Re: [Gluster-devel] ./tests/basic/mount-nfs-auth.t spews out warnings

2018-03-15 Thread Raghavendra G
I assume we build with --enable-gnfs turned on. Unlikely that its not, but
just pointing it out as initially I ran into some failures due to gNFS
server not coming due to lake of relevant libraries.

On Thu, Mar 15, 2018 at 3:46 PM, Raghavendra Gowdappa <rgowd...@redhat.com>
wrote:

> Providing more links to failure:
> https://build.gluster.org/job/experimental-periodic/227/console
> https://build.gluster.org/job/regression-test-burn-in/3836/console
>
>
> On Thu, Mar 15, 2018 at 3:39 PM, Raghavendra Gowdappa <rgowd...@redhat.com
> > wrote:
>
>> I can reproduce the failure even without my patch on master. Looks like
>> this test resulted in failures earlier too.
>>
>> [1] http://lists.gluster.org/pipermail/gluster-devel/2015-May/044932.html
>> [2] See mail to gluster-maintainers with subj: "Build failed in Jenkins:
>> regression-test-burn-in #3836"
>>
>> Usual failures are (need not be all of them, but at least few of them):
>> ./tests/basic/mount-nfs-auth.t (Wstat: 0 Tests: 92 Failed: 4)
>>   Failed tests:  22-24, 28
>>
>> If gNFS team can take a look into this it will be helpful.
>>
>> regards,
>> Raghavendra
>>
>> On Wed, Mar 14, 2018 at 3:04 PM, Nigel Babu <nig...@redhat.com> wrote:
>>
>>> When the test works it takes less than 60 seconds. If it needs more than
>>> 200 seconds, that means there's an actual issue.
>>>
>>> On Wed, Mar 14, 2018 at 10:16 AM, Raghavendra Gowdappa <
>>> rgowd...@redhat.com> wrote:
>>>
>>>> All,
>>>>
>>>> I was trying to debug a regression failure [1]. When I ran test locally
>>>> on my laptop, I see some warnings as below:
>>>>
>>>> ++ gluster --mode=script --wignore volume get patchy nfs.mount-rmtab
>>>> ++ xargs dirname
>>>> ++ awk '/^nfs.mount-rmtab/{print $2}'
>>>> dirname: missing operand
>>>> Try 'dirname --help' for more information.
>>>> + NFSDIR=
>>>>
>>>> To debug I ran the volume get cmds:
>>>>
>>>> [root@booradley glusterfs]# gluster volume get patchy nfs.mount-rmtab
>>>> Option  Value
>>>>
>>>> --  -
>>>>
>>>> volume get option failed. Check the cli/glusterd log file for more
>>>> details
>>>>
>>>> [root@booradley glusterfs]# gluster volume set patchy nfs.mount-rmtab
>>>> testdir
>>>> volume set: success
>>>>
>>>> [root@booradley glusterfs]# gluster volume get patchy nfs.mount-rmtab
>>>> Option  Value
>>>>
>>>> --  -
>>>>
>>>> nfs.mount-rmtab testdir
>>>>
>>>>
>>>> Does this mean the option value is not set properly in the script? Need
>>>> your help in debugging this.
>>>>
>>>> @Nigel
>>>> I noticed that test is timing out.
>>>>
>>>> *20:28:39* ./tests/basic/mount-nfs-auth.t timed out after 200 seconds
>>>>
>>>> Can this be infra issue where nfs was taking too much time to mount?
>>>>
>>>> [1] https://build.gluster.org/job/centos7-regression/316/console
>>>>
>>>> regards,
>>>> Raghavendra
>>>>
>>>
>>>
>>>
>>> --
>>> nigelb
>>>
>>
>>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 4.0: Unable to complete rolling upgrade tests

2018-03-08 Thread Raghavendra G
bIgJwCMySIwJm
>>>>> AJvGMBvNEA#
>>>>> I setup 3 server containers to install 3.13 first as follows (within
>>>>> the
>>>>> containers)
>>>>>
>>>>> (inside the 3 server containers)
>>>>> yum -y update; yum -y install centos-release-gluster313; yum install
>>>>> glusterfs-server; glusterd
>>>>>
>>>>> (inside centos-glfs-server1)
>>>>> gluster peer probe centos-glfs-server2
>>>>> gluster peer probe centos-glfs-server3
>>>>> gluster peer status
>>>>> gluster v create patchy replica 3 centos-glfs-server1:/d/brick1
>>>>> centos-glfs-server2:/d/brick2 centos-glfs-server3:/d/brick3
>>>>> centos-glfs-server1:/d/brick4 centos-glfs-server2:/d/brick5
>>>>> centos-glfs-server3:/d/brick6 force
>>>>> gluster v start patchy
>>>>> gluster v status
>>>>>
>>>>> Create a client container as per the document above, and mount the
>>>>> above
>>>>> volume and create 1 file, 1 directory and a file within that directory.
>>>>>
>>>>> Now we start the upgrade process (as laid out for 3.13 here
>>>>> http://docs.gluster.org/en/latest/Upgrade-Guide/upgrade_to_3.13/ ):
>>>>> - killall glusterfs glusterfsd glusterd
>>>>> - yum install
>>>>> http://cbs.centos.org/kojifiles/work/tasks/1548/311548/cento
>>>>> s-release-gluster40-0.9-1.el7.centos.x86_64.rpm
>>>>> - yum upgrade --enablerepo=centos-gluster40-test glusterfs-server
>>>>>
>>>>> < Go back to the client and edit the contents of one of the files and
>>>>> change the permissions of a directory, so that there are things to heal
>>>>> when we bring up the newly upgraded server>
>>>>>
>>>>> - gluster --version
>>>>> - glusterd
>>>>> - gluster v status
>>>>> - gluster v heal patchy
>>>>>
>>>>> The above starts failing as follows,
>>>>> [root@centos-glfs-server1 /]# gluster v heal patchy
>>>>> Launching heal operation to perform index self heal on volume patchy
>>>>> has
>>>>> been unsuccessful:
>>>>> Commit failed on centos-glfs-server2.glfstest20. Please check log file
>>>>> for details.
>>>>> Commit failed on centos-glfs-server3. Please check log file for
>>>>> details.
>>>>>
>>>>>  From here, if further files or directories are created from the
>>>>> client,
>>>>> they just get added to the heal backlog, and heal does not catchup.
>>>>>
>>>>> As is obvious, I cannot proceed, as the upgrade procedure is broken.
>>>>> The
>>>>> issue itself may not be selfheal deamon, but something around
>>>>> connections, but as the process fails here, looking to you guys to
>>>>> unblock this as soon as possible, as we are already running a day's
>>>>> slip
>>>>> in the release.
>>>>>
>>>>> Thanks,
>>>>> Shyam
>>>>>
>>>>
>>>>
>>>
>> ___
>> maintainers mailing list
>> maintain...@gluster.org
>> http://lists.gluster.org/mailman/listinfo/maintainers
>>
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Report ESTALE as ENOENT

2018-02-25 Thread Raghavendra G
On Fri, Feb 23, 2018 at 6:33 AM, J. Bruce Fields <bfie...@fieldses.org>
wrote:

> On Thu, Feb 22, 2018 at 01:17:58PM +0530, Raghavendra G wrote:
> > On Wed, Oct 11, 2017 at 7:32 PM, J. Bruce Fields <bfie...@fieldses.org>
> > wrote:
> >
> > > On Wed, Oct 11, 2017 at 04:11:51PM +0530, Raghavendra G wrote:
> > > > On Thu, Mar 31, 2016 at 1:22 AM, J. Bruce Fields <
> bfie...@fieldses.org>
> > > > wrote:
> > > >
> > > > > On Mon, Mar 28, 2016 at 04:21:00PM -0400, Vijay Bellur wrote:
> > > > > > I would prefer to:
> > > > > >
> > > > > > 1. Return ENOENT for all system calls that operate on a path.
> > > > > >
> > > > > > 2. ESTALE might be ok for file descriptor based operations.
> > > > >
> > > > > Note that operations which operate on paths can fail with ESTALE
> when
> > > > > they attempt to look up a component within a directory that no
> longer
> > > > > exists.
> > > > >
> > > >
> > > > But, "man 2 rmdir"  or "man 2 unlink" doesn't list ESTALE as a valid
> > > error.
> > >
> > > In fact, almost no man pages list ESTALE as a valid error:
> > >
> > > [bfields@patate man-pages]$ git grep ESTALE
> > > Changes.old:Change description for ESTALE
> > > man2/open_by_handle_at.2:.B ESTALE
> > > man2/open_by_handle_at.2:.B ESTALE
> > > man3/errno.3:.B ESTALE
> > >
> > > Cc'ing Michael Kerrisk for advice.  Is there some reason for that, or
> > > can we fix those man pages?
> > >
> > > > Also rm doesn't seem to handle ESTALE too [3]
> > > >
> > > > [4] https://github.com/coreutils/coreutils/blob/master/src/
> remove.c#L305
> > >
> > > I *think* that code is just deciding whether a given error should be
> > > silently ignored in the rm -f case.  I don't think -ESTALE (indicating
> > > the directory is bad) is such an error, so I think this code is
> correct.
> > > But my understanding may be wrong.
> > >
> >
> > For a local filesystem, we may not end up in ESTALE errors. But, when
> rmdir
> > is executed from multiple clients of a network fs (like NFS, Glusterfs),
> > unlink or rmdir can easily fail with ESTALE as the other rm invocation
> > could've deleted it. I think this is what has happened in bugs like:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1546717
> > https://bugzilla.redhat.com/show_bug.cgi?id=1245065
> >
> > This in fact was the earlier motivation to convert ESTALE into ENOENT, so
> > that rm would ignore it. Now that I reverted the fix, looks like the bug
> > has promptly resurfaced :)
> >
> > There is one glitch though. Bug 1245065 mentions that some parts of
> > directory structure remain undeleted. From my understanding, atleast one
> > instance of rm (which is racing ahead of all others causing others to
> > fail), should've delted the directory structure completely. Though, I
> need
> > to understand the directory traversal done by rm to find whether there
> are
> > cyclic dependency between two rms causing both of them to fail.
>
> I don't see how you could avoid that.  The clients are each caching
> multiple subdirectories of the tree, and there's no guarantee that 1
> client has fresher caches of every subdirectory.  There's also no
> guarantee that the client that's ahead stays ahead--another client that
> sees which objects the first client has already deleted can leapfrog
> ahead.
>

What are the drawbacks of applications (like rm) treating ESTALE equivalent
of ENOENT? It seems to me, from the application perspective they both
convey similar information. If rm could ignore ESTALE just like it does for
ENOENT, probably we don't run into this issue.


> I think the solution is just not to do that--NFS clients aren't really
> equipped to handle directory operations on directories that are deleted
> out from under them, and there probably aren't any hacks on the server
> side that will fix that.  If there's a real need for this kind of case,
> we may need to work on the protocol itself.  For now all we may be able
> to do is educate users about what NFS can and can't do.
>
> --b.
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Report ESTALE as ENOENT

2018-02-22 Thread Raghavendra G
On Thu, Feb 22, 2018 at 1:17 PM, Raghavendra G <raghaven...@gluster.com>
wrote:

>
>
> On Wed, Oct 11, 2017 at 7:32 PM, J. Bruce Fields <bfie...@fieldses.org>
> wrote:
>
>> On Wed, Oct 11, 2017 at 04:11:51PM +0530, Raghavendra G wrote:
>> > On Thu, Mar 31, 2016 at 1:22 AM, J. Bruce Fields <bfie...@fieldses.org>
>> > wrote:
>> >
>> > > On Mon, Mar 28, 2016 at 04:21:00PM -0400, Vijay Bellur wrote:
>> > > > I would prefer to:
>> > > >
>> > > > 1. Return ENOENT for all system calls that operate on a path.
>> > > >
>> > > > 2. ESTALE might be ok for file descriptor based operations.
>> > >
>> > > Note that operations which operate on paths can fail with ESTALE when
>> > > they attempt to look up a component within a directory that no longer
>> > > exists.
>> > >
>> >
>> > But, "man 2 rmdir"  or "man 2 unlink" doesn't list ESTALE as a valid
>> error.
>>
>> In fact, almost no man pages list ESTALE as a valid error:
>>
>> [bfields@patate man-pages]$ git grep ESTALE
>> Changes.old:Change description for ESTALE
>> man2/open_by_handle_at.2:.B ESTALE
>> man2/open_by_handle_at.2:.B ESTALE
>> man3/errno.3:.B ESTALE
>>
>> Cc'ing Michael Kerrisk for advice.  Is there some reason for that, or
>> can we fix those man pages?
>>
>> > Also rm doesn't seem to handle ESTALE too [3]
>> >
>> > [4] https://github.com/coreutils/coreutils/blob/master/src/remov
>> e.c#L305
>>
>> I *think* that code is just deciding whether a given error should be
>> silently ignored in the rm -f case.  I don't think -ESTALE (indicating
>> the directory is bad) is such an error, so I think this code is correct.
>> But my understanding may be wrong.
>>
>
> For a local filesystem, we may not end up in ESTALE errors. But, when
> rmdir is executed from multiple clients of a network fs (like NFS,
> Glusterfs), unlink or rmdir can easily fail with ESTALE as the other rm
> invocation could've deleted it. I think this is what has happened in bugs
> like:
> https://bugzilla.redhat.com/show_bug.cgi?id=1546717
> https://bugzilla.redhat.com/show_bug.cgi?id=1245065
>
> This in fact was the earlier motivation to convert ESTALE into ENOENT, so
> that rm would ignore it. Now that I reverted the fix, looks like the bug
> has promptly resurfaced :)
>
> There is one glitch though. Bug 1245065 mentions that some parts of
> directory structure remain undeleted. From my understanding, atleast one
> instance of rm (which is racing ahead of all others causing others to
> fail), should've delted the directory structure completely. Though, I need
> to understand the directory traversal done by rm to find whether there are
> cyclic dependency between two rms causing both of them to fail.
>

Also note that VFS retries unlink and rmdir if there is an estale:

https://github.com/torvalds/linux/blob/master/fs/namei.c#L4056
https://github.com/torvalds/linux/blob/master/fs/namei.c#L3927

So, underlying fs like Glusterfs cannot mask ESTALE as it'll break some
functionality.

In this scenario what are the ways to fix a failing rm when run from
multiple mount points (Bugs I pointed out above)? I really can't think of a
way other than fixing rm to ignore ESTALE error.


>
>> > > Maybe non-creating open("./foo") returning ENOENT would be reasonable
>> in
>> > > this case since that's what you'd get in the local filesystem case,
>> but
>> > > creat("./foo") returning ENOENT, for example, isn't something
>> > > applications will be written to handle.
>> > >
>> > > The Linux VFS will retry ESTALE on path-based systemcalls *one* time,
>> to
>> > > reduce the chance of ESTALE in those cases.
>> >
>> >
>> > I should've anticipated bug [2] due to this comment. My mistake. Bug
>> [2] is
>> > indeed due to kernel not retrying open on receiving an ENOENT error.
>> > Glusterfs sent ENOENT because file's inode-number/nodeid changed but
>> same
>> > path exists. The correct error would've been ESTALE, but due to our
>> > conversion of ESTALE to ENOENT, the latter was sent back to kernel.
>> >
>> > Looking through kernel VFS code, only open *seems* to retry
>> > (do_filep_open). I couldn't find similar logic to other path based
>> syscalls
>> > like rmdir, unlink, stat, chmod etc
>>
>> I believe there is a retry in those cases, but I'm not sure exactly
>> where it is.  Looking around See the retry_estale() checks sprinkled
>> around namei.c, which were added by Jeff Layton a few years ago.
>>
>> --b.
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
>
> --
> Raghavendra G
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] GlusterFS project contribution.

2018-02-22 Thread Raghavendra G
lease take a look and then
> you
> >>> can start concentrating on areas that interest you.
> >>>
> >>> You can also contribute by filing well thought out bugs with good data
> >>> (reproducers etc), producing reports of performance in various
> use-cases,
> >>> calling out areas of improvement by constructive criticism. Welcome to
> the
> >>> community :).
> >>>
> >>> regards,
> >>> Raghavendra
> >>>
> >>> On Mon, Feb 12, 2018 at 7:57 AM, Javier Romero <xavi...@gmail.com>
> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> My name is Javier, live in Buenos Aires, Argentina, and work in the
> >>>> Network Operations Center of an Internet Service Provider as a Linux
> >>>> Sysadmin. I would like to contribute on the GlusterFS project if
> there is
> >>>> something where I can be useful.
> >>>>
> >>>> Thanks for your kind attention.
> >>>>
> >>>> Best Regards,
> >>>>
> >>>>
> >>>> Javier Romero
> >>>> E-mail: xavi...@gmail.com
> >>>> Skype: xavinux
> >>>>
> >>>>
> >>>> ___
> >>>> Gluster-devel mailing list
> >>>> Gluster-devel@gluster.org
> >>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
> >>>
> >>>
> >>
> >>
> >> ___
> >> Gluster-devel mailing list
> >> Gluster-devel@gluster.org
> >> http://lists.gluster.org/mailman/listinfo/gluster-devel
> >
> >
> >
> >
> > --
> > Amar Tumballi (amarts)
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Report ESTALE as ENOENT

2018-02-21 Thread Raghavendra G
On Wed, Oct 11, 2017 at 7:32 PM, J. Bruce Fields <bfie...@fieldses.org>
wrote:

> On Wed, Oct 11, 2017 at 04:11:51PM +0530, Raghavendra G wrote:
> > On Thu, Mar 31, 2016 at 1:22 AM, J. Bruce Fields <bfie...@fieldses.org>
> > wrote:
> >
> > > On Mon, Mar 28, 2016 at 04:21:00PM -0400, Vijay Bellur wrote:
> > > > I would prefer to:
> > > >
> > > > 1. Return ENOENT for all system calls that operate on a path.
> > > >
> > > > 2. ESTALE might be ok for file descriptor based operations.
> > >
> > > Note that operations which operate on paths can fail with ESTALE when
> > > they attempt to look up a component within a directory that no longer
> > > exists.
> > >
> >
> > But, "man 2 rmdir"  or "man 2 unlink" doesn't list ESTALE as a valid
> error.
>
> In fact, almost no man pages list ESTALE as a valid error:
>
> [bfields@patate man-pages]$ git grep ESTALE
> Changes.old:Change description for ESTALE
> man2/open_by_handle_at.2:.B ESTALE
> man2/open_by_handle_at.2:.B ESTALE
> man3/errno.3:.B ESTALE
>
> Cc'ing Michael Kerrisk for advice.  Is there some reason for that, or
> can we fix those man pages?
>
> > Also rm doesn't seem to handle ESTALE too [3]
> >
> > [4] https://github.com/coreutils/coreutils/blob/master/src/remove.c#L305
>
> I *think* that code is just deciding whether a given error should be
> silently ignored in the rm -f case.  I don't think -ESTALE (indicating
> the directory is bad) is such an error, so I think this code is correct.
> But my understanding may be wrong.
>

For a local filesystem, we may not end up in ESTALE errors. But, when rmdir
is executed from multiple clients of a network fs (like NFS, Glusterfs),
unlink or rmdir can easily fail with ESTALE as the other rm invocation
could've deleted it. I think this is what has happened in bugs like:
https://bugzilla.redhat.com/show_bug.cgi?id=1546717
https://bugzilla.redhat.com/show_bug.cgi?id=1245065

This in fact was the earlier motivation to convert ESTALE into ENOENT, so
that rm would ignore it. Now that I reverted the fix, looks like the bug
has promptly resurfaced :)

There is one glitch though. Bug 1245065 mentions that some parts of
directory structure remain undeleted. From my understanding, atleast one
instance of rm (which is racing ahead of all others causing others to
fail), should've delted the directory structure completely. Though, I need
to understand the directory traversal done by rm to find whether there are
cyclic dependency between two rms causing both of them to fail.


> > > Maybe non-creating open("./foo") returning ENOENT would be reasonable
> in
> > > this case since that's what you'd get in the local filesystem case, but
> > > creat("./foo") returning ENOENT, for example, isn't something
> > > applications will be written to handle.
> > >
> > > The Linux VFS will retry ESTALE on path-based systemcalls *one* time,
> to
> > > reduce the chance of ESTALE in those cases.
> >
> >
> > I should've anticipated bug [2] due to this comment. My mistake. Bug [2]
> is
> > indeed due to kernel not retrying open on receiving an ENOENT error.
> > Glusterfs sent ENOENT because file's inode-number/nodeid changed but same
> > path exists. The correct error would've been ESTALE, but due to our
> > conversion of ESTALE to ENOENT, the latter was sent back to kernel.
> >
> > Looking through kernel VFS code, only open *seems* to retry
> > (do_filep_open). I couldn't find similar logic to other path based
> syscalls
> > like rmdir, unlink, stat, chmod etc
>
> I believe there is a retry in those cases, but I'm not sure exactly
> where it is.  Looking around See the retry_estale() checks sprinkled
> around namei.c, which were added by Jeff Layton a few years ago.
>
> --b.
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Glusterfs and Structured data

2018-02-13 Thread Raghavendra G
I've started marking "whiteboard" of bugs in this class with tag
"GLUSTERFS_METADATA_INCONSISTENCY". Please add the tag to any bugs which
you deem to fit in.

On Fri, Feb 9, 2018 at 4:30 PM, Raghavendra Gowdappa <rgowd...@redhat.com>
wrote:

>
>
> - Original Message -
> > From: "Pranith Kumar Karampuri" <pkara...@redhat.com>
> > To: "Raghavendra G" <raghaven...@gluster.com>
> > Cc: "Gluster Devel" <gluster-devel@gluster.org>
> > Sent: Friday, February 9, 2018 2:30:59 PM
> > Subject: Re: [Gluster-devel] Glusterfs and Structured data
> >
> >
> >
> > On Thu, Feb 8, 2018 at 12:05 PM, Raghavendra G < raghaven...@gluster.com
> >
> > wrote:
> >
> >
> >
> >
> >
> > On Tue, Feb 6, 2018 at 8:15 PM, Vijay Bellur < vbel...@redhat.com >
> wrote:
> >
> >
> >
> >
> >
> > On Sun, Feb 4, 2018 at 3:39 AM, Raghavendra Gowdappa <
> rgowd...@redhat.com >
> > wrote:
> >
> >
> > All,
> >
> > One of our users pointed out to the documentation that glusterfs is not
> good
> > for storing "Structured data" [1], while discussing an issue [2].
> >
> >
> > As far as I remember, the content around structured data in the Install
> Guide
> > is from a FAQ that was being circulated in Gluster, Inc. indicating the
> > startup's market positioning. Most of that was based on not wanting to
> get
> > into performance based comparisons of storage systems that are frequently
> > seen in the structured data space.
> >
> >
> > Does any of you have more context on the feasibility of storing
> "structured
> > data" on Glusterfs? Is one of the reasons for such a suggestion
> "staleness
> > of metadata" as encountered in bugs like [3]?
> >
> >
> > There are challenges that distributed storage systems face when exposed
> to
> > applications that were written for a local filesystem interface. We have
> > encountered problems with applications like tar [4] that are not in the
> > realm of "Structured data". If we look at the common theme across all
> these
> > problems, it is related to metadata & read after write consistency issues
> > with the default translator stack that gets exposed on the client side.
> > While the default stack is optimal for other scenarios, it does seem
> that a
> > category of applications needing strict metadata consistency is not well
> > served by that. We have observed that disabling a few performance
> > translators and tuning cache timeouts for VFS/FUSE have helped to
> overcome
> > some of them. The WIP effort on timestamp consistency across the
> translator
> > stack, patches that have been merged as a result of the bugs that you
> > mention & other fixes for outstanding issues should certainly help in
> > catering to these workloads better with the file interface.
> >
> > There are deployments that I have come across where glusterfs is used for
> > storing structured data. gluster-block & qemu-libgfapi overcome the
> metadata
> > consistency problem by exposing a file as a block device & by disabling
> most
> > of the performance translators in the default stack. Workloads that have
> > been deemed problematic with the file interface for the reasons alluded
> > above, function well with the block interface.
> >
> > I agree that gluster-block due to its usage of a subset of glusterfs fops
> > (mostly reads/writes I guess), runs into less number of consistency
> issues.
> > However, as you've mentioned we seem to disable perf xlator stack in our
> > tests/use-cases till now. Note that perf xlator stack is one of worst
> > offenders as far as the metadata consistency is concerned (relatively
> less
> > scenarios of data inconsistency). So, I wonder,
> > * what would be the scenario if we enable perf xlator stack for
> > gluster-block?
> > * Is performance on gluster-block satisfactory so that we don't need
> these
> > xlators?
> > - Or is it that these xlators are not useful for the workload usually
> run on
> > gluster-block (For random read/write workload, read/write caching xlators
> > offer less or no advantage)?
> >
> > Yes. They are not useful. Block/VM files are opened with O_DIRECT, so we
> > don't enable caching at any layer in glusterfs. md-cache could be useful
> for
> > serving fstat from glusterfs. But apart from that I don't see any other
> > xlator contributing much.
> >
> >

Re: [Gluster-devel] Glusterfs and Structured data

2018-02-07 Thread Raghavendra G
On Tue, Feb 6, 2018 at 8:15 PM, Vijay Bellur <vbel...@redhat.com> wrote:

>
>
> On Sun, Feb 4, 2018 at 3:39 AM, Raghavendra Gowdappa <rgowd...@redhat.com>
> wrote:
>
>> All,
>>
>> One of our users pointed out to the documentation that glusterfs is not
>> good for storing "Structured data" [1], while discussing an issue [2].
>
>
>
> As far as I remember, the content around structured data in the Install
> Guide is from a FAQ that was being circulated in Gluster, Inc. indicating
> the startup's market positioning. Most of that was based on not wanting to
> get into performance based comparisons of storage systems that are
> frequently seen in the structured data space.
>
>
>> Does any of you have more context on the feasibility of storing
>> "structured data" on Glusterfs? Is one of the reasons for such a suggestion
>> "staleness of metadata" as encountered in bugs like [3]?
>>
>
>
> There are challenges that distributed storage systems face when exposed to
> applications that were written for a local filesystem interface. We have
> encountered problems with applications like tar [4] that are not in the
> realm of "Structured data". If we look at the common theme across all these
> problems, it is related to metadata & read after write consistency issues
> with the default translator stack that gets exposed on the client side.
> While the default stack is optimal for other scenarios, it does seem that a
> category of applications needing strict metadata consistency is not well
> served by that. We have observed that disabling a few performance
> translators and tuning cache timeouts for VFS/FUSE have helped to overcome
> some of them. The WIP effort on timestamp consistency across the translator
> stack, patches that have been merged as a result of the bugs that you
> mention & other fixes for outstanding issues should certainly help in
> catering to these workloads better with the file interface.
>
> There are deployments that I have come across where glusterfs is used for
> storing structured data. gluster-block  & qemu-libgfapi overcome the
> metadata consistency problem by exposing a file as a block device & by
> disabling most of the performance translators in the default stack.
> Workloads that have been deemed problematic with the file interface for the
> reasons alluded above, function well with the block interface.
>

I agree that gluster-block due to its usage of a subset of glusterfs fops
(mostly reads/writes I guess), runs into less number of consistency issues.
However, as you've mentioned we seem to disable perf xlator stack in our
tests/use-cases till now. Note that perf xlator stack is one of worst
offenders as far as the metadata consistency is concerned (relatively less
scenarios of data inconsistency). So, I wonder,
* what would be the scenario if we enable perf xlator stack for
gluster-block?
* Is performance on gluster-block satisfactory so that we don't need these
xlators?
  - Or is it that these xlators are not useful for the workload usually run
on gluster-block (For random read/write workload, read/write caching
xlators offer less or no advantage)?
  - Or theoretically the workload is ought to benefit from perf xlators,
but we don't see them in our results (there are open bugs to this effect)?

I am asking these questions to ascertain priority on fixing perf xlators
for (meta)data inconsistencies. If we offer a different solution for these
workloads, the need for fixing these issues will be less.

I feel that we have come a long way from the time the install guide was
> written and an update for removing the "staleness of content" might be in
> order there :-).
>
> Regards,
> Vijay
>
> [4] https://bugzilla.redhat.com/show_bug.cgi?id=1058526
>
>
>>
>> [1] http://docs.gluster.org/en/latest/Install-Guide/Overview/
>> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1512691
>> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1390050
>>
>> regards,
>> Raghavendra
>> _______
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] Rafi KC attending DevConf and FOSDEM

2018-01-28 Thread Raghavendra G
On Fri, Jan 26, 2018 at 7:27 PM, Niels de Vos <nde...@redhat.com> wrote:

> On Fri, Jan 26, 2018 at 06:24:36PM +0530, Mohammed Rafi K C wrote:
> > Hi All,
> >
> > I'm attending both DevConf (25-28) and Fosdem (3-4). If any of you are
> > attending the conferences and would like to talk about gluster, please
> > feel free to ping me through irc nick rafi on freenode or message me on
> > +436649795838
>
> In addition to that at FOSDEM, there is a Gluster stand (+Ceph, and next
> to oVirt) on level 1 (ground floor) of the K building[0]. We'll try to
> have some of the developers and other contributors to the project around
> at all times. Come and talk to us about your use-cases, questions and
> words of encouragement ;-)
>
> There are several talks related to Gluster too! On Saturday there is
> "Optimizing Software Defined Storage for the Age of Flash" [1],



Thanks Niels for that.

Me, Manoj and Krutika will be in FOSDEM 18 (3rd and 4th Feb 2018). We would
be happy to chat with you on anything related to glusterfs :). Hopefully
we'll have some interesting results to share with you in the talk!!. Please
do plan to attend it if possible.

and on
> Sunday the Software Defined Storage DevRoom has scheduled many more.
>
> Hope to see you there!
> Niels
>
>
> 0. https://fosdem.org/2018/schedule/buildings/#k
> 1. https://fosdem.org/2018/schedule/event/optimizing_sds/
> 2. https://fosdem.org/2018/schedule/track/software_defined_storage/
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 4.0: Branched

2018-01-26 Thread Raghavendra G
On Fri, Jan 26, 2018 at 4:49 PM, Raghavendra Gowdappa <rgowd...@redhat.com>
wrote:

>
>
> - Original Message -
> > From: "Shyam Ranganathan" <srang...@redhat.com>
> > To: "Gluster Devel" <gluster-devel@gluster.org>, "GlusterFS
> Maintainers" <maintain...@gluster.org>
> > Sent: Thursday, January 25, 2018 9:49:51 PM
> > Subject: Re: [Gluster-Maintainers] [Gluster-devel] Release 4.0: Branched
> >
> > On 01/23/2018 03:17 PM, Shyam Ranganathan wrote:
> > > 4.0 release has been branched!
> > >
> > > I will follow this up with a more detailed schedule for the release,
> and
> > > also the granted feature backport exceptions that we are waiting.
> > >
> > > Feature backports would need to make it in by this weekend, so that we
> > > can tag RC0 by the end of the month.
> >
> > Backports need to be ready for merge on or before Jan, 29th 2018 3:00 PM
> > Eastern TZ.
> >
> > Features that requested and hence are granted backport exceptions are as
> > follows,
> >
> > 1) Dentry fop serializer xlator on brick stack
> > https://github.com/gluster/glusterfs/issues/397
> >
> > @Du please backport the same to the 4.0 branch as the patch in master is
> > merged.
>
> Sure.
>

https://review.gluster.org/#/c/19340/1
But this might fail smoke as the bug associated is not associated with 4.0
branch. Blocked on 4.0 version tag in bugzilla.


> >
> > 2) Leases support on GlusterFS
> > https://github.com/gluster/glusterfs/issues/350
> >
> > @Jiffin and @ndevos, there is one patch pending against master,
> > https://review.gluster.org/#/c/18785/ please do the needful and backport
> > this to the 4.0 branch.
> >
> > 3) Data corruption in write ordering of rebalance and application writes
> > https://github.com/gluster/glusterfs/issues/308
> >
> > @susant, @du if we can conclude on the strategy here, please backport as
> > needed.
>
> https://review.gluster.org/#/c/19207/
> Review comments need to be addressed and centos regressions are failing.
>
> https://review.gluster.org/#/c/19202/
> There are some suggestions on the patch. If others agree they are valid,
> this patch can be considered as redundant with approach of #19207. However,
> as I've mentioned in the comments there are some tradeoffs too. So, Waiting
> for response to my comments. If nobody responds in the time period given,
> we can merge the patch and susant will have to backport to 4.0 branch.
>
> >
> > 4) Couple of patches that are tracked for a backport are,
> > https://review.gluster.org/#/c/19223/
> > https://review.gluster.org/#/c/19267/ (prep for ctime changes in later
> > releases)
> >
> > Other features discussed are not in scope for a backports to 4.0.
> >
> > If you asked for one and do not see it in this list, shout out!
> >
> > >
> > > Only exception could be: https://review.gluster.org/#/c/19223/
> > >
> > > Thanks,
> > > Shyam
> > > ___
> > > Gluster-devel mailing list
> > > Gluster-devel@gluster.org
> > > http://lists.gluster.org/mailman/listinfo/gluster-devel
> > >
> > ___
> > maintainers mailing list
> > maintain...@gluster.org
> > http://lists.gluster.org/mailman/listinfo/maintainers
> >
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Release 4.0: Making it happen!

2018-01-22 Thread Raghavendra G
https://github.com/gluster/glusterfs/issues/397

The patch has been in discussion for quite some time. Thanks to efforts
from Amar, its good to be merged. I've proposed it for 4.0 for now. If the
branching is done or if you feel its not fit for 4.0 branch, please feel
free to change the milestone to appropriate version (maybe 4.1?)

On Fri, Jan 19, 2018 at 9:30 AM, Jiffin Tony Thottan <jthot...@redhat.com>
wrote:

>
>
> On Wednesday 17 January 2018 04:55 PM, Jiffin Tony Thottan wrote:
>
>>
>>
>> On Tuesday 16 January 2018 08:57 PM, Shyam Ranganathan wrote:
>>
>>> On 01/10/2018 01:14 PM, Shyam Ranganathan wrote:
>>>
>>>> Hi,
>>>>
>>>> 4.0 branching date is slated on the 16th of Jan 2018 and release is
>>>> slated for the end of Feb (28th), 2018.
>>>>
>>> This is today! So read on...
>>>
>>> Short update: I am going to wait a couple more days before branching, to
>>> settle release content and exceptions. Branching is hence on Jan, 18th
>>> (Thursday).
>>>
>>> We are at the phase when we need to ensure our release scope is correct
>>>> and *must* release features are landing. Towards this we need the
>>>> following information for all contributors.
>>>>
>>>> 1) Features that are making it to the release by branching date
>>>>
>>>> - There are currently 35 open github issues marked as 4.0 milestone [1]
>>>> - Need contributors to look at this list and let us know which will meet
>>>> the branching date
>>>>
>>> Other than the protocol changes (from Amar), I did not receive any
>>> requests for features that are making it to the release. I have compiled
>>> a list of features based on patches in gerrit that are open, to check
>>> what features are viable to make it to 4.0. This can be found here [3].
>>>
>>> NOTE: All features, other than the ones in [3] are being moved out of
>>> the 4.0 milestone.
>>>
>>> - Need contributors to let us know which may slip and hence needs a
>>>> backport exception to 4.0 branch (post branching).
>>>> - Need milestone corrections on features that are not making it to the
>>>> 4.0 release
>>>>
>>> I need the following contributors to respond and state if the feature in
>>> [3] should still be tracked against 4.0 and how much time is possibly
>>> needed to make it happen.
>>>
>>> - Poornima, Amar, Jiffin, Du, Susant, Sanoj, Vijay
>>>
>>
>> Hi,
>>
>> The two gfapi[1,2] related changes have ack from poornima and Niels
>> mentioned that he will do the review by EOD.
>>
>> [1] https://review.gluster.org/#/c/18784/
>> [2] https://review.gluster.org/#/c/18785/
>>
>>
>>
> Niels has few comments on above patch. I need to have one week
> extension(26th Jan 2018)
> --
> Jiffin
>
>
> Regards,
>> Jiffin
>>
>>
>>> NOTE: Slips are accepted if they fall 1-1.5 weeks post branching, not
>>>> post that, and called out before branching!
>>>>
>>>> 2) Reviews needing priority
>>>>
>>>> - There could be features that are up for review, and considering we
>>>> have about 6-7 days before branching, we need a list of these commits,
>>>> that you want review attention on.
>>>> - This will be added to this [2] dashboard, easing contributor access to
>>>> top priority reviews before branching
>>>>
>>> As of now, I am adding a few from the list in [3] for further review
>>> attention as I see things evolving, more will be added as the point
>>> above is answered by the respective contributors.
>>>
>>> 3) Review help!
>>>>
>>>> - This link [2] contains reviews that need attention, as they are
>>>> targeted for 4.0. Request maintainers and contributors to pay close
>>>> attention to this list on a daily basis and help out with reviews.
>>>>
>>>> Thanks,
>>>> Shyam
>>>>
>>>> [1] github issues marked for 4.0:
>>>> https://github.com/gluster/glusterfs/milestone/3
>>>>
>>>> [2] Review focus for features planned to land in 4.0:
>>>> https://review.gluster.org/#/q/owner:srangana%2540redhat.com+is:starred
>>>>
>>> [3] Release 4.0 features with pending code reviews:
>>> http://bit.ly/2rbjcl8
>>>
>>> ___
>>>> Gluster-devel mailing list
>>>> Gluster-devel@gluster.org
>>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>>
>>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] configure fails due to failure in locating libxml2-devel

2018-01-21 Thread Raghavendra G
On Mon, Jan 22, 2018 at 11:32 AM, Kaushal M <kshlms...@gmail.com> wrote:

> Did you run autogen.sh after installing libxml2-devel?
>

I hadn't. I did it now and configure succeeds. Thanks Kaushal.


> On Mon, Jan 22, 2018 at 11:10 AM, Raghavendra G
> <raghavendra...@gmail.com> wrote:
> > All,
> >
> > # ./configure
> > 
> > configure: error: libxml2 devel libraries not found
> >
> > # ls /usr/lib64/libxml2.so
> > /usr/lib64/libxml2.so
> >
> > # ls /usr/include/libxml2/
> > libxml
> >
> > # yum install libxml2-devel
> > Package libxml2-devel-2.9.1-6.el7_2.3.x86_64 already installed and
> latest
> > version
> > Nothing to do
> >
> > Looks like the issue is very similar to one filed in:
> > https://bugzilla.redhat.com/show_bug.cgi?id=64134
> >
> > Has anyone encountered this? How did you workaround this?
> >
> > regards,
> > --
> > Raghavendra G
> >
> >
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] cluster/dht: restrict migration of opened files

2018-01-21 Thread Raghavendra G
On Thu, Jan 18, 2018 at 8:11 PM, Pranith Kumar Karampuri <
pkara...@redhat.com> wrote:

>
>
> On Tue, Jan 16, 2018 at 2:52 PM, Raghavendra Gowdappa <rgowd...@redhat.com
> > wrote:
>
>> All,
>>
>> Patch [1] prevents migration of opened files during rebalance operation.
>> If patch [1] affects you, please voice out your concerns. [1] is a stop-gap
>> fix for the problem discussed in issues [2][3]
>>
>
> What is the impact on VM and gluster-block usecases after this patch? Will
> it rebalance any data in these usecases?
>

Assuming there is an fd opened on these files always and if
cluster.force-migration is set to off, there will be no migration. However,
we can force migration (even on files with open fds) by setting
cluster.force-migration to on.


>
>>
>> [1] https://review.gluster.org/#/c/19202/
>> [2] https://github.com/gluster/glusterfs/issues/308
>> [3] https://github.com/gluster/glusterfs/issues/347
>>
>> regards,
>> Raghavendra
>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
>
> --
> Pranith
>
> _______
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] configure fails due to failure in locating libxml2-devel

2018-01-21 Thread Raghavendra G
All,

# ./configure

configure: error: libxml2 devel libraries not found

# ls /usr/lib64/libxml2.so
/usr/lib64/libxml2.so

# ls /usr/include/libxml2/
libxml

# yum install libxml2-devel
Package libxml2-devel-2.9.1-6.el7_2.3.x86_64 already installed and latest
version
Nothing to do

Looks like the issue is very similar to one filed in:
https://bugzilla.redhat.com/show_bug.cgi?id=64134

Has anyone encountered this? How did you workaround this?

regards,
-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Determine if a file is open in a cluster

2018-01-11 Thread Raghavendra G
Another simple test in code would be to check whether inode->fd_list is
empty as fd_list represents list of all fds opened on that inode.

On Fri, Jan 12, 2018 at 4:38 AM, Vijay Bellur <vbel...@redhat.com> wrote:

> Hi Ram,
>
> Do you want to check this from within a translator? If so, you can look
> for GLUSTERFS_OPEN_FD_COUNT in xlators like dht, afr, ec etc. where they
> check for open file descriptors in various FOPs.
>
> Regards,
> Vijay
>
> On Thu, Jan 11, 2018 at 10:40 AM, Ram Ankireddypalle <are...@commvault.com
> > wrote:
>
>> Hi,
>>
>>Is it possible to find out within a cluster if a file is currently
>> open by any of the clients/self-heal daemon or any other daemon’s within a
>> cluster. Please point to the sample code in any of the Xlator which does
>> such a check.
>>
>>
>>
>> Thanks and Regards,
>>
>> Ram
>> ***Legal Disclaimer***
>> "This communication may contain confidential and privileged material for
>> the
>> sole use of the intended recipient. Any unauthorized review, use or
>> distribution
>> by others is strictly prohibited. If you have received the message by
>> mistake,
>> please advise the sender by reply email and delete the message. Thank
>> you."
>> **
>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] a link issue maybe introduced in a bug fix " Don't let NFS cache stat after writes"

2018-01-10 Thread Raghavendra G
+csaba for fuse. +poornima, if the issue turns out to be an issue in
md-cache caching.

On Wed, Jan 10, 2018 at 5:37 PM, Pranith Kumar Karampuri <
pkara...@redhat.com> wrote:

>
>
> On Wed, Jan 10, 2018 at 11:09 AM, Lian, George (NSB - CN/Hangzhou) <
> george.l...@nokia-sbell.com> wrote:
>
>> Hi, Pranith Kumar,
>>
>>
>>
>> I has create a bug on Bugzilla https://bugzilla.redhat.com/sh
>> ow_bug.cgi?id=1531457
>>
>> After my investigation for this link issue, I suppose your changes on
>> afr-dir-write.c with issue " Don't let NFS cache stat after writes" , your
>> fix is like:
>>
>> --
>>
>>if (afr_txn_nothing_failed (frame, this)) {
>>
>> /*if it did pre-op, it will do post-op changing
>> ctime*/
>>
>> if (priv->consistent_metadata &&
>>
>> afr_needs_changelog_update (local))
>>
>> afr_zero_fill_stat (local);
>>
>> local->transaction.unwind (frame, this);
>>
>> }
>>
>> In the above fix, it set the ia_nlink to ‘0’ if option
>> consistent-metadata is set to “on”.
>>
>> And hard link a file with which just created will lead to an error, and
>> the error is caused in kernel function “vfs_link”:
>>
>> if (inode->i_nlink == 0 && !(inode->i_state & I_LINKABLE))
>>
>>  error =  -ENOENT;
>>
>>
>>
>> could you please have a check and give some comments here?
>>
>
> When stat is "zero filled", understanding is that the higher layer
> protocol doesn't send stat value to the kernel and a separate lookup is
> sent by the kernel to get the latest stat value. In which protocol are you
> seeing this issue? Fuse/NFS/SMB?
>
>
>>
>>
>> Thanks & Best Regards,
>>
>> George
>>
>
>
>
> --
> Pranith
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Need inputs on patch #17985

2017-11-29 Thread Raghavendra G
I think this caused regression in quick-read. On going through code, I
realized Quick-read doesn't fetch content of a file pointed by dentry in
readdirplus. Since, the patch in question prevents any lookup from
resolver, reads on the file till the duration of "entry-timeout" (a cmdline
option to fuse mount, whose default value is 1 sec) after the entry was
discovered in readdirplus will not be served by quick-read even though the
size of file is eligible to be cached. This may cause perf regression in
read heavy workloads on smallfiles. We'll be doing more testing to identify
this.

On Tue, Sep 12, 2017 at 11:31 AM, Raghavendra G <raghavendra...@gmail.com>
wrote:

> Update. Two more days to go for the deadline. Till now, there are no open
> issues identified against this patch.
>
> On Fri, Sep 8, 2017 at 6:54 AM, Raghavendra Gowdappa <rgowd...@redhat.com>
> wrote:
>
>>
>>
>> - Original Message -
>> > From: "FNU Raghavendra Manjunath" <rab...@redhat.com>
>> > To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
>> > Cc: "Raghavendra G" <raghavendra...@gmail.com>, "Nithya Balachandran" <
>> nbala...@redhat.com>, anoo...@redhat.com,
>> > "Gluster Devel" <gluster-devel@gluster.org>, "Raghavendra Bhat" <
>> raghaven...@redhat.com>
>> > Sent: Thursday, September 7, 2017 6:44:51 PM
>> > Subject: Re: [Gluster-devel] Need inputs on patch #17985
>> >
>> > From snapview client perspective one important thing to note. For
>> building
>> > the context for the entry point (by default ".snaps") a explicit lookup
>> has
>> > to be done on it. The dentry for ".snaps" is not returned when readdir
>> is
>> > done on its parent directory (Not even when ls -a is done). So for
>> building
>> > the context of .snaps (in the context snapview client saves the
>> information
>> > about whether it is a real inode or virtual inode) we need a lookup.
>>
>> Since the dentry corresponding to ".snaps" is not returned, there won't
>> be an inode for this directory linked in itable. Also, glusterfs wouldn't
>> have given nodeid corresponding to ".snaps" during readdir response (as
>> dentry itself is not returned). So, kernel would do an explicit lookup
>> before doing any operation on ".snaps" (unlike for those dentries which
>> contain nodeid kernel can choose to skip a lookup) and we are safe. So,
>> #17985 is safe in its current form.
>>
>> >
>> > From snapview server perspective as well a lookup might be needed. In
>> > snapview server a glfs handle is established between the snapview server
>> > and the snapshot brick. So a inode in snapview server process contains
>> the
>> > glfs handle for the object being accessed from snapshot.  In snapview
>> > server readdirp does not build the inode context (which contains the
>> glfs
>> > handle etc) because glfs handle is returned only in lookup.
>>
>> Same argument I've given holds good for this case too. Important point to
>> note is that "there is no dentry and hence no nodeid corresponding to
>> .snaps is passed to kernel and kernel is forced to do an explicit lookup".
>>
>> >
>> > Regards,
>> > Raghavendra
>> >
>> >
>> > On Tue, Aug 29, 2017 at 12:53 AM, Raghavendra Gowdappa <
>> rgowd...@redhat.com>
>> > wrote:
>> >
>> > >
>> > >
>> > > - Original Message -
>> > > > From: "Raghavendra G" <raghavendra...@gmail.com>
>> > > > To: "Nithya Balachandran" <nbala...@redhat.com>
>> > > > Cc: "Raghavendra Gowdappa" <rgowd...@redhat.com>,
>> anoo...@redhat.com,
>> > > "Gluster Devel" <gluster-devel@gluster.org>,
>> > > > raghaven...@redhat.com
>> > > > Sent: Tuesday, August 29, 2017 8:52:28 AM
>> > > > Subject: Re: [Gluster-devel] Need inputs on patch #17985
>> > > >
>> > > > On Thu, Aug 24, 2017 at 2:53 PM, Nithya Balachandran <
>> > > nbala...@redhat.com>
>> > > > wrote:
>> > > >
>> > > > > It has been a while but iirc snapview client (loaded abt dht/tier
>> etc)
>> > > had
>> > > > > some issues when we ran tiering tests. Rafi might have more info
>> on
>> > > this -
>> > > > > basically it was expecting

Re: [Gluster-devel] regression tests taking time

2017-11-29 Thread Raghavendra G
Isn't there a timeout after which test is aborted? AFAIR its 300 seconds.

On Wed, Nov 29, 2017 at 2:55 AM, Amar Tumballi <atumb...@redhat.com> wrote:

> Not sure if to believe it, just these 2 patches took 1hr30mins in
> regression. Is it expected?
>
> *21:08:30* ./tests/bugs/nfs/bug-1053579.t  -  2683 second*21:08:30* 
> ./tests/bugs/fuse/many-groups-for-acl.t  -  2229 second
>
>
> Can someone check these out?
>
> --
> Amar Tumballi (amarts)
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Change #18681 has broken build on master

2017-11-07 Thread Raghavendra G
On Tue, Nov 7, 2017 at 5:32 PM, Nigel Babu <nig...@redhat.com> wrote:

> Landed:https://github.com/gluster/glusterfs/commit/
> c3d7974e2be68f0fac8f54c9557d0f868e6be6c8
>
> Please rebase your patches and re-trigger.
>

Thanks Nigel. I am wondering how did the build for #18681 succeed? Any
insights on this?


> On Tue, Nov 7, 2017 at 5:23 PM, Nigel Babu <nig...@redhat.com> wrote:
>
>> Rafi has a fix[1]. I'm going to make it skip regressions and land it
>> directly.
>>
>> https://review.gluster.org/#/c/18680/
>>
>> On Tue, Nov 7, 2017 at 4:42 PM, Raghavendra Gowdappa <rgowd...@redhat.com
>> > wrote:
>>
>>> Please check [1].
>>>
>>> Build on master branch on my laptop failed too:
>>>
>>> [raghu@unused server]$ make > /dev/null
>>> server.c: In function 'init':
>>> server.c:1205:9: error: too few arguments to function
>>> 'rpcsvc_program_register'
>>> In file included from server.h:17:0,
>>>  from server.c:16:
>>> ../../../../rpc/rpc-lib/src/rpcsvc.h:426:1: note: declared here
>>> make[1]: *** [server.lo] Error 1
>>> make: *** [all-recursive] Error 1
>>>
>>> The change was introduced by [2]. However, the puzzling thing is [2]
>>> itself was built successfully and has passed all tests. Wondering how did
>>> that happen.
>>>
>>> [1] https://build.gluster.org/job/centos6-regression/7281/console
>>> [2] review.gluster.org/18681
>>>
>>> regards,
>>> Raghavendra
>>>
>>
>>
>>
>> --
>> nigelb
>>
>
>
>
> --
> nigelb
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Report ESTALE as ENOENT

2017-10-18 Thread Raghavendra G
+Brian Foster

On Wed, Oct 11, 2017 at 4:11 PM, Raghavendra G <raghaven...@gluster.com>
wrote:

> We ran into a regression [2][3]. Hence reviving this thread.
>
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1500269
> [3] https://review.gluster.org/18463
>
> On Thu, Mar 31, 2016 at 1:22 AM, J. Bruce Fields <bfie...@fieldses.org>
> wrote:
>
>> On Mon, Mar 28, 2016 at 04:21:00PM -0400, Vijay Bellur wrote:
>> > On 03/28/2016 09:34 AM, FNU Raghavendra Manjunath wrote:
>> > >
>> > >I can understand the concern. But I think instead of generally
>> > >converting all the ESTALE errors ENOENT, probably we should try to
>> > >analyze the errors that are generated by lower layers (like posix).
>> > >
>> > >Even fuse kernel module some times returns ESTALE. (Well, I can see it
>> > >returning ESTALE errors in some cases in the code. Someone please
>> > >correct me if thats not the case).  And aso I am not sure if converting
>> > >all the ESTALE errors to ENOENT is ok or not.
>> >
>> > ESTALE in fuse is returned only for export_operations. fuse
>> > implements this for providing support to export fuse mounts as nfs
>> > exports. A cursory reading of the source seems to indicate that fuse
>> > returns ESTALE only in cases where filehandle resolution fails.
>> >
>> > >
>> > >For fd based operations, I am not sure if ENOENT can be sent or not (as
>> > >though the file is unlinked, it can be accessed if there were open fds
>> > >on it before unlinking the file).
>> >
>> > ESTALE should be fine for fd based operations. It would be analogous
>> > to a filehandle resolution failing and should not be a common
>> > occurrence.
>> >
>> > >
>> > >I feel, we have to look into some parts to check if they generating
>> > >ESTALE is a proper error or not. Also, if there is any bug in below
>> > >layers fixing which can avoid ESTALE errors, then I feel that would be
>> > >the better option.
>> > >
>> >
>> > I would prefer to:
>> >
>> > 1. Return ENOENT for all system calls that operate on a path.
>> >
>> > 2. ESTALE might be ok for file descriptor based operations.
>>
>> Note that operations which operate on paths can fail with ESTALE when
>> they attempt to look up a component within a directory that no longer
>> exists.
>>
>
> But, "man 2 rmdir"  or "man 2 unlink" doesn't list ESTALE as a valid
> error. Also rm doesn't seem to handle ESTALE too [3]
>
> [4] https://github.com/coreutils/coreutils/blob/master/src/remove.c#L305
>
>
>> Maybe non-creating open("./foo") returning ENOENT would be reasonable in
>> this case since that's what you'd get in the local filesystem case, but
>> creat("./foo") returning ENOENT, for example, isn't something
>> applications will be written to handle.
>>
>> The Linux VFS will retry ESTALE on path-based systemcalls *one* time, to
>> reduce the chance of ESTALE in those cases.
>
>
> I should've anticipated bug [2] due to this comment. My mistake. Bug [2]
> is indeed due to kernel not retrying open on receiving an ENOENT error.
> Glusterfs sent ENOENT because file's inode-number/nodeid changed but same
> path exists. The correct error would've been ESTALE, but due to our
> conversion of ESTALE to ENOENT, the latter was sent back to kernel.
>

We've an application which does very frequent renames (around 10-15 per
second). So, single retry by kernel of an open failed with ESTALE is not
helping us.

@Bruce/Brian,

Do you know why VFS chose an approach of retrying instead of a stricter
synchronization mechanism using locking? For eg., rename and open could've
been synchronized using a lock.

For eg., a rough psuedocode for open and rename could've been (please
ignore ordering src,dst locks in rename)

sys_open ()
{
   LOCK (dentry->lock);
   {
lookup path;
open (inode)
   }
   UNLOCK (dentry->lock;
}

sys_rename ()
{
 LOCK (dst-dentry->lock);
 {
LOCK (src-dentry->lock);
{
 rename (src, dst);
}
UNLOCK (src-dentry->lock);
}
UNLOCK (dst-dentry->lock);
}

@Bruce,

With the current retry model, any suggestions on how to handle applications
that do frequent renames?


> Looking through kernel VFS code, only open *seems* to retry
> (do_filep_open). I couldn't find similar logic to other path based syscalls
> like rmdir, unlink, stat, chmod etc
>
&g

Re: [Gluster-devel] Report ESTALE as ENOENT

2017-10-11 Thread Raghavendra G
We ran into a regression [2][3]. Hence reviving this thread.

[2] https://bugzilla.redhat.com/show_bug.cgi?id=1500269
[3] https://review.gluster.org/18463

On Thu, Mar 31, 2016 at 1:22 AM, J. Bruce Fields <bfie...@fieldses.org>
wrote:

> On Mon, Mar 28, 2016 at 04:21:00PM -0400, Vijay Bellur wrote:
> > On 03/28/2016 09:34 AM, FNU Raghavendra Manjunath wrote:
> > >
> > >I can understand the concern. But I think instead of generally
> > >converting all the ESTALE errors ENOENT, probably we should try to
> > >analyze the errors that are generated by lower layers (like posix).
> > >
> > >Even fuse kernel module some times returns ESTALE. (Well, I can see it
> > >returning ESTALE errors in some cases in the code. Someone please
> > >correct me if thats not the case).  And aso I am not sure if converting
> > >all the ESTALE errors to ENOENT is ok or not.
> >
> > ESTALE in fuse is returned only for export_operations. fuse
> > implements this for providing support to export fuse mounts as nfs
> > exports. A cursory reading of the source seems to indicate that fuse
> > returns ESTALE only in cases where filehandle resolution fails.
> >
> > >
> > >For fd based operations, I am not sure if ENOENT can be sent or not (as
> > >though the file is unlinked, it can be accessed if there were open fds
> > >on it before unlinking the file).
> >
> > ESTALE should be fine for fd based operations. It would be analogous
> > to a filehandle resolution failing and should not be a common
> > occurrence.
> >
> > >
> > >I feel, we have to look into some parts to check if they generating
> > >ESTALE is a proper error or not. Also, if there is any bug in below
> > >layers fixing which can avoid ESTALE errors, then I feel that would be
> > >the better option.
> > >
> >
> > I would prefer to:
> >
> > 1. Return ENOENT for all system calls that operate on a path.
> >
> > 2. ESTALE might be ok for file descriptor based operations.
>
> Note that operations which operate on paths can fail with ESTALE when
> they attempt to look up a component within a directory that no longer
> exists.
>

But, "man 2 rmdir"  or "man 2 unlink" doesn't list ESTALE as a valid error.
Also rm doesn't seem to handle ESTALE too [3]

[4] https://github.com/coreutils/coreutils/blob/master/src/remove.c#L305


> Maybe non-creating open("./foo") returning ENOENT would be reasonable in
> this case since that's what you'd get in the local filesystem case, but
> creat("./foo") returning ENOENT, for example, isn't something
> applications will be written to handle.
>
> The Linux VFS will retry ESTALE on path-based systemcalls *one* time, to
> reduce the chance of ESTALE in those cases.


I should've anticipated bug [2] due to this comment. My mistake. Bug [2] is
indeed due to kernel not retrying open on receiving an ENOENT error.
Glusterfs sent ENOENT because file's inode-number/nodeid changed but same
path exists. The correct error would've been ESTALE, but due to our
conversion of ESTALE to ENOENT, the latter was sent back to kernel.

Looking through kernel VFS code, only open *seems* to retry
(do_filep_open). I couldn't find similar logic to other path based syscalls
like rmdir, unlink, stat, chmod etc

The bugzilla entry that
> tracked those patches might be interesting:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=678544
>
> > NFS recommends that applications add special code for handling
> > ESTALE [1]. Unfortunately changing application code is not easy and
> > hence it does not come as a surprise that coreutils also does not
> > accommodate ESTALE.
>
> We also need to consider whether the application's handling of the
> ENOENT case could be incorrect for the ESTALE case, with consequences
> possibly as bad as or worse than consequences of seeing an unexpected
> error.
>
> My first intuition is that translating ESTALE to ENOENT is less safe
> than not doing so, because:
>
> - once an ESTALE-unaware application his the ESTALE case, we
>   risk a bug regardless of which we return, but if we return
>   ESTALE at least the problem should be more obvious to the
>   person debugging.
> - for ESTALE-aware applications, the ESTALE/ENOENT distinction
>   is useful.
>

Another place to not convert is for  those cases where kernel retries the
operation on seeing an ESTALE.

I guess we need to think through each operation and we cannot ESTALE to
ENOENT always.


> But I haven't really thought through examples.
>
> --b.
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] brick multiplexing regression is broken

2017-10-05 Thread Raghavendra G
The test which is failing is also introduced with the same patch. It is
supposed to validate the functionality introduced. From the history of
earlier patchsets of same patch, same test has failed earlier too, albeit
inconsistently (though the merged version has passed centos regressions).
So, it looks like the patch is not working as intended during some race
conditions. Frequent earlier failures should've served as an alert, but I
failed to notice. Sorry about that.

On Thu, Oct 5, 2017 at 6:04 PM, Atin Mukherjee <amukh...@redhat.com> wrote:

> The following commit has broken the brick multiplexing regression job.
> tests/bugs/bug-1371806_1.t has failed couple of times.  One of the latest
> regression job report is at https://build.gluster.org/job/
> regression-test-with-multiplex/406/console .
>
>
> commit 9b4de61a136b8e5ba7bf0e48690cdb1292d0dee8
> Author: Mohit Agrawal <moagr...@redhat.com>
> Date:   Fri May 12 21:12:47 2017 +0530
>
> cluster/dht : User xattrs are not healed after brick stop/start
>
> Problem: In a distributed volume custom extended attribute value for a
> directory
>  does not display correct value after stop/start or added
> newly brick.
>  If any extended(acl) attribute value is set for a directory
> after stop/added
>  the brick the attribute(user|acl|quota) value is not updated
> on brick
>  after start the brick.
>
> Solution: First store hashed subvol or subvol(has internal xattr) on
> inode ctx and
>   consider it as a MDS subvol.At the time of update custom
> xattr
>   (user,quota,acl, selinux) on directory first check the mds
> from
>   inode ctx, if mds is not present on inode ctx then throw
> EINVAL error
>   to application otherwise set xattr on MDS subvol with
> internal xattr
>   value of -1 and then try to update the attribute on other
> non MDS
>   volumes also.If mds subvol is down in that case throw an
>   error "Transport endpoint is not connected". In
> dht_dir_lookup_cbk|
>   dht_revalidate_cbk|dht_discover_complete call
> dht_call_dir_xattr_heal
>   to heal custom extended attribute.
>   In case of gnfs server if hashed subvol has not found based
> on
>   loc then wind a call on all subvol to update xattr.
>
> Fix:1) Save MDS subvol on inode ctx
> 2) Check if mds subvol is present on inode ctx
> 3) If mds subvol is down then call unwind with error ENOTCONN
> and if it is up
>then set new xattr "GF_DHT_XATTR_MDS" to -1 and wind a call
> on other
>subvol.
> 4) If setxattr fop is successful on non-mds subvol then
> increment the value of
>internal xattr to +1
> 5) At the time of directory_lookup check the value of new
> xattr GF_DHT_XATTR_MDS
> 6) If value is not 0 in dht_lookup_dir_cbk(other cbk)
> functions then call heal
>function to heal user xattr
> 7) syncop_setxattr on hashed_subvol to reset the value of
> xattr to 0
>if heal is successful on all subvol.
>
> Test : To reproduce the issue followed below steps
>1) Create a distributed volume and create mount point
>2) Create some directory from mount point mkdir tmp{1..5}
>3) Kill any one brick from the volume
>4) Set extended attribute from mount point on directory
>   setfattr -n user.foo -v "abc" ./tmp{1..5}
>   It will throw error " Transport End point is not connected "
>   for those hashed subvol is down
>5) Start volume with force option to start brick process
>6) Execute getfattr command on mount point for directory
>7) Check extended attribute on brick
>   getfattr -n user.foo /tmp{1..5}
>   It shows correct value for directories for those
>   xattr fop were executed successfully.
>
> Note: The patch will resolve xattr healing problem only for fuse mount
>   not for nfs mount.
>
> BUG: 1371806
> Signed-off-by: Mohit Agrawal <moagr...@redhat.com>
>
> Change-Id: I4eb137eace24a8cb796712b742f1d177a65343d5
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Need inputs on patch #17985

2017-09-12 Thread Raghavendra G
Update. Two more days to go for the deadline. Till now, there are no open
issues identified against this patch.

On Fri, Sep 8, 2017 at 6:54 AM, Raghavendra Gowdappa <rgowd...@redhat.com>
wrote:

>
>
> - Original Message -
> > From: "FNU Raghavendra Manjunath" <rab...@redhat.com>
> > To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > Cc: "Raghavendra G" <raghavendra...@gmail.com>, "Nithya Balachandran" <
> nbala...@redhat.com>, anoo...@redhat.com,
> > "Gluster Devel" <gluster-devel@gluster.org>, "Raghavendra Bhat" <
> raghaven...@redhat.com>
> > Sent: Thursday, September 7, 2017 6:44:51 PM
> > Subject: Re: [Gluster-devel] Need inputs on patch #17985
> >
> > From snapview client perspective one important thing to note. For
> building
> > the context for the entry point (by default ".snaps") a explicit lookup
> has
> > to be done on it. The dentry for ".snaps" is not returned when readdir is
> > done on its parent directory (Not even when ls -a is done). So for
> building
> > the context of .snaps (in the context snapview client saves the
> information
> > about whether it is a real inode or virtual inode) we need a lookup.
>
> Since the dentry corresponding to ".snaps" is not returned, there won't be
> an inode for this directory linked in itable. Also, glusterfs wouldn't have
> given nodeid corresponding to ".snaps" during readdir response (as dentry
> itself is not returned). So, kernel would do an explicit lookup before
> doing any operation on ".snaps" (unlike for those dentries which contain
> nodeid kernel can choose to skip a lookup) and we are safe. So, #17985 is
> safe in its current form.
>
> >
> > From snapview server perspective as well a lookup might be needed. In
> > snapview server a glfs handle is established between the snapview server
> > and the snapshot brick. So a inode in snapview server process contains
> the
> > glfs handle for the object being accessed from snapshot.  In snapview
> > server readdirp does not build the inode context (which contains the glfs
> > handle etc) because glfs handle is returned only in lookup.
>
> Same argument I've given holds good for this case too. Important point to
> note is that "there is no dentry and hence no nodeid corresponding to
> .snaps is passed to kernel and kernel is forced to do an explicit lookup".
>
> >
> > Regards,
> > Raghavendra
> >
> >
> > On Tue, Aug 29, 2017 at 12:53 AM, Raghavendra Gowdappa <
> rgowd...@redhat.com>
> > wrote:
> >
> > >
> > >
> > > - Original Message -
> > > > From: "Raghavendra G" <raghavendra...@gmail.com>
> > > > To: "Nithya Balachandran" <nbala...@redhat.com>
> > > > Cc: "Raghavendra Gowdappa" <rgowd...@redhat.com>, anoo...@redhat.com
> ,
> > > "Gluster Devel" <gluster-devel@gluster.org>,
> > > > raghaven...@redhat.com
> > > > Sent: Tuesday, August 29, 2017 8:52:28 AM
> > > > Subject: Re: [Gluster-devel] Need inputs on patch #17985
> > > >
> > > > On Thu, Aug 24, 2017 at 2:53 PM, Nithya Balachandran <
> > > nbala...@redhat.com>
> > > > wrote:
> > > >
> > > > > It has been a while but iirc snapview client (loaded abt dht/tier
> etc)
> > > had
> > > > > some issues when we ran tiering tests. Rafi might have more info on
> > > this -
> > > > > basically it was expecting to find the inode_ctx populated but it
> was
> > > not.
> > > > >
> > > >
> > > > Thanks Nithya. @Rafi, @Raghavendra Bhat, is it possible to take the
> > > > ownership of,
> > > >
> > > > * Identifying whether the patch in question causes the issue?
> > >
> > > gf_svc_readdirp_cbk is setting relevant state in inode [1]. I quickly
> > > checked whether its the same state stored by gf_svc_lookup_cbk and it
> looks
> > > like the same state. So, I guess readdirp is handled correctly by
> > > snapview-client and an explicit lookup is not required. But, will wait
> for
> > > inputs from rabhat and rafi.
> > >
> > > [1] https://github.com/gluster/glusterfs/blob/master/xlators/
> > > features/snapview-client/src/snapview-client.c#L1962
> > >
> > > > * Send a fix or at least evaluate whether a fix is possible.
> > > >
> >

Re: [Gluster-devel] Fuse mounts and inodes

2017-09-06 Thread Raghavendra G
On Wed, Sep 6, 2017 at 11:16 AM, Csaba Henk <ch...@redhat.com> wrote:

> Thanks Du, nice bit of info! It made me wander about the following:
>
> - Could it be then the default answer we give to "glusterfs client
> high memory usage"
>   type of complaints to set vfs_cache_pressure to 100 + x?
>
- And then x = ? Was there proper performance testing done to see how
> performance /
>   mem consumtion changes in terms of vfs_cache_performace?
>

I had a discussion with Manoj on this. One drawback with using
vfs_cache_performance tunable is that its a dynamic algorithm which decides
whether to purge from page cache or inode cache looking at the current
memory pressure. An obvious drawback for glusterfs is that, various of
caches of glusterfs are not visible to kernel (Memory consumed by Glusterfs
gets reflected neither in page cache nor in inode cache). This _might_
result in algorithm working poorly.

- vfs_cache_pressure is an allover system tunable. If 100 + x is ideal
> for GlusterFS, can
>   we take the courage to propose this? Is there no risk to trash other
> (disk-based)
>   filesystems' performace?
>

That's a valid point. Behavior of other filesystems would be a concern.

I've not really thought through this suggestion of tuning /proc/sys/vm
tunables and I am not even an expert who knows what tunables are at our
disposal. Just wanted to bring this idea to notice of wider audience.


> Csaba
>
> On Wed, Sep 6, 2017 at 6:57 AM, Raghavendra G <raghaven...@gluster.com>
> wrote:
> > Another parallel effort could be trying to configure the number of
> > inodes/dentries cached by kernel VFS using /proc/sys/vm interface.
> >
> > ==
> >
> > vfs_cache_pressure
> > --
> >
> > This percentage value controls the tendency of the kernel to reclaim
> > the memory which is used for caching of directory and inode objects.
> >
> > At the default value of vfs_cache_pressure=100 the kernel will attempt to
> > reclaim dentries and inodes at a "fair" rate with respect to pagecache
> and
> > swapcache reclaim.  Decreasing vfs_cache_pressure causes the kernel to
> > prefer
> > to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel
> > will
> > never reclaim dentries and inodes due to memory pressure and this can
> easily
> > lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond
> 100
> > causes the kernel to prefer to reclaim dentries and inodes.
> >
> > Increasing vfs_cache_pressure significantly beyond 100 may have negative
> > performance impact. Reclaim code needs to take various locks to find
> > freeable
> > directory and inode objects. With vfs_cache_pressure=1000, it will look
> for
> > ten times more freeable objects than there are.
> >
> > Also we've an article for sysadmins which has a section:
> >
> > 
> >
> > With GlusterFS, many users with a lot of storage and many small files
> > easily end up using a lot of RAM on the server side due to
> > 'inode/dentry' caching, leading to decreased performance when the kernel
> > keeps crawling through data-structures on a 40GB RAM system. Changing
> > this value higher than 100 has helped many users to achieve fair caching
> > and more responsiveness from the kernel.
> >
> > 
> >
> > Complete article can be found at:
> > https://gluster.readthedocs.io/en/latest/Administrator%
> 20Guide/Linux%20Kernel%20Tuning/
> >
> > regards,
> >
> >
> > On Tue, Sep 5, 2017 at 5:20 PM, Raghavendra Gowdappa <
> rgowd...@redhat.com>
> > wrote:
> >>
> >> +gluster-devel
> >>
> >> Ashish just spoke to me about need of GC of inodes due to some state in
> >> inode that is being proposed in EC. Hence adding more people to
> >> conversation.
> >>
> >> > > On 4 September 2017 at 12:34, Csaba Henk <ch...@redhat.com> wrote:
> >> > >
> >> > > > I don't know, depends on how sophisticated GC we need/want/can get
> >> > > > by. I
> >> > > > guess the complexity will be inherent, ie. that of the algorithm
> >> > > > chosen
> >> > > > and
> >> > > > how we address concurrency & performance impacts, but once that's
> >> > > > got
> >> > > > right
> >> > > > the other aspects of implementation won't be hard.
> >> > > >
> >> > > > Eg. would it be good just to maintain a simple LRU list?
> >> > 

Re: [Gluster-devel] Need inputs on patch #17985

2017-09-06 Thread Raghavendra G
A gentle reminder. We are midway through the proposed timeline.

On Wed, Aug 30, 2017 at 10:28 AM, Raghavendra Gowdappa <rgowd...@redhat.com>
wrote:

>
>
> - Original Message -
> > From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > To: "Raghavendra G" <raghavendra...@gmail.com>
> > Cc: "Nithya Balachandran" <nbala...@redhat.com>, anoo...@redhat.com,
> "Gluster Devel" <gluster-devel@gluster.org>,
> > raghaven...@redhat.com
> > Sent: Tuesday, August 29, 2017 10:23:49 AM
> > Subject: Re: [Gluster-devel] Need inputs on patch #17985
> >
> >
> >
> > - Original Message -
> > > From: "Raghavendra G" <raghavendra...@gmail.com>
> > > To: "Nithya Balachandran" <nbala...@redhat.com>
> > > Cc: "Raghavendra Gowdappa" <rgowd...@redhat.com>, anoo...@redhat.com,
> > > "Gluster Devel" <gluster-devel@gluster.org>,
> > > raghaven...@redhat.com
> > > Sent: Tuesday, August 29, 2017 8:52:28 AM
> > > Subject: Re: [Gluster-devel] Need inputs on patch #17985
> > >
> > > On Thu, Aug 24, 2017 at 2:53 PM, Nithya Balachandran <
> nbala...@redhat.com>
> > > wrote:
> > >
> > > > It has been a while but iirc snapview client (loaded abt dht/tier
> etc)
> > > > had
> > > > some issues when we ran tiering tests. Rafi might have more info on
> this
> > > > -
> > > > basically it was expecting to find the inode_ctx populated but it was
> > > > not.
> > > >
> > >
> > > Thanks Nithya. @Rafi, @Raghavendra Bhat, is it possible to take the
> > > ownership of,
> > >
> > > * Identifying whether the patch in question causes the issue?
> >
> > gf_svc_readdirp_cbk is setting relevant state in inode [1]. I quickly
> checked
> > whether its the same state stored by gf_svc_lookup_cbk and it looks like
> the
> > same state. So, I guess readdirp is handled correctly by snapview-client
> and
> > an explicit lookup is not required. But, will wait for inputs from rabhat
> > and rafi.
> >
> > [1]
> > https://github.com/gluster/glusterfs/blob/master/xlators/
> features/snapview-client/src/snapview-client.c#L1962
> >
> > > * Send a fix or at least evaluate whether a fix is possible.
> > >
> > > @Others,
> > >
> > > With the motivation of getting some traction on this, Is it ok if we:
> > > * Set a deadline of around 15 days to complete the review (or testing
> with
> > > the patch in question) of respective components and to come up with
> issues
> > > (if any).
> > > * Post the deadline, if there are no open issues, go ahead and merge
> the
> > > patch?
> > >
> > > If time is not enough, let us know and we can come up with a reasonable
> > > time.
>
> Since I don't see any objection to the deadline proposed, I am assuming
> everyone is fine with it. Post the deadline, if there are no open issues,
> patch will be merged.
>
> > >
> > > regards,
> > > Raghavendra
> > >
> > >
> > > > On 24 August 2017 at 10:13, Raghavendra G <raghavendra...@gmail.com>
> > > > wrote:
> > > >
> > > >> Note that we need to consider xlators on brick stack too. I've added
> > > >> maintainers/peers of xlators on brick stack. Please explicitly
> ack/nack
> > > >> whether this patch affects your component.
> > > >>
> > > >> For reference, following are the xlators loaded in brick stack
> > > >>
> > > >> storage/posix
> > > >> features/trash
> > > >> features/changetimerecorder
> > > >> features/changelog
> > > >> features/bitrot-stub
> > > >> features/access-control
> > > >> features/locks
> > > >> features/worm
> > > >> features/read-only
> > > >> features/leases
> > > >> features/upcall
> > > >> performance/io-threads
> > > >> features/selinux
> > > >> features/marker
> > > >> features/barrier
> > > >> features/index
> > > >> features/quota
> > > >> debug/io-stats
> > > >> performance/decompounder
> > > >> protocol/server
> > > >>
> > > >>
> > > >> For those not following this thread, the question we nee

Re: [Gluster-devel] Fuse mounts and inodes

2017-09-05 Thread Raghavendra G
hya Balachandran
> > > >>> <nbala...@redhat.com
> > > >>> > wrote:
> > > >>>
> > > >>>>
> > > >>>>
> > > >>>> On 4 September 2017 at 10:25, Raghavendra Gowdappa
> > > >>>> <rgowd...@redhat.com
> > > >>>> > wrote:
> > > >>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> - Original Message -
> > > >>>>> > From: "Nithya Balachandran" <nbala...@redhat.com>
> > > >>>>> > Sent: Monday, September 4, 2017 10:19:37 AM
> > > >>>>> > Subject: Fuse mounts and inodes
> > > >>>>> >
> > > >>>>> > Hi,
> > > >>>>> >
> > > >>>>> > One of the reasons for the memory consumption in gluster fuse
> > > >>>>> > mounts
> > > >>>>> is the
> > > >>>>> > number of inodes in the table which are never kicked out.
> > > >>>>> >
> > > >>>>> > Is there any way to default to an entry-timeout and
> > > >>>>> attribute-timeout value
> > > >>>>> > while mounting Gluster using Fuse? Say 60s each so those
> entries
> > > >>>>> will be
> > > >>>>> > purged periodically?
> > > >>>>>
> > > >>>>> Once the entry timeouts, inodes won't be purged. Kernel sends a
> > > >>>>> lookup
> > > >>>>> to revalidate the mapping of path to inode. AFAIK, reverse
> > > >>>>> invalidation
> > > >>>>> (see inode_invalidate) is the only way to make kernel forget
> > > >>>>> inodes/attributes.
> > > >>>>>
> > > >>>>> Is that something that can be done from the Fuse mount ? Or is
> this
> > > >>>> something that needs to be added to Fuse?
> > > >>>>
> > > >>>>> >
> > > >>>>> > Regards,
> > > >>>>> > Nithya
> > > >>>>> >
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>>
> > > >>
> > > >
> > >
> >
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Proposed Protocol changes for 4.0: Need feedback.

2017-09-01 Thread Raghavendra G
If we plan to fix, bz [1] is another candidate that has to be targeted for
4.0.

Gluster's XDR does not conform to RFC spec

Description of problem:  In an attempt to generate bindings to
Gluster's XDR with Rust I've run into a few problems.  It appears the
XDR .x spec files are using keywords that are not legal according to
the RFC spec https://tools.ietf.org/html/rfc4506.html.  This hinders
language adoption of Gluster.  If anyone would like to produce pure
language bindings this prevents them from succeeding.  Indeed I think
community adoption of Gluster would benefit from having a wider range
of XDR bindings on which to build tooling.

Some of the problems I've encountered:
1. RFC4506 doesn't define "long" as a basic type in .x files; unsigned
hyper is its 64 bit unsigned type.
2. Also unsigned char uuid[16] not defined
Possibly other problems that I haven't gotten to yet as the parser
moves through the files.

Version-Release number of selected component (if applicable):


How reproducible: Always reproducible.


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1336889

On Fri, Sep 1, 2017 at 2:29 PM, Soumya Koduri <skod...@redhat.com> wrote:

>
>
> On 08/11/2017 06:04 PM, Amar Tumballi wrote:
>
>> Hi All,
>>
>> Below are the proposed protocol changes (ie, XDR changes on the wire) we
>> are thinking for Gluster 4.0.
>>
>
> Poornima and I were discussing if we can include volume uuid as part of
> Handshake protocol between protocol/client and protocol/server so that
> clients do not re-connect if the volume was deleted and recreated with the
> same name, eliminating potential issues at upper layers [1].
>
> We haven't looked into details, but the idea is to have glusterd2 send
> volume uuid as part of GETSPEC request to clients & brick processes which
> shall be used by protocol/client & protocol/server (may be along with vol
> name as well) during HNDSK_SETVOLUME.
>
> Poornima/Ram,
>
> Please add if I missed out anything.
>
> Thanks,
> Soumya
>
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1463191
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Need inputs on patch #17985

2017-08-28 Thread Raghavendra G
On Thu, Aug 24, 2017 at 2:53 PM, Nithya Balachandran <nbala...@redhat.com>
wrote:

> It has been a while but iirc snapview client (loaded abt dht/tier etc) had
> some issues when we ran tiering tests. Rafi might have more info on this -
> basically it was expecting to find the inode_ctx populated but it was not.
>

Thanks Nithya. @Rafi, @Raghavendra Bhat, is it possible to take the
ownership of,

* Identifying whether the patch in question causes the issue?
* Send a fix or at least evaluate whether a fix is possible.

@Others,

With the motivation of getting some traction on this, Is it ok if we:
* Set a deadline of around 15 days to complete the review (or testing with
the patch in question) of respective components and to come up with issues
(if any).
* Post the deadline, if there are no open issues, go ahead and merge the
patch?

If time is not enough, let us know and we can come up with a reasonable
time.

regards,
Raghavendra


> On 24 August 2017 at 10:13, Raghavendra G <raghavendra...@gmail.com>
> wrote:
>
>> Note that we need to consider xlators on brick stack too. I've added
>> maintainers/peers of xlators on brick stack. Please explicitly ack/nack
>> whether this patch affects your component.
>>
>> For reference, following are the xlators loaded in brick stack
>>
>> storage/posix
>> features/trash
>> features/changetimerecorder
>> features/changelog
>> features/bitrot-stub
>> features/access-control
>> features/locks
>> features/worm
>> features/read-only
>> features/leases
>> features/upcall
>> performance/io-threads
>> features/selinux
>> features/marker
>> features/barrier
>> features/index
>> features/quota
>> debug/io-stats
>> performance/decompounder
>> protocol/server
>>
>>
>> For those not following this thread, the question we need to answer is,
>> "whether the xlator you are associated with works fine if a non-lookup
>> fop (like open, setattr, stat etc) hits it without a lookup ever being done
>> on that inode"
>>
>> regards,
>> Raghavendra
>>
>> On Wed, Aug 23, 2017 at 11:56 AM, Raghavendra Gowdappa <
>> rgowd...@redhat.com> wrote:
>>
>>> Thanks Pranith and Ashish for your inputs.
>>>
>>> - Original Message -
>>> > From: "Pranith Kumar Karampuri" <pkara...@redhat.com>
>>> > To: "Ashish Pandey" <aspan...@redhat.com>
>>> > Cc: "Raghavendra Gowdappa" <rgowd...@redhat.com>, "Xavier Hernandez" <
>>> xhernan...@datalab.es>, "Gluster Devel"
>>> > <gluster-devel@gluster.org>
>>> > Sent: Wednesday, August 23, 2017 11:55:19 AM
>>> > Subject: Re: Need inputs on patch #17985
>>> >
>>> > Raghavendra,
>>> > As Ashish mentioned, there aren't any known problems if upper
>>> xlators
>>> > don't send lookups in EC at the moment.
>>> >
>>> > On Wed, Aug 23, 2017 at 9:07 AM, Ashish Pandey <aspan...@redhat.com>
>>> wrote:
>>> >
>>> > > Raghvendra,
>>> > >
>>> > > I have provided my comment on this patch.
>>> > > I think EC will not have any issue with this approach.
>>> > > However, I would welcome comments from Xavi and Pranith too for any
>>> side
>>> > > effects which I may not be able to foresee.
>>> > >
>>> > > Ashish
>>> > >
>>> > > --
>>> > > *From: *"Raghavendra Gowdappa" <rgowd...@redhat.com>
>>> > > *To: *"Ashish Pandey" <aspan...@redhat.com>
>>> > > *Cc: *"Pranith Kumar Karampuri" <pkara...@redhat.com>, "Xavier
>>> Hernandez"
>>> > > <xhernan...@datalab.es>, "Gluster Devel" <gluster-devel@gluster.org>
>>> > > *Sent: *Wednesday, August 23, 2017 8:29:48 AM
>>> > > *Subject: *Need inputs on patch #17985
>>> > >
>>> > >
>>> > > Hi Ashish,
>>> > >
>>> > > Following are the blockers for making a decision on whether patch
>>> [1] can
>>> > > be merged or not:
>>> > > * Evaluation of dentry operations (like rename etc) in dht
>>> > > * Whether EC works fine if a non-lookup fop (like open(dir), stat,
>>> chmod
>>> > > etc) hits EC without a single lookup performed on file/inode
>>> > >
>>> > > Can you please comment on the patch? I'll take care of dht part.
>>> > >
>>> > > [1] https://review.gluster.org/#/c/17985/
>>> > >
>>> > > regards,
>>> > > Raghavendra
>>> > >
>>> > >
>>> >
>>> >
>>> > --
>>> > Pranith
>>> >
>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>>> --
>>> Raghavendra G
>>>
>>> <http://lists.gluster.org/mailman/listinfo/gluster-devel>
>>>
>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>


-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Need inputs on patch #17985

2017-08-23 Thread Raghavendra G
Note that we need to consider xlators on brick stack too. I've added
maintainers/peers of xlators on brick stack. Please explicitly ack/nack
whether this patch affects your component.

For reference, following are the xlators loaded in brick stack

storage/posix
features/trash
features/changetimerecorder
features/changelog
features/bitrot-stub
features/access-control
features/locks
features/worm
features/read-only
features/leases
features/upcall
performance/io-threads
features/selinux
features/marker
features/barrier
features/index
features/quota
debug/io-stats
performance/decompounder
protocol/server


For those not following this thread, the question we need to answer is,
"whether the xlator you are associated with works fine if a non-lookup fop
(like open, setattr, stat etc) hits it without a lookup ever being done on
that inode"

regards,
Raghavendra

On Wed, Aug 23, 2017 at 11:56 AM, Raghavendra Gowdappa <rgowd...@redhat.com>
wrote:

> Thanks Pranith and Ashish for your inputs.
>
> - Original Message -
> > From: "Pranith Kumar Karampuri" <pkara...@redhat.com>
> > To: "Ashish Pandey" <aspan...@redhat.com>
> > Cc: "Raghavendra Gowdappa" <rgowd...@redhat.com>, "Xavier Hernandez" <
> xhernan...@datalab.es>, "Gluster Devel"
> > <gluster-devel@gluster.org>
> > Sent: Wednesday, August 23, 2017 11:55:19 AM
> > Subject: Re: Need inputs on patch #17985
> >
> > Raghavendra,
> > As Ashish mentioned, there aren't any known problems if upper xlators
> > don't send lookups in EC at the moment.
> >
> > On Wed, Aug 23, 2017 at 9:07 AM, Ashish Pandey <aspan...@redhat.com>
> wrote:
> >
> > > Raghvendra,
> > >
> > > I have provided my comment on this patch.
> > > I think EC will not have any issue with this approach.
> > > However, I would welcome comments from Xavi and Pranith too for any
> side
> > > effects which I may not be able to foresee.
> > >
> > > Ashish
> > >
> > > --
> > > *From: *"Raghavendra Gowdappa" <rgowd...@redhat.com>
> > > *To: *"Ashish Pandey" <aspan...@redhat.com>
> > > *Cc: *"Pranith Kumar Karampuri" <pkara...@redhat.com>, "Xavier
> Hernandez"
> > > <xhernan...@datalab.es>, "Gluster Devel" <gluster-devel@gluster.org>
> > > *Sent: *Wednesday, August 23, 2017 8:29:48 AM
> > > *Subject: *Need inputs on patch #17985
> > >
> > >
> > > Hi Ashish,
> > >
> > > Following are the blockers for making a decision on whether patch [1]
> can
> > > be merged or not:
> > > * Evaluation of dentry operations (like rename etc) in dht
> > > * Whether EC works fine if a non-lookup fop (like open(dir), stat,
> chmod
> > > etc) hits EC without a single lookup performed on file/inode
> > >
> > > Can you please comment on the patch? I'll take care of dht part.
> > >
> > > [1] https://review.gluster.org/#/c/17985/
> > >
> > > regards,
> > > Raghavendra
> > >
> > >
> >
> >
> > --
> > Pranith
> >
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>
> --
> Raghavendra G
>
> <http://lists.gluster.org/mailman/listinfo/gluster-devel>
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] How commonly applications make use of fadvise?

2017-08-20 Thread Raghavendra G
On Sat, Aug 19, 2017 at 4:27 PM, Csaba Henk <ch...@redhat.com> wrote:

> Hi Niels,
>
> On Fri, Aug 11, 2017 at 2:33 PM, Niels de Vos <nde...@redhat.com> wrote:
> > On Fri, Aug 11, 2017 at 05:50:47PM +0530, Ravishankar N wrote:
> [...]
> >> To me it looks like fadvise (mm/fadvise.c) affects only the linux page
> cache
> >> behavior and is decoupled from the filesystem itself. What this means
> for
> >> fuse  is that the  'advise' is only to the content that the fuse kernel
> >> module has stored in that machine's page cache.  Exposing it as a FOP
> would
> >> likely involve adding a new fop to struct file_operations that is common
> >> across the entire VFS and likely  won't fly with the kernel folks. I
> could
> >> be wrong in understanding all of this. :-)
> >
> > Thanks for checking! If that is the case, we need a good use-case to add
> > a fadvise function pointer to the file_operations. It is not impossible
> > to convince the Linux VFS developers, but it would not be as trivial as
> > adding it to FUSE only (but that requires the VFS infrastructure to be
> > there).
>
> Well, question is: are strategies of the caching xlators' mapping well to
> the POSIX_FADV_* hint set? Would an application that might run on
> a GlusterFS storage backend use fadvise(2) anyway or would fadvise calls
> be added particularly to optimize the GlusterFS backed scenario?
>

+Pranith, +ravi.

If I am not wrong, afr too has strategies like eager-locking if writes are
sequential. Wondering whether afr can benefit from a feature like this.


> Because if usage of fadvise were specifically to address the GlusterFS
> backend -- either because of specifc semantic or specific behavior --, then
> I don't see much point in force-fitting this kind of tuning into the
> fadvise
> syscall. We can just as well operate then via xattrs.
>
> Csaba
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] How commonly applications make use of fadvise?

2017-08-18 Thread Raghavendra G
On Fri, Aug 11, 2017 at 5:50 PM, Ravishankar N <ravishan...@redhat.com>
wrote:

>
>
> On 08/11/2017 04:51 PM, Niels de Vos wrote:
>
>> On Fri, Aug 11, 2017 at 12:47:47AM -0400, Raghavendra Gowdappa wrote:
>>
>>> Hi all,
>>>
>>> In a conversation between me, Milind and Csaba, Milind pointed out
>>> fadvise(2) [1] and its potential benefits to Glusterfs' caching
>>> translators like read-ahead etc. After discussing about it, we agreed
>>> that our performance translators can leverage the hints to provide
>>> better performance. Now the question is how commonly applications
>>> actually provide hints? Is it something that is used quite frequently?
>>> If yes, we can think of implementing this in glusterfs (probably
>>> kernel-fuse too?). If no, there is not much of an advantage in
>>> spending our energies here. Your inputs will help us to prioritize
>>> this feature.
>>>
>> If functionality like this is available, we would add support in
>> libgfapi.so as well. NFS-Ganesha is prepared for consuming this
>> (fsal_obj_ops->io_advise), so applications running on top of NFS will
>> benefit. I failed to see if the standard Samba/vfs can use it.
>>
>> A quick check in QEMU does not suggest it is used by the block drivers.
>>
>> I don't think Linux/FUSE supports fadvise though. So this is an
>> oppertunity for a Gluster developer to get their name in the Linux
>> kernel :-) Feature additions like this have been done before by us, and
>> we should continue where we can. It is a relatively easy entry for
>> contributing to the Linux kernel.
>>
>
> To me it looks like fadvise (mm/fadvise.c) affects only the linux page
> cache behavior and is decoupled from the filesystem itself. What this means
> for fuse  is that the  'advise' is only to the content that the fuse kernel
> module has stored in that machine's page cache.  Exposing it as a FOP would
> likely involve adding a new fop to struct file_operations that is common
> across the entire VFS and likely  won't fly with the kernel folks. I could
> be wrong in understanding all of this. :-)
>

That's correct. file_operations don't have fadvise and hence exposing this
to underlying filesystems would require adding this member.


> Regards,
> Ravi
>
>
>>
>>> [1] https://linux.die.net/man/2/fadvise
>>>
>> As well as local man-pages for fadvise64/posix_fadvise.
>>
>> Showing that we have support for this, suggests that the filesystem
>> becomes more mature and gains advanced features. This should impress
>> users and might open up more interest for certain (HPC?) use-cases.
>>
>> Thanks,
>> Niels
>>
>>
>> regards,
>>> Raghavendra
>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Question regarding to gluster and vfs

2017-08-18 Thread Raghavendra G
On Thu, Aug 17, 2017 at 5:16 PM, Shyam Ranganathan <srang...@redhat.com>
wrote:

> On 08/17/2017 07:36 AM, Amar Tumballi wrote:
>
>>
>>
>> On Thu, Aug 17, 2017 at 1:21 PM, Raghavendra Talur <rta...@redhat.com
>> <mailto:rta...@redhat.com>> wrote:
>>
>> On Wed, Aug 16, 2017 at 5:52 PM, Ilan Schwarts <ila...@gmail.com
>> <mailto:ila...@gmail.com>> wrote:
>>  > Hi,
>>  > So this is a bit odd case.
>>  > I have created 2 servers nodes (running CentOS 7.3)
>>  > From Client machine (CentOS 7.2) I mount to one of the nodes
>> (nfs) using:
>>  > [root@CentOS7286-64 mnt]#  mount -t nfs
>>  > L137B-GlusterFS-Node1.L137B-root.com:/volume1 /mnt/glustervianfs/
>>  >
>>  > When i created (touch) a file over the NFS:
>>  > From Client Machine:
>>  > [revivo@CentOS7286-64 glustervianfs]$ touch nfs3file
>>  > [revivo@CentOS7286-64 glustervianfs]$ id revivo
>>  > uid=2021(revivo) gid=2020(maccabi) groups=2020(maccabi),10(wheel)
>>  >
>>  > On Server machine:
>>  > I monitor the file operations at VFS kernel level.
>>  > I receive 1 event of file create, and 2 events of set attribute
>> changes.
>>  > What I see is that root creates the file (uid/gid of 0)
>>  > And then root (also) use chown and chgrp to set security
>> (attribute)
>>  > of the new file.
>>  >
>>  > When i go to the glutser volume itself and ls -la,i do see the
>>  > *correct* (2021 - revivo /2020 - revivo) uid/gid:
>>  > [root@L137B-GlusterFS-Node1 volume1]# ls -lia
>>  > total 24
>>  > 11 drwxrwxrwx.  3 revivo maccabi 4096 Aug 10 12:13 .
>>  >  2 drwxr-xr-x.  3 root   root4096 Aug  9 14:32 ..
>>  > 12 drw---. 16 root   root4096 Aug 10 12:13 .glusterfs
>>  > 31 -rw-r--r--.  2 revivo maccabi0 Aug 10 12:13 nfs3file
>>  >
>>  > Why on the VFS layer i get uid/gid - 0/0
>>
>> As you have pointed out above, the file is created with 0:0
>> owner:group but subsequent operations change owner and group using
>> chown and chgrp. This is because the glusterfsd(brick daemon) process
>> always runs as root. I don't know the exact reason why setfsuid and
>> setfsgid are not used although the code exist.
>>
>> Amar/Pranith/Raghavendra/Vijay,
>>
>> Do you know why HAVE_SET_FSID is undefined in line
>> https://github.com/gluster/glusterfs/blob/master/xlators/sto
>> rage/posix/src/posix.c#L65
>> <https://github.com/gluster/glusterfs/blob/master/xlators/st
>> orage/posix/src/posix.c#L65>
>>
>>
>> Its been ~10 years since its disabled in codebase, and I don't recollect
>> why completely right now.
>>
>> By checking the patch [1] which got this change, I couldn't make out
>> much: Probably something to do with Solaris support IMO.
>>
>> [1] - https://github.com/gluster/historic/commit/3176ddf99f701412b
>> d799cc730afd598c2a13e39
>>
>> May be time to run a test by removing that line as we are friendly with
>> only Linux/BSD right now.
>>
>
> From memory (so take it with a pinch of salt), setting internal xattrs and
> the like needed root permissions, and not UID/GID permissions, this was
> when parts of DHT xattr setting was fixed and this code path analyzed
> (about less than a year back).
>
> So when testing it out this possibly needs some consideration. @Nithya do
> you have a better context to provide?
>

These scenarios are explicitly handled by setting uid/gid to 0 while doing
these operations (like linkto file creation etc). Even if we run into bugs
after removing this, explicit setting of credentials should be preferred.


>
>> Regards,
>> Amar
>>
>> Thanks,
>> Raghavendra Talur
>>
>>
>>
>>
>> --
>> Amar Tumballi (amarts)
>>
>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Changing the relative order of read-ahead and open-behind

2017-07-25 Thread Raghavendra G
On Tue, Jul 25, 2017 at 10:39 AM, Amar Tumballi <atumb...@redhat.com> wrote:

>
>
> On Tue, Jul 25, 2017 at 9:33 AM, Raghavendra Gowdappa <rgowd...@redhat.com
> > wrote:
>
>>
>>
>> - Original Message -
>> > From: "Pranith Kumar Karampuri" <pkara...@redhat.com>
>> > To: "Raghavendra G" <raghaven...@gluster.com>
>> > Cc: "Gluster Devel" <gluster-devel@gluster.org>
>> > Sent: Tuesday, July 25, 2017 7:51:07 AM
>> > Subject: Re: [Gluster-devel] Changing the relative order of read-ahead
>> andopen-behind
>> >
>> >
>> >
>> > On Mon, Jul 24, 2017 at 5:11 PM, Raghavendra G <
>> raghaven...@gluster.com >
>> > wrote:
>> >
>> >
>> >
>> >
>> >
>> > On Fri, Jul 21, 2017 at 6:39 PM, Vijay Bellur < vbel...@redhat.com >
>> wrote:
>> >
>> >
>> >
>> >
>> > On Fri, Jul 21, 2017 at 3:26 AM, Raghavendra Gowdappa <
>> rgowd...@redhat.com >
>> > wrote:
>> >
>> >
>> > Hi all,
>> >
>> > We've a bug [1], due to which read-ahead is completely disabled when the
>> > workload is read-only. One of the easy fix was to make read-ahead as an
>> > ancestor of open-behind in xlator graph (Currently its a descendant). A
>> > patch has been sent out by Rafi to do the same. As noted in one of the
>> > comments, one flip side of this solution is that small files (which are
>> > eligible to be cached by quick read) are cached twice - once each in
>> > read-ahead and quick-read - wasting up precious memory. However, there
>> are
>> > no other simpler solutions for this issue. If you've concerns on the
>> > approach followed by [2] or have other suggestions please voice them
>> out.
>> > Otherwise, I am planning to merge [2] for lack of better alternatives.
>> >
>> >
>> > Since the maximum size of files cached by quick-read is 64KB, can we
>> have
>> > read-ahead kick in for offsets greater than 64KB?
>> >
>> > I got your point. We can enable read-ahead only for files whose size is
>> > greater than the size eligible for caching quick-read. IOW, read-ahead
>> gets
>> > disabled if file size is less than 64KB. Thanks for the suggestion.
>> >
>> > I added a comment on the patch to move the xlators in reverse to the
>> way the
>> > patch is currently doing. Milind I think implemented it. Will that lead
>> to
>> > any problem?
>>
>> From gerrit:
>>
>> 
>>
>> It fixes the issue too and it is a better solution than the current one
>> as it doesn't run into duplicate cache problem. The reason open-behind was
>> loaded as an ancestor of quick-read was that it seemed unnecessary that
>> quick-read should even witness an open. However,
>>
>>* looking into code qr_open is indeed setting some priority for the
>> inode which will be used during purging of cache due to exceeding cache
>> limit. So, it helps quick read to witness an open.
>>* the real benefit of open-behind is avoiding fops over network. So,
>> as long as open-behind is loaded in client stack, we reap its benefits.
>>* Also note that if option "read-after-open" is set in open-behind, an
>> open is anyways done over network irrespective of whether quick-read has
>> cached the file, which to me looks unnecessary. By moving open-behind as a
>> descendant of quick-read, open-behind won't even witness a read when the
>> file is cached by quick-read. But, if read-after-open option is implemented
>> in open-behind with the goal of fixing non-posix compliance for the case of
>> open fd on a file is unlinked, we might regress. But again, even this
>> approach doesn't fix the compliance problem completely. One has to turn
>> open-behind off to be completely posix complaint in this scenario.
>>
>> Given the reasons above, it helps just moving open-behind as a descendant
>> of read-ahead.
>>
>> 
>>
>>
> Analysis looks good. But I would like us (all developers) to backup the
> theories like this with some data.
>

> How about you plan a test case which can demonstrate the difference ?
>

What is the scenario you want to measure here?


> I will help you set up metrics measuring with graphs [1] on experimental
> branch [2] to actually measure and graphically represent the hypothesis.
>
> We can set this as an example for future for anyone to try the permutation
> & com

Re: [Gluster-devel] Changing the relative order of read-ahead and open-behind

2017-07-24 Thread Raghavendra G
+Milind

On Mon, Jul 24, 2017 at 5:11 PM, Raghavendra G <raghaven...@gluster.com>
wrote:

>
>
> On Fri, Jul 21, 2017 at 6:39 PM, Vijay Bellur <vbel...@redhat.com> wrote:
>
>>
>> On Fri, Jul 21, 2017 at 3:26 AM, Raghavendra Gowdappa <
>> rgowd...@redhat.com> wrote:
>>
>>> Hi all,
>>>
>>> We've a bug [1], due to which read-ahead is completely disabled when the
>>> workload is read-only. One of the easy fix was to make read-ahead as an
>>> ancestor of open-behind in xlator graph (Currently its a descendant). A
>>> patch has been sent out by Rafi to do the same. As noted in one of the
>>> comments, one flip side of this solution is that small files (which are
>>> eligible to be cached by quick read) are cached twice - once each in
>>> read-ahead and quick-read - wasting up precious memory. However, there are
>>> no other simpler solutions for this issue. If you've concerns on the
>>> approach followed by [2] or have other suggestions please voice them out.
>>> Otherwise, I am planning to merge [2] for lack of better alternatives.
>>>
>>
>>
>> Since the maximum size of files cached by quick-read is 64KB, can we have
>> read-ahead kick in for offsets greater than 64KB?
>>
>
> I got your point. We can enable read-ahead only for files whose size is
> greater than the size eligible for caching quick-read. IOW, read-ahead gets
> disabled if file size is less than 64KB. Thanks for the suggestion.
>
>
>>
>> Thanks,
>> Vijay
>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
>
> --
> Raghavendra G
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Changing the relative order of read-ahead and open-behind

2017-07-24 Thread Raghavendra G
On Fri, Jul 21, 2017 at 6:39 PM, Vijay Bellur <vbel...@redhat.com> wrote:

>
> On Fri, Jul 21, 2017 at 3:26 AM, Raghavendra Gowdappa <rgowd...@redhat.com
> > wrote:
>
>> Hi all,
>>
>> We've a bug [1], due to which read-ahead is completely disabled when the
>> workload is read-only. One of the easy fix was to make read-ahead as an
>> ancestor of open-behind in xlator graph (Currently its a descendant). A
>> patch has been sent out by Rafi to do the same. As noted in one of the
>> comments, one flip side of this solution is that small files (which are
>> eligible to be cached by quick read) are cached twice - once each in
>> read-ahead and quick-read - wasting up precious memory. However, there are
>> no other simpler solutions for this issue. If you've concerns on the
>> approach followed by [2] or have other suggestions please voice them out.
>> Otherwise, I am planning to merge [2] for lack of better alternatives.
>>
>
>
> Since the maximum size of files cached by quick-read is 64KB, can we have
> read-ahead kick in for offsets greater than 64KB?
>

I got your point. We can enable read-ahead only for files whose size is
greater than the size eligible for caching quick-read. IOW, read-ahead gets
disabled if file size is less than 64KB. Thanks for the suggestion.


>
> Thanks,
> Vijay
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] GlusterFS v3.12 - Nearing deadline for branch out

2017-07-18 Thread Raghavendra G
Would be good to have [1] in 3.12. The patch is ready for reviews and
existing review comments have been addressed. Request more reviews.

[1] https://review.gluster.org/#/c/17105/

On Mon, Jul 17, 2017 at 9:00 PM, Pranith Kumar Karampuri <
pkara...@redhat.com> wrote:

> hi,
>Status of the following features targeted for 3.12:
> 1) Need a way to resolve split-brain (#135) : Mostly will be merged in a
> day.
> 2) Halo Hybrid mode (#217): Unfortunately didn't get time to follow up on
> this, so will not make it to the release.
> 3) Implement heal throttling (#255): Won't be making it to 3.12
> 4) Delay generator xlator (#257): I can definitely get this in by End of
> next week, otherwise we can do this for next release.
> 5) Parallel writes in EC (#251): This seems like a stretch for this
> weekend but definitely completable by end of next week. I am hopeful Xavi
> will have some cycles to complete the final reviews. Otherwise we may have
> to push this out.
> 6) Discard support for EC (#254): Doable for this weekend IMO, also
> depends on what Xavi thinks...
> 7) Last stripe caching (#256): We are targetting this instead of heal
> throttling (#255). This is not added to 3.12 tracker. I can add this if we
> can wait till next week. This also depends on Xavi's schedule.
>
> Also added Xavi for his inputs.
>
>
> On Wed, Jul 5, 2017 at 9:07 PM, Shyam <srang...@redhat.com> wrote:
>
>> Further to this,
>>
>> 1) I cleared up the projects lane [1] and also issues marked for 3.12 [2]
>>   - I did this optimistically, moving everything to 3.12 (both from a
>> projects and a milestones perspective), so if something is not making it,
>> drop a note, and we can clear up the tags accordingly.
>>
>> 2) Reviews posted and open against the issues in [1] can be viewed here
>> [3]
>>
>>   - Request maintainers and contributors to take a look at these and
>> accelerate the reviews, to meet the feature cut-off deadline
>>
>>   - Request feature owners to ensure that their patches are listed in the
>> link [3]
>>
>> 3) Finally, we need a status of open issues to understand how we can
>> help. Requesting all feature owners to post the same (as Amar has
>> requested).
>>
>> Thanks,
>> Shyam
>>
>> [1] Project lane: https://github.com/gluster/glusterfs/projects/1
>> [2] Issues with 3.12 milestone: https://github.com/gluster/glu
>> sterfs/milestone/4
>> [3] Reviews needing attetion: https://review.gluster.org/#/q
>> /starredby:srangana%2540redhat.com
>>
>> "Releases are made better together"
>>
>>
>> On 07/05/2017 03:18 AM, Amar Tumballi wrote:
>>
>>> All,
>>>
>>> We are around 10 working days remaining for branching out for 3.12
>>> release, after which, we will have just 15 more days open for 'critical'
>>> features to get in, for which there should be more detailed proposals.
>>>
>>> If you have few things planned out, but haven't taken it to completion
>>> yet, OR you have sent some patches, but not yet reviewed, start whining
>>> now, and get these in.
>>>
>>> Thanks,
>>> Amar
>>>
>>> --
>>> Amar Tumballi (amarts)
>>>
>>>
>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
>
> --
> Pranith
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Performance experiments with io-stats translator

2017-06-07 Thread Raghavendra G
On Wed, Jun 7, 2017 at 11:59 AM, Xavier Hernandez <xhernan...@datalab.es>
wrote:

> Hi Krutika,
>
> On 06/06/17 13:35, Krutika Dhananjay wrote:
>
>> Hi,
>>
>> As part of identifying performance bottlenecks within gluster stack for
>> VM image store use-case, I loaded io-stats at multiple points on the
>> client and brick stack and ran randrd test using fio from within the
>> hosted vms in parallel.
>>
>> Before I get to the results, a little bit about the configuration ...
>>
>> 3 node cluster; 1x3 plain replicate volume with group virt settings,
>> direct-io.
>> 3 FUSE clients, one per node in the cluster (which implies reads are
>> served from the replica that is local to the client).
>>
>> io-stats was loaded at the following places:
>> On the client stack: Above client-io-threads and above protocol/client-0
>> (the first child of AFR).
>> On the brick stack: Below protocol/server, above and below io-threads
>> and just above storage/posix.
>>
>> Based on a 60-second run of randrd test and subsequent analysis of the
>> stats dumped by the individual io-stats instances, the following is what
>> I found:
>>
>> _*​​Translator Position*_*   *_*Avg Latency of READ
>> fop as seen by this translator*_
>>
>> 1. parent of client-io-threads1666us
>>
>> ∆ (1,2) = 50us
>>
>> 2. parent of protocol/client-01616us
>>
>> ∆(2,3) = 1453us
>>
>> - end of client stack -
>> - beginning of brick stack ---
>>
>> 3. child of protocol/server   163us
>>
>> ∆(3,4) = 7us
>>
>> 4. parent of io-threads156us
>>
>> ∆(4,5) = 20us
>>
>> 5. child-of-io-threads  136us
>>
>> ∆ (5,6) = 11us
>>
>> 6. parent of storage/posix   125us
>> ...
>>  end of brick stack 
>>
>> So it seems like the biggest bottleneck here is a combination of the
>> network + epoll, rpc layer?
>> I must admit I am no expert with networks, but I'm assuming if the
>> client is reading from the local brick, then
>> even latency contribution from the actual network won't be much, in
>> which case bulk of the latency is coming from epoll, rpc layer, etc at
>> both client and brick end? Please correct me if I'm wrong.
>>
>> I will, of course, do some more runs and confirm if the pattern is
>> consistent.
>>
>
> very interesting. These results are similar to what I also observed when
> doing some ec tests.
>

For EC we've found [1] to increase the performance. Though not sure whether
it'll have any significant impact for replicated setups.


My personal feeling is that there's high serialization and/or contention in
> the network layer caused by mutexes, but I don't have data to support that.
>

As to lock contention or lack of concurrency at socket/rpc layers, AFAIK
we've following suspects in I/O path (as opposed to accepting/listen paths):

* Only one of reading from socket, writing to socket, error handling on
socket, voluntary shutdown of sockets (through shutdown) can be in progress
at a time. IOW, these operations are not concurrent as each one of them
acquires a lock contended by others. My gut feeling is that at least
reading from socket and writing to socket can be made concurrent and I've
to spend more time on this to have a definitive answer.

* Till [1], handler also incurred cost of message processing by higher
layers (not just the cost of reading a msg from socket). Since we've epoll
configured with EPOLL_ONESHOT and add back socket only after handler
completes there was a lag after one msg is read before another msg can be
read from same socket.

* EPOLL_ONESHOT also means processing of one event (say POLLIN) also
excludes other events (like POLLOUT when lots of msgs waiting to be written
to socket) till the event is processed. The vice-versa scenario - reads
blocked when writes are pending on a socket and a POLLOUT is received - is
also true here. I think this is another area where we can improve.

Will update the thread as and when I can think of a valid suspect.

[1] https://review.gluster.org/17391


>
> Xavi
>
>
>
>> -Krutika
>>
>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Backport for "Add back socket for polling of events immediately..."

2017-05-28 Thread Raghavendra G
+gluster-users

On Mon, May 29, 2017 at 8:46 AM, Raghavendra G <raghaven...@gluster.com>
wrote:

> Replying to all queries here:
>
> * Is it a bug or performance enhancement?
>   Its a performance enhancement. No functionality is broken if this patch
> is not taken in.
>
> * Are there performance numbers to validate the claim?
>   https://bugzilla.redhat.com/show_bug.cgi?id=1358606#c9
>
> * Are there any existing users who need this enhancement?
>   https://bugzilla.redhat.com/show_bug.cgi?id=1358606#c27
>
>   Though not sure what branch Zhang Huan is on. @Zhang your inputs are
> needed here.
>
> * Do I think this patch _should_ go into any of the released branches?
>   Personally, I don't feel strongly either way. I am fine with this patch
> not making into any of released branches. But, I do think there are users
> who are affected with this (Especially EC/Disperse configurations). If they
> want to stick to the released branches, pulling into released branches will
> help them. @Pranith/Xavi, what are your opinions on this?
>
> regards,
> Raghavendra
>
> On Sun, May 28, 2017 at 6:58 PM, Shyam <srang...@redhat.com> wrote:
>
>> On 05/28/2017 09:24 AM, Atin Mukherjee wrote:
>>
>>>
>>>
>>> On Sun, May 28, 2017 at 1:48 PM, Niels de Vos <nde...@redhat.com
>>> <mailto:nde...@redhat.com>> wrote:
>>>
>>> On Fri, May 26, 2017 at 12:25:42PM -0400, Shyam wrote:
>>> > Or this one: https://review.gluster.org/15036 <
>>> https://review.gluster.org/15036>
>>> >
>>> > This is backported to 3.8/10 and 3.11 and considering the size and
>>> impact of
>>> > the change, I wanted to be sure that we are going to accept this
>>> across all
>>> > 3 releases?
>>> >
>>> > @Du, would like your thoughts on this.
>>> >
>>> > @niels, @kaushal, @talur, as release owners, could you weigh in as
>>> well
>>> > please.
>>> >
>>> > I am thinking that we get this into 3.11.1 if there is agreement,
>>> and not in
>>> > 3.11.0 as we are finalizing the release in 3 days, and this change
>>> looks
>>> > big, to get in at this time.
>>>
>>>
>>> Given 3.11 is going to be a new release, I'd recommend to get this fix
>>> in (if we have time). https://review.gluster.org/#/c/17402/ is dependent
>>> on this one.
>>>
>>
>> It is not a fix Atin, it is a more fundamental change to request
>> processing, with 2 days to the release, you want me to merge this?
>>
>> Is there a *bug* that will surface without this change or is it a
>> performance enhancement?
>>
>>
>>> >
>>> > Further the change is actually an enhancement, and provides
>>> performance
>>> > benefits, so it is valid as a change itself, but I feel it is too
>>> late to
>>> > add to the current 3.11 release.
>>>
>>> Indeed, and mostly we do not merge enhancements that are non-trivial
>>> to
>>> stable branches. Each change that we backport introduces the chance
>>> on
>>> regressions for users with their unknown (and possibly awkward)
>>> workloads.
>>>
>>> The patch itself looks ok, but it is difficult to predict how the
>>> change
>>> affects current deployments. I prefer to be conservative and not have
>>> this merged in 3.8, at least for now. Are there any statistics in how
>>> performance is affected with this change? Having features like this
>>> only
>>> in newer versions might also convince users to upgrade sooner, 3.8
>>> will
>>> only be supported until 3.12 (or 4.0) gets released, which is
>>> approx. 3
>>> months from now according to our schedule.
>>>
>>> Niels
>>>
>>> ___
>>> maintainers mailing list
>>> maintain...@gluster.org <mailto:maintain...@gluster.org>
>>> http://lists.gluster.org/mailman/listinfo/maintainers
>>> <http://lists.gluster.org/mailman/listinfo/maintainers>
>>>
>>>
>>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
>
> --
> Raghavendra G
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Backport for "Add back socket for polling of events immediately..."

2017-05-28 Thread Raghavendra G
Replying to all queries here:

* Is it a bug or performance enhancement?
  Its a performance enhancement. No functionality is broken if this patch
is not taken in.

* Are there performance numbers to validate the claim?
  https://bugzilla.redhat.com/show_bug.cgi?id=1358606#c9

* Are there any existing users who need this enhancement?
  https://bugzilla.redhat.com/show_bug.cgi?id=1358606#c27

  Though not sure what branch Zhang Huan is on. @Zhang your inputs are
needed here.

* Do I think this patch _should_ go into any of the released branches?
  Personally, I don't feel strongly either way. I am fine with this patch
not making into any of released branches. But, I do think there are users
who are affected with this (Especially EC/Disperse configurations). If they
want to stick to the released branches, pulling into released branches will
help them. @Pranith/Xavi, what are your opinions on this?

regards,
Raghavendra

On Sun, May 28, 2017 at 6:58 PM, Shyam <srang...@redhat.com> wrote:

> On 05/28/2017 09:24 AM, Atin Mukherjee wrote:
>
>>
>>
>> On Sun, May 28, 2017 at 1:48 PM, Niels de Vos <nde...@redhat.com
>> <mailto:nde...@redhat.com>> wrote:
>>
>> On Fri, May 26, 2017 at 12:25:42PM -0400, Shyam wrote:
>> > Or this one: https://review.gluster.org/15036 <
>> https://review.gluster.org/15036>
>> >
>> > This is backported to 3.8/10 and 3.11 and considering the size and
>> impact of
>> > the change, I wanted to be sure that we are going to accept this
>> across all
>> > 3 releases?
>> >
>> > @Du, would like your thoughts on this.
>> >
>> > @niels, @kaushal, @talur, as release owners, could you weigh in as
>> well
>> > please.
>> >
>> > I am thinking that we get this into 3.11.1 if there is agreement,
>> and not in
>> > 3.11.0 as we are finalizing the release in 3 days, and this change
>> looks
>> > big, to get in at this time.
>>
>>
>> Given 3.11 is going to be a new release, I'd recommend to get this fix
>> in (if we have time). https://review.gluster.org/#/c/17402/ is dependent
>> on this one.
>>
>
> It is not a fix Atin, it is a more fundamental change to request
> processing, with 2 days to the release, you want me to merge this?
>
> Is there a *bug* that will surface without this change or is it a
> performance enhancement?
>
>
>> >
>> > Further the change is actually an enhancement, and provides
>> performance
>> > benefits, so it is valid as a change itself, but I feel it is too
>> late to
>> > add to the current 3.11 release.
>>
>> Indeed, and mostly we do not merge enhancements that are non-trivial
>> to
>> stable branches. Each change that we backport introduces the chance on
>> regressions for users with their unknown (and possibly awkward)
>> workloads.
>>
>> The patch itself looks ok, but it is difficult to predict how the
>> change
>> affects current deployments. I prefer to be conservative and not have
>> this merged in 3.8, at least for now. Are there any statistics in how
>> performance is affected with this change? Having features like this
>> only
>> in newer versions might also convince users to upgrade sooner, 3.8
>> will
>> only be supported until 3.12 (or 4.0) gets released, which is approx.
>> 3
>> months from now according to our schedule.
>>
>> Niels
>>
>> ___
>> maintainers mailing list
>> maintain...@gluster.org <mailto:maintain...@gluster.org>
>> http://lists.gluster.org/mailman/listinfo/maintainers
>> <http://lists.gluster.org/mailman/listinfo/maintainers>
>>
>>
>> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Review request - #17105

2017-05-16 Thread Raghavendra G
program/GF-DUMP: Shield ping processing from traffic to Glusterfs Program

Since poller thread bears the brunt of execution till the request is handed
over to io-threads, poller thread experiencies lock contention(s) in the
control flow till io-threads, which slows it down. This delay invariably
affects reading ping requests from network and responding to them,
resulting in increased ping latencies, which sometimes results in a
ping-timer-expiry on client leading to disconnect of transport. So, this
patch aims to free up poller thread from executing code of Glusterfs
Program.

We do this by making
* Glusterfs Program registering itself asking rpcsvc to execute its actors
in its own threads.
* GF-DUMP Program registering itself asking rpcsvc to _NOT_ execute its
actors in its own threads. Otherwise program's ownthreads become bottleneck
in processing ping traffic. This means that poller thread reads a ping
packet, invokes its actor and hands the response msg to transport queue.

Change-Id: I526268c10bdd5ef93f322a4f95385137550a6a49
<https://review.gluster.org/#/q/I526268c10bdd5ef93f322a4f95385137550a6a49>
Signed-off-by: Raghavendra G <rgowd...@redhat.com>
BUG: 1421938 <https://bugzilla.redhat.com/show_bug.cgi?id=1421938>

Patch: https://review.gluster.org/#/c/17105/

Note that there is only one thread per program. So, am wondering whether
this thread can become performance bottleneck for Glusterfs program. Your
comments are welcome.

regards,
-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Review request - patch #15036

2017-05-11 Thread Raghavendra G
I'll wait for a day on this. If there are no reviews, I'll assume that as a
+1 and will go ahead and merge it. If anyone needs more time, please let me
know and I can wait.

On Thu, May 11, 2017 at 12:22 PM, Raghavendra Gowdappa <rgowd...@redhat.com>
wrote:

> All,
>
> Reviews are requested on [1]. Impact is non-trivial as it introduces more
> concurrency in execution wrt processing of messages read from network.
>
> All tests are passed, though gerrit is not reflecting the last smoke which
> was successful.
>
> For reference, below is the verbatim copy of commit msg:
>
> 
>
> event/epoll: Add back socket for polling of events immediately after
> reading the entire rpc message from the wire Currently socket is added back
> for future events after higher layers (rpc, xlators etc) have processed the
> message. If message processing involves signficant delay (as in writev
> replies processed by Erasure Coding), performance takes hit. Hence this
> patch modifies transport/socket to add back the socket for polling of
> events immediately after reading the entire rpc message, but before
> notification to higher layers.
>
> credits: Thanks to "Kotresh Hiremath Ravishankar" <khire...@redhat.com>
> for assitance in fixing a regression in bitrot caused by this patch.
>
> BUG: 1448364
> 
>
> @Nigel,
>
> Is there a way to override -1 from smoke, as last instance of it is
> successful?
>
> [1] https://review.gluster.org/#/c/15036/
>
> regards,
> Raghavendra
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] lock_revocation.t is hanging in regression tests

2017-05-11 Thread Raghavendra G
On Thu, May 11, 2017 at 3:58 AM, Vijay Bellur <vbel...@redhat.com> wrote:

>
>
> On Wed, May 10, 2017 at 9:01 AM, Jeff Darcy <j...@pl.atyp.us> wrote:
>
>>
>>
>>
>> On Wed, May 10, 2017, at 06:30 AM, Raghavendra G wrote:
>>
>> marking it bad won't help as even bad tests are run by build system (and
>> they might hang).
>>
>>
>> This is news to me.  Did something change to make it this way?  If so, we
>> should change it back.  There's no point in having a way to mark tests as
>> bad if the build system ignores it.
>>
>
> skip_bad_tests is set to "yes" by default in run-tests.sh. So the
> regression job  should skip bad tests during a run.
>

My mistake. Sorry about the confusion.


>
> Nigel can probably educate us if the behavior is different :).
>
> -Vijay
>
>
>
>
>
>>
>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] glfsheal: crashed(segfault) with disperse volume in RDMA

2017-05-10 Thread Raghavendra G
Thanks for the analysis. The fix attached with bug is valid. Can you post
it to review.gluster.org? We can review and merge it.

But I am curious whether the failure of rdma_create_id is valid or a bug in
glusterfs. Are you able to run other tools (like ping-pong) that use
rdma-cm for connection establishment? If they are failing too, then the
issue is with infrastructure and not a bug in glusterfs. Otherwise there
might be a bug in glusterfs.

On Wed, May 10, 2017 at 1:23 PM, Ji-Hyeon Gim <potato...@gluesys.com> wrote:

> Hello, I tried to heal my disperse volume with glfsheal but glfsheal is
> always crushed in RDMA environment!
>
> I did submit a bug report for this problem.
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1449495
>
> Please excuse my lack of bug reporting process.
>
> I'm new, so bear with me :)
>
> Please advise me for more valuable improvements!
>
> Best regards.
>
> --
>
> Ji-Hyeon Gim
> Research Engineer, Gluesys
>
> Address. Gluesys R Center, 5F, 11-31, Simin-daero 327beon-gil,
>  Dongan-gu, Anyang-si,
>  Gyeonggi-do, Korea
>  (14055)
> Phone.   +82-70-8787-1053
> Fax. +82-31-388-3261
> Mobile.  +82-10-7293-8858
> E-Mail.  potato...@potatogim.net
> Website. www.potatogim.net
>
> The time I wasted today is the tomorrow the dead man was eager to see
> yesterday.
>   - Sophocles
>
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] lock_revocation.t is hanging in regression tests

2017-05-10 Thread Raghavendra G
The behaviour is different on my laptop though. Test doesn't hang, but it
fails (both on master and with 17234). So, as Pranith suggested offline,
marking it bad won't help as even bad tests are run by build system (and
they might hang). Is removing the test only way?

On Wed, May 10, 2017 at 3:49 PM, Raghavendra G <raghaven...@gluster.com>
wrote:

> The same test is causing failures for [1] too. On checking it failed on
> master too. I've submitted a patch [2] to mark it as failure
>
> [1] https://review.gluster.org/#/c/15036
> [2] https://review.gluster.org/#/c/17234/
>
> On Tue, May 9, 2017 at 11:47 PM, Shyam <srang...@redhat.com> wrote:
>
>> I think this should be disabled for now.
>>
>> See past context here [1] [2], I guess this not being reported as a
>> failure is causing the attention lapse.
>>
>> [1] http://lists.gluster.org/pipermail/gluster-devel/2017-March/
>> 052287.html
>>
>> [2] http://lists.gluster.org/pipermail/gluster-devel/2017-Februa
>> ry/052158.html (see (1) in the mail)
>>
>>
>> On 05/09/2017 01:50 PM, Jeff Darcy wrote:
>>
>>> After seeing this hang with multiplexing enabled yesterday, I saw it
>>> hang a test for a completely unrelated patch
>>> (https://review.gluster.org/17200) without multiplexing.  Does anyone
>>> object to disabling this test while it's debugged?
>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>>> _______
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
>
> --
> Raghavendra G
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] lock_revocation.t is hanging in regression tests

2017-05-10 Thread Raghavendra G
The same test is causing failures for [1] too. On checking it failed on
master too. I've submitted a patch [2] to mark it as failure

[1] https://review.gluster.org/#/c/15036
[2] https://review.gluster.org/#/c/17234/

On Tue, May 9, 2017 at 11:47 PM, Shyam <srang...@redhat.com> wrote:

> I think this should be disabled for now.
>
> See past context here [1] [2], I guess this not being reported as a
> failure is causing the attention lapse.
>
> [1] http://lists.gluster.org/pipermail/gluster-devel/2017-March/
> 052287.html
>
> [2] http://lists.gluster.org/pipermail/gluster-devel/2017-Februa
> ry/052158.html (see (1) in the mail)
>
>
> On 05/09/2017 01:50 PM, Jeff Darcy wrote:
>
>> After seeing this hang with multiplexing enabled yesterday, I saw it
>> hang a test for a completely unrelated patch
>> (https://review.gluster.org/17200) without multiplexing.  Does anyone
>> object to disabling this test while it's debugged?
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [DHT] The myth of two hops for linkto file resolution

2017-05-04 Thread Raghavendra G
On Thu, May 4, 2017 at 4:36 PM, Xavier Hernandez <xhernan...@datalab.es>
wrote:

> Hi,
>
> On 30/04/17 06:03, Raghavendra Gowdappa wrote:
>
>> All,
>>
>> Its a common perception that the resolution of a file having linkto file
>> on the hashed-subvol requires two hops:
>>
>> 1. client to hashed-subvol.
>> 2. client to the subvol where file actually resides.
>>
>> While it is true that a fresh lookup behaves this way, the other fact
>> that get's ignored is that fresh lookups on files are almost always
>> prevented by readdirplus. Since readdirplus picks the dentry from the
>> subvolume where actual file (data-file) resides, the two hop cost is most
>> likely never witnessed by the application.
>>
>
> This is true for workloads that list directory contents before accessing
> the files, but there are other use cases that directly access the file
> without navigating through the file system. In this case fresh lookups are
> needed.
>

I agree, if the contents of parent directory are not listed at least once,
 penalty is still there.


> Xavi
>
>
>
>> A word of caution is that I've not done any testing to prove this
>> observation :).
>>
>> regards,
>> Raghavendra
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Priority based ping packet for 3.10

2017-04-24 Thread Raghavendra G
On Fri, Apr 21, 2017 at 11:43 AM, Raghavendra G <raghaven...@gluster.com>
wrote:

> Summing up various discussions I had on this,
>
> 1. Current ping frame work should measure just the responsiveness of
> network and rpc layer. This means poller threads shouldn't be winding the
> individual fops at all (as it might add delay in reading the ping
> requests). Instead, they can queue the requests to a common work queue and
> other threads should pick up the requests.
>

Patch can be found at:
https://review.gluster.org/17105


> 4. We've fixed some lock contention issues on the brick stack due to high
> latency on backend fs. However, this is on-going work as contentions can be
> found in various codepaths (mem-pool etc).
>

These patches were contributed by "Krutika Dhananjay" <
kdhananj.at.redhat.com>.
https://review.gluster.org/16869
https://review.gluster.org/16785
https://review.gluster.org/16462

Thanks Krutika for all those patches :).

regards,
Raghavendra
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Priority based ping packet for 3.10

2017-04-21 Thread Raghavendra G
Summing up various discussions I had on this,

1. Current ping frame work should measure just the responsiveness of
network and rpc layer. This means poller threads shouldn't be winding the
individual fops at all (as it might add delay in reading the ping
requests). Instead, they can queue the requests to a common work queue and
other threads should pick up the requests.
2. We also need another tool to measure the responsiveness of the entire
Brick xlator stack. This tool can have a slightly larger time than ping
timeout as responses naturally will be delayed. Whether this tool should
measure the responsiveness of the backend fs is an open question as we
already have a posix health checker that measures the responsiveness and
sends a CHILD_DOWN when backend fs in not responsive. Also, there are open
questions here like what data structures various xlators are accessing as
part of this fop (like inode, fd, mem-pools etc). Accessing various data
structures will result in a different latency.
3. Currently ping packets are not sent by a client when there is no I/O
from it. As per the discussions above, client should measure the
responsiveness even when there is no traffic to/from it. May be the
interval during which ping packets are sent can be increased.
4. We've fixed some lock contention issues on the brick stack due to high
latency on backend fs. However, this is on-going work as contentions can be
found in various codepaths (mem-pool etc).

We'll shortly send a fix for 1. The other things will be picked based on
the bandwidth. Contributions are welcome :).

regards,
Raghavendra.

On Wed, Jan 25, 2017 at 11:01 AM, Joe Julian <j...@julianfamily.org> wrote:

> Yes, the earlier a fault is detected the better.
>
> On January 24, 2017 9:21:27 PM PST, Jeff Darcy <jda...@redhat.com> wrote:
>>
>>  If there are no responses to be received and no requests being
>>>  sent to a brick, why would be a client be interested in the health of
>>>  server/brick?
>>>
>>
>> The client (code) might not, but the user might want to find out and fix
>> the fault before the brick gets busy again.
>> --
>>
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>>
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] decoupling network.ping-timeout and transport.tcp-user-timeout

2017-02-28 Thread Raghavendra G
There is a patch on this [1]. Reviews from wider audience will be helpful,
before we merge the patch.

https://review.gluster.org/#/c/16731/

regards,
Raghavendra


On Wed, Jan 11, 2017 at 4:19 PM, Milind Changire <mchan...@redhat.com>
wrote:

> +gluster-users
>
> Milind
>
>
> On 01/11/2017 03:21 PM, Milind Changire wrote:
>
>> The management connection uses network.ping-timeout to time out and
>> retry connection to a different server if the existing connection
>> end-point is unreachable from the client.
>> Due to the nature of the parameters involved in the TCP/IP network
>> stack, it becomes imperative to control the other network connections
>> using the socket level tunables:
>> * SO_KEEPALIVE
>> * TCP_KEEPIDLE
>> * TCP_KEEPINTVL
>> * TCP_KEEPCNT
>>
>> So, I'd like to decouple the network.ping-timeout and
>> transport.tcp-user-timeout since they are tunables for different
>> aspects of gluster application. network-ping-timeout monitors the
>> brick/node level responsiveness and transport.tcp-user-timeout is one
>> of the attributes that is used to manage the state of the socket.
>>
>> Saying so, we could do away with network.ping-timeout altogether and
>> stick with transport.tcp-user-timeout for types of sockets. It becomes
>> increasingly difficult to work with different tunables across gluster.
>>
>> I believe, there have not been many cases in which the community has
>> found the existing defaults for socket timeout unusable. So we could
>> stick with the system defaults and add the following socket level
>> tunables and make them open for configuration:
>> * client.tcp-user-timeout
>>  which sets transport.tcp-user-timeout
>> * client.keepalive-time
>>  which sets transport.socket.keepalive-time
>> * client.keepalive-interval
>>  which sets transport.socket.keepalive-interval
>> * client.keepalive-count
>>  which sets transport.socket.keepalive-count
>> * server.tcp-user-timeout
>>  which sets transport.tcp-user-timeout
>> * server.keepalive-time
>>  which sets transport.socket.keepalive-time
>> * server.keepalive-interval
>>  which sets transport.socket.keepalive-interval
>> * server.keepalive-count
>>  which sets transport.socket.keepalive-count
>>
>> However, these settings would effect all sockets in gluster.
>> In cases where aggressive timeouts are needed, the community can find
>> gluster options which have 1:1 mapping with socket level options as
>> documented in tcp(7).
>>
>> Please share your thoughts about the risks or effectiveness of the
>> decoupling.
>>
>> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Priority based ping packet for 3.10

2017-01-24 Thread Raghavendra G
On Tue, Jan 24, 2017 at 10:39 AM, Vijay Bellur <vbel...@redhat.com> wrote:

>
>
> On Thu, Jan 19, 2017 at 8:06 AM, Jeff Darcy <jda...@redhat.com> wrote:
>
>> > The more relevant question would be with TCP_KEEPALIVE and
>> TCP_USER_TIMEOUT
>> > on sockets, do we really need ping-pong framework in Clients? We might
>> need
>> > that in transport/rdma setups, but my question is concentrating on
>> > transport/rdma. In other words would like to hear why do we need
>> heart-beat
>> > mechanism in the first place. One scenario might be a healthy socket
>> level
>> > connection but an unhealthy brick/client (like a deadlocked one).
>>
>> This is an important case to consider.  On the one hand, I think it
>> answers
>> your question about TCP_KEEPALIVE.  What we really care about is whether a
>> brick's request queue is moving.  In other words, what's the time since
>> the
>> last reply from that brick, and does that time exceed some threshold?
>
>
I agree with this.


> On a
>> busy system, we don't even need ping packets to know that.  We can just
>> use
>> responses on other requests to set/reset that timer.  We only need to send
>> ping packets when our *outbound* queue has remained empty for some
>> fraction
>> of our timeout.
>>
>
Do we need ping packets sent even when client is not waiting for any
replies? I assume no. If there are no responses to be received and no
requests being sent to a brick, why would be a client be interested in the
health of server/brick?


>
>> However, it's important that our measurements be *end to end* and not just
>> at the transport level.  This is particularly true with multiplexing,
>> where multiple bricks will share and contend on various resources.  We
>> should ping *through* client and server, with separate translators above
>> and below each.  This would give us a true end-to-end ping *for that
>> brick*, and also keep the code nicely modular.
>>
>
Agree with this. My understanding of ping framework is a tool to identify
unhealthy bricks (we are interested in bricks as they are the ones going to
serve fops). With that understanding ping-pong should be end to end (to
whatever logical entity that constitutes brick). However, where in the
brick xlator stack ping packets should be responded? Should they go all the
way down to storage/posix?


>
> +1 to this. Having ping, pong xlators immediately above and below protocol
> translators would also address the problem of epoll threads getting blocked
> in gluster's xlator stacks in busy systems.
>
> Having said that, I do see value in Rafi's patch that prompted this
> thread. Would it not help to prioritize ping - pong traffic in all parts of
> the gluster stack including the send queue on the client?
>

I've two concerns here:
1. Responsiveness of brick to client invariably involves latency of network
and our own transport's io-queue. Wouldn't prioritizing ping packets over
normal data give us a skewed view of brick's responsiveness? For eg., On a
network with heavy traffic ping-pong might be happening, but fops might be
moving very slowely. What is that we achieve with a successful ping-pong in
this scenario? Also, Is our response to the opposite scenario of
ping-timeout happening and disconnecting the transport achieves anything
substantially good? May be it helps to bring the latency of syscalls down
(as experienced by application), as our HA translators like afr, EC add the
latency of identifying disconnect (or  a successful fop) to latency of
syscalls. As developers many of us keep wondering what is that we are
trying to achieve with an heart beat mechanism.

2. Assuming that we want to prioritize ping traffic over normal traffic
(which we do logically now as ping packets doesn't traverse the entire
brick xlator stack all the way down to posix, instead short circuit at
protocol/server), the fix in discussion here is partial (as we can't
prioritize ping traffic ON the WIRE and through tcp/ip stack). While I
don't have strong objections to it, I feel that its partial solution and
might be inconsequential (just an hunch, no data). However, I can accept
the patch, if we feel it helps.


> Regards,
> Vijay
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Priority based ping packet for 3.10

2017-01-19 Thread Raghavendra G
On Thu, Jan 19, 2017 at 3:59 PM, Raghavendra G <raghaven...@gluster.com>
wrote:

> The more relevant question would be with TCP_KEEPALIVE and
> TCP_USER_TIMEOUT on sockets, do we really need ping-pong framework in
> Clients? We might need that in transport/rdma setups, but my question is
> concentrating on transport/rdma.
>

s \ concentrating on transport/rdma \ concentrating on transport/socket \

In other words would like to hear why do we need heart-beat mechanism in
> the first place. One scenario might be a healthy socket level connection
> but an unhealthy brick/client (like a deadlocked one). Are there enough
> such realistic scenarios which make ping-pong/heartbeat necessary? What
> other ways brick/client can go bad?
>
> On Thu, Jan 19, 2017 at 3:36 PM, Raghavendra G <raghaven...@gluster.com>
> wrote:
>
>>
>>
>> On Thu, Jan 19, 2017 at 1:50 PM, Mohammed Rafi K C <rkavu...@redhat.com>
>> wrote:
>>
>>> Hi,
>>>
>>> The patch for priority based ping packets [1] are ready to review. As
>>> Shyam mentioned in the comment on patch set 12, it doesn't solve the
>>> problem with network conjunction nor the disk latency. Also it won't
>>> priorities the reply of ping packets at the server end (We don't have a
>>> straight way to identify prognum in the reply).
>>>
>>>
>>> So my question , Is it worth of taking the patch or do we need to think
>>> through a more generic solutions.
>>>
>>
>> Though ping requests can take more time to reach server due to heavy
>> traffic, realistically speaking common reasons for ping-timer expiry have
>> been either
>>
>> 1. client not been able to read ping response [2]
>> 2. server not able to read ping request.
>>
>> Speaking about 2 above, Me, Kritika and Pranith were just discussing
>> today morning about an issue where they had hit ping timer expiry in
>> replicated setups when disk usage was high. The reason for this as Pranith
>> pointed out was,
>> 1. posix has some fops (like posix_xattrop, posix_fxattrop) which do
>> syscalls after holding a lock on inode (inode->lock).
>> 2. During high disk usage scenarios, syscall latencies were high
>> (sometimes >= ping-timeout value)
>> 3. Before being handed over to a new thread at io-threads xlator, a fop
>> gets executed in one of the threads that reads incoming messages from
>> socket. This execution path includes some translators like protocol/server,
>> index, quota-enforcer, marker. And these translators might access inode-ctx
>> which involves locking inode (inode->lock). Due to this locking latency of
>> syscall gets transferred to poller thread. Since poller thread is waiting
>> on inode->lock, it won't be able to read ping requests from network in-time
>> resulting in ping-timer expiry.
>>
>> I think Kritika is working on a patch to eliminate locking on inode in 1
>> above. We also need to reduce the actual fop execution in poller thread.
>> IOW, we need to hand over the fop execution to io-threads/syncop-threads as
>> early as we can. [3] helps in this scenario as it adds back the socket for
>> polling immediately after reading the entire msg but before execution of
>> fop begins. So, even though fop execution is happening in poller thread,
>> msgs from same connection can be read in other poller threads parallely
>> (and we can scale up the number of epoll-threads when load is high).
>>
>> Also, note that there is no way we can send entire ping request as
>> "URGENT" data over network. So, prioritization in [1] is only the queue of
>> messages waiting to be written to network. So, Though I suggested [1], the
>> more I think of it, it seems less irrelevant.
>>
>> [2] http://review.gluster.org/12402
>> [3] http://review.gluster.org/15036
>>
>>
>>>
>>> Note : We could make this patch more generic so that any packets can be
>>> marked as priority to add into the head instead of just Ping packets.
>>>
>>> [1] : http://review.gluster.org/#/c/11935/
>>>
>>> Regards
>>>
>>> Rafi KC
>>>
>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>>
>>
>>
>> --
>> Raghavendra G
>>
>
>
>
> --
> Raghavendra G
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Priority based ping packet for 3.10

2017-01-19 Thread Raghavendra G
The more relevant question would be with TCP_KEEPALIVE and TCP_USER_TIMEOUT
on sockets, do we really need ping-pong framework in Clients? We might need
that in transport/rdma setups, but my question is concentrating on
transport/rdma. In other words would like to hear why do we need heart-beat
mechanism in the first place. One scenario might be a healthy socket level
connection but an unhealthy brick/client (like a deadlocked one). Are there
enough such realistic scenarios which make ping-pong/heartbeat necessary?
What other ways brick/client can go bad?

On Thu, Jan 19, 2017 at 3:36 PM, Raghavendra G <raghaven...@gluster.com>
wrote:

>
>
> On Thu, Jan 19, 2017 at 1:50 PM, Mohammed Rafi K C <rkavu...@redhat.com>
> wrote:
>
>> Hi,
>>
>> The patch for priority based ping packets [1] are ready to review. As
>> Shyam mentioned in the comment on patch set 12, it doesn't solve the
>> problem with network conjunction nor the disk latency. Also it won't
>> priorities the reply of ping packets at the server end (We don't have a
>> straight way to identify prognum in the reply).
>>
>>
>> So my question , Is it worth of taking the patch or do we need to think
>> through a more generic solutions.
>>
>
> Though ping requests can take more time to reach server due to heavy
> traffic, realistically speaking common reasons for ping-timer expiry have
> been either
>
> 1. client not been able to read ping response [2]
> 2. server not able to read ping request.
>
> Speaking about 2 above, Me, Kritika and Pranith were just discussing today
> morning about an issue where they had hit ping timer expiry in replicated
> setups when disk usage was high. The reason for this as Pranith pointed out
> was,
> 1. posix has some fops (like posix_xattrop, posix_fxattrop) which do
> syscalls after holding a lock on inode (inode->lock).
> 2. During high disk usage scenarios, syscall latencies were high
> (sometimes >= ping-timeout value)
> 3. Before being handed over to a new thread at io-threads xlator, a fop
> gets executed in one of the threads that reads incoming messages from
> socket. This execution path includes some translators like protocol/server,
> index, quota-enforcer, marker. And these translators might access inode-ctx
> which involves locking inode (inode->lock). Due to this locking latency of
> syscall gets transferred to poller thread. Since poller thread is waiting
> on inode->lock, it won't be able to read ping requests from network in-time
> resulting in ping-timer expiry.
>
> I think Kritika is working on a patch to eliminate locking on inode in 1
> above. We also need to reduce the actual fop execution in poller thread.
> IOW, we need to hand over the fop execution to io-threads/syncop-threads as
> early as we can. [3] helps in this scenario as it adds back the socket for
> polling immediately after reading the entire msg but before execution of
> fop begins. So, even though fop execution is happening in poller thread,
> msgs from same connection can be read in other poller threads parallely
> (and we can scale up the number of epoll-threads when load is high).
>
> Also, note that there is no way we can send entire ping request as
> "URGENT" data over network. So, prioritization in [1] is only the queue of
> messages waiting to be written to network. So, Though I suggested [1], the
> more I think of it, it seems less irrelevant.
>
> [2] http://review.gluster.org/12402
> [3] http://review.gluster.org/15036
>
>
>>
>> Note : We could make this patch more generic so that any packets can be
>> marked as priority to add into the head instead of just Ping packets.
>>
>> [1] : http://review.gluster.org/#/c/11935/
>>
>> Regards
>>
>> Rafi KC
>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
>
> --
> Raghavendra G
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] FOP write success when volume of glusterfs is full with write-behind is on

2017-01-19 Thread Raghavendra G
On Fri, Jan 13, 2017 at 1:48 PM, Lian, George (Nokia - CN/Hangzhou) <
george.l...@nokia.com> wrote:

> Hi,
>
> I try write FOP test on case of volume of gusterfs is full, the detail
> process and some investigation is as the below:
> 
> 
> --
> 1)  use dd to write full with volume export
> dd if=/dev/zero of=/mnt/export/large.tar bs=10M count=10
>
> 2) set write-behind option on
># echo "asdf" > /mnt/export/test
>#
>  No error prompt here, and try cat the file with following information.
># cat /mnt/export/test
>cat: /mnt/export/test: No such file or directory
>
># strace echo "asdf" > /mnt/export/test
> write(1, "asdf\n", 5)   = 5
> close(1)= -1 ENOSPC (No space left
> on device)
>
> 3) set write-behind option off
># echo "asdf" > /mnt/export/test
>#-bash: echo: write error: No space left on device
>  Have error prompt here.
># cat /mnt/export/test
>cat: /mnt/export/test: No such file or directory
>
># strace echo "asdf" > /mnt/export/test
> write(1, "asdf\n", 5)   = -1 ENOSPC (No space left
> on device)
> close(1)= 0
> 
> -
> In my view , the action of FOP write is right to application.
> But when the write-behind option is set on, the write FOP return success
> but it can't really write to gluster volume.
> It will let application confuse, and it will lead to more application
> issue.
>
> Although the close will return error, but as you know, more application
> will not do close FOP until the application exit,
> In this case, write FOP show success, but when another thread want to read
> it, but it can't read anything.
>
> Do you think it is an issue?
>

No. The behavior of write-behind is posix-complaint. In fact I think this
is the behavior of any write-back cache implementation. As to Posix
compliance, man 2 write has this section:



NOTES
   Not  checking  the  return value of close() is a common but
nevertheless serious programming error.  It is quite possible that errors
on a previous write(2) operation are first reported at the final close().
Not checking the return value when closing the file may lead to silent loss
of  data.   This  can  especially be observed with NFS and with disk quota.

   A successful close does not guarantee that the data has been
successfully saved to disk, as the kernel defers writes.  It is not common
for a file system to flush the buffers when the stream is closed.  If you
need to be sure that the data is physically stored use fsync(2).  (It will
depend on the disk  hardware at this point.)



If not, do you have any comments for this inconvenient?
>
>
> Best Regards,
> George
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Priority based ping packet for 3.10

2017-01-19 Thread Raghavendra G
On Thu, Jan 19, 2017 at 1:50 PM, Mohammed Rafi K C <rkavu...@redhat.com>
wrote:

> Hi,
>
> The patch for priority based ping packets [1] are ready to review. As
> Shyam mentioned in the comment on patch set 12, it doesn't solve the
> problem with network conjunction nor the disk latency. Also it won't
> priorities the reply of ping packets at the server end (We don't have a
> straight way to identify prognum in the reply).
>
>
> So my question , Is it worth of taking the patch or do we need to think
> through a more generic solutions.
>

Though ping requests can take more time to reach server due to heavy
traffic, realistically speaking common reasons for ping-timer expiry have
been either

1. client not been able to read ping response [2]
2. server not able to read ping request.

Speaking about 2 above, Me, Kritika and Pranith were just discussing today
morning about an issue where they had hit ping timer expiry in replicated
setups when disk usage was high. The reason for this as Pranith pointed out
was,
1. posix has some fops (like posix_xattrop, posix_fxattrop) which do
syscalls after holding a lock on inode (inode->lock).
2. During high disk usage scenarios, syscall latencies were high (sometimes
>= ping-timeout value)
3. Before being handed over to a new thread at io-threads xlator, a fop
gets executed in one of the threads that reads incoming messages from
socket. This execution path includes some translators like protocol/server,
index, quota-enforcer, marker. And these translators might access inode-ctx
which involves locking inode (inode->lock). Due to this locking latency of
syscall gets transferred to poller thread. Since poller thread is waiting
on inode->lock, it won't be able to read ping requests from network in-time
resulting in ping-timer expiry.

I think Kritika is working on a patch to eliminate locking on inode in 1
above. We also need to reduce the actual fop execution in poller thread.
IOW, we need to hand over the fop execution to io-threads/syncop-threads as
early as we can. [3] helps in this scenario as it adds back the socket for
polling immediately after reading the entire msg but before execution of
fop begins. So, even though fop execution is happening in poller thread,
msgs from same connection can be read in other poller threads parallely
(and we can scale up the number of epoll-threads when load is high).

Also, note that there is no way we can send entire ping request as "URGENT"
data over network. So, prioritization in [1] is only the queue of messages
waiting to be written to network. So, Though I suggested [1], the more I
think of it, it seems less irrelevant.

[2] http://review.gluster.org/12402
[3] http://review.gluster.org/15036


>
> Note : We could make this patch more generic so that any packets can be
> marked as priority to add into the head instead of just Ping packets.
>
> [1] : http://review.gluster.org/#/c/11935/
>
> Regards
>
> Rafi KC
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Fwd: Behaviour of simple distributed volumes on failed bricks?

2017-01-12 Thread Raghavendra G
On Thu, Jan 12, 2017 at 4:44 PM, Raghavendra G <raghavendra...@gmail.com>
wrote:

>
>
> On Thu, Jan 12, 2017 at 4:35 PM, Raghavendra G <raghavendra...@gmail.com>
> wrote:
>
>>
>>
>> On Thu, Jan 12, 2017 at 9:20 AM, B.K.Raghuram <bkr...@gmail.com> wrote:
>>
>>> cc'ing devel as well for some developer insight..
>>>
>>> -- Forwarded message --
>>> To: gluster-users <gluster-us...@gluster.org>
>>>
>>>
>>> I had a question on the expected behaviour of simple distributed volumes
>>> when a brick fails for the following scenarios (as in, will the scenario
>>> succeed for fail) :
>>>
>>> - New file creation if that file name hashes to the failed brick
>>>
>>
>> Fails
>>
>>
>>> - New file creation if that file name hashes to one of the remaining
>>> functional brick
>>>
>>
>> Succeeds
>>
>> - New directory creation if that file name hashes to the failed brick
>>>
>>
>> Fails
>>
>>
>>> - New directory creation if that file name hashes to one of the
>>> remaining functional brick
>>>
>>
>> Succeeds. Layout is spread only to the functional bricks. Non-functional
>> bricks are not part of layout of the directory. In other words files of the
>> directory won't be stored in non-functional bricks
>>
>>
>>> - File modifications/deletions if that file name hashes to the failed
>>> brick
>>>
>>
>> Fails
>>
>
> Fails mostly. However, let's say if the inode of file has been looked up
> when hashed-brick was up, then operations succeed even if hashed-brick goes
> down later.
>

Sorry, I forgot to mention that this can happen only if actual file
(data-file) is stored on non-hashed brick and hashed-brick contains a
linkto file. If data-file is present on hashed brick and it goes down,
access fails always.

Following sequence of ops succeed:
>
> 1. hashed-brick up
> 2. Lookup (file) => success
> 3. open (file) => success (fd)
> 4. hashed-brick goes down
> 5. write/read (fd) => success
>
> Note that if the inode is cached in vfs/kernel after 2, even path based
> operations like stat, open, chmod etc can succeed if hashed-brick goes down
> immediately after successful lookup. The point is only fresh lookups fail
> if hashed brick/subvol is down.
>
>
>>
>>> - File modifications/deletions if that file name hashes to one of the
>>> remaining functional brick
>>>
>>
>> Succeeds
>>
>>
>>> - Directory renames/deletions if that file name hashes to the failed
>>> brick
>>>
>>
>> Fails
>>
>>
>>> - Directory renames/deletions if that file name hashes to one of the
>>> remaining functional brick
>>>
>>
>> Fails
>>
>>
>>>
>>>
>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>
>>> --
>>> Raghavendra G
>>>
>>> <http://www.gluster.org/mailman/listinfo/gluster-devel>
>>
>> <http://www.gluster.org/mailman/listinfo/gluster-devel>
>>
>
>
>
> --
> Raghavendra G
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Fwd: Behaviour of simple distributed volumes on failed bricks?

2017-01-12 Thread Raghavendra G
On Thu, Jan 12, 2017 at 9:20 AM, B.K.Raghuram <bkr...@gmail.com> wrote:

> cc'ing devel as well for some developer insight..
>
> -- Forwarded message --
> To: gluster-users <gluster-us...@gluster.org>
>
>
> I had a question on the expected behaviour of simple distributed volumes
> when a brick fails for the following scenarios (as in, will the scenario
> succeed for fail) :
>
> - New file creation if that file name hashes to the failed brick
>

Fails


> - New file creation if that file name hashes to one of the remaining
> functional brick
>

Succeeds

- New directory creation if that file name hashes to the failed brick
>

Fails


> - New directory creation if that file name hashes to one of the remaining
> functional brick
>

Succeeds. Layout is spread only to the functional bricks. Non-functional
bricks are not part of layout of the directory. In other words files of the
directory won't be stored in non-functional bricks


> - File modifications/deletions if that file name hashes to the failed brick
>

Fails


> - File modifications/deletions if that file name hashes to one of the
> remaining functional brick
>

Succeeds


> - Directory renames/deletions if that file name hashes to the failed brick
>

Fails


> - Directory renames/deletions if that file name hashes to one of the
> remaining functional brick
>

Fails


>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>
> --
> Raghavendra G
>
> <http://www.gluster.org/mailman/listinfo/gluster-devel>

<http://www.gluster.org/mailman/listinfo/gluster-devel>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] On Gluster resiliency

2017-01-01 Thread Raghavendra G
Ivan,

That's good to hear. Thanks for posting :)

regards,
Raghavendra

On Fri, Dec 23, 2016 at 10:10 PM, Ivan Rossi <rouge2...@gmail.com> wrote:

> Last few days has been tense because a R3 3.8.5 Gluster cluster that I
> built has been plagued by problems.
>
> The first symptom has been a continuous stream in the client logs of:
>
> [2016-12-17 15:55:02.047508] E [MSGID: 108009]
> [afr-open.c:187:afr_openfd_fix_open_cbk]
> 0-hisap-prod-1-replicate-0: Failed to open
> /home/galaxy/HISAP/java/lib/java/jre1.7.0_51/jre/lib/rt.jar on subvolume
> hisap-prod-1-client-2 [Transport endpoint is not connected]
>
> followed by very frequent peer disconnections/reconnections and a
> continuous stream of files to be healed on several volumes.
>
> The problem has been traced back to a flaky X540-T2 10GBE NIC embedded
> in one of the peers motherboard, that was incapable of keeping the
> correct 10Gbit speed negotiation with the switch.
>
> The motherboard has been replaced on the peer. and then the volumes
> healed quickly to complete health.  All of these while the users kept
> running some heavy-duty bioinformatics applications (NGS data
> analysis) on top of Gluster.  No user noticed ANYTHING despite a major
> hardware problem and offi-lining of a peer.
>
> This is a RESILIENT system, in my book.
>
> Gluster people, despite the constant stream of problems and requests
> for help that you see on the ML and IRC, rest assured that you are
> building a nice piece of software, at least IMHO.
>
> Keep-up the good work and Merry Christmas.
>
> Ivan Rossi
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] Feedback on DHT option "cluster.readdir-optimize"

2016-11-10 Thread Raghavendra G
On Thu, Nov 10, 2016 at 8:57 PM, Vijay Bellur <vbel...@redhat.com> wrote:

> On Thu, Nov 10, 2016 at 3:17 AM, Nithya Balachandran
> <nbala...@redhat.com> wrote:
> >
> >
> > On 8 November 2016 at 20:21, Kyle Johnson <kjohn...@gnulnx.net> wrote:
> >>
> >> Hey there,
> >>
> >> We have a number of processes which daily walk our entire directory tree
> >> and perform operations on the found files.
> >>
> >> Pre-gluster, this processes was able to complete within 24 hours of
> >> starting.  After outgrowing that single server and moving to a gluster
> setup
> >> (two bricks, two servers, distribute, 10gig uplink), the processes
> became
> >> unusable.
> >>
> >> After turning this option on, we were back to normal run times, with the
> >> process completing within 24 hours.
> >>
> >> Our data is heavy nested in a large number of subfolders under
> /media/ftp.
> >
> >
> > Thanks for getting back to us - this is very good information. Can you
> > provide a few more details?
> >
> > How deep is your directory tree and roughly how many directories do you
> have
> > at each level?
> > Are all your files in the lowest level dirs or do they exist on several
> > levels?
> > Would you be willing to provide the gluster volume info output for this
> > volume?
> >>
>
>
> I have had performance improvement with this option when the first
> level below the root consisted several thousands of directories
> without any files. IIRC, I was testing this in a 16 x 2 setup.
>

Yes Vijay. I remember you mentioning it. This option is expected to only
boost readdir performance on a directory containing subdirectories. For
files it has no effect.

On a similar note, I think we can also skip linkto files in readdirp  (on
brick) as dht_readdirp picks the dentry from subvol containing data-file.


> Regards,
> Vijay
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Please pause merging patches to 3.9 waiting for just one patch

2016-11-10 Thread Raghavendra G
On Thu, Nov 10, 2016 at 8:46 PM, Manikandan Selvaganesh <
manikandancs...@gmail.com> wrote:

> Enabling/disabling quota or removing limits are the ways in which
> quota.conf is regenerated to the later version. It works properly. And as
> Pranith said, both enabling/disabling takes a lot of time to crawl(though
> now much faster with enhanced quota enable/disable process) which we cannot
> suggest the users with a lot of quota configuration. Resetting the limit
> using limit-usage does not work properly. I have tested the same. The
> workaround is based on the user setup here. I mean the steps he exactly
> used in order matters here. The workaround is not so generic.
>

Thanks Manikandan for the reply :). I've not tested this, but went through
the code. If I am not wrong, function glusterd_store_quota_config  would
write a quota.conf which is compatible for versions >= 3.7. This function
is invoked by glusterd_quota_limit_usage unconditionally in success path.
What am I missing here?

@Pranith,

Since Manikandan says his tests didn't succeed always, probably we should
do any of the following
1. hold back the release till we successfully test limit-usage to rewrite
quota.conf (I can do this tomorrow)
2. get the patch in question for 3.9
3. If 1 is failing, debug why 1 is not working and fix that.

regards,
Raghavendra

> However, quota enable/disable would regenerate the file on any case.
>
> IMO, this bug is critical. I am not sure though how often users would hit
> this - Updating from 3.6 to latest versions. From 3.7 to latest, its fine,
> this has nothing to do with this patch.
>
> On Nov 10, 2016 8:03 PM, "Pranith Kumar Karampuri" <pkara...@redhat.com>
> wrote:
>
>>
>>
>> On Thu, Nov 10, 2016 at 7:43 PM, Raghavendra G <raghaven...@gluster.com>
>> wrote:
>>
>>>
>>>
>>> On Thu, Nov 10, 2016 at 2:14 PM, Pranith Kumar Karampuri <
>>> pkara...@redhat.com> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Nov 10, 2016 at 1:11 PM, Atin Mukherjee <amukh...@redhat.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Thu, Nov 10, 2016 at 1:04 PM, Pranith Kumar Karampuri <
>>>>> pkara...@redhat.com> wrote:
>>>>>
>>>>>> I am trying to understand the criticality of these patches.
>>>>>> Raghavendra's patch is crucial because gfapi workloads(for samba and 
>>>>>> qemu)
>>>>>> are affected severely. I waited for Krutika's patch because VM usecase 
>>>>>> can
>>>>>> lead to disk corruption on replace-brick. If you could let us know the
>>>>>> criticality and we are in agreement that they are this severe, we can
>>>>>> definitely take them in. Otherwise next release is better IMO. Thoughts?
>>>>>>
>>>>>
>>>>> If you are asking about how critical they are, then the first two are
>>>>> definitely not but third one is actually a critical one as if user 
>>>>> upgrades
>>>>> from 3.6 to latest with quota enable, further peer probes get rejected and
>>>>> the only work around is to disable quota and re-enable it back.
>>>>>
>>>>
>>>> Let me take Raghavendra G's input also here.
>>>>
>>>> Raghavendra, what do you think we should do? Merge it or live with it
>>>> till 3.9.1?
>>>>
>>>
>>> The commit says quota.conf is rewritten to compatible version during
>>> three operations:
>>> 1. enable/disable quota
>>>
>>
>> This will involve crawling the whole FS doesn't it?
>>
>> 2. limit usage
>>>
>>
>> This is a good way IMO. Could Sanoj/you confirm that this works once by
>> testing it.
>>
>>
>>> 3. remove quota limit
>>>
>>
>> I guess you added this for completeness. We can't really suggest this to
>> users as a work around.
>>
>>
>>>
>>> I checked the code and it works as stated in commit msg. Probably we can
>>> list the above three operations as work around and take this patch in for
>>> 3.9.1
>>>
>>
>>>
>>>>
>>>>>
>>>>> On a different note, 3.9 head is not static and moving forward. So if
>>>>> you are really looking at only critical patches need to go in, that's not
>>>>> happening, just a word of caution!
>>>>>
>>>>>
>>>>>> On Thu, Nov 10, 2016 at 12:56 PM, Atin Mukherjee &l

Re: [Gluster-devel] [Gluster-Maintainers] Please pause merging patches to 3.9 waiting for just one patch

2016-11-10 Thread Raghavendra G
On Thu, Nov 10, 2016 at 2:14 PM, Pranith Kumar Karampuri <
pkara...@redhat.com> wrote:

>
>
> On Thu, Nov 10, 2016 at 1:11 PM, Atin Mukherjee <amukh...@redhat.com>
> wrote:
>
>>
>>
>> On Thu, Nov 10, 2016 at 1:04 PM, Pranith Kumar Karampuri <
>> pkara...@redhat.com> wrote:
>>
>>> I am trying to understand the criticality of these patches.
>>> Raghavendra's patch is crucial because gfapi workloads(for samba and qemu)
>>> are affected severely. I waited for Krutika's patch because VM usecase can
>>> lead to disk corruption on replace-brick. If you could let us know the
>>> criticality and we are in agreement that they are this severe, we can
>>> definitely take them in. Otherwise next release is better IMO. Thoughts?
>>>
>>
>> If you are asking about how critical they are, then the first two are
>> definitely not but third one is actually a critical one as if user upgrades
>> from 3.6 to latest with quota enable, further peer probes get rejected and
>> the only work around is to disable quota and re-enable it back.
>>
>
> Let me take Raghavendra G's input also here.
>
> Raghavendra, what do you think we should do? Merge it or live with it till
> 3.9.1?
>

The commit says quota.conf is rewritten to compatible version during three
operations:
1. enable/disable quota
2. limit usage
3. remove quota limit

I checked the code and it works as stated in commit msg. Probably we can
list the above three operations as work around and take this patch in for
3.9.1


>
>>
>> On a different note, 3.9 head is not static and moving forward. So if you
>> are really looking at only critical patches need to go in, that's not
>> happening, just a word of caution!
>>
>>
>>> On Thu, Nov 10, 2016 at 12:56 PM, Atin Mukherjee <amukh...@redhat.com>
>>> wrote:
>>>
>>>> Pranith,
>>>>
>>>> I'd like to see following patches getting in:
>>>>
>>>> http://review.gluster.org/#/c/15722/
>>>> http://review.gluster.org/#/c/15714/
>>>> http://review.gluster.org/#/c/15792/
>>>>
>>>
>>>>
>>>>
>>>>
>>>> On Thu, Nov 10, 2016 at 7:12 AM, Pranith Kumar Karampuri <
>>>> pkara...@redhat.com> wrote:
>>>>
>>>>> hi,
>>>>>   The only problem left was EC taking more time. This should
>>>>> affect small files a lot more. Best way to solve it is using 
>>>>> compound-fops.
>>>>> So for now I think going ahead with the release is best.
>>>>>
>>>>> We are waiting for Raghavendra Talur's http://review.gluster.org/#/c/
>>>>> 15778 before going ahead with the release. If we missed any other
>>>>> crucial patch please let us know.
>>>>>
>>>>> Will make the release as soon as this patch is merged.
>>>>>
>>>>> --
>>>>> Pranith & Aravinda
>>>>>
>>>>> ___
>>>>> maintainers mailing list
>>>>> maintain...@gluster.org
>>>>> http://www.gluster.org/mailman/listinfo/maintainers
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> ~ Atin (atinm)
>>>>
>>>
>>>
>>>
>>> --
>>> Pranith
>>>
>>
>>
>>
>> --
>>
>> ~ Atin (atinm)
>>
>
>
>
> --
> Pranith
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] Feedback on DHT option "cluster.readdir-optimize"

2016-11-09 Thread Raghavendra G
On Thu, Nov 10, 2016 at 12:58 PM, Gandalf Corvotempesta <
gandalf.corvotempe...@gmail.com> wrote:

> Il 10 nov 2016 08:22, "Raghavendra
> <raghaven...@gluster.com> ha scritto:
> >
> > Kyle,
> >
> > Thanks for your your response :). This really helps. From 13s to 0.23s
> seems like huge improvement.
>
> From 13 minutes to 23 seconds, not from 13 seconds :)
>

Yeah. That was one confused reply :). Sorry about that.


>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] Feedback on DHT option "cluster.readdir-optimize"

2016-11-09 Thread Raghavendra G
Kyle,

Thanks for your your response :). This really helps. From 13s to 0.23s
seems like huge improvement.

regards,
Raghavendra

On Tue, Nov 8, 2016 at 8:21 PM, Kyle Johnson <kjohn...@gnulnx.net> wrote:

> Hey there,
>
> We have a number of processes which daily walk our entire directory tree
> and perform operations on the found files.
>
> Pre-gluster, this processes was able to complete within 24 hours of
> starting.  After outgrowing that single server and moving to a gluster
> setup (two bricks, two servers, distribute, 10gig uplink), the processes
> became unusable.
>
> After turning this option on, we were back to normal run times, with the
> process completing within 24 hours.
>
> Our data is heavy nested in a large number of subfolders under /media/ftp.
>
> A subset of our data:
>
> 15T of files in 48163 directories under /media/ftp/dig_dis.
>
> Without readdir-optimize:
>
> [root@colossus dig_dis]# time ls|wc -l
> 48163
>
> real13m1.582s
> user0m0.294s
> sys 0m0.205s
>
>
> With readdir-optimize:
>
> [root@colossus dig_dis]# time ls | wc -l
> 48163
>
> real0m23.785s
> user0m0.296s
> sys 0m0.108s
>
>
> Long story short - this option is super important to me as it resolved an
> issue that would have otherwise made me move my data off of gluster.
>
>
> Thank you for all of your work,
>
> Kyle
>
>
>
>
>
> On 11/07/2016 10:07 PM, Raghavendra Gowdappa wrote:
>
>> Hi all,
>>
>> We have an option in called "cluster.readdir-optimize" which alters the
>> behavior of readdirp in DHT. This value affects how storage/posix treats
>> dentries corresponding to directories (not for files).
>>
>> When this value is on,
>> * DHT asks only one subvol/brick to return dentries corresponding to
>> directories.
>> * Other subvols/bricks filter dentries corresponding to directories and
>> send only dentries corresponding to files.
>>
>> When this value is off (this is the default value),
>> * All subvols return all dentries stored on them. IOW, bricks don't
>> filter any dentries.
>> * Since a directory has one dentry representing it on each subvol, dht
>> (loaded on client) picks up dentry only from hashed subvol.
>>
>> Note that irrespective of value of this option, _all_ subvols return
>> dentries corresponding to files which are stored on them.
>>
>> This option was introduced to boost performance of readdir as (when set
>> on), filtering of dentries happens on bricks and hence there is reduced:
>> 1. network traffic (with filtering all the redundant dentry information)
>> 2. number of readdir calls between client and server for the same number
>> of dentries returned to application (If filtering happens on client, lesser
>> number of dentries in result and hence more number of readdir calls. IOW,
>> result buffer is not filled to maximum capacity).
>>
>> We want to hear from you Whether you've used this option and if yes,
>> 1. Did it really boost readdir performance?
>> 2. Do you've any performance data to find out what was the percentage of
>> improvement (or deterioration)?
>> 3. Data set you had (Number of files, directories and organisation of
>> directories).
>>
>> If we find out that this option is really helping you, we can spend our
>> energies on fixing issues that will arise when this option is set to on.
>> One common issue with turning this option on is that when this option is
>> set, some directories might not show up in directory listing [1]. The
>> reason for this is that:
>> 1. If a directory can be created on a hashed subvol, mkdir (result to
>> application) will be successful, irrespective of result of mkdir on rest of
>> the subvols.
>> 2. So, any subvol we pick to give us dentries corresponding to directory
>> need not contain all the directories and we might miss out those
>> directories in listing.
>>
>> Your feedback is important for us and will help us to prioritize and
>> improve things.
>>
>> [1] https://www.gluster.org/pipermail/gluster-users/2016-October
>> /028703.html
>>
>> regards,
>> Raghavendra
>> ___
>> Gluster-users mailing list
>> gluster-us...@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] A question of GlusterFS dentries!

2016-11-07 Thread Raghavendra G
On Wed, Nov 2, 2016 at 9:54 AM, Serkan Çoban <cobanser...@gmail.com> wrote:

> +1 for "no-rewinddir-support" option in DHT.
> We are seeing very slow directory listing specially with 1500+ brick
> volume, 'ls' takes 20+ second with 1000+ files.
>

If its not clear, I would like to point out that serialized readdir is not
the sole issue that's causing slowness. If directories are _HUGE_ then I
don't expect too much of benefit from parallelizing. Also, as others have
been pointing out (in various in-person discussions) there are other
scalability limits like number of messages, memory consumed etc to wind
calls parallely. I'll probably do a rough POC in next couple of months to
see whether this idea has any substance or not and post the results.



> On Wed, Nov 2, 2016 at 7:08 AM, Raghavendra Gowdappa
> <rgowd...@redhat.com> wrote:
> >
> >
> > - Original Message -
> >> From: "Keiviw" <kei...@163.com>
> >> To: gluster-devel@gluster.org
> >> Sent: Tuesday, November 1, 2016 12:41:02 PM
> >> Subject: [Gluster-devel] A question of GlusterFS dentries!
> >>
> >> Hi,
> >> In GlusterFS distributed volumes, listing a non-empty directory was
> slow.
> >> Then I read the dht codes and found the reasons. But I was confused that
> >> GlusterFS dht travesed all the bricks(in the volume) sequentially,why
> not
> >> use multi-thread to read dentries from multiple bricks simultaneously.
> >> That's a question that's always puzzled me, Couly you please tell me
> >> something about this???
> >
> > readdir across subvols is sequential mostly because we have to support
> rewinddir(3). We need to maintain the mapping of offset and dentry across
> multiple invocations of readdir. In other words if someone did a rewinddir
> to an offset corresponding to earlier dentry, subsequent readdirs should
> return same set of dentries what the earlier invocation of readdir
> returned. For example, in an hypothetical scenario, readdir returned
> following dentries:
> >
> > 1. a, off=10
> > 2. b, off=2
> > 3. c, off=5
> > 4. d, off=15
> > 5. e, off=17
> > 6. f, off=13
> >
> > Now if we did rewinddir to off 5 and issue readdir again we should get
> following dentries:
> > (c, off=5), (d, off=15), (e, off=17), (f, off=13)
> >
> > Within a subvol backend filesystem provides rewinddir guarantee for the
> dentries present on that subvol. However, across subvols it is the
> responsibility of DHT to provide the above guarantee. Which means we
> should've some well defined order in which we send readdir calls (Note that
> order is not well defined if we do a parallel readdir across all subvols).
> So, DHT has sequential readdir which is a well defined order of reading
> dentries.
> >
> > To give an example if we have another subvol - subvol2 - (in addiction
> to the subvol above - say subvol1) with following listing:
> > 1. g, off=16
> > 2. h, off=20
> > 3. i, off=3
> > 4. j, off=19
> >
> > With parallel readdir we can have many ordering like - (a, b, g, h, i,
> c, d, e, f, j), (g, h, a, b, c, i, j, d, e, f) etc. Now if we do (with
> readdir done parallely):
> >
> > 1. A complete listing of the directory (which can be any one of 10P1 =
> 10 ways - I hope math is correct here).
> > 2. Do rewinddir (20)
> >
> > We cannot predict what are the set of dentries that come _after_ offset
> 20. However, if we do a readdir sequentially across subvols there is only
> one directory listing i.e, (a, b, c, d, e, f, g, h, i, j). So, its easier
> to support rewinddir.
> >
> > If there is no POSIX requirement for rewinddir support, I think a
> parallel readdir can easily be implemented (which improves performance
> too). But unfortunately rewinddir is still a POSIX requirement. This also
> opens up another possibility of a "no-rewinddir-support" option in DHT,
> which if enabled results in parallel readdirs across subvols. What I am not
> sure is how many users still use rewinddir? If there is a critical mass
> which wants performance with a tradeoff of no rewinddir support this can be
> a good feature.
> >
> > +gluster-users to get an opinion on this.
> >
> > regards,
> > Raghavendra
> >
> >>
> >>
> >>
> >>
> >>
> >>
> >> ___
> >> Gluster-devel mailing list
> >> Gluster-devel@gluster.org
> >> http://www.gluster.org/mailman/listinfo/gluster-devel
> > ___
> > Gluster-users mailing list
> > gluster-us...@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-users
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Issue about the size of fstat is less than the really size of the syslog file

2016-11-03 Thread Raghavendra G
On Thu, Nov 3, 2016 at 10:27 AM, Raghavendra G <raghaven...@gluster.com>
wrote:

>
>
> On Thu, Nov 3, 2016 at 7:16 AM, Lian, George (Nokia - CN/Hangzhou) <
> george.l...@nokia.com> wrote:
>
>> >Yes. I was assuming that the previous results were tested with:
>> >1. write-behind on with the fix
>> >2. quick-read and readdir-ahead off
>> # gluster volume info log
>>
>> performance.quick-read: off
>> performance.readdir-ahead: off
>> performance.stat-prefetch: on
>> performance.write-behind: on
>>
>>
>> with the above configuration and write-behind.so with patch 2, the "tail
>> truncated" issue still be there.
>>
>> # tail -f syslog >/dev/null
>> tail: syslog: file truncated
>> tail: syslog: file truncated
>>
>> FYI,
>>
>
> Thanks George. I'll take a look.
>

Can you please test with following configuration?

1. write-behind on with my fix
2. readdir-ahead and quick-read off
3. performance.stat-prefetch on
4. performance.force-readdirp off
5. dht.force-readdirp off
6. Also mount glusterfs with option "use-readdirp=no"

[root@booradley glusterfs]# mount -t glusterfs -o use-readdirp=no
booradley:/newptop /mnt

[root@booradley glusterfs]# ps ax | grep -i mnt
14418 ?Ssl0:00 /usr/local/sbin/glusterfs --use-readdirp=no
--volfile-server=booradley --volfile-id=/newptop /mnt

[root@booradley glusterfs]# gluster volume set newptop
performance.write-behind on
volume set: success

[root@booradley glusterfs]# gluster volume set newptop
performance.quick-read off
volume set: success

[root@booradley glusterfs]# gluster volume set newptop
performance.stat-prefetch on
volume set: success

[root@booradley glusterfs]# gluster volume set newptop
performance.force-readdirp off
volume set: success

[root@booradley glusterfs]# gluster volume set newptop dht.force-readdirp
off
volume set: success

[root@booradley glusterfs]# gluster volume set newptop
performance.readdir-ahead off
volume set: success

[root@booradley glusterfs]# gluster volume info newptop

Volume Name: newptop
Type: Distribute
Volume ID: 092756e1-e095-4e05-9f14-3e9a6aed908c
Status: Started
Snapshot Count: 0
Number of Bricks: 1
Transport-type: tcp
Bricks:
Brick1: booradley:/home/export/newptop
Options Reconfigured:
dht.force-readdirp: off
performance.force-readdirp: off
performance.stat-prefetch: on
performance.write-behind: on
performance.quick-read: off
transport.address-family: inet
performance.readdir-ahead: off
nfs.disable: on



>
>>
>> Best Regards,
>> George
>>
>>
>> -Original Message-
>> From: Raghavendra Gowdappa [mailto:rgowd...@redhat.com]
>> Sent: Wednesday, November 02, 2016 5:41 PM
>> To: Lian, George (Nokia - CN/Hangzhou) <george.l...@nokia.com>
>> Cc: Raghavendra G <raghaven...@gluster.com>; Gluster-devel@gluster.org;
>> Zizka, Jan (Nokia - CZ/Prague) <jan.zi...@nokia.com>; Zhang, Bingxuan
>> (Nokia - CN/Hangzhou) <bingxuan.zh...@nokia.com>
>> Subject: Re: [Gluster-devel] Issue about the size of fstat is less than
>> the really size of the syslog file
>>
>>
>>
>> - Original Message -
>> > From: "George Lian (Nokia - CN/Hangzhou)" <george.l...@nokia.com>
>> > To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
>> > Cc: "Raghavendra G" <raghaven...@gluster.com>,
>> Gluster-devel@gluster.org, "Jan Zizka (Nokia - CZ/Prague)"
>> > <jan.zi...@nokia.com>, "Bingxuan Zhang (Nokia - CN/Hangzhou)" <
>> bingxuan.zh...@nokia.com>
>> > Sent: Wednesday, November 2, 2016 1:38:44 PM
>> > Subject: RE: [Gluster-devel] Issue about the size of fstat is less than
>> the really size of the syslog file
>> >
>> > Yes, I confirm use the Patch 2.
>> >
>> > One update: the issue is occurred when readdir-ahead off and
>> write-behind on.
>> > Seems gone when write-behind and readdir-ahead and quick-read all off.
>> > Not verified with readdir-ahead and quick-read both off and
>> write-behind on
>> > till now.
>> >
>> > Need I test it with write-behind on and readdir-ahead and quick-read
>> both
>> > off?
>>
>> Yes. I was assuming that the previous results were tested with:
>> 1. write-behind on with the fix
>> 2. quick-read and readdir-ahead off
>>
>> If not, test results with this configuration will help.
>>
>> >
>> > Best Regards,
>> > George
>> >
>> > -Original Message-
>> > From: Raghavendra Gowdappa [mailto:rgowd...@redhat.com]
>> > Sent: Wednesday, November 02, 2

Re: [Gluster-devel] A question of GlusterFS dentries!

2016-11-03 Thread Raghavendra G
On Thu, Nov 3, 2016 at 11:34 AM, Keiviw <kei...@163.com> wrote:

> If GlusterFS does not support POSIX seekdir,what problems will user or
> GlusterFS have?
>

Glusterfs won't have any problem if we don't support seekdir. I am also not
sure whether applications have real use-case for seekdir. But, however its
a POSIX requirement.



>
> 发自网易邮箱大师
> On 11/03/2016 12:52, Raghavendra G <raghaven...@gluster.com> wrote:
>
>
>
> On Wed, Nov 2, 2016 at 9:38 AM, Raghavendra Gowdappa <rgowd...@redhat.com>
> wrote:
>
>>
>>
>> - Original Message -
>> > From: "Keiviw" <kei...@163.com>
>> > To: gluster-devel@gluster.org
>> > Sent: Tuesday, November 1, 2016 12:41:02 PM
>> > Subject: [Gluster-devel] A question of GlusterFS dentries!
>> >
>> > Hi,
>> > In GlusterFS distributed volumes, listing a non-empty directory was
>> slow.
>> > Then I read the dht codes and found the reasons. But I was confused that
>> > GlusterFS dht travesed all the bricks(in the volume) sequentially,why
>> not
>> > use multi-thread to read dentries from multiple bricks simultaneously.
>> > That's a question that's always puzzled me, Couly you please tell me
>> > something about this???
>>
>> readdir across subvols is sequential mostly because we have to support
>> rewinddir(3).
>
>
> Sorry. seekdir(3) is the more relevant function here. Since rewinddir
> resets the dir stream to beginning, its not much of a difficulty to support
> rewinddir with parallel readdirs across subvols.
>
>
>> We need to maintain the mapping of offset and dentry across multiple
>> invocations of readdir. In other words if someone did a rewinddir to an
>> offset corresponding to earlier dentry, subsequent readdirs should return
>> same set of dentries what the earlier invocation of readdir returned. For
>> example, in an hypothetical scenario, readdir returned following dentries:
>>
>> 1. a, off=10
>> 2. b, off=2
>> 3. c, off=5
>> 4. d, off=15
>> 5. e, off=17
>> 6. f, off=13
>>
>> Now if we did rewinddir to off 5 and issue readdir again we should get
>> following dentries:
>> (c, off=5), (d, off=15), (e, off=17), (f, off=13)
>>
>> Within a subvol backend filesystem provides rewinddir guarantee for the
>> dentries present on that subvol. However, across subvols it is the
>> responsibility of DHT to provide the above guarantee. Which means we
>> should've some well defined order in which we send readdir calls (Note that
>> order is not well defined if we do a parallel readdir across all subvols).
>> So, DHT has sequential readdir which is a well defined order of reading
>> dentries.
>>
>> To give an example if we have another subvol - subvol2 - (in addiction to
>> the subvol above - say subvol1) with following listing:
>> 1. g, off=16
>> 2. h, off=20
>> 3. i, off=3
>> 4. j, off=19
>>
>> With parallel readdir we can have many ordering like - (a, b, g, h, i, c,
>> d, e, f, j), (g, h, a, b, c, i, j, d, e, f) etc. Now if we do (with readdir
>> done parallely):
>>
>> 1. A complete listing of the directory (which can be any one of 10P1 = 10
>> ways - I hope math is correct here).
>> 2. Do rewinddir (20)
>>
>> We cannot predict what are the set of dentries that come _after_ offset
>> 20. However, if we do a readdir sequentially across subvols there is only
>> one directory listing i.e, (a, b, c, d, e, f, g, h, i, j). So, its easier
>> to support rewinddir.
>>
>> If there is no POSIX requirement for rewinddir support, I think a
>> parallel readdir can easily be implemented (which improves performance
>> too). But unfortunately rewinddir is still a POSIX requirement. This also
>> opens up another possibility of a "no-rewinddir-support" option in DHT,
>> which if enabled results in parallel readdirs across subvols. What I am not
>> sure is how many users still use rewinddir? If there is a critical mass
>> which wants performance with a tradeoff of no rewinddir support this can be
>> a good feature.
>>
>> +gluster-users to get an opinion on this.
>>
>> regards,
>> Raghavendra
>>
>> >
>> >
>> >
>> >
>> >
>> >
>> > ___
>> > Gluster-devel mailing list
>> > Gluster-devel@gluster.org
>> > http://www.gluster.org/mailman/listinfo/gluster-devel
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
>
> --
> Raghavendra G
>
>
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Issue about the size of fstat is less than the really size of the syslog file

2016-11-01 Thread Raghavendra G
> The issue explained in this comment is hit only when writes are done.
>> But, in
>> > your use-case only "tail" is the application running on the mount (If I
>> am
>> > not wrong, the  writer is running on a different mountpoint). So, I
>> doubt
>> > you are hitting this issue. But, you are saying that the issue goes away
>> > when write-behind/md-cache is turned off pointing to some interaction
>> > between md-cache and write-behind causing the issue. I need more time to
>> > look into this issue. Can you file a bug on this?
>> >
>> > > if (IA_ISREG(inode->ia_type) &&
>> > > ((iatt->ia_mtime != mdc->md_mtime) ||
>> > > (iatt->ia_ctime != mdc->md_ctime)))
>> > > if (!prebuf || (prebuf->ia_ctime !=
>> mdc->md_ctime) ||
>> > > (prebuf->ia_mtime != mdc->md_mtime))
>> > > inode_invalidate(inode);
>> > >
>> > > mdc_from_iatt (mdc, iatt);
>> > >
>> > > time (>ia_time);
>> > > }
>> > >
>> > > Best Regards,
>> > > George
>> > > -Original Message-
>> > > From: Raghavendra Gowdappa [mailto:rgowd...@redhat.com]
>> > > Sent: Thursday, October 13, 2016 8:58 PM
>> > > To: Lian, George (Nokia - CN/Hangzhou) <george.l...@nokia.com>
>> > > Cc: Gluster-devel@gluster.org; I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS
>> > > <i_ext_mbb_wcdma_swd3_da1_mat...@internal.nsn.com>; Zhang, Bingxuan
>> (Nokia
>> > > -
>> > > CN/Hangzhou) <bingxuan.zh...@nokia.com>; Zizka, Jan (Nokia -
>> CZ/Prague)
>> > > <jan.zi...@nokia.com>
>> > > Subject: Re: [Gluster-devel] Issue about the size of fstat is less
>> than the
>> > > really size of the syslog file
>> > >
>> > >
>> > >
>> > > - Original Message -
>> > > > From: "George Lian (Nokia - CN/Hangzhou)" <george.l...@nokia.com>
>> > > > To: Gluster-devel@gluster.org
>> > > > Cc: "I_EXT_MBB_WCDMA_SWD3_DA1_MATRIX_GMS"
>> > > > <i_ext_mbb_wcdma_swd3_da1_mat...@internal.nsn.com>, "Bingxuan Zhang
>> > > > (Nokia
>> > > > - CN/Hangzhou)" <bingxuan.zh...@nokia.com>, "Jan Zizka (Nokia -
>> > > > CZ/Prague)"
>> > > > <jan.zi...@nokia.com>
>> > > > Sent: Thursday, October 13, 2016 2:33:53 PM
>> > > > Subject: [Gluster-devel] Issue about the size of fstat is less than
>> the
>> > > > really size of the syslog file
>> > > >
>> > > > Hi, Dear Expert,
>> > > > We have use glusterfs as a network filesystem, and syslog store in
>> there,
>> > > > some clients on different host may write the syslog file via
>> “glusterfs”
>> > > > mount point.
>> > > > Now we encounter an issue when we “tail” the syslog file, it will
>> > > > occasional
>> > > > failed with error “ file truncated ”
>> > > > As we study and trace with the “tail” source code, it failed with
>> the
>> > > > following code:
>> > > > if ( S_ISREG (mode) && stats.st_size < f[i].size )
>> > > > {
>> > > > error (0, 0, _("%s: file truncated"), quotef (name));
>> > > > /* Assume the file was truncated to 0,
>> > > > and therefore output all "new" data. */
>> > > > xlseek (fd, 0, SEEK_SET, name);
>> > > > f[i].size = 0;
>> > > > }
>> > > > When stats.st_size < f[i].size, what mean the size report by fstat
>> is
>> > > > less
>> > > > than “tail” had read, it lead to “file truncated”, we also use
>> “strace”
>> > > > tools to trace the tail application, the related tail strace log as
>> the
>> > > > below:
>> > > > nanosleep({1, 0}, NULL) = 0
>> > > > fstat(3, {st_mode=S_IFREG|0644, st_size=192543105, ...}) = 0
>> > > > nanosleep({1, 0}, NULL) = 0
>> > > > fstat(3, {st_mode=S_IFREG|0644, st_size=192543105, ...}) = 0
>> > > > nanosleep({1, 0}, NULL) = 0
>> > > > fstat(3, {st_mode=S_IFREG|0644, st_size=192543105, ...}) = 0
>> > > > nanosleep({1, 0}, NULL) = 0
>> > > > fstat(3, {st_mode=S_IFREG|0644, st_size=192544549, ...}) = 0
>> > > > read(3, " Data … -"..., 8192) = 1444
>> > > > read(3, " Data.. "..., 8192) = 720
>> > > > read(3, "", 8192) = 0
>> > > > fstat(3, {st_mode=S_IFREG|0644, st_size=192544789, ...}) = 0
>> > > > write(1, “DATA…..” ) = 2164
>> > > > write(2, "tail: ", 6tail: ) = 6
>> > > > write(2, "/mnt/log/master/syslog: file tru"...,
>> 38/mnt/log/master/syslog:
>> > > > file truncated) = 38
>> > > > as the above strace log, tail has read 1444+720=2164 bytes,
>> > > > but fstat tell “tail” 192544789 – 192543105 = 1664 which less than
>> 2164,
>> > > > so
>> > > > it lead to “tail” application “file truncated”.
>> > > > And if we turn off “write-behind” feature, the issue will not be
>> > > > reproduced
>> > > > any more.
>> > >
>> > > That seems strange. There are no writes happening on the fd/inode
>> through
>> > > which tail is reading/stating from. So, it seems strange that
>> write-behind
>> > > is involved here. I suspect whether any of
>> md-cache/read-ahead/io-cache is
>> > > causing the issue. Can you,
>> > >
>> > > 1. Turn off md-cache, read-ahead, io-cache xlators
>> > > 2. mount glusterfs with --attribute-timeout=0
>> > > 3. set write-behind on
>> > >
>> > > and rerun the tests? If you don't hit the issue, you can experiment by
>> > > turning on/off of md-cache, read-ahead and io-cache translators and
>> see
>> > > what
>> > > are the minimal number of xlators that need to be turned off to not
>> hit the
>> > > issue (with write-behind on)?
>> > >
>> > > regards,
>> > > Raghavendra
>> > >
>> > > > So we think it may be related to cache consistence issue due to
>> > > > performance
>> > > > consider, but we still have concern that:
>> > > > The syslog file is used only with “Append” mode, so the size of file
>> > > > shouldn’t be reduced, when a client read the file, why “fstat” can’t
>> > > > return
>> > > > the really size match to the cache?
>> > > > From current investigation, we doubt that the current implement of
>> > > > “glusterfs” has a bug on “fstat” when cache is on.
>> > > > Your comments is our highly appreciated!
>> > > > Thanks & Best Regards
>> > > > George
>> > > >
>> > > > ___
>> > > > Gluster-devel mailing list
>> > > > Gluster-devel@gluster.org
>> > > > http://www.gluster.org/mailman/listinfo/gluster-devel
>> > >
>> >
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>
>>
>>
>> --
>>
>> Pranith
>>
>>
>>
>>
>> --
>>
>> Pranith
>>
>
>
>
> --
> Pranith
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Custom Transport layers

2016-10-31 Thread Raghavendra G
On Fri, Oct 28, 2016 at 6:20 PM, Lindsay Mathieson <
lindsay.mathie...@gmail.com> wrote:

> Is it possible to write custom transport layers for gluster?, data
> transfer, not the management protocols. Pointers to the existing code
> and/or docs :) would be helpful
>
>
> I'd like to experiment with broadcast udp to see if its feasible in local
> networks.


Another thing to consider here is ordering of messages (sent over
transport). If Broadcast udp doesn't support ordering of messages (I know
udp doesn't, assuming broadcast udp doesn't too, but I may be wrong). If it
doesn't, you've to build ordering logic on top of it. If transport layer
doesn't provide ordering, we cannot reason about consistency of data stored
on filesystem.


> It would be amazing if we could write at 1GB speeds simultaneously to all
> nodes.
>
>
> Alternatively let me know if this has been tried and discarded as a bad
> idea ...
>
> thanks,
>
> --
> Lindsay Mathieson
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] libgfapi zero copy write - application in samba, nfs-ganesha

2016-09-30 Thread Raghavendra G
On Thu, Sep 29, 2016 at 11:11 AM, Raghavendra G <raghaven...@gluster.com>
wrote:

>
>
> On Wed, Sep 28, 2016 at 7:37 PM, Shyam <srang...@redhat.com> wrote:
>
>> On 09/27/2016 04:02 AM, Poornima Gurusiddaiah wrote:
>>
>>> W.r.t Samba consuming this, it requires a great deal of code change in
>>> Samba.
>>> Currently samba has no concept of getting buf from the underlying file
>>> system,
>>> the filesystem comes into picture only at the last layer(gluster plugin),
>>> where system calls are replaced by libgfapi calls. Hence, this is not
>>> readily
>>> consumable by Samba, and i think same will be the case with NFS_Ganesha,
>>> will
>>> let the Ganesha folksc comment on the same.
>>>
>>
>> This is exactly my reservation about the nature of change [2] that is
>> done in this patch. We expect all consumers to use *our* buffer management
>> system, which may not be possible all the time.
>>
>> From the majority of consumers that I know of, other than what Sachin
>> stated as an advantage for CommVault, none of the others can use the
>> gluster buffers at the moment (Ganesha, SAMBA, qemu. (I would like to
>> understand how CommVault can use gluster buffers in this situation without
>> copying out data to the same, just for clarity).
>>
>
> +Jeff cody, for comments on QEMU
>
>
>>
>> This is the reason I posted the comments at [1], stating we should copy
>> out the buffer, when Gluster needs it preserved, but use application
>> provided buffers as long as we can.
>>
>
> My concerns here are:
>
> * We are just moving the copy from gfapi layer to write-behind. Though I
> am not sure what percentage of writes that hit write-behind are
> "written-back", I would assume it to be a significant percentage (otherwise
> there is no benefit in having write-behind). However, we can try this
> approach and get some perf data before we make a decision.
>
> * Buffer management. All gluster code uses iobuf/iobrefs to manage the
> buffers of relatively large size. With the approach suggested above, I see
> two concerns:
> a. write-behind has to differentiate between iobufs that need copying
> (write calls through gfapi layer) and iobufs that can just be refed (writes
> from fuse etc) when "writing-back" the write. This adds more complexity.
> b. For the case where write-behind chooses to not "write-back" the
> write, we need a way of encapsulating the application buffer into
> iobuf/iobref. This might need changes in iobuf infra.
>
>
>> I do see the advantages of zero-copy, but not when gluster api is
>> managing the buffers, it just makes it more tedious for applications to use
>> this scheme, IMHO.
>>
>
Another point we can consider here is gfapi (and gluster internal xlator
stack) providing both behaviors as mentioned below:
1. Making Glusterfs xlator stack use application buffers.
2. Forcing applications to use only gluster managed buffers if they want
zero copy.

Let the applications make choice on what interface to use, based on their
use-cases (as there is a trade-off in terms of performance, code changes,
legacy applications which are resistant to change etc).


>> Could we think and negate (if possible) thoughts around using the
>> application passed buffers as is? One caveat here seems to be when using
>> RDMA (we need the memory registered if I am not wrong), as that would
>> involve a copy to RDMA buffers when using application passed buffers.
>
>
> Actually RDMA is not a problem in the current implementation (ruling out
> suggestions by others to use a pre-registered iobufs  for managing io-cache
> etc). This is because, in current implementation the responsibility of
> registering the memory region lies in transport/rdma. In other words
> transport/rdma doesn't expect pre-registered buffers.
>
>
> What are the other pitfalls?
>>
>> [1] http://www.gluster.org/pipermail/gluster-devel/2016-August/0
>> 50622.html
>>
>> [2] http://review.gluster.org/#/c/14784/
>>
>>
>>>
>>> Regards,
>>> Poornima
>>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
>
> --
> Raghavendra G
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] libgfapi zero copy write - application in samba, nfs-ganesha

2016-09-30 Thread Raghavendra G
On Wed, Sep 28, 2016 at 7:37 PM, Shyam <srang...@redhat.com> wrote:

> On 09/27/2016 04:02 AM, Poornima Gurusiddaiah wrote:
>
>> W.r.t Samba consuming this, it requires a great deal of code change in
>> Samba.
>> Currently samba has no concept of getting buf from the underlying file
>> system,
>> the filesystem comes into picture only at the last layer(gluster plugin),
>> where system calls are replaced by libgfapi calls. Hence, this is not
>> readily
>> consumable by Samba, and i think same will be the case with NFS_Ganesha,
>> will
>> let the Ganesha folksc comment on the same.
>>
>
> This is exactly my reservation about the nature of change [2] that is done
> in this patch. We expect all consumers to use *our* buffer management
> system, which may not be possible all the time.
>
> From the majority of consumers that I know of, other than what Sachin
> stated as an advantage for CommVault, none of the others can use the
> gluster buffers at the moment (Ganesha, SAMBA, qemu. (I would like to
> understand how CommVault can use gluster buffers in this situation without
> copying out data to the same, just for clarity).
>

+Jeff cody, for comments on QEMU


>
> This is the reason I posted the comments at [1], stating we should copy
> out the buffer, when Gluster needs it preserved, but use application
> provided buffers as long as we can.
>

My concerns here are:

* We are just moving the copy from gfapi layer to write-behind. Though I am
not sure what percentage of writes that hit write-behind are
"written-back", I would assume it to be a significant percentage (otherwise
there is no benefit in having write-behind). However, we can try this
approach and get some perf data before we make a decision.

* Buffer management. All gluster code uses iobuf/iobrefs to manage the
buffers of relatively large size. With the approach suggested above, I see
two concerns:
a. write-behind has to differentiate between iobufs that need copying
(write calls through gfapi layer) and iobufs that can just be refed (writes
from fuse etc) when "writing-back" the write. This adds more complexity.
b. For the case where write-behind chooses to not "write-back" the
write, we need a way of encapsulating the application buffer into
iobuf/iobref. This might need changes in iobuf infra.


> I do see the advantages of zero-copy, but not when gluster api is managing
> the buffers, it just makes it more tedious for applications to use this
> scheme, IMHO.
>
> Could we think and negate (if possible) thoughts around using the
> application passed buffers as is? One caveat here seems to be when using
> RDMA (we need the memory registered if I am not wrong), as that would
> involve a copy to RDMA buffers when using application passed buffers.


Actually RDMA is not a problem in the current implementation (ruling out
suggestions by others to use a pre-registered iobufs  for managing io-cache
etc). This is because, in current implementation the responsibility of
registering the memory region lies in transport/rdma. In other words
transport/rdma doesn't expect pre-registered buffers.


What are the other pitfalls?
>
> [1] http://www.gluster.org/pipermail/gluster-devel/2016-August/050622.html
>
> [2] http://review.gluster.org/#/c/14784/
>
>
>>
>> Regards,
>> Poornima
>>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] libgfapi zero copy write - application in samba, nfs-ganesha

2016-09-27 Thread Raghavendra G
+sachin.

On Tue, Sep 27, 2016 at 11:23 AM, Raghavendra Gowdappa <rgowd...@redhat.com>
wrote:

>
>
> - Original Message -
> > From: "Ric Wheeler" <rwhee...@redhat.com>
> > To: "Raghavendra Gowdappa" <rgowd...@redhat.com>, "Saravanakumar
> Arumugam" <sarum...@redhat.com>
> > Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Ben Turner" <
> btur...@redhat.com>, "Ben England"
> > <bengl...@redhat.com>
> > Sent: Tuesday, September 27, 2016 10:51:48 AM
> > Subject: Re: [Gluster-devel] libgfapi zero copy write - application in
> samba, nfs-ganesha
> >
> > On 09/27/2016 07:56 AM, Raghavendra Gowdappa wrote:
> > > +Manoj, +Ben turner, +Ben England.
> > >
> > > @Perf-team,
> > >
> > > Do you think the gains are significant enough, so that smb and
> nfs-ganesha
> > > team can start thinking about consuming this change?
> > >
> > > regards,
> > > Raghavendra
> >
> > This is a large gain but I think that we might see even larger gains (a
> lot
> > depends on how we implement copy offload :)).
>
> Can you elaborate on what you mean "copy offload"? If it is the way we
> avoid a copy in gfapi (from application buffer), following is the workflow:
>
> 
>
> Work flow of zero copy write operation:
> --
>
> 1) Application requests a buffer of specific size. A new buffer is
> allocated from iobuf pool, and this buffer is passed on to application.
>Achieved using "glfs_get_buffer"
>
> 2) Application writes into the received buffer, and passes that to
> libgfapi, and libgfapi in turn passes the same buffer to underlying
> translators. This avoids a memcpy in glfs write
>Achieved using "glfs_zero_write"
>
> 3) Once the write operation is complete, Application must take the
> responsibilty of freeing the buffer.
>Achieved using "glfs_free_buffer"
>
> 
>
> Do you've any suggestions/improvements on this? I think Shyam mentioned an
> alternative approach (for zero-copy readv I think), let me look up at that
> too.
>
> regards,
> Raghavendra
>
> >
> > Worth looking at how we can make use of it.
> >
> > thanks!
> >
> > Ric
> >
> > >
> > > - Original Message -
> > >> From: "Saravanakumar Arumugam" <sarum...@redhat.com>
> > >> To: "Gluster Devel" <gluster-devel@gluster.org>
> > >> Sent: Monday, September 26, 2016 7:18:26 PM
> > >> Subject: [Gluster-devel] libgfapi zero copy write - application in
> samba,
> > >>nfs-ganesha
> > >>
> > >> Hi,
> > >>
> > >> I have carried out "basic" performance measurement with zero copy
> write
> > >> APIs.
> > >> Throughput of zero copy write is 57 MB/sec vs default write 43 MB/sec.
> > >> ( I have modified Ben England's gfapi_perf_test.c for this. Attached
> the
> > >> same
> > >> for reference )
> > >>
> > >> We would like to hear how samba/ nfs-ganesha who are libgfapi users
> can
> > >> make
> > >> use of this.
> > >> Please provide your comments. Refer attached results.
> > >>
> > >> Zero copy in write patch: http://review.gluster.org/#/c/14784/
> > >>
> > >> Thanks,
> > >> Saravana
> >
> >
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] relative ordering of writes to same file from two different fds

2016-09-23 Thread Raghavendra G
; ones. This wait can fill up write-behind buffer and can
>>> eventually result in a full write-behind cache and hence not able to
>>> "write-back" newer writes.
>>>
>>> * What does POSIX say about it?
>>> * How do other filesystems behave in this scenario?
>>>
>>>
>>> Also, the current write-behind implementation has the concept of
>>> "generation numbers". To quote from comment:
>>>
>>> 
>>>
>>>  uint64_t gen;/* Liability generation number. Represents
>>>  the current 'state' of liability. Every
>>>  new addition to the liability list bumps
>>>  the generation number.
>>>
>>>
>>>   a newly arrived
>>> request is only required
>>>  to perform causal checks against the
>>> entries
>>>  in the liability list which were present
>>>  at the time of its addition. the
>>> generation
>>>  number at the time of its addition is
>>> stored
>>>  in the request and used during checks.
>>>
>>>
>>>   the liability list
>>> can grow while the request
>>>  waits in the todo list waiting for its
>>>  dependent operations to complete.
>>> however
>>>  it is not of the request's concern to
>>> depend
>>>  itself on those new entries which
>>> arrived
>>>          after it arrived (i.e, those that have a
>>>  liability generation higher than itself)
>>>   */
>>> 
>>>
>>> So, if a single thread is doing writes on two different fds, generation
>>> numbers are sufficient to enforce the relative ordering. If writes are from
>>> two different threads/processes, I think write-behind is not obligated to
>>> maintain their order. Comments?
>>>
>>> [1] http://review.gluster.org/#/c/15380/
>>>
>>> regards,
>>> Raghavendra
>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>
>>
>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Query regards to heal xattr heal in dht

2016-09-15 Thread Raghavendra G
t;> anyone does have better solution to correct it.
>>>>>>
>>>>>>   Problem:
>>>>>>In a distributed volume environment custom extended attribute
>>>>>> value for a directory does not display correct value after stop/start the
>>>>>> brick. If any extended attribute value is set for a directory after stop
>>>>>> the brick the attribute value is not updated on brick after start the 
>>>>>> brick.
>>>>>>
>>>>>>   Current approach:
>>>>>> 1) function set_user_xattr to store user extended attribute in
>>>>>> dictionary
>>>>>> 2) function dht_dir_xattr_heal call syncop_setxattr to update the
>>>>>> attribute on all volume
>>>>>> 3) Call the function (dht_dir_xattr_heal) for every directory
>>>>>> lookup in dht_lookup_revalidate_cbk
>>>>>>
>>>>>>   Psuedocode for function dht_dir_xatt_heal is like below
>>>>>>
>>>>>>1) First it will fetch atttributes from first up volume and store
>>>>>> into xattr.
>>>>>>2) Run loop on all subvolume and fetch existing attributes from
>>>>>> every volume
>>>>>>3) Replace user attributes from current attributes with xattr user
>>>>>> attributes
>>>>>>4) Set latest extended attributes(current + old user attributes)
>>>>>> inot subvol.
>>>>>>
>>>>>>
>>>>>>In this current approach problem is
>>>>>>
>>>>>>1) it will call heal function(dht_dir_xattr_heal) for every
>>>>>> directory lookup without comparing xattr.
>>>>>> 2) The function internally call syncop xattr for every subvolume
>>>>>> that would be a expensive operation.
>>>>>>
>>>>>>I have one another way like below to correct it but again in this
>>>>>> one it does have dependency on time (not sure time is synch on all bricks
>>>>>> or not)
>>>>>>
>>>>>>1) At the time of set extended attribute(setxattr) change time in
>>>>>> metadata at server side
>>>>>>2) Compare change time before call healing function in
>>>>>> dht_revalidate_cbk
>>>>>>
>>>>>> Please share your input on this.
>>>>>> Appreciate your input.
>>>>>>
>>>>>> Regards
>>>>>> Mohit Agrawal
>>>>>>
>>>>>> ___
>>>>>> Gluster-devel mailing list
>>>>>> Gluster-devel@gluster.org
>>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Pranith
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Pranith
>>>
>>
>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] CFP for Gluster Developer Summit

2016-08-29 Thread Raghavendra G
Though its a bit late, here is one from me:

Topic: "DHT: current design, (dis)advantages, challenges - A perspective"

Agenda:

I'll try to address
* the why's, (dis)advantages of current design. As noted in the title, this
is my own perspective I've gathered while working on DHT. We don't have any
existing documentation for the motivations. The source has been bugs (huge
number of them :)), interaction with other people working on DHT and code
reading.
* Current work going on and a rough roadmap of what we'll be working on
during at least next few months.
* Going by the objectives of this talk, this might as well turn out to be a
discussion.

regards,

On Wed, Aug 24, 2016 at 8:27 PM, Arthy Loganathan <aloga...@redhat.com>
wrote:

>
>
> On 08/24/2016 07:18 PM, Atin Mukherjee wrote:
>
>
>
> On Wed, Aug 24, 2016 at 5:43 PM, Arthy Loganathan < <aloga...@redhat.com>
> aloga...@redhat.com> wrote:
>
>> Hi,
>>
>> I would like to propose below topic as a lightening talk.
>>
>> Title: Data Logging to monitor Gluster Performance
>>
>> Theme: Process and Infrastructure
>>
>> To benchmark any software product, we often need to do performance
>> analysis of the system along with the product. I have written a tool
>> "System Monitor" to collect required data like CPU, memory usage and load
>> average periodically (with graphical representation) of any process on a
>> system. This data collected can help in analyzing the system & product
>> performance.
>>
>
> A link to this project would definitely help here.
>
>
> Hi Atin,
>
> Here is the link to the project - https://github.com/aloganat/
> system_monitor
>
> Thanks & Regards,
> Arthy
>
>
>
>>
>> From this talk I would like to give an overview of this tool and explain
>> how it can be used to monitor Gluster performance.
>>
>> Agenda:
>>   - Overview of the tool and its usage
>>   - Collecting the data in an excel sheet at regular intervals of time
>>   - Plotting the graph with that data (in progress)
>>   - a short demo
>>
>> Thanks & Regards,
>>
>> Arthy
>>
>>
>>
>>
>> ___
>> Gluster-users mailing list
>> gluster-us...@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-users
>>
>
>
>
> --
>
> --Atin
>
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] md-cache improvements

2016-08-17 Thread Raghavendra G
On Fri, Aug 12, 2016 at 10:29 AM, Raghavendra G <raghaven...@gluster.com>
wrote:

>
>
> On Thu, Aug 11, 2016 at 9:31 AM, Raghavendra G <raghaven...@gluster.com>
> wrote:
>
>> Couple of more areas to explore:
>> 1. purging kernel dentry and/or page-cache too. Because of patch [1],
>> upcall notification can result in a call to inode_invalidate, which results
>> in an "invalidate" notification to fuse kernel module. While I am sure
>> that, this notification will purge page-cache from kernel, I am not sure
>> about dentries. I assume if an inode is invalidated, it should result in a
>> lookup (from kernel to glusterfs). But neverthless, we should look into
>> differences between entry_invalidation and inode_invalidation and harness
>> them appropriately.
>>
>> 2. Granularity of invalidation. For eg., We shouldn't be purging
>> page-cache in kernel, because of a change in xattr used by an xlator (eg.,
>> dht layout xattr). We have to make sure that [1] is handling this. We need
>> to add more granularity into invaldation (like internal xattr invalidation,
>> user xattr invalidation, entry invalidation in kernel, page-cache
>> invalidation in kernel, attribute/stat invalidation in kernel etc) and use
>> them judiciously, while making sure other cached data remains to be present.
>>
>
> To stress the importance of this point, it should be noted that with tier
> there can be constant migration of files, which can result in spurious
> (from perspective of application) invalidations, even though application is
> not doing any writes on files [2][3][4]. Also, even if application is
> writing to file, there is no point in invalidating dentry cache. We should
> explore more ways to solve [2][3][4].
>
> 3. We've a long standing issue of spurious termination of fuse
> invalidation thread. Since after termination, the thread is not re-spawned,
> we would not be able to purge kernel entry/attribute/page-cache. This issue
> was touched upon during a discussion [5], though we didn't solve the
> problem then for lack of bandwidth. Csaba has agreed to work on this issue.
>

4. Flooding of network with upcall notifications. Is it a problem? If yes,
does upcall infra already solves it? Would NFS/SMB leases help here?


> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c7
> [3] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c8
> [4] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c9
> [5] http://review.gluster.org/#/c/13274/1/xlators/mount/
> fuse/src/fuse-bridge.c
>
>
>>
>> [1] http://review.gluster.org/12951
>>
>>
>> On Wed, Aug 10, 2016 at 10:35 PM, Dan Lambright <dlamb...@redhat.com>
>> wrote:
>>
>>>
>>> There have been recurring discussions within the gluster community to
>>> build on existing support for md-cache and upcalls to help performance for
>>> small file workloads. In certain cases, "lookup amplification" dominates
>>> data transfers, i.e. the cumulative round trip times of multiple LOOKUPs
>>> from the client mitigates benefits from faster backend storage.
>>>
>>> To tackle this problem, one suggestion is to more aggressively utilize
>>> md-cache to cache inodes on the client than is currently done. The inodes
>>> would be cached until they are invalidated by the server.
>>>
>>> Several gluster development engineers within the DHT, NFS, and Samba
>>> teams have been involved with related efforts, which have been underway for
>>> some time now. At this juncture, comments are requested from gluster
>>> developers.
>>>
>>> (1) .. help call out where additional upcalls would be needed to
>>> invalidate stale client cache entries (in particular, need feedback from
>>> DHT/AFR areas),
>>>
>>> (2) .. identify failure cases, when we cannot trust the contents of
>>> md-cache, e.g. when an upcall may have been dropped by the network
>>>
>>> (3) .. point out additional improvements which md-cache needs. For
>>> example, it cannot be allowed to grow unbounded.
>>>
>>> Dan
>>>
>>> - Original Message -
>>> > From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
>>> >
>>> > List of areas where we need invalidation notification:
>>> > 1. Any changes to xattrs used by xlators to store metadata (like dht
>>> layout
>>> > xattr, afr xattrs etc).
>>> > 2. Scenarios where individual xlator feels like it needs a lookup. For
>>> > example failed directory creation on non-hashed su

Re: [Gluster-devel] md-cache improvements

2016-08-10 Thread Raghavendra G
Couple of more areas to explore:
1. purging kernel dentry and/or page-cache too. Because of patch [1],
upcall notification can result in a call to inode_invalidate, which results
in an "invalidate" notification to fuse kernel module. While I am sure
that, this notification will purge page-cache from kernel, I am not sure
about dentries. I assume if an inode is invalidated, it should result in a
lookup (from kernel to glusterfs). But neverthless, we should look into
differences between entry_invalidation and inode_invalidation and harness
them appropriately.

2. Granularity of invalidation. For eg., We shouldn't be purging page-cache
in kernel, because of a change in xattr used by an xlator (eg., dht layout
xattr). We have to make sure that [1] is handling this. We need to add more
granularity into invaldation (like internal xattr invalidation, user xattr
invalidation, entry invalidation in kernel, page-cache invalidation in
kernel, attribute/stat invalidation in kernel etc) and use them
judiciously, while making sure other cached data remains to be present.

[1] http://review.gluster.org/12951

On Wed, Aug 10, 2016 at 10:35 PM, Dan Lambright <dlamb...@redhat.com> wrote:

>
> There have been recurring discussions within the gluster community to
> build on existing support for md-cache and upcalls to help performance for
> small file workloads. In certain cases, "lookup amplification" dominates
> data transfers, i.e. the cumulative round trip times of multiple LOOKUPs
> from the client mitigates benefits from faster backend storage.
>
> To tackle this problem, one suggestion is to more aggressively utilize
> md-cache to cache inodes on the client than is currently done. The inodes
> would be cached until they are invalidated by the server.
>
> Several gluster development engineers within the DHT, NFS, and Samba teams
> have been involved with related efforts, which have been underway for some
> time now. At this juncture, comments are requested from gluster developers.
>
> (1) .. help call out where additional upcalls would be needed to
> invalidate stale client cache entries (in particular, need feedback from
> DHT/AFR areas),
>
> (2) .. identify failure cases, when we cannot trust the contents of
> md-cache, e.g. when an upcall may have been dropped by the network
>
> (3) .. point out additional improvements which md-cache needs. For
> example, it cannot be allowed to grow unbounded.
>
> Dan
>
> - Original Message -
> > From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> >
> > List of areas where we need invalidation notification:
> > 1. Any changes to xattrs used by xlators to store metadata (like dht
> layout
> > xattr, afr xattrs etc).
> > 2. Scenarios where individual xlator feels like it needs a lookup. For
> > example failed directory creation on non-hashed subvol in dht during
> mkdir.
> > Though dht succeeds mkdir, it would be better to not cache this inode as
> a
> > subsequent lookup will heal the directory and make things better.
> > 3. removing of files
> > 4. writev on brick (to invalidate read cache on client)
> >
> > Other questions:
> > 5. Does md-cache has cache management? like lru or an upper limit for
> cache.
> > 6. Network disconnects and invalidating cache. When a network disconnect
> > happens we need to invalidate cache for inodes present on that brick as
> we
> > might be missing some notifications. Current approach of purging cache of
> > all inodes might not be optimal as it might rollback benefits of caching.
> > Also, please note that network disconnects are not rare events.
> >
> > regards,
> > Raghavendra
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] Need a way to display and flush gluster cache ?

2016-07-27 Thread Raghavendra G
On Wed, Jul 27, 2016 at 10:29 AM, Mohammed Rafi K C <rkavu...@redhat.com>
wrote:

> Thanks for your feedback.
>
> In fact meta xlator is loaded only on fuse mount, is there any particular
> reason to not to use meta-autoload xltor for nfs server and libgfapi ?
>

I think its because of lack of resources. I am not aware of any technical
reason for not using on NFSv3 server and gfapi.


> Regards
>
> Rafi KC
> On 07/26/2016 04:05 PM, Niels de Vos wrote:
>
> On Tue, Jul 26, 2016 at 12:43:56PM +0530, Kaushal M wrote:
>
> On Tue, Jul 26, 2016 at 12:28 PM, Prashanth Pai <p...@redhat.com> 
> <p...@redhat.com> wrote:
>
> +1 to option (2) which similar to echoing into /proc/sys/vm/drop_caches
>
>  -Prashanth Pai
>
> - Original Message -
>
> From: "Mohammed Rafi K C" <rkavu...@redhat.com> <rkavu...@redhat.com>
> To: "gluster-users" <gluster-us...@gluster.org> <gluster-us...@gluster.org>, 
> "Gluster Devel" <gluster-devel@gluster.org> <gluster-devel@gluster.org>
> Sent: Tuesday, 26 July, 2016 10:44:15 AM
> Subject: [Gluster-devel] Need a way to display and flush gluster cache ?
>
> Hi,
>
> Gluster stack has it's own caching mechanism , mostly on client side.
> But there is no concrete method to see how much memory are consuming by
> gluster for caching and if needed there is no way to flush the cache memory.
>
> So my first question is, Do we require to implement this two features
> for gluster cache?
>
>
> If so I would like to discuss some of our thoughts towards it.
>
> (If you are not interested in implementation discussion, you can skip
> this part :)
>
> 1) Implement a virtual xattr on root, and on doing setxattr, flush all
> the cache, and for getxattr we can print the aggregated cache size.
>
> 2) Currently in gluster native client support .meta virtual directory to
> get meta data information as analogues to proc. we can implement a
> virtual file inside the .meta directory to read  the cache size. Also we
> can flush the cache using a special write into the file, (similar to
> echoing into proc file) . This approach may be difficult to implement in
> other clients.
>
> +1 for making use of the meta-xlator. We should be making more use of it.
>
> Indeed, this would be nice. Maybe this can also expose the memory
> allocations like /proc/slabinfo.
>
> The io-stats xlator can dump some statistics to
> /var/log/glusterfs/samples/ and /var/lib/glusterd/stats/ . That seems to
> be acceptible too, and allows to get statistics from server-side
> processes without involving any clients.
>
> HTH,
> Niels
>
>
>
> 3) A cli command to display and flush the data with ip and port as an
> argument. GlusterD need to send the op to client from the connected
> client list. But this approach would be difficult to implement for
> libgfapi based clients. For me, it doesn't seems to be a good option.
>
> Your suggestions and comments are most welcome.
>
> Thanks to Talur and Poornima for their suggestions.
>
> Regards
>
> Rafi KC
>
> ___
> Gluster-devel mailing 
> listGluster-devel@gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-devel
>
> ___
> Gluster-devel mailing 
> listGluster-devel@gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-devel
>
> ___
> Gluster-users mailing 
> listGluster-users@gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-users
>
>
>
> ___
> Gluster-devel mailing 
> listGluster-devel@gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-devel
>
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] One client can effectively hang entire gluster array

2016-07-27 Thread Raghavendra G
I've filed a bug on the issue at:
https://bugzilla.redhat.com/show_bug.cgi?id=1360689

On Fri, Jul 15, 2016 at 12:44 PM, Raghavendra G <raghaven...@gluster.com>
wrote:

> Hi Patrick,
>
> Is it possible to test out whether the patch fixes your issue? There is
> nothing like validation from user experiencing the problem first hand.
>
> regards,
> Raghavendra
>
> On Tue, Jul 12, 2016 at 10:40 PM, Jeff Darcy <jda...@redhat.com> wrote:
>
>> > Thanks for responding so quickly. I'm not familiar with the codebase,
>> so if
>> > you don't mind me asking, how much would that list reordering slow
>> things
>> > down for, say, a queue of 1500 client machines? i.e. round-about how
>> long of
>> > a client list would significantly affect latency?
>> >
>> > I only ask because we have quite a few clients and you explicitly call
>> out
>> > that the queue reordering method used may have problems for lots of
>> clients.
>>
>> It's actually less about the number of clients than about the I/O queue
>> depth.  That's typically a pretty small number, generally proportional to
>> the number of storage devices and inversely proportional to their speed.
>> So for very large numbers of very slow devices there *might* be a problem
>> with the list traversals being slow, but otherwise probably not.
>> _______________
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
>
> --
> Raghavendra G
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] What tranpsort type in glfs_set_volfile_server() exactly mean ?

2016-07-25 Thread Raghavendra G
On Mon, Jul 18, 2016 at 4:35 PM, Raghavendra Talur <rta...@redhat.com>
wrote:

>
>
> On Mon, Jul 18, 2016 at 4:10 PM, Prasanna Kalever <pkale...@redhat.com>
> wrote:
>
>> Hey Team,
>>
>>
>> My understanding is that @transport argumet in
>> glfs_set_volfile_server() is meant for specifying transport used in
>> fetching volfile server,
>>
>
> Yes, @transport arg here is transport to use for fetching volfile.
>
>
>> IIRC which currently supports tcp and unix only...
>>
> Yes, this is correct too.
>
>
>>
>> The doc here
>> https://github.com/gluster/glusterfs/blob/master/api/src/glfs.h
>> +166 shows the rdma as well, which is something I cannot digest.
>>
> This is doc written with assumption that rdma would work too.
>
>
>>
>>
>> Can someone correct me ?
>>
>> Have we ever supported volfile fetch over rdma ?
>>
>
> I think no. To test, you would have to set rdma as only transport option
> in glusterd.vol and see what happens in volfile fetch.
>
>
> IMO, fetching volfile over rdma is an overkill and would not be required.
> RDMA should be kept only for IO operations.
>

+1. Actually there is no point in having infiniband to fetch volfile as
* RDMA/infiniband is useful for large data transfer
* Even when using RDMA/infiniband, we use a tcp/ip socket for initial
handshake (specifically an IPoIB stack) - RDMA connection manager
(librdmacm). So, a tcp/ip based address or pathway is necessary even while
using RDMA.

So, its a overkill to use RDMA for fetching volfiles.


>
> We should just remove it from the docs.
>
> Thanks,
> Raghavendra Talur
>
>
>>
>> Thanks,
>> --
>> Prasanna
>>
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] One client can effectively hang entire gluster array

2016-07-15 Thread Raghavendra G
Hi Patrick,

Is it possible to test out whether the patch fixes your issue? There is
nothing like validation from user experiencing the problem first hand.

regards,
Raghavendra

On Tue, Jul 12, 2016 at 10:40 PM, Jeff Darcy <jda...@redhat.com> wrote:

> > Thanks for responding so quickly. I'm not familiar with the codebase, so
> if
> > you don't mind me asking, how much would that list reordering slow things
> > down for, say, a queue of 1500 client machines? i.e. round-about how
> long of
> > a client list would significantly affect latency?
> >
> > I only ask because we have quite a few clients and you explicitly call
> out
> > that the queue reordering method used may have problems for lots of
> clients.
>
> It's actually less about the number of clients than about the I/O queue
> depth.  That's typically a pretty small number, generally proportional to
> the number of storage devices and inversely proportional to their speed.
> So for very large numbers of very slow devices there *might* be a problem
> with the list traversals being slow, but otherwise probably not.
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Reducing merge conflicts

2016-07-14 Thread Raghavendra G
On Fri, Jul 15, 2016 at 1:09 AM, Jeff Darcy <jda...@redhat.com> wrote:

> > The feedback I got is, "it is not motivating to review patches that are
> > already merged by maintainer."
>
> I can totally understand that.  I've been pretty active reviewing lately,
> and it's an *awful* demotivating grind.  On the other hand, it's also
> pretty demotivating to see one's own hard work "rot" as the lack of
> reviews forces rebase after rebase.  Haven't we all seen that?  I'm
> sure the magnitude of that effect varies across teams and across parts
> of the code, but I'm equally sure that it affects all of us to some
> degree.
>
>
> > Do you suggest they should change that
> > behaviour in that case?
>
> Maybe.  The fact is that all of our maintainers have plenty of other
> responsibilities, and not all of them prioritize the same way.  I know I
> wouldn't be reviewing so many patches myself otherwise.  If reviews are
> being missed under the current rules, maybe we do need new rules.
>
> > let us give equal recognition for:
> > patches sent
> > patches reviewed - this one is missing.
> > helping users on gluster-users
> > helping users on #gluster/#gluster-dev
> >
> > Feel free to add anything more I might have missed out. May be new
> > ideas/design/big-refactor?
>
> Also doc, infrastructure work, blog/meetup/conference outreach, etc.
>

Bug triage (more generally collecting information from field on how things
are working, what we are lacking)
Testing
Performance measurements
Futuristic thinking (new features, rewrites etc)
Channelizing our collective brain-power and prioritization of things to
work on (roadmap etc)
Handling competition (Defence/Constructive arguments etc).

I know all of this are not easy to measure (and may be out of scope of this
discussion). But, it doesn't hurt to list things needed to make a
project/product successful.


> > let people do what they like more among these and let us also recognize
> them
> > for all their contributions. Let us celebrate their work in each monthly
> > news letter.
>
> Good idea.
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] One client can effectively hang entire gluster array

2016-07-11 Thread Raghavendra G
On Fri, Jul 8, 2016 at 8:02 PM, Jeff Darcy <jda...@redhat.com> wrote:

> > In either of these situations, one glusterfsd process on whatever peer
> the
> > client is currently talking to will skyrocket to *nproc* cpu usage (800%,
> > 1600%) and the storage cluster is essentially useless; all other clients
> > will eventually try to read or write data to the overloaded peer and,
> when
> > that happens, their connection will hang. Heals between peers hang
> because
> > the load on the peer is around 1.5x the number of cores or more. This
> occurs
> > in either gluster 3.6 or 3.7, is very repeatable, and happens much too
> > frequently.
>
> I have some good news and some bad news.
>
> The good news is that features to address this are already planned for the
> 4.0 release.  Primarily I'm referring to QoS enhancements, some parts of
> which were already implemented for the bitrot daemon.  I'm still working
> out the exact requirements for this as a general facility, though.  You
> can help!  :)  Also, some of the work on "brick multiplexing" (multiple
> bricks within one glusterfsd process) should help to prevent the thrashing
> that causes a complete freeze-up.
>
> Now for the bad news.  Did I mention that these are 4.0 features?  4.0 is
> not near term, and not getting any nearer as other features and releases
> keep "jumping the queue" to absorb all of the resources we need for 4.0
> to happen.  Not that I'm bitter or anything.  ;)  To address your more
> immediate concerns, I think we need to consider more modest changes that
> can be completed in more modest time.  For example:
>
>  * The load should *never* get to 1.5x the number of cores.  Perhaps we
>could tweak the thread-scaling code in io-threads and epoll to check
>system load and not scale up (or even scale down) if system load is
>already high.
>
>  * We might be able to tweak io-threads (which already runs on the
>bricks and already has a global queue) to schedule requests in a
>fairer way across clients.  Right now it executes them in the
>same order that they were read from the network.


This sounds to be an easier fix. We can make io-threads to factor in
another input i.e., the client through which request came in (essentially
frame->root->client) before scheduling. That should make the problem
bearable at-least if not crippling. As to what algorithm to use, I think we
can consider leaky bucket of bit-rot implementation or dmclock. I've not
really thought deeper about the algorithm part. If the approach sounds ok,
we can discuss more about algos.

That tends to
>be a bit "unfair" and that should be fixed in the network code,
>but that's a much harder task.
>
> These are only weak approximations of what we really should be doing,
> and will be doing in the long term, but (without making any promises)
> they might be sufficient and achievable in the near term.  Thoughts?
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] dht mkdir preop check, afr and (non-)readable afr subvols

2016-06-06 Thread Raghavendra G
On Wed, Jun 1, 2016 at 12:50 PM, Xavier Hernandez <xhernan...@datalab.es>
wrote:

> Hi,
>
> On 01/06/16 08:53, Raghavendra Gowdappa wrote:
>
>>
>>
>> - Original Message -
>>
>>> From: "Xavier Hernandez" <xhernan...@datalab.es>
>>> To: "Pranith Kumar Karampuri" <pkara...@redhat.com>, "Raghavendra G" <
>>> raghaven...@gluster.com>
>>> Cc: "Gluster Devel" <gluster-devel@gluster.org>
>>> Sent: Wednesday, June 1, 2016 11:57:12 AM
>>> Subject: Re: [Gluster-devel] dht mkdir preop check, afr and
>>> (non-)readable afr subvols
>>>
>>> Oops, you are right. For entry operations the current version of the
>>> parent directory is not checked, just to avoid this problem.
>>>
>>> This means that mkdir will be sent to all alive subvolumes. However it
>>> still selects the group of answers that have a minimum quorum equal or
>>> greater than #bricks - redundancy. So it should be still valid.
>>>
>>
>> What if the quorum is met on "bad" subvolumes? and mkdir was successful
>> on bad subvolumes? Do we consider mkdir as successful? If yes, even EC
>> suffers from the problem described in bz
>> https://bugzilla.redhat.com/show_bug.cgi?id=1341429.
>>
>
> I don't understand the real problem. How a subvolume of EC could be in bad
> state from the point of view of DHT ?
>
> If you use xattrs to configure something in the parent directories, you
> should have needed to use setxattr or xattrop to do that. These operations
> do consider good/bad bricks because they touch inode metadata. This will
> only succeed if enough (quorum) bricks have successfully processed it. If
> quorum is met but for an error answer, an error will be reported to DHT and
> the majority of bricks will be left in the old state (these should be
> considered the good subvolumes). If some brick has succeeded, it will be
> considered bad and will be healed. If no quorum is met (even for an error
> answer), EIO will be returned and the state of the directory should be
> considered unknown/damaged.
>

Yes. Ideally, dht should use a getxattr for the layout xattr. But, for
performance reasons we thought of overloading mkdir by introducing
pre-operations (done by bricks). With plain dht it is a simple comparison
of xattrs passed as argument and xattrs stored on disk. But, I failed to
include afr and EC in the picture. Hence this issue. How difficult for EC
and AFR to bring this kind of check? Is it even possible for afr and EC to
implement this kind of pre-op checks with reasonable complexity?


> If a later mkdir checks this value in storage/posix and succeeds in enough
> bricks, it necessarily means that is has succeeded in good bricks, because
> there cannot be enough bricks with the bad xattr value.
>
> Note that quorum is always > #bricks/2 so we cannot have a quorum with
> good and bad bricks at the same time.
>
> Xavi
>
>
>
>>
>>> Xavi
>>>
>>> On 01/06/16 06:51, Pranith Kumar Karampuri wrote:
>>>
>>>> Xavi,
>>>> But if we keep winding only to good subvolumes, there is a case
>>>> where bad subvolumes will never catch up right? i.e. if we keep creating
>>>> files in same directory and everytime self-heal completes there are more
>>>> entries mounts would have created on the good subvolumes alone. I think
>>>> I must have missed this in the reviews if this is the current behavior.
>>>> It was not in the earlier releases. Right?
>>>>
>>>> Pranith
>>>>
>>>> On Tue, May 31, 2016 at 2:17 PM, Raghavendra G <raghaven...@gluster.com
>>>> <mailto:raghaven...@gluster.com>> wrote:
>>>>
>>>>
>>>>
>>>> On Tue, May 31, 2016 at 12:37 PM, Xavier Hernandez
>>>> <xhernan...@datalab.es <mailto:xhernan...@datalab.es>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> On 31/05/16 07:05, Raghavendra Gowdappa wrote:
>>>>
>>>> +gluster-devel, +Xavi
>>>>
>>>> Hi all,
>>>>
>>>> The context is [1], where bricks do pre-operation checks
>>>> before doing a fop and proceed with fop only if pre-op check
>>>> is successful.
>>>>
>>>> @Xavi,
>>>>
>>>> We need your inputs on behavior of EC subvolumes as well.
>>>>
>>>>
>>>> If 

Re: [Gluster-devel] dht mkdir preop check, afr and (non-)readable afr subvols

2016-05-31 Thread Raghavendra G
I've filed a bug at [1] to track issue in afr.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1341429

On Tue, May 31, 2016 at 2:17 PM, Raghavendra G <raghaven...@gluster.com>
wrote:

>
>
> On Tue, May 31, 2016 at 12:37 PM, Xavier Hernandez <xhernan...@datalab.es>
> wrote:
>
>> Hi,
>>
>> On 31/05/16 07:05, Raghavendra Gowdappa wrote:
>>
>>> +gluster-devel, +Xavi
>>>
>>> Hi all,
>>>
>>> The context is [1], where bricks do pre-operation checks before doing a
>>> fop and proceed with fop only if pre-op check is successful.
>>>
>>> @Xavi,
>>>
>>> We need your inputs on behavior of EC subvolumes as well.
>>>
>>
>> If I understand correctly, EC shouldn't have any problems here.
>>
>> EC sends the mkdir request to all subvolumes that are currently
>> considered "good" and tries to combine the answers. Answers that match in
>> return code, errno (if necessary) and xdata contents (except for some
>> special xattrs that are ignored for combination purposes), are grouped.
>>
>> Then it takes the group with more members/answers. If that group has a
>> minimum size of #bricks - redundancy, it is considered the good answer.
>> Otherwise EIO is returned because bricks are in an inconsistent state.
>>
>> If there's any answer in another group, it's considered bad and gets
>> marked so that self-heal will repair it using the good information from the
>> majority of bricks.
>>
>> xdata is combined and returned even if return code is -1.
>>
>> Is that enough to cover the needed behavior ?
>>
>
> Thanks Xavi. That's sufficient for the feature in question. One of the
> main cases I was interested in was what would be the behaviour if mkdir
> succeeds on "bad" subvolume and fails on "good" subvolume. Since you never
> wind mkdir to "bad" subvolume(s), this situation never arises.
>
>
>
>>
>> Xavi
>>
>>
>>
>>> [1] http://review.gluster.org/13885
>>>
>>> regards,
>>> Raghavendra
>>>
>>> - Original Message -
>>>
>>>> From: "Pranith Kumar Karampuri" <pkara...@redhat.com>
>>>> To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
>>>> Cc: "team-quine-afr" <team-quine-...@redhat.com>, "rhs-zteam" <
>>>> rhs-zt...@redhat.com>
>>>> Sent: Tuesday, May 31, 2016 10:22:49 AM
>>>> Subject: Re: dht mkdir preop check, afr and (non-)readable afr subvols
>>>>
>>>> I think you should start a discussion on gluster-devel so that Xavi
>>>> gets a
>>>> chance to respond on the mails as well.
>>>>
>>>> On Tue, May 31, 2016 at 10:21 AM, Raghavendra Gowdappa <
>>>> rgowd...@redhat.com>
>>>> wrote:
>>>>
>>>> Also note that we've plans to extend this pre-op check to all dentry
>>>>> operations which also depend parent layout. So, the discussion need to
>>>>> cover all dentry operations like:
>>>>>
>>>>> 1. create
>>>>> 2. mkdir
>>>>> 3. rmdir
>>>>> 4. mknod
>>>>> 5. symlink
>>>>> 6. unlink
>>>>> 7. rename
>>>>>
>>>>> We also plan to have similar checks in lock codepath for directories
>>>>> too
>>>>> (planning to use hashed-subvolume as lock-subvolume for directories).
>>>>> So,
>>>>> more fops :)
>>>>> 8. lk (posix locks)
>>>>> 9. inodelk
>>>>> 10. entrylk
>>>>>
>>>>> regards,
>>>>> Raghavendra
>>>>>
>>>>> - Original Message -
>>>>>
>>>>>> From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
>>>>>> To: "team-quine-afr" <team-quine-...@redhat.com>
>>>>>> Cc: "rhs-zteam" <rhs-zt...@redhat.com>
>>>>>> Sent: Tuesday, May 31, 2016 10:15:04 AM
>>>>>> Subject: dht mkdir preop check, afr and (non-)readable afr subvols
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I have some queries related to the behavior of afr_mkdir with respect
>>>>>> to
>>>>>> readable subvols.
>>>>>>
>>>>>> 1. While winding mkdir to subvols does afr check whether th

Re: [Gluster-devel] [Gluster-users] Fwd: dht_is_subvol_filled messages on client

2016-05-03 Thread Raghavendra G
On Mon, May 2, 2016 at 11:41 AM, Serkan Çoban <cobanser...@gmail.com> wrote:

> >1. What is the out put of du -hs ? Please get this
> information for each of the brick that are part of disperse.
>

Sorry. I needed df output of the filesystem containing brick. Not du. Sorry
about that.


> There are 20 bricks in disperse-56 and the du -hs output is like:
> 80K /bricks/20
> 80K /bricks/20
> 80K /bricks/20
> 80K /bricks/20
> 80K /bricks/20
> 80K /bricks/20
> 80K /bricks/20
> 80K /bricks/20
> 1.8M /bricks/20
> 80K /bricks/20
> 80K /bricks/20
> 80K /bricks/20
> 80K /bricks/20
> 80K /bricks/20
> 80K /bricks/20
> 80K /bricks/20
> 80K /bricks/20
> 80K /bricks/20
> 80K /bricks/20
> 80K /bricks/20
>
> I see that gluster is not writing to this disperse set. All other
> disperse sets are filled 13GB but this one is empty. I see directory
> structure created but no files in directories.
> How can I fix the issue? I will try to rebalance but I don't think it
> will write to this disperse set...
>
>
>
> On Sat, Apr 30, 2016 at 9:22 AM, Raghavendra G <raghaven...@gluster.com>
> wrote:
> >
> >
> > On Fri, Apr 29, 2016 at 12:32 AM, Serkan Çoban <cobanser...@gmail.com>
> > wrote:
> >>
> >> Hi, I cannot get an answer from user list, so asking to devel list.
> >>
> >> I am getting [dht-diskusage.c:277:dht_is_subvol_filled] 0-v0-dht:
> >> inodes on subvolume 'v0-disperse-56' are at (100.00 %), consider
> >> adding more bricks.
> >>
> >> message on client logs.My cluster is empty there are only a couple of
> >> GB files for testing. Why this message appear in syslog?
> >
> >
> > dht uses disk usage information from backend export.
> >
> > 1. What is the out put of du -hs ? Please get this
> > information for each of the brick that are part of disperse.
> > 2. Once you get du information from each brick, the value seen by dht
> will
> > be based on how cluster/disperse aggregates du info (basically statfs
> fop).
> >
> > The reason for 100% disk usage may be,
> > In case of 1, backend fs might be shared by data other than brick.
> > In case of 2, some issues with aggregation.
> >
> >> Is is safe to
> >> ignore it?
> >
> >
> > dht will try not to have data files on the subvol in question
> > (v0-disperse-56). Hence lookup cost will be two hops for files hashing to
> > disperse-56 (note that other fops like read/write/open still have the
> cost
> > of single hop and dont suffer from this penalty). Other than that there
> is
> > no significant harm unless disperse-56 is really running out of space.
> >
> > regards,
> > Raghavendra
> >
> >> ___
> >> Gluster-devel mailing list
> >> Gluster-devel@gluster.org
> >> http://www.gluster.org/mailman/listinfo/gluster-devel
> >
> >
> >
> >
> > --
> > Raghavendra G
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] Fwd: dht_is_subvol_filled messages on client

2016-04-30 Thread Raghavendra G
On Fri, Apr 29, 2016 at 12:32 AM, Serkan Çoban <cobanser...@gmail.com>
wrote:

> Hi, I cannot get an answer from user list, so asking to devel list.
>
> I am getting [dht-diskusage.c:277:dht_is_subvol_filled] 0-v0-dht:
> inodes on subvolume 'v0-disperse-56' are at (100.00 %), consider
> adding more bricks.
>
> message on client logs.My cluster is empty there are only a couple of
> GB files for testing. Why this message appear in syslog?


dht uses disk usage information from backend export.

1. What is the out put of du -hs ? Please get this
information for each of the brick that are part of disperse.
2. Once you get du information from each brick, the value seen by dht will
be based on how cluster/disperse aggregates du info (basically statfs fop).

The reason for 100% disk usage may be,
In case of 1, backend fs might be shared by data other than brick.
In case of 2, some issues with aggregation.

Is is safe to
> ignore it?
>

dht will try not to have data files on the subvol in question
(v0-disperse-56). Hence lookup cost will be two hops for files hashing to
disperse-56 (note that other fops like read/write/open still have the cost
of single hop and dont suffer from this penalty). Other than that there is
no significant harm unless disperse-56 is really running out of space.

regards,
Raghavendra

___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] gluster 3.7.9 permission denied and mv errors

2016-04-29 Thread Raghavendra G
On Wed, Apr 13, 2016 at 10:00 PM, David F. Robinson <
david.robin...@corvidtec.com> wrote:

> I am running into two problems (possibly related?).
>
> 1) Every once in a while, when I do a 'rm -rf DIRNAME', it comes back with
> an error:
> rm: cannot remove `DIRNAME` : Directory not empty
>
> If I try the 'rm -rf' again after the error, it deletes the
> directory.  The issue is that I have scripts that clean up directories, and
> they are failing unless I go through the deletes a 2nd time.
>

What kind of mount are you using? Is it a FUSE or NFS mount? Recently we
saw a similar issue on NFS clients on RHEL6 where rm -rf used to fail with
ENOTEMPTY in some specific cases.


>
> 2) I have different scripts to move a large numbers of files (5-25k) from
> one directory to another.  Sometimes I receive an error:
> /bin/mv: cannot move `xyz` to `../bkp00/xyz`: File exists
>

Does ./bkp00/xyz exist on backend? If yes, what is the value of gfid xattr
(key: "trusted.gfid") for "xyz" and "./bkp00/xyz" on backend bricks (I need
gfid from all the bricks) when this issue happens?


> The move is done using '/bin/mv -f', so it should overwrite the file
> if it exists.  I have tested this with hundreds of files, and it works as
> expected.  However, every few days the script that moves the files will
> have problems with 1 or 2 files during the move.  This is one move problem
> out of roughly 10,000 files that are being moved and I cannot figure out
> any reason for the intermittent problem.
>
> Setup details for my gluster configuration shown below.
>
> [root@gfs01bkp logs]# gluster volume info
>
> Volume Name: gfsbackup
> Type: Distribute
> Volume ID: e78d5123-d9bc-4d88-9c73-61d28abf0b41
> Status: Started
> Number of Bricks: 7
> Transport-type: tcp
> Bricks:
> Brick1: gfsib01bkp.corvidtec.com:/data/brick01bkp/gfsbackup
> Brick2: gfsib01bkp.corvidtec.com:/data/brick02bkp/gfsbackup
> Brick3: gfsib02bkp.corvidtec.com:/data/brick01bkp/gfsbackup
> Brick4: gfsib02bkp.corvidtec.com:/data/brick02bkp/gfsbackup
> Brick5: gfsib02bkp.corvidtec.com:/data/brick03bkp/gfsbackup
> Brick6: gfsib02bkp.corvidtec.com:/data/brick04bkp/gfsbackup
> Brick7: gfsib02bkp.corvidtec.com:/data/brick05bkp/gfsbackup
> Options Reconfigured:
> nfs.disable: off
> server.allow-insecure: on
> storage.owner-gid: 100
> server.manage-gids: on
> cluster.lookup-optimize: on
> server.event-threads: 8
> client.event-threads: 8
> changelog.changelog: off
> storage.build-pgfid: on
> performance.readdir-ahead: on
> diagnostics.brick-log-level: WARNING
> diagnostics.client-log-level: WARNING
> cluster.rebal-throttle: aggressive
> performance.cache-size: 1024MB
> performance.write-behind-window-size: 10MB
>
>
> [root@gfs01bkp logs]# rpm -qa | grep gluster
> glusterfs-server-3.7.9-1.el6.x86_64
> glusterfs-debuginfo-3.7.9-1.el6.x86_64
> glusterfs-api-3.7.9-1.el6.x86_64
> glusterfs-resource-agents-3.7.9-1.el6.noarch
> gluster-nagios-common-0.1.1-0.el6.noarch
> glusterfs-libs-3.7.9-1.el6.x86_64
> glusterfs-fuse-3.7.9-1.el6.x86_64
> glusterfs-extra-xlators-3.7.9-1.el6.x86_64
> glusterfs-geo-replication-3.7.9-1.el6.x86_64
> glusterfs-3.7.9-1.el6.x86_64
> glusterfs-cli-3.7.9-1.el6.x86_64
> glusterfs-devel-3.7.9-1.el6.x86_64
> glusterfs-rdma-3.7.9-1.el6.x86_64
> samba-vfs-glusterfs-4.1.11-2.el6.x86_64
> glusterfs-client-xlators-3.7.9-1.el6.x86_64
> glusterfs-api-devel-3.7.9-1.el6.x86_64
> python-gluster-3.7.9-1.el6.noarch
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Gluster + Infiniband + 3.x kernel -> hard crash?

2016-04-29 Thread Raghavendra G
On Thu, Apr 7, 2016 at 2:02 AM, Glomski, Patrick <
patrick.glom...@corvidtec.com> wrote:

> We run gluster 3.7 in a distributed replicated setup. Infiniband (tcp)
> links the gluster peers together and clients use the ethernet interface.
>
> This setup is stable running CentOS 6.x and using the most recent
> infiniband drivers provided by Mellanox. Uptime was 170 days when we took
> it down to wipe the systems and update to CentOS 7.
>
> When the exact same setup is loaded onto a CentOS 7 machine (minor setup
> differences, but basically the same; setup is handled by ansible), the
> peers will (seemingly randomly) experience a hard crash and need to be
> power-cycled. There is no output on the screen and nothing in the logs.
> After rebooting, the peer reconnects, heals whatever files it missed, and
> everything is happy again. Maximum uptime for any given peer is 20 days.
> Thanks to the replication, clients maintain connectivity, but from a system
> administration perspective it's driving me crazy!
>
> We run other storage servers with the same infiniband and CentOS7 setup
> except that they use NFS instead of gluster. NFS shares are served through
> infiniband to some machines and ethernet to others.
>
> Is it possible that gluster's (and only gluster's) use of the infiniband
> kernel module to send tcp packets to its peers on a 3 kernel is causing the
> system to have a hard crash?
>

Please note that Gluster is only a "userspace" consumer of infiniband. So,
at least in "theory" it shouldn't result in kernel panic. However
infiniband also allows userspace programs to do somethings which can be
done only by kernel (like pinning pages to a specific address). I am not
very familiar with internals of infiniband and hence cannot authoritatively
comment on whether kernel panic is possible/impossible. Some one with an
understanding of infiniband internals would be in a better position to
comment on this.


Pretty specific problem and it doesn't make much sense to me, but that's
> sure where the evidence seems to point.
>
> Anyone running CentOS 7 gluster arrays with infiniband out there to
> confirm that it works fine for them? Gluster devs care to chime in with a
> better theory? I'd love for this random crashing to stop.
>
> Thanks,
> Patrick
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Regression-test-burn-in crash in EC test

2016-04-29 Thread Raghavendra G
Seems like I missed adding rtalur/sakshi to cc list.

On Fri, Apr 29, 2016 at 5:25 PM, Raghavendra G <raghaven...@gluster.com>
wrote:

> Raghavendra Talur reported another crash in dht_rename_lock_cbk (which is
> similar - not exactly same - to the bt presented here). I heard Sakshi is
> taking a look into this.
>
> Rtalur/Sakshi,
>
> Can you please post your findings here?
>
> regards,
> Raghavendra
>
> On Fri, Apr 29, 2016 at 4:50 PM, Jeff Darcy <jda...@redhat.com> wrote:
>
>> > The test is doing renames where source and target directories are
>> > different. At the same time a new ec-set is added and rebalance started.
>> > Rebalance will cause dht to also move files between bricks. Maybe this
>> > is causing some race in dht ?
>> >
>> > I'll try to continue investigating when I have some time.
>>
>> That would be great, but if you've pursued this as far as DHT then it
>> would be OK to hand it off to that team as well.  Thanks!
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>
>
>
>
> --
> Raghavendra G
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Cores generated with ./tests/geo-rep/georep-basic-dr-tarssh.t

2016-03-04 Thread Raghavendra G
On Fri, Mar 4, 2016 at 2:02 PM, Raghavendra G <raghaven...@gluster.com>
wrote:

>
>
> On Thu, Mar 3, 2016 at 6:26 PM, Kotresh Hiremath Ravishankar <
> khire...@redhat.com> wrote:
>
>> Hi,
>>
>> Yes, with this patch we need not set conn->trans to NULL in
>> rpc_clnt_disable
>>
>
> While [1] fixes the crash, things can be improved in the way how changelog
> is using rpc.
>
> 1. In the current code, there is an rpc_clnt object leak during disconnect
> event.
> 2. Also, freed "mydata" of changelog is still associated with rpc_clnt
> object (corollary of 1), though change log might not get any events with
> "mydata" (as connection is dead).
>
> I've discussed with Kotresh about changes needed, offline. So, following
> are the action items.
> 1. Soumya's patch [2] is valid and is needed for 3.7 branch too.
> 2. [2] can be accepted. However, someone might want to re-use an rpc
> object after disabling it, like introducing a new api rpc_clnt_enable_again
> (though no of such use-cases is very less). But [2] doesn't allow it. The
> point is as long as rpc-clnt object is alive, transport object is alive
> (though disconnected) and we can re-use it. So, I would prefer not to
> accept it.
>

[2] will be accepted now.


> 3. Kotresh will work on new changes to make sure changelog makes correct
> use of rpc-clnt.
>
> [1] http://review.gluster.org/#/c/13592
> [2] http://review.gluster.org/#/c/1359
>
> regards,
> Raghavendra.
>
>
>> Thanks and Regards,
>> Kotresh H R
>>
>> - Original Message -
>> > From: "Soumya Koduri" <skod...@redhat.com>
>> > To: "Kotresh Hiremath Ravishankar" <khire...@redhat.com>, "Raghavendra
>> G" <raghaven...@gluster.com>
>> > Cc: "Gluster Devel" <gluster-devel@gluster.org>
>> > Sent: Thursday, March 3, 2016 5:06:00 PM
>> > Subject: Re: [Gluster-devel] Cores generated with
>> ./tests/geo-rep/georep-basic-dr-tarssh.t
>> >
>> >
>> >
>> > On 03/03/2016 04:58 PM, Kotresh Hiremath Ravishankar wrote:
>> > > [Replying on top of my own reply]
>> > >
>> > > Hi,
>> > >
>> > > I have submitted the below patch [1] to avoid the issue of
>> > > 'rpc_clnt_submit'
>> > > getting reconnected. But it won't take care of memory leak problem
>> you were
>> > > trying to fix. That we have to carefully go through all cases and fix
>> it.
>> > > Please have a look at it.
>> > >
>> > Looks good. IIUC, with this patch, we need not set conn->trans to NULL
>> > in 'rpc_clnt_disable()'. Right? If yes, then it takes care of memleak as
>> > the transport object shall then get freed as part of
>> > 'rpc_clnt_trigger_destroy'.
>> >
>> >
>> > > http://review.gluster.org/#/c/13592/
>> > >
>> > > Thanks and Regards,
>> > > Kotresh H R
>> > >
>> > > - Original Message -
>> > >> From: "Kotresh Hiremath Ravishankar" <khire...@redhat.com>
>> > >> To: "Soumya Koduri" <skod...@redhat.com>
>> > >> Cc: "Raghavendra G" <raghaven...@gluster.com>, "Gluster Devel"
>> > >> <gluster-devel@gluster.org>
>> > >> Sent: Thursday, March 3, 2016 3:39:11 PM
>> > >> Subject: Re: [Gluster-devel] Cores generated with
>> > >> ./tests/geo-rep/georep-basic-dr-tarssh.t
>> > >>
>> > >> Hi Soumya,
>> > >>
>> > >> I tested the lastes patch [2] on master where your previous patch
>> [1] in
>> > >> merged.
>> > >> I see crashes at different places.
>> > >>
>> > >> 1. If there are code paths that are holding rpc object without
>> taking ref
>> > >> on
>> > >> it, all those
>> > >> code path will crash on invoking rpc submit on that object as rpc
>> > >> object
>> > >> would have freed
>> > >> by last unref on DISCONNECT event. I see this kind of use-case in
>> > >> chagnelog rpc code.
>> > >> Need to check on other users of rpc.
>> > Agree. We should fix all such code-paths. Since this seem to be an
>> > intricate fix, shall we take these patches only in master branch and not
>> > in 3.7 release for now till we fix all such paths as we encounter?
>> >
>> > >&

Re: [Gluster-devel] Cores generated with ./tests/geo-rep/georep-basic-dr-tarssh.t

2016-03-04 Thread Raghavendra G
On Thu, Mar 3, 2016 at 6:26 PM, Kotresh Hiremath Ravishankar <
khire...@redhat.com> wrote:

> Hi,
>
> Yes, with this patch we need not set conn->trans to NULL in
> rpc_clnt_disable
>

While [1] fixes the crash, things can be improved in the way how changelog
is using rpc.

1. In the current code, there is an rpc_clnt object leak during disconnect
event.
2. Also, freed "mydata" of changelog is still associated with rpc_clnt
object (corollary of 1), though change log might not get any events with
"mydata" (as connection is dead).

I've discussed with Kotresh about changes needed, offline. So, following
are the action items.
1. Soumya's patch [2] is valid and is needed for 3.7 branch too.
2. [2] can be accepted. However, someone might want to re-use an rpc object
after disabling it, like introducing a new api rpc_clnt_enable_again
(though no of such use-cases is very less). But [2] doesn't allow it. The
point is as long as rpc-clnt object is alive, transport object is alive
(though disconnected) and we can re-use it. So, I would prefer not to
accept it.
3. Kotresh will work on new changes to make sure changelog makes correct
use of rpc-clnt.

[1] http://review.gluster.org/#/c/13592
[2] http://review.gluster.org/#/c/1359

regards,
Raghavendra.


> Thanks and Regards,
> Kotresh H R
>
> - Original Message -
> > From: "Soumya Koduri" <skod...@redhat.com>
> > To: "Kotresh Hiremath Ravishankar" <khire...@redhat.com>, "Raghavendra
> G" <raghaven...@gluster.com>
> > Cc: "Gluster Devel" <gluster-devel@gluster.org>
> > Sent: Thursday, March 3, 2016 5:06:00 PM
> > Subject: Re: [Gluster-devel] Cores generated with
> ./tests/geo-rep/georep-basic-dr-tarssh.t
> >
> >
> >
> > On 03/03/2016 04:58 PM, Kotresh Hiremath Ravishankar wrote:
> > > [Replying on top of my own reply]
> > >
> > > Hi,
> > >
> > > I have submitted the below patch [1] to avoid the issue of
> > > 'rpc_clnt_submit'
> > > getting reconnected. But it won't take care of memory leak problem you
> were
> > > trying to fix. That we have to carefully go through all cases and fix
> it.
> > > Please have a look at it.
> > >
> > Looks good. IIUC, with this patch, we need not set conn->trans to NULL
> > in 'rpc_clnt_disable()'. Right? If yes, then it takes care of memleak as
> > the transport object shall then get freed as part of
> > 'rpc_clnt_trigger_destroy'.
> >
> >
> > > http://review.gluster.org/#/c/13592/
> > >
> > > Thanks and Regards,
> > > Kotresh H R
> > >
> > > - Original Message -
> > >> From: "Kotresh Hiremath Ravishankar" <khire...@redhat.com>
> > >> To: "Soumya Koduri" <skod...@redhat.com>
> > >> Cc: "Raghavendra G" <raghaven...@gluster.com>, "Gluster Devel"
> > >> <gluster-devel@gluster.org>
> > >> Sent: Thursday, March 3, 2016 3:39:11 PM
> > >> Subject: Re: [Gluster-devel] Cores generated with
> > >> ./tests/geo-rep/georep-basic-dr-tarssh.t
> > >>
> > >> Hi Soumya,
> > >>
> > >> I tested the lastes patch [2] on master where your previous patch [1]
> in
> > >> merged.
> > >> I see crashes at different places.
> > >>
> > >> 1. If there are code paths that are holding rpc object without taking
> ref
> > >> on
> > >> it, all those
> > >> code path will crash on invoking rpc submit on that object as rpc
> > >> object
> > >> would have freed
> > >> by last unref on DISCONNECT event. I see this kind of use-case in
> > >> chagnelog rpc code.
> > >> Need to check on other users of rpc.
> > Agree. We should fix all such code-paths. Since this seem to be an
> > intricate fix, shall we take these patches only in master branch and not
> > in 3.7 release for now till we fix all such paths as we encounter?
> >
> > >>
> > >> 2. And also we need to take care of reconnect timers that are being
> set
> > >> and
> > >> are re-tried to
> > >> connect back on expiration. In those cases also, we might crash
> as rpc
> > >> object would have freed.
> > Your patch addresses this..right?
> >
> > Thanks,
> > Soumya
> >
> > >>
> > >>
> > >> [1] http://review.gluster.org/#/c/13507/
> > >> [2] http://review.gluster.org/#/c/13587/
> > >&g

Re: [Gluster-devel] Cores generated with ./tests/geo-rep/georep-basic-dr-tarssh.t

2016-03-02 Thread Raghavendra G
Hi Soumya,

Can you send a fix to this regression on upstream master too? This patch is
merged there.

regards,
Raghavendra

On Tue, Mar 1, 2016 at 10:34 PM, Kotresh Hiremath Ravishankar <
khire...@redhat.com> wrote:

> Hi Soumya,
>
> I analysed the issue and found out that crash has happened because of the
> patch [1].
>
> The patch doesn't set transport object to NULL in 'rpc_clnt_disable' but
> instead does it on
> 'rpc_clnt_trigger_destroy'. So if there are pending rpc invocations on the
> rpc object that
> is disabled (those instances are possible as happening now in changelog),
> it will trigger a
> CONNECT notify again with 'mydata' that is freed causing a crash. This
> happens because
> 'rpc_clnt_submit' reconnects if rpc is not connected.
>
>  rpc_clnt_submit (...) {
>...
> if (conn->connected == 0) {
> ret = rpc_transport_connect (conn->trans,
>
>  conn->config.remote_port);
> }
>...
>  }
>
> Without your patch, conn->trans was set NULL and hence CONNECT fails not
> resulting with
> CONNECT notify call. And also the cleanup happens in failure path.
>
> So the memory leak can happen, if there is no try for rpc invocation after
> DISCONNECT.
> It will be cleaned up otherwise.
>
>
> [1] http://review.gluster.org/#/c/13507/
>
> Thanks and Regards,
> Kotresh H R
>
> - Original Message -
> > From: "Kotresh Hiremath Ravishankar" <khire...@redhat.com>
> > To: "Soumya Koduri" <skod...@redhat.com>
> > Cc: avish...@redhat.com, "Gluster Devel" <gluster-devel@gluster.org>
> > Sent: Monday, February 29, 2016 4:15:22 PM
> > Subject: Re: Cores generated with
> ./tests/geo-rep/georep-basic-dr-tarssh.t
> >
> > Hi Soumya,
> >
> > I just tested that it is reproducible only with your patch both in
> master and
> > 3.76 branch.
> > The geo-rep test cases are marked bad in master. So it's not hit in
> master.
> > rpc is introduced
> > in changelog xlator to communicate to applications via libgfchangelog.
> > Venky/Me will check
> > why is the crash happening and will update.
> >
> >
> > Thanks and Regards,
> > Kotresh H R
> >
> > - Original Message -
> > > From: "Soumya Koduri" <skod...@redhat.com>
> > > To: avish...@redhat.com, "kotresh" <khire...@redhat.com>
> > > Cc: "Gluster Devel" <gluster-devel@gluster.org>
> > > Sent: Monday, February 29, 2016 2:10:51 PM
> > > Subject: Cores generated with ./tests/geo-rep/georep-basic-dr-tarssh.t
> > >
> > > Hi Aravinda/Kotresh,
> > >
> > > With [1], I consistently see cores generated with the test
> > > './tests/geo-rep/georep-basic-dr-tarssh.t' in release-3.7 branch. From
> > > the cores, looks like we are trying to dereference a freed
> > > changelog_rpc_clnt_t(crpc) object in changelog_rpc_notify(). Strangely
> > > this was not reported in master branch.
> > >
> > > I tried debugging but couldn't find any possible suspects. I request
> you
> > > to take a look and let me know if [1] caused any regression.
> > >
> > > Thanks,
> > > Soumya
> > >
> > > [1] http://review.gluster.org/#/c/13507/
> > >
> >
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Posix lock migration design

2016-02-29 Thread Raghavendra G
On Mon, Feb 29, 2016 at 12:52 PM, Susant Palai <spa...@redhat.com> wrote:

> Hi Raghavendra,
>I have a question on the design.
>
>Currently in case of a client disconnection, pl_flush cleans up the
> locks associated with the fd created from that client.
> From the design, rebalance will migrate the locks to the new destination.
> Now in case client gets disconnected from the
> destination brick, how it is supposed to clean up the locks as
> rebalance/brick have no idea whether the client has opened
> an fd on destination and what the fd is.
>

>So the question is how to associate the client fd with locks on
> destination.
>

We don't use fds to cleanup the locks during flush. We use lk-owner which
doesn't change across migration. Note that lk-owner for posix-locks is
filled by the vfs/kernel where we've glusterfs mount.


 pthread_mutex_lock (_inode->mutex);
{
__delete_locks_of_owner (pl_inode, frame->root->client,
 >root->lk_owner);
}
pthread_mutex_unlock (_inode->mutex);



> Thanks,
> Susant
>
> - Original Message -
> From: "Susant Palai" <spa...@redhat.com>
> To: "Gluster Devel" <gluster-devel@gluster.org>
> Sent: Friday, 29 January, 2016 3:15:14 PM
> Subject: [Gluster-devel] Posix lock migration design
>
> Hi,
>Here, [1]
>
> https://docs.google.com/document/d/17SZAKxx5mhM-cY5hdE4qRq9icmFqy3LBaTdewofOXYc/edit?usp=sharing
> is a google document about proposal for "POSIX_LOCK_MIGRATION". Problem
> statement and design are explained in the document it self.
>
>   Requesting the devel list to go through the document and
> comment/analyze/suggest, to take the thoughts forward (either on the
> google doc itself or here on the devel list).
>
>
> Thanks,
> Susant
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] 3.6.8 crashing a lot in production

2016-02-23 Thread Raghavendra G
Came across a glibc bug which could've caused some corruptions. On googling
about possible problems, we found that there is an issue (
https://bugzilla.redhat.com/show_bug.cgi?id=1305406) fixed in
glibc-2.17-121.el7. From the bug we found the following test-script to
determine if the glibc is buggy. And on running it, we ran it on the local
setup using the following method given in the bug:
 # objdump -r -d /lib64/libc.so.6 | grep -C 20 _int_free |
grep -C 10 cmpxchg | head -21 | grep -A 3 cmpxchg | tail -1 | (grep '%r' &&
echo "Your libc is likely buggy." || echo "Your libc looks OK.") 7cc36: 48
85 c9 test %rcx,%rcx Your libc is likely buggy.  Could you
check if the above command on your setup gives the same output which says
"Your libc is likely buggy." regards,

On Sat, Feb 13, 2016 at 7:46 AM, Krutika Dhananjay <kdhan...@redhat.com>
wrote:

> Taking a look. Give me some time.
>
> -Krutika
>
> --
>
> *From: *"Joe Julian" <j...@julianfamily.org>
> *To: *"Krutika Dhananjay" <kdhan...@redhat.com>, gluster-devel@gluster.org
> *Sent: *Saturday, February 13, 2016 6:02:13 AM
> *Subject: *Fwd: [Gluster-devel] 3.6.8 crashing a lot in production
>
>
> Could this be a regression from http://review.gluster.org/7981 ?
>
>  Forwarded Message 
> Subject: [Gluster-devel] 3.6.8 crashing a lot in production
> Date: Fri, 12 Feb 2016 16:20:59 -0800
> From: Joe Julian <j...@julianfamily.org> <j...@julianfamily.org>
> To: gluster-us...@gluster.org, gluster-devel@gluster.org
>
>
> I have multiple bricks crashing in production. Any help would be greatly
> appreciated.
>
> The crash log is in this bug report: 
> https://bugzilla.redhat.com/show_bug.cgi?id=1307146
>
> Looks like it's crashing in pl_inodelk_client_cleanup
> ___
> Gluster-devel mailing 
> listGluster-devel@gluster.orghttp://www.gluster.org/mailman/listinfo/gluster-devel
>
>
>
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] 3.6.8 glusterfsd processes not responding

2016-02-15 Thread Raghavendra G
Had missed out gluster-devel.

On Tue, Feb 16, 2016 at 10:50 AM, Raghavendra G <raghaven...@gluster.com>
wrote:

> The thread is blocked on ctx->lock. Looking at the definition of
> pl_ctx_get, I found that two operations
>
> 1. Check whether ctx is not present.
> 2. Create and set the ctx.
>
> are not atomic. This can result in a non-zero ctx (ctx1) being overwritten
> (by say ctx2) by a racing thread. And if somebody has already acquired lock
> ctx1, they would be doing an unlock on ctx2 resulting in a corrupted lock.
> This can result in hangs. Below is the definition of pl_ctx_get:
>
> pl_ctx_t*
> pl_ctx_get (client_t *client, xlator_t *xlator)
> {
> void *tmp = NULL;
>   pl_ctx_t *ctx = NULL;
>
> client_ctx_get (client, xlator, );
>
> ctx = tmp;
>
> if (ctx != NULL)
>goto out;
>
> ctx = GF_CALLOC (1, sizeof (pl_ctx_t), gf_locks_mt_posix_lock_t);
>
> if (ctx == NULL)
> goto out;
>
> pthread_mutex_init (>lock, NULL);
> INIT_LIST_HEAD (>inodelk_lockers);
> INIT_LIST_HEAD (>entrylk_lockers);
>
> if (client_ctx_set (client, xlator, ctx) != 0) {
> pthread_mutex_destroy (>lock);
>GF_FREE (ctx);
> ctx = NULL;
> }
> out:
> return ctx;
> }
>
> Though this is a bug, I am not sure whether this is the RCA for the issue
> pointed out in this thread. I'll send out a patch for the issue identified
> here.
>
>
> On Sat, Feb 13, 2016 at 8:46 AM, Joe Julian <j...@julianfamily.org> wrote:
>
>> I've also got several glusterfsd processes that have stopped responding.
>>
>> A backtrace from a live core, strace, and state dump follow:
>>
>> Thread 10 (LWP 31587):
>> #0  0x7f81d384289c in __lll_lock_wait () from
>> /lib/x86_64-linux-gnu/libpthread.so.0
>> #1  0x7f81d383e065 in _L_lock_858 () from
>> /lib/x86_64-linux-gnu/libpthread.so.0
>> #2  0x7f81d383deba in pthread_mutex_lock () from
>> /lib/x86_64-linux-gnu/libpthread.so.0
>> #3  0x7f81ce600ff8 in pl_inodelk_client_cleanup ()
>>from /usr/lib/x86_64-linux-gnu/glusterfs/3.6.8/xlator/features/locks.so
>> #4  0x7f81ce5fe84a in ?? () from
>> /usr/lib/x86_64-linux-gnu/glusterfs/3.6.8/xlator/features/locks.so
>> #5  0x7f81d3ef573d in gf_client_disconnect () from
>> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0
>> #6  0x7f81cd74e270 in server_connection_cleanup ()
>>from
>> /usr/lib/x86_64-linux-gnu/glusterfs/3.6.8/xlator/protocol/server.so
>> #7  0x7f81cd7486ec in server_rpc_notify ()
>>from
>> /usr/lib/x86_64-linux-gnu/glusterfs/3.6.8/xlator/protocol/server.so
>> #8  0x7f81d3c70f1b in rpcsvc_handle_disconnect () from
>> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0
>> #9  0x7f81d3c710b0 in rpcsvc_notify () from
>> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0
>> #10 0x7f81d3c74257 in rpc_transport_notify () from
>> /usr/lib/x86_64-linux-gnu/libgfrpc.so.0
>> #11 0x7f81cf4d4077 in ?? () from
>> /usr/lib/x86_64-linux-gnu/glusterfs/3.6.8/rpc-transport/socket.so
>> #12 0x7f81d3ef793b in ?? () from
>> /usr/lib/x86_64-linux-gnu/libglusterfs.so.0
>> #13 0x7f81d4348f71 in main ()
>>
>> Thread 9 (LWP 3385):
>> #0  0x7f81d353408d in nanosleep () from
>> /lib/x86_64-linux-gnu/libc.so.6
>> #1  0x7f81d3533f2c in sleep () from /lib/x86_64-linux-gnu/libc.so.6
>> #2  0x7f81cee50c38 in ?? () from
>> /usr/lib/x86_64-linux-gnu/glusterfs/3.6.8/xlator/storage/posix.so
>> #3  0x7f81d383be9a in start_thread () from
>> /lib/x86_64-linux-gnu/libpthread.so.0
>> #4  0x7f81d35683fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
>> #5  0x in ?? ()
>>
>> Thread 8 (LWP 20656):
>> #0  0x7f81d38400fe in pthread_cond_timedwait@@GLIBC_2.3.2 ()
>>from /lib/x86_64-linux-gnu/libpthread.so.0
>> #1  0x7f81ce3ea032 in iot_worker ()
>>from
>> /usr/lib/x86_64-linux-gnu/glusterfs/3.6.8/xlator/performance/io-threads.so
>> #2  0x7f81d383be9a in start_thread () from
>> /lib/x86_64-linux-gnu/libpthread.so.0
>> #3  0x7f81d35683fd in clone () from /lib/x86_64-linux-gnu/libc.so.6
>> #4  0x in ?? ()
>>
>> Thread 7 (LWP 31881):
>> #0  0x7f81d383fd84 in pthread_cond_wait@@GLIBC_2.3.2 () from
>> /lib/x86_64-linux-gnu/libpthread.so.0
>> #1  0x7f81cee50f3b in posix_fsyncer_pick ()
>>from /usr/lib/x86_64-linux-gnu/glusterfs/3.6.8/xlator/storage/posix.so
>> #2  0x7f81cee51155 in posix_fsyncer ()
>>from /usr/lib/x86_64-linux-gnu/glusterfs/3.6.8/xlator/storage/posix

Re: [Gluster-devel] Feature: Automagic lock-revocation for features/locks xlator (v3.7.x)

2016-02-15 Thread Raghavendra G
On Sat, Feb 13, 2016 at 12:38 AM, Richard Wareing <rware...@fb.com> wrote:

> Hey,
>
> Sorry for the late reply but I missed this e-mail.  With respect to
> identifying locking domains, we use the identical logic that GlusterFS
> itself uses to identify the domains; which is just a simple string
> comparison if I'm not mistaken.   System processes (SHD/Rebalance) locking
> domains are treated identical to any other, this is specifically critical
> to things like DHT healing as this locking domain is used both in userland
> and by SHDs (you cannot disable DHT healing).
>

We cannot disable DHT healing altogether. But we _can_ identify whether
healing is done by a mount process (on behalf of application) or a
rebalance process. All internal processes (rebalance, shd, quotad etc) have
a negative value in frame->root->pid (as opposed to a positive value for a
fop request from a mount process). I agree with you that just by looking at
domain, we cannot figure out whether lock request is from internal process
or a mount process. But, with the help of frame->root->pid, we can. By
choosing to flush locks from rebalance process (instead of locks from mount
process), I thought we can reduce the scenarios where application sees
errors. Of course we'll see more of rebalance failures, but that is a
trade-off we perhaps have to live with. Just a thought :).



>  To illustrate this, consider the case where a SHD holds a lock to do a
> DHT heal but can't because of GFID split-braina user comes along a
> hammers that directory attempting to get a lockyou can pretty much kiss
> your cluster good-bye after that :).
>
> With this in mind, we explicitly choose not to respect system process
> (SHD/rebalance) locks any more than a user lock request as they can be just
> as likely (if not more so) to cause a system to fall over vs. a user (see
> example above).  Although this might seem unwise at first, I'd put forth
> that having clusters fall over catastrophically pushes far worse decisions
> on operators such as re-kicking random bricks or entire clusters in
> desperate attempts at freeing locks (the CLI is often unable to free the
> locks in our experience) or stopping run away memory consumption due to
> frames piling up on the bricks.  To date, we haven't even observed a single
> instance of data corruption (and we've been looking for it!) due to this
> feature.
>
> We've even used it on clusters where they were on the verge of falling
> over and we enable revocation and the entire system stabilizes almost
> instantly (it's really like magic when you see it :) ).
>
> Hope this helps!
>
> Richard
>
>
> --
> *From:* raghavendra...@gmail.com [raghavendra...@gmail.com] on behalf of
> Raghavendra G [raghaven...@gluster.com]
> *Sent:* Tuesday, January 26, 2016 9:49 PM
> *To:* Raghavendra Gowdappa
> *Cc:* Richard Wareing; Gluster Devel
> *Subject:* Re: [Gluster-devel] Feature: Automagic lock-revocation for
> features/locks xlator (v3.7.x)
>
>
>
> On Mon, Jan 25, 2016 at 10:39 AM, Raghavendra Gowdappa <
> rgowd...@redhat.com> wrote:
>
>>
>>
>> - Original Message -
>> > From: "Richard Wareing" <rware...@fb.com>
>> > To: "Pranith Kumar Karampuri" <pkara...@redhat.com>
>> > Cc: gluster-devel@gluster.org
>> > Sent: Monday, January 25, 2016 8:17:11 AM
>> > Subject: Re: [Gluster-devel] Feature: Automagic lock-revocation for
>> features/locks xlator (v3.7.x)
>> >
>> > Yup per domain would be useful, the patch itself currently honors
>> domains as
>> > well. So locks in a different domains will not be touched during
>> revocation.
>> >
>> > In our cases we actually prefer to pull the plug on SHD/DHT domains to
>> ensure
>> > clients do not hang, this is important for DHT self heals which cannot
>> be
>> > disabled via any option, we've found in most cases once we reap the lock
>> > another properly behaving client comes along and completes the DHT heal
>> > properly.
>>
>> Flushing waiting locks of DHT can affect application continuity too.
>> Though locks requested by rebalance process can be flushed to certain
>> extent without applications noticing any failures, there is no guarantee
>> that locks requested in DHT_LAYOUT_HEAL_DOMAIN and DHT_FILE_MIGRATE_DOMAIN,
>> are issued by only rebalance process.
>
>
> I missed this point in my previous mail. Now I remember that we can use
> frame->root->pid (being negative) to identify internal processes. Was this
> the approach you followed to identify locks from rebalance process?
>
>
>> These two domains a

Re: [Gluster-devel] Rebalance data migration and corruption

2016-02-08 Thread Raghavendra G
+ read) or extra lock and unlock (for non-compound fop
>>>> based
>>>> implementation) for every read it does from src.
>>>>
>>>
>>> Can we use delegations here? Rebalance process can acquire a
>>> mandatory-write-delegation (an exclusive lock with a functionality
>>> that delegation is recalled when a write operation happens). In that
>>> case rebalance process, can do something like:
>>>
>>> 1. Acquire a read delegation for entire file.
>>> 2. Migrate the entire file.
>>> 3. Remove/unlock/give-back the delegation it has acquired.
>>>
>>> If a recall is issued from brick (when a write happens from mount), it
>>> completes the current write to dst (or throws away the read from src)
>>> to maintain atomicity. Before doing next set of (read, src) and
>>> (write, dst) tries to reacquire lock.
>>>
>>
>> With delegations this simplifies the normal path, when a file is
>> exclusively handled by rebalance. It also improves the case where a
>> client and rebalance are conflicting on a file, to degrade to mandatory
>> locks by either parties.
>>
>> I would prefer we take the delegation route for such needs in the future.
>>
>> Right. But if there are simultaneous access to the same file from any
> other client and rebalance process, delegations shall not be granted or
> revoked if granted even though they are operating at different offsets. So
> if you rely only on delegations, migration may not proceed if an
> application has held a lock or doing any I/Os.
>

Does the brick process wait for the response of delegation holder
(rebalance process here) before it wipes out the delegation/locks? If
that's the case, rebalance process can complete one transaction of (read,
src) and (write, dst) before responding to a delegation recall. That way
there is no starvation for both applications and rebalance process (though
this makes both of them slower, but that cannot helped I think).


> Also ideally rebalance process has to take write delegation as it would
> end up writing the data on destination brick which shall affect READ I/Os,
> (though of course we can have special checks/hacks for internal generated
> fops).
>

No, read delegations (on src) are sufficient for our use case. All we need
is that if there is a write on src while rebalance-process has a
delegation, We need that write to be blocked till rebalance process returns
that delegation back. Write delegations are unnecessarily more restrictive
as they conflict with application reads too, which we don't need. For the
sake of clarity client always writes to src first and then to dst. Also,
writes to src and dst are serialized. So, its sufficient we synchronize on
src.


> That said, having delegations shall definitely ensure correctness with
> respect to exclusive file access.
>
> Thanks,
> Soumya
>
>
>
>>> @Soumyak, can something like this be done with delegations?
>>>
>>> @Pranith,
>>> Afr does transactions for writing to its subvols. Can you suggest any
>>> optimizations here so that rebalance process can have a transaction
>>> for (read, src) and (write, dst) with minimal performance overhead?
>>>
>>> regards,
>>> Raghavendra.
>>>
>>>
>>>> Comments?
>>>>
>>>>
>>>>> regards,
>>>>> Raghavendra.
>>>>>
>>>>
>>>> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Rebalance data migration and corruption

2016-02-08 Thread Raghavendra G
>>Right. But if there are simultaneous access to the same file from

> any other client and rebalance process, delegations shall not be
>> granted or revoked if granted even though they are operating at
>> different offsets. So if you rely only on delegations, migration may
>> not proceed if an application has held a lock or doing any I/Os.
>>
>>
>> Does the brick process wait for the response of delegation holder
>> (rebalance process here) before it wipes out the delegation/locks? If
>> that's the case, rebalance process can complete one transaction of
>> (read, src) and (write, dst) before responding to a delegation recall.
>> That way there is no starvation for both applications and rebalance
>> process (though this makes both of them slower, but that cannot helped I
>> think).
>>
>
> yes. Brick process should wait for certain period before revoking the
> delegations forcefully in case if it is not returned by the client. Also if
> required (like done by NFS servers) we can choose to increase this timeout
> value at run time if the client is diligently flushing the data.


hmm.. I would prefer an infinite timeout. The only scenario where brick
process can forcefully flush leases would be connection lose with rebalance
process. The more scenarios where brick can flush leases without knowledge
of rebalance process, we open up more race-windows for this bug to occur.

In fact at least in theory to be correct, rebalance process should replay
all the transactions that happened during the lease which got flushed out
by brick (after re-acquiring that lease). So, we would like to avoid any
such scenarios.

Btw, what is the necessity of timeouts? Is it an insurance against rogue
clients who won't respond back to lease recalls?
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] GlusterFS FUSE client hangs on rsyncing lots of file

2016-02-04 Thread Raghavendra G
+soumyak, +rtalur.

On Fri, Jan 29, 2016 at 2:34 PM, Pranith Kumar Karampuri <
pkara...@redhat.com> wrote:

>
>
> On 01/28/2016 05:05 PM, Pranith Kumar Karampuri wrote:
>
>> With baul jianguo's help I am able to see that FLUSH fops are hanging for
>> some reason.
>>
>> pk1@localhost - ~/Downloads
>> 17:02:13 :) ⚡ grep "unique=" client-dump1.txt
>> unique=3160758373
>> unique=2073075682
>> unique=1455047665
>> unique=0
>>
>> pk1@localhost - ~/Downloads
>> 17:02:21 :) ⚡ grep "unique=" client-dump-0.txt
>> unique=3160758373
>> unique=2073075682
>> unique=1455047665
>> unique=0
>>
>> I will be debugging a bit more and post my findings.
>>
> +Raghavendra G
>
> All the stubs are hung in write-behind. I checked that the statedumps
> doesn't have any writes in progress. May be because of some race, flush fop
> is not resumed after write calls are complete? It seems this issue happens
> only when io-threads is enabled on the client.
>
> Pranith
>
>
>> Pranith
>> On 01/28/2016 03:18 PM, baul jianguo wrote:
>>
>>> the client glusterfs gdb info, main thread id is 70800。
>>>   In the top output,70800 thread time 1263:30,70810 thread time
>>> 1321:10,other thread time too small。
>>> (gdb) thread apply all bt
>>>
>>>
>>>
>>> Thread 9 (Thread 0x7fc21acaf700 (LWP 70801)):
>>>
>>> #0  0x7fc21cc0c535 in sigwait () from /lib64/libpthread.so.0
>>>
>>> #1  0x0040539b in glusterfs_sigwaiter (arg=>> out>) at glusterfsd.c:1653
>>>
>>> #2  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0
>>>
>>> #3  0x7fc21c56e93d in clone () from /lib64/libc.so.6
>>>
>>>
>>>
>>> Thread 8 (Thread 0x7fc21a2ae700 (LWP 70802)):
>>>
>>> #0  0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
>>> /lib64/libpthread.so.0
>>>
>>> #1  0x7fc21ded02bf in syncenv_task (proc=0x121ee60) at syncop.c:493
>>>
>>> #2  0x7fc21ded6300 in syncenv_processor (thdata=0x121ee60) at
>>> syncop.c:571
>>>
>>> #3  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0
>>>
>>> #4  0x7fc21c56e93d in clone () from /lib64/libc.so.6
>>>
>>>
>>>
>>> Thread 7 (Thread 0x7fc2198ad700 (LWP 70803)):
>>>
>>> #0  0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
>>> /lib64/libpthread.so.0
>>>
>>> #1  0x7fc21ded02bf in syncenv_task (proc=0x121f220) at syncop.c:493
>>>
>>> #2  0x7fc21ded6300 in syncenv_processor (thdata=0x121f220) at
>>> syncop.c:571
>>>
>>> #3  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0
>>>
>>> #4  0x7fc21c56e93d in clone () from /lib64/libc.so.6
>>>
>>>
>>>
>>> Thread 6 (Thread 0x7fc21767d700 (LWP 70805)):
>>>
>>> #0  0x7fc21cc0bfbd in nanosleep () from /lib64/libpthread.so.0
>>>
>>> #1  0x7fc21deb16bc in gf_timer_proc (ctx=0x11f2010) at timer.c:170
>>>
>>> #2  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0
>>>
>>> #3  0x7fc21c56e93d in clone () from /lib64/libc.so.6
>>>
>>>
>>>
>>> Thread 5 (Thread 0x7fc20fb1e700 (LWP 70810)):
>>>
>>> #0  0x7fc21c566987 in readv () from /lib64/libc.so.6
>>>
>>> #1  0x7fc21accbc55 in fuse_thread_proc (data=0x120f450) at
>>> fuse-bridge.c:4752
>>>
>>> #2  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0
>>>
>>> #3  0x7fc21c56e93d in clone () from /lib64/libc.so.6 时间最多
>>>
>>>
>>>
>>> Thread 4 (Thread 0x7fc20f11d700 (LWP 70811)): 少点
>>>
>>> #0  0x7fc21cc0b7dd in read () from /lib64/libpthread.so.0
>>>
>>> #1  0x7fc21acc0e73 in read (data=) at
>>> /usr/include/bits/unistd.h:45
>>>
>>> #2  notify_kernel_loop (data=) at fuse-bridge.c:3786
>>>
>>> #3  0x7fc21cc04a51 in start_thread () from /lib64/libpthread.so.0
>>>
>>> #4  0x7fc21c56e93d in clone () from /lib64/libc.so.6
>>>
>>>
>>>
>>> Thread 3 (Thread 0x7fc1b16fe700 (LWP 206224)):
>>>
>>> ---Type  to continue, or q  to quit---
>>>
>>> #0  0x7fc21cc08a0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from
>>> /lib64/libpthread.so.0
>&

Re: [Gluster-devel] GlusterFS FUSE client leaks summary — part I

2016-02-04 Thread Raghavendra G
;>>>> gfapi: Fix inode nlookup counts
>>>>>
>>>>> inode: Retire the inodes from the lru
>>>>>>
>>>>>
>>>>> list in inode_table_destroy
>>>>>
>>>>> upcall: free the xdr* allocations
>>>>>> ===
>>>>>>
>>>>>>
>>>>>> With those patches we got API leaks fixed (I hope, brief tests show
>>>>>>
>>>>>
>>>>> that) and
>>>>>
>>>>> got rid of "kernel notifier loop terminated" message.
>>>>>>
>>>>>
>>>>> Nevertheless, FUSE
>>>>>
>>>>> client still leaks.
>>>>>>
>>>>>> I have several test
>>>>>>
>>>>>
>>>>> volumes with several million of small files (100K…2M in
>>>>>
>>>>> average). I
>>>>>>
>>>>>
>>>>> do 2 types of FUSE client testing:
>>>>>
>>>>>> 1) find /mnt/volume -type d
>>>>>> 2)
>>>>>>
>>>>>
>>>>> rsync -av -H /mnt/source_volume/* /mnt/target_volume/
>>>>>
>>>>> And most
>>>>>>
>>>>>
>>>>> up-to-date results are shown below:
>>>>>
>>>>>> === find /mnt/volume -type d
>>>>>>
>>>>>
>>>>> ===
>>>>>
>>>>> Memory consumption: ~4G
>>>>>>
>>>>>
>>>>> Statedump:
>>>>>>
>>>>> https://gist.github.com/10cde83c63f1b4f1dd7a
>>>>>
>>>>> Valgrind:
>>>>>>
>>>>> https://gist.github.com/097afb01ebb2c5e9e78d
>>>>>
>>>>> I guess,
>>>>>>
>>>>>
>>>>> fuse-bridge/fuse-resolve. related.
>>>>>
>>>>> === rsync -av -H
>>>>>>
>>>>>
>>>>> /mnt/source_volume/* /mnt/target_volume/ ===
>>>>>
>>>>> Memory consumption:
>>>>>>
>>>>> ~3.3...4G
>>>>>
>>>>> Statedump (target volume):
>>>>>>
>>>>> https://gist.github.com/31e43110eaa4da663435
>>>>>
>>>>> Valgrind (target volume):
>>>>>>
>>>>> https://gist.github.com/f8e0151a6878cacc9b1a
>>>>>
>>>>> I guess,
>>>>>>
>>>>>
>>>>> DHT-related.
>>>>>
>>>>> Give me more patches to test :).
>>>>>>
>>>>>
>>>>> ___
>>>>>
>>>>> Gluster-devel mailing
>>>>>>
>>>>>
>>>>> list
>>>>>
>>>>> Gluster-devel@gluster.org
>>>>>>
>>>>>
>>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>>
>>>>
>>>>
>>>> ___
>>>> Gluster-devel mailing list
>>>> Gluster-devel@gluster.org
>>>> http://www.gluster.org/mailman/listinfo/gluster-devel
>>>>
>>>> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Feature: Automagic lock-revocation for features/locks xlator (v3.7.x)

2016-01-26 Thread Raghavendra G
-groot-locks: MONKEY
> LOCKING
> > (forcing stuck lock)!" ... in the logs indicating a request has been
> > dropped.
> >
> > 2. Lock revocation - Once enabled, this feature will revoke a
> *contended*lock
> > (i.e. if nobody else asks for the lock, we will not revoke it) either by
> the
> > amount of time the lock has been held, how many other lock requests are
> > waiting on the lock to be freed, or some combination of both. Clients
> which
> > are losing their locks will be notified by receiving EAGAIN (send back to
> > their callback function).
> >
> > The feature is activated via these options:
> > features.locks-revocation-secs <integer; 0 to disable>
> > features.locks-revocation-clear-all [on/off]
> > features.locks-revocation-max-blocked 
> >
> > Recommended settings are: 1800 seconds for a time based timeout (give
> clients
> > the benefit of the doubt, or chose a max-blocked requires some
> > experimentation depending on your workload, but generally values of
> hundreds
> > to low thousands (it's normal for many ten's of locks to be taken out
> when
> > files are being written @ high throughput).
> >
> > I really like this feature. One question though, self-heal, rebalance
> domain
> > locks are active until self-heal/rebalance is complete which can take
> more
> > than 30 minutes if the files are in TBs. I will try to see what we can
> do to
> > handle these without increasing the revocation-secs too much. May be we
> can
> > come up with per domain revocation timeouts. Comments are welcome.
> >
> > Pranith
> >
> >
> >
> >
> > =
> >
> > The patch supplied will patch clean the the v3.7.6 release tag, and
> probably
> > to any 3.7.x release & master (posix locks xlator is rarely touched).
> >
> > Richard
> >
> >
> >
> >
> >
> > ___
> > Gluster-devel mailing list Gluster-devel@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> >
> >
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Raghavendra G
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

  1   2   >