from:"Anand Avati"

Re: [Gluster-devel] Reg. multi thread epoll NetBSD failures

2015-01-23 Thread Anand Avati

Since all of epoll code and its multithreading is under ifdefs, netbsd
should just continue working as single threaded poll unaffected by the
patch. If netbsd kqueue supports single shot event delivery and edge
triggered notification, we could have an equivalent implantation on netbsd
too. Even if kqueue does not support these features, it might well be worth
implementing a single threaded level triggered kqueue based event handler,
and promote netbsd from suffering from sucky vanilla poll.

Thanks

On Fri, Jan 23, 2015, 19:29 Emmanuel Dreyfus m...@netbsd.org wrote:

 Ben England bengl...@redhat.com wrote:

  NetBSD may be useful for exposing race conditions, but it's not clear to
  me that all of these race conditions would happen in a non-NetBSD
  environment,

 In many times, NetBSD exhibited cases where non specified,
 Linux-specific behaviors were assumed. I recall my very first finding in
 Glusterfs: Linux lets you use a mutex without calling
 pthread_mutex_init() first. That broke on NetBSD, as expected.

 Fixing this kind of issues is interesting beyond NetBSD support, since
 you cannot take for granted that an unspecified behavior will not be
 altered in the future.

 That said, I am fine if you let NetBSD run without fixing the underlying
 issue, but you have been warned :-)

 --
 Emmanuel Dreyfus
 http://hcpnet.free.fr/pubz
 m...@netbsd.org
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] RDMA: Patch to make use of pre registered memory

2015-01-23 Thread Anand Avati

Couple of comments -

1. rdma can register init/fini functions (via pointers) into iobuf_pool.
Absolutely no need to introduce rdma dependency into libglusterfs.

2. It might be a good idea to take a holistic approach towards zero-copy
with libgfapi + RDMA, rather than a narrow goal of use pre-registered
memory with RDMA. Do keep the options open for RDMA'ing user's memory
pointer (passed to glfs_write()) as well.

3. It is better to make io-cache and write-behind use a new iobuf_pool for
caching purpose. There could be an optimization where they could just do
iobuf/iobref_ref() when safe - e.g io-cache can cache with iobuf_ref when
transport is socket, or write-behind can unwind by holding onto data with
iobuf_ref() when the topmost layer is FUSE or server (i.e no gfapi).

4. Next step for zero-copy would be introduction of a new fop readto()
where the destination pointer is passed from the caller (gfapi being the
primary use case). In this situation RDMA ought to register that memory if
necessary and request server to RDMA_WRITE into the pointer provided by
gfapi caller.

2. and 4. require changes in the code you would be modifying if you were to
just do pre-registered memroy, so it is better we plan for the bigger
picture upfront. Zero-copy can improve performance (especially read) in
qemu use case.

Thanks
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] ctime weirdness

2015-01-14 Thread Anand Avati

I don't think the problem is with the handling of SETATTR in either NetBSD
or Linux. I am guessing NetBSD FUSE is _using_ SETATTR to update atime upon
open? Linux FUSE just leaves it to the backend filesystem to update atime.
Whenever there is a SETATTR fop, ctime is _always_ bumped.

Thanks

On Mon Jan 12 2015 at 5:03:51 PM Emmanuel Dreyfus m...@netbsd.org wrote:

 Hello

 Here is a NetBSD behavior that looks pathological:
 (it happens on a FUSE mount but not a native mount):

 # touch a
 # stat -x a
   File: a
   Size: 0FileType: Regular File
   Mode: (0644/-rw-r--r--) Uid: (0/root)  Gid: (0/
 wheel)
 Device: 203,7   Inode: 13726586830943880794Links: 1
 Access: Tue Jan 13 01:57:25 2015
 Modify: Tue Jan 13 01:57:25 2015
 Change: Tue Jan 13 01:57:25 2015
 # cat a  /dev/null
 # stat -x a
   File: a
   Size: 0FileType: Regular File
   Mode: (0644/-rw-r--r--) Uid: (0/root)  Gid: (0/
 wheel)
 Device: 203,7   Inode: 13726586830943880794Links: 1
 Access: Tue Jan 13 01:57:31 2015
 Modify: Tue Jan 13 01:57:25 2015
 Change: Tue Jan 13 01:57:31 2015

 NetBSD FUSE implementation does not sends ctime with SETATTR. Looking at
 glusterfs FUSE xlator, I see setattr code does not handle ctime either.

 How does that happen? What wrong does NetBSD SETATTR does?

 --
 Emmanuel Dreyfus
 http://hcpnet.free.fr/pubz
 m...@netbsd.org
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Suggestion needed to make use of iobuf_pool as rdma buffer.

2015-01-14 Thread Anand Avati

On Tue Jan 13 2015 at 11:57:53 PM Mohammed Rafi K C rkavu...@redhat.com
wrote:


 On 01/14/2015 12:11 AM, Anand Avati wrote:

 3) Why not have a separate iobuf pool for RDMA?


 Since every fops are using the default iobuf_pool, if we go with another
 iobuf_pool dedicated to rdma, we need to copy that buffer from default pool
 to rdma or so, unless we are intelligently allocating the buffers based on
 the transport which we are going to use. It is an extra level copying in
 the I/O path.


Not sure what you mean by that. Every fop does not use default iobuf_pool.
Only readv() and writev() do. If you really want to save on memory
registration cost, your first target should be the header buffers (which is
used in every fop, and currently valloc()ed and ibv_reg_mr() per call).
Making headers use an iobuf pool where every arena is registered during
arena creation and destruction will get you the highest overhead savings.

Coming to file data iobufs, today iobuf pools are used in a mixed way,
i.e, they hold both data being actively transferred/under IO, and also data
which is being held long term (cached by io-cache). io-cache just does an
iobuf_ref() and holds on to the data. This avoids memory copies in io-cache
layer. However that may be something we want to reconsider: io-cache could
use its own iobuf pool into which data is copied into from the transfer
iobuf (which is pre-registered with RDMA in bulk etc.)

Thanks






 On Tue Jan 13 2015 at 6:30:09 AM Mohammed Rafi K C rkavu...@redhat.com
 wrote:

 Hi All,

 When using RDMA protocol, we need to register the buffer which is going
 to send through rdma with rdma device. In fact, it is a costly
 operation, and a performance killer if it happened in I/O path. So our
 current plan is to register pre-allocated iobuf_arenas from  iobuf_pool
 with rdma when rdma is getting initialized. The problem comes when all
 the iobufs are exhausted, then we need to dynamically allocate new
 arenas from libglusterfs module. Since it is created in libglusterfs, we
 can't make a call to rdma from libglusterfs. So we will force to
 register each of the iobufs from the newly created arenas with rdma in
 I/O path. If io-cache is turned on in client stack, then all the
 pre-registred arenas will use by io-cache as cache buffer. so we have to
 do the registration in rdma for each i/o call for every iobufs,
 eventually we cannot make use of pre registered arenas.

 To address the issue, we have two approaches in mind,

  1) Register each dynamically created buffers in iobuf by bringing
 transport layer together with libglusterfs.

  2) create a separate buffer for caching and offload the data from the
 read response to the cache buffer in background.

 If we could make use of preregister memory for every rdma call, then we
 will have approximately 20% increment for write and 25% of increment for
 read.

 Please give your thoughts to address the issue.

 Thanks  Regards
 Rafi KC
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Suggestion needed to make use of iobuf_pool as rdma buffer.

2015-01-13 Thread Anand Avati

3) Why not have a separate iobuf pool for RDMA?

On Tue Jan 13 2015 at 6:30:09 AM Mohammed Rafi K C rkavu...@redhat.com
wrote:

 Hi All,

 When using RDMA protocol, we need to register the buffer which is going
 to send through rdma with rdma device. In fact, it is a costly
 operation, and a performance killer if it happened in I/O path. So our
 current plan is to register pre-allocated iobuf_arenas from  iobuf_pool
 with rdma when rdma is getting initialized. The problem comes when all
 the iobufs are exhausted, then we need to dynamically allocate new
 arenas from libglusterfs module. Since it is created in libglusterfs, we
 can't make a call to rdma from libglusterfs. So we will force to
 register each of the iobufs from the newly created arenas with rdma in
 I/O path. If io-cache is turned on in client stack, then all the
 pre-registred arenas will use by io-cache as cache buffer. so we have to
 do the registration in rdma for each i/o call for every iobufs,
 eventually we cannot make use of pre registered arenas.

 To address the issue, we have two approaches in mind,

  1) Register each dynamically created buffers in iobuf by bringing
 transport layer together with libglusterfs.

  2) create a separate buffer for caching and offload the data from the
 read response to the cache buffer in background.

 If we could make use of preregister memory for every rdma call, then we
 will have approximately 20% increment for write and 25% of increment for
 read.

 Please give your thoughts to address the issue.

 Thanks  Regards
 Rafi KC
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Order of server-side xlators

2015-01-12 Thread Anand Avati

Valid questions. access-control had to be as close to posix as possible in
its first implementation (to minimize the cost of the STAT calls originated
by it), but since the introduction of posix-acl there are no extra STAT
calls, and given the later introduction of quota, it certainly makes sense
to have access-control/posix-acl closer to protocol/server. Some general
constraints to consider while deciding the order:

- keep io-stats as close to protocol/server as possible
- keep io-threads as close to storage/posix as possible
- any xlator which performs direct filesystem operations (with system
calls, not STACK_WIND) are better placed between io-threads and posix to
keep epoll thread nonblocking  (e.g changelog)

Thanks

On Mon Jan 12 2015 at 5:02:59 AM Xavier Hernandez xhernan...@datalab.es
wrote:

 Hi,

 looking at the server-side xlator stack created on a generic volume with
 quota enabled, I see the following xlators:

  posix
  changelog
  access-control
  locks
  io-threads
  barrier
  index
  marker
  quota
  io-stats
  server

 The question is why access-control and quota are in this relative order.
 It would seem more logical to me to be in the reverse order because if
 an operation is not permitted, it's irrelevant if there is enough quota
 to do it or not: gluster should return EPERM or EACCES instead of EDQUOT.

 Also, index and marker can operate on requests that can be later denied
 by access-control, having to undo the work done in that case. Wouldn't
 it be better to use index and marker after having validated all
 permissions of the request ?

 I'm not very familiarized with these xlators, so maybe I'm missing an
 important detail.

 Thanks,

 Xavi
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] mandatory lock

2015-01-08 Thread Anand Avati

Note that the mandatory locks available in the locks translator is just the
mandatory extensions for posix locks - at least one of the apps must be
using locks to begin with. What Harmeet is asking for is something
different - automatic exclusive access to edit files. i.e, if one app has
opened a file for editing, other apps which attempt an open must either
fail (EBUSY) or block till the first app closes. We need to treat
open(O_RDONLY) as a read lock and open(O_RDWR|O_WRONLY) as a write lock
request (essentially an auto applied oplock). This is something gluster
does not yet have.

Thanks

On Thu Jan 08 2015 at 2:49:29 AM Raghavendra Gowdappa rgowd...@redhat.com
wrote:



 - Original Message -
  From: Raghavendra Gowdappa rgowd...@redhat.com
  To: Harmeet Kalsi kharm...@hotmail.com
  Cc: Gluster-devel@gluster.org gluster-devel@gluster.org
  Sent: Thursday, January 8, 2015 4:12:44 PM
  Subject: Re: [Gluster-devel] mandatory lock
 
 
 
  - Original Message -
   From: Harmeet Kalsi kharm...@hotmail.com
   To: Gluster-devel@gluster.org gluster-devel@gluster.org
   Sent: Wednesday, January 7, 2015 5:55:43 PM
   Subject: [Gluster-devel] mandatory lock
  
   Dear All.
   Would it be possible for someone to guide me in the right direction to
   enable
   the mandatory lock on a volume please.
   At the moment two clients can edit the same file at the same time
 which is
   causing issues.
 
  I see code related to mandatory locking in posix-locks xlator (pl_writev,
  pl_truncate etc). To enable it you've to set option mandatory-locks
 yes in
  posix-locks xlator loaded on bricks
  (/var/lib/glusterd/vols/volname/*.vol). We've no way to set this
 option
  through gluster cli. Also, I am not sure to what extent this feature is
  tested/used till now. You can try it out and please let us know whether
 it
  worked for you :).

 If mandatory locking doesn't work for you, can you modify your application
 to use advisory locking, since advisory locking is tested well and being
 used for long time?

 
   Many thanks in advance
   Kind Regards
  
   ___
   Gluster-devel mailing list
   Gluster-devel@gluster.org
   http://www.gluster.org/mailman/listinfo/gluster-devel
  
  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org
  http://www.gluster.org/mailman/listinfo/gluster-devel
 
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] mandatory lock

2015-01-08 Thread Anand Avati

Or use an rsync style .filename.rand tempfile, write the new version of the
file, and rename that to filename.

On Thu Jan 08 2015 at 12:21:18 PM Anand Avati av...@gluster.org wrote:

 Ideally you want the clients to coordinate among themselves. Note that
 this feature cannot be implemented foolproof (theoretically) in a system
 that supports NFSv3.

 On Thu Jan 08 2015 at 8:57:48 AM Harmeet Kalsi kharm...@hotmail.com
 wrote:

 Hi Anand, that was spot on. Any idea if there will be development on this
 side in near future as multiple clients writing to the same file can cause
 issues.

 Regards

 --
 From: av...@gluster.org
 Date: Thu, 8 Jan 2015 16:07:50 +

 Subject: Re: [Gluster-devel] mandatory lock
 To: rgowd...@redhat.com; kharm...@hotmail.com
 CC: gluster-devel@gluster.org


 Note that the mandatory locks available in the locks translator is just
 the mandatory extensions for posix locks - at least one of the apps must be
 using locks to begin with. What Harmeet is asking for is something
 different - automatic exclusive access to edit files. i.e, if one app has
 opened a file for editing, other apps which attempt an open must either
 fail (EBUSY) or block till the first app closes. We need to treat
 open(O_RDONLY) as a read lock and open(O_RDWR|O_WRONLY) as a write lock
 request (essentially an auto applied oplock). This is something gluster
 does not yet have.

 Thanks

 On Thu Jan 08 2015 at 2:49:29 AM Raghavendra Gowdappa 
 rgowd...@redhat.com wrote:



 - Original Message -
  From: Raghavendra Gowdappa rgowd...@redhat.com
  To: Harmeet Kalsi kharm...@hotmail.com
  Cc: Gluster-devel@gluster.org gluster-devel@gluster.org
  Sent: Thursday, January 8, 2015 4:12:44 PM
  Subject: Re: [Gluster-devel] mandatory lock
 
 
 
  - Original Message -
   From: Harmeet Kalsi kharm...@hotmail.com
   To: Gluster-devel@gluster.org gluster-devel@gluster.org
   Sent: Wednesday, January 7, 2015 5:55:43 PM
   Subject: [Gluster-devel] mandatory lock
  
   Dear All.
   Would it be possible for someone to guide me in the right direction to
   enable
   the mandatory lock on a volume please.
   At the moment two clients can edit the same file at the same time
 which is
   causing issues.
 
  I see code related to mandatory locking in posix-locks xlator
 (pl_writev,
  pl_truncate etc). To enable it you've to set option mandatory-locks
 yes in
  posix-locks xlator loaded on bricks
  (/var/lib/glusterd/vols/volname/*.vol). We've no way to set this
 option
  through gluster cli. Also, I am not sure to what extent this feature is
  tested/used till now. You can try it out and please let us know whether
 it
  worked for you :).

 If mandatory locking doesn't work for you, can you modify your
 application to use advisory locking, since advisory locking is tested well
 and being used for long time?

 
   Many thanks in advance
   Kind Regards
  
   ___
   Gluster-devel mailing list
   Gluster-devel@gluster.org
   http://www.gluster.org/mailman/listinfo/gluster-devel
  
  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org
  http://www.gluster.org/mailman/listinfo/gluster-devel
 
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Appending time to snap name in USS

2015-01-08 Thread Anand Avati

It would be convenient if the time is appended to the snap name on the fly
(when receiving list of snap names from glusterd?) so that the timezone
application can be dynamic (which is what users would expect).

Thanks

On Thu Jan 08 2015 at 3:21:15 AM Poornima Gurusiddaiah pguru...@redhat.com
wrote:

 Hi,

 Windows has a feature called shadow copy. This is widely used by all
 windows users to view the previous versions of a file.
 For shadow copy to work with glusterfs backend, the problem was that
 the clients expect snapshots to contain some format
 of time in their name.

 After evaluating the possible ways(asking the user to create
 snapshot with some format of time in it and have rename snapshot
 for existing snapshots) the following method seemed simpler.

 If the USS is enabled, then the creation time of the snapshot is
 appended to the snapname and is listed in the .snaps directory.
 The actual name of the snapshot is left unmodified. i.e. the  snapshot
 list/info/restore etc. commands work with the original snapname.
 The patch for the same can be found @http://review.gluster.org/#/c/9371/

 The impact is that, the users would see the snapnames to be different in
 the .snaps folder
 than what they have created. Also the current patch does not take care of
 the scenario where
 the snapname already has time in its name.

 Eg:
 Without this patch:
 drwxr-xr-x 4 root root 110 Dec 26 04:14 snap1
 drwxr-xr-x 4 root root 110 Dec 26 04:14 snap2

 With this patch
 drwxr-xr-x 4 root root 110 Dec 26 04:14 snap1@GMT-2014.12.30-05.07.50
 drwxr-xr-x 4 root root 110 Dec 26 04:14 snap2@GMT-2014.12.30-23.49.02

 Please let me know if you have any suggestions or concerns on the same.

 Thanks,
 Poornima
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] mandatory lock

2015-01-08 Thread Anand Avati

Ideally you want the clients to coordinate among themselves. Note that this
feature cannot be implemented foolproof (theoretically) in a system that
supports NFSv3.

On Thu Jan 08 2015 at 8:57:48 AM Harmeet Kalsi kharm...@hotmail.com wrote:

 Hi Anand, that was spot on. Any idea if there will be development on this
 side in near future as multiple clients writing to the same file can cause
 issues.

 Regards

 --
 From: av...@gluster.org
 Date: Thu, 8 Jan 2015 16:07:50 +

 Subject: Re: [Gluster-devel] mandatory lock
 To: rgowd...@redhat.com; kharm...@hotmail.com
 CC: gluster-devel@gluster.org


 Note that the mandatory locks available in the locks translator is just
 the mandatory extensions for posix locks - at least one of the apps must be
 using locks to begin with. What Harmeet is asking for is something
 different - automatic exclusive access to edit files. i.e, if one app has
 opened a file for editing, other apps which attempt an open must either
 fail (EBUSY) or block till the first app closes. We need to treat
 open(O_RDONLY) as a read lock and open(O_RDWR|O_WRONLY) as a write lock
 request (essentially an auto applied oplock). This is something gluster
 does not yet have.

 Thanks

 On Thu Jan 08 2015 at 2:49:29 AM Raghavendra Gowdappa rgowd...@redhat.com
 wrote:



 - Original Message -
  From: Raghavendra Gowdappa rgowd...@redhat.com
  To: Harmeet Kalsi kharm...@hotmail.com
  Cc: Gluster-devel@gluster.org gluster-devel@gluster.org
  Sent: Thursday, January 8, 2015 4:12:44 PM
  Subject: Re: [Gluster-devel] mandatory lock
 
 
 
  - Original Message -
   From: Harmeet Kalsi kharm...@hotmail.com
   To: Gluster-devel@gluster.org gluster-devel@gluster.org
   Sent: Wednesday, January 7, 2015 5:55:43 PM
   Subject: [Gluster-devel] mandatory lock
  
   Dear All.
   Would it be possible for someone to guide me in the right direction to
   enable
   the mandatory lock on a volume please.
   At the moment two clients can edit the same file at the same time
 which is
   causing issues.
 
  I see code related to mandatory locking in posix-locks xlator (pl_writev,
  pl_truncate etc). To enable it you've to set option mandatory-locks
 yes in
  posix-locks xlator loaded on bricks
  (/var/lib/glusterd/vols/volname/*.vol). We've no way to set this
 option
  through gluster cli. Also, I am not sure to what extent this feature is
  tested/used till now. You can try it out and please let us know whether
 it
  worked for you :).

 If mandatory locking doesn't work for you, can you modify your application
 to use advisory locking, since advisory locking is tested well and being
 used for long time?

 
   Many thanks in advance
   Kind Regards
  
   ___
   Gluster-devel mailing list
   Gluster-devel@gluster.org
   http://www.gluster.org/mailman/listinfo/gluster-devel
  
  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org
  http://www.gluster.org/mailman/listinfo/gluster-devel
 
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel


___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Readdir d_off encoding

2014-12-23 Thread Anand Avati

Please review http://review.gluster.org/9332/, as it undoes the
introduction of itransform on d_off in AFR. This does not solve
DHT-over-DHT or other future use cases, but at least fixes the regression
in 3.6.x.

Thanks

On Tue Dec 23 2014 at 10:34:41 AM Anand Avati av...@gluster.org wrote:

 Using GFID does not work for d_off. The GFID represents and inode, and a
 d_off represents a directory entry. Therefore using GFID as an alternative
 to d_off breaks down when you have hardlinks for the same inode in a single
 directory.

 On Tue Dec 23 2014 at 2:20:34 AM Xavier Hernandez xhernan...@datalab.es
 wrote:

 On 12/22/2014 06:41 PM, Jeff Darcy wrote:
  An alternative would be to convert directories into regular files from
  the brick point of view.
 
  The benefits of this would be:
 
  * d_off would be controlled by gluster, so all bricks would have the
  same d_off and order. No need to use any d_off mapping or
 transformation.
 
  I don't think a full-out change from real directories to virtual ones is
  in the cards, but a variant of this idea might be worth exploring
 further.
  If we had a *server side* component to map between on-disk d_off values
  and those we present to clients, then it might be able to do a better
 job
  than the local FS of ensuring uniqueness within the bits (e.g. 48 of
 them)
  that are left over after we subtract some for a brick ID.  This could be
  enough to make the bit-stealing approach (on the client) viable.  There
  are probably some issues with failing over between replicas, which
 should
  have the same files but might not have assigned the same internal d_off
  values, but those issues might be avoidable if the d_off values are
  deterministic with respect to GFIDs.

 Having a server-side xlator seems a better approximation, however I see
 some problems that need to be solved:

 The mapper should work on the fly (i.e. it should do the mapping between
 the local d_off to the client d_off without having full knowledge of the
 directory contents). This is a good approach for really big directories
 because doesn't require to waste large amounts of memory, but it will be
 hard to find a way to avoid duplicates, specially if are limited to ~48
 bits.

 Making it based on the GFID would be a good way to have common d_off
 between bricks, however maintaining order will be harder. It will also
 be hard to guarantee uniqueness if mapping is deterministic and
 directory is very big. Otherwise it would need to read full directory
 contents before returning mapped d_off's.

 To minimize the collision problem, we need to solve the ordering
 problem. If we can guarantee that all bricks return directory entries in
 the same order and d_off, we don't need to reserve some bits in d_off.

 I think the virtual directories solution should be the one to consider
 for 4.0. For earlier versions we can try to find an intermediate solution.

 Following your idea of a server side component, could this be useful ?

 * Keep all directories and its entries in a double linked list stored in
 xattr of each inode.

 * Use this linked list to build the readdir answer.

 * Use the first 64 (or 63) bits of gfid as the d_off.

 * There will be two special offsets: 0 for '.' and 1 for '..'

 Example (using shorter gfid's for simplicity):

 Directory root with gfid 0001
 Directory 'test1' inside root with gfid 
 Directory 'test2' inside root with gfid 
 Entry 'entry1' inside 'test1' with gfid 
 Entry 'entry2' inside 'test1' with gfid 
 Entry 'entry3' inside 'test2' with gfid 
 Entry 'entry4' inside 'test2' with gfid 
 Entry 'entry5' inside 'test2' with gfid 

 / (0001)
test1/ ()
  entry1 ()
  entry2 ()
test2/ ()
  entry3 ()
  entry4 ()
  entry5 ()

 Note that entry2 and entry3 are hardlinks.

 xattrs of root (0001):
 trusted.dirmap.0001.next = 
 trusted.dirmap.0001.prev = 

 xattrs of 'test1' ():
 trusted.dirmap.0001.next = 
 trusted.dirmap.0001.prev = 0001
 trusted.dirmap..next = 
 trusted.dirmap..prev = 

 xattrs of 'test2' ():
 trusted.dirmap.0001.next = 0001
 trusted.dirmap.0001.prev = 
 trusted.dirmap..next = 
 trusted.dirmap..prev = 

 xattrs of 'entry1' ():
 trusted.dirmap..next = 
 trusted.dirmap..prev = 

 xattrs of 'entry2'/'entry3' ():
 trusted.dirmap..next = 
 trusted.dirmap..prev = 
 trusted.dirmap..next = 
 trusted.dirmap..prev = 

 xattrs of 'entry4' ():
 trusted.dirmap..next = 
 trusted.dirmap..prev = 

 xattrs of 'entry5' ():
 trusted.dirmap..next = 
 trusted.dirmap..prev = 

 It's easy to enumerate all entries from the beginning of a directory.
 Also, since we return extra information from each inode in a directory,
 accessing these new xattrs doesn't represent a big impact

Re: [Gluster-devel] pthread_mutex misusage in glusterd_op_sm

2014-11-26 Thread Anand Avati

This is indeed a misuse. A very similar bug used to be there in io-threads,
but we have moved to using pthread_cond over there since a while.

To fix this problem we could use a pthread_mutex/pthread_cond pair + a
boolean flag in place of the misused mutex. Or, we could just declare
gd_op_sm_lock as a synclock_t to achieve the same result.

Thanks

On Tue Nov 25 2014 at 10:26:34 AM Emmanuel Dreyfus m...@netbsd.org wrote:

 I made a simple fix that address the problem:
 http://review.gluster.org/9197

 Are there other places where the same bug could exist? Anyone familiar
 with the code would tell?

 Emmanuel Dreyfus m...@netbsd.org wrote:

  in glusterd_op_sm(), we lock and unlock the gd_op_sm_lock mutex.
  Unfortunately, locking and unlocking can happen in different threads
  (task swap will occur in handelr call).
 
  This case is explictely covered by POSIX: the behavior is undefined.
  http://pubs.opengroup.org/onlinepubs/9699919799/
 functions/pthread_mutex_lo
  ck.html
 
  When unlocking from a thread that is not owner, Linux seems to be fine
  (though you never know with unspecified operation), while NetBSD returns
  EPERM, causing a spurious error in tests/basic/pump.t . It can be
 observed
  in a few failed tests here:
  http://build.gluster.org/job/rackspace-netbsd7-regression-triggered/
 
  Fixing it seems far from being obvious. I guess it needs to use syncop,
  but the change would be intrusive. Do we have another option? Is it
  possible to switch a task to a given thread?


 --
 Emmanuel Dreyfus
 http://hcpnet.free.fr/pubz
 m...@netbsd.org
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Single layout at root (Was EHT / DHT)

2014-11-25 Thread Anand Avati

On Tue Nov 25 2014 at 1:28:59 PM Shyam srang...@redhat.com wrote:

 On 11/12/2014 01:55 AM, Anand Avati wrote:
 
 
  On Tue, Nov 11, 2014 at 1:56 PM, Jeff Darcy jda...@redhat.com
  mailto:jda...@redhat.com wrote:
 
(Personally I would have
  done this by mixing in the parent GFID to the hash calculation, but
  that alternative was ignored.)
 
 
  Actually when DHT was implemented, the concept of GFID did not (yet)
  exist. Due to backward compatibility it has just remained this way even
  later. Including the GFID into the hash has benefits.

 I am curious here as this is interesting.

 So the layout start subvol assignment for a directory to be based on its
 GFID was provided so that files with the same name distribute better
 than ending up in the same bricks, right?


Right, for e.g we wouldn't want all the README.txt in various directories
of a volume to end up on the same server. The way it is achieved today is,
the per server hash-range assignment is rotated by a certain amount (how
much it is rotated is determined by a separate hash on the directory path)
at the time of mkdir.


 Instead as we _now_ have GFID, we could use that including the name to
 get a similar/better distribution, or GFID+name to determine hashed subvol.


What we could do now is, include the parent directory gfid as an input into
the DHT hash function.

Today, we do approximately:
  int hashval = dm_hash (readme.txt)
  hash_ranges[] = inode_ctx_get (parent_dir)
  subvol = find_subvol (hash_ranges, hashval)

Instead, we could:
  int hashval = new_hash (readme.txt, parent_dir.gfid)
  hash_ranges[] = global_value
  subvol = find_subvol (hash_ranges, hashval)

The idea here would be that on dentry creates we would need to generate
 the GFID and not let the bricks generate the same, so that we can choose
 the subvol to wind the FOP to.


The GFID would be that of the parent (as an entry name is always in the
context of a parent directory/inode). Also, the GFID for a new entry is
already generated by the client, the brick does not generate a GFID.


 This eliminates the need for a layout per sub-directory and all the
 (interesting) problems that it comes with and instead can be replaced by
 a layout at root. Not sure if it handles all use cases and paths that we
 have now (which needs more understanding).

 I do understand there is a backward compatibility issue here, but other
 than this, this sounds better than the current scheme, as there is a
 single layout to read/optimize/stash/etc. across clients.

 Can I understand the rationale of this better, as to what you folks are
 thinking. Am I missing something or over reading on the benefits that
 this can provide?


I think you understand it right. The benefit is one could have a single
hash layout for the entire volume and the directory specific-ness is
implemented by including the directory gfid into the hash function. The way
I see it, the compromise would be something like:

Pro per directory range: By having per-directory hash ranges, we can do
easier incremental rebalance. Partial progress is well tolerated and does
not impact the entire volume. The time a given directory is undergoing
rebalance, for that directory alone we need to enter unhashed lookup
mode, only for that period of time.

Con per directory range: Just the new hash assignment phase (to impact
placement of new files/data, not move old data) itself is an extended
process, crawling the entire volume with complex per-directory operations.
The number of points in the system where things can break (i.e, result in
overlaps and holes in ranges) is high.

Pro single layout with dir GFID in hash: Avoid the numerous parts (per-dir
hash ranges) which can potentially break.

Con single layout with dir GFID in hash: Rebalance phase 1 (assigning new
layout) is atomic for the entire volume - unhashed lookup has to be on
for all dirs for the entire period. To mitigate this, we could explore
versioning the centralized hash ranges, and store the version used by each
directory in its xattrs (and update the version as the rebalance
progresses). But now we have more centralized metadata (may be/ may not be
a worthy compromise - not sure.)

In summary, including GFID into the hash calculation does open up
interesting possibilities and worthy of serious consideration.

HTH,
Avati
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Invalid DIR * usage in quota xlator

2014-10-15 Thread Anand Avati

On Tue, Oct 14, 2014 at 7:22 PM, Emmanuel Dreyfus m...@netbsd.org wrote:

 J. Bruce Fields bfie...@fieldses.org wrote:

  Is the result on non-Linux really to fail any readdir using an offset
  not returned from the current open?

 Yes, but thatnon-Linux behabvior is POSIX compliant. Linux just happens
 to do more than the standard here.

  I can't see how NFS READDIR will work on non-Linux platforms in that
  case.

 Differrent case: seekdir/telldir operate at libc level on data cached in
 userland. NFS READDIR operates on data in the kernel.


Is there a way to get hold of the directory entry cookies used by NFS
readdir from user-space? some sort of a NetBSD specific syscall (like
getdents of Linux)?

Thanks
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] if/else coding style :-)

2014-10-13 Thread Anand Avati

On Mon, Oct 13, 2014 at 2:00 PM, Shyam srang...@redhat.com wrote:

 (apologies, last one on the metrics from me :), as I believe it is more
 about style than actual numbers at a point)

 _maybe_ this is better, and it is pretty close to call now ;)

 find -name '*.c' | xargs grep else | wc -l
 3719
 find -name '*.c' | xargs grep else | grep '}' | wc -l
 1986
 find -name '*.c' | xargs grep else | grep -v '}' | wc -l
 1733


Without taking sides: the last grep is including else without either { or }.

[~/work/glusterfs]
sh$ git grep '} else {' | wc -l
1331
[~/work/glusterfs]
sh$ git grep 'else {' | grep -v '}' | wc -l
 142

So going by just numbers, } else { is 10x more common than }\n else {.
I also find that believable based on familiarity of seeing this pattern in
the code. Either way, good idea to stick to one and not allow both in
future code.

Thanks
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Feature proposal - FS-Cache support in FUSE

2014-09-02 Thread Anand Avati

On Mon, Sep 1, 2014 at 6:07 AM, Vimal A R arvi...@yahoo.in wrote:

Hello fuse-devel / fs-cache / gluster-devel lists,

I would like to propose the idea of implementing FS-Cache support in the
fuse kernel module, which I am planning to do as part of my UG university
course. This proposal is by no means final, since I have just started to
look into this.

There are several user-space filesystems which are based on the FUSE
kernel module. As of now, if I understand correct, the only networked
filesystems having FS-Cache support are NFS and AFS.

Implementing support hooks for fs-cache in the fuse module would provide
networked filesystems such as GlusterFS the benefit of a client-side
caching mechanism, which should decrease the access times.

If you are planning to test this with GlusterFS, note that one of the first
challenges would be to have persistent filehandles in FUSE. While GlusterFS
has a notion of a persistent handle (GFID, 128bit) which is constant across
clients and remounts, the FUSE kernel module is presented a transient LONG
(64/32 bit) which is specific to the mount instance (actually, the address
of the userspace inode_t within glusterfs process - allows for constant
time filehandle resolution).

This would be a challenge with any FUSE based filesystem which has
persistent filehandles larger than 64bit.

Thanks

When enabled, FS-Cache would maintain a virtual indexing tree to cache the
data or object-types per network FS. Indices in the tree are used by
FS-Cache to find objects faster. The tree or index structure under the main
network FS index depends on the filesystem. Cookies are used to represent
the indices, the pages etc..

The tree structure would be as following:

a) The virtual index tree maintained by fs-cache would look like:

* FS-Cache master index - The network-filesystem indice (NFS/AFS etc..)
- per-share indices - File-handle indices - Page indices

b) In case of FUSE-based filesystems, the tree would be similar to :

* FS-Cache master index - FUSE indice - Per FS indices - file-handle
indices - page indices.

c) In case of FUSE based filesystems as GlusterFS, the tree would as :

* FS-Cache master index - FUSE indice (fuse.glusterfs) - GlusterFS
volume ID (a UUID exists for each volume) - GlusterFS file-handle indices
(based on the GFID of a file) - page indices.

The idea is to enable FUSE to work with the FS-Cache network filesystem
API, which is documented at '
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/caching/netfs-api.txt
'.

The implementation of FS-Cache support in NFS can be taken as a guideline
to understand and start off.

I will reply to this mail with any other updates that would come up whilst
pursuing this further. I request any sort of feedback/suggestions, ideas,
any pitfalls etc.. that can help in taking this further.

Thank you,

Vimal

References:
*
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/caching/fscache.txt
*
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/caching/netfs-api.txt
*
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/caching/object.txt
* http://people.redhat.com/dhowells/fscache/FS-Cache.pdf
* http://people.redhat.com/steved/fscache/docs/HOWTO.txt
* https://en.wikipedia.org/wiki/CacheFS
* https://lwn.net/Articles/160122/
* http://www.linux-mag.com/id/7378/
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Transparent encryption in GlusterFS: Implications on manageability

2014-08-13 Thread Anand Avati

+1 for all the points.


On Wed, Aug 13, 2014 at 11:22 AM, Jeff Darcy jda...@redhat.com wrote:

  I.1 Generating the master volume key
 
 
  Master volume key should be generated by user on the trusted machine.
  Recommendations on master key generation provided at section 6.2 of
  the manpages [1]. Generating of master volume key is in user's
  competence.

 That was fine for an initial implementation, but it's still the single
 largest obstacle to adoption of this feature.  Looking forward, we need
 to provide full CLI support for generating keys in the necessary format,
 specifying their location, etc.

 I.2 Location of the master volume key when mounting a
 volume
 
 
  At mount time the crypt translator searches for a master volume key on
  the client machine at the location specified by the respective
  translator option. If there is no any key at the specified location,
  or the key at specified location is in improper format, then mount
  will fail. Otherwise, the crypt translator loads the key to its
  private memory data structures.
 
  Location of the master volume key can be specified at volume creation
  time (see option master-key, section 6.7 of the man pages [1]).
  However, this option can be overridden by user at mount time to
  specify another location, see section 7 of manpages [1], steps 6, 7,
  8.

 Again, we need to improve on this.  We should support this as a volume
 or mount option in its own right, not rely on the generic
 --xlator-option mechanism.  Adding options to mount.glusterfs isn't
 hard.  Alternatively, we could make this look like a volume option
 settable once through the CLI, even though the path is stored locally on
 the client.  Or we could provide a separate special-purpose
 command/script, which again only needs to be run once.  It would even be
 acceptable to treat the path to the key file (not its contents!) as a
 true volume option, stored on the servers.  Any of these would be better
 than requiring the user to understand our volfile format and
 construction so that they can add the necessary option by hand.

 II. Check graph of translators on your client machine
 after mount!
 
 
  During mount your client machine receives configuration info from the
  non-trusted server. In particular, this info contains the graph of
  translators, which can be subjected to tampering, so that encryption
  won't be invoked for your volume at all. So it is highly important to
  verify this graph. After successful mount make sure that the graph of
  translators contains the crypt translator with proper options (see
  FAQ#1, section 11 of the manpages [1]).

 It is important to verify the graph, but not by poking through log files
 and not without more information about what to look for.  So we got a
 volfile that includes the crypt translator, with some options.  The
 *code* should ensure that the master-key option has the value from the
 command line or local config, and not some other.  If we have to add
 special support for this in otherwise-generic graph initialization code,
 that's fine.
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] how does meta xlator work?

2014-08-13 Thread Anand Avati

On Tue, Aug 12, 2014 at 9:58 AM, Emmanuel Dreyfus m...@netbsd.org wrote:

 On Mon, Aug 11, 2014 at 09:53:19PM -0700, Anand Avati wrote:
  If FUSE implements proper direct_io semantics (somewhat like how O_DIRECT
  flag is handled) and allows the mode to be enabled by the FS in open_cbk,
  then I guess such a special handling of 0-byte need not be necessary? At
  least it hasn't been necessary in Linux FUSE implementation.

 I made a patch that lets meta.t pass on NetBSD. Unfortunately, the
 direct IO flag gets attached to the vnode and not to the file descriptor,
 which means it is not possible to have a fd with direct IO and another
 without.

 But perhaps it is just good enough.


That may / may not work well in practice depending on the number of
concurrent apps working on a file. But a good start nonetheless.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] how does meta xlator work?

2014-08-13 Thread Anand Avati

On Wed, Aug 13, 2014 at 8:55 PM, Emmanuel Dreyfus m...@netbsd.org wrote:

 Anand Avati av...@gluster.org wrote:

  That may / may not work well in practice depending on the number of
  concurrent apps working on a file.

 I am not sure what could make a FS decide that for the same file, one
 file descriptor should use direct I/O and another should not.

 Keeping the flag at file descriptor level would require VFS modification
 in the kernel: the filesystem knows nothing about file descriptors, it
 just know vnodes. It could be done, but I expect to meet resistance :-)


For e.g, glusterfs used to enable directio mode for non-read-only FDs (to
cut overheads) but disable directio for read-only (to leverage readahead).
After big_writes was introduced in Linux FUSE this has changed.

But we should be OK having vnode level switch for now, I think.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] how does meta xlator work?

2014-08-11 Thread Anand Avati

My guess would be that direct_io_mode works differently on *BSD. In Linux
(and it appears in OS/X as well), the VFS takes hint from the file size
(returned in lookup/stat) to limit itself from not read()ing beyond that
offset. So if a file size is returned 0 in lookup, read() is never received
even by FUSE.

In meta all file sizes are 0 (since the contents of the inode are generated
dynamically on open()/read(), size is unknown during lookup() -- just like
/proc). And therefore all meta file open()s are forced into direct_io_mode (
http://review.gluster.org/7506) so that read() requests are sent straight
to FUSE/glusterfs bypassing VFS (size is ignored etc.)

So my guess would be to inspect how direct_io_mode works in those FUSE
implementations first. It is unlikely to be any other issue.

Thanks


On Sun, Aug 10, 2014 at 9:56 PM, Harshavardhana har...@harshavardhana.net
wrote:

  I am working on tests/basic/meta.t on NetBSD
  It fails because .meta/frames is empty, like all files in .meta. A quick
  investigation in source code shows that the function responsible for
  filling the code (frames_file_fill) is never called.
 

 Same experience here.

 It does work on OSX, but does not work on FreeBSD for similar reasons
 haven't figured it out yet what is causing the issue.

 --
 Religious confuse piety with mere ritual, the virtuous confuse
 regulation with outcomes
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] how does meta xlator work?

2014-08-11 Thread Anand Avati

On Mon, Aug 11, 2014 at 7:37 PM, Harshavardhana har...@harshavardhana.net
wrote:

  But there is something I don't get withthe fix:
  - the code forces direct IO if (state-flags  O_ACCMODE) != O_RDONLY),
  but here the file is open read/only, hence I would expect fuse xlator to
  do nothing special

 direct_io_mode(xdata) is in-fact gf_true here for 'meta'

  - using direct IO is the kernel's decision, reflecting the flags used by
  calling process on open(2). How can the fuse xlator convince the kernel
  to enable it?
 

 fuse is convinced by FOPEN_DIRECT_IO  which is pro-actively set when
 'xdata' is set with
 direct-io-mode to 1 by meta module.

 From FreeBSD fuse4bsd code

 void
 fuse_vnode_open(struct vnode *vp, int32_t fuse_open_flags, struct thread
 *td)
 {
 /*
  * Funcation is called for every vnode open.
  * Merge fuse_open_flags it may be 0
  *
  * XXXIP: Handle FOPEN_DIRECT_IO and FOPEN_KEEP_CACHE
  */

 if (vnode_vtype(vp) == VREG) {
 /* XXXIP prevent getattr, by using cached node size */
 vnode_create_vobject(vp, 0, td);
 }
 }


 FUSE4BSD seems like doesn't even implement this as full functionality.
 Only getpages/putpages API seems to implement IO_DIRECT handling.
 Need to see if indeed such is the case.


I think you found it right. fuse4bsd should start handling FOPEN_DIRECT_IO
in the open handler.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] how does meta xlator work?

2014-08-11 Thread Anand Avati

On Mon, Aug 11, 2014 at 9:14 PM, Emmanuel Dreyfus m...@netbsd.org wrote:

 Anand Avati av...@gluster.org wrote:

  In meta all file sizes are 0 (since the contents of the inode are
 generated
  dynamically on open()/read(), size is unknown during lookup() -- just
 like
  /proc). And therefore all meta file open()s are forced into
 direct_io_mode (
  http://review.gluster.org/7506) so that read() requests are sent
 straight
  to FUSE/glusterfs bypassing VFS (size is ignored etc.)

 I found the code in the kernel that skips the read if it is beyond known
 file size. Hence I guess the idea is that on lookup of a file with a
 size equal to 0, a special handling should be done so that reads happens
 anyway.


If FUSE implements proper direct_io semantics (somewhat like how O_DIRECT
flag is handled) and allows the mode to be enabled by the FS in open_cbk,
then I guess such a special handling of 0-byte need not be necessary? At
least it hasn't been necessary in Linux FUSE implementation.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Fw: Re: Corvid gluster testing

2014-08-07 Thread Anand Avati

David,
Is it possible to profile the app to understand the block sizes used for
performing write() (using strace, source code inspection etc)? The block
sizes reported by gluster volume profile is measured on the server side and
is subject to some aggregation by the client side write-behind xlator.
Typically the biggest hurdle for small block writes is FUSE context
switches which happens even before reaching the client side write-behind
xlator.

You could also enable the io-stats xlator on the client side just below
FUSE (before reaching write-behind), and extract data using setfattr.



On Wed, Aug 6, 2014 at 10:00 AM, David F. Robinson 
david.robin...@corvidtec.com wrote:

 My apologies.  I did some additional testing and realized that my timing
 wasn't right.  I believe that after I do the write, NFS caches the data and
 until I close and flush the file, the timing isn't correct.
 I believe the appropriate timing is now 38-seconds for NFS and 60-seconds
 for gluster.  I played around with some of the parameters and got it down
 to 52-seconds with gluster by setting:

 performance.write-behind-window-size: 128MB
 performance.cache-size: 128MB

 I couldn't get it closer to the NFS timing on the writes, although the
 read speads were slightly better than NFS.  I am not sure if this is
 reasonable, or if I should be able to get write speeds that are more
 comparable to the NFS mount...

 Sorry for the confusion I might have caused with my first email... It
 isn't 25x slower.  It is roughly 30% slower for the writes...


 David


 -- Original Message --
 From: Vijay Bellur vbel...@redhat.com
 To: David F. Robinson david.robin...@corvidtec.com;
 gluster-devel@gluster.org
 Sent: 8/6/2014 12:48:09 PM
 Subject: Re: [Gluster-devel] Fw: Re: Corvid gluster testing

  On 08/06/2014 12:11 AM, David F. Robinson wrote:

 I have been testing some of the fixes that Pranith incorporated into the
 3.5.2-beta to see how they performed for moderate levels of i/o. All of
 the stability issues that I had seen in previous versions seem to have
 been fixed in 3.5.2; however, there still seem to be some significant
 performance issues. Pranith suggested that I send this to the
 gluster-devel email list, so here goes:
 I am running an MPI job that saves a restart file to the gluster file
 system. When I use the following in my fstab to mount the gluster
 volume, the i/o time for the 2.5GB file is roughly 45-seconds.
 / gfsib01a.corvidtec.com:/homegfs /homegfs glusterfs
 transport=tcp,_netdev 0 0
 /
 When I switch this to use the NFS protocol (see below), the i/o time is
 2.5-seconds.
 / gfsib01a.corvidtec.com:/homegfs /homegfs nfs
 vers=3,intr,bg,rsize=32768,wsize=32768 0 0/
 The read-times for gluster are 10-20% faster than NFS, but the write
 times are almost 20x slower.


 What is the block size of the writes that are being performed? You can
 expect better throughput and lower latency with block sizes that are close
 to or greater than 128KB.

 -Vijay


 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] regarding fuse mount crash on graph-switch

2014-08-06 Thread Anand Avati

Can you add more logging to the fd migration failure path as well please
(errno and possibly other details)?

Thanks!


On Wed, Aug 6, 2014 at 9:16 PM, Pranith Kumar Karampuri pkara...@redhat.com
 wrote:

 hi,
Could you guys review http://review.gluster.com/#/c/8402. This fixes
 crash reported by JoeJulian. We are yet to find why fd-migration failed.

 Pranith

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] regarding resolution for fuse/server

2014-08-01 Thread Anand Avati

There are subtle differences between fuse and server. In fuse the inode
table does not use LRU pruning, so expected inodes are guaranteed to be
cached. For e.g, when mkdir() FOP arrives, fuse would have already checked
with a lookup and the kernel guarantees another thread would not have
created mkdir in the mean time (with mutex on dir). In the server, either
because of LRU or threads racing, you need to re-evaulate the situation to
be sure. RESOLVE_MUST/NOT makes more sense in the server because of this.

HTH

On Fri, Aug 1, 2014 at 2:43 AM, Pranith Kumar Karampuri pkara...@redhat.com
 wrote:

 hi,
  Does anyone know why there is different code for resolution in fuse
 vs server? There are some differences too, like server asserts about the
 resolution types like RESOLVE_MUST/RESOLVE_NOT etc where as fuse doesn't do
 any such thing. Wondering if there is any reason why the code is different
 in these two xlators.

 Pranith

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Reuse of frame?

2014-07-28 Thread Anand Avati

call frames and stacks are re-used from a mem-pool. So pointers might
repeat. Can you describe your use case a little more in detail, just to be
sure?


On Mon, Jul 28, 2014 at 11:27 AM, Matthew McKeen matt...@mmckeen.net
wrote:

 Is it true that different fops will always have a different frame
 (i.e. different frame pointer) as seen in the translator stack?  I've
 always thought this to be true, but it seems that with the
 release-3.6.0 branch the two quick getxattr syscalls that the getfattr
 cli command calls share the same frame pointer.

 This is causing havoc with a translator of mine, and I was wondering
 if this was a bug, or expected behaviour.

 Thanks,
 Matt

 --
 Matthew McKeen
 matt...@mmckeen.net
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Reuse of frame?

2014-07-28 Thread Anand Avati

Does your code wait for both clients to unwind so that it merges the two
replies before it unwinds itself? You typically would need to keep a call
count (# of winds) and wait for that many _cbk invocations before calling
STACK_UNWIND yourself.

If you are not waiting for both replies, it is possible that the frame
pointer got re-used for second call before second callback of first call
arrived.


On Mon, Jul 28, 2014 at 12:32 PM, Matthew McKeen matt...@mmckeen.net
wrote:

 I have a translator on the client stack.  For a particular getxattr I
 wind the stack to call two client translator getxattr fops.  The
 callback for these fops is the same function in the original
 translator.  The getfattr cli command calls two getxattr syscalls in
 rapid succession so that I see the callback being hit 4 times, and the
 original getxattr forward fop 2 times.  For both the 4 callbacks and 2
 forward fops the frame pointer for the translator is the same.
 Therefore, when I try and store a pointer to a dict in frame-local,
 the dict pointer points to the same dict for both fops and data set
 into the dict with the same keys ends up overwriting the values from
 the previous fop.



 On Mon, Jul 28, 2014 at 12:19 PM, Anand Avati av...@gluster.org wrote:
  call frames and stacks are re-used from a mem-pool. So pointers might
  repeat. Can you describe your use case a little more in detail, just to
 be
  sure?
 
 
  On Mon, Jul 28, 2014 at 11:27 AM, Matthew McKeen matt...@mmckeen.net
  wrote:
 
  Is it true that different fops will always have a different frame
  (i.e. different frame pointer) as seen in the translator stack?  I've
  always thought this to be true, but it seems that with the
  release-3.6.0 branch the two quick getxattr syscalls that the getfattr
  cli command calls share the same frame pointer.
 
  This is causing havoc with a translator of mine, and I was wondering
  if this was a bug, or expected behaviour.
 
  Thanks,
  Matt
 
  --
  Matthew McKeen
  matt...@mmckeen.net
  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org
  http://supercolony.gluster.org/mailman/listinfo/gluster-devel
 
 



 --
 Matthew McKeen
 matt...@mmckeen.net


 --
 Matthew McKeen
 matt...@mmckeen.net

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Better organization for code documentation [Was: Developer Documentation for datastructures in gluster]

2014-07-22 Thread Anand Avati

On Tue, Jul 22, 2014 at 7:35 AM, Kaushal M kshlms...@gmail.com wrote:

 Hey everyone,

 While I was writing the documentation for the options framework, I
 thought up of a way to better organize the code documentation we are
 creating now. I've posted a patch for review that implements this
 organization. [1]

 Copying the description from the patch I've posted for review,
 ```
 A new directory hierarchy has been created in doc/code for the code
 documentation, which follows the general GlusterFS source hierarchy.
 Each GlusterFS module has an entry in this tree. The source directory of
 every GlusterFS module has a symlink, 'doc', to its corresponding
 directory in the doc/code tree.

 Taking glusterd for example. With this scheme, there will be
 doc/code/xlators/mgmg/glusterd directory which will contain the relevant
 documentation to glusterd. This directory will be symlinked to
 xlators/mgmt/glusterd/src/doc .

 This organization should allow for easy reference by developers when
 developing on GlusterFS and also allow for easy hosting of the documents
 when we set it up.
 ```



I haven't read the previous thread, but having doc dir co-exist with src in
each module would encourage (or at least remind) keeping doc updated along
with src changes.  Generally recommended not to store symlinks in the
source repo (though git supports it I think). You could create symlinks
from top level doc/code to per module (or vice versa) in autogen.sh.

Thanks
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Fwd: Re: can not build glusterfs3.5.1 on solaris because of missing sys/cdefs.h

2014-07-16 Thread Anand Avati

Copying gluster-devel@

Thanks for reporting Michael. I guess we need to forward port that old
change. Can you please send out a patch to gerrit?

Thanks!

On 7/16/14, 2:36 AM, 马忠 wrote:
 Hi Avati,
 
I tried to build the latest glusterfs 3.5.1 on solaris11.1,  but 
 it stopped because of missing sys/cdefs.h
 
 I've checked the Changelog and found the 
 commit a5301c874f978570187c3543b0c3a4ceba143c25 had once
 
 solved such a problem in the obsolete file 
 libglusterfsclient/src/libglusterfsclient.h. I don't understand why
 
 it appeared again in the later added file api/src/glfs.h. Can you give 
 me any suggestion about this problem? thanks.
 
 --
 
 [root@localhost glusterfs]# git show 
 a5301c874f978570187c3543b0c3a4ceba143c25
 
 commit a5301c874f978570187c3543b0c3a4ceba143c25
 
 Author: Anand V. Avati av...@amp.gluster.com
 
 Date:   Mon May 18 17:24:16 2009 +0530
 
  workaround for not including sys/cdefs.h -- including sys/cdefs.h 
 breaks build on solaris and other platforms
 
 diff --git a/libglusterfsclient/src/libglusterfsclient.h 
 b/libglusterfsclient/src/libglusterfsclient.h
 
 index 1c2441b..5376985 100755
 
 --- a/libglusterfsclient/src/libglusterfsclient.h
 
 +++ b/libglusterfsclient/src/libglusterfsclient.h
 
 @@ -20,7 +20,22 @@
 
   #ifndef _LIBGLUSTERFSCLIENT_H
 
   #define _LIBGLUSTERFSCLIENT_H
 
 -#include sys/cdefs.h
 
 +#ifndef __BEGIN_DECLS
 
 +#ifdef __cplusplus
 
 +#define __BEGIN_DECLS extern C {
 
 +#else
 
 -
 
 root@solaris:~/glusterfs-3.5.1# gmake
 
 
 
 
 
 gmake[3]: Entering directory `/root/glusterfs-3.5.1/api/src'
 
CC libgfapi_la-glfs.lo
 
 In file included from glfs.c:50:
 
 glfs.h:41:23: sys/cdefs.h: No such file or directory
 
 In file included from glfs.c:50:
 
 glfs.h:57: error: syntax error before struct
 
 In file included from glfs.c:51:
 
 glfs-internal.h:57: error: syntax error before struct
 
 gmake[3]: *** [libgfapi_la-glfs.lo] Error 1
 
 gmake[3]: Leaving directory `/root/glusterfs-3.5.1/api/src'
 
 gmake[2]: *** [all-recursive] Error 1
 
 gmake[2]: Leaving directory `/root/glusterfs-3.5.1/api'
 
 gmake[1]: *** [all-recursive] Error 1
 
 gmake[1]: Leaving directory `/root/glusterfs-3.5.1'
 
 gmake: *** [all] Error 2
 
 --
 
 Thanks in advance,
 
  Michael
 



___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] inode linking in GlusterFS NFS server

2014-07-07 Thread Anand Avati

On Mon, Jul 7, 2014 at 12:48 PM, Raghavendra Bhat rab...@redhat.com wrote:


 Hi,

 As per my understanding nfs server is not doing inode linking in readdirp
 callback. Because of this there might be some errors while dealing with
 virtual inodes (or gfids). As of now meta, gfid-access and snapview-server
 (used for user serviceable snapshots) xlators makes use of virtual inodes
 with random gfids. The situation is this:

 Say User serviceable snapshot feature has been enabled and there are 2
 snapshots (snap1 and snap2). Let /mnt/nfs be the nfs mount. Now the
 snapshots can be accessed by entering .snaps directory.  Now if snap1
 directory is entered and *ls -l* is done (i.e. cd /mnt/nfs/.snaps/snap1
 and then ls -l),  the readdirp fop is sent to the snapview-server xlator
 (which is part of a daemon running for the volume), which talks to the
 corresponding snapshot volume and gets the dentry list. Before unwinding it
 would have generated random gfids for those dentries.

 Now nfs server upon getting readdirp reply, will associate the gfid with
 the filehandle created for the entry. But without linking the inode, it
 would send the readdirp reply back to nfs client. Now next time when nfs
 client makes some operation on one of those filehandles, nfs server tries
 to resolve it by finding the inode for the gfid present in the filehandle.
 But since the inode was not linked in readdirp, inode_find operation fails
 and it tries to do a hard resolution by sending the lookup operation on
 that gfid to the normal main graph. (The information on whether the call
 should be sent to main graph or snapview-server would be present in the
 inode context. But here the lookup has come on a gfid with a newly created
 inode where the context is not there yet. So the call would be sent to the
 main graph itself). But since the gfid is a randomly generated virtual gfid
 (not present on disk), the lookup operation fails giving error.

 As per my understanding this can happen with any xlator that deals with
 virtual inodes (by generating random gfids).

 I can think of these 2 methods to handle this:
 1)  do inode linking for readdirp also in nfs server
 2)  If lookup operation fails, snapview-client xlator (which actually
 redirects the fops on snapshot world to snapview-server by looking into the
 inode context) should check if the failed lookup is a nameless lookup. If
 so, AND the gfid of the inode is NULL AND lookup has come from main graph,
 then instead of unwinding the lookup with failure, send it to
 snapview-server which might be able to find the inode for the gfid (as the
 gfid was generated by itself, it should be able to find the inode for that
 gfid unless and until it has been purged from the inode table).


 Please let me know if I have missed anything. Please provide feedback.



That's right. NFS server should be linking readdirp_cbk inodes just like
FUSE or protocol/server. It has been OK without virtual gfids thus far.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] triggers for sending inode forgets

2014-07-04 Thread Anand Avati

On Fri, Jul 4, 2014 at 8:17 PM, Pranith Kumar Karampuri pkara...@redhat.com
 wrote:


 On 07/05/2014 08:17 AM, Anand Avati wrote:




 On Fri, Jul 4, 2014 at 7:03 PM, Pranith Kumar Karampuri 
 pkara...@redhat.com wrote:

 hi,
 I work on glusterfs and was debugging a memory leak. Need your help
 in figuring out if something is done properly or not.
 When a file is looked up for the first time in gluster through fuse,
 gluster remembers the parent-inode, basename for that inode. Whenever an
 unlink/rmdir/(lookup gives ENOENT) happens then corresponding forgetting of
 parent-inode, basename happens.


  This is because of the path resolver explicitly calls d_invalidate() on
 a dentry when d_revalidate() fails on it.

  In all other cases it relies on fuse to send forget of an inode to
 release these associations. I was wondering what are the trigger points for
 sending forgets by fuse.

 Lets say M0, M1 are fuse mounts of same volume.
 1) Mount 'M0' creates a file 'a'
 2) Mount 'M1' of deletes file 'a'

 M0 never touches 'a' anymore. Will a forget be sent on inode of 'a'? If
 yes when?


  Really depends on when the memory manager decides to start reclaiming
 memory from dcache due to memory pressure. If the system is not under
 memory pressure, and if the stale dentry is never encountered by the path
 resolver, the inode may never receive a forget. To keep a tight utilization
 limit on the inode/dcache, you will have to proactively 
 fuse_notify_inval_entry
 on old/deleted files.

 Thanks for this info Avati. I see that in fuse-bridge for glusterfs there
 is a setxattr interface to do that. Is that what you are referring to?


In glusterfs fuse-bridge.c:fuse_invalidate_entry() is the function you want
to look at. The setxattr() interface is just for testing the functionality.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] [PATCH] fuse: ignore entry-timeout on LOOKUP_REVAL

2014-06-26 Thread Anand Avati

The following test case demonstrates the bug:

  sh# mount -t glusterfs localhost:meta-test /mnt/one

  sh# mount -t glusterfs localhost:meta-test /mnt/two

  sh# echo stuff  /mnt/one/file; rm -f /mnt/two/file; echo stuff  
/mnt/one/file
  bash: /mnt/one/file: Stale file handle

  sh# echo stuff  /mnt/one/file; rm -f /mnt/two/file; sleep 1; echo stuff  
/mnt/one/file

On the second open() on /mnt/one, FUSE would have used the old
nodeid (file handle) trying to re-open it. Gluster is returning
-ESTALE. The ESTALE propagates back to namei.c:filename_lookup()
where lookup is re-attempted with LOOKUP_REVAL. The right
behavior now, would be for FUSE to ignore the entry-timeout and
and do the up-call revalidation. Instead FUSE is ignoring
LOOKUP_REVAL, succeeding the revalidation (because entry-timeout
has not passed), and open() is again retried on the old file
handle and finally the ESTALE is going back to the application.

Fix: if revalidation is happening with LOOKUP_REVAL, then ignore
entry-timeout and always do the up-call.

Signed-off-by: Anand Avati av...@redhat.com
---
 fs/fuse/dir.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 4219835..4eaa30d 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -198,7 +198,7 @@ static int fuse_dentry_revalidate(struct dentry *entry, 
unsigned int flags)
inode = ACCESS_ONCE(entry-d_inode);
if (inode  is_bad_inode(inode))
goto invalid;
-   else if (fuse_dentry_time(entry)  get_jiffies_64()) {
+   else if (fuse_dentry_time(entry)  get_jiffies_64() || (flags  
LOOKUP_REVAL)) {
int err;
struct fuse_entry_out outarg;
struct fuse_req *req;
-- 
1.7.1

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Regarding doing away with refkeeper in locks xlator

2014-06-05 Thread Anand Avati


On 6/3/14, 11:32 PM, Pranith Kumar Karampuri wrote:


On 06/04/2014 11:37 AM, Krutika Dhananjay wrote:

Hi,

Recently there was a crash in locks translator (BZ 1103347, BZ
1097102) with the following backtrace:
(gdb) bt
#0  uuid_unpack (in=0x8 Address 0x8 out of bounds,
uu=0x7fffea6c6a60) at ../../contrib/uuid/unpack.c:44
#1  0x7feeba9e19d6 in uuid_unparse_x (uu=value optimized out,
out=0x2350fc0 081bbc7a-7551-44ac-85c7-aad5e2633db9,
fmt=0x7feebaa08e00
%08x-%04x-%04x-%02x%02x-%02x%02x%02x%02x%02x%02x) at
../../contrib/uuid/unparse.c:55
#2  0x7feeba9be837 in uuid_utoa (uuid=0x8 Address 0x8 out of
bounds) at common-utils.c:2138
#3  0x7feeb06e8a58 in pl_inodelk_log_cleanup (this=0x230d910,
ctx=0x7fee700f0c60) at inodelk.c:396
#4  pl_inodelk_client_cleanup (this=0x230d910, ctx=0x7fee700f0c60) at
inodelk.c:428
#5  0x7feeb06ddf3a in pl_client_disconnect_cbk (this=0x230d910,
client=value optimized out) at posix.c:2550
#6  0x7feeba9fa2dd in gf_client_disconnect (client=0x27724a0) at
client_t.c:368
#7  0x7feeab77ed48 in server_connection_cleanup (this=0x2316390,
client=0x27724a0, flags=value optimized out) at server-helpers.c:354
#8  0x7feeab77ae2c in server_rpc_notify (rpc=value optimized
out, xl=0x2316390, event=value optimized out, data=0x2bf51c0) at
server.c:527
#9  0x7feeba775155 in rpcsvc_handle_disconnect (svc=0x2325980,
trans=0x2bf51c0) at rpcsvc.c:720
#10 0x7feeba776c30 in rpcsvc_notify (trans=0x2bf51c0,
mydata=value optimized out, event=value optimized out,
data=0x2bf51c0) at rpcsvc.c:758
#11 0x7feeba778638 in rpc_transport_notify (this=value optimized
out, event=value optimized out, data=value optimized out) at
rpc-transport.c:512
#12 0x7feeb115e971 in socket_event_poll_err (fd=value optimized
out, idx=value optimized out, data=0x2bf51c0, poll_in=value
optimized out, poll_out=0,
poll_err=0) at socket.c:1071
#13 socket_event_handler (fd=value optimized out, idx=value
optimized out, data=0x2bf51c0, poll_in=value optimized out,
poll_out=0, poll_err=0) at socket.c:2240
#14 0x7feeba9fc6a7 in event_dispatch_epoll_handler
(event_pool=0x22e2d00) at event-epoll.c:384
#15 event_dispatch_epoll (event_pool=0x22e2d00) at event-epoll.c:445
#16 0x00407e93 in main (argc=19, argv=0x7fffea6c7f88) at
glusterfsd.c:2023
(gdb) f 4
#4  pl_inodelk_client_cleanup (this=0x230d910, ctx=0x7fee700f0c60) at
inodelk.c:428
428pl_inodelk_log_cleanup (l);
(gdb) p l-pl_inode-refkeeper
$1 = (inode_t *) 0x0
(gdb)

pl_inode-refkeeper was found to be NULL even when there were some
blocked inodelks in a certain domain of the inode,
which when dereferenced by the epoll thread in the cleanup codepath
led to a crash.

On inspecting the code (for want of a consistent reproducer), three
things were found:

1. The function where the crash happens (pl_inodelk_log_cleanup()),
makes an attempt to resolve the inode to path as can be seen below.
But the way inode_path() itself
works is to first construct the path based on the given inode's
ancestry and place it in the buffer provided. And if all else fails,
the gfid of the inode is placed in a certain format (gfid:%s).
This eliminates the need for statements from line 4 through 7
below, thereby preventing dereferencing of pl_inode-refkeeper.
Now, although this change prevents the crash altogether, it still
does not fix the race that led to pl_inode-refkeeper becoming NULL,
and comes at the cost of
printing (null) in the log message on line 9 every time
pl_inode-refkeeper is found to be NULL, rendering the logged messages
somewhat useless.

code
  0 pl_inode = lock-pl_inode;
  1
  2 inode_path (pl_inode-refkeeper, NULL, path);
  3
  4 if (path)
  5 file = path;
  6 else
  7 file = uuid_utoa (pl_inode-refkeeper-gfid);
8
  9 gf_log (THIS-name, GF_LOG_WARNING,
 10 releasing lock on %s held by 
 11 {client=%p, pid=%PRId64 lk-owner=%s},
 12 file, lock-client, (uint64_t) lock-client_pid,
 13 lkowner_utoa (lock-owner));
\code

I think this logging code is from the days when gfid handle concept was
not there. So it wasn't returning gfid:gfid-str in cases the path is
not present in the dentries. I believe the else block can be deleted
safely now.

Pranith


2. There is at least one codepath found that can lead to this crash:
Imagine an inode on which an inodelk operation is attempted by a
client and is successfully granted too.
   Now, between the time the lock was granted and
pl_update_refkeeper() was called by this thread, the client could send
a DISCONNECT event,
   causing cleanup codepath to be executed, where the epoll thread
crashes on dereferencing pl_inode-refkeeper which is STILL NULL at
this point.

   Besides, there are still places in locks xlator where the refkeeper
is NOT updated whenever the lists are modified - for instance in the
cleanup codepath from a

Re: [Gluster-devel] Regarding doing away with refkeeper in locks xlator

2014-06-05 Thread Anand Avati


On 6/4/14, 9:43 PM, Krutika Dhananjay wrote:





*From: *Pranith Kumar Karampuri pkara...@redhat.com
*To: *Krutika Dhananjay kdhan...@redhat.com, Anand Avati
aav...@redhat.com
*Cc: *gluster-devel@gluster.org
*Sent: *Wednesday, June 4, 2014 12:23:59 PM
*Subject: *Re: [Gluster-devel] Regarding doing away with refkeeper
in locks xlator


On 06/04/2014 12:02 PM, Pranith Kumar Karampuri wrote:


On 06/04/2014 11:37 AM, Krutika Dhananjay wrote:

Hi,

Recently there was a crash in locks translator (BZ 1103347,
BZ 1097102) with the following backtrace:
(gdb) bt
#0  uuid_unpack (in=0x8 Address 0x8 out of bounds,
uu=0x7fffea6c6a60) at ../../contrib/uuid/unpack.c:44
#1  0x7feeba9e19d6 in uuid_unparse_x (uu=value
optimized out, out=0x2350fc0
081bbc7a-7551-44ac-85c7-aad5e2633db9,
 fmt=0x7feebaa08e00
%08x-%04x-%04x-%02x%02x-%02x%02x%02x%02x%02x%02x) at
../../contrib/uuid/unparse.c:55
#2  0x7feeba9be837 in uuid_utoa (uuid=0x8 Address 0x8
out of bounds) at common-utils.c:2138
#3  0x7feeb06e8a58 in pl_inodelk_log_cleanup
(this=0x230d910, ctx=0x7fee700f0c60) at inodelk.c:396
#4  pl_inodelk_client_cleanup (this=0x230d910,
ctx=0x7fee700f0c60) at inodelk.c:428
#5  0x7feeb06ddf3a in pl_client_disconnect_cbk
(this=0x230d910, client=value optimized out) at posix.c:2550
#6  0x7feeba9fa2dd in gf_client_disconnect
(client=0x27724a0) at client_t.c:368
#7  0x7feeab77ed48 in server_connection_cleanup
(this=0x2316390, client=0x27724a0, flags=value optimized
out) at server-helpers.c:354
#8  0x7feeab77ae2c in server_rpc_notify (rpc=value
optimized out, xl=0x2316390, event=value optimized out,
data=0x2bf51c0) at server.c:527
#9  0x7feeba775155 in rpcsvc_handle_disconnect
(svc=0x2325980, trans=0x2bf51c0) at rpcsvc.c:720
#10 0x7feeba776c30 in rpcsvc_notify (trans=0x2bf51c0,
mydata=value optimized out, event=value optimized out,
data=0x2bf51c0) at rpcsvc.c:758
#11 0x7feeba778638 in rpc_transport_notify (this=value
optimized out, event=value optimized out, data=value
optimized out) at rpc-transport.c:512
#12 0x7feeb115e971 in socket_event_poll_err (fd=value
optimized out, idx=value optimized out, data=0x2bf51c0,
poll_in=value optimized out, poll_out=0,
 poll_err=0) at socket.c:1071
#13 socket_event_handler (fd=value optimized out,
idx=value optimized out, data=0x2bf51c0, poll_in=value
optimized out, poll_out=0, poll_err=0) at socket.c:2240
#14 0x7feeba9fc6a7 in event_dispatch_epoll_handler
(event_pool=0x22e2d00) at event-epoll.c:384
#15 event_dispatch_epoll (event_pool=0x22e2d00) at
event-epoll.c:445
#16 0x00407e93 in main (argc=19,
argv=0x7fffea6c7f88) at glusterfsd.c:2023
(gdb) f 4
#4  pl_inodelk_client_cleanup (this=0x230d910,
ctx=0x7fee700f0c60) at inodelk.c:428
428pl_inodelk_log_cleanup (l);
(gdb) p l-pl_inode-refkeeper
$1 = (inode_t *) 0x0
(gdb)

pl_inode-refkeeper was found to be NULL even when there
were some blocked inodelks in a certain domain of the inode,
which when dereferenced by the epoll thread in the cleanup
codepath led to a crash.

On inspecting the code (for want of a consistent
reproducer), three things were found:

1. The function where the crash happens
(pl_inodelk_log_cleanup()), makes an attempt to resolve the
inode to path as can be seen below. But the way inode_path()
itself
 works is to first construct the path based on the given
inode's ancestry and place it in the buffer provided. And if
all else fails, the gfid of the inode is placed in a certain
format (gfid:%s).
 This eliminates the need for statements from line 4
through 7 below, thereby preventing dereferencing of
pl_inode-refkeeper.
 Now, although this change prevents the crash
altogether, it still does not fix the race that led to
pl_inode-refkeeper becoming NULL, and comes at the cost of
 printing (null) in the log message on line 9 every
time pl_inode-refkeeper is found to be NULL, rendering the
logged messages

Re: [Gluster-devel] Need sensible default value for detecting unclean client disconnects

2014-05-20 Thread Anand Avati

Niels,
This is a good addition. While gluster clients do a reasonably good job at
detecting dead/hung servers with ping-timeout, the server side detection
has been rather weak. TCP_KEEPALIVE has helped to some extent, for cases
where an idling client (which holds a lock) goes dead. However if an active
client with pending data in server's socket buffer dies, we have been
subject to long tcp retransmission to finish and give up.

The way I see it, this option is complementary to TCP_KEEPALIVE (keepalive
works for idle and only idle connections, user_timeout works only when
there is pending acknowledgements, thus covering the full spectrum). To
that end, it might make sense to present the admin a single timeout
configuration value rather than two. It would be very frustrating for the
admin to configure one of them to, say, 30 seconds, and then find that the
server does not clean up after 30 seconds of a hung client only because the
connection was idle (or not idle). Configuring a second timeout for the
other case can be very unintuitive.

In fact, I would suggest to have a single network timeout configuration,
which gets applied to all the three: ping-timeout on the client,
user_timeout on the server, keepalive on both. I think that is what a user
would be expecting anyways. Each is for a slightly different technical
situation, but all just internal details as far as a user is concerned.

Thoughts?


On Tue, May 20, 2014 at 4:30 AM, Niels de Vos nde...@redhat.com wrote:

 Hi all,

 the last few days I've been looking at a problem [1] where a client
 locks a file over a FUSE-mount, and a 2nd client tries to grab that lock
 too.  It is expected that the 2nd client gets blocked until the 1st
 client releases the lock. This all work as long as the 1st client
 cleanly releases the lock.

 Whenever the 1st client crashes (like a kernel panic) or the network is
 split and the 1st client is unreachable, the 2nd client may not get the
 lock until the bricks detect that the connection to the 1st client is
 dead. If there are pending Replies, the bricks may need 15-20 minutes
 until the re-transmissions of the replies have timed-out.

 The current default of 15-20 minutes is quite long for a fail-over
 scenario. Relatively recently [2], the Linux kernel got
 a TCP_USER_TIMEOUT socket option (similar to TCP_KEEPALIVE). This option
 can be used to configure a per-socket timeout, instead of a system-wide
 configuration through the net.ipv4.tcp_retries2 sysctl.

 The default network.ping-timeout is set to 42 seconds. I'd like to
 propose a network.tcp-timeout option that can be set per volume. This
 option should then set TCP_USER_TIMEOUT for the socket, which causes
 re-transmission failures to be fatal after the timeout has passed.

 Now the remaining question, what shall be the default timeout in seconds
 for this new network.tcp-timeout option? I'm currently thinking of
 making it high enough (like 5 minutes) to prevent false positives.

 Thoughts and comments welcome,
 Niels


 1 https://bugzilla.redhat.com/show_bug.cgi?id=1099460
 2
 http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=dca43c7
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] New project on the Forge - gstatus

2014-05-16 Thread Anand Avati

KP, Vipul,

It will be awesome to get io-stats like instrumentation on the client side.
Here are some further thoughts on how to implement that. If you have a
recent git HEAD build, I would suggest that you explore the latency stats
on the client side exposed through meta at
$MNT/.meta/graphs/active/$xlator/profile. You can enable latency
measurement with echo 1  $MNT/.meta/measure_latency. I would suggest
extending these stats with the extra ones io-stats has, and make
glusterfsiostats expose these stats.

If you can compare libglusterfs/src/latency.c:gf_latency_begin(),
gf_latency_end() and gf_latency_udpate() with the macros in io-stats.c
UPDATE_PROFILE_STATS()
and START_FOP_LATENCY(), you will quickly realize how a lot of logic is
duplicated between io-stats and latency.c. If you can enhance latency.c and
make it capture the remaining stats what io-stats is capturing, the
benefits of this approach would be:

- stats are already getting captured at all xlator levels, and not just at
the position where io-stats is inserted
- file like interface makes the stats more easily inspectable and
consumable, and updated on the fly
- conforms with the way rest of the internals are exposed through $MNT/.meta

In order to this, you might want to look into:

- latency.c as of today captures fop count, mean latency, total time,
whereas io-stats measures these along with min-time, max-time and
block-size histogram.
- extend gf_proc_dump_latency_info() to dump the new stats
- either prettify that output like 'volume profile info' output, or JSONify
it like xlators/meta/src/frames-file.c
- add support for cumulative vs interval stats (store an extra copy of
this-latencies[])

etc..

Thanks!


On Fri, Apr 25, 2014 at 9:09 PM, Krishnan Parthasarathi kpart...@redhat.com
 wrote:

 [Resending due to gluster-devel mailing list issue]

 Apologies for the late reply.

 glusterd uses its socket connection with brick processes (where io-stats
 xlator is loaded) to
 gather information from io-stats via an RPC request. This facility is
 restricted to brick processes
 as it stands today.

 Some background ...
 io-stats xlator is loaded, both in GlusterFS mounts and brick processes.
 So, we have the capabilities
 to monitor I/O statistics on both sides. To collect I/O statistics at the
 server side, we have

 # gluster volume profile VOLNAME [start | info | stop]
 AND
 #gluster volume top VOLNAME info [and other options]

 We don't have a usable way of gathering I/O statistics (not monitoring,
 though the counters could be enhanced)
 at the client-side, ie. for a given mount point. This is the gap
 glusterfsiostat aims to fill. We need to remember
 that the machines hosting GlusterFS mounts may not have glusterd installed
 on them.

 We are considering rrdtool as a possible statistics database because it
 seems like a natural choice for storing time-series
 data. rrdtool is capable of answering high-level statistical queries on
 statistics that were logged in it by io-stats xlator
 over and above printing running counters periodically.

 Hope this gives some more clarity on what we are thinking.

 thanks,
 Krish
 - Original Message -

  Probably me not understanding.

  the comment iostats making data available to glusterd over RPC - is
 what I
  latched on to. I wondered whether this meant that a socket could be
 opened
  that way to get at the iostats data flow.

  Cheers,

  PC

  - Original Message -

   From: Vipul Nayyar nayyar_vi...@yahoo.com
 
   To: Paul Cuzner pcuz...@redhat.com, Krishnan Parthasarathi
   kpart...@redhat.com
 
   Cc: Vijay Bellur vbel...@redhat.com, gluster-devel
   gluster-de...@nongnu.org
 
   Sent: Thursday, 20 February, 2014 5:06:27 AM
 
   Subject: Re: [Gluster-devel] New project on the Forge - gstatus
 

   Hi Paul,
 

   I'm really not sure, if this can be done in python(at least
 comfortably).
   Maybe we can tread on the same path as Justin's glusterflow in python.
 But
   I
   don't think, all the io-stats counters will be available with the way
 how
   Justin's used Jeff Darcy's previous work to build his tool. I can be
 wrong.
   My knowledge is a bit incomplete and based on very less experience as a
   user
   and an amateur Gluster developer. Please do correct me, if I can be.
 

   Regards
 
   Vipul Nayyar
 
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Spurious failures because of nfs and snapshots

2014-05-15 Thread Anand Avati

On Thu, May 15, 2014 at 5:49 PM, Pranith Kumar Karampuri 
pkara...@redhat.com wrote:

 hi,
 In the latest build I fired for review.gluster.com/7766 (
 http://build.gluster.org/job/regression/4443/console) failed because of
 spurious failure. The script doesn't wait for nfs export to be available. I
 fixed that, but interestingly I found quite a few scripts with same
 problem. Some of the scripts are relying on 'sleep 5' which also could lead
 to spurious failures if the export is not available in 5 seconds. We found
 that waiting for 20 seconds is better, but 'sleep 20' would unnecessarily
 delay the build execution. So if you guys are going to write any scripts
 which has to do nfs mounts, please do it the following way:

 EXPECT_WITHIN 20 1 is_nfs_export_available;
 TEST mount -t nfs -o vers=3 $H0:/$V0 $N0;


Always please also add mount -o soft,intr in the regression scripts for
mounting nfs. Becomes so much easier to cleanup any hung mess. We
probably need an NFS mounting helper function which can be called like:

TEST mount_nfs $H0:/$V0 $N0;

Thanks

Avati
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Automatically building RPMs upon patch submission?

2014-05-12 Thread Anand Avati

On Mon, May 12, 2014 at 4:23 PM, Justin Clift jus...@gluster.org wrote:

 On 12/05/2014, at 9:04 PM, Anand Avati wrote:
 snip
  And yeah, the other reason: if a dev pushes a series/set of dependent
 patches, regression needs to run only on the last one (regression
 test/voting is cumulative for the set). Running regression on all the
 individual patches (like a smoke test) would be very wasteful, and tricky
 to avoid (this was the part which I couldn't solve)

 What's the manual with intelligence required process we use to
 do this atm?  eg for people wanting to test the combined patch
 set

 Side note - I'm mucking around with the Gerrit Trigger plugin in
 a VM on my desktop running Jenkins.  So if you see any strangeness
 in things with Gerrit comments, it could be me. (feel free to ping
 me as needed)


http://build.gluster.org/job/regression/build - key in the gerrit patch
number for the CHANGE_ID field, and click 'Build'.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] dht: selfheal of missing directories on nameless (by GFID) LOOKUP

2014-05-04 Thread Anand Avati

On Sun, May 4, 2014 at 9:22 AM, Niels de Vos nde...@redhat.com wrote:

 Hi,

 bug 1093324 has been opened and we have identified the following cause:

 1. an NFS-client does a LOOKUP of a directory on a volume
 2. the NFS-client receives a filehandle (contains volume-id + GFID)
 3. add-brick is executed, but the new brick does not have any
directories yet
 4. the NFS-client creates a new file in the directory, this request is
in the format or filehandle/filename, filehandle was received
in step 2
 5. the NFS-server does a LOOKUP on the parent directory identified by
the filehandle - nameless LOOKUP, only GFID is known
 6. the old brick(s) return successfully
 7. the new brick returns ESTALE
 8. the NFS-server returns ESTALE to the NFS-client

 In this case, the NFS-client should not receive an ESTALE. There is also
 no ESTALE error passed to the client when this procedure is done over
 FUSE or samba/libgfapi.

 Selfhealing a directory entry based only on a GFID is not always
 possible. Files do not have a unique filename (hardlinks), so it is not
 trivial to find a filename for a GFID (expensive operation, and the
 result could be a list). However, for a directory this is simpler.
 A directory is not hardlink'd in the .glusterfs directory, directories
 are maintained as symbolic-links. This makes it possible to find the
 name of a directory, when only the GFID is known.

 Currently DHT is not able to selfheal directories on a nameless LOOKUP.
 I think that it should be possible to change this, and to fix the ESTALE
 returned by the NFS-server.

 At least two changes would be needed, and this is where I would like to
 hear opinions from others about it:

 - The posix-xlator should be able to return the directory name when
   a GFID is given. This can be part of the LOOKUP-reply (dict), and that
   would add a readlink() syscall for each nameless LOOKUP that finds
   a directory. Or (suggested by Pranith) add a virtual xattr and handle
   this specific request with an additional FGETXATTR call.


I think the LOOKUP-reply with readlink() is better, instead of a new
over-the-wire FOP.



 - DHT should selfheal the directory when at least one ESTALE is returned
   by the bricks.



This also makes sense, except - if even the parent directory is missing on
that server (yet to be healed). Another important point to note is that,
the directories (with the same GFID) themselves may be present at various
locations as various dentries on the many servers. A lookup of
dir-gfid/name should succeed transparently independent of the differing
dir-gfid's dentries across servers.

However if you want to heal, now the choice of server from where you select
the dir's parent and name become important as the self-heal will impose
that on the other servers. For e.g one of the AFR subvolumes may have not
yet healed the parent directories etc. Or, the N-1 servers may each return
a different par-gfid/dir-name in the LOOKUP reply. So it can quickly get
hairy.

As a general approach, using the LOOKUP-reply to send parent info from the
posix level makes sense. But we also need a more detailed proposal on how
that info is used at the cluster xlator levels to achieve a higher level
goal, like self-heal.


 When all bricks return ESTALE, the ESTALE is valid and
   should be passed on to the upper layers (NFS-server - NFS-client).


Yes.

Thanks
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

39 matches

Mail list logo