from:"Raghavendra Gowdappa"

Re: [Gluster-devel] regarding special treatment of ENOTSUP for setxattr

2014-05-07 Thread Raghavendra Gowdappa

I think with repetitive log message suppression patch being merged, we don't 
really need gf_log_occasionally (except if they are logged in DEBUG or TRACE 
levels).

- Original Message -
 From: Pranith Kumar Karampuri pkara...@redhat.com
 To: Vijay Bellur vbel...@redhat.com
 Cc: gluster-devel@gluster.org, Anand Avati aav...@redhat.com
 Sent: Wednesday, 7 May, 2014 3:12:10 PM
 Subject: Re: [Gluster-devel] regarding special treatment of ENOTSUP for 
 setxattr

 - Original Message -
  From: Vijay Bellur vbel...@redhat.com
  To: Pranith Kumar Karampuri pkara...@redhat.com, Anand Avati
  aav...@redhat.com
  Cc: gluster-devel@gluster.org
  Sent: Tuesday, May 6, 2014 7:16:12 PM
  Subject: Re: [Gluster-devel] regarding special treatment of ENOTSUP for
  setxattr

  On 05/06/2014 01:07 PM, Pranith Kumar Karampuri wrote:
   hi,
  Why is there occasional logging for ENOTSUP errno when setxattr fails?

  In the absence of occasional logging, the log files would be flooded
  with this message every time there is a setxattr() call.

 How to know which keys are failing setxattr with ENOTSUPP if it is not logged
 when the key keeps changing?

 Pranith

  -Vijay

 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] corrupted hash table

2014-05-21 Thread Raghavendra Gowdappa

Hi Emmanuel,

Is it possible to get valgrind reports (or a test case which caused this 
crash)? The inode table is corrupted in this case.

regards,
Raghavendra
- Original Message -
 From: Emmanuel Dreyfus m...@netbsd.org
 To: gluster-devel@gluster.org
 Sent: Wednesday, May 21, 2014 12:43:50 PM
 Subject: Re: [Gluster-devel] corrupted hash table
 
 Nobody has an idea on this one?
 
  This is master branch, client side:
  
  Program terminated with signal 11, Segmentation fault.
  #0  uuid_unpack (in=0xffc0 Address 0xffc0 out of bounds,
  uu=0xbf7fd7b0) at ../../contrib/uuid/unpack.c:43
  
  warning: Source file is more recent than executable.
  43  tmp = *ptr++;
  (gdb) print tmp
  Cannot access memory at address 0xffc0
  (gdb) bt
  #0  uuid_unpack (in=0xffc0 Address 0xffc0 out of bounds,
  uu=0xbf7fd7b0) at ../../contrib/uuid/unpack.c:43
  #1  0xbb788f63 in uuid_compare (
  uu1=0xffc0 Address 0xffc0 out of bounds,
  uu2=0xb811f938 k\350_6) at ../../contrib/uuid/compare.c:46
  #2  0xbb769993 in __inode_find (table=0xbb213368, gfid=0xb811f938
  k\350_6)
  at inode.c:763
  #3  0xbb769cdc in __inode_link (inode=0x5a70b768, parent=optimized out,
  name=0x5a7e3148 conf24746.file, iatt=0xb811f930) at inode.c:831
  #4  0xbb769f3f in inode_link (inode=0x5a70b768, parent=0x5af47728,
  name=0x5a7e3148 conf24746.file, iatt=0xb811f930) at inode.c:892
  #5  0xbb36bdaa in fuse_create_cbk (frame=0xba417c44, cookie=0xbb28cb98,
  this=0xb9cbe018, op_ret=0, op_errno=0, fd=0xb799e808, inode=0x5a70b768,
  buf=0xb811f930, preparent=0xb811f998, postparent=0xb811fa00, xdata=0x0)
  at fuse-bridge.c:1888
  #6  0xb92a30a0 in io_stats_create_cbk (frame=0xbb28cb98, cookie=0xbb287418,
  this=0xb9df2018, op_ret=0, op_errno=0, fd=0xb799e808, inode=0x5a70b768,
  buf=0xb811f930, preparent=0xb811f998, postparent=0xb811fa00, xdata=0x0)
  at io-stats.c:1260
  #7  0xb92afd80 in mdc_create_cbk (frame=0xbb287418, cookie=0xbb28d7d8,
  this=0xb9df1018, op_ret=0, op_errno=0, fd=0xb799e808, inode=0x5a70b768,
  buf=0xb811f930, preparent=0xb811f998, postparent=0xb811fa00, xdata=0x0)
  at md-cache.c:1404
  #8  0xb92c790f in ioc_create_cbk (frame=0xbb28d7d8, cookie=0xbb28b008,
  this=0xb9dee018, op_ret=0, op_errno=0, fd=0xb799e808, inode=0x5a70b768,
  buf=0xb811f930, preparent=0xb811f998, postparent=0xb811fa00, xdata=0x0)
  at io-cache.c:701
  #9  0xbb3079ba in ra_create_cbk (frame=0xbb28b008, cookie=0xbb287f08,
  this=0xb9dec018, op_ret=0, op_errno=0, fd=0xb799e808, inode=0x5a70b768,
  buf=0xb811f930, preparent=0xb811f998, postparent=0xb811fa00, xdata=0x0)
  at read-ahead.c:173
  #10 0xb92f66b9 in dht_create_cbk (frame=0xbb287f08, cookie=0xbb28f988,
  this=0xb9cff018, op_ret=0, op_errno=0, fd=0xb799e808, inode=0x5a70b768,
  stbuf=0xb811f930, preparent=0xb811f998, postparent=0xb811fa00,
  xdata=0x5c491028) at dht-common.c:3942
  #11 0xb932fa22 in afr_create_unwind (frame=0xba40439c, this=0xb9cfd018)
  at afr-dir-write.c:397
  #12 0xb9330a02 in __afr_dir_write_cbk (frame=0xba40439c, cookie=0x2,
  this=0xb9cfd018, op_ret=0, op_errno=0, buf=0xbf7fdfe4,
  preparent=0xbf7fdf7c, postparent=0xbf7fdf14, preparent2=0x0,
  postparent2=0x0, xdata=0x5cb61ea8) at afr-dir-write.c:244
  #13 0xb939a401 in client3_3_create_cbk (req=0xb805f028, iov=0xb805f048,
  count=1, myframe=0xbb28e4f8) at client-rpc-fops.c:2211
  #14 0xbb7daecf in rpc_clnt_handle_reply (clnt=0xb9cd93b8,
  pollin=0x5a7dbe38)
  at rpc-clnt.c:767
  #15 0xbb7db7a4 in rpc_clnt_notify (trans=0xb80a7018, mydata=0xb9cd93d8,
  ---Type return to continue, or q return to quit---
  event=RPC_TRANSPORT_MSG_RECEIVED, data=0x5a7dbe38) at rpc-clnt.c:895
  #16 0xbb7d7d9c in rpc_transport_notify (this=0xb80a7018,
  event=RPC_TRANSPORT_MSG_RECEIVED, data=0x5a7dbe38) at
  rpc-transport.c:512
  #17 0xbb3214ab in socket_event_poll_in (this=0xb80a7018) at socket.c:2120
  #18 0xbb3246fc in socket_event_handler (fd=16, idx=4, data=0xb80a7018,
  poll_in=1, poll_out=0, poll_err=0) at socket.c:2233
  #19 0xbb7a4c9a in event_dispatch_poll_handler (i=4, ufds=0xbb285118,
  event_pool=0xbb242098) at event-poll.c:357
  #20 event_dispatch_poll (event_pool=0xbb242098) at event-poll.c:436
  #21 0xbb77a160 in event_dispatch (event_pool=0xbb242098) at event.c:113
  #22 0x08050567 in main (argc=4, argv=0xbf7fe880) at glusterfsd.c:2023
  (gdb) frame 2
  #2  0xbb769993 in __inode_find (table=0xbb213368, gfid=0xb811f938
  k\350_6)
  at inode.c:763
  763 if (uuid_compare (tmp-gfid, gfid) == 0) {
  (gdb) list
  758 return table-root;
  759
  760 hash = hash_gfid (gfid, 65536);
  761
  762 list_for_each_entry (tmp, table-inode_hash[hash], hash) {
  763 if (uuid_compare (tmp-gfid, gfid) == 0) {
  764 inode = tmp;
  765

Re: [Gluster-devel] Plea for reviews

2014-06-23 Thread Raghavendra Gowdappa

Jeff,

Comments inlined.

- Original Message -
 From: Jeff Darcy jda...@redhat.com
 To: Gluster Devel gluster-devel@gluster.org
 Sent: Monday, June 23, 2014 6:53:53 PM
 Subject: [Gluster-devel] Plea for reviews

 I have several patches queued up for 3.6, which have all passed
 regression tests.  Unfortunately, they're all in areas where our
 resources are pretty thin, so getting the required +1 reviews
 is proving to be a challenge.  The patches are as follows:

 * For heterogeneous bricks [1]
   http://review.gluster.org/8093

Caught up with release schedules. I'll try to take up this on high priority.

 * For better ssl [2]
   http://review.gluster.org/3695
   http://review.gluster.org/8040
   http://review.gluster.org/8094

 * Not in feature list, but turning out to be important
   http://review.gluster.org/7702

 I know these are all in tricky areas.  I'd be glad to do
 walkthroughs to explain what each patch is doing in more
 detail.  Thanks in advance to anyone who can help!

 [1]
 http://www.gluster.org/community/documentation/index.php/Features/heterogeneous-bricks

 [2]
 http://www.gluster.org/community/documentation/index.php/Features/better-ssl

regards,
Raghavendra.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Feature review: Improved rebalance performance

2014-07-01 Thread Raghavendra Gowdappa

- Original Message -
 From: Shyamsundar Ranganathan srang...@redhat.com
 To: Xavier Hernandez xhernan...@datalab.es
 Cc: gluster-devel@gluster.org
 Sent: Tuesday, July 1, 2014 1:48:09 AM
 Subject: Re: [Gluster-devel] Feature review: Improved rebalance performance

  From: Xavier Hernandez xhernan...@datalab.es

  Hi Shyam,

  On Thursday 26 June 2014 14:41:13 Shyamsundar Ranganathan wrote:
   It also touches upon a rebalance on access like mechanism where we could
   potentially, move data out of existing bricks to a newer brick faster, in
   the case of brick addition, and vice versa for brick removal, and heal
   the
   rest of the data on access.

  Will this rebalance on access feature be enabled always or only during a
  brick addition/removal to move files that do not go to the affected brick
  while the main rebalance is populating or removing files from the brick ?

 The rebalance on access, in my head, stands as follows, (a little more
 detailed than what is in the feature page)
 Step 1: Initiation of the process
 - Admin chooses to rebalance _changed_ bricks
   - This could mean added/removed/changed size bricks
 [3]- Rebalance on access is triggered, so as to move files when they are
 accessed but asynchronously
 [1]- Background rebalance, acts only to (re)move data (from)to these bricks
   [2]- This would also change the layout for all directories, to include the
   new configuration of the cluster, so that newer data is placed in the
   correct bricks

 Step 2: Completion of background rebalance
 - Once background rebalance is complete, the rebalance status is noted as
 success/failure based on what the backgrould rebalance process did
 - This will not stop the on access rebalance, as data is still all over the
 place, and enhancements like lookup-unhashed=auto will have trouble

 Step 3: Admin can initiate a full rebalance
 - When this is complete then the on access rebalance would be turned off, as
 the cluster is rebalanced!

 Step 2.5/4: Choosing to stop the on access rebalance
 - This can be initiated by the admin, post 3 which is more logical or between
 2 and 3, in which case lookup everywhere for files etc. cannot be avoided
 due to [2] above

 Issues and possible solutions:

 [4] One other thought is to create link files, as a part of [1], for files
 that do not belong to the right bricks but are _not_ going to be rebalanced
 as their source/destination is not a changed brick. This _should_ be faster
 than moving data around and rebalancing these files. It should also avoid
 the problem that, post a rebalance _changed_ command, the cluster may have
 files in the wrong place based on the layout, as the link files would be
 present to correct the situation. In this situation the rebalance on access
 can be left on indefinitely and turning it off does not serve much purpose.

 Enabling rebalance on access always is fine, but I am not sure it buys us
 gluster states that mean the cluster is in a balanced situation, for other
 actions like the lookup-unhashed mentioned which may not just need the link
 files in place. Examples could be mismatched or overly space committed
 bricks with old, not accessed data etc. but do not have a clear example yet.

 Just stating, the core intention of rebalance _changed_ is to create space
 in existing bricks when the cluster grows faster, or be able to remove
 bricks from the cluster faster.

 Redoing a rebalance _changed_ again due to a gluster configuration change,
 i.e expanding the cluster again say, needs some thought. It does not impact
 if rebalance on access is running or not, the only thing it may impact is
 the choice of files that are already put into the on access queue based on
 the older layout, due to the older cluster configuration. Just noting this
 here.

 In short if we do [4] then we can leave rebalance on access turned on always,
 unless we have some other counter examples or use cases that are not thought
 of. Doing [4] seems logical, so I would state that we should, but from a
 performance angle of improving rebalance, we need to determine the worth
 against access paths from IO post not having [4] (again considering the
 improvement that lookup-unhashed brings, this maybe obvious that [4] should
 be done).

 A note on [3], the intention is to start an asynchronous sync task that
 rebalances the file on access, and not impact the IO path. So if a file is
 chosen by the IO path as to needing a rebalance, then a sync task with the
 required xattr to trigger a file move is setup, and setxattr is called, that
 should take care of the file migration and enabling the IO path to progress
 as is.

 Reading through your mail, a better way of doing this by sharing the load,
 would be to use an index, so that each node in the cluster has a list of
 files accessed that need a rebalance. The above method for [3] would be
 client heavy and would incur a network read and write, whereas the index
 manner of doing

Re: [Gluster-devel] Feature review: Improved rebalance performance

2014-07-01 Thread Raghavendra Gowdappa

- Original Message -
 From: Xavier Hernandez xhernan...@datalab.es
 To: Raghavendra Gowdappa rgowd...@redhat.com
 Cc: Shyamsundar Ranganathan srang...@redhat.com, gluster-devel@gluster.org
 Sent: Tuesday, July 1, 2014 3:10:29 PM
 Subject: Re: [Gluster-devel] Feature review: Improved rebalance performance

 On Tuesday 01 July 2014 02:37:34 Raghavendra Gowdappa wrote:
   Another thing to consider for future versions is to modify the current
   DHT
   to a consistent hashing and even the hash value (using gfid instead of a
   hash of the name would solve the rename problem). The consistent hashing
   would drastically reduce the number of files that need to be moved and
   already solves some of the current problems. This change needs a lot of
   thinking though.

  The problem with using gfid for hashing instead of name is that we run into
  a chicken and egg problem. Before lookup, we cannot know the gfid of the
  file and to lookup the file, we need gfid to find out the node in which
  file resides. Of course, this problem would go away if we lookup (may be
  just during fresh lookups) on all the nodes, but that slows down the fresh
  lookups and may not be acceptable.

 I think it's not so problematic, and the benefits would be considerable.

 The gfid of the root directory is always known. This means that we could
 always do a lookup on root by gfid.

 I haven't tested it but as I understand it, when you want to do a getxattr on
 a file inside a subdirectory, for example, the kernel will issue lookups on
 all intermediate directories to check,

Yes, but how does dht handle these lookups? Are you suggesting that we wind the 
lookup call to all subvolumes (since we don't know which subvolume the file is 
present for lack of gfid)?

 at least, the access rights before
 finally reading the xattr of the file. This means that we can get and cache
 gfid's of all intermediate directories in the process.

 Even if there's some operation that does not issue a previous lookup, we
 could
 do that lookup if it's not cached. Of course if there were many more
 operations not issuing a previous lookup, this solution won't be good, but I
 think this is not the case.

 I'll try to do some tests to see if this is correct.

 Xavi

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] syncops and thread specific memory regions

2014-07-02 Thread Raghavendra Gowdappa

Hi all,

The bug fixed by [1] is a one instance of the class of problems where:
1. we access a variable which is stored in thread-specific area and hence can 
be stored in different memory regions across different threads.
2. A single (code) control flow is executed in more than one thread.
3. Optimization prevents recalculating address of variable mentioned in 1 every 
time its accessed, instead using an address calculated earlier.

The bug fixed by [1] involved errno as the variable. However there are other 
pointers which are stored in TLS like,
1. The xlator object in whose context the current code is executing in (aka 
THIS, set/read by using __glusterfs_this_location() ).
2. A buffer used to parse binary uuids into strings (used by uuid_utoa () ).

I think we can hit the corruption uncovered by [1] in the above two scenarios 
too. Comments?

[1] http://review.gluster.org/6475

regards,
Raghavendra.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] regarding inode_link/unlink

2014-07-05 Thread Raghavendra Gowdappa

- Original Message -
 From: Pranith Kumar Karampuri pkara...@redhat.com
 To: Raghavendra Gowdappa rgowd...@redhat.com
 Cc: Gluster Devel gluster-devel@gluster.org, Anand Avati 
 av...@gluster.org, Brian Foster
 bfos...@redhat.com, Raghavendra Bhat rab...@redhat.com
 Sent: Friday, July 4, 2014 5:39:03 PM
 Subject: Re: regarding inode_link/unlink

 On 07/04/2014 04:28 PM, Raghavendra Gowdappa wrote:

  - Original Message -
  From: Pranith Kumar Karampuri pkara...@redhat.com
  To: Gluster Devel gluster-devel@gluster.org, Anand Avati
  av...@gluster.org, Brian Foster
  bfos...@redhat.com, Raghavendra Gowdappa rgowd...@redhat.com,
  Raghavendra Bhat rab...@redhat.com
  Sent: Friday, July 4, 2014 3:44:29 PM
  Subject: regarding inode_link/unlink

  hi,
 I have a doubt about when a particular dentry_unset thus
  inode_unref on parent dir happens on fuse-bridge in gluster.
  When a file is looked up for the first time fuse_entry_cbk does
  'inode_link' with parent-gfid/bname. Whenever an unlink/rmdir/(lookup
  gives ENOENT) happens then corresponding inode unlink happens. The
  question is, will the present set of operations lead to leaks:
  1) Mount 'M0' creates a file 'a'
  2) Mount 'M1' of same volume deletes file 'a'

  M0 never touches 'a' anymore. When will inode_unlink happen for such
  cases? Will it lead to memory leaks?
  Kernel will eventually send forget (a) on M0 and that will cleanup the
  dentries and inode. Its equivalent to a file being looked up and never
  used again (deleting doesn't matter in this case).
 Do you know the trigger points for that? When I do 'touch a' on the
 mount point and leave the system like that, forget is not coming.
 If I do unlink on the file then forget is coming.

I am not very familiar with how kernel manages its inodes. However, as Avati 
has mentioned in another mail, you can force kernel to send forgets by 
invalidating the inode. I think he has given enough details in another mail.

 Pranith

  Pranith

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] When inode table is populated?

2014-07-30 Thread Raghavendra Gowdappa

- Original Message -
 From: Jiffin Thottan jthot...@redhat.com
 To: gluster-devel@gluster.org
 Sent: Wednesday, July 30, 2014 12:22:30 PM
 Subject: [Gluster-devel] When  inode table is populated?

 Hi,

 When we were trying to call rename from translator (in reconfigure) using
 STACK_WIND , inode table(this-itable) value seems to be null.

 Since inode is required for performing rename, When will inode table gets
 populated and Why it is not populated in reconfigure or init?

Not every translator has an inode table (nor it is required to). Only the 
translators which do inode management (like fuse-bridge, protocol/server, 
libgfapi, possibly nfsv3 server??) will have an inode table associated with 
them. If you need to access itable, you can do that using inode-table.

 Or should we create a private inode table and generate inode using it?

 -Jiffin
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Monotonically increasing memory

2014-07-31 Thread Raghavendra Gowdappa

Anders,

Mostly its a case of memory leak. It would be helpful if you can file a bug on 
this. Following information would be useful to fix the issue:

1. valgrind reports (if possible).
 a. To start brick and nfs processes with valgrind you can use following 
cmdline when starting glusterd.
# glusterd --xlator-option *.run-with-valgrind=yes

In this case all the valgrind logs can be found in standard glusterfs log 
directory.

 b. For client you can start glusterfs just like any other process in valgrind. 
Since glusterfs is daemonized, while running with valgrind we need to prevent 
it by running it in foreground. We can use -N option to do that
# valgrind --leak-check=full --log-file=path-to-valgrind-log glusterfs 
--volfile-id=xyz --volfile-server=abc -N /mnt/glfs

2. Once you observe a considerable leak in memory, please get a statedump of 
glusterfs

  # gluster volume statedump volname

and attach the reports in the bug.

regards,
Raghavendra.

- Original Message -
 From: Anders Blomdell anders.blomd...@control.lth.se
 To: Gluster Devel gluster-devel@gluster.org
 Sent: Friday, August 1, 2014 12:01:15 AM
 Subject: [Gluster-devel] Monotonically increasing memory
 
 During rsync of 35 files, memory consumption of glusterfs
 rose to 12 GB (after approx 14 hours), I take it that this is a
 bug I should try to track down?
 
 Version is 3.7dev as of tuesday...
 
 /Anders
 
 --
 Anders Blomdell  Email: anders.blomd...@control.lth.se
 Department of Automatic Control
 Lund University  Phone:+46 46 222 4625
 P.O. Box 118 Fax:  +46 46 138118
 SE-221 00 Lund, Sweden
 
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Rackspace regression slaves hung?

2014-08-28 Thread Raghavendra Gowdappa



- Original Message -
 From: Krutika Dhananjay kdhan...@redhat.com
 To: Justin Clift jcl...@redhat.com
 Cc: Gluster Devel gluster-devel@gluster.org
 Sent: Thursday, August 28, 2014 12:25:35 PM
 Subject: [Gluster-devel] Rackspace regression slaves hung?
 
 Hi Justin,
 
 It looks like slaves 22-25 are hung for over 23 hours now?

There are couple of patches [1] submitted by me are resulting in hang. I think 
these slaves were spawned to test the patch [1] and its dependencies. If yes, 
they can be killed.

[1] http://review.gluster.com/#/c/8523/

 
 -Krutika
 
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Rackspace regression slaves hung?

2014-08-28 Thread Raghavendra Gowdappa

I've killed the jobs in question.

- Original Message -
 From: Raghavendra Gowdappa rgowd...@redhat.com
 To: Krutika Dhananjay kdhan...@redhat.com
 Cc: Justin Clift jcl...@redhat.com, Gluster Devel 
 gluster-devel@gluster.org
 Sent: Thursday, August 28, 2014 12:37:07 PM
 Subject: Re: [Gluster-devel] Rackspace regression slaves hung?
 
 
 
 - Original Message -
  From: Krutika Dhananjay kdhan...@redhat.com
  To: Justin Clift jcl...@redhat.com
  Cc: Gluster Devel gluster-devel@gluster.org
  Sent: Thursday, August 28, 2014 12:25:35 PM
  Subject: [Gluster-devel] Rackspace regression slaves hung?
  
  Hi Justin,
  
  It looks like slaves 22-25 are hung for over 23 hours now?
 
 There are couple of patches [1] submitted by me are resulting in hang. I
 think these slaves were spawned to test the patch [1] and its dependencies.
 If yes, they can be killed.
 
 [1] http://review.gluster.com/#/c/8523/
 
  
  -Krutika
  
  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org
  http://supercolony.gluster.org/mailman/listinfo/gluster-devel
  
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] how do you debug ref leaks?

2014-09-17 Thread Raghavendra Gowdappa

- Original Message -
 From: Raghavendra Gowdappa rgowd...@redhat.com
 To: Pranith Kumar Karampuri pkara...@redhat.com
 Cc: Gluster Devel gluster-devel@gluster.org
 Sent: Thursday, September 18, 2014 10:08:15 AM
 Subject: Re: [Gluster-devel] how do you debug ref leaks?

 For eg., if a dictionary is not freed because of non-zero refcount, if there
 is an information on who has held these references would help to narrow down
 the code path or component.

This solution might be rudimentary. However, someone who has worked on things 
like garbage collection can give better answers I think. This discussion also 
reminds me of Greenspun's tenth rule [1]

[1] http://en.wikipedia.org/wiki/Greenspun%27s_tenth_rule

 - Original Message -
  From: Pranith Kumar Karampuri pkara...@redhat.com
  To: Raghavendra Gowdappa rgowd...@redhat.com
  Cc: Gluster Devel gluster-devel@gluster.org
  Sent: Thursday, September 18, 2014 10:05:18 AM
  Subject: Re: [Gluster-devel] how do you debug ref leaks?

  On 09/18/2014 09:59 AM, Raghavendra Gowdappa wrote:
   One thing that would be helpful is allocator info for generic objects
   like dict, inode, fd etc. That way we wouldn't have to sift through large
   amount of code.
  Could you elaborate the idea please.

  Pranith
   - Original Message -
   From: Pranith Kumar Karampuri pkara...@redhat.com
   To: Gluster Devel gluster-devel@gluster.org
   Sent: Thursday, September 18, 2014 7:43:00 AM
   Subject: [Gluster-devel] how do you debug ref leaks?

   hi,
 Till now the only method I used to find ref leaks effectively is
 to
   find what operation is causing ref leaks and read the code to find if
   there is a ref-leak somewhere. Valgrind doesn't solve this problem
   because it is reachable memory from inode-table etc. I am just wondering
   if there is an effective way anyone else knows of. Do you guys think we
   need a better mechanism of finding refleaks? At least which decreases
   the search space significantly i.e. xlator y, fop f etc? It would be
   better if we can come up with ways to integrate statedump and this infra
   just like we did for mem-accounting.

   One way I thought was to introduce new apis called
   xl_fop_dict/inode/fd_ref/unref (). Each xl keeps an array of num_fops
   per inode/dict/fd and increments/decrements accordingly. Dump this info
   on statedump.

   I myself am not completely sure about this idea. It requires all xlators
   to change.

   Any ideas?

   Pranith
   ___
   Gluster-devel mailing list
   Gluster-devel@gluster.org
   http://supercolony.gluster.org/mailman/listinfo/gluster-devel

 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] question on quota

2014-09-21 Thread Raghavendra Gowdappa

Hi Emmanuel,

If Quota cannot allow the entire payload in the write call without exceeding 
limit, it allows the fraction of payload that can fit-in within the quota 
boundaries. This has led to short writes which the comment in frame 12 
mentions. Its the behaviour of write-behind to return EIO for short writes. 
Since its unlikely that the writes incident on quota when the limit is about be 
reached, can be allowed in their entirety, we can expect an EIO before EDQUOT. 
However, without write-behind in xlator graph, there would be no EIO.

regards,
Raghavendra.
- Original Message -
 From: Emmanuel Dreyfus m...@netbsd.org
 To: Gluster Devel gluster-devel@gluster.org
 Sent: Sunday, September 21, 2014 11:09:11 PM
 Subject: [Gluster-devel] question on quota
 
 Hi
 
 I am trying to get tests/basic/quota.t working on NetBSD and I notice an
 oddity:
 before betting EDQUOT, I get EIO. The backtrace leading there is below. Note
 the
 comments at frame 12. Shall I assume it is the expected behavior to get EIO
 for
 an over quota write?
 
 #0  0xbb491dc7 in _lwp_kill () from /lib/libc.so.12
 #1  0xbb491d68 in raise () from /lib/libc.so.12
 #2  0xbb491982 in abort () from /lib/libc.so.12
 #3  0xba5a68bb in fuse_writev_cbk (frame=0xb99b7670, cookie=0xb98b5e28,
 this=0xbb287018, op_ret=-1, op_errno=5, stbuf=0xbf7fe1e0,
 postbuf=0xbf7fe1e0, xdata=0x0) at fuse-bridge.c:2271
 #4  0xb9ba0d48 in io_stats_writev_cbk (frame=0xb98b5e28, cookie=0xb98b5eb8,
 this=0xbb2d9018, op_ret=-1, op_errno=5, prebuf=0xbf7fe1e0,
 postbuf=0xbf7fe1e0, xdata=0x0) at io-stats.c:1402
 #5  0xb9bbb504 in mdc_writev_cbk (frame=0xb98b5eb8, cookie=0xb98b5fd8,
 this=0xbb2d7018, op_ret=-1, op_errno=5, prebuf=0xbf7fe1e0,
 postbuf=0xbf7fe1e0, xdata=0x0) at md-cache.c:1509
 #6  0xbb768d7c in default_writev_cbk (frame=0xb98b5fd8, cookie=0xb98b6068,
 this=0xbb2d6018, op_ret=-1, op_errno=5, prebuf=0xbf7fe1e0,
 postbuf=0xbf7fe1e0, xdata=0x0) at defaults.c:1019
 #7  0xbb768d7c in default_writev_cbk (frame=0xb98b6068, cookie=0xb98b6338,
 this=0xbb2d5018, op_ret=-1, op_errno=5, prebuf=0xbf7fe1e0,
 postbuf=0xbf7fe1e0, xdata=0x0) at defaults.c:1019
 #8  0xb9be372d in ioc_writev_cbk (frame=0xb98b6338, cookie=0xb98b6458,
 this=0xbb2d4018, op_ret=-1, op_errno=5, prebuf=0xbf7fe1e0,
 postbuf=0xbf7fe1e0, xdata=0x0) at io-cache.c:1225
 #9  0xb9bf421f in ra_writev_cbk (frame=0xb98b6458, cookie=0xb98b64e8,
 this=0xbb2d3018, op_ret=-1, op_errno=5, prebuf=0xbf7fe1e0,
 postbuf=0xbf7fe1e0, xdata=0x0) at read-ahead.c:654
 #10 0xbb3068c4 in wb_do_unwinds (wb_inode=0xbb2403a8, lies=0xbf7fe288)
 at write-behind.c:921
 #11 0xbb30724c in wb_process_queue (wb_inode=0xbb2403a8) at
 write-behind.c:1209
 #12 0xbb305ec0 in wb_fulfill_cbk (frame=0xb98e9e70, cookie=0xb98b5be8,
 this=0xbb2d2018, op_ret=81920, op_errno=0, prebuf=0xb98de3fc,
 postbuf=0xb98de464, xdata=0xb98bbaa8) at write-behind.c:758
 
 Here we have this code:
 742 if (op_ret == -1) {
 743 wb_fulfill_err (head, op_errno);
 744 } else if (op_ret  head-total_size) {
 745 /*
 746  * We've encountered a short write, for whatever
 reason.
 747  * Set an EIO error for the next fop. This should be
 748  * valid for writev or flush (close).
 749  *
 750  * TODO: Retry the write so we can potentially
 capture
 751  * a real error condition (i.e., ENOSPC).
 752  */
 753 wb_fulfill_err (head, EIO);
 754 }
 
 #13 0xb9c3fed8 in dht_writev_cbk (frame=0xb98b5be8, cookie=0xb98b5c78,
 this=0xbb2d1018, op_ret=81920, op_errno=0, prebuf=0xb98de3fc,
 postbuf=0xb98de464, xdata=0xb98bbaa8) at dht-inode-write.c:84
 #14 0xb9c7e97d in afr_writev_unwind (frame=0xb98b5c78, this=0xbb2cf018)
 at afr-inode-write.c:188
 #15 0xb9c7ed43 in afr_writev_wind_cbk (frame=0xb99b6e70, cookie=0x1,
 this=0xbb2cf018, op_ret=81920, op_errno=0, prebuf=0xbf7fe438,
 postbuf=0xbf7fe3d0, xdata=0xb98bbaa8) at afr-inode-write.c:313
 #16 0xb9cdd30d in client3_3_writev_cbk (req=0xb9ad4428, iov=0xb9ad4448,
 count=1, myframe=0xb98b5918) at client-rpc-fops.c:855
 #17 0xbb7330c2 in rpc_clnt_handle_reply (clnt=0xbb2af508, pollin=0xb98a6fc8)
 at rpc-clnt.c:766
 #18 0xbb7333ba in rpc_clnt_notify (trans=0xb99a2018, mydata=0xbb2af528,
 event=RPC_TRANSPORT_MSG_RECEIVED, data=0xb98a6fc8) at rpc-clnt.c:894
 #19 0xbb72fab5 in rpc_transport_notify (this=0xb99a2018,
 event=RPC_TRANSPORT_MSG_RECEIVED, data=0xb98a6fc8) at rpc-transport.c:516
 #20 0xb9d6d832 in socket_event_poll_in (this=0xb99a2018) at socket.c:2153
 #21 0xb9d6dce7 in socket_event_handler (fd=15, idx=5, data=0xb99a2018,
 poll_in=1, poll_out=0, poll_err=0) at socket.c:2266
 #22 0xbb7be78f in event_dispatch_poll_handler (event_pool=0xbb242098,
 ufds=0xbb2856b8, i=5) at

Re: [Gluster-devel] io-threads problem? (was: opendir gets Stale NFS file handle)

2014-09-30 Thread Raghavendra Gowdappa



- Original Message -
 From: Niels de Vos nde...@redhat.com
 To: Emmanuel Dreyfus m...@netbsd.org
 Cc: Gluster Devel gluster-devel@gluster.org
 Sent: Tuesday, September 30, 2014 2:08:06 PM
 Subject: Re: [Gluster-devel] io-threads problem? (was: opendir gets Stale NFS 
 file handle)
 
 On Tue, Sep 30, 2014 at 06:03:44AM +0200, Emmanuel Dreyfus wrote:
  Hello
  
  I observe this kind of errors in bricks logs:
  [2014-09-30 03:56:10.172889] E
  [server-rpc-fops.c:681:server_opendir_cbk] 0-patchy-server: 11: OPENDIR
  (null) (63a151ad-a8b7-496b-92a8-5c3c7897e6fa) == (Stale NFS file
  handle)
 
 ESTALE gets returned when a directory is opened by handle (in this case
 the GFID). The posix xlator should do the OPENDIR on the brick, through
 the .glusterfs/...GFID... structure.


gfid handle 63a151ad-a8b7-496b-92a8-5c3c7897e6fa is missing from .glusterfs 
directory (A nameless lookup on this gfid failed with ENOENT in storage/posix) 
and when this happens server resolver returns an ESTALE. Seems like earlier 
lookup was successful (since client got the gfid) and before opendir came the 
handle was deleted. Is there a possibility that the directory was deleted from 
some other client? In that case, this is not really an error. Otherwise, there 
might be some issue.

 
  Here is the backtrace leading to it. Is that a real error?
  
  #3  0xb9c45934 in server_opendir_cbk (frame=0xbb235e70, cookie=0x0,
  this=0xbb2ca018, op_ret=-1, op_errno=70, fd=0x0, xdata=0x0)
  at server-rpc-fops.c:682
  #4  0xb9c4d402 in server_opendir_resume (frame=0xbb235e70,
  bound_xl=0xbb2c8018)
  at server-rpc-fops.c:2507
  #5  0xb9c3ef51 in server_resolve_done (frame=0xbb235e70)
  at server-resolve.c:557
  #6  0xb9c3f02d in server_resolve_all (frame=0xbb235e70) at
  server-resolve.c:592
  #7  0xb9c3eefa in server_resolve (frame=0xbb235e70) at
  server-resolve.c:541
  #8  0xb9c3f00a in server_resolve_all (frame=0xbb235e70) at
  server-resolve.c:588
  #9  0xb9c3e662 in resolve_continue (frame=0xbb235e70) at
  server-resolve.c:233
  #10 0xb9c3e242 in resolve_gfid_cbk (frame=0xbb235e70, cookie=0xbb287528,
  this=0xbb2ca018, op_ret=-1, op_errno=2, inode=0xbb287498,
  buf=0xb9b2cd14,
  xdata=0x0, postparent=0xb9b2ccac) at server-resolve.c:171
  #11 0xb9c710a7 in io_stats_lookup_cbk (frame=0xbb287528,
  cookie=0xbb2875b8,
  this=0xbb2c8018, op_ret=-1, op_errno=2, inode=0xbb287498,
  buf=0xb9b2cd14,
  xdata=0x0, postparent=0xb9b2ccac) at io-stats.c:1510
  #12 0xb9cbc09f in marker_lookup_cbk (frame=0xbb2875b8,
  cookie=0xbb287648,
  this=0xbb2c5018, op_ret=-1, op_errno=2, inode=0xbb287498,
  buf=0xb9b2cd14,
  dict=0x0, postparent=0xb9b2ccac) at marker.c:2614
  #13 0xbb7667d8 in default_lookup_cbk (frame=0xbb287648,
  cookie=0xbb2876d8,
  this=0xbb2c4018, op_ret=-1, op_errno=2, inode=0xbb287498,
  buf=0xb9b2cd14,
  xdata=0x0, postparent=0xb9b2ccac) at defaults.c:841
  #14 0xbb7667d8 in default_lookup_cbk (frame=0xbb2876d8,
  cookie=0xbb287768,
  this=0xbb2c2018, op_ret=-1, op_errno=2, inode=0xbb287498,
  buf=0xb9b2cd14,
  xdata=0x0, postparent=0xb9b2ccac) at defaults.c:841
  #15 0xb9cf12ab in pl_lookup_cbk (frame=0xbb287768, cookie=0xbb287888,
  this=0xbb2c1018, op_ret=-1, op_errno=2, inode=0xbb287498,
  buf=0xb9b2cd14,
  xdata=0x0, postparent=0xb9b2ccac) at posix.c:2036
  #16 0xb9d03fb0 in posix_acl_lookup_cbk (frame=0xbb287888,
  cookie=0xbb287918,
  this=0xbb2c0018, op_ret=-1, op_errno=2, inode=0xbb287498,
  buf=0xb9b2cd14,
  xattr=0x0, postparent=0xb9b2ccac) at posix-acl.c:806
  #17 0xb9d30601 in posix_lookup (frame=0xbb287918, this=0xbb2be018,
  loc=0xb9910048, xdata=0xbb2432a8) at posix.c:189
  #18 0xbb771646 in default_lookup (frame=0xbb287918, this=0xbb2bf018,
  loc=0xb9910048, xdata=0xbb2432a8) at defaults.c:2117
  #19 0xb9d04384 in posix_acl_lookup (frame=0xbb287888, this=0xbb2c0018,
  loc=0xb9910048, xattr=0x0) at posix-acl.c:858
  #20 0xb9cf1713 in pl_lookup (frame=0xbb287768, this=0xbb2c1018,
  loc=0xb9910048, xdata=0x0) at posix.c:2080
  #21 0xbb76f4da in default_lookup_resume (frame=0xbb2876d8,
  this=0xbb2c2018,
  loc=0xb9910048, xdata=0x0) at defaults.c:1683
  #22 0xbb786667 in call_resume_wind (stub=0xb9910028) at call-stub.c:2478
  #23 0xbb78d4f5 in call_resume (stub=0xb9910028) at call-stub.c:2841
  #24 0xbb30402f in iot_worker (
  data=error reading variable: Cannot access memory at address
  0xb9b2cfd8,
  data@entry=error reading variable: Cannot access memory at address
  0xb9b2cfd4) at io-threads.c:214
 
 This error suggests that 'data' can not be accessed. I have no idea why
 io-threads would fail here though...
 
 Niels
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org

Re: [Gluster-devel] io-threads problem? (was: opendir gets Stale NFS file handle)

2014-09-30 Thread Raghavendra Gowdappa

Pranith had RCAed one of the race-conditions where a stale dentry was left in 
server inode table. The race can be outlined as below (T1 and T2 are two 
threads):

1. T1: readdirp in storage/posix reads a dentry (say pgfid1, bname1) along 
with metadata information and gfid.
2. T2: unlink (pgfid1, bname1) is done in storage/posix and the dentry pgfid1, 
bname1 is purged from server inode table (inode table management is done by 
protocol/server).
3. T1: links (pgfid1, bname1) with corresponding gfid read in step 1.

Now, since the last unlink was done on pgfid1, bname1 the dentry remains in 
server inode table (only in server inode table, since the entry was deleted on 
the exported brick) resulting in ESTALE errors.

This situation can be hit when T1 does a lookup on the same dentry instead of 
readdirp. However I am not sure this is a serious problem since entry is 
deleted from the backend (and we are not giving ESTALE errors for a 
file/directory which is actually present on backend).

In this case just restarting the volume would make the problem go away since 
after restarting servers start with fresh inode-cache. I am not sure whether 
this is the same problem you are facing, but this seems something related.

regards,
Raghavendra.

- Original Message -
 From: Emmanuel Dreyfus m...@netbsd.org
 To: Raghavendra Gowdappa rgowd...@redhat.com, Niels de Vos 
 nde...@redhat.com
 Cc: Gluster Devel gluster-devel@gluster.org
 Sent: Tuesday, September 30, 2014 5:19:31 PM
 Subject: Re: [Gluster-devel] io-threads problem? (was: opendir gets Stale NFS 
 file handle)
 
 Raghavendra Gowdappa rgowd...@redhat.com wrote:
 
  Is there a
 possibility that the directory was deleted from some other client? In
 that case, this is not really an error. Otherwise, there might be some
 issue.
 
 I deleted the volume and started over: the problem vanished. I wonder
 how to cope with that on a production machine where data should not be
 deleted like that.
 
 --
 Emmanuel Dreyfus
 http://hcpnet.free.fr/pubz
 m...@netbsd.org
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Chaning position of md-cache in xlator graph

2014-10-21 Thread Raghavendra Gowdappa

Adding correct gluster-devel mail id.

- Original Message -
 From: Raghavendra Gowdappa rgowd...@redhat.com
 To: gluster-devel gluster-de...@nongnu.org
 Sent: Tuesday, 21 October, 2014 3:26:21 PM
 Subject: Chaning position of md-cache in xlator graph

 Hi all,

 The context is bz 1138970 [1]. As discussed in the bug, it would make more
 sense loading md-cache closer to bricks (as a descendant of write-behind to
 be specific) from the point of correctness, since stats are affected by
 writes. Does anyone of you see any issue in doing this?

 [1] https://bugzilla.redhat.com/show_bug.cgi?id=1138970

 regards,
 Raghavendra.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Quota problems with dispersed volumes

2014-10-27 Thread Raghavendra Gowdappa

- Original Message -
 From: Xavier Hernandez xhernan...@datalab.es
 To: Gluster Devel gluster-devel@gluster.org, Krishnan Parthasarathi 
 kpart...@redhat.com, Raghavendra
 Gowdappa rgowd...@redhat.com
 Cc: Dan Lambright dlamb...@redhat.com
 Sent: Monday, October 27, 2014 11:07:40 PM
 Subject: Quota problems with dispersed volumes

 Hi,

 testing quota on a dispersed volume I've found a problem on how the
 total used space is calculated.

 # gluster volume create test disperse server{0..2}:/bricks/disperse
 # gluster volume start test
 # gluster volume quota test enable
 # gluster volume quota test limit-usage / 1GB
 # gluster volume quota test list
 Path  H-LS-L  UsedAvailable  S-L exceeded?  H-L exceeded?
 -
 / 1.0GB  80%  0Bytes  1.0GB  No No
 # mount -t glusterfs server0:/test /gluster/test
 # dd if=/dev/zero of=/gluster/test/file bs=1024k count=512
 # ls -lh /gluster/test
 total 512M
 -rw-r--r-- 1 root root 512M Oct 27 18:29 file
 # gluster volume quota test list
 Path  H-LS-L  UsedAvailable  S-L exceeded? H-L exceeded?

 / 1.0GB  80%  256.0MB 768.0MBNoNo

 As you can see quota seems to only count the space used by the file in
 one of the bricks (each file uses 256MB on each brick).

 How would be the best way to solve this problem ? I don't know quota
 internals so I'm a bit lost about where to adjust real file sizes...

We use extended attribute with key trusted.glusterfs.quota.size to get the 
size of a directory/file. The value for this key is probed in lookup and 
getxattr calls. You can implement logic to handle this key appropriately in 
disperse xlator to give a proper size to higher layers. In your existing 
implementation you might have been probably passing xattrs from one of the 
bricks and hence seeing size from only one brick.

 Thanks,

 Xavi

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Quota problems with dispersed volumes

2014-10-29 Thread Raghavendra Gowdappa

- Original Message -
 From: Xavier Hernandez xhernan...@datalab.es
 To: Raghavendra Gowdappa rgowd...@redhat.com
 Cc: Gluster Devel gluster-devel@gluster.org
 Sent: Wednesday, October 29, 2014 6:24:55 PM
 Subject: Re: [Gluster-devel] Quota problems with dispersed volumes

 On 10/28/2014 02:05 PM, Xavier Hernandez wrote:
  On 10/28/2014 04:30 AM, Raghavendra Gowdappa wrote:

  We use extended attribute with key trusted.glusterfs.quota.size to
  get the size of a directory/file. The value for this key is probed in
  lookup and getxattr calls. You can implement logic to handle this key
  appropriately in disperse xlator to give a proper size to higher
  layers. In your existing implementation you might have been probably
  passing xattrs from one of the bricks and hence seeing size from only
  one brick.

  I think this patch fixes the problem:

   http://review.gluster.org/8990

 It seems that there are some other xattrs visible from client side. I've
 identified 'trusted.glusterfs.quota.*.contri'. Are there any other
 xattrs that I should handle on the client side ?

this is an internal xattr which only marker (disk usage accounting xlator) 
uses. The applications running on glusterfs shouldn't be seeing this. If you 
are seeing this xattr from mount, we should filter this xattr from being listed 
(at fuse-bridge and gfapi).

 It seems that there's also a 'trusted.glusterfs.quota.dirty'

This is again an internal xattr. You should not worry about handling this. This 
also needs to be filtered from being displayed to application.

 and
 'trusted.glusterfs.quota.limit-set'.

This should be visible from mount point, as this xattr holds the value of quota 
limit set on that inode. You can handle this in disperse xlator by picking the 
value from any of its children.

 How I should handle visible xattrs in ec xlator if they have different
 values in each brick ?

 trusted.glusterfs.quota.size is handled by choosing the maximum value.

This depends on how ec is handling the files/directories and the meaning of 
xattr. For eg., trusted.glusterfs.quota.size represents the size of the 
file/directory. When read from brick, the value will be the size of directory 
on that brick. When read from a cluster translator like dht, it will be the 
size of that directory across the whole cluster. So, in dht we add up the 
values from all bricks and set the sum as the value. However, in case of 
replicate/afr, we just pick the value from any of the subvolume.

 Thanks,

 Xavi

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Quota problems with dispersed volumes

2014-10-29 Thread Raghavendra Gowdappa

From quota perspective I don't see any other xattrs related to quota. You've 
listed them all :).

- Original Message -
 From: Raghavendra Gowdappa rgowd...@redhat.com
 To: Xavier Hernandez xhernan...@datalab.es
 Cc: Gluster Devel gluster-devel@gluster.org
 Sent: Wednesday, October 29, 2014 8:30:33 PM
 Subject: Re: [Gluster-devel] Quota problems with dispersed volumes

 - Original Message -
  From: Xavier Hernandez xhernan...@datalab.es
  To: Raghavendra Gowdappa rgowd...@redhat.com
  Cc: Gluster Devel gluster-devel@gluster.org
  Sent: Wednesday, October 29, 2014 6:24:55 PM
  Subject: Re: [Gluster-devel] Quota problems with dispersed volumes

  On 10/28/2014 02:05 PM, Xavier Hernandez wrote:
   On 10/28/2014 04:30 AM, Raghavendra Gowdappa wrote:

   We use extended attribute with key trusted.glusterfs.quota.size to
   get the size of a directory/file. The value for this key is probed in
   lookup and getxattr calls. You can implement logic to handle this key
   appropriately in disperse xlator to give a proper size to higher
   layers. In your existing implementation you might have been probably
   passing xattrs from one of the bricks and hence seeing size from only
   one brick.

   I think this patch fixes the problem:

http://review.gluster.org/8990

  It seems that there are some other xattrs visible from client side. I've
  identified 'trusted.glusterfs.quota.*.contri'. Are there any other
  xattrs that I should handle on the client side ?

 this is an internal xattr which only marker (disk usage accounting xlator)
 uses. The applications running on glusterfs shouldn't be seeing this. If you
 are seeing this xattr from mount, we should filter this xattr from being
 listed (at fuse-bridge and gfapi).

  It seems that there's also a 'trusted.glusterfs.quota.dirty'

 This is again an internal xattr. You should not worry about handling this.
 This also needs to be filtered from being displayed to application.

  and
  'trusted.glusterfs.quota.limit-set'.

 This should be visible from mount point, as this xattr holds the value of
 quota limit set on that inode. You can handle this in disperse xlator by
 picking the value from any of its children.

  How I should handle visible xattrs in ec xlator if they have different
  values in each brick ?

  trusted.glusterfs.quota.size is handled by choosing the maximum value.

 This depends on how ec is handling the files/directories and the meaning of
 xattr. For eg., trusted.glusterfs.quota.size represents the size of the
 file/directory. When read from brick, the value will be the size of
 directory on that brick. When read from a cluster translator like dht, it
 will be the size of that directory across the whole cluster. So, in dht we
 add up the values from all bricks and set the sum as the value. However, in
 case of replicate/afr, we just pick the value from any of the subvolume.

  Thanks,

  Xavi

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Wrong behavior on fsync of md-cache ?

2014-11-25 Thread Raghavendra Gowdappa

- Original Message -
 From: Xavier Hernandez xhernan...@datalab.es
 To: Raghavendra Gowdappa rgowd...@redhat.com
 Cc: Gluster Devel gluster-devel@gluster.org, Emmanuel Dreyfus 
 m...@netbsd.org
 Sent: Tuesday, November 25, 2014 2:05:25 PM
 Subject: Re: Wrong behavior on fsync of md-cache ?

 On 11/25/2014 07:38 AM, Raghavendra Gowdappa wrote:
  - Original Message -
  From: Xavier Hernandez xhernan...@datalab.es
  To: Raghavendra Gowdappa rgowd...@redhat.com
  Cc: Gluster Devel gluster-devel@gluster.org, Emmanuel Dreyfus
  m...@netbsd.org
  Sent: Tuesday, November 25, 2014 12:49:03 AM
  Subject: Re: Wrong behavior on fsync of md-cache ?

  I think the problem is here: the first thing wb_fsync()
  checks is if there's an error in the fd (wd_fd_err()). If that's the
  case, the call is immediately unwinded with that error. The error seems
  to be set in wb_fulfill_cbk(). I don't know the internals of write-back
  xlator, but this seems to be the problem.

  Yes, your analysis is correct. Once the error is hit, fsync is not
  queued  behind unfulfilled writes. Whether it can be considered as a bug
  is debatable.  Since there is already an error in one of the writes which
  was written-behind  fsync should return the error. I am not sure whether
  it should wait till we try to flush _all_ the writes that were written
  behind. Any suggestions on what is the expected behaviour here?

 I think that it should wait for all pending writes. In the test case I
 used, all pending writes will fail the same way that the first one, but
 in other situations it's possible to have a write failing (for example
 due to a damaged block in disk) and following writes succeeding.

  From the man page of fsync:

  fsync() transfers (flushes) all modified in-core data of (i.e.,
  modified buffer cache pages for) the file referred to by the file
  descriptor fd to the disk device (or other permanent storage
  device) so that all changed information can be retrieved even after
  the system crashed or was rebooted. This includes writing through
  or flushing a disk cache if present. The call blocks until the
  device reports that the transfer has completed. It also flushes
  metadata information associated with the file (see stat(2)).

 As I understand it, when fsync is received all queued writes must be
 sent to the device (regardless if a previous write has failed or not).
 It also says that the call blocks until the device has finished all the
 operations.

 However it's not clear to me how to control file consistency because
 this allows some writes to succeed after a failed one. 

Though fsync doesn't wait on queued writes after a failure, the queued writes 
are flushed to disk even in the existing codebase. Can you file a bug to make 
fsync to wait for completion of queued writes irrespective of whether flushing 
any of them failed or not? I'll send a patch to fix the issue. Just to 
prioritise this, how important is the fix?

 I assume that
 controlling this is the responsibility of the calling application that
 should issue fsyncs on critical points to guarantee consistency.

 Anyway it seems that there's a difference between linux and NetBSD
 because this test only fails on NetBSD. Is it possible that linux's fuse
 implementation delays the fsync request until all pending writes have
 been answered ? this would explain why this problem has not manifested
 till now. NetBSD seems to send fsync (probably as the first step of a
 close() call) when the first write fails.

 Xavi

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Wrong behavior on fsync of md-cache ?

2014-11-26 Thread Raghavendra Gowdappa

- Original Message -
 From: Xavier Hernandez xhernan...@datalab.es
 To: Emmanuel Dreyfus m...@netbsd.org
 Cc: Raghavendra Gowdappa rgowd...@redhat.com, Gluster Devel 
 gluster-devel@gluster.org
 Sent: Wednesday, November 26, 2014 2:05:58 PM
 Subject: Re: Wrong behavior on fsync of md-cache ?

 On 11/25/2014 06:45 PM, Xavier Hernandez wrote:
  On 11/25/2014 02:25 PM, Emmanuel Dreyfus wrote:
  On Tue, Nov 25, 2014 at 01:42:21PM +0100, Xavier Hernandez wrote:
  It seems to fail only in NetBSD. I'm not sure what priority it has.
  Emmanuel
  is trying to create a regression test for new patches that checks all
  tests
  in tests/basic, and tests/basic/ec/quota.t hits this issue.

  FWIW, I just tried to change NetBSD FUSE to queue fsync after write, but
  that does not help, I still crash in dht_writev_cbk()

  Not sure what could be the problem. I added a sleep between 'dd' and
  'rm' to let all pending writes to finish before removing the file and it
  seemed to pass the test reliably.

 On second though, I think your change on fuse haven't solved the problem
 because fuse really sees all answers to the writes it has sent.

 If I understand correctly how write-behind works, when it receives a
 write, it queues it to be processed later, but immediately returns an
 answer to the upper layers. This is why write-behind improves performance.

 This means that it won't be possible to solve the problem in the fuse
 layer because it doesn't have enough information about the real state of
 all caches.

 Raghavendra, if that's true, fsync will need to be propagated always,
 even in case of error so that other xlators (even the remote brick
 filesystem) will have a chance to flush its caches, if any.

Xavi, yes you are right. I'll take that into consideration.

 Not sure if this could be important/interesting, but I think (not really
 sure) that posix says that only answered writes must be flushed on
 fsync. If a recent write has been received but not yet answered when
 fsync is received, it's not mandatory to flush it.

 Xavi

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] NetBSD regression tests: reviews required

2014-12-01 Thread Raghavendra Gowdappa

- Original Message -
 From: Xavier Hernandez xhernan...@datalab.es
 To: Emmanuel Dreyfus m...@netbsd.org, Vijay Bellur 
 vbel...@redhat.com, Justin Clift jus...@gluster.org,
 Pranith Kumar Karampuri pkara...@redhat.com, Krishnan Parthasarathi 
 kpart...@redhat.com, Raghavendra
 Gowdappa rgowd...@redhat.com
 Cc: gluster-devel@gluster.org
 Sent: Monday, December 1, 2014 2:32:32 PM
 Subject: Re: [Gluster-devel] NetBSD regression tests: reviews required

 On 12/01/2014 05:49 AM, Emmanuel Dreyfus wrote:
  Vijay Bellur vbel...@redhat.com wrote:

  And as the fix crop, I have a few others to share :-)
  More the merrier :-).

  Here is the latest list of NetBSD fixes for regression tests:
  http://review.gluster.com/8982
  http://review.gluster.com/9071
  http://review.gluster.com/9075
  http://review.gluster.com/9074
  http://review.gluster.com/9212  [1]
  http://review.gluster.com/9216  [2]
  http://review.gluster.com/9217
  http://review.gluster.com/9219
  http://review.gluster.com/9220

  [1] Krishnan Parthasarathi will probably want to improve the commit
  message before merging.

  [2] Here I fix the symptom rather than the cause. Hints are welcome to
  help fixing the cause, but perhaps the symptom fix could be merged as an
  interim solution so that glustershd stops crashing during the test.

  The regression.sh script on nbslave71 and nbslave72 still disable two
  test that always fail
  ./tests/basic/afr/entry-self-heal.t  - I am working on it
  ./tests/basic/ec/quota.t - Xavier Hernandez and Raghavendra Gowdappa
  may have a word about it.

 A temporal solution if you need to implement this very soon is to add a
 sleep of a few seconds between the 'dd' and 'rm' commands in the quota.t
 script. This prevents the crash on DHT and allows the test to pass.

 I can do that if needed.

Go ahead :).

 Xavi

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] explicit lookup of inods linked via readdirp

2014-12-17 Thread Raghavendra Gowdappa

- Original Message -
 From: Raghavendra Bhat rab...@redhat.com
 To: Gluster Devel gluster-devel@gluster.org
 Cc: Anand Avati aav...@redhat.com
 Sent: Thursday, December 18, 2014 12:31:41 PM
 Subject: [Gluster-devel] explicit lookup of inods linked via readdirp

 Hi,

 In fuse I saw, that as part of resolving a inode, an explicit lookup is
 done on it if the inode is found to be linked via readdirp (At the time
 of linking in readdirp, fuse sets a flag in the inode context). It is
 done because,  many xlators such as afr depend upon lookup call for many
 things such as healing.

Yes. But the lookup is a nameless lookup and hence is not sufficient enough. 
Some of the functionalities that get affected AFAIK are:
1. dht cannot create/heal directories and their layouts.
2. afr cannot identify gfid mismatch of a file across its subvolumes, since to 
identify a gfid mismatch we need a name.

From what I heard, afr relies on crawls done by self-heal daemon for 
named-lookups. But dht is worst hit in terms of maintaining directory 
structure on newly added bricks (this problem is  slightly different, since we 
don't hit this because of nameless lookup after readdirp. Instead it is 
because of a lack of named-lookup on the file after a graph switch. 
Neverthless I am clubbing both because a named lookup would've solved the 
issue). I've a feeling that different components have built their own way of 
handling what is essentially same issue. Its better we devise a single 
comprehensive solution.

 But that logic is not there in gfapi. I am thinking of introducing that
 mechanism in gfapi as well, where as part of resolve it checks if the
 inode is linked from readdirp. And if so it will do an explicit lookup
 on that inode.

As you've mentioned a lookup gives a chance to afr to heal the file. So, its 
needed in gfapi too. However you've to speak to afr folks to discuss whether 
nameless lookup is sufficient enough.

 NOTE: It can be done in NFS server as well.

Dht in NFS setup is also hit because of lack of named-lookups resulting in 
non-healing of directories on newly added brick.

 Please provide feedback.

 Regards,
 Raghavendra Bhat
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] explicit lookup of inods linked via readdirp

2014-12-17 Thread Raghavendra Gowdappa

+Pranith

- Original Message -
 From: Raghavendra Gowdappa rgowd...@redhat.com
 To: Raghavendra Bhat rab...@redhat.com
 Cc: Anand Avati aav...@redhat.com, Gluster Devel 
 gluster-devel@gluster.org
 Sent: Thursday, December 18, 2014 12:58:27 PM
 Subject: Re: [Gluster-devel] explicit lookup of inods linked via readdirp

 - Original Message -
  From: Raghavendra Bhat rab...@redhat.com
  To: Gluster Devel gluster-devel@gluster.org
  Cc: Anand Avati aav...@redhat.com
  Sent: Thursday, December 18, 2014 12:31:41 PM
  Subject: [Gluster-devel] explicit lookup of inods linked via readdirp

  Hi,

  In fuse I saw, that as part of resolving a inode, an explicit lookup is
  done on it if the inode is found to be linked via readdirp (At the time
  of linking in readdirp, fuse sets a flag in the inode context). It is
  done because,  many xlators such as afr depend upon lookup call for many
  things such as healing.

 Yes. But the lookup is a nameless lookup and hence is not sufficient enough.
 Some of the functionalities that get affected AFAIK are:
 1. dht cannot create/heal directories and their layouts.
 2. afr cannot identify gfid mismatch of a file across its subvolumes, since
 to identify a gfid mismatch we need a name.

 From what I heard, afr relies on crawls done by self-heal daemon for
 named-lookups. But dht is worst hit in terms of maintaining directory
 structure on newly added bricks (this problem is  slightly different, since
 we don't hit this because of nameless lookup after readdirp. Instead it is
 because of a lack of named-lookup on the file after a graph switch.
 Neverthless I am clubbing both because a named lookup would've solved the
 issue). I've a feeling that different components have built their own way of
 handling what is essentially same issue. Its better we devise a single
 comprehensive solution.

  But that logic is not there in gfapi. I am thinking of introducing that
  mechanism in gfapi as well, where as part of resolve it checks if the
  inode is linked from readdirp. And if so it will do an explicit lookup
  on that inode.

 As you've mentioned a lookup gives a chance to afr to heal the file. So, its
 needed in gfapi too. However you've to speak to afr folks to discuss whether
 nameless lookup is sufficient enough.

  NOTE: It can be done in NFS server as well.

 Dht in NFS setup is also hit because of lack of named-lookups resulting in
 non-healing of directories on newly added brick.

  Please provide feedback.

  Regards,
  Raghavendra Bhat
  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org
  http://supercolony.gluster.org/mailman/listinfo/gluster-devel

 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Problems with graph switch in disperse

2014-12-24 Thread Raghavendra Gowdappa

Do you know the origins of EIO? fuse-bridge only fails a lookup fop with EIO 
(when NULL gfid is received in a successful lookup reply). So, there might be 
other xlator which is sending EIO.

- Original Message -
 From: Xavier Hernandez xhernan...@datalab.es
 To: Gluster Devel gluster-devel@gluster.org
 Sent: Wednesday, December 24, 2014 6:25:17 PM
 Subject: [Gluster-devel] Problems with graph switch in disperse

 Hi,

 I'm experiencing a problem when gluster graph is changed as a result of
 a replace-brick operation (probably with any other operation that
 changes the graph) while the client is also doing other tasks, like
 writing a file.

 When operation starts, I see that the replaced brick is disconnected,
 but writes continue working normally with one brick less.

 At some point, another graph is created and comes online. Remaining
 bricks on the old graph are disconnected and the old graph is destroyed.
 I see how new write requests are sent to the new graph.

 This seems correct. However there's a point where I see this:

 [2014-12-24 11:29:58.541130] T [fuse-bridge.c:2305:fuse_write_resume]
 0-glusterfs-fuse: 2234: WRITE (0x16dcf3c, size=131072, offset=255721472)
 [2014-12-24 11:29:58.541156] T [ec-helpers.c:101:ec_trace] 2-ec:
 WIND(INODELK) 0x7f8921b7a9a4(0x7f8921b78e14) [refs=5, winds=3, jobs=1]
 frame=0x7f8932e92c38/0x7f8932e9e6b0, min/exp=3/3, err=0 state=1
 {111:000:000} idx=0
 [2014-12-24 11:29:58.541292] T [rpc-clnt.c:1384:rpc_clnt_record]
 2-patchy-client-0: Auth Info: pid: 0, uid: 0, gid: 0, owner:
 d025e932897f
 [2014-12-24 11:29:58.541296] T [io-cache.c:133:ioc_inode_flush]
 2-patchy-io-cache: locked inode(0x16d2810)
 [2014-12-24 11:29:58.541354] T
 [rpc-clnt.c:1241:rpc_clnt_record_build_header] 2-rpc-clnt: Request
 fraglen 152, payload: 84, rpc hdr: 68
 [2014-12-24 11:29:58.541408] T [io-cache.c:137:ioc_inode_flush]
 2-patchy-io-cache: unlocked inode(0x16d2810)
 [2014-12-24 11:29:58.541493] T [io-cache.c:133:ioc_inode_flush]
 2-patchy-io-cache: locked inode(0x16d2810)
 [2014-12-24 11:29:58.541536] T [io-cache.c:137:ioc_inode_flush]
 2-patchy-io-cache: unlocked inode(0x16d2810)
 [2014-12-24 11:29:58.541537] T [rpc-clnt.c:1577:rpc_clnt_submit]
 2-rpc-clnt: submitted request (XID: 0x17 Program: GlusterFS 3.3,
 ProgVers: 330, Proc: 29) to rpc-transport (patchy-client-0)
 [2014-12-24 11:29:58.541646] W [fuse-bridge.c:2271:fuse_writev_cbk]
 0-glusterfs-fuse: 2234: WRITE = -1 (Input/output error)

 It seems that fuse still has a write request pending for graph 0. It is
 resumed but it returns EIO without calling the xlator stack (operations
 seen between the two log messages are from other operations and they are
 sent to graph 2). I'm not sure why this happens and how I should aviod this.

 I tried the same scenario with replicate and it seems to work, so there
 must be something wrong in disperse, but I don't see where the problem
 could be.

 Any ideas ?

 Thanks,

 Xavi
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Quota problems without a way of fixing them

2015-01-21 Thread Raghavendra Gowdappa

- Original Message -
 From: Raghavendra Gowdappa rgowd...@redhat.com
 To: Joe Julian j...@julianfamily.org
 Cc: Gluster-devel@gluster.org gluster-devel@gluster.org, 
 pk...@grid.auth.gr
 Sent: Thursday, January 22, 2015 12:58:47 PM
 Subject: Re: [Gluster-devel] Quota problems without a way of fixing them

 - Original Message -
  From: Joe Julian j...@julianfamily.org
  To: Raghavendra Gowdappa rgowd...@redhat.com
  Cc: pk...@grid.auth.gr, Gluster-devel@gluster.org
  gluster-devel@gluster.org
  Sent: Thursday, January 22, 2015 11:16:39 AM
  Subject: Re: [Gluster-devel] Quota problems without a way of fixing them

  On 01/21/2015 09:32 PM, Raghavendra Gowdappa wrote:

   - Original Message -
   From: Joe Julian j...@julianfamily.org
   To: Gluster Devel gluster-devel@gluster.org
   Cc: Paschalis Korosoglou pk...@grid.auth.gr
   Sent: Thursday, January 22, 2015 12:54:44 AM
   Subject: [Gluster-devel] Quota problems without a way of fixing them

   Paschalis (PeterA in #gluster) has reported these bugs and we've tried
   to
   find the source of the problem to no avail. Worse yet, there's no way to
   just reset the quotas to match what's actually there, as far as I can
   tell.

   What should we look for to isolate the source of this problem since this
   is a
   production system with enough activity to make isolating the repro
   difficult
   at best, and debug logs have enough noise to make isolation nearly
   impossible?

   Finally, isn't there some simple way to trigger quota to rescan a path
   to
   reset trusted.glusterfs.quota.size ?
   1. Delete following xattrs from all the files/directories on all the
   bricks
   a) trusted.glusterfs.quota.size
   b) trusted.glusterfs.quota.*.contri
   c) trusted.glusterfs.quota.dirty

   2. Turn off md-cache
   # gluster volume set volname performance.stat-prefetch off

   3. Mount glusterfs asking not to use readdirp instead of readdir
   # mount -t glusterfs -o use-readdirp=no volfile-server:volfile-id
   /mnt/glusterfs

   4. Do a crawl on the mountpoint
   # find /mnt/glusterfs -exec stat \{} \;  /dev/null

   This should correct the accounting on bricks. Once done, you should see
   correct values in quota list output. Please let us know if it doesn't
   work
   for you.

  But that could be a months-long process with the size of many of our
  users volumes. There should be a way to do this with a single directory
  tree.

 If you can isolate a sub-directory tree where size accounting has gone bad,

But, the problem with this approach is that how do we know whether parents of 
this sub-directory have correct size. If a subdirectory has wrong size, then 
most likely accounting of all the ancestors of that sub-directory till root has 
gone bad. Hence I am skeptic about just healing part of a directory tree.

 this can be done by setting xattr trusted.glusterfs.quota.dirty of a
 directory to 1 and sending a lookup on that directory. Basically what this
 does is to add sizes of all immediate children and set that as the value of
 trusted.glusterfs.quota.size on the directory. But, the catch here is that
 the sizes of immediate children need not be accounted correctly. Hence this
 healing should be done bottom up starting with bottom-most directory and
 healing towards the top-level subdirectory which is isolated. We can have an
 algorithm like this:

 void
 heal (char *path)
 {
char value = 1;
struct stbuf = {0, };

setxattr (path, trusted.glusterfs.quota.dirty, (const void *)
value, sizeof (value));

/* now the dirty xattr has been set, trigger a lookup, so that the
directory is healed */
stat (path, stbuf);

return;
 }

 void
 crawl (DIR *dirfd, char *path)
 {
   struct dirent *result = NULL, entry = {0, };

   while (result = readdir (dirfd, entry, NULL)) {
if (IA_ISDIR (result-d_type)) {
DIR *childfd = NULL;
char *childpath = NULL;

childpath = construct_path (path, entry-d_name);

childfd = opendir (entry-d_name);

crawl (childfd, childpath);
}
   }

   heal (dirfd);

   return;
 }

 Now call crawl on isolated sub-directory (on the mountpoint). Note that above
 is a psudo-code, and a tool should be written using the above algo. We'll
 try to add a program to extras/utils which does this.

   His production system has been unmanageable for months now. It is
   possible
   for someone spare some cycles to get this looked at?

   2013-03-04 - https://bugzilla.redhat.com/show_bug.cgi?id=917901
   2013-10-24 - https://bugzilla.redhat.com/show_bug.cgi?id=1023134
   We are working on these bugs. We'll update on the bugzilla once we find
   anything substantial.

 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org

Re: [Gluster-devel] cannot delete non-empty directory

2015-02-09 Thread Raghavendra Gowdappa

- Original Message -
From: David F. Robinson david.robin...@corvidtec.com
To: Shyam srang...@redhat.com, Gluster Devel
gluster-devel@gluster.org, gluster-us...@gluster.org, Susant
Palai spa...@redhat.com
Sent: Monday, February 9, 2015 10:55:44 PM
Subject: Re: [Gluster-devel] cannot delete non-empty directory

So, just to be sure before I do this, it is okay to do the following if
I want to get rid of everything in the /old_shelf4/Aegis directory and
below?

rm -rf /data/brick*/homegfs_bkp/backup.0/old_shelf4/Aegis

Yes. This will solve the issue you are facing now. After this Aegis can be
removed from the mount point.

What happens to all of the files in the .glusterfs directory? Does this
get rebuilt or do the links stay there for files that now no longer
exist?

Links stay there for files that now no longer exist. This is not an issue
except that we'll be loosing an inode (no data-blocks as file size was 0).

And, is this same issue what causes all of the broken links in
.glusterfs. See attached image for example. There appears to be a lot
of broken links the .glusterfs directories. Is this normal or does it
indicate another problem.

There can be other issues which can result in links not getting deleted from
.glusterfs directory. Current issue is not related to that.

Finally, if I search through the /data/brick* directories, should I find
no entries of ---T permission files with zero length files? Do I
need to clean all of these up somehow? A quick look at
/data/brick01bkp/homegfs_bkp/.glusterfs/2f/54 shows many of these files.
They look like
-T 3 rbhinge pme_ics 0 Jan 9 16:45
2f54d7d6-968b-442f-8cfe-eff01d6cefe7
-T 2 rbhinge pme_ics 0 Jan 9 21:40
2f54d7e7-b198-4fd4-aec7-f5d0ff020f72

How do I find out what file these entries were pointing to?

As shyam had mentioned in an earlier mail, these files represent dht linkto
files. These are sort of metadata containing the name of the subvolume where
actual file is stored (hence the name link-to). The destination to which this
linkto is pointing is stored in xattrs. A dump of all the xattrs matching
regex trusted.glusterfs.* should list all the xattrs. The value of
trusted.glusterfs.dht.linkto xattr should give the destination subvolume. If
the file is not present on the destination, then its a stale linkto file
pointing to non-existent file (on destination subvol) and it can be removed.
Otherwise they are valid and shouldn't be removed.

Again as shyam mentioned in previous mail [1] should've fixed the issue (it is
present in v3.6.0 and above). Not sure why we are seeing this issue again.

[1] http://review.gluster.org/8602

David

-- Original Message --
From: Shyam srang...@redhat.com
To: David F. Robinson david.robin...@corvidtec.com; Gluster Devel
gluster-devel@gluster.org; gluster-us...@gluster.org
gluster-us...@gluster.org; Susant Palai spa...@redhat.com
Sent: 2/9/2015 11:11:20 AM
Subject: Re: [Gluster-devel] cannot delete non-empty directory

On 02/08/2015 12:19 PM, David F. Robinson wrote:
I am seeing these messsages after I delete large amounts of data using
gluster 3.6.2.
cannot delete non-empty directory:
old_shelf4/Aegis/!!!Programs/RavenCFD/Storage/Jimmy_Old/src_vj1.5_final
*_From the FUSE mount (as root), the directory shows up as empty:_*
# pwd
/backup/homegfs/backup.0/old_shelf4/Aegis/!!!Programs/RavenCFD/Storage/Jimmy_Old/src_vj1.5_final

# ls -al
total 5
d- 2 root root 4106 Feb 6 13:55 .
drwxrws--- 3 601 dmiller 72 Feb 6 13:55 ..
However, when you look at the bricks, the files are still there (none
on
brick01bkp, all files are on brick02bkp). All of the files are
0-length
and have --T permissions.

These files are linkto files that are created by DHT, which basically
mean the files were either renamed, or the brick layout changed (I
suspect the former to be the cause).

These files should have been deleted when the files that they point to
were deleted, looks like this did not happen.

Can I get the following information for some of the files here,
- getfattr -d -m . -e text path to file on brick
- The output of trusted.glusterfs.dht.linkto xattr should state where
the real file belongs, in this case as there are only 2 bricks, it
should be brick01bkp subvol
- As the second brick is empty, we should be able to safely delete
these files from the brick and proceed to do an rmdir on the mount
point of the volume as the directory is now empty.
- Please check, the one sub-directory that is showing up in this case
as well, save1

Any suggestions on how to fix this and how to prevent it from
happening?

I believe there are renames happening here, possibly by the archive
creator, one way to prevent the rename from creating a linkto file is
to use the DHT set parameter to set a pattern so that file name hash
considers only the static part of

Re: [Gluster-devel] mandatory lock

2015-01-08 Thread Raghavendra Gowdappa

- Original Message -
 From: Raghavendra Gowdappa rgowd...@redhat.com
 To: Harmeet Kalsi kharm...@hotmail.com
 Cc: Gluster-devel@gluster.org gluster-devel@gluster.org
 Sent: Thursday, January 8, 2015 4:12:44 PM
 Subject: Re: [Gluster-devel] mandatory lock

 - Original Message -
  From: Harmeet Kalsi kharm...@hotmail.com
  To: Gluster-devel@gluster.org gluster-devel@gluster.org
  Sent: Wednesday, January 7, 2015 5:55:43 PM
  Subject: [Gluster-devel] mandatory lock

  Dear All.
  Would it be possible for someone to guide me in the right direction to
  enable
  the mandatory lock on a volume please.
  At the moment two clients can edit the same file at the same time which is
  causing issues.

 I see code related to mandatory locking in posix-locks xlator (pl_writev,
 pl_truncate etc). To enable it you've to set option mandatory-locks yes in
 posix-locks xlator loaded on bricks
 (/var/lib/glusterd/vols/volname/*.vol). We've no way to set this option
 through gluster cli. Also, I am not sure to what extent this feature is
 tested/used till now. You can try it out and please let us know whether it
 worked for you :).

If mandatory locking doesn't work for you, can you modify your application to 
use advisory locking, since advisory locking is tested well and being used for 
long time?

  Many thanks in advance
  Kind Regards

  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org
  http://www.gluster.org/mailman/listinfo/gluster-devel

 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] spurious failure in quota-nfs.t

2015-05-05 Thread Raghavendra Gowdappa

- Original Message -
 From: Sachin Pandit span...@redhat.com
 To: Pranith Kumar Karampuri pkara...@redhat.com
 Cc: Gluster Devel gluster-devel@gluster.org
 Sent: Tuesday, May 5, 2015 3:24:59 PM
 Subject: Re: [Gluster-devel] spurious failure in quota-nfs.t

 - Original Message -
  From: Pranith Kumar Karampuri pkara...@redhat.com
  To: Vijaikumar Mallikarjuna vmall...@redhat.com, Sachin Pandit
  span...@redhat.com
  Cc: Gluster Devel gluster-devel@gluster.org
  Sent: Tuesday, May 5, 2015 8:43:18 AM
  Subject: spurious failure in quota-nfs.t

  hi Vijai/Sachin,
  http://build.gluster.org/job/rackspace-regression-2GB-triggered/8268/console
  Doesn't seem like an obvious failure. Know anything about it?

 Hi Pranith,

 I checked the logs and could not find any significant information.
 It seems like the marker has failed to update the extended
 attributes till the root. 

Any idea why it couldn't update till root?

 I have started the execution of this
 test case in a loop, it has completed more than 20 successful
 runs till now. I will update in this thread if I make any
 progress on root causing the issue.

 ~ Sachin.

  Pranith

 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Moratorium on new patch acceptance

2015-05-19 Thread Raghavendra Gowdappa

- Original Message -
 From: Shyam srang...@redhat.com
 To: gluster-devel@gluster.org
 Sent: Tuesday, May 19, 2015 6:13:06 AM
 Subject: Re: [Gluster-devel] Moratorium on new patch acceptance

 On 05/18/2015 07:05 PM, Shyam wrote:
  On 05/18/2015 03:49 PM, Shyam wrote:
  On 05/18/2015 10:33 AM, Vijay Bellur wrote:

  The etherpad did not call out, ./tests/bugs/distribute/bug-1161156.t
  which did not have an owner, and so I took a stab at it and below are
  the results.

  I also think failure in ./tests/bugs/quota/bug-1038598.t is the same as
  the observation below.

  NOTE: Anyone with better knowledge of Quota can possibly chip in as to
  what should we expect in this case and how to correct the expectation
  from these test cases.

  (Details of ./tests/bugs/distribute/bug-1161156.t)
  1) Failure is in TEST #20
  Failed line: TEST ! dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=1k
  count=10240 conv=fdatasync

  2) The above line is expected to fail (i.e dd is expected to fail) as,
  the set quota is 20MB and we are attempting to exceed it by another 5MB
  at this point in the test case.

  3) The failure is easily reproducible in my laptop, 2/10 times

  4) On debugging, I see that when the above dd succeeds (or the test
  fails, which means dd succeeded in writing more than the set quota),
  there are no write errors from the bricks or any errors on the final
  COMMIT RPC call to NFS.

  As a result the expectation of this test fails.

  NOTE: Sometimes there is a write failure from one of the bricks (the
  above test uses AFR as well), but AFR self healing kicks in and fixes
  the problem, as expected, as the write succeeded on one of the replicas.
  I add this observation, as the failed regression run logs, has some
  EDQUOT errors reported in the client xlator, but only from one of the
  client bricks, and there are further AFR self heal logs noted in the
  logs.

  5) When the test case succeeds the writes fail with EDQUOT as expected.
  There are times when the quota is exceeded by say 1MB - 4.8MB, but the
  test case still passes. Which means that, if we were to try to exceed
  the quota by 1MB (instead of the 5MB as in the test case), this test
  case may fail always.

  Here is why I think this passes by quota sometime and not others making
  this and the other test case mentioned below spurious.
  - Each write is 256K from the client (that is what is sent over the wire)
  - If more IO was queued by io-threads after passing quota checks, which
  in this 5MB case requires 20 IOs to be queued (16 IOs could be active
  in io-threads itself), we could end up writing more than the quota amount

  So, if quota checks to see if a write is violating the quota, and let's
  it through, and updates on the UNWIND the space used for future checks,
  we could have more IO outstanding than what the quota allows, and as a
  result allow such a larger write to pass through, considering IO threads
  queue and active IOs as well. Would this be a fair assumption of how
  quota works?

Yes, this is a possible scenario. There is a finite time window between,

1. Querying the size of a directory. In other words checking whether current 
write can be allowed
2. The effect of this write getting reflected in size of all the parent 
directories of a file till root

If 1 and 2 were atomic, another parallel write which could've exceed the 
quota-limit could not have slipped through. Unfortunately, in the current 
scheme of things they are not atomic. Now there can be parallel writes in this 
test case because of nfs-client and/or glusterfs write-back (though we've one 
single threaded application - dd - running). One way of testing this hypothesis 
is to disable nfs and glusterfs write-back and run the same (unmodified) test 
and the test should succeed always (dd should fail). To disable write-back in 
nfs you can use noac option while mounting.

The situation becomes worse in real-life scenarios because of parallelism 
involved at many layers:

1. multiple applications, each possibly being multithreaded writing to possibly 
many/or single file(s) in a quota subtree
2. write-back in NFS-client and glusterfs
3. Multiple bricks holding files of a quota-subtree. Each brick processing 
simultaneously many write requests through io-threads.

I've tried in past to fix the issue, though unsuccessfully. It seems to me that 
one effective strategy is to make enforcement and updation of size of parents 
atomic. But if we do that we end up adding latency of accounting to latency of 
fop. Other options can be explored. But, our Quota functionality requirements 
allow a buffer of 10% while enforcing limits. So, this issue has not been high 
on our priority list till now. So, our tests should also expect failures 
allowing for this 10% buffer.

  I believe this is what is happening in this case. Checking a fix on my
  machine, and will post the same if it proves to be help the situation.

 Posted a patch to

Re: [Gluster-devel] Moratorium on new patch acceptance

2015-05-19 Thread Raghavendra Gowdappa

- Original Message -
 From: Raghavendra Gowdappa rgowd...@redhat.com
 To: Shyam srang...@redhat.com
 Cc: gluster-devel@gluster.org
 Sent: Tuesday, May 19, 2015 11:46:19 AM
 Subject: Re: [Gluster-devel] Moratorium on new patch acceptance

 - Original Message -
  From: Shyam srang...@redhat.com
  To: gluster-devel@gluster.org
  Sent: Tuesday, May 19, 2015 6:13:06 AM
  Subject: Re: [Gluster-devel] Moratorium on new patch acceptance

  On 05/18/2015 07:05 PM, Shyam wrote:
   On 05/18/2015 03:49 PM, Shyam wrote:
   On 05/18/2015 10:33 AM, Vijay Bellur wrote:

   The etherpad did not call out, ./tests/bugs/distribute/bug-1161156.t
   which did not have an owner, and so I took a stab at it and below are
   the results.

   I also think failure in ./tests/bugs/quota/bug-1038598.t is the same as
   the observation below.

   NOTE: Anyone with better knowledge of Quota can possibly chip in as to
   what should we expect in this case and how to correct the expectation
   from these test cases.

   (Details of ./tests/bugs/distribute/bug-1161156.t)
   1) Failure is in TEST #20
   Failed line: TEST ! dd if=/dev/zero of=$N0/$mydir/newfile_2 bs=1k
   count=10240 conv=fdatasync

   2) The above line is expected to fail (i.e dd is expected to fail) as,
   the set quota is 20MB and we are attempting to exceed it by another 5MB
   at this point in the test case.

   3) The failure is easily reproducible in my laptop, 2/10 times

   4) On debugging, I see that when the above dd succeeds (or the test
   fails, which means dd succeeded in writing more than the set quota),
   there are no write errors from the bricks or any errors on the final
   COMMIT RPC call to NFS.

   As a result the expectation of this test fails.

   NOTE: Sometimes there is a write failure from one of the bricks (the
   above test uses AFR as well), but AFR self healing kicks in and fixes
   the problem, as expected, as the write succeeded on one of the replicas.
   I add this observation, as the failed regression run logs, has some
   EDQUOT errors reported in the client xlator, but only from one of the
   client bricks, and there are further AFR self heal logs noted in the
   logs.

   5) When the test case succeeds the writes fail with EDQUOT as expected.
   There are times when the quota is exceeded by say 1MB - 4.8MB, but the
   test case still passes. Which means that, if we were to try to exceed
   the quota by 1MB (instead of the 5MB as in the test case), this test
   case may fail always.

   Here is why I think this passes by quota sometime and not others making
   this and the other test case mentioned below spurious.
   - Each write is 256K from the client (that is what is sent over the wire)
   - If more IO was queued by io-threads after passing quota checks, which
   in this 5MB case requires 20 IOs to be queued (16 IOs could be active
   in io-threads itself), we could end up writing more than the quota amount

   So, if quota checks to see if a write is violating the quota, and let's
   it through, and updates on the UNWIND the space used for future checks,
   we could have more IO outstanding than what the quota allows, and as a
   result allow such a larger write to pass through, considering IO threads
   queue and active IOs as well. Would this be a fair assumption of how
   quota works?

 Yes, this is a possible scenario. There is a finite time window between,

 1. Querying the size of a directory. In other words checking whether current
 write can be allowed
 2. The effect of this write getting reflected in size of all the parent
 directories of a file till root

 If 1 and 2 were atomic, another parallel write which could've exceed the
 quota-limit could not have slipped through. Unfortunately, in the current
 scheme of things they are not atomic. Now there can be parallel writes in
 this test case because of nfs-client and/or glusterfs write-back (though
 we've one single threaded application - dd - running). One way of testing
 this hypothesis is to disable nfs and glusterfs write-back and run the same
 (unmodified) test and the test should succeed always (dd should fail). To
 disable write-back in nfs you can use noac option while mounting.

 The situation becomes worse in real-life scenarios because of parallelism
 involved at many layers:

 1. multiple applications, each possibly being multithreaded writing to
 possibly many/or single file(s) in a quota subtree
 2. write-back in NFS-client and glusterfs
 3. Multiple bricks holding files of a quota-subtree. Each brick processing
 simultaneously many write requests through io-threads.

4. Background accounting of directory sizes _after_ a write is complete.

 I've tried in past to fix the issue, though unsuccessfully. It seems to me
 that one effective strategy is to make enforcement and updation of size of
 parents atomic. But if we do that we end up adding latency of accounting to
 latency of fop. Other options can

Re: [Gluster-devel] Moratorium on new patch acceptance

2015-05-19 Thread Raghavendra Gowdappa

- Original Message -
 From: Vijay Bellur vbel...@redhat.com
 To: Raghavendra Gowdappa rgowd...@redhat.com, Shyam 
 srang...@redhat.com
 Cc: gluster-devel@gluster.org
 Sent: Tuesday, May 19, 2015 1:29:57 PM
 Subject: Re: [Gluster-devel] Moratorium on new patch acceptance

 On 05/19/2015 12:21 PM, Raghavendra Gowdappa wrote:

  Yes, this is a possible scenario. There is a finite time window between,

  1. Querying the size of a directory. In other words checking whether
  current
  write can be allowed
  2. The effect of this write getting reflected in size of all the parent
  directories of a file till root

  If 1 and 2 were atomic, another parallel write which could've exceed the
  quota-limit could not have slipped through. Unfortunately, in the current
  scheme of things they are not atomic. Now there can be parallel writes in
  this test case because of nfs-client and/or glusterfs write-back (though
  we've one single threaded application - dd - running). One way of testing
  this hypothesis is to disable nfs and glusterfs write-back and run the
  same
  (unmodified) test and the test should succeed always (dd should fail). To
  disable write-back in nfs you can use noac option while mounting.

  The situation becomes worse in real-life scenarios because of parallelism
  involved at many layers:

  1. multiple applications, each possibly being multithreaded writing to
  possibly many/or single file(s) in a quota subtree
  2. write-back in NFS-client and glusterfs
  3. Multiple bricks holding files of a quota-subtree. Each brick
  processing
  simultaneously many write requests through io-threads.

  4. Background accounting of directory sizes _after_ a write is complete.

  I've tried in past to fix the issue, though unsuccessfully. It seems to
  me
  that one effective strategy is to make enforcement and updation of size
  of
  parents atomic. But if we do that we end up adding latency of accounting
  to
  latency of fop. Other options can be explored. But, our Quota
  functionality
  requirements allow a buffer of 10% while enforcing limits. So, this issue
  has not been high on our priority list till now. So, our tests should
  also
  expect failures allowing for this 10% buffer.

  Since most of our tests are a single instance of single threaded dd running
  on a single mount, if the hypothesis turns out true, we can turn off
  nfs-client and glusterfs write-back in all tests related to Quota.
  Comments?

 Even with write-behind enabled, dd should get a failure upon close() if
 quota were to return EDQUOT for any of the writes. I suspect that
 flush-behind being enabled by default in write-behind can mask a failure
 for close(). Disabling flush-behind in the tests might take care of
 fixing the tests.

No, my suggestion was aimed at not having parallel writes. In this case quota 
won't even fail the writes with EDQUOT because of reasons explained above. Yes, 
we need to disable flush-behind along with this so that errors are delivered to 
application.

 It would be good to have nfs + quota coverage in the tests. So let us
 not disable nfs tests for quota.

The suggestion was to continue using nfs, but preventing nfs-clients from using 
a write-back cache.

 Thanks,
 Vijay

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Regression failures

2015-06-03 Thread Raghavendra Gowdappa

Few more:

1.
./tests/basic/afr/sparse-file-self-heal.t

From 
http://build.gluster.org/job/rackspace-netbsd7-regression-triggered/5887/consoleFull

2.
./tests/bugs/protocol/bug-808400-repl.t

From 
http://build.gluster.org/job/rackspace-regression-2GB-triggered/9920/consoleFull


3.
./tests/bugs/quota/inode-quota.t

From 
http://build.gluster.org/job/rackspace-regression-2GB-triggered/10035/consoleFull

Last two failures are on the same patch http://review.gluster.org/#/c/10967/. 
Though I've not got time to look into each failure, it seems like 
spurious/intermittent failures since the patch passed all the regressions 
master and different tests are failing on different runs.

- Original Message -
 From: Sachin Pandit span...@redhat.com
 To: Gluster Devel gluster-devel@gluster.org
 Sent: Wednesday, June 3, 2015 4:36:50 PM
 Subject: [Gluster-devel] Regression failures
 
 Hi,
 
 http://review.gluster.org/#/c/11024/ failed in
 tests/basic/volume-snapshot-clone.t testcase.
 http://build.gluster.org/job/rackspace-regression-2GB-triggered/10057/consoleFull
 
 
 http://review.gluster.org/#/c/11000/ failed in
 tests/bugs/replicate/bug-979365.t testcase.
 http://build.gluster.org/job/rackspace-regression-2GB-triggered/9985/consoleFull
 
 
 Seems like a spurious failure. Can anyone please
 have a look at this.
 
 Regards,
 Sachin Pandit.
 
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] /tests/bugs/quota/bug-1153964.t is consistently failing

2015-06-23 Thread Raghavendra Gowdappa

- Original Message -
 From: Atin Mukherjee amukh...@redhat.com
 To: Niels de Vos nde...@redhat.com, Vijaikumar M vmall...@redhat.com
 Cc: Raghavendra G raghaven...@gluster.com, Gluster Devel 
 gluster-devel@gluster.org
 Sent: Wednesday, June 24, 2015 10:15:12 AM
 Subject: Re: [Gluster-devel] /tests/bugs/quota/bug-1153964.t is consistently 
 failing

 When is this patch getting merged, this is blocking other patches to get in.

Revert of http://review.gluster.org/11311, is waiting for regression runs to 
pass. There are three patches (duplicates of each other). If anyone of them 
pass both regression runs, I'll merge them. As far as refcounting mechanism go, 
it'll take some time to review and merge the patch.

 ~Atin

 On 06/23/2015 06:26 PM, Niels de Vos wrote:
  On Tue, Jun 23, 2015 at 05:30:39PM +0530, Vijaikumar M wrote:

  On Tuesday 23 June 2015 04:28 PM, Niels de Vos wrote:
  On Tue, Jun 23, 2015 at 03:45:43PM +0530, Vijaikumar M wrote:
  I have submitted below patch which fixes this issue. I am handling
  memory
  clean-up with reference countmechanism.

  http://review.gluster.org/#/c/11361
  Is there a reason you can not use the (new) refcounting functions that
  were introduceed with http://review.gluster.org/11022 ?

  I was not aware that ref-counting patch was merged. Sure we will use these
  function and re-submit my patch.

  Ok, thanks!
  Niels

  Thanks,
  Vijay

  It would be nicer to standardize all refcounting mechanisms on one
  implementation. I hope we can replace existing refcounting with this one
  too. Introducing more refcounting ways is not going to be helpful.

  Thanks,
  Niels

  Thanks,
  Vijay

  On Tuesday 23 June 2015 12:58 PM, Raghavendra G wrote:
  Multiple replies to same query. Pick one ;).

  On Tue, Jun 23, 2015 at 12:55 PM, Venky Shankar
  yknev.shan...@gmail.com
  mailto:yknev.shan...@gmail.com wrote:

 OK. Two reverts of the same patch ;)

 Pick one.

 On Tue, Jun 23, 2015 at 12:51 PM, Raghavendra Gowdappa
 rgowd...@redhat.com mailto:rgowd...@redhat.com wrote:
  Seems like its a memory corruption caused by:
  http://review.gluster.org/11311

  I've reverted the patch at:
  http://review.gluster.org/11360

  - Original Message -
  From: Xavier Hernandez xhernan...@datalab.es
 mailto:xhernan...@datalab.es
  To: Gluster Devel gluster-devel@gluster.org
 mailto:gluster-devel@gluster.org
  Sent: Tuesday, June 23, 2015 12:44:47 PM
  Subject: [Gluster-devel] /tests/bugs/quota/bug-1153964.t is
 consistently failing

  Hi,

  the quota test bug-1153964.t is failing consistently for a
  totally
  unrelated patch. Is this a known issue ?

  http://build.gluster.org/job/rackspace-regression-2GB-triggered/11142/consoleFull

  http://build.gluster.org/job/rackspace-regression-2GB-triggered/11165/consoleFull

  http://build.gluster.org/job/rackspace-regression-2GB-triggered/11172/consoleFull

  http://build.gluster.org/job/rackspace-regression-2GB-triggered/11191/consoleFull

  Xavi
  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org mailto:Gluster-devel@gluster.org
  http://www.gluster.org/mailman/listinfo/gluster-devel

  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org mailto:Gluster-devel@gluster.org
  http://www.gluster.org/mailman/listinfo/gluster-devel
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org mailto:Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel

  --
  Raghavendra G

  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org
  http://www.gluster.org/mailman/listinfo/gluster-devel
  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org
  http://www.gluster.org/mailman/listinfo/gluster-devel

  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org
  http://www.gluster.org/mailman/listinfo/gluster-devel

 --
 ~Atin
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] /tests/bugs/quota/bug-1153964.t is consistently failing

2015-06-23 Thread Raghavendra Gowdappa

Seems like its a memory corruption caused by:
http://review.gluster.org/11311

I've reverted the patch at:
http://review.gluster.org/11360

- Original Message -
 From: Xavier Hernandez xhernan...@datalab.es
 To: Gluster Devel gluster-devel@gluster.org
 Sent: Tuesday, June 23, 2015 12:44:47 PM
 Subject: [Gluster-devel] /tests/bugs/quota/bug-1153964.t is consistently  
 failing
 
 Hi,
 
 the quota test bug-1153964.t is failing consistently for a totally
 unrelated patch. Is this a known issue ?
 
 http://build.gluster.org/job/rackspace-regression-2GB-triggered/11142/consoleFull
 http://build.gluster.org/job/rackspace-regression-2GB-triggered/11165/consoleFull
 http://build.gluster.org/job/rackspace-regression-2GB-triggered/11172/consoleFull
 http://build.gluster.org/job/rackspace-regression-2GB-triggered/11191/consoleFull
 
 Xavi
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Unable to send patches to gerrit

2015-06-11 Thread Raghavendra Gowdappa

Now, not able to open/review/merge any patches on review.gluster.org.

- Original Message -
 From: Avra Sengupta aseng...@redhat.com
 To: Pranith Kumar Karampuri pkara...@redhat.com, Anoop C S 
 achir...@redhat.com, gluster-devel@gluster.org,
 Kaushal M kshlms...@gmail.com, Vijay Bellur vbel...@redhat.com, 
 gluster-infra gluster-in...@gluster.org
 Sent: Thursday, 11 June, 2015 11:16:15 AM
 Subject: Re: [Gluster-devel] Unable to send patches to gerrit

 +Adding gluster-infra

 On 06/11/2015 10:39 AM, Pranith Kumar Karampuri wrote:
  Last time when this happened Kaushal/vijay fixed it if I remember
  correctly.
  +kaushal +Vijay

  Pranith
  On 06/11/2015 10:38 AM, Anoop C S wrote:

  On 06/11/2015 10:33 AM, Ravishankar N wrote:
  I'm unable to push a patch on release-3.6, getting different
  errors every time:

  This happens for master too. I continuously get the following error:

  error: unpack failed: error No space left on device

  [ravi@tuxpad glusterfs]$ ./rfc.sh [detached HEAD a59646a] afr:
  honour selfheal enable/disable volume set options Date: Sat May 30
  10:23:33 2015 +0530 3 files changed, 108 insertions(+), 4
  deletions(-) create mode 100644 tests/basic/afr/client-side-heal.t
  Successfully rebased and updated
  refs/heads/3.6_honour_heal_options. Counting objects: 11, done.
  Delta compression using up to 4 threads. Compressing objects: 100%
  (11/11), done. Writing objects: 100% (11/11), 1.77 KiB | 0 bytes/s,
  done. Total 11 (delta 9), reused 0 (delta 0) *error: unpack failed:
  error No space left on device** **fatal: Unpack error, check server
  log* To ssh://itisr...@git.gluster.org/glusterfs.git ! [remote
  rejected] HEAD - refs/for/release-3.6/bug-1230259 (n/a (unpacker
  error)) error: failed to push some refs to
  'ssh://itisr...@git.gluster.org/glusterfs.git' [ravi@tuxpad
  glusterfs]$

  [ravi@tuxpad glusterfs]$ ./rfc.sh [detached HEAD 8b28efd] afr:
  honour selfheal enable/disable volume set options Date: Sat May 30
  10:23:33 2015 +0530 3 files changed, 108 insertions(+), 4
  deletions(-) create mode 100644 tests/basic/afr/client-side-heal.t
  Successfully rebased and updated
  refs/heads/3.6_honour_heal_options. *fatal: internal server
  error** **fatal: Could not read from remote repository.** **
  **Please make sure you have the correct access rights** **and the
  repository exists.*

  Anybody else facing problems? -Ravi

  ___ Gluster-devel
  mailing list Gluster-devel@gluster.org
  http://www.gluster.org/mailman/listinfo/gluster-devel

  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org
  http://www.gluster.org/mailman/listinfo/gluster-devel

  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org
  http://www.gluster.org/mailman/listinfo/gluster-devel

 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Only netbsd regressions seem to be triggered

2015-06-02 Thread Raghavendra Gowdappa

All,

It seems only netbsd regressions are triggered. Linux based regressions seems 
to be not triggered. I've observed this with two patches [1][2]. Pranith also 
feels same. Have any of you seen similar issue?

[1]http://review.gluster.org/#/c/10943/
[2]http://review.gluster.org/#/c/10834/

regards,
Raghavendra
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Unable to send patches to release 3.7 branch.

2015-05-29 Thread Raghavendra Gowdappa

- Original Message -
 From: Avra Sengupta aseng...@redhat.com
 To: Gluster Devel gluster-devel@gluster.org
 Sent: Friday, May 29, 2015 9:11:22 AM
 Subject: [Gluster-devel] Unable to send patches to release 3.7 branch.

 Hi,

 Usually when a patch is backported to release 3.7 branch it contains the
 following from the patch already merged in master:

  Change-Id: Ib878f39814af566b9250cf6b8ed47da0ca5b1128
  BUG: 1226120
  Signed-off-by: Avra Sengupta aseng...@redhat.com
  Reviewed-on: http://review.gluster.org/10641
  Reviewed-by: Rajesh Joseph rjos...@redhat.com
  Tested-by: NetBSD Build System

Remove this line. NetBSD Build System is not a valid email id.

  Reviewed-by: Kaushal M kaus...@redhat.com

 While trying to send this patch from release 3.7 branch I am getting the
 following error from checkpatch.pl which is not letting me
 send the patch, and this is new because it didn't use to happen earlier.

 # ./extras/checkpatch.pl
 0001-glusterd-snapshot-Return-correct-errno-in-events-of-.patch
 Use of uninitialized value $gerrit_url in regexp compilation at

Use options --gerrit-url
./extras/checkpatch.pl --gerrit-url review.gluster.org

 ./extras/checkpatch.pl line 1958.
 ERROR: Unrecognized url address: 'http://review.gluster.org/10313'
 #16:
 Reviewed-on: http://review.gluster.org/10313

 ERROR: Unrecognized email address: 'NetBSD Build System'
 #19:
 Tested-by: NetBSD Build System

 Patch not according to coding guidelines! please fix.
 total: 2 errors, 0 warnings, 326 lines checked

 Why is checkpatch unable to reference the gerrit url?

 Regards,
 Avra
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Crash in dht_fsync()

2015-05-26 Thread Raghavendra Gowdappa

During graph change we migrate all the fds opened on old subvol. In this case 
we are trying to migrate fds opened on virtual .meta subtree. This has 
resulted in crash as these inodes and fds have to be handled differently than 
the ones in real volumes.

- Original Message -
 From: Raghavendra Gowdappa rgowd...@redhat.com
 To: Raghavendra Gowdappa rgowd...@redhat.com
 Cc: Vijay Bellur vbel...@redhat.com, Gluster Devel 
 gluster-devel@gluster.org
 Sent: Tuesday, May 26, 2015 12:02:46 PM
 Subject: Re: [Gluster-devel] Crash in dht_fsync()

 Is there a way to get coredump?

 - Raghavendra Gowdappa rgowd...@redhat.com wrote:
  Will take a look.

  - Original Message -
   From: Vijay Bellur vbel...@redhat.com
   To: Gluster Devel gluster-devel@gluster.org
   Sent: Monday, May 25, 2015 11:08:46 PM
   Subject: [Gluster-devel] Crash in dht_fsync()

   While running tests/performance/open-behind.t in a loop on mainline, I
   observe the crash at [1]. The backtrace seems to point to dht_fsync().
   Can one of the dht developers please take a look in?

   Thanks,
   Vijay

   [1] http://fpaste.org/225375/
   ___
   Gluster-devel mailing list
   Gluster-devel@gluster.org
   http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Huge memory consumption with quota-marker

2015-07-02 Thread Raghavendra Gowdappa

- Original Message -
 From: Krishnan Parthasarathi kpart...@redhat.com
 To: Raghavendra Gowdappa rgowd...@redhat.com
 Cc: Pranith Kumar Karampuri pkara...@redhat.com, Vijay Bellur 
 vbel...@redhat.com, Vijaikumar M
 vmall...@redhat.com, Gluster Devel gluster-devel@gluster.org, 
 Nagaprasad Sathyanarayana
 nsath...@redhat.com
 Sent: Thursday, July 2, 2015 11:27:34 AM
 Subject: Re: Huge memory consumption with quota-marker

 Yes. The PROC_MAX is the maximum no. of 'worker' threads that would be
 spawned for a given
 syncenv.

So, if we create a new syncenv with smaller stack-size, threads spawned in that 
syncenv will add to the number of threads in the process. However, if you 
create synctasks with stacksize different from the default env-stacksize, 
tasks will have lesser stack size but utilizing same threads of default syncenv.

 - Original Message -

  - Original Message -
   From: Krishnan Parthasarathi kpart...@redhat.com
   To: Pranith Kumar Karampuri pkara...@redhat.com
   Cc: Vijay Bellur vbel...@redhat.com, Vijaikumar M
   vmall...@redhat.com, Gluster Devel
   gluster-devel@gluster.org, Raghavendra Gowdappa
   rgowd...@redhat.com,
   Nagaprasad Sathyanarayana
   nsath...@redhat.com
   Sent: Thursday, July 2, 2015 10:54:44 AM
   Subject: Re: Huge memory consumption with quota-marker

   Yes, we could take synctask size as an argument for synctask_create.
   The increase in synctask threads is not really a problem, it can't
   grow more than 16 (SYNCENV_PROC_MAX).

  That is it cannot grow more than PROC_MAX in _single_ syncenv I suppose.

   - Original Message -

On 07/02/2015 10:40 AM, Krishnan Parthasarathi wrote:

 - Original Message -
 On Wednesday 01 July 2015 08:41 AM, Vijaikumar M wrote:
 Hi,

 The new marker xlator uses syncop framework to update quota-size in
 the
 background, it uses one synctask per write FOP.
 If there are 100 parallel writes with all different inodes but on
 the
 same directory '/dir', there will be ~100 txn waiting in queue to
 acquire a lock on on its parent i.e '/dir'.
 Each of this txn uses a syntack and each synctask allocates stack
 size
 of 2M (default size), so total 0f 200M usage. This usage can
 increase
 depending on the load.

 I am think of of using the stacksize for synctask to 256k, will
 this
 mem
 be sufficient as we perform very limited operations within a
 synctask
 in
 marker updation?

 Seems like a good idea to me. Do we need a 256k stacksize or can we
 live
 with something even smaller?
 It was 16K when synctask was introduced. This is a property of
 syncenv.
 We
 could
 create a separate syncenv for marker transactions which has smaller
 stacks.
 env-stacksize (and SYNCTASK_DEFAULT_STACKSIZE) was increased to 2MB
 to
 support
 pump xlator based data migration for replace-brick. For the no. of
 stack
 frames
 a marker transaction could use at any given time, we could use much
 lesser,
 16K say.
 Does that make sense?
Creating one more syncenv will lead to extra sync-threads, may be we
can
take stacksize as argument.

Pranith

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Build and Regression failure in master branch!

2015-06-28 Thread Raghavendra Gowdappa

Thanks Kotresh.

- Original Message -
 From: Kotresh Hiremath Ravishankar khire...@redhat.com
 To: Atin Mukherjee atin.mukherje...@gmail.com
 Cc: Gluster Devel gluster-devel@gluster.org
 Sent: Sunday, June 28, 2015 1:57:24 PM
 Subject: Re: [Gluster-devel] Build and Regression failure in master branch!

 Yes, Atin. You are right, header files are missing in Makefiles.
 Build because of the commit 3741804bec65a33d400af38dcc80700c8a668b81

 I have sent the patch for the same.
 http://review.gluster.org/#/c/11451/

 Please someone review and merge it.

 Thanks and Regards,
 Kotresh H R

 - Original Message -
  From: Atin Mukherjee atin.mukherje...@gmail.com
  To: Kotresh Hiremath Ravishankar khire...@redhat.com
  Cc: Gluster Devel gluster-devel@gluster.org
  Sent: Sunday, June 28, 2015 12:56:21 PM
  Subject: Re: [Gluster-devel] Build and Regression failure in master branch!

  -Atin
  Sent from one plus one
  On Jun 28, 2015 12:01 PM, Kotresh Hiremath Ravishankar 
  khire...@redhat.com wrote:

   Hi,

   rpm build is consistently failing for the patch (
  http://review.gluster.org/#/c/11443/)
   with following error where as it is passing in local setup.

   ...
   Making all in performance
   Making all in write-behind
   Making all in src
 CC   write-behind.lo
   write-behind.c:24:35: fatal error: write-behind-messages.h: No such file
  or directory
#include write-behind-messages.h
  ^
   compilation terminated.
   make[5]: *** [write-behind.lo] Error 1
   make[4]: *** [all-recursive] Error 1
   make[3]: *** [all-recursive] Error 1
   make[2]: *** [all-recursive] Error 1
   make[1]: *** [all-recursive] Error 1
   make: *** [all] Error 2
   RPM build errors:
   error: Bad exit status from /var/tmp/rpm-tmp.8QmLg0 (%build)
   Bad exit status from /var/tmp/rpm-tmp.8QmLg0 (%build)
  This means the entry of this file is missing in the respective makefile.

   Regression Failures: ./tests/basic/afr/client-side-heal.t

   Above test case is consistently failing for the patch.

  http://build.gluster.org/job/rackspace-netbsd7-regression-triggered/7596/consoleFull

  http://build.gluster.org/job/rackspace-regression-2GB-triggered/11641/consoleFull

   Are there known issues?

   Thanks and Regards,
   Kotresh H R

   ___
   Gluster-devel mailing list
   Gluster-devel@gluster.org
   http://www.gluster.org/mailman/listinfo/gluster-devel

 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Failure in tests/basic/tier/bug-1214222-directories_miising_after_attach_tier.t

2015-07-02 Thread Raghavendra Gowdappa

I've reverted [1] which brought the change allow-insecure to be on by default. 
The patch seems to have issues which will be addressed and merged later. The 
revert can be found at [2].

[1] http://review.gluster.org/11274
[2] http://review.gluster.org/11507

Please let me know if the regressions are still failing.

regards,
Raghavendra.

- Original Message -
 From: Joseph Fernandes josfe...@redhat.com
 To: Atin Mukherjee atin.mukherje...@gmail.com
 Cc: Gluster Devel gluster-devel@gluster.org
 Sent: Thursday, July 2, 2015 9:49:16 PM
 Subject: Re: [Gluster-devel] Failure in 
 tests/basic/tier/bug-1214222-directories_miising_after_attach_tier.t
 
 Yep.. Thanks Guys.
 
 - Original Message -
 From: Atin Mukherjee atin.mukherje...@gmail.com
 To: Joseph Fernandes josfe...@redhat.com
 Cc: kpart...@redhat.com, Atin Mukherjee amukh...@redhat.com, Gluster
 Devel gluster-devel@gluster.org
 Sent: Thursday, July 2, 2015 9:45:01 PM
 Subject: Re: [Gluster-devel] Failure in
 tests/basic/tier/bug-1214222-directories_miising_after_attach_tier.t
 
 Joe,
 
 Please refer to Prasanna's mail. He has uploaded a patch to solve it.
 
 -Atin
 Sent from one plus one
 On Jul 2, 2015 9:42 PM, Joseph Fernandes josfe...@redhat.com wrote:
 
  Hi All,
 
  This is the same issue as the previous tiering regression failure.
 
  Volume brick not able to start brick because port is busy
 
  [2015-07-02 10:20:20.601372]  [run.c:190:runner_log] (--
  /build/install/lib/libglusterfs.so.0(_gf_log_callingfn+0x240)[0x7f05e080bc32]
  (-- /build/install/lib/libglusterfs.so.0(runner_log+0x192)[0x7f05e08754ce]
  (--
  /build/install/lib/glusterfs/3.8dev/xlator/mgmt/glusterd.so(glusterd_volume_start_glusterfs+0xae7)[0x7f05d5c935d7]
  (--
  /build/install/lib/glusterfs/3.8dev/xlator/mgmt/glusterd.so(glusterd_brick_start+0x151)[0x7f05d5c9d4e3]
  (--
  /build/install/lib/glusterfs/3.8dev/xlator/mgmt/glusterd.so(glusterd_op_perform_add_bricks+0x8fe)[0x7f05d5d10661]
  ) 0-: Starting GlusterFS: /build/install/sbin/glusterfsd -s
  slave33.cloud.gluster.org --volfile-id
  patchy.slave33.cloud.gluster.org.d-backends-patchy5 -p
  /var/lib/glusterd/vols/patchy/run/slave33.cloud.gluster.org-d-backends-patchy5.pid
  -S /var/run/gluster/ca5f5a89aa3a24f0a54852590ab82ad5.socket --brick-name
  /d/backends/patchy5 -l /var/log/glusterfs/bricks/d-backends-patchy5.log
  --xlator-option *-posix.glusterd-uuid=da011de8-9103-4cf2-9f4b-03707d0019d0
  --brick-port 49167 --xlator-option patchy-server.listen-port=49167
  [2015-07-02 10:20:20.624297] I [MSGID: 106144]
  [glusterd-pmap.c:269:pmap_registry_remove] 0-pmap: removing brick (null) on
  port 49167
  [2015-07-02 10:20:20.625315] E [MSGID: 106005]
  [glusterd-utils.c:4448:glusterd_brick_start] 0-management: Unable to start
  brick slave33.cloud.gluster.org:/d/backends/patchy5
  [2015-07-02 10:20:20.625354] E [MSGID: 106074]
  [glusterd-brick-ops.c:2096:glusterd_op_add_brick] 0-glusterd: Unable to add
  bricks
  [2015-07-02 10:20:20.625368] E [MSGID: 106123]
  [glusterd-syncop.c:1416:gd_commit_op_phase] 0-management: Commit of
  operation 'Volume Add brick' failed on localhost
 
 
  Brick Log:
 
  [2015-07-02 10:20:20.608547] I [MSGID: 100030] [glusterfsd.c:2296:main]
  0-/build/install/sbin/glusterfsd: Started running
  /build/install/sbin/glusterfsd version 3.8dev (args:
  /build/install/sbin/glusterfsd -s slave33.cloud.gluster.org --volfile-id
  patchy.slave33.cloud.gluster.org.d-backends-patchy5 -p
  /var/lib/glusterd/vols/patchy/run/slave33.cloud.gluster.org-d-backends-patchy5.pid
  -S /var/run/gluster/ca5f5a89aa3a24f0a54852590ab82ad5.socket --brick-name
  /d/backends/patchy5 -l /var/log/glusterfs/bricks/d-backends-patchy5.log
  --xlator-option *-posix.glusterd-uuid=da011de8-9103-4cf2-9f4b-03707d0019d0
  --brick-port 49167 --xlator-option patchy-server.listen-port=49167)
  [2015-07-02 10:20:20.617113] I [MSGID: 101190]
  [event-epoll.c:627:event_dispatch_epoll_worker] 0-epoll: Started thread
  with index 1
  [2015-07-02 10:20:20.623097] I [MSGID: 101173]
  [graph.c:268:gf_add_cmdline_options] 0-patchy-server: adding option
  'listen-port' for volume 'patchy-server' with value '49167'
  [2015-07-02 10:20:20.623135] I [MSGID: 101173]
  [graph.c:268:gf_add_cmdline_options] 0-patchy-posix: adding option
  'glusterd-uuid' for volume 'patchy-posix' with value
  'da011de8-9103-4cf2-9f4b-03707d0019d0'
  [2015-07-02 10:20:20.623358] I [MSGID: 115034]
  [server.c:392:_check_for_auth_option] 0-/d/backends/patchy5: skip format
  check for non-addr auth option auth.login./d/backends/patchy5.allow
  [2015-07-02 10:20:20.623374] I [MSGID: 115034]
  [server.c:392:_check_for_auth_option] 0-/d/backends/patchy5: skip format
  check for non-addr auth option
  auth.login.96bcb872-559b-4f19-84ad-a735dc6068f6.password
  [2015-07-02 10:20:20.623568] I
  [rpcsvc.c:2210:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured
  rpc.outstanding-rpc-limit with value 64
  [2015-07-02 10:20:20.623633] W [MSGID: 101002]

Re: [Gluster-devel] Crash in dht_fsync()

2015-05-25 Thread Raghavendra Gowdappa

Will take a look.

- Original Message -
 From: Vijay Bellur vbel...@redhat.com
 To: Gluster Devel gluster-devel@gluster.org
 Sent: Monday, May 25, 2015 11:08:46 PM
 Subject: [Gluster-devel] Crash in dht_fsync()

 While running tests/performance/open-behind.t in a loop on mainline, I
 observe the crash at [1]. The backtrace seems to point to dht_fsync().
 Can one of the dht developers please take a look in?

 Thanks,
 Vijay

 [1] http://fpaste.org/225375/
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Crash in dht_fsync()

2015-05-26 Thread Raghavendra Gowdappa

Is there a way to get coredump?

- Raghavendra Gowdappa rgowd...@redhat.com wrote:
 Will take a look.
 
 - Original Message -
  From: Vijay Bellur vbel...@redhat.com
  To: Gluster Devel gluster-devel@gluster.org
  Sent: Monday, May 25, 2015 11:08:46 PM
  Subject: [Gluster-devel] Crash in dht_fsync()
  
  While running tests/performance/open-behind.t in a loop on mainline, I
  observe the crash at [1]. The backtrace seems to point to dht_fsync().
  Can one of the dht developers please take a look in?
  
  Thanks,
  Vijay
  
  [1] http://fpaste.org/225375/
  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org
  http://www.gluster.org/mailman/listinfo/gluster-devel
  

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Moratorium on new patch acceptance

2015-05-26 Thread Raghavendra Gowdappa

Status on crash in dht_fsync reported by Vijay Bellur:

I am not able to reproduce the issue. However I am consistently hitting a 
failure in the last test of ./tests/performance/open-behind.t (test 18). The 
test reads as:


gluster volume top $V0 open | grep -w $F0 /dev/null 21
TEST [ $? -eq 0 ];

gluster volume top is not giving file name in the output causing grep to fail 
resulting in failure of test.

regards,
Raghavendra.

- Original Message -
 From: Vijaikumar M vmall...@redhat.com
 To: Gluster Devel gluster-devel@gluster.org
 Sent: Tuesday, May 26, 2015 4:43:23 PM
 Subject: Re: [Gluster-devel] Moratorium on new patch acceptance
 
 Here is the status on quota test-case spurious failure:
 
 There were 3 issues
 1) Quota exceeding the limit because of parallel writes - Merged
 Upstream, patch submitted to release-3.7 #10910
  ./tests/bugs/quota/bug-1038598.t
  ./tests/bugs/distribute/bug-1161156.t
 2) Quoting accounting going wrong - Patch Submitted #10918
  ./tests/basic/ec/quota.t
  ./tests/basic/quota-nfs.t
 3) Quota with anonymous FDs on NetBSD:
 This is NFS client caching issue on NetBSD. Sachin and Myself are
 working on this issue.
  ./tests/basic/quota-anon-fd-nfs.t
 
 
 Thanks,
 Vijay
 
 
 On Friday 22 May 2015 11:45 PM, Vijay Bellur wrote:
  On 05/21/2015 12:07 AM, Vijay Bellur wrote:
  On 05/19/2015 11:56 PM, Vijay Bellur wrote:
  On 05/18/2015 08:03 PM, Vijay Bellur wrote:
  On 05/16/2015 03:34 PM, Vijay Bellur wrote:
 
 
  I will send daily status updates from Monday (05/18) about this so
  that
  we are clear about where we are and what needs to be done to remove
  this
  moratorium. Appreciate your help in having a clean set of regression
  tests going forward!
 
 
  We have made some progress since Saturday. The problem with glupy.t
  has
  been fixed - thanks to Niels! All but following tests have developers
  looking into them:
 
   ./tests/basic/afr/entry-self-heal.t
 
   ./tests/bugs/replicate/bug-976800.t
 
   ./tests/bugs/replicate/bug-1015990.t
 
   ./tests/bugs/quota/bug-1038598.t
 
   ./tests/basic/ec/quota.t
 
   ./tests/basic/quota-nfs.t
 
   ./tests/bugs/glusterd/bug-974007.t
 
  Can submitters of these test cases or current feature owners pick
  these
  up and start looking into the failures please? Do update the spurious
  failures etherpad [1] once you pick up a particular test.
 
 
  [1] https://public.pad.fsfe.org/p/gluster-spurious-failures
 
 
  Update for today - all tests that are known to fail have owners. Thanks
  everyone for chipping in! I think we should be able to lift this
  moratorium and resume normal patch acceptance shortly.
 
 
  Today's update - Pranith fixed a bunch of failures in erasure coding and
  Avra removed a test that was not relevant anymore - thanks for that!
 
  Quota, afr, snapshot  tiering tests are being looked into. Will provide
  an update on where we are with these tomorrow.
 
 
  A few tests have not been readily reproducible. Of the remaining
  tests, all but the following have either been root caused or we have
  patches in review:
 
  ./tests/basic/mount-nfs-auth.t
  ./tests/performance/open-behind.t
  ./tests/basic/ec/ec-5-2.t
  ./tests/basic/quota-nfs.t
 
  With some reviews and investigations of failing tests happening over
  the weekend, I am optimistic about being able to accept patches as
  usual from early next week.
 
  Thanks,
  Vijay
 
  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org
  http://www.gluster.org/mailman/listinfo/gluster-devel
 
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Enhancing Quota enforcement during parallel writes

2015-05-21 Thread Raghavendra Gowdappa

- Original Message -
 From: Raghavendra Gowdappa rgowd...@redhat.com
 To: Gluster Devel gluster-devel@gluster.org
 Cc: Vijaikumar Mallikarjuna vmall...@redhat.com, Sachin Pandit 
 span...@redhat.com
 Sent: Friday, 22 May, 2015 10:50:14 AM
 Subject: Enhancing Quota enforcement during parallel writes

 All,

 As pointed by [1], parallel writes can result in incorrect quota enforcement.
 [2] was an (unsuccessful) attempt to solve the issue. Some points about [2]:

 in_progress_writes is updated _after_ we fetch the size. Due to this, two
 writes can see the same size and hence the issue is not solved. What we
 should be doing is to update in_progress_writes even before we fetch the
 size. If we do this, it is guaranteed that at-least one write sees the
 other's size accounted in in_progress_writes. This approach has two issues:

 1. since we had added current write size to in_progress_writes, current write
 would already be accounted in the size of the directory. This is a minor
 issue and can be solved by subtracting the size of the current write from
 the resultant cluster-wide in-progress-size of the directory.

 2. We might prematurely fail the writes even though there is some space
 available. Assume there is a 5MB of free space. If two 5MB writes are issued
 in parallel, both might fail as both might see each other's size already
 accounted, though none of them has succeeded.

Of course, we can go with this limitation as we are erring on conservative side 
if the following logic seems too complicated.

 To solve this issue, I am
 proposing following algo:

* we assign an identity that is unique across the cluster for each write -
say uuid
* Among all the in-progress-writes we pick a write. The policy used can be
a random criteria like smallest of all the uuids. So, each brick selects
a candidate among its own in-progress-writes _AND_ incoming candidate
(see the psuedocode of get_dir_size below for more clarity). It sends
back this candidate along with size of directory. The brick also
remembers the last candidate it approved. clustering translators like dht
pick one write among these replies, using the same logic bricks had used.
Now along with size we also get a candidate to choose from in-progress
writes. However, there might be a new write on the brick in the
time-window where we try to fetch size which could be the candidate. We
should compare the resultant cluster_wide candidate with the per-brick
candidate. So, the enforcement logic will be as below:

 /* Both enforcer and get_dir_size are executed in brick process. I've left
 out logic of get_dir_size in cluster translators like dht */
 enforcer ()
 {
 /* Note that this logic is executed independently for each directory on
 which quota limit is set. All the in-progress writes, sizes, candidates
 are valid in the context of
that directory
  */

 my_delta = iov_length (input_iovec, input_count);
 my_id = getuuid();

 add_my_delta_to_in_progress_size ();

 get_dir_size (my_id, size, in_progress_size, cluster_candidate);

 in_progress_size -= my_delta;

 if (((size + my_delta)  quota_limit)  ((size + in_progress_size +
 my_delta)  quota_limit) {

   /* we've to choose among in-progress writes */

   brick_candidate = least_of_uuids
   (directory-in_progress_write_list,
   directory-last_winning_candidate);

   if ((my_id == cluster_candidate)  (my_id == brick_candidate)) {
   /* 1. subtract my_delta from per-brick in-progress writes
  2. add my_delta to per-brick sizes of all parents
  3. allow-write

  getting brick_candidate above, 1 and 2 should be done
  atomically
   */
   } else {
   /* 1. subtract my_delta from per-brick in-progress writes
  2. fail_write
*/
 } else if ((size + my_delta)  quota_limit) {
   /* 1. subtract my_delta from per-brick in-progress writes
  2. add my_delta to per-brick sizes of all parents
  3. allow-write

  1 and 2 should be done atomically
   */
 } else {

fail_write ();

 }

 }

 get_dir_size (IN incoming_candidate_id, IN directory, OUT *winning_candidate,
 ...)
 {
  directory-last_winning_candidate = winning_candidate = least_uuid
  (directory-in_progress_write_list, incoming_candidate_id);

 }

 Comments?

 [1] http://www.gluster.org/pipermail/gluster-devel/2015-May/045194.html
 [2] http://review.gluster.org/#/c/6220/

 regards,
 Raghavendra.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] rm -rf issues in Geo-replication

2015-05-22 Thread Raghavendra Gowdappa

- Original Message -
 From: Aravinda avish...@redhat.com
 To: Gluster Devel gluster-devel@gluster.org
 Cc: Raghavendra Gowdappa rgowd...@redhat.com
 Sent: Friday, 22 May, 2015 12:42:11 PM
 Subject: rm -rf issues in Geo-replication

 Problem:

 Each geo-rep workers process Changelogs available in their bricks,
 if worker sees RMDIR, it tries to remove that directory recursively.
 Since rmdir is recorded in all the bricks, rm -rf is executed
 in parallel.

 Due to DHTs open issues of parallel rm -rf, Some of the directories
 will not get deleted in Slave Volume(Stale directory layout). If same
 named dir is created in Master, then Geo-rep will end up in inconsistent
 state since GFID is different for the new directory and directory
 exists in Slave.

 Solution - Fix in DHT:
 -
 Hold lock during rmdir, so that parallel rmdir will get blocked and
 no stale layouts.

 Solution - Fix in Geo-rep:
 --
 Temporarily we can fix in Geo-rep till DHT fixes this issue. Since
 Meta Volume is available with each Cluster, Geo-rep can keep lock
 for GFID of dir to be deleted.

If it fixes a currently pressing problem in geo-rep, we can use this. However, 
please note that the problem is directory self heal done during lookup racing 
with rmdir. So, theoretically, any path based operation (like stat, opendir, 
chmod, chown, etc and not just rmdir) can result in this bug. So, even with 
this solution you can see the issue (like doing find dir while doing rmdir 
dir). There is an age old patch supposed to resolve this issue at [1], but 
not merged because of various reasons (one being synchronization required to 
prevent stale layouts being stored in inode-ctx).

[1] http://review.gluster.org/#/c/4846/

 For example,

 when rmdir:
  while True:
  try:
 # fcntl lock in Meta volume $METAVOL/.rmdirlocks/GFID
  get_lock(GFID)
  recursive_delete()
  release_and_del_lock_file()
  break
  except (EACCES, EAGAIN):
  continue

 One worker will succeed and all other workers will get ENOENT/ESTALE,
 which can be safely ignored.

 Let us know your thoughts.

 --
 regards
 Aravinda

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] on patch #11553

2015-07-07 Thread Raghavendra Gowdappa

+gluster-devel

- Original Message -
 From: Raghavendra Gowdappa rgowd...@redhat.com
 To: Krishnan Parthasarathi kpart...@redhat.com
 Cc: Nithya Balachandran nbala...@redhat.com, Anoop C S 
 achir...@redhat.com
 Sent: Tuesday, 7 July, 2015 11:32:01 AM
 Subject: on patch #11553

 KP,

 Though the crash because of lack of init while fops are in progress is
 solved, concerns addressed by [1] are still valid. Basically what we need to
 guarantee is that when is it safe to wind fops through a particular subvol
 of protocol/server. So, if some xlators are doing things in events like
 CHILD_UP (like trash), server_setvolume should wait for CHILD_UP on a
 particular subvol before accepting a client. So, [1] is necessary but
 following changes need to be made:

 1. protocol/server _can_ have multiple subvol as children. In that case we
 should track whether the exported subvol has received CHILD_UP and only
 after a successful CHILD_UP on that subvol connections to that subvol can be
 accepted.
 2. It is valid (though not a common thing on brick process) that some subvols
 can be up and some might be down. So, child readiness should be localised to
 that subvol instead of tracking readiness at protocol/server level.

 So, please revive [1] and send it with corrections and I'll merge it.

 [1] http://review.gluster.org/11553

 regards,
 Raghavendra.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Inconsistent behavior due to lack of lookup on entry followed by readdirp

2015-08-13 Thread Raghavendra Gowdappa

- Original Message -
 From: Krutika Dhananjay kdhan...@redhat.com
 To: Raghavendra Gowdappa rgowd...@redhat.com
 Cc: Mohammed Rafi K C rkavu...@redhat.com, Gluster Devel 
 gluster-devel@gluster.org, Dan Lambright
 dlamb...@redhat.com, Nithya Balachandran nbala...@redhat.com, Ben 
 Turner btur...@redhat.com, Ben England
 bengl...@redhat.com, Manoj Pillai mpil...@redhat.com, Pranith Kumar 
 Karampuri pkara...@redhat.com,
 Ravishankar Narayanankutty ranar...@redhat.com, xhernan...@datalab.es
 Sent: Thursday, August 13, 2015 9:53:41 AM
 Subject: Re: Inconsistent behavior due to lack of lookup on entry followed by 
 readdirp

 - Original Message -

  From: Raghavendra Gowdappa rgowd...@redhat.com
  To: Krutika Dhananjay kdhan...@redhat.com
  Cc: Mohammed Rafi K C rkavu...@redhat.com, Gluster Devel
  gluster-devel@gluster.org, Dan Lambright dlamb...@redhat.com, Nithya
  Balachandran nbala...@redhat.com, Ben Turner btur...@redhat.com,
  Ben
  England bengl...@redhat.com, Manoj Pillai mpil...@redhat.com,
  Pranith Kumar Karampuri pkara...@redhat.com, Ravishankar
  Narayanankutty ranar...@redhat.com, xhernan...@datalab.es
  Sent: Thursday, August 13, 2015 9:06:37 AM
  Subject: Re: Inconsistent behavior due to lack of lookup on entry followed
  by
  readdirp

  - Original Message -
   From: Krutika Dhananjay kdhan...@redhat.com
   To: Mohammed Rafi K C rkavu...@redhat.com
   Cc: Gluster Devel gluster-devel@gluster.org, Dan Lambright
   dlamb...@redhat.com, Nithya Balachandran
   nbala...@redhat.com, Raghavendra Gowdappa rgowd...@redhat.com, Ben
   Turner btur...@redhat.com, Ben
   England bengl...@redhat.com, Manoj Pillai mpil...@redhat.com,
   Pranith Kumar Karampuri
   pkara...@redhat.com, Ravishankar Narayanankutty
   ranar...@redhat.com,
   xhernan...@datalab.es
   Sent: Wednesday, August 12, 2015 9:02:44 PM
   Subject: Re: Inconsistent behavior due to lack of lookup on entry
   followed
   by readdirp

   I faced the same issue with the sharding translator. I fixed it by making
   its
   readdirp callback initialize individual entries' inode ctx, some of these
   being xattr values, which are filled in entry-dict by the posix
   translator.
   Here is the patch that got merged recently:
   http://review.gluster.org/11854
   Would that be as easy to do in DHT as well?

  The problem is not just filling out state in the inode. The bigger problem
  is
  healing, which is supposed to maintain a directory/file to be in state
  consistent with our design before a successful reply to lookup. The
  operations can involve creating directories on missing subvols, setting
  appropriate layout, etc. Effectively for readdirp to replace lookup, it
  should be calling dht_lookup on each of the dentry it is passing back to
  application.

 OK.

   As far as AFR is concerned, it indirectly forces LOOKUP on entries which
   are
   being retrieved for the first time through a READDIRP (and as a result do
   not have their inode ctx etc initialised yet) by setting entry-inode to
   NULL. See afr_readdir_transform_entries().

  Hmm. Then we already have disabled readdirp through code :). Without an
  inode corresponding to entry, readdirp will be effectively readdir
  stripping
  any performance benefits by having readdirp as a batched lookup (of all
  the dentries).
 No. Not every single READDIRP will be transformed into a READDIR by AFR. AFR
 resets the inode corresponding to an entry, before responding to its parent,
 _only_ under the following two conditions:
 1) if this entry in question is being retrieved by this client for the first
 time through a READDIRP. In other words, this client has not _yet_ performed
 a LOOKUP on it.
 2) if that sub-volume of AFR on which the parent directory is being
 READDIRP'd (remember AFR would only need to serve inode and directory reads
 from one of the replicas) does _not_ contain a good copy of the entry.
 In other words this entry needs to be healed on parent's read child. This is
 because we do not want the caching translators or the application itself to
 get incorrect entry attributes.

Thanks Krutika. We'll be borrowing the idea of setting entry-inode to NULL, 
when dht determines that inode needs to be healed. Since afr is already doing 
that for all dentries during first READDIRP (barring any lookups on that inode 
before), I don't think doing this will have any further performance degradation 
(As most of the setups will be distributed-replicated).

 This means that more often than not, AFR _would_ be leaving the inode
 corresponding to the entry as it is, and not setting it to NULL.

   This is the default behavior which is being made optional as part of
   http://review.gluster.org/#/c/11846/ which is still under review (see BZ
   1250803, a performance bug :) ).

  If it is made optional, when we enable setting entry-inode we still see
  consistency issues. Also, it seems to me that there is no point in having
  each individual xlator

Re: [Gluster-devel] Inconsistent behavior due to lack of lookup on entry followed by readdirp

2015-08-12 Thread Raghavendra Gowdappa

- Original Message -
 From: Krutika Dhananjay kdhan...@redhat.com
 To: Mohammed Rafi K C rkavu...@redhat.com
 Cc: Gluster Devel gluster-devel@gluster.org, Dan Lambright 
 dlamb...@redhat.com, Nithya Balachandran
 nbala...@redhat.com, Raghavendra Gowdappa rgowd...@redhat.com, Ben 
 Turner btur...@redhat.com, Ben
 England bengl...@redhat.com, Manoj Pillai mpil...@redhat.com, Pranith 
 Kumar Karampuri
 pkara...@redhat.com, Ravishankar Narayanankutty ranar...@redhat.com, 
 xhernan...@datalab.es
 Sent: Wednesday, August 12, 2015 9:02:44 PM
 Subject: Re: Inconsistent behavior due to lack of lookup on entry followed by 
 readdirp

 I faced the same issue with the sharding translator. I fixed it by making its
 readdirp callback initialize individual entries' inode ctx, some of these
 being xattr values, which are filled in entry-dict by the posix translator.
 Here is the patch that got merged recently: http://review.gluster.org/11854
 Would that be as easy to do in DHT as well?

The problem is not just filling out state in the inode. The bigger problem is 
healing, which is supposed to maintain a directory/file to be in state 
consistent with our design before a successful reply to lookup. The operations 
can involve creating directories on missing subvols, setting appropriate 
layout, etc. Effectively for readdirp to replace lookup, it should be calling 
dht_lookup on each of the dentry it is passing back to application.

 As far as AFR is concerned, it indirectly forces LOOKUP on entries which are
 being retrieved for the first time through a READDIRP (and as a result do
 not have their inode ctx etc initialised yet) by setting entry-inode to
 NULL. See afr_readdir_transform_entries().

Hmm. Then we already have disabled readdirp through code :). Without an inode 
corresponding to entry, readdirp will be effectively readdir stripping any 
performance benefits by having readdirp as a batched lookup (of all the 
dentries).

 This is the default behavior which is being made optional as part of
 http://review.gluster.org/#/c/11846/ which is still under review (see BZ
 1250803, a performance bug :) ).

If it is made optional, when we enable setting entry-inode we still see 
consistency issues. Also, it seems to me that there is no point in having each 
individual xlator option controlling this behaviour. Instead we can make each 
xlator behave in compliance to global mount option --use-readdirp=yes/no. Is 
there any specific reason to have an option to control this behaviour in afr?

 -Krutika

 - Original Message -

  From: Mohammed Rafi K C rkavu...@redhat.com
  To: Gluster Devel gluster-devel@gluster.org
  Cc: Dan Lambright dlamb...@redhat.com, Nithya Balachandran
  nbala...@redhat.com, Raghavendra Gowdappa rgowd...@redhat.com, Ben
  Turner btur...@redhat.com, Ben England bengl...@redhat.com, Manoj
  Pillai mpil...@redhat.com, Pranith Kumar Karampuri
  pkara...@redhat.com, Ravishankar Narayanankutty ranar...@redhat.com,
  kdhan...@redhat.com, xhernan...@datalab.es
  Sent: Wednesday, August 12, 2015 7:29:48 PM
  Subject: Inconsistent behavior due to lack of lookup on entry followed by
  readdirp

  Hi All,

  We are facing some inconsistent behavior for fops like rename, unlink
  etc due to lack of lookup followed by a readdirp, more specifically if
  inodes/gfid are populated via readdirp call and this nodeid is shared
  with kernal, md-cache will cache this based on base-name. Then
  subsequent named lookup will be served from md-cache and it winds-back
  immediately. So there is a chance to have an FOP triggered with out
  having a lookup on an entry. DHT does lot of things like creating link
  files and populate inode_ctx etc, during lookup. In such scenario it is
  must to have at least one lookup to be happened on an entry. Since
  readdirp preventing the lookup, it has been very hard for fops to
  proceed without a first lookup on the entry. We are also suspecting some
  problems due to same with afr/ec self healing also. So If we remove
  readdirp from md-cache ([1], [2]) it causes, an additional hop for first
  lookup for every entry. I'm mostly concerned with this one extra network
  call, and the performance degradation caused by the same.

  Now with this, the only advantage with readdirp is, it removes one
  context switch between kernal and userspace. Is it really worth to
  sacrifice this for consistency ?

  What do you think about removing readdirp functionality?

  Please provide your input/suggestion/ideas.

  [1] : http://review.gluster.org/#/c/11892/

  [2] : http://review.gluster.org/#/c/11894/

  Thanks in Advance
  Rafi KC

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Locking behavior vs rmdir/unlink of a directory/file

2015-08-19 Thread Raghavendra Gowdappa

- Original Message -
 From: Raghavendra Gowdappa rgowd...@redhat.com
 To: Gluster Devel gluster-devel@gluster.org
 Cc: Sakshi Bansal saban...@redhat.com
 Sent: Thursday, August 20, 2015 10:24:46 AM
 Subject: [Gluster-devel] Locking behavior vs rmdir/unlink of a directory/file

 Hi all,

 Most of the code currently treats inode table (and dentry structure
 associated with that) as the correct representative of underlying backend
 file-system. While this is correct for most of the cases, the representation
 might be out of sync for small time-windows (like file deleted on disk, but
 dentry and inode is not removed in our inode table etc). While working on
 locking directories in dht for better consistency we ran into one such
 issue. The issue is basically to make rmdir and directory creation during
 dht-selfheal mutually exclusive. The idea is to have a blocking inodelk on
 inode before proceeding with rmdir or directory self-heal. However, consider
 following scenario:

 1. (dht_)rmdir acquires a lock.
 2. lookup-selfheal tries to acquire a lock, but is blocked on lock acquired
 by rmdir.
 3. rmdir deletes directory and unlocks the lock. Its possible for inode to
 remain in inode table and searchable through gfid till there is a positive
 reference count on it. In this case lock-request (by lookup) and
 granted-lock (to rmdir) makes the inode to remain in inode table even after
 rmdir.

as both of them have a refcount each on inode.

 4. lock request issued by lookup is granted.

 Note that at step 4, its still possible rmdir might be in progress from dht
 perspective (it just completed on one node). However, this is precisely the
 situation we wanted to avoid i.e., we wanted to block and fail dht-selfheal
 instead of allowing it to proceed.

 In this scenario at step 4, the directory is removed on backend file-system,
 but its representation is still present in inode table. We tried to solve
 this by doing a lookup on gfid before granting a lock [1]. However, because
 of [1]

 1. we no longer treat inode table as source of truth as opposed to other
 non-lookup code
 2. performance hit in terms of a lookup on backend-filesystem for _every_
 granted lock. This may not be as big considering that there is no network
 call involved.

 There are other ways where dht could've avoided above scenario altogether
 with different trade-offs we didn't want to make. Few alternatives would've
 been,
 1. use entrylk during lookup-selfheal and rmdir. This fits naturally as both
 are entry operations. However, dht-selfheal also sets layouts which should
 be synchronized other operations where we don't have name information. tl;dr
 we wanted to avoid using entrylk for reasons that are out of scope for this
 problem.
 2. Use non-blocking inodelk by dht during lookup-selfheal. This solves the
 problem for most of the practical cases, but theoretically race can still
 exist.

 To summarize, the problem of granted-locks and unlink/rmdir still remains and
 I am not sure what exactly should be the behavior of posix-locks in that
 scenario. Inputs in way of review on [1] are greatly appreciated.

 [1] http://review.gluster.org/#/c/11916/

 regards,
 Raghavendra.
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Locking behavior vs rmdir/unlink of a directory/file

2015-08-19 Thread Raghavendra Gowdappa

To put the problem in simple words, A lock is granted by posix-locks xlator 
even after a directory is deleted on backend.

- Original Message -
 From: Raghavendra Gowdappa rgowd...@redhat.com
 To: Gluster Devel gluster-devel@gluster.org
 Cc: Sakshi Bansal saban...@redhat.com
 Sent: Thursday, August 20, 2015 10:31:55 AM
 Subject: Re: [Gluster-devel] Locking behavior vs rmdir/unlink of a 
 directory/file

 - Original Message -
  From: Raghavendra Gowdappa rgowd...@redhat.com
  To: Gluster Devel gluster-devel@gluster.org
  Cc: Sakshi Bansal saban...@redhat.com
  Sent: Thursday, August 20, 2015 10:24:46 AM
  Subject: [Gluster-devel] Locking behavior vs rmdir/unlink of a
  directory/file

  Hi all,

  Most of the code currently treats inode table (and dentry structure
  associated with that) as the correct representative of underlying backend
  file-system. While this is correct for most of the cases, the
  representation
  might be out of sync for small time-windows (like file deleted on disk, but
  dentry and inode is not removed in our inode table etc). While working on
  locking directories in dht for better consistency we ran into one such
  issue. The issue is basically to make rmdir and directory creation during
  dht-selfheal mutually exclusive. The idea is to have a blocking inodelk on
  inode before proceeding with rmdir or directory self-heal. However,
  consider
  following scenario:

  1. (dht_)rmdir acquires a lock.
  2. lookup-selfheal tries to acquire a lock, but is blocked on lock acquired
  by rmdir.
  3. rmdir deletes directory and unlocks the lock. Its possible for inode to
  remain in inode table and searchable through gfid till there is a positive
  reference count on it. In this case lock-request (by lookup) and
  granted-lock (to rmdir) makes the inode to remain in inode table even after
  rmdir.

 as both of them have a refcount each on inode.

  4. lock request issued by lookup is granted.

  Note that at step 4, its still possible rmdir might be in progress from dht
  perspective (it just completed on one node). However, this is precisely the
  situation we wanted to avoid i.e., we wanted to block and fail dht-selfheal
  instead of allowing it to proceed.

  In this scenario at step 4, the directory is removed on backend
  file-system,
  but its representation is still present in inode table. We tried to solve
  this by doing a lookup on gfid before granting a lock [1]. However, because
  of [1]

  1. we no longer treat inode table as source of truth as opposed to other
  non-lookup code
  2. performance hit in terms of a lookup on backend-filesystem for _every_
  granted lock. This may not be as big considering that there is no network
  call involved.

  There are other ways where dht could've avoided above scenario altogether
  with different trade-offs we didn't want to make. Few alternatives would've
  been,
  1. use entrylk during lookup-selfheal and rmdir. This fits naturally as
  both
  are entry operations. However, dht-selfheal also sets layouts which should
  be synchronized other operations where we don't have name information.
  tl;dr
  we wanted to avoid using entrylk for reasons that are out of scope for this
  problem.
  2. Use non-blocking inodelk by dht during lookup-selfheal. This solves the
  problem for most of the practical cases, but theoretically race can still
  exist.

  To summarize, the problem of granted-locks and unlink/rmdir still remains
  and
  I am not sure what exactly should be the behavior of posix-locks in that
  scenario. Inputs in way of review on [1] are greatly appreciated.

  [1] http://review.gluster.org/#/c/11916/

  regards,
  Raghavendra.
  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org
  http://www.gluster.org/mailman/listinfo/gluster-devel

 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Serialization of fops acting on same dentry on server

2015-08-16 Thread Raghavendra Gowdappa

All,

Pranith and me were discussing about implementation of compound operations like 
create + lock, mkdir + lock, open + lock etc. These operations are useful 
in situations like:

1. To prevent locking on all subvols during directory creation as part of self 
heal in dht. Currently we are following approach of locking _all_ subvols by 
both rmdir and lookup-heal [1].
2. To lock a file in advance so that there is less performance hit during 
transactions in afr.

While thinking about implementing such compound operations, it occurred to me 
that one of the problems would be how do we handle a racing mkdir/create and a 
(named lookup - simply referred as lookup from now on - followed by lock). This 
is because,
1. creation of directory/file on backend
2. linking of the inode with the gfid corresponding to that file/directory

are not atomic. It is not guaranteed that inode passed down during mkdir/create 
call need not be the one that survives in inode table. Since posix-locks xlator 
maintains all the lock-state in inode, it would be a problem if a different 
inode is linked in inode table than the one passed during mkdir/create. One way 
to solve this problem is to serialize fops (like mkdir/create, lookup, rename, 
rmdir, unlink) that are happening on a particular dentry. This serialization 
would also solve other bugs like:

1. issues solved by [2][3] and possibly many such issues.
2. Stale dentries left out in bricks' inode table because of a racing lookup 
and dentry modification ops (like rmdir, unlink, rename etc).

Initial idea I've now is to maintain fops in-progress on a dentry in parent 
inode (may be resolver code in protocol/server). Based on this we can serialize 
the operations. Since we need to serialize _only_ operations on a dentry (we 
don't serialize nameless lookups), it is guaranteed that we do have a parent 
inode always. Any comments/discussion on this would be appreciated.

[1] http://review.gluster.org/11725
[2] http://review.gluster.org/9913
[3] http://review.gluster.org/5240

regards,
Raghavendra.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Serialization of fops acting on same dentry on server

2015-08-16 Thread Raghavendra Gowdappa

- Original Message -
 From: Raghavendra Gowdappa rgowd...@redhat.com
 To: Gluster Devel gluster-devel@gluster.org
 Cc: Sakshi Bansal saban...@redhat.com
 Sent: Monday, 17 August, 2015 10:39:38 AM
 Subject: [Gluster-devel] Serialization of fops acting on same dentry on   
 server

 All,

 Pranith and me were discussing about implementation of compound operations
 like create + lock, mkdir + lock, open + lock etc. These operations
 are useful in situations like:

 1. To prevent locking on all subvols during directory creation as part of
 self heal in dht. Currently we are following approach of locking _all_
 subvols by both rmdir and lookup-heal [1].

Correction. It should've been, to prevent locking on all subvols during 
rmdir. The lookup self-heal should lock on all subvols (with compound mkdir + 
lookup if directory is not present on a subvol). With this rmdir/rename can 
lock on just any one subvol and this will prevent any parallel lookup-heal from 
preventing directory creation.

 2. To lock a file in advance so that there is less performance hit during
 transactions in afr.

 While thinking about implementing such compound operations, it occurred to me
 that one of the problems would be how do we handle a racing mkdir/create and
 a (named lookup - simply referred as lookup from now on - followed by lock).
 This is because,
 1. creation of directory/file on backend
 2. linking of the inode with the gfid corresponding to that file/directory

 are not atomic. It is not guaranteed that inode passed down during
 mkdir/create call need not be the one that survives in inode table. Since
 posix-locks xlator maintains all the lock-state in inode, it would be a
 problem if a different inode is linked in inode table than the one passed
 during mkdir/create. One way to solve this problem is to serialize fops
 (like mkdir/create, lookup, rename, rmdir, unlink) that are happening on a
 particular dentry. This serialization would also solve other bugs like:

 1. issues solved by [2][3] and possibly many such issues.
 2. Stale dentries left out in bricks' inode table because of a racing lookup
 and dentry modification ops (like rmdir, unlink, rename etc).

 Initial idea I've now is to maintain fops in-progress on a dentry in parent
 inode (may be resolver code in protocol/server). Based on this we can
 serialize the operations. Since we need to serialize _only_ operations on a
 dentry (we don't serialize nameless lookups), it is guaranteed that we do
 have a parent inode always. Any comments/discussion on this would be
 appreciated.

 [1] http://review.gluster.org/11725
 [2] http://review.gluster.org/9913
 [3] http://review.gluster.org/5240

 regards,
 Raghavendra.
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Serialization of fops acting on same dentry on server

2015-08-17 Thread Raghavendra Gowdappa

- Original Message -
 From: Niels de Vos nde...@redhat.com
 To: Raghavendra Gowdappa rgowd...@redhat.com
 Cc: Gluster Devel gluster-devel@gluster.org, Sakshi Bansal 
 saban...@redhat.com
 Sent: Monday, 17 August, 2015 11:14:18 AM
 Subject: Re: [Gluster-devel] Serialization of fops acting on same dentry on 
 server

 On Mon, Aug 17, 2015 at 01:09:38AM -0400, Raghavendra Gowdappa wrote:
  All,

  Pranith and me were discussing about implementation of compound
  operations like create + lock, mkdir + lock, open + lock etc.
  These operations are useful in situations like:

  1. To prevent locking on all subvols during directory creation as part
  of self heal in dht. Currently we are following approach of locking
  _all_ subvols by both rmdir and lookup-heal [1].
  2. To lock a file in advance so that there is less performance hit
  during transactions in afr.

 I have an interest in compound/composite procedures too. My use-case is
 a little different, and I (was and still) am planning to send more
 details about it soon.

 Basically, there are certain cases where libgfapi will not be able to
 automatically pass the uid/gid in the RPC-header. A design for
 supporting Kerberos will mainly use the standardized RPCSEC_GSS. If
 there is no option to use the Kerberos credentials of the user doing
 I/O (remote client, not using Kerberos to talk to samba/ganesha), the
 username (or uid/gid) needs to be passed to the storage servers.

 A compound/composite procedure would then look like this:

   [RPC header]
 [AUTH_GSS + Kerberos principal for libgfapi/samba/ganesha/...]

   [GlusterFS COMPOUND]
 [SETFSUID]
 [SETLOCKOWNER]
 [${FOP}]
 [.. more FOPs?]

 This idea has not been reviewed/commented on with some of the Kerberos
 experts that I want to involve. A more complete description about the
 plans to support Kerberos will follow.

 Do you think that this matches your ideas on compound operations?

The thing we had in mind was more of compounding more than one Gluster fops. We 
really didn't think at the granularity of setfsuid, setlkowner etc. But, yes 
its not something fundamentally different from what we had in mind.

 Thanks,
 Niels

  While thinking about implementing such compound operations, it
  occurred to me that one of the problems would be how do we handle a
  racing mkdir/create and a (named lookup - simply referred as lookup
  from now on - followed by lock). This is because,
  1. creation of directory/file on backend
  2. linking of the inode with the gfid corresponding to that
  file/directory

  are not atomic. It is not guaranteed that inode passed down during
  mkdir/create call need not be the one that survives in inode table.
  Since posix-locks xlator maintains all the lock-state in inode, it
  would be a problem if a different inode is linked in inode table than
  the one passed during mkdir/create. One way to solve this problem is
  to serialize fops (like mkdir/create, lookup, rename, rmdir, unlink)
  that are happening on a particular dentry. This serialization would
  also solve other bugs like:

  1. issues solved by [2][3] and possibly many such issues.
  2. Stale dentries left out in bricks' inode table because of a racing
  lookup and dentry modification ops (like rmdir, unlink, rename etc).

  Initial idea I've now is to maintain fops in-progress on a dentry in
  parent inode (may be resolver code in protocol/server). Based on this
  we can serialize the operations. Since we need to serialize _only_
  operations on a dentry (we don't serialize nameless lookups), it is
  guaranteed that we do have a parent inode always. Any
  comments/discussion on this would be appreciated.

  [1] http://review.gluster.org/11725
  [2] http://review.gluster.org/9913
  [3] http://review.gluster.org/5240

  regards,
  Raghavendra.
  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org
  http://www.gluster.org/mailman/listinfo/gluster-devel

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Locking behavior vs rmdir/unlink of a directory/file

2015-08-19 Thread Raghavendra Gowdappa

Hi all,

Most of the code currently treats inode table (and dentry structure associated 
with that) as the correct representative of underlying backend file-system. 
While this is correct for most of the cases, the representation might be out of 
sync for small time-windows (like file deleted on disk, but dentry and inode is 
not removed in our inode table etc). While working on locking directories in 
dht for better consistency we ran into one such issue. The issue is basically 
to make rmdir and directory creation during dht-selfheal mutually exclusive. 
The idea is to have a blocking inodelk on inode before proceeding with rmdir or 
directory self-heal. However, consider following scenario:

1. (dht_)rmdir acquires a lock.
2. lookup-selfheal tries to acquire a lock, but is blocked on lock acquired by 
rmdir.
3. rmdir deletes directory and unlocks the lock. Its possible for inode to 
remain in inode table and searchable through gfid till there is a positive 
reference count on it. In this case lock-request (by lookup) and granted-lock 
(to rmdir) makes the inode to remain in inode table even after rmdir.
4. lock request issued by lookup is granted.

Note that at step 4, its still possible rmdir might be in progress from dht 
perspective (it just completed on one node). However, this is precisely the 
situation we wanted to avoid i.e., we wanted to block and fail dht-selfheal 
instead of allowing it to proceed.

In this scenario at step 4, the directory is removed on backend file-system, 
but its representation is still present in inode table. We tried to solve this 
by doing a lookup on gfid before granting a lock [1]. However, because of [1]

1. we no longer treat inode table as source of truth as opposed to other 
non-lookup code
2. performance hit in terms of a lookup on backend-filesystem for _every_ 
granted lock. This may not be as big considering that there is no network call 
involved.

There are other ways where dht could've avoided above scenario altogether with 
different trade-offs we didn't want to make. Few alternatives would've been,
1. use entrylk during lookup-selfheal and rmdir. This fits naturally as both 
are entry operations. However, dht-selfheal also sets layouts which should be 
synchronized other operations where we don't have name information. tl;dr we 
wanted to avoid using entrylk for reasons that are out of scope for this 
problem.
2. Use non-blocking inodelk by dht during lookup-selfheal. This solves the 
problem for most of the practical cases, but theoretically race can still exist.

To summarize, the problem of granted-locks and unlink/rmdir still remains and I 
am not sure what exactly should be the behavior of posix-locks in that 
scenario. Inputs in way of review on [1] are greatly appreciated.

[1] http://review.gluster.org/#/c/11916/

regards,
Raghavendra.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] SSL enabled glusterd crash

2015-08-06 Thread Raghavendra Gowdappa

There is a race b/w gf_timer_call_cancel and firing of timer addressed by [1]. 
Can this be the cause? Also note that [1] is not sufficient enough, as the 
callers of gf_timer_call_cancel should check return value and shouldn't free 
opaque pointer it passed to gf_timer_call_after during timer registration when 
gf_timer_call_cancel returns -1. Note that [1] is not in 3.7.3.

[1] http://review.gluster.org/6459

- Original Message -
 From: Emmanuel Dreyfus m...@netbsd.org
 To: gluster-devel@gluster.org
 Sent: Thursday, August 6, 2015 3:29:36 PM
 Subject: [Gluster-devel] SSL enabled glusterd crash
 
 On 3.7.3 with SSL enabled, restarting glusterd is quite unreliable,
 with peers and bricks showing up or not in gluster status outputs.
 And results can be different on different peers, and even not
 symetrical: a peer sees the bricks of another but not the other
 way around.
 
 After playing a bit, I managed to get a real crash on restarting
 glusterd on all peers. 3 of them crash here:
 
 Program terminated with signal 11, Segmentation fault.
 #0  0xbbbda1f4 in rpc_clnt_reconnect (conn_ptr=0xb9ce5150) at rpc-clnt.c:409
 409 gf_timer_call_cancel (clnt-ctx,
 #0  0xbbbda1f4 in rpc_clnt_reconnect (conn_ptr=0xb9ce5150) at rpc-clnt.c:409
 #1  0xbbb33d0c in gf_timer_proc (ctx=Cannot access memory at address
  0xba9fffd8) at timer.c:194
 (gdb) list
 404 if (!trans) {
 405 pthread_mutex_unlock (conn-lock);
 406 return;
 407 }
 408 if (conn-reconnect)
 409 gf_timer_call_cancel (clnt-ctx,
 410   conn-reconnect);
 411 conn-reconnect = 0;
 412
 413 if ((conn-connected == 0)  !clnt-disabled) {
 (gdb) print clnt
 $1 = (struct rpc_clnt *) 0x39bb
 (gdb) print conn
 $2 = (rpc_clnt_connection_t *) 0xb9ce5150
 (gdb) print conn-lock
 $3 = {ptm_magic = 51200, ptm_errorcheck = 0 '\000', ptm_pad1 = 0Q\316,
   ptm_interlock = 185 '\271', ptm_pad2 = \336\300\255,
   ptm_owner = 0x6af000de, ptm_waiters = 0x39bb, ptm_recursed = 51200,
   ptm_spare2 = 0xce513000}
 
 ptm_magix is wrong. NetBSD libpthread sets it as 0x0003 when created
 and as 0xDEAD0003 when destroyed. This means we either have memory
 corruption, or the mutex was never initialized.
 
 The last one crashes somewhere else:
 
 Program terminated with signal 11, Segmentation fault#0  0xbbb33e60 in
 gf_timer_registry_init (ctx=0x80) at timer.c:241
 241 if (!ctx-timer) {
 (gdb) bt
 #0  0xbbb33e60 in gf_timer_registry_init (ctx=0x80) at timer.c:241
 #1  0xbbb339ce in gf_timer_call_cancel (ctx=0x80, event=0xb9dffb24)
 at timer.c:121
 #2  0xbbbda206 in rpc_clnt_reconnect (conn_ptr=0xb9ce9150) at rpc-clnt.c:409
 #3  0xbbb33d0c in gf_timer_proc (ctx=Cannot access memory at
  address 0xba9fffd8) at timer.c:194
 (gdb) print ctx
 $1 = (glusterfs_ctx_t *) 0x80
 (gdb) frame 2
 #2  0xbbbda206 in rpc_clnt_reconnect (conn_ptr=0xb9ce9150) at rpc-clnt.c:409
 409 gf_timer_call_cancel (clnt-ctx,
 (gdb) print clnt
 $2 = (struct rpc_clnt *) 0xb9dffd94
 (gdb) print clnt-lock.ptm_magic
 $3 = 1
 
 Here again, corrupted or not initialized.
 
 
 I kept the cores for further investigation if this is needed.
 
 --
 Emmanuel Dreyfus
 m...@netbsd.org
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Patch merge request-3.7 branch: http://review.gluster.org/#/c/11858/

2015-08-12 Thread Raghavendra Gowdappa



- Original Message -
 From: Ravishankar N ravishan...@redhat.com
 To: Gluster Devel gluster-devel@gluster.org
 Sent: Wednesday, August 12, 2015 6:01:16 PM
 Subject: [Gluster-devel] Patch merge request-3.7 branch:  
 http://review.gluster.org/#/c/11858/
 
 Could some one with merge rights take
 http://review.gluster.org/#/c/11858/ in for the 3.7 branch? This
 backport has +2 from the maintainer and has passed regressions.
 

Done.

 Thanks in advance :-)
 Ravi
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Lack of named lookups during resolution of inodes after graph switch (was Discuss: http://review.gluster.org/#/c/11368/)

2015-07-20 Thread Raghavendra Gowdappa

+gluster-devel

- Original Message -
 From: Dan Lambright dlamb...@redhat.com
 To: Raghavendra Gowdappa rgowd...@redhat.com
 Cc: Shyam srang...@redhat.com, Nithya Balachandran 
 nbala...@redhat.com, Sakshi Bansal saban...@redhat.com
 Sent: Monday, July 20, 2015 8:23:16 AM
 Subject: Re: Discuss: http://review.gluster.org/#/c/11368/

 I am posting another version of the patch to discuss.. Here is a summary in
 simplest form;

 The fix tries to address problems we have with tiered volumes and fix-layout.

 If we try to use both the hot and cold tier before fix-layout has completed,
 we get many stale file errors; the new hot tier does not have layouts for
 the inodes.

 To avoid such problems, we only use the cold tier until fix-layout is done.
 (sub volume count = 1)

 When we detect fix layout is done, we will do a graph switch which will
 create new layouts on demand. We would like to switch to using both tiers
 (subvolume_cnt=2) only at the time of the graph switch is done.

 There is a hole with that solution. If we make a directory after fix layout
 has past the parent (of the new directory), fix-layout will not copy the new
 directory to the new tier.

 If we try to access such directories, the code fails (dht_access does not
 have a cached sub volume).

 So, we detect such directories when we do a lookup/revalidate/discover, and
 store their peculiar state in the layout if they are only accessible on the
 cold tier. Eventually a self heal will happen, and this state will age out.

 I have a unit test and system test for this.

 Basically my questions are
 - cleanest way to invoke the graph switch between using just the cold tier to
 using both the cold and hot tier.

This is a long standing problem which needs a fix very badly. I think the 
client/mount cannot rely on rebalance/tier process for directory creation since 
I/O on client is independent and there is no way to synchronize it with 
rebalance directory heal. The culprit here is lack of hierarchical named 
lookups from root till that directory after a graph switch in mount process. If 
named-lookups are sent, dht is quite capable of creating directories on newly 
added subvols. So, I am proposing some solutions below.

Interface layers (fuse-bridge, gfapi, nfs etc) should make sure that there is 
at least once entire directory hierarchy till root is looked up before sending 
fops on an inode after graph-switch. For dht, its sufficient only if inodes 
associated with directory are looked up in this fashion. However, non-directory 
inodes might also benefit from this since VFS essentially would've done a 
hierarchical lookup before doing fops. Its only glusterfs which has introduced 
nameless lookups, but much of the logic is designed around named hierarchical 
lookup. Now, to address the question whether its possible for interface layers 
to figure out ancestry of an inode,

* With fuse-bridge, entire dentry structure is preserved (at least in the 
first graph which witnessed named-lookups from kernel and we can migrate this 
structure to newer graphs too). We can use dentry structure from older graph to 
send these named lookups and build similar dentry structure in newer graph too. 
This resolution is still on-demand when a fop is sent on an inode (like 
existing code, but the change being instead of one nameless lookup on inode, we 
do named lookup of parents and inode in newer graph). So, named lookups can be 
sent for all inodes irrespective of whether inode corresponds to directory or 
non-directory.

* I am assuming gfapi is similar to fuse-bridge. Would need verifications 
from people maintaining gfapi whether my assumption is correct.

* NFS-v3 server allows client to just pass file-handle and can construct 
relevant state to access the files (one of the reasons why nameless lookups 
were introduced in first place). Since it relies heavily on nameless lookups 
the dentry structure need not always be present in NFS server process. However 
we can borrow some ideas from [1]. If it seems that maintaining the list of 
parents of a file in xattrs is overkill (basically we are constructing reverse 
dentry tree), at least for problems faced by dht/tier its good enough we get 
this hierarchy for directory inodes. With gfid based backend, we can always get 
path/hierarchy for a directory using gfid of inode using .glusterfs directory 
(within .glusterfs there is a symbolic link with name of gfid whose contents 
can get us ancestry till root). This solution works for _all_ interface layers.

I am suspecting its not just dht, but also other cluster xlators like EC, afr, 
non-cluster entities like quota, geo-rep which face this issue. I am aware of 
atleast one problem in afr - difficulty in identifying gfid mismatch of an 
entry across subvols after graph switch. Geo-replication too is using some form 
of gfid to path conversion. So, comments from other maintainers/developers are 
highly appreciated.

[1] http

Re: [Gluster-devel] Gerrit review, submit type and Jenkins testing

2015-11-09 Thread Raghavendra Gowdappa



- Original Message -
> From: "Raghavendra Talur" 
> To: "Gluster Devel" 
> Sent: Tuesday, November 10, 2015 3:10:34 AM
> Subject: [Gluster-devel] Gerrit review, submit type and Jenkins testing
> 
> Hi,
> 
> While trying to understand how our gerrit+jenkins setup works, I realized of
> a possibility of allowing bugs to get in.
> 
> Currently, our gerrit is setup to have cherry-pick as the submit type. Now
> consider a case where:
> 
> Dev1 sends a commit B with parent commit A(A is already merged).
> Dev2 sends a commit C with parent commit A(A is already merged).
> 
> Both the patches get +2 from Jenkins.
> 
> Maintainer merges commit B from Dev1.
> Another maintainer merges commit C from Dev2.
> 
> If the two commits B and C changed code which had no merge conflicts but were
> conflicting in logic,
> then we have a master which has bugs.
> 
> If Dev3 now sends a commit D with re-based master as parent, we have the
> following cases:
> 
> 1. If bug introduced above is not racy, we have tests always failing for Dev3
> on commit D. Tests that fail would be from components that commit B and C
> changed. Dev3 has no idea on how to fix them and has to enlist help from
> Dev1 and Dev2.
> 
> 2. If bug introduced above is racy, then there is a probability that Dev3
> escapes from this trouble and someone else will bear it later. Even if the
> racy code is hit and test fails, Dev3 will probably re-trigger the tests
> given that they failed for a component which is not related to his/her code
> and the bug stays in code longer.
> 
> The most obvious but not practical solution to the above problem is to change
> the submit type in gerrit to "fast-forward only". It would then ensure that
> once commit B is merged, Dev2 has to re-base and re-run the tests on commit
> C with commit B as parent, before it could be merged. It is not practical
> because it will cause all patches in review to get re-based and re-triggered
> whenever a patch is merged.
> 
> A little modification to the above solution would be to
> 
> 
> * change submit type to fast-forward only
> * don't run any jenkins job on patches till they get +2 from reviewers
> * once a +2 is given, run jenkins job on patch and automatically submit
> it if test passes.
> * automatically rebase all patches on review with new master and mark
> conflict if merge conflict arises.

Seems like a good suggestion. How about a slight variation to the above 
process? Can we run one initial set of regression immediately after submission, 
but before any reviews? That way reviewers can prioritize those patches that 
have passed regression over the ones that have failed? Flip side is that 
minimum two sets of regressions are needed to merge any patch. I am making this 
suggestion with the assumption that dev/reviewer time is more precious than 
machine time. Of course, this will have issues with patches that need to get in 
urgently (user/customer hot fix etc) where time is a constraint. But that can 
be worked around on a case-by-case basis.

> 
> As a side effect of this, Dev would now be forced to run a complete
> regression on dev machine before sending a patch for review.
> 
> Any thoughts on the above solutions or other suggestions?
> 
> Thanks,
> Raghavendra Talur
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] not able to open gerrit (review.glustster.org)

2015-10-19 Thread Raghavendra Gowdappa

Its loading now.

- Original Message -
> From: "Gaurav Garg" 
> To: "Gluster Devel" 
> Sent: Monday, October 19, 2015 11:11:37 AM
> Subject: [Gluster-devel] not able to open gerrit (review.glustster.org)
> 
> Anybody facing the same issue ?
> 
> Thanx,
> 
> ~Gaurav
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Need advice re some major issues with glusterfind

2015-10-20 Thread Raghavendra Gowdappa

Hi John,

- Original Message -
> From: "John Sincock [FLCPTY]" 
> To: gluster-devel@gluster.org
> Sent: Wednesday, October 21, 2015 5:53:23 AM
> Subject: [Gluster-devel] Need advice re some major issues with glusterfind
> 
> Hi Everybody,
> 
> We have recently upgraded our 220 TB gluster to 3.7.4, and we've been trying
> to use the new glusterfind feature but have been having some serious
> problems with it. Overall the glusterfind looks very promising, so I don't
> want to offend anyone by raising these issues.
> 
> If these issues can be resolved or worked around, glusterfind will be a great
> feature.  So I would really appreciate any information or advice:
> 
> 1) What can be done about the vast number of tiny changelogs? We are seeing
> often 5+ small 89 byte changelog files per minute on EACH brick. Larger
> files if busier. We've been generating these changelogs for a few weeks and
> have in excess of 10,000 or 12,000 on most bricks. This makes glusterfinds
> very, very slow, especially on a node which has a lot of bricks, and looks
> unsustainable in the long run. Why are these files so small, and why are
> there so many of them, and how are they supposed to be managed in the long
> run? The sheer number of these files looks sure to impact performance in the
> long run.
> 
> 2) Pgfid xattribute is wreaking havoc with our backup scheme - when gluster
> adds this extended attribute to files it changes the ctime, which we were
> using to determine which files need to be archived. There should be a
> warning added to release notes & upgrade notes, so people can make a plan to
> manage this if required.
> 
> Also, we ran a rebalance immediately after the 3.7.4 upgrade, and the
> rebalance took 5 days or so to complete, which looks like a major speed
> improvement over the more serial rebalance algorithm, so that's good. But I
> was hoping that the rebalance would also have had the side-effect of
> triggering all files to be labelled with the pgfid attribute by the time the
> rebalance completed, or failing that, after creation of an mlocate database
> across our entire gluster (which would have accessed every file, unless it
> is getting the info it needs only from directory inodes). Now it looks like
> ctimes are still being modified, and I think this can only be caused by
> files still being labelled with pgfids.
> 
> How can we force gluster to get this pgfid labelling over and done with, for
> all files that are already on the volume? We can't have gluster continuing
> to add pgfids in bursts here and there, eg when files are read for the first
> time since the upgrade. We need to get it over and done with. We have just
> had to turn off pgfid creation on the volume until we can force gluster to
> get it over and done with in one go.

We are looking into pgfid xattr issue. Its a long weekend here in India. So, 
kindly expect a delay on update on this issue.

> 
> 3) Files modified just before a glusterfind pre are often not included in the
> changed files list, unless pre command is run again a bit later - I think
> changelogs are missing very recent changes and need to be flushed or
> something before the pre command uses them?
> 
> 4) BUG: Glusterfind follows symlinks off bricks and onto NFS mounted
> directories (and will cause these shares to be mounted if you have autofs
> enabled). Glusterfind should definitely not follow symlinks, but it does.
> For now, we are getting around this by turning off autofs when re run
> glusterfinds, but this should not be necessary. Glusterfind must be fixed so
> it never follows symlinks and never leaves the brick it is currently
> searching.
> 
> 5) We have one of our nodes  with 16 bricks, and on this machine, glusterfind
> pre command seems to get stuck pegging all 8 cores to 100%, an strace of an
> offending processes gives an endless stream of these lseeks and reads and
> very little else. What is going on here? It doesn't look right... :
> 
> lseek(13, 17188864, SEEK_SET)   = 17188864
> read(13,
> "\r\0\0\0\4\0J\0\3\25\2\"\0013\0J\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024)
> = 1024
> lseek(13, 17189888, SEEK_SET)   = 17189888
> read(13,
> "\r\0\0\0\4\0\"\0\3\31\0020\1#\0\"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1024) = 1024
> lseek(13, 17190912, SEEK_SET)   = 17190912
> read(13,
> "\r\0\0\0\3\0\365\0\3\1\1\372\0\365\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1024) = 1024
> lseek(13, 17191936, SEEK_SET)   = 17191936
> read(13,
> "\r\0\0\0\4\0F\0\3\17\2\"\0017\0F\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024)
> = 1024
> lseek(13, 17192960, SEEK_SET)   = 17192960
> read(13,
> "\r\0\0\0\4\0006\0\2\371\2\4\1\31\0006\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 1024) = 1024
> lseek(13, 17193984, SEEK_SET)   = 17193984
> read(13,
> "\r\0\0\0\4\0L\0\3\31\2\36\1/\0L\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1024)
> = 1024
> 
> I saved one of these straces for 20 or 30 secs or so, and then doing a quick

Re: [Gluster-devel] Testcase '/tests/bugs/snapshot/bug-1109889.t' failing

2015-07-08 Thread Raghavendra Gowdappa



- Original Message -
 From: Nithya Balachandran nbala...@redhat.com
 To: Vijaikumar M vmall...@redhat.com
 Cc: Gluster Devel gluster-devel@gluster.org
 Sent: Wednesday, July 8, 2015 12:17:29 PM
 Subject: Re: [Gluster-devel] Testcase '/tests/bugs/snapshot/bug-1109889.t'
 failing
 
 Also failing on:
 http://build.gluster.org/job/rackspace-regression-2GB-triggered/12069/consoleFull
 
 Regards,
 Nithya
 
 - Original Message -
  From: Vijaikumar M vmall...@redhat.com
  To: Gluster Devel gluster-devel@gluster.org, Avra Sengupta
  aseng...@redhat.com, Rajesh Joseph
  rjos...@redhat.com
  Sent: Wednesday, 8 July, 2015 11:09:14 AM
  Subject: [Gluster-devel] Testcase '/tests/bugs/snapshot/bug-1109889.t'
  failing
  
  Hi,
  
  Testcase '/tests/bugs/snapshot/bug-1109889.t' is failing consistently

Earlier this test was resulting a brick crash because of brick accepting fops 
even before its xlator graph is initialised. A recent fix makes server to 
reject any client connections till xlator graph is initialised. I think stat is 
failing with ENOTCONN because of that. Now the question is why/how client is 
trying to connect before server is initialized. Hope that helps.

  
  http://build.gluster.org/job/rackspace-regression-2GB-triggered/12048/consoleFull
  
  I think below test at line# 72 need to be changed:
  
  TEST stat $M0/.snaps;
  
  To
  
  EXPECT_WITHIN $PROCESS_UP_TIMEOUT 0 STAT $M0/.snaps
  
  
  Thanks,
  Vijay
  
  ___
  Gluster-devel mailing list
  Gluster-devel@gluster.org
  http://www.gluster.org/mailman/listinfo/gluster-devel
  
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://www.gluster.org/mailman/listinfo/gluster-devel
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Locking behavior vs rmdir/unlink of a directory/file

2015-08-25 Thread Raghavendra Gowdappa

- Original Message -
 From: Vijay Bellur vbel...@redhat.com
 To: Raghavendra Gowdappa rgowd...@redhat.com, Gluster Devel 
 gluster-devel@gluster.org
 Cc: Sakshi Bansal saban...@redhat.com
 Sent: Monday, August 24, 2015 3:52:09 PM
 Subject: Re: [Gluster-devel] Locking behavior vs rmdir/unlink of a 
 directory/file

 On Thursday 20 August 2015 10:24 AM, Raghavendra Gowdappa wrote:
  Hi all,

  Most of the code currently treats inode table (and dentry structure
  associated with that) as the correct representative of underlying backend
  file-system. While this is correct for most of the cases, the
  representation might be out of sync for small time-windows (like file
  deleted on disk, but dentry and inode is not removed in our inode table
  etc). While working on locking directories in dht for better consistency
  we ran into one such issue. The issue is basically to make rmdir and
  directory creation during dht-selfheal mutually exclusive. The idea is to
  have a blocking inodelk on inode before proceeding with rmdir or directory
  self-heal. However, consider following scenario:

  1. (dht_)rmdir acquires a lock.
  2. lookup-selfheal tries to acquire a lock, but is blocked on lock acquired
  by rmdir.
  3. rmdir deletes directory and unlocks the lock. Its possible for inode to
  remain in inode table and searchable through gfid till there is a positive
  reference count on it. In this case lock-request (by lookup) and
  granted-lock (to rmdir) makes the inode to remain in inode table even
  after rmdir.
  4. lock request issued by lookup is granted.

  Note that at step 4, its still possible rmdir might be in progress from dht
  perspective (it just completed on one node). However, this is precisely
  the situation we wanted to avoid i.e., we wanted to block and fail
  dht-selfheal instead of allowing it to proceed.

  In this scenario at step 4, the directory is removed on backend
  file-system, but its representation is still present in inode table. We
  tried to solve this by doing a lookup on gfid before granting a lock [1].
  However, because of [1]

  1. we no longer treat inode table as source of truth as opposed to other
  non-lookup code
  2. performance hit in terms of a lookup on backend-filesystem for _every_
  granted lock. This may not be as big considering that there is no network
  call involved.

 Can we not mark the in memory inode as having been unlinked in
 posix_rmdir() and use this information to determine whether a lock
 request can be processed?

Yes. Nithya suggested the same. But this seemed like a hacky fix. Reason is:

1. Currently we don't really differentiate in inode management based on inode 
type. The code (dentry management, inode management) is agnostic to type. With 
this fix we are bringing such explicit differentiation. Note that only for 
directories we can have such a flag indicating that inode is removed from 
backend. Since, files have hardlinks it would be difficult (if not impossible, 
as unlink_cbk doesn't carry iatt to figure out whether current unlink is on 
last link). This makes the solution only applicable for directories. For lock 
requests on files we still need to lookup on the backend (for our use-case this 
is fine, since we are not locking on files). Not a show-stopper, but something 
in terms of aesthetics.

The whole thing about locks during/after file/directory is removed seems to be 
not well defined as of now IMHO. 
 a. We can acquire lock because inode exists in inode-table.
 b. inode-exists in inode-table because there are some locks on inode 
holding reference.

   Of course, this situation will be fixed with maintaining a flag indicating 
file/directory removal (or by doing a lookup). But some clarifications needed - 
If we are going to have some information indicating file/directory is removed, 
what should be the behaviour of future lock/unlock calls? should we fail them? 
For lock calls, we can fail them. But for unlock there are two choices:
 a. Let the consumer send an unlock even after remove
 b. Or clear out the locks during unlink/rmdir.

I prefer approach a. here. Comments?

 stat() calls can be significantly expensive if the disk seek times
 happen to be high. It would be better if we can avoid an additional
 stat() for every granted lock.

 Regards,
 Vijay

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] requesting for review

2015-08-31 Thread Raghavendra Gowdappa

Its been reviewed and merged.

- Original Message -
> From: "Hari Gowtham" 
> To: "Gluster Devel" 
> Sent: Monday, August 31, 2015 6:13:44 PM
> Subject: [Gluster-devel] requesting for review
> 
> Hi,
> 
> Could anyone review this patch? it has passed the regression and netbsd.
> 
> http://review.gluster.org/#/c/11906/
> 
> 
> --
> Regards,
> Hari.
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] GlusterFS cache architecture

2015-08-31 Thread Raghavendra Gowdappa



- Original Message -
> From: "Oleksandr Natalenko" 
> To: gluster-devel@gluster.org
> Sent: Monday, August 31, 2015 7:37:51 PM
> Subject: [Gluster-devel] GlusterFS cache architecture
> 
> Hello.
> 
> I'm trying to investigate how GlusterFS manages cache on both server and
> client side, but unfortunately cannot find any exhaustive, appropriate
> and up
> to date information.
> 
> The disposition is that we have, saying, 2 GlusterFS nodes (server_a and
> server_b) with replicated volume some_volume. Also we have several
> clients
> (saying client_1 and client_2) that mount some_volume and do some
> manipulation
> with files on it (lets assume some_volume contains web-related assets,
> and
> client_1/client_2 are web-servers). Also there is client_3 that does
> web-
> related deploying on some_volume (lets assume that client_3 is
> web-developer).
> 
> We would like to use multilayered cache scheme that involves filesystem
> cache
> (on both client/server sides) as well as web server cache.
> 
> So, my questions are:
> 
> 1) does caching-related items (performance.cache-size,
> performance.cache-min-
> file-size, performance.cache-max-file-size etc.) affect server side
> only?

Actually, caching is on the client side (this caching aims to beat network and 
disk latency to add up into our fop - file operation - latency). There is no 
server side caching in glusterfs as of now (except for what ever caching 
underlying OS/drivers provide in backend).

> 2) are there any tunables that affect client side caching?

Yes. Basic tunables one need to be aware of are the ones affecting cache-sizes. 
There are some tunables which define glusterfs behaviour for better/lesser 
consistency (with a possible trade-off of performance). These consistency 
related tunables are mostly (but not limited to) in write-behind (like 
strict-ordering, flush-behind etc). There are various timeouts in each xlator 
that can be configured to tune cache-coherency. "gluster volume set help" 
should give you a starting point.

> 3) how client-side caching (we are talking about read cache only, write
> cache
> is not interesting to us) is performed (if it is at all)?

client side read-caching is done across multiple xlators:

1. read-ahead: to boost performance during sequential reads. We read "ahead" of 
the application, so that data can be in our read-cache by the time application 
requests it.

2. io-cache: to boost performance if application "re-reads" same region of 
file. We cache after application has requested some data, so that subsequent 
accesses are served from io-cache.

3. quick-read (in conjunction with open-behind): to boost reads on small files. 
Quick read caches the entire file during lookup. Any further opens are "faked" 
by open-behind, assuming that the application is doing open solely to read the 
file (which is anyways cached already). If the application does a different 
fop, then an fd is opened and fop is performed after successful open. Quick 
read aims to save time spent in open, multiple reads and a release over network.

4. md-cache (or stat-prefetch): Caches metadata (like iatt - gluster equivalent 
of stat, user xattrs etc).

5. readdir-ahead: similar to read-ahead, but for directory entries during 
readdir. This helps to boost performance of readdir.


> 4) how and in what cases client cache is discarded (and how that relates
> to
> upcall framework)?

As of now read-cache is discarded based on the availability of free space in 
cache and timeouts (age of data in cache). Currently upcall is not used to 
address cache-coherency issues, but can be used in future.

> 
> Ideally, there should be some documentation that covers general
> GlusterFS
> cache workflow.
> 
> Any info would be appreciated. Thanks.
> 
> --
> Oleksandr post-factum Natalenko, MSc
> pf-kernel community
> https://natalenko.name/
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] GlusterFS cache architecture

2015-08-31 Thread Raghavendra Gowdappa



- Original Message -
> From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> To: "Oleksandr Natalenko" <oleksa...@natalenko.name>
> Cc: gluster-devel@gluster.org
> Sent: Tuesday, September 1, 2015 9:20:04 AM
> Subject: Re: [Gluster-devel] GlusterFS cache architecture
> 
> 
> 
> - Original Message -
> > From: "Oleksandr Natalenko" <oleksa...@natalenko.name>
> > To: gluster-devel@gluster.org
> > Sent: Monday, August 31, 2015 7:37:51 PM
> > Subject: [Gluster-devel] GlusterFS cache architecture
> > 
> > Hello.
> > 
> > I'm trying to investigate how GlusterFS manages cache on both server and
> > client side, but unfortunately cannot find any exhaustive, appropriate
> > and up
> > to date information.
> > 
> > The disposition is that we have, saying, 2 GlusterFS nodes (server_a and
> > server_b) with replicated volume some_volume. Also we have several
> > clients
> > (saying client_1 and client_2) that mount some_volume and do some
> > manipulation
> > with files on it (lets assume some_volume contains web-related assets,
> > and
> > client_1/client_2 are web-servers). Also there is client_3 that does
> > web-
> > related deploying on some_volume (lets assume that client_3 is
> > web-developer).
> > 
> > We would like to use multilayered cache scheme that involves filesystem
> > cache
> > (on both client/server sides) as well as web server cache.
> > 
> > So, my questions are:
> > 
> > 1) does caching-related items (performance.cache-size,
> > performance.cache-min-
> > file-size, performance.cache-max-file-size etc.) affect server side
> > only?
> 
> Actually, caching is on the client side (this caching aims to beat network
> and disk latency to add up into our fop - file operation - latency). There
> is no server side caching in glusterfs as of now (except for what ever
> caching underlying OS/drivers provide in backend).
> 
> > 2) are there any tunables that affect client side caching?
> 
> Yes. Basic tunables one need to be aware of are the ones affecting
> cache-sizes. There are some tunables which define glusterfs behaviour for
> better/lesser consistency (with a possible trade-off of performance). These
> consistency related tunables are mostly (but not limited to) in write-behind
> (like strict-ordering, flush-behind etc). There are various timeouts in each
> xlator that can be configured to tune cache-coherency. "gluster volume set
> help" should give you a starting point.

If you don't find documentation anywhere, you can look into source code of each 
of the xlators for a global definition of array "options" which is of type 
"struct volume_options" :). They also carry basic few line description of what 
the option is supposed to do.

> 
> > 3) how client-side caching (we are talking about read cache only, write
> > cache
> > is not interesting to us) is performed (if it is at all)?
> 
> client side read-caching is done across multiple xlators:
> 
> 1. read-ahead: to boost performance during sequential reads. We read "ahead"
> of the application, so that data can be in our read-cache by the time
> application requests it.
> 
> 2. io-cache: to boost performance if application "re-reads" same region of
> file. We cache after application has requested some data, so that subsequent
> accesses are served from io-cache.
> 
> 3. quick-read (in conjunction with open-behind): to boost reads on small
> files. Quick read caches the entire file during lookup. Any further opens
> are "faked" by open-behind, assuming that the application is doing open
> solely to read the file (which is anyways cached already). If the
> application does a different fop, then an fd is opened and fop is performed
> after successful open. Quick read aims to save time spent in open, multiple
> reads and a release over network.
> 
> 4. md-cache (or stat-prefetch): Caches metadata (like iatt - gluster
> equivalent of stat, user xattrs etc).
> 
> 5. readdir-ahead: similar to read-ahead, but for directory entries during
> readdir. This helps to boost performance of readdir.
> 
> 
> > 4) how and in what cases client cache is discarded (and how that relates
> > to
> > upcall framework)?
> 
> As of now read-cache is discarded based on the availability of free space in
> cache and timeouts (age of data in cache). Currently upcall is not used to
> address cache-coherency issues, but can be used in future.
> 
> > 
> > Ideally, there should be some documentation that covers general
> > GlusterFS
> > cache workflow.
> > 
> > Any info would be appreciated. Thanks.
> > 
> > --
> > Oleksandr post-factum Natalenko, MSc
> > pf-kernel community
> > https://natalenko.name/
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> > 
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] FOP ratelimit?

2015-09-02 Thread Raghavendra Gowdappa



- Original Message -
> From: "Pranith Kumar Karampuri" 
> To: "Emmanuel Dreyfus" , gluster-devel@gluster.org
> Sent: Wednesday, September 2, 2015 2:04:32 PM
> Subject: Re: [Gluster-devel] FOP ratelimit?
> 
> 
> 
> On 09/02/2015 01:59 PM, Emmanuel Dreyfus wrote:
> > Hi
> >
> > Yesterday I experienced the problem of a single user bringing down
> > a glusterfs cluster to its knees because of a high amount of rename
> > operations.
> >
> > I understand rename on DHT can be very costly because data really have
> > to be moved from a brick to another one just for a file name change.
> > Is there a workaround for this behavior?
> This is not true.

Data is not moved across bricks during rename. So, may be something else is 
causing the issue. Were you running rebalance while these renames were being 
done?

> >
> > And more generally, do we have a way to ratelimit FOPs per client, so
> > that one client cannot make the cluster unusable for the others?
> Do you have profile data?
> 
> Raghavendra G is working on some QOS related enahancements in gluster.
> Please let us know if you have any inputs here.

Thanks Pranith. 

@Manu and others,

Its helpful if you can give some pointers on what parameters (like latency, 
throughput etc) you want us to consider for QoS. Also, any ideas (like 
interface for QoS) in this area is welcome. With my very basic search, seems 
like there are not many filesystems with QoS functionality.

regards,
Raghavendra.
> 
> Pranith
> >
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] FOP ratelimit?

2015-09-03 Thread Raghavendra Gowdappa



- Original Message -
> From: "Emmanuel Dreyfus" <m...@netbsd.org>
> To: "Raghavendra Gowdappa" <rgowd...@redhat.com>, "Pranith Kumar Karampuri" 
> <pkara...@redhat.com>
> Cc: gluster-devel@gluster.org
> Sent: Wednesday, September 2, 2015 8:12:37 PM
> Subject: Re: [Gluster-devel] FOP ratelimit?
> 
> Raghavendra Gowdappa <rgowd...@redhat.com> wrote:
> 
> > Its helpful if you can give some pointers on what parameters (like
> > latency, throughput etc) you want us to consider for QoS.
> 
> Full blown QoS would be nice, but a first line of defense against
> resource hogs seems just badly required.
> 
> A bare minimum could be to process client's FOP in a round robin
> fashion. That way even if one client sends a lot of FOPs, there is
> always some window for others to slip in.
> 
> Any opinion?

As of now we depend on epoll/poll events informing servers about incoming 
messages. All sockets are put in the same event-pool represented by a single 
poll-control fd. So, the order of our processing of msgs from various clients 
really depends on how epoll/poll picks events across multiple sockets. Do 
poll/epoll have any sort of scheduling? or is it random? Any pointers on this 
are appreciated.

> 
> --
> Emmanuel Dreyfus
> http://hcpnet.free.fr/pubz
> m...@netbsd.org
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] [posix-compliance] unlink and access to file through open fd

2015-09-04 Thread Raghavendra Gowdappa

All,

Posix allows access to file through open fds even if name associated with file 
is deleted. While this works for glusterfs for most of the cases, there are 
some corner cases where we fail.

1. Reboot of brick:
===

With the reboot of brick, fd is lost. unlink would've deleted both gfid and 
path links to file and we would loose the file. As a solution, perhaps we 
should create an hardlink to the file (say in .glusterfs) which gets deleted 
only when last fd is closed?

2. Graph switch:
=

The issue is captured in bz 1259995 [1]. Pasting the content from bz verbatim:
Consider following sequence of operations:
1. fd = open ("/mnt/glusterfs/file");
2. unlink ("/mnt/glusterfs/file");
3. Do a graph-switch, lets say by adding a new brick to volume.
4. migration of fd to new graph fails. This is because as part of migration we 
do a lookup and open. But, lookup fails as file is already deleted and hence 
migration fails and fd is marked bad.

In fact this test case is already present in our regression tests, though the 
test checks whether the fd is just marked as bad. But the expectation of filing 
this bug is that migration should succeed. This is possible since there is an 
fd opened on brick through old-graph and hence can be duped using dup syscall.

Of course the solution outlined here doesn't cover the case where file is not 
present on brick at all. For eg., a new brick was added to replica set and that 
new brick doesn't contain the file. Now, since the file is deleted, how do 
replica heals that file to another brick etc.

But atleast this can be solved for those cases where file was present on a 
brick and fd was already opened.

3. Open-behind and unlink from a different client:
==

While open-behind handles unlink from the same client (through which open was 
performed), if unlink and open are done from two different clients, file is 
lost. I cannot think of any good solution for this.

I wanted to know whether these problems are real enough to channel our efforts 
to fix these issues. Comments are welcome in terms of solutions or other 
possible scenarios which can lead to this issue.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1259995

regards,
Raghavendra.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Handling Failed flushes in write-behind

2015-09-29 Thread Raghavendra Gowdappa

+ gluster-devel

> 
> On Tuesday 29 September 2015 04:45 PM, Raghavendra Gowdappa wrote:
> > Hi All,
> >
> > Currently on failure of flushing of writeback cache, we mark the fd bad.
> > The rationale behind this is that since the application doesn't know which
> > of the writes that are cached failed, fd is in a bad state and cannot
> > possibly do a meaningful/correct read. However, this approach (though
> > posix-complaint) is not acceptable for long standing applications like
> > QEMU [1]. So, a two part solution was decided:
> >
> > 1. No longer mark the fd bad during failures while flushing data to backend
> > from write-behind cache.
> > 2. retry the writes
> >
> > As for as 2, goes, application can checkpoint by doing fsync and on write
> > failures, roll-back to last checkpoint and replay writes from that
> > checkpoint. Or, glusterfs can retry the writes on behalf of the
> > application. However, glusterfs retrying writes cannot be a complete
> > solution as the error-condition we've run into might never get resolved
> > (For eg., running out of space). So, glusterfs has to give up after some
> > time.
> >
> > It would be helpful if you give your inputs on how other writeback systems
> > (Eg., kernel page-cache, nfs, samba, ceph, lustre etc) behave in this
> > scenario and what would be a sane policy for glusterfs.
> >
> > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1200862
> >
> > regards,
> > Raghavendra
> >
> 
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Handling Failed flushes in write-behind

2015-10-05 Thread Raghavendra Gowdappa

- Original Message -
> From: "Prashanth Pai" <p...@redhat.com>
> To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Thiago da Silva" 
> <thi...@redhat.com>
> Sent: Wednesday, September 30, 2015 11:38:38 AM
> Subject: Re: [Gluster-devel] Handling Failed flushes in write-behind
> 
> > > > As for as 2, goes, application can checkpoint by doing fsync and on
> > > > write
> > > > failures, roll-back to last checkpoint and replay writes from that
> > > > checkpoint. Or, glusterfs can retry the writes on behalf of the
> > > > application. However, glusterfs retrying writes cannot be a complete
> > > > solution as the error-condition we've run into might never get resolved
> > > > (For eg., running out of space). So, glusterfs has to give up after
> > > > some
> > > > time.
> 
> The application should not be expected to replay writes. glusterfs must be
> retrying the failed write.

Well, failed writes can fail due to two categories of errors:

1. The error condition can be transient or file-system can do something to 
alleviate the error.
2. The error condition can be permanent or file-system has no control over how 
to recover from the failure condition. For eg., Network failure.

The best a file-system can do in scenario 1 is:
1. try to do things to alleviate the error.
2. retry the writes

For eg., ext4 on seeing a writeback failure with ENOSPC, tries to free some 
space by freeing some extents (again extents are managed by filesystem) and 
retries. Again this retry is only once after failure. After that page is marked 
with error.

As far as failure scenarios 2, there is no point in retrying and it is 
difficult to have a well defined policy on how long we can keep retrying. The 
purpose of this mail is to identify errors that fall into scenario 1 above and 
have a recovery policy. I am afraid, glusterfs cannot do much in scenario 2. If 
you've ideas that can help for scenario 2, I am open to incorporate them.

I did a quick look at how various filesystems handle writeback failures (this 
is not extensive research and hence there might be some incorrectness):

1. FUSE:
   ==
 FUSE implemented write-back from kernel version 3.15. In its current version, 
it doesn't replay the writes at all on writeback failure.

2. xfs:

 xfs seem to have an intelligent failure handling mechanism on writeback 
failure. It marks the pages as dirty again after writeback failure for some 
errors. For other errors, it doesn't retry. I couldn't look into details of 
what errors are retried and what errors are not

3. ext4:
   =
  Only ENOSPC errors are retried. That too, only once.

Also, please note that to the best of my knowledge, POSIX only guarantees 
writes that are checkpointed by fsync to have been persisted. Given the above 
constraints I am curious to know how the applications handle similar issues on 
other filesystems.

> In gluster-swift, we had hit into a case where the application would get EIO
> but the write had actually failed because of ENOSPC.

>From linux kernel source tree,

static inline void mapping_set_error(struct address_space *mapping, int error)
{
if (unlikely(error)) {
if (error == -ENOSPC)
set_bit(AS_ENOSPC, >flags);
else
set_bit(AS_EIO, >flags);
}
}

Seems like only ENOSPC is stored. Rest of the errors are transformed into EIO. 
Again, we are ready to comply to whatever is the standard practise.

> https://bugzilla.redhat.com/show_bug.cgi?id=986812
> 
> Regards,
>  -Prashanth Pai
> 
> - Original Message -
> > From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > To: "Vijay Bellur" <vbel...@redhat.com>
> > Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Ben Turner"
> > <btur...@redhat.com>, "Ira Cooper" <icoo...@redhat.com>
> > Sent: Tuesday, September 29, 2015 4:56:33 PM
> > Subject: Re: [Gluster-devel] Handling Failed flushes in write-behind
> > 
> > + gluster-devel
> > 
> > > 
> > > On Tuesday 29 September 2015 04:45 PM, Raghavendra Gowdappa wrote:
> > > > Hi All,
> > > >
> > > > Currently on failure of flushing of writeback cache, we mark the fd
> > > > bad.
> > > > The rationale behind this is that since the application doesn't know
> > > > which
> > > > of the writes that are cached failed, fd is in a bad state and cannot
> > > > possibly do a meaningful/correct read. However, this approach (though
> > > > posix-complaint) is

Re: [Gluster-devel] Handling Failed flushes in write-behind

2015-10-05 Thread Raghavendra Gowdappa

+kevin

- Original Message -
> From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> To: "Prashanth Pai" <p...@redhat.com>
> Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Thiago da Silva" 
> <thi...@redhat.com>
> Sent: Monday, October 5, 2015 11:37:00 AM
> Subject: Re: [Gluster-devel] Handling Failed flushes in write-behind
> 
> 
> 
> - Original Message -
> > From: "Prashanth Pai" <p...@redhat.com>
> > To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Thiago da Silva"
> > <thi...@redhat.com>
> > Sent: Wednesday, September 30, 2015 11:38:38 AM
> > Subject: Re: [Gluster-devel] Handling Failed flushes in write-behind
> > 
> > > > > As for as 2, goes, application can checkpoint by doing fsync and on
> > > > > write
> > > > > failures, roll-back to last checkpoint and replay writes from that
> > > > > checkpoint. Or, glusterfs can retry the writes on behalf of the
> > > > > application. However, glusterfs retrying writes cannot be a complete
> > > > > solution as the error-condition we've run into might never get
> > > > > resolved
> > > > > (For eg., running out of space). So, glusterfs has to give up after
> > > > > some
> > > > > time.
> > 
> > The application should not be expected to replay writes. glusterfs must be
> > retrying the failed write.
> 
> Well, failed writes can fail due to two categories of errors:
> 
> 1. The error condition can be transient or file-system can do something to
> alleviate the error.
> 2. The error condition can be permanent or file-system has no control over
> how to recover from the failure condition. For eg., Network failure.
> 
> The best a file-system can do in scenario 1 is:
> 1. try to do things to alleviate the error.
> 2. retry the writes
> 
> For eg., ext4 on seeing a writeback failure with ENOSPC, tries to free some
> space by freeing some extents (again extents are managed by filesystem) and
> retries. Again this retry is only once after failure. After that page is
> marked with error.
> 
> As far as failure scenarios 2, there is no point in retrying and it is
> difficult to have a well defined policy on how long we can keep retrying.
> The purpose of this mail is to identify errors that fall into scenario 1
> above and have a recovery policy. I am afraid, glusterfs cannot do much in
> scenario 2. If you've ideas that can help for scenario 2, I am open to
> incorporate them.
> 
> I did a quick look at how various filesystems handle writeback failures (this
> is not extensive research and hence there might be some incorrectness):
> 
> 1. FUSE:
>==
>  FUSE implemented write-back from kernel version 3.15. In its current
>  version, it doesn't replay the writes at all on writeback failure.
> 
> 2. xfs:
>
>  xfs seem to have an intelligent failure handling mechanism on writeback
>  failure. It marks the pages as dirty again after writeback failure for some
>  errors. For other errors, it doesn't retry. I couldn't look into details of
>  what errors are retried and what errors are not
> 
> 3. ext4:
>=
>   Only ENOSPC errors are retried. That too, only once.
> 
> Also, please note that to the best of my knowledge, POSIX only guarantees
> writes that are checkpointed by fsync to have been persisted. Given the
> above constraints I am curious to know how the applications handle similar
> issues on other filesystems.
> 
> > In gluster-swift, we had hit into a case where the application would get
> > EIO
> > but the write had actually failed because of ENOSPC.
> 
> From linux kernel source tree,
> 
> static inline void mapping_set_error(struct address_space *mapping, int
> error)
> {
> if (unlikely(error)) {
> if (error == -ENOSPC)
>   set_bit(AS_ENOSPC, >flags);
> else
> set_bit(AS_EIO, >flags);
> }
> }
> 
> Seems like only ENOSPC is stored. Rest of the errors are transformed into
> EIO. Again, we are ready to comply to whatever is the standard practise.
> 
> > https://bugzilla.redhat.com/show_bug.cgi?id=986812
> > 
> > Regards,
> >  -Prashanth Pai
> > 
> > - Original Message -
> > > From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > > To: "Vijay Bellur" <vbel...@redhat.com>
> > > Cc: "Gluster Devel&qu

Re: [Gluster-devel] compound fop design first cut

2015-12-07 Thread Raghavendra Gowdappa

> 
> On 12/08/2015 09:02 AM, Pranith Kumar Karampuri wrote:
> >
> >
> > On 12/08/2015 02:53 AM, Shyam wrote:
> >> Hi,
> >>
> >> Why not think along the lines of new FOPs like fop_compound(_cbk)
> >> where, the inargs to this FOP is a list of FOPs to execute (either in
> >> order or any order)?
> > That is the intent. The question is how do we specify the fops that we
> > want to do and the arguments to the fop. In this approach, for example
> > xl_fxattrop_writev() is a new FOP. List of fops that need to be done
> > are fxattrop, writev in that order and the arguments are a union of
> > the arguments needed to perform the fops fxattrop, writev. The reason
> > why this fop is not implemented through out the graph is to not change
> > most of the stack on the brick side in the first cut of the
> > implementation. i.e. quota/barrier/geo-rep/io-threads
> > priorities/bit-rot may have to implement these new compund fops. We
> > still get the benefit of avoiding the network round trips.
> >>
> >> With a scheme like the above we could,
> >>  - compound any set of FOPs (of course, we need to take care here,
> >> but still the feasibility exists)
> > It still exists but the fop space will be blown for each of the
> > combination.
> >>  - Each xlator can inspect the compound relation and chose to
> >> uncompound them. So if an xlator cannot perform FOPA+B as a single
> >> compound FOP, it can choose to send FOPA and then FOPB and chain up
> >> the responses back to the compound request sent to it. Also, the
> >> intention here would be to leverage existing FOP code in any xlator,
> >> to appropriately modify the inargs
> >>  - The RPC payload is constructed based on existing FOP RPC
> >> definitions, but compounded based on the compound FOP RPC definition
> > This will be done in phase-3 after learning a bit more about how best
> > to implement it to prevent stuffing arguments in xdata in future as
> > much as possible. After which we can choose to retire
> > compound-fop-sender and receiver xlators.
> >>
> >> Possibly on the brick graph as well, pass these down as compounded
> >> FOPs, till someone decides to break it open and do it in phases
> >> (ultimately POSIX xlator).
> > This will be done in phase-2. At the moment we are not giving any
> > choice for the xlators on the brick side.
> >>
> >> The intention would be to break a compound FOP in case an xlator in
> >> between cannot support it or, even expand a compound FOP request, say
> >> the fxattropAndWrite is an AFR compounding decision, but a compound
> >> request to AFR maybe WriteandClose, hence AFR needs to extend this
> >> compound request.
> > Yes. There was a discussion with krutika where if shard wants to do
> > write then xattrop in a single fop, then we need dht to implement
> > dht_writev_fxattrop which should look somewhat similar to
> > dht_writev(), and afr will need to implement afr_writev_fxattrop() as
> > full blown transaction where it needs to take data+metadata domain
> > locks then do data+metadata pre-op then wind to
> > compound_fop_sender_writev_fxattrop() and then data+metadata post-op
> > then unlocks.
> >
> > If we were to do writev, fxattrop separately, fops will be (In
> > unoptimized case):
> > 1) finodelk for write
> > 2) fxattrop for preop of write.
> > 3) write
> > 4) fxattrop for post op of write
> > 5) unlock for write
> > 6) finodelk for fxattrop
> > 7) fxattrop for preop of shard-fxattrop
> > 8) shard-fxattrop
> > 9) fxattrop for post op of shard fxattrop
> > 10) unlock forfxattrop
> >
> > If AFR chooses to implement writev_fxattrop: means data+metadata
> > transaction.
> > 1) finodelk in data, metadata domain simultaneously (just like we take
> > multiple locks in rename)
> > 2) preop for data, metadata parts as part of the compound fop
> > 3) writev+fxattrop
> > 4)postop for data, metadata parts as part of the compound fop
> > 5) unlocks simultaneously.
> >
> > So it is still 2x reduction of the number of network fops except for
> > may be locking.
> >>
> >> The above is just a off the cuff thought on the same.
> > We need to arrive at a consensus about how to specify the list of fops
> > and their arguments. The reason why I went against list_of_fops is to
> > make discovery of possibile optimizations we can do easier per
> > compound fop (Inspired by ec's implementation of multiplications by
> > all possible elements in the Galois field, where multiplication with
> > different number has a different optimization). Could you elaborate
> > more about the idea you have about list_of_fops and its arguments? May
> > be we can come up with combinations of fops where we can employ this
> > technique of just list_of_fops and wind. I think rest of the solutions
> > you mentioned is where it will converge towards over time. Intention
> > is to avoid network round trips without waiting for the whole stack to
> > change as much as possible.
> May be I am over thinking it. Not a lot of combinations could be
> transactions. In any

Re: [Gluster-devel] intermittent test failure: tests/bugs/tier/bug-1279376-rename-demoted-file.t

2015-12-09 Thread Raghavendra Gowdappa



- Original Message -
> From: "Michael Adam" 
> To: gluster-devel@gluster.org
> Sent: Wednesday, December 9, 2015 1:46:32 PM
> Subject: [Gluster-devel] intermittent test failure: 
> tests/bugs/tier/bug-1279376-rename-demoted-file.t
> 
> Hi,
> 
> found another one. See
> 
> https://build.gluster.org/job/rackspace-regression-2GB-triggered/16603/consoleFull
> 
> Run by http://review.gluster.org/#/c/12830/
> which should not change any test result.

A bug has been filed at:
https://bugzilla.redhat.com/show_bug.cgi?id=1289845

> 
> Michael
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] libgfapi compound operations - multiple writes

2015-12-09 Thread Raghavendra Gowdappa

forking off since it muddles the original conversation. I've some questions:

1. Why do multiple writes need to be compounded together?
2. If the reason is aggregation, cant we tune write-behind to do the same?

regards,
Raghavendra.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] intermittent test failure: tests/basic/tier/record-metadata-heat.t ?

2015-12-07 Thread Raghavendra Gowdappa

> Looks like the run failed due to:
> 
> /tests/bugs/fuse/bug-924726.t (Wstat: 0 Tests: 20 Failed: 1)
>Failed test:  20
> 
> Raghavendra - this test has been reported previously too as affecting
> other regression runs. Can you please take a look in as you are the
> original author of this test unit? I tried reproducing the problem in my
> local setup a few times but that does not seem to happen easily.

A fix has been sent to:
http://review.gluster.org/12906

> 
> Thanks,
> Vijay
> 
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] libgfapi compound operations - multiple writes

2015-12-09 Thread Raghavendra Gowdappa



- Original Message -
> From: "Jeff Darcy" <jda...@redhat.com>
> To: "Raghavendra Gowdappa" <rgowd...@redhat.com>, "Poornima Gurusiddaiah" 
> <pguru...@redhat.com>
> Cc: "Gluster Devel" <gluster-devel@gluster.org>
> Sent: Wednesday, December 9, 2015 10:36:43 PM
> Subject: Re: [Gluster-devel] libgfapi compound operations - multiple writes
> 
> 
> 
> 
> On December 9, 2015 at 10:31:03 AM, Raghavendra Gowdappa
> (rgowd...@redhat.com) wrote:
> > forking off since it muddles the original conversation. I've some
> > questions:
> >  
> > 1. Why do multiple writes need to be compounded together?
> > 2. If the reason is aggregation, cant we tune write-behind to do the same?
> 
> I think compounding (as we’ve been discussing it) is only necessary when
> there’s a dependency between operations.  For example, if the first
> creates a value (e.g. file descriptor) used by the second, or if the
> second should not proceed unless the first (e.g. a lock) succeeded.  If
> multiple operations are completely independent of one another, as is the
> case for writes without fsync, then I think we should rely on
> write-behind or something similar instead.  Compounding is likely to be
> the wrong solution here for two reasons:
> 
>  * Correctness: if the writes are independent, there’s no reason why
>    failure of the first should cause the second not to be issued (as
>    would be the case with compounding).
> 
>  * Performance: compounding would keep the writes separate, whereas
>    write-behind can reduce overhead even more by coalescing them into a
>    single request.

Yes. I had similar thoughts while asking the question. Thanks for elaborating.

> 
> There is, however, one case where compounding would be the right answer:
> when there really is a dependency between the writes.  There’s no way to
> specify this through the POSIX/VFS interface (more’s the pity), but it’s
> easy to imagine GFAPI or internal use cases where a second write should
> not overtake or continue without the first - e.g.  a key/value store
> that writes new data followed by an index update pointing to that data.
> The strictly-sequential behavior of a compound operation might be just
> the right match for such cases.

We have one such use-case already i.e., O_APPEND writes. In fact write-behind 
has enough logic to address dependencies like conflicting writes, read, stat 
etc on just written regions etc (Of course, we would loose performance gains as 
write-behind still wind calls across network for dependent ops. But again, if 
write-behind cache is sufficient enough, this latency is not witnessed by 
application). So, I am wondering can we pass down these dependency requirements 
down the stack and let write-behind handle them.

@Poornima and others,

Did you've any such use-cases in mind when you proposed compounding?

regards,
Raghavendra
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Help needed in understanding GlusterFS logs and debugging elasticsearch failures

2015-12-11 Thread Raghavendra Gowdappa



- Original Message -
> From: "Sachidananda URS" 
> To: "Gluster Devel" 
> Sent: Friday, December 11, 2015 8:56:04 PM
> Subject: [Gluster-devel] Help needed in understanding GlusterFS logs and 
> debugging elasticsearch failures
> 
> Hi,
> 
> I was trying to use GlusterFS as a backend filesystem for storing the
> elasticsearch indices on GlusterFS mount.
> 
> The filesystem operations as far as I can understand is, lucene engine
> does a lot of renames on the index files. And multiple threads read
> from the same file concurrently.
> 
> While writing index, elasticsearch/lucene complains of index corruption and
> the
> health of the cluster goes to red, and all the operations on the index fail
> hereafter.
> 
> ===
> 
> [2015-12-10 02:43:45,614][WARN ][index.engine ] [client-2]
> [logstash-2015.12.09][3] failed engine [merge failed]
> org.apache.lucene.index.MergePolicy$MergeException:
> org.apache.lucene.index.CorruptIndexException: checksum failed (hardware
> problem?) : expected=0 actual=6d811d06
> (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/mnt/gluster2/rhs/nodes/0/indices/logstash-2015.12.09/3/index/_a7.cfs")
> [slice=_a7_Lucene50_0.doc]))
> at
> 
> org.elasticsearch.index.engine.InternalEngine$EngineMergeScheduler$1.doRun(InternalEngine.java:1233)
> at
> 
> org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
> at
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed

May be read returned different data than expected? Logs doesn't indicate 
anything suspicious.

> (hardware problem?) : expected=0 actual=6d811d06
> (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path="/mnt/gluster2/rhs/nodes/0/indices/logstash-2015.12.09/3/index/_a7.cfs")
> [slice=_a7_Lucene50_0.doc]))
> 
> =
> 
> 
> Server logs does not have anything. The client logs is full of messages like:
> 
> 
> 
> [2015-12-03 18:44:17.882032] I [MSGID: 109066] [dht-rename.c:1410:dht_rename]
> 0-esearch-dht: renaming
> /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-61881676454442626.tlog
> (hash=esearch-replicate-0/cache=esearch-replicate-0) =>
> /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-311.ckp
> (hash=esearch-replicate-1/cache=)
> [2015-12-03 18:45:31.276316] I [MSGID: 109066] [dht-rename.c:1410:dht_rename]
> 0-esearch-dht: renaming
> /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-2384654015514619399.tlog
> (hash=esearch-replicate-0/cache=esearch-replicate-0) =>
> /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-312.ckp
> (hash=esearch-replicate-0/cache=)
> [2015-12-03 18:45:31.587660] I [MSGID: 109066] [dht-rename.c:1410:dht_rename]
> 0-esearch-dht: renaming
> /rhs/nodes/0/indices/logstash-2015.12.03/4/translog/translog-4957943728738197940.tlog
> (hash=esearch-replicate-0/cache=esearch-replicate-0) =>
> /rhs/nodes/0/indices/logstash-2015.12.03/4/translog/translog-312.ckp
> (hash=esearch-replicate-0/cache=)
> [2015-12-03 18:46:48.424605] I [MSGID: 109066] [dht-rename.c:1410:dht_rename]
> 0-esearch-dht: renaming
> /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-1731620600607498012.tlog
> (hash=esearch-replicate-1/cache=esearch-replicate-1) =>
> /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-313.ckp
> (hash=esearch-replicate-1/cache=)
> [2015-12-03 18:46:48.466558] I [MSGID: 109066] [dht-rename.c:1410:dht_rename]
> 0-esearch-dht: renaming
> /rhs/nodes/0/indices/logstash-2015.12.03/4/translog/translog-5214949393126318982.tlog
> (hash=esearch-replicate-1/cache=esearch-replicate-1) =>
> /rhs/nodes/0/indices/logstash-2015.12.03/4/translog/translog-313.ckp
> (hash=esearch-replicate-1/cache=)
> [2015-12-03 18:48:06.314138] I [MSGID: 109066] [dht-rename.c:1410:dht_rename]
> 0-esearch-dht: renaming
> /rhs/nodes/0/indices/logstash-2015.12.03/4/translog/translog-9110755229226773921.tlog
> (hash=esearch-replicate-0/cache=esearch-replicate-0) =>
> /rhs/nodes/0/indices/logstash-2015.12.03/4/translog/translog-314.ckp
> (hash=esearch-replicate-1/cache=)
> [2015-12-03 18:48:06.332919] I [MSGID: 109066] [dht-rename.c:1410:dht_rename]
> 0-esearch-dht: renaming
> /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-5193443717817038271.tlog
> (hash=esearch-replicate-1/cache=esearch-replicate-1) =>
> /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-314.ckp
> (hash=esearch-replicate-1/cache=)
> [2015-12-03 18:49:24.694263] I [MSGID: 109066] [dht-rename.c:1410:dht_rename]
> 0-esearch-dht: renaming
> /rhs/nodes/0/indices/logstash-2015.12.03/1/translog/translog-2750483795035758522.tlog
>

Re: [Gluster-devel] Is there any advantage or disadvantage to multiple cpu cores?

2015-12-14 Thread Raghavendra Gowdappa



- Original Message -
> From: "Joe Julian" 
> To: "Gluster Devel" 
> Sent: Monday, December 14, 2015 2:40:14 PM
> Subject: [Gluster-devel] Is there any advantage or disadvantage to multiple   
> cpu cores?
> 
> Does the code take advantage of multiple cpu cores?

On client:
* We've multiple threads to receive replies from bricks parallely 
(multithreaded epoll).
* the thread that reads from /dev/fuse doesn't generally process replies. So, 
request and reply processing can happen parallely.

On Bricks:
* io-threads enables parallelism for processing all request/reply parallely.

So, we've multiple threads that can execute on multiple cores simultaneously. 
However, we've don't really assign threads to cores.

> If I assigned a single core to gluster, would it have an effect on
> performance?

Long time back there was this proposal to make sure a request gets assigned to 
a thread executing on the same core on which application issued the syscall. 
The idea was to minimize too many process context switches on a core and 
thereby conserve relevancy of cpu-cache. But nothing concrete happened towards 
that goal.

> If yes, explain so I can determine a sane number of cores to allocate
> per server.
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Design for lookup-optimize made default

2015-12-15 Thread Raghavendra Gowdappa

> > > 
> > > Sakshi,
> > > 
> > > In the doc. there is reference to the fact that when a client fixes a
> > > layout it assigns the same dircommit hash to the layout which is
> > > equivalent to the vol commit hash. I think this assumption is incorrect,
> > > when a client heals the layout, the commit hash is set to 1
> > > (DHT_LAYOUT_HASH_INVALID) [1].
> > 
> > Yes. You are correct. Thats an oversight on my part. Sorry about it :).
> > 
> > > 
> > > What the above basically means is that when anyone other than rebalance
> > > changes the layout of an existing directory, it's commit-hash will start
> > > disagreeing with the volume commit hash. So that part is already handled
> > > (unless I am missing something, which case it is a bug and we need it
> > > fixed).
> > > 
> > > The other part of the self-heal, I would consider *not* needed. If a
> > > client heals a layout, it is because a previous layout creator
> > > (rebalance or mkdir) was incomplete, and hence the client needs to set
> > > the layout. If this was by rebalance, the rebalance process would have
> > > failed and hence would need to be rerun. For abnormal failures on
> > > directory creations, I think the proposed solution is heavy weight, as
> > > lookup-optimize is an *optimization* and so it can devolve into
> > > non-optimized modes in such cases. IOW, I am stating we do not need to
> > > do this self healing.
> > 
> > If this happens to a large number of directories, then performance hit can
> > be
> > large (and its not an optimization in the sense that hashing should've
> > helped us to conclusively say when a file is absent and its basic design of
> > dht, which we had strayed away because of bugs). However, the question as
> > you pointed out is, can it happen often enough? As of now, healing can be
> > triggered because of following reasons:
> > 
> > 1. As of now, no synchronization between rename (src, dst) and healing.
> > There
> > are two cases here:
> >a. healing of src by a racing lookup on src. This falls in the class of
> >bugs similar to lookup-heal creating directories deleted by a racing
> >rmdir and hence will be fixed when we fix that class of bugs (the
> >solution to which is ready, implementation is pending).
> >b. Healing of destination (as layout of src breaks the continuum of dst
> >layout). But again this is not a problem rename overwrites dst only if
> >its an empty directory and no children need to be healed for empty
> >directory.
> > 
> > 2. Race b/w fix layout from a rebalance process and lookup-heal from a
> > client.
> >We don't have synchronization b/w these two as of now and *might* end up
> >with too many directories with DHT_LAYOUT_HASH_INVALID set resulting in
> >poor performance.
> > 
> > 3. Any failures in layout setting (because of node going down after we
> > choose
> > to heal layout, setxattr failures etc).
> > 
> > Given the above considerations, I conservatively chose to heal children of
> > directories. I am not sure whether these considerations are just
> > theoretical
> > or something realistic that can be hit in field. With the above details, do
> > you still think healing from selfheal daemon is not worth the effort?
> 
> And the other thing to note that, once a directory ends up with
> DHT_LAYOUT_HASH_INVALID (in non add/remove-brick) scenario, its stays in
> that state till there is a fix layout is run or for the entire lifetime of
> the directory.

Another case where we might end up with DHT_LAYOUT_HASH_INVALID for a directory 
is a race b/w lookup on a directory and mkdir of the same name. In this race if 
lookup wins the race and sets the layout, we'll have invalid_hash set on the 
layout.

I was worried about these unknowns and I tried to solve this by having a 
fall-back option in terms of heal by self-heal daemon in case if we end up with 
invalid-hash. It seemed easier
1. to identify scenarios where we might heal and add directory to index.
2. poll the index and heal the children of entries found and remove the entry 
from index.

This fall-back option I think helps to recover instead of assuming that not 
many use cases lead to invalid-hash of a directory.
> 
> > 
> > > 
> > > I think we still need to handle stale layouts and the lookup (and other
> > > problems).
> > 
> > Yes, the more we avoid spurious heals, the less we need healing from
> > self-heal daemon. In fact we need healing from self-heal daemon only for
> > those directories self-heal was triggered spuriously.
> > 
> > > 
> > > [1]
> > > https://github.com/gluster/glusterfs/blob/master/xlators/cluster/dht/src/dht-selfheal.c#L1685
> > > 
> > > On 12/11/2015 06:08 AM, Sakshi Bansal wrote:
> > > > The above link may not be accessible to all. In that case please refer
> > > > to
> > > > this:
> > > > https://public.pad.fsfe.org/p/dht_lookup_optimize
> > > > ___
> > > > Gluster-devel mailing list
> > > > Gluster-devel@gluster.org

Re: [Gluster-devel] Design for lookup-optimize made default

2015-12-14 Thread Raghavendra Gowdappa

- Original Message -
> From: "Shyam" 
> To: "Sakshi Bansal" , "Gluster Devel" 
> 
> Sent: Monday, December 14, 2015 10:40:09 PM
> Subject: Re: [Gluster-devel] Design for lookup-optimize made default
> 
> Sakshi,
> 
> In the doc. there is reference to the fact that when a client fixes a
> layout it assigns the same dircommit hash to the layout which is
> equivalent to the vol commit hash. I think this assumption is incorrect,
> when a client heals the layout, the commit hash is set to 1
> (DHT_LAYOUT_HASH_INVALID) [1].

Yes. You are correct. Thats an oversight on my part. Sorry about it :).

> 
> What the above basically means is that when anyone other than rebalance
> changes the layout of an existing directory, it's commit-hash will start
> disagreeing with the volume commit hash. So that part is already handled
> (unless I am missing something, which case it is a bug and we need it
> fixed).
> 
> The other part of the self-heal, I would consider *not* needed. If a
> client heals a layout, it is because a previous layout creator
> (rebalance or mkdir) was incomplete, and hence the client needs to set
> the layout. If this was by rebalance, the rebalance process would have
> failed and hence would need to be rerun. For abnormal failures on
> directory creations, I think the proposed solution is heavy weight, as
> lookup-optimize is an *optimization* and so it can devolve into
> non-optimized modes in such cases. IOW, I am stating we do not need to
> do this self healing.

If this happens to a large number of directories, then performance hit can be 
large (and its not an optimization in the sense that hashing should've helped 
us to conclusively say when a file is absent and its basic design of dht, which 
we had strayed away because of bugs). However, the question as you pointed out 
is, can it happen often enough? As of now, healing can be triggered because of 
following reasons:

1. As of now, no synchronization between rename (src, dst) and healing. There 
are two cases here:
   a. healing of src by a racing lookup on src. This falls in the class of bugs 
similar to lookup-heal creating directories deleted by a racing rmdir and hence 
will be fixed when we fix that class of bugs (the solution to which is ready, 
implementation is pending).
   b. Healing of destination (as layout of src breaks the continuum of dst 
layout). But again this is not a problem rename overwrites dst only if its an 
empty directory and no children need to be healed for empty directory.

2. Race b/w fix layout from a rebalance process and lookup-heal from a client.
   We don't have synchronization b/w these two as of now and *might* end up 
with too many directories with DHT_LAYOUT_HASH_INVALID set resulting in poor 
performance.

3. Any failures in layout setting (because of node going down after we choose 
to heal layout, setxattr failures etc).

Given the above considerations, I conservatively chose to heal children of 
directories. I am not sure whether these considerations are just theoretical or 
something realistic that can be hit in field. With the above details, do you 
still think healing from selfheal daemon is not worth the effort?

> 
> I think we still need to handle stale layouts and the lookup (and other
> problems).

Yes, the more we avoid spurious heals, the less we need healing from self-heal 
daemon. In fact we need healing from self-heal daemon only for those 
directories self-heal was triggered spuriously.

> 
> [1]
> https://github.com/gluster/glusterfs/blob/master/xlators/cluster/dht/src/dht-selfheal.c#L1685
> 
> On 12/11/2015 06:08 AM, Sakshi Bansal wrote:
> > The above link may not be accessible to all. In that case please refer to
> > this:
> > https://public.pad.fsfe.org/p/dht_lookup_optimize
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> >
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Design for lookup-optimize made default

2015-12-14 Thread Raghavendra Gowdappa



- Original Message -
> From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> To: "Shyam" <srang...@redhat.com>
> Cc: "Gluster Devel" <gluster-devel@gluster.org>
> Sent: Tuesday, December 15, 2015 11:38:03 AM
> Subject: Re: [Gluster-devel] Design for lookup-optimize made default
> 
> 
> 
> - Original Message -
> > From: "Shyam" <srang...@redhat.com>
> > To: "Sakshi Bansal" <saban...@redhat.com>, "Gluster Devel"
> > <gluster-devel@gluster.org>
> > Sent: Monday, December 14, 2015 10:40:09 PM
> > Subject: Re: [Gluster-devel] Design for lookup-optimize made default
> > 
> > Sakshi,
> > 
> > In the doc. there is reference to the fact that when a client fixes a
> > layout it assigns the same dircommit hash to the layout which is
> > equivalent to the vol commit hash. I think this assumption is incorrect,
> > when a client heals the layout, the commit hash is set to 1
> > (DHT_LAYOUT_HASH_INVALID) [1].
> 
> Yes. You are correct. Thats an oversight on my part. Sorry about it :).
> 
> > 
> > What the above basically means is that when anyone other than rebalance
> > changes the layout of an existing directory, it's commit-hash will start
> > disagreeing with the volume commit hash. So that part is already handled
> > (unless I am missing something, which case it is a bug and we need it
> > fixed).
> > 
> > The other part of the self-heal, I would consider *not* needed. If a
> > client heals a layout, it is because a previous layout creator
> > (rebalance or mkdir) was incomplete, and hence the client needs to set
> > the layout. If this was by rebalance, the rebalance process would have
> > failed and hence would need to be rerun. For abnormal failures on
> > directory creations, I think the proposed solution is heavy weight, as
> > lookup-optimize is an *optimization* and so it can devolve into
> > non-optimized modes in such cases. IOW, I am stating we do not need to
> > do this self healing.
> 
> If this happens to a large number of directories, then performance hit can be
> large (and its not an optimization in the sense that hashing should've
> helped us to conclusively say when a file is absent and its basic design of
> dht, which we had strayed away because of bugs). However, the question as
> you pointed out is, can it happen often enough? As of now, healing can be
> triggered because of following reasons:
> 
> 1. As of now, no synchronization between rename (src, dst) and healing. There
> are two cases here:
>a. healing of src by a racing lookup on src. This falls in the class of
>bugs similar to lookup-heal creating directories deleted by a racing
>rmdir and hence will be fixed when we fix that class of bugs (the
>solution to which is ready, implementation is pending).
>b. Healing of destination (as layout of src breaks the continuum of dst
>layout). But again this is not a problem rename overwrites dst only if
>its an empty directory and no children need to be healed for empty
>directory.
> 
> 2. Race b/w fix layout from a rebalance process and lookup-heal from a
> client.
>We don't have synchronization b/w these two as of now and *might* end up
>with too many directories with DHT_LAYOUT_HASH_INVALID set resulting in
>poor performance.
> 
> 3. Any failures in layout setting (because of node going down after we choose
> to heal layout, setxattr failures etc).
> 
> Given the above considerations, I conservatively chose to heal children of
> directories. I am not sure whether these considerations are just theoretical
> or something realistic that can be hit in field. With the above details, do
> you still think healing from selfheal daemon is not worth the effort?

And the other thing to note that, once a directory ends up with 
DHT_LAYOUT_HASH_INVALID (in non add/remove-brick) scenario, its stays in that 
state till there is a fix layout is run or for the entire lifetime of the 
directory.

> 
> > 
> > I think we still need to handle stale layouts and the lookup (and other
> > problems).
> 
> Yes, the more we avoid spurious heals, the less we need healing from
> self-heal daemon. In fact we need healing from self-heal daemon only for
> those directories self-heal was triggered spuriously.
> 
> > 
> > [1]
> > https://github.com/gluster/glusterfs/blob/master/xlators/cluster/dht/src/dht-selfheal.c#L1685
> > 
> > On 12/11/2015 06:08 AM, Sakshi Bansal wrote:
> > > The above link may not be accessible to all. In that case please refer to
> > > this:
> > > https://pu

Re: [Gluster-devel] quota.t hangs on NetBSD machines

2016-01-04 Thread Raghavendra Gowdappa



- Original Message -
> From: "Emmanuel Dreyfus" <m...@netbsd.org>
> To: "Gluster Devel" <gluster-devel@gluster.org>
> Cc: "Raghavendra Gowdappa" <rgowd...@redhat.com>, "Raghavendra Talur" 
> <rta...@redhat.com>
> Sent: Monday, January 4, 2016 2:35:23 PM
> Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines
> 
> On Mon, Jan 04, 2016 at 09:00:43AM +, Emmanuel Dreyfus wrote:
> > gluster volume info/status seem to hang too.
> 
> No, sorry, I was not running the right binary.
> I now have a statedump, but how can I use it to conclude anything
> about whodropped the request?

Can you send the statedump? Please look for frames with "complete=0". This 
indicates that the frame is not unwound.

> 
> --
> Emmanuel Dreyfus
> m...@netbsd.org
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] quota.t hangs on NetBSD machines

2016-01-03 Thread Raghavendra Gowdappa



- Original Message -
> From: "Emmanuel Dreyfus" <m...@netbsd.org>
> To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> Cc: "Raghavendra Talur" <rta...@redhat.com>, "Gluster Devel" 
> <gluster-devel@gluster.org>
> Sent: Thursday, December 31, 2015 6:32:22 PM
> Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines
> 
> Raghavendra Gowdappa <rgowd...@redhat.com> wrote:
> 
> > We saw similar bt on test process. At that time we took statedump of
> > client process. While we were going through statedump, surprisingly the
> > test program resumed and completed.
> 
> That suggests the problem would be in glusterfs client, doesn't it?

Not conclusively. If there is a frame-loss most likely attaching gdb has no 
effect. But, client can one of the potential areas where the problem is.

> 
> --
> Emmanuel Dreyfus
> http://hcpnet.free.fr/pubz
> m...@netbsd.org
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] quota.t hangs on NetBSD machines

2016-01-04 Thread Raghavendra Gowdappa



- Original Message -
> From: "Emmanuel Dreyfus" <m...@netbsd.org>
> To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> Cc: "Emmanuel Dreyfus" <m...@netbsd.org>, "Gluster Devel" 
> <gluster-devel@gluster.org>, "Raghavendra Talur"
> <rta...@redhat.com>
> Sent: Monday, January 4, 2016 4:03:22 PM
> Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines
> 
> On Mon, Jan 04, 2016 at 05:16:16AM -0500, Raghavendra Gowdappa wrote:
> > Can you send the statedump? Please look for frames with "complete=0". This
> > indicates that the frame is not unwound.
> 
> Here it is. No unwound frame?

Is this statedump? Statedumps carry lots of other information too. This seems 
more like profile output from io-stats. Steps to obtain statedump from my 
previous mail:

1. Add "all=yes" /var/run/gluster/glusterdump.options
2. kill -SIGUSR1 
3. statedump can be found in /var/run/gluster/*dump*

Also, please take statedump only after you find the application process is in 
"uninterruptible sleep" ('D' on Linux) state.

> 
> {
> "gluster.patchy.aggr.read_1b": "0",
> "gluster.patchy.aggr.write_1b": "0",
> "gluster.patchy.aggr.read_2b": "0",
> "gluster.patchy.aggr.write_2b": "0",
> "gluster.patchy.aggr.read_4b": "0",
> "gluster.patchy.aggr.write_4b": "0",
> "gluster.patchy.aggr.read_8b": "0",
> "gluster.patchy.aggr.write_8b": "0",
> "gluster.patchy.aggr.read_16b": "0",
> "gluster.patchy.aggr.write_16b": "0",
> "gluster.patchy.aggr.read_32b": "0",
> "gluster.patchy.aggr.write_32b": "0",
> "gluster.patchy.aggr.read_64b": "0",
> "gluster.patchy.aggr.write_64b": "0",
> "gluster.patchy.aggr.read_128b": "0",
> "gluster.patchy.aggr.write_128b": "0",
> "gluster.patchy.aggr.read_256b": "0",
> "gluster.patchy.aggr.write_256b": "0",
> "gluster.patchy.aggr.read_512b": "0",
> "gluster.patchy.aggr.write_512b": "0",
> "gluster.patchy.aggr.read_1kb": "0",
> "gluster.patchy.aggr.write_1kb": "0",
> "gluster.patchy.aggr.read_2kb": "0",
> "gluster.patchy.aggr.write_2kb": "0",
> "gluster.patchy.aggr.read_4kb": "0",
> "gluster.patchy.aggr.write_4kb": "0",
> "gluster.patchy.aggr.read_8kb": "0",
> "gluster.patchy.aggr.write_8kb": "0",
> "gluster.patchy.aggr.read_16kb": "0",
> "gluster.patchy.aggr.write_16kb": "0",
> "gluster.patchy.aggr.read_32kb": "0",
> "gluster.patchy.aggr.write_32kb": "357",
> "gluster.patchy.aggr.read_64kb": "0",
> "gluster.patchy.aggr.write_64kb": "0",
> "gluster.patchy.aggr.read_128kb": "0",
> "gluster.patchy.aggr.write_128kb": "0",
> "gluster.patchy.aggr.read_256kb": "0",
> "gluster.patchy.aggr.write_256kb": "0",
> "gluster.patchy.aggr.read_512kb": "0",
> "gluster.patchy.aggr.write_512kb": "0",
> "gluster.patchy.aggr.read_1mb": "0",
> "gluster.patchy.aggr.write_1mb": "0",
> "gluster.patchy.aggr.read_2mb": "0",
> "gluster.patchy.aggr.write_2mb": "0",
> "gluster.patchy.aggr.read_4mb": "0",
> "gluster.patchy.aggr.write_4mb": "0",
> "gluster.patchy.aggr.read_8mb": "0",
> "gluster.patchy.aggr.write_8mb": "0",
> "gluster.patchy.aggr.read_16mb": "0",
> "gluster.patchy.aggr.write_16mb": "0",
> "gluster.patchy.aggr.read_32mb": "0",
> "gluster.patchy.aggr.write_32mb": "0",
> "gluster.patchy.aggr.read_64mb": "0",
> "gluster.patchy.aggr.write_64mb": "0",
> "gluster.patchy.aggr.read_128mb": "0",
> "gluster.patchy.aggr.write_128mb": "0",
> "gluster.patchy.aggr.read_256mb": "0",
> "gluster.patchy.aggr.write_256mb": "0",
> "gluster.patchy.aggr.read_512mb": "0",
> "gluster.patchy.aggr.write_512mb": "

Re: [Gluster-devel] quota.t hangs on NetBSD machines

2016-01-04 Thread Raghavendra Gowdappa

Thanks. There is a write call which is not unwound by write-behind as can be 
seen below:

[.WRITE]
request-ptr=0xb80a4830
refcount=2
wound=no
generation-number=90
req->op_ret=32768
req->op_errno=0
sync-attempts=0
sync-in-progress=no
size=32768
offset=11665408
lied=0
append=0
fulfilled=0
go=0

note, the request is not wound (wound=no and sync-in-progress=no), not unwound 
(lied=0). I am yet to figure out the RCA. will be sending a patch soon.

- Original Message -
> From: "Manikandan Selvaganesh" <mselv...@redhat.com>
> To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> Cc: "Vijaikumar Mallikarjuna" <vmall...@redhat.com>, "Emmanuel Dreyfus" 
> <m...@netbsd.org>, "Raghavendra Talur"
> <rta...@redhat.com>, "Gluster Devel" <gluster-devel@gluster.org>
> Sent: Tuesday, January 5, 2016 11:43:55 AM
> Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines
> 
> Hi Raghavendra,
> 
> Yeah, we have taken the statedump when the test program was in 'D' state. I
> have enabled statedump of inodes too.
> 
> Attaching the entire statedump file.
> 
> Thank you :-)
> 
> --
> Regards,
> Manikandan Selvaganesh.
> 
> - Original Message -
> From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> To: "Manikandan Selvaganesh" <mselv...@redhat.com>
> Cc: "Emmanuel Dreyfus" <m...@netbsd.org>, "Gluster Devel"
> <gluster-devel@gluster.org>, "Vijaikumar Mallikarjuna" <vmall...@redhat.com>
> Sent: Monday, January 4, 2016 11:56:03 PM
> Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines
> 
> 
> 
> - Original Message -
> > From: "Manikandan Selvaganesh" <mselv...@redhat.com>
> > To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > Cc: "Emmanuel Dreyfus" <m...@netbsd.org>, "Gluster Devel"
> > <gluster-devel@gluster.org>, "Vijaikumar Mallikarjuna"
> > <vmall...@redhat.com>
> > Sent: Monday, January 4, 2016 7:00:16 PM
> > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines
> > 
> > Hi,
> > 
> > We have taken statedump of fuse client process, quotad and bricks.
> > Apparently, we could not find any stack information in brick's statedump.
> > Below is the client statedump state information:
> 
> Thanks Manikandan :). You took this statedump once the test program was in
> 'D' state, right? Otherwise these can be just in-transit fops.
> 
> > 
> > [global.callpool.stack.1.frame.1]
> > frame=0xb80775f0
> > ref_count=0
> > translator=patchy-write-behind
> > complete=0
> > parent=patchy-read-ahead
> > wind_from=ra_writev
> > wind_to=FIRST_CHILD(this)->fops->writev
> > unwind_to=ra_writev_cbk
> 
> As I suspected, write-behind seems to be the culprit. Can you upload the
> entire statedump file? Also, make sure you've enabled statedump of inodes
> (by setting "all=yes" in glusterdump.options as explained in my previous
> mail).
> 
> > 
> > [global.callpool.stack.1.frame.2]
> > frame=0xb8077540
> > ref_count=1
> > translator=patchy-read-ahead
> > complete=0
> > parent=patchy-io-cache
> > wind_from=ioc_writev
> > wind_to=FIRST_CHILD(this)->fops->writev
> > unwind_to=ioc_writev_cbk
> > 
> > [global.callpool.stack.1.frame.3]
> > frame=0xb8077490
> > ref_count=1
> > translator=patchy-io-cache
> > complete=0
> > parent=patchy-quick-read
> > wind_from=qr_writev
> > wind_to=FIRST_CHILD (this)->fops->writev
> > unwind_to=default_writev_cbk
> > 
> > [global.callpool.stack.1.frame.4]
> > frame=0xb80773e0
> > ref_count=1
> > translator=patchy-quick-read
> > complete=0
> > parent=patchy-open-behind
> > wind_from=default_writev_resume
> > wind_to=FIRST_CHILD(this)->fops->writev
> > unwind_to=default_writev_cbk
> > 
> > [global.callpool.stack.1.frame.5]
> > frame=0xb8077330
> > ref_count=1
> > translator=patchy-open-behind
> > complete=0
> > parent=patchy-md-cache
> > wind_from=mdc_writev
> > wind_to=FIRST_CHILD(this)->fops->writev
> > unwind_to=mdc_writev_cbk
> > 
> > [global.callpool.stack.1.frame.6]
> > frame=0xb80771d0
> > ref_count=1
> > translator=patchy-md-cache
> > complete=0
> > parent=patchy
> > wind_from=io_stats_writev
> > wind_to=FIRST_CHILD(this)->fops->writev
> > unwind_to=io_stats_writev_cbk
> > 
> > [global.callpool.stack.1.f

Re: [Gluster-devel] quota.t hangs on NetBSD machines

2016-01-05 Thread Raghavendra Gowdappa



- Original Message -
> From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> To: "Manikandan Selvaganesh" <mselv...@redhat.com>
> Cc: "Vijaikumar Mallikarjuna" <vmall...@redhat.com>, "Emmanuel Dreyfus" 
> <m...@netbsd.org>, "Raghavendra Talur"
> <rta...@redhat.com>, "Gluster Devel" <gluster-devel@gluster.org>
> Sent: Tuesday, January 5, 2016 12:16:27 PM
> Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines
> 
> Thanks. There is a write call which is not unwound by write-behind as can be
> seen below:
> 
> [.WRITE]
> request-ptr=0xb80a4830
> refcount=2
> wound=no
> generation-number=90
> req->op_ret=32768
> req->op_errno=0
> sync-attempts=0
> sync-in-progress=no
> size=32768
> offset=11665408
> lied=0
> append=0
> fulfilled=0
> go=0
> 
> note, the request is not wound (wound=no and sync-in-progress=no), not
> unwound (lied=0). I am yet to figure out the RCA. will be sending a patch
> soon.

I figured out this issue occurs when "trickling-writes" is on in write-behind 
(by default its on). Unfortunately this option cannot be turned off using cli 
as of now (one can edit volfiles though).

> 
> - Original Message -
> > From: "Manikandan Selvaganesh" <mselv...@redhat.com>
> > To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > Cc: "Vijaikumar Mallikarjuna" <vmall...@redhat.com>, "Emmanuel Dreyfus"
> > <m...@netbsd.org>, "Raghavendra Talur"
> > <rta...@redhat.com>, "Gluster Devel" <gluster-devel@gluster.org>
> > Sent: Tuesday, January 5, 2016 11:43:55 AM
> > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines
> > 
> > Hi Raghavendra,
> > 
> > Yeah, we have taken the statedump when the test program was in 'D' state. I
> > have enabled statedump of inodes too.
> > 
> > Attaching the entire statedump file.
> > 
> > Thank you :-)
> > 
> > --
> > Regards,
> > Manikandan Selvaganesh.
> > 
> > - Original Message -
> > From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > To: "Manikandan Selvaganesh" <mselv...@redhat.com>
> > Cc: "Emmanuel Dreyfus" <m...@netbsd.org>, "Gluster Devel"
> > <gluster-devel@gluster.org>, "Vijaikumar Mallikarjuna"
> > <vmall...@redhat.com>
> > Sent: Monday, January 4, 2016 11:56:03 PM
> > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines
> > 
> > 
> > 
> > - Original Message -
> > > From: "Manikandan Selvaganesh" <mselv...@redhat.com>
> > > To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > > Cc: "Emmanuel Dreyfus" <m...@netbsd.org>, "Gluster Devel"
> > > <gluster-devel@gluster.org>, "Vijaikumar Mallikarjuna"
> > > <vmall...@redhat.com>
> > > Sent: Monday, January 4, 2016 7:00:16 PM
> > > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines
> > > 
> > > Hi,
> > > 
> > > We have taken statedump of fuse client process, quotad and bricks.
> > > Apparently, we could not find any stack information in brick's statedump.
> > > Below is the client statedump state information:
> > 
> > Thanks Manikandan :). You took this statedump once the test program was in
> > 'D' state, right? Otherwise these can be just in-transit fops.
> > 
> > > 
> > > [global.callpool.stack.1.frame.1]
> > > frame=0xb80775f0
> > > ref_count=0
> > > translator=patchy-write-behind
> > > complete=0
> > > parent=patchy-read-ahead
> > > wind_from=ra_writev
> > > wind_to=FIRST_CHILD(this)->fops->writev
> > > unwind_to=ra_writev_cbk
> > 
> > As I suspected, write-behind seems to be the culprit. Can you upload the
> > entire statedump file? Also, make sure you've enabled statedump of inodes
> > (by setting "all=yes" in glusterdump.options as explained in my previous
> > mail).
> > 
> > > 
> > > [global.callpool.stack.1.frame.2]
> > > frame=0xb8077540
> > > ref_count=1
> > > translator=patchy-read-ahead
> > > complete=0
> > > parent=patchy-io-cache
> > > wind_from=ioc_writev
> > > wind_to=FIRST_CHILD(this)->fops->writev
> > > unwind_to=ioc_writev_cbk
> > > 
> > > [global.callpool.stack.1.frame.3]
> &g

Re: [Gluster-devel] quota.t hangs on NetBSD machines

2016-01-06 Thread Raghavendra Gowdappa



- Original Message -
> From: "Manikandan Selvaganesh" <mselv...@redhat.com>
> To: "Raghavendra G" <raghaven...@gluster.com>
> Cc: "Gluster Devel" <gluster-devel@gluster.org>
> Sent: Wednesday, January 6, 2016 7:54:32 PM
> Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines
> 
> Hi,
> We are debugging the issue. With the patch[1], the quota.t doesn't seem to
> hang and all the test passes successfully but it throws an error(while
> running test #24 in quota.t) "perfused: perfuse_node_inactive:
> perfuse_node_fsync failed error = 69: Resource temporarily unavailable". I
> have attached a tar file which contains the logs while the test(quota.t) is
> being run.
> Thanks to Raghavendra Talur and Vijay for helping :)
> 
> [1] http://review.gluster.org/#/c/13177/

I've merged this patch. If there are any tests hung, please kill them and 
retrigger.

> 
> Thank you :-)
> 
> --
> Regards,
> Manikandan Selvaganesh.
> 
> - Original Message -
> From: "Manikandan Selvaganesh" <mselv...@redhat.com>
> To: "Raghavendra G" <raghaven...@gluster.com>
> Cc: "Gluster Devel" <gluster-devel@gluster.org>
> Sent: Wednesday, January 6, 2016 11:04:23 AM
> Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines
> 
> Hi Raghavendra,
> I will check out this with the fix and update you soon on this :)
> 
> --
> Regards,
> Manikandan Selvaganesh.
> 
> - Original Message -
> From: "Raghavendra G" <raghaven...@gluster.com>
> To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> Cc: "Manikandan Selvaganesh" <mselv...@redhat.com>, "Gluster Devel"
> <gluster-devel@gluster.org>
> Sent: Tuesday, January 5, 2016 11:15:57 PM
> Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines
> 
> Manikandan,
> 
> Can you test this fix since you've the setup ready?
> 
> regards,
> Raghavendra
> 
> On Tue, Jan 5, 2016 at 11:11 PM, Raghavendra G <raghaven...@gluster.com>
> wrote:
> 
> > A fix has been sent to:
> > http://review.gluster.org/13177
> >
> > On Tue, Jan 5, 2016 at 2:24 PM, Raghavendra Gowdappa <rgowd...@redhat.com>
> > wrote:
> >
> >>
> >>
> >> - Original Message -
> >> > From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> >> > To: "Manikandan Selvaganesh" <mselv...@redhat.com>
> >> > Cc: "Vijaikumar Mallikarjuna" <vmall...@redhat.com>, "Emmanuel
> >> Dreyfus" <m...@netbsd.org>, "Raghavendra Talur"
> >> > <rta...@redhat.com>, "Gluster Devel" <gluster-devel@gluster.org>
> >> > Sent: Tuesday, January 5, 2016 12:16:27 PM
> >> > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines
> >> >
> >> > Thanks. There is a write call which is not unwound by write-behind as
> >> can be
> >> > seen below:
> >> >
> >> > [.WRITE]
> >> > request-ptr=0xb80a4830
> >> > refcount=2
> >> > wound=no
> >> > generation-number=90
> >> > req->op_ret=32768
> >> > req->op_errno=0
> >> > sync-attempts=0
> >> > sync-in-progress=no
> >> > size=32768
> >> > offset=11665408
> >> > lied=0
> >> > append=0
> >> > fulfilled=0
> >> > go=0
> >> >
> >> > note, the request is not wound (wound=no and sync-in-progress=no), not
> >> > unwound (lied=0). I am yet to figure out the RCA. will be sending a
> >> patch
> >> > soon.
> >>
> >> I figured out this issue occurs when "trickling-writes" is on in
> >> write-behind (by default its on). Unfortunately this option cannot be
> >> turned off using cli as of now (one can edit volfiles though).
> >>
> >> >
> >> > - Original Message -
> >> > > From: "Manikandan Selvaganesh" <mselv...@redhat.com>
> >> > > To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> >> > > Cc: "Vijaikumar Mallikarjuna" <vmall...@redhat.com>, "Emmanuel
> >> Dreyfus"
> >> > > <m...@netbsd.org>, "Raghavendra Talur"
> >> > > <rta...@redhat.com>, "Gluster Devel" <gluster-devel@gluster.org>
> >> > > Sent: Tuesday, January 5, 2016 11:43:55 AM
> >> > >

Re: [Gluster-devel] NetBSD tests not running to completion.

2016-01-07 Thread Raghavendra Gowdappa

> On 01/07/2016 02:39 PM, Emmanuel Dreyfus wrote:
> > On Wed, Jan 06, 2016 at 05:49:04PM +0530, Ravishankar N wrote:
> >> I re triggered NetBSD regressions for
> >> http://review.gluster.org/#/c/13041/3
> >> but they are being run in silent mode and are not completing. Can some one
> >> from the infra-team take a look? The last 22 tests in
> >> https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/ have
> >> failed. Highly unlikely that something is wrong with all those patches.
> > I note your latest test compelted with an error in mount-nfs-auth.t:
> > https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/13260/consoleFull
> >
> > Would you have the jenkins build that did not complete s that I can have a
> > look at it?
> >
> > Generally speaking, I have to pôint that NetBSD regression does show light
> > on generic bugs, we had a recent exemple with quota-nfs.t. For now there
> > are not other well supported platforms, but if you want glusterfs to
> > be really portable, removing mandatory NetBSD regression is not a good
> > idea:
> > portability bugs will crop.
> >
> > Even a daily or weekly regression run seems a bad idea to me. If you do not
> > prevent integration of patches that break NetBSD regression, that will get
> > in, and tests will break one by one over time. I have a first hand
> > experience of this situation, when I was actually trying to catch on with
> > NetBSD regression. Many time I reached something reliable enough to become
> > mandatory, and got broken by a new patch before it became actualy
> > mandatory.
> >
> > IMO, relaxing NetBSD regression requirement means the project drops the
> > goal
> > of being portable.
> >
> hi Emmanuel,
>   This Sunday I have some time I can spend helping in making
> tests better for NetBSD. I have seen bugs that are caught only by NetBSD
> regression just recently, so I see value in making NetBSD more reliable.

+1. As Manu and Ravi's conversation pointed out, its better to take a call 
based on data (how many tests are failing, how many are spurious). As my recent 
work on quota-nfs.t shows, I was actively trying to seek a reproducer for 
write-behind issue, but the reproducer seemed elusive. We were able to hit the 
bug very inconsistently. Couple that with the pressure to take things to 
closure, a tendency to push things under carpet creeps in.

Having said that you can find some of my commits where netbsd results are 
skipped (or not waited for completion of netbsd runs). A knowledge that infra 
is stable and there are less false-positives (of bugs) will shift 
responsibility on developers to own the issue and fix it.

> Please let me know what are the things we can work on. It would help if
> you give me something specific to glusterfs to make it more valuable in
> the short term. Over time I would like to learn enough to share the load
> with you however little it may be (Please bear with me, I some times go
> quiet). Here are the initial things I would like to know to begin with:

I can try to help out here too. But mostly on best effort basis as there are 
other responsibilities where I am evaluated directly.

> 
> 1) How to set up NetBSD VMs on my laptop which is of exact version as
> the ones that are run on build systems.
> 2) How to prevent NetBSD machines hang when things crash (At least I
> used to see that the machines hang when fuse crashes before, not sure if
> this is still the case)? (This failure needs manual intervention at the
> moment on NetBSD regressions, if we make it report failures and pick
> next job that would be the best way forward)
> 3) We should come up with a list of known problems and how to
> troubleshoot those problems, when things are not going smooth in NetBSD.
> Again, we really need to make things automatic, this should be last
> resort. Our top goal should be to make NetBSD machines report failures
> and go to execute next job.
> 4) How can we make debugging better in NetBSD? In the worst case we can
> make all tests execute in trace/debug mode on NetBSD.
> 
> I really want to appreciate the fine job you have done so far in making
> sure glusterfs is stable on NetBSD.

++1. I appreciate Emmanuel's effort/support from such a long time and will try 
to chip in to whatever extent I can.

> 
> Infra team,
> I think we need to make some improvements to our infra. We need
> to get information about health of linux, NetBSD regression builds.
> 1) Something like, in the last 100 builds how many builds succeeded on
> Linux, how many succeeded on NetBSD.
> 2) What are the tests that failed in the last 100 builds and how many
> times on both Linux and NetBSD. (I actually wrote this part in some
> parts, but the whole command output has changed making my scripts stale)
> Any other ideas you guys have?
> 3) Which components have highest number of spurious failures.
> 4) How many builds did not complete/manually aborted etc.
> 
> Once we start measuring these things, next

Re: [Gluster-devel] quota.t hangs on NetBSD machines

2016-01-04 Thread Raghavendra Gowdappa



- Original Message -
> From: "Manikandan Selvaganesh" <mselv...@redhat.com>
> To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> Cc: "Emmanuel Dreyfus" <m...@netbsd.org>, "Gluster Devel" 
> <gluster-devel@gluster.org>, "Vijaikumar Mallikarjuna"
> <vmall...@redhat.com>
> Sent: Monday, January 4, 2016 7:00:16 PM
> Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines
> 
> Hi,
> 
> We have taken statedump of fuse client process, quotad and bricks.
> Apparently, we could not find any stack information in brick's statedump.
> Below is the client statedump state information:

Thanks Manikandan :). You took this statedump once the test program was in 'D' 
state, right? Otherwise these can be just in-transit fops.

> 
> [global.callpool.stack.1.frame.1]
> frame=0xb80775f0
> ref_count=0
> translator=patchy-write-behind
> complete=0
> parent=patchy-read-ahead
> wind_from=ra_writev
> wind_to=FIRST_CHILD(this)->fops->writev
> unwind_to=ra_writev_cbk

As I suspected, write-behind seems to be the culprit. Can you upload the entire 
statedump file? Also, make sure you've enabled statedump of inodes (by setting 
"all=yes" in glusterdump.options as explained in my previous mail).

> 
> [global.callpool.stack.1.frame.2]
> frame=0xb8077540
> ref_count=1
> translator=patchy-read-ahead
> complete=0
> parent=patchy-io-cache
> wind_from=ioc_writev
> wind_to=FIRST_CHILD(this)->fops->writev
> unwind_to=ioc_writev_cbk
> 
> [global.callpool.stack.1.frame.3]
> frame=0xb8077490
> ref_count=1
> translator=patchy-io-cache
> complete=0
> parent=patchy-quick-read
> wind_from=qr_writev
> wind_to=FIRST_CHILD (this)->fops->writev
> unwind_to=default_writev_cbk
> 
> [global.callpool.stack.1.frame.4]
> frame=0xb80773e0
> ref_count=1
> translator=patchy-quick-read
> complete=0
> parent=patchy-open-behind
> wind_from=default_writev_resume
> wind_to=FIRST_CHILD(this)->fops->writev
> unwind_to=default_writev_cbk
> 
> [global.callpool.stack.1.frame.5]
> frame=0xb8077330
> ref_count=1
> translator=patchy-open-behind
> complete=0
> parent=patchy-md-cache
> wind_from=mdc_writev
> wind_to=FIRST_CHILD(this)->fops->writev
> unwind_to=mdc_writev_cbk
> 
> [global.callpool.stack.1.frame.6]
> frame=0xb80771d0
> ref_count=1
> translator=patchy-md-cache
> complete=0
> parent=patchy
> wind_from=io_stats_writev
> wind_to=FIRST_CHILD(this)->fops->writev
> unwind_to=io_stats_writev_cbk
> 
> [global.callpool.stack.1.frame.7]
> frame=0xb8077120
> ref_count=1
> translator=patchy
> complete=0
> parent=fuse
> wind_from=fuse_write_resume
> wind_to=FIRST_CHILD(this)->fops->writev
> unwind_to=fuse_writev_cbk
> 
> [global.callpool.stack.1.frame.8]
> frame=0xb8077070
> ref_count=1
> translator=fuse
> complete=0
> 
> [global.callpool.stack.2]
> stack=0xba420040
> uid=0
> gid=0
> pid=0
> unique=0
> lk-owner=
> op=stack
> type=0
> cnt=1
> 
> [global.callpool.stack.2.frame.1]
> frame=0xba43c6d0
> ref_count=0
> translator=glusterfs
> complete=0
> 
> [fuse]
> 
> Below is the statedump information for quotad
> 
> [global.callpool.stack.1.frame.1]
> frame=0xbb1d0620
> ref_count=0
> translator=glusterfs
> complete=0
> 
> Thank you :-)
> 
> --
> Regards,
> Manikandan Selvaganesh.
> 
> - Original Message -
> From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> To: "Emmanuel Dreyfus" <m...@netbsd.org>
> Cc: "Gluster Devel" <gluster-devel@gluster.org>
> Sent: Monday, January 4, 2016 6:11:24 PM
> Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines
> 
> 
> 
> - Original Message -
> > From: "Emmanuel Dreyfus" <m...@netbsd.org>
> > To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > Cc: "Emmanuel Dreyfus" <m...@netbsd.org>, "Gluster Devel"
> > <gluster-devel@gluster.org>, "Raghavendra Talur"
> > <rta...@redhat.com>
> > Sent: Monday, January 4, 2016 4:03:22 PM
> > Subject: Re: [Gluster-devel] quota.t hangs on NetBSD machines
> > 
> > On Mon, Jan 04, 2016 at 05:16:16AM -0500, Raghavendra Gowdappa wrote:
> > > Can you send the statedump? Please look for frames with "complete=0".
> > > This
> > > indicates that the frame is not unwound.
> > 
> > Here it is. No unwound frame?
> 
> Is this statedump? Statedumps carry lots of other information too. This seems
> more li

Re: [Gluster-devel] FreeBSD port of GlusterFS racks up a lot of CPU usage

2016-01-08 Thread Raghavendra Gowdappa



- Original Message -
> From: "Rick Macklem" 
> To: "Jeff Darcy" 
> Cc: "Raghavendra G" , "freebsd-fs" 
> , "Hubbard Jordan"
> , "Xavier Hernandez" , "Gluster 
> Devel" 
> Sent: Saturday, January 9, 2016 7:29:59 AM
> Subject: Re: [Gluster-devel] FreeBSD port of GlusterFS racks up a lot of CPU 
> usage
> 
> Jeff Darcy wrote:
> > > > I don't know anything about gluster's poll implementation so I may
> > > > be totally wrong, but would it be possible to use an eventfd (or a
> > > > pipe if eventfd is not supported) to signal the need to add more
> > > > file descriptors to the poll call ?
> > > >
> > > >
> > > > The poll call should listen on this new fd. When we need to change
> > > > the fd list, we should simply write to the eventfd or pipe from
> > > > another thread.  This will cause the poll call to return and we will
> > > > be able to change the fd list without having a short timeout nor
> > > > having to decide on any trade-off.
> > > 
> > >
> > > Thats a nice idea. Based on my understanding of why timeouts are being
> > > used, this approach can work.
> > 
> > The own-thread code which preceded the current poll implementation did
> > something similar, using a pipe fd to be woken up for new *outgoing*
> > messages.  That code still exists, and might provide some insight into
> > how to do this for the current poll code.
> I took a look at event-poll.c and found something interesting...
> - A pipe called "breaker" is already set up by event_pool_new_poll() and
>   closed by event_pool_destroy_poll(), however it never gets used for
>   anything.

I did a check on history, but couldn't find any information on why it was 
removed. Can you send this patch to http://review.gluster.org ? We can review 
and merge the patch over there. If you are not aware, development work flow can 
be found at:

http://www.gluster.org/community/documentation/index.php/Developers

> 
> So, I added a few lines of code that writes a byte to it whenever the list of
> file descriptors is changed and read when poll() returns, if its revents is
> set.
> I also changed the timeout to -1 (infinity) and it seems to work for a
> trivial
> test.
> --> Btw, I also noticed the "changed" variable gets set to 1 on a change, but
> never reset to 0. I didn't change this, since it looks "racey". (ie. I
> think you could easily get a race between a thread that clears it and one
> that adds a new fd.)
> 
> A slightly safer version of the patch would set a long (100msec ??) timeout
> instead
> of -1.
> 
> Anyhow, I've attached the patch in case anyone would like to try it and will
> create a bug report for this after I've had more time to test it.
> (I only use a couple of laptops, so my testing will be minimal.)
> 
> Thanks for all the help, rick
> 
> > ___
> > freebsd...@freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> > To unsubscribe, send any mail to "freebsd-fs-unsubscr...@freebsd.org"
> > 
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] volume-snapshot.t failures

2015-12-21 Thread Raghavendra Gowdappa

Seems like a snapshot failure on build machines. Found another failure:
https://build.gluster.org/job/rackspace-regression-2GB-triggered/17066/console

Test failed:
./tests/basic/tier/tier-snapshot.t

Debug-msg:
++ gluster --mode=script --wignore snapshot create snap2 patchy no-timestamp
snapshot create: failed: Pre-validation failed on localhost. Please check log 
file for details
+ test_footer
+ RET=1
+ local err=
+ '[' 1 -eq 0 ']'
+ echo 'not ok 11 '
not ok 11 
+ '[' x0 = x0 ']'
+ echo 'FAILED COMMAND: gluster --mode=script --wignore snapshot create snap2 
patchy no-timestamp'
FAILED COMMAND: gluster --mode=script --wignore snapshot create snap2 patchy 
no-timestamp

- Original Message -
> From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> To: "Rajesh Joseph" <rjos...@redhat.com>
> Sent: Tuesday, December 22, 2015 10:05:22 AM
> Subject: volume-snapshot.t failures
> 
> Hi Rajesh
> 
> There is a failure of volume-snapshot.t on build machine:
> https://build.gluster.org/job/rackspace-regression-2GB-triggered/17048/consoleFull
> 
> 
> However, on my local machine test succeeds always. Is it a known case of
> spurious failure?
> 
> regards,
> Raghavendra.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] volume-snapshot.t failures

2015-12-21 Thread Raghavendra Gowdappa

Both these tests succeed on my local machine.

- Original Message -
> From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> To: "Rajesh Joseph" <rjos...@redhat.com>
> Cc: "Gluster Devel" <gluster-devel@gluster.org>
> Sent: Tuesday, December 22, 2015 12:05:24 PM
> Subject: Re: [Gluster-devel] volume-snapshot.t failures
> 
> Seems like a snapshot failure on build machines. Found another failure:
> https://build.gluster.org/job/rackspace-regression-2GB-triggered/17066/console
> 
> Test failed:
> ./tests/basic/tier/tier-snapshot.t
> 
> Debug-msg:
> ++ gluster --mode=script --wignore snapshot create snap2 patchy no-timestamp
> snapshot create: failed: Pre-validation failed on localhost. Please check log
> file for details
> + test_footer
> + RET=1
> + local err=
> + '[' 1 -eq 0 ']'
> + echo 'not ok 11 '
> not ok 11
> + '[' x0 = x0 ']'
> + echo 'FAILED COMMAND: gluster --mode=script --wignore snapshot create snap2
> patchy no-timestamp'
> FAILED COMMAND: gluster --mode=script --wignore snapshot create snap2 patchy
> no-timestamp
> 
> - Original Message -
> > From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > To: "Rajesh Joseph" <rjos...@redhat.com>
> > Sent: Tuesday, December 22, 2015 10:05:22 AM
> > Subject: volume-snapshot.t failures
> > 
> > Hi Rajesh
> > 
> > There is a failure of volume-snapshot.t on build machine:
> > https://build.gluster.org/job/rackspace-regression-2GB-triggered/17048/consoleFull
> > 
> > 
> > However, on my local machine test succeeds always. Is it a known case of
> > spurious failure?
> > 
> > regards,
> > Raghavendra.
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] volume-snapshot.t failures

2015-12-22 Thread Raghavendra Gowdappa

Thanks Dan!!

- Original Message -
> From: "Dan Lambright" <dlamb...@redhat.com>
> To: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> Cc: "Rajesh Joseph" <rjos...@redhat.com>, "Gluster Devel" 
> <gluster-devel@gluster.org>
> Sent: Tuesday, December 22, 2015 12:16:05 PM
> Subject: Re: [Gluster-devel] volume-snapshot.t failures
> 
> They fail on RHEL6 machines due to an issue with sqlite. The test has been
> moved to the ignore list already (13056).
> I'd like to look into having tests that do not run on RHEL6, only RHEL7+.
> I've been running it on RHEL7 the last few hours in a loop successfully.
> 
> - Original Message -
> > From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > To: "Rajesh Joseph" <rjos...@redhat.com>
> > Cc: "Gluster Devel" <gluster-devel@gluster.org>
> > Sent: Tuesday, December 22, 2015 1:42:13 AM
> > Subject: Re: [Gluster-devel] volume-snapshot.t failures
> > 
> > Both these tests succeed on my local machine.
> > 
> > - Original Message -
> > > From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > > To: "Rajesh Joseph" <rjos...@redhat.com>
> > > Cc: "Gluster Devel" <gluster-devel@gluster.org>
> > > Sent: Tuesday, December 22, 2015 12:05:24 PM
> > > Subject: Re: [Gluster-devel] volume-snapshot.t failures
> > > 
> > > Seems like a snapshot failure on build machines. Found another failure:
> > > https://build.gluster.org/job/rackspace-regression-2GB-triggered/17066/console
> > > 
> > > Test failed:
> > > ./tests/basic/tier/tier-snapshot.t
> > > 
> > > Debug-msg:
> > > ++ gluster --mode=script --wignore snapshot create snap2 patchy
> > > no-timestamp
> > > snapshot create: failed: Pre-validation failed on localhost. Please check
> > > log
> > > file for details
> > > + test_footer
> > > + RET=1
> > > + local err=
> > > + '[' 1 -eq 0 ']'
> > > + echo 'not ok 11 '
> > > not ok 11
> > > + '[' x0 = x0 ']'
> > > + echo 'FAILED COMMAND: gluster --mode=script --wignore snapshot create
> > > snap2
> > > patchy no-timestamp'
> > > FAILED COMMAND: gluster --mode=script --wignore snapshot create snap2
> > > patchy
> > > no-timestamp
> > > 
> > > - Original Message -
> > > > From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> > > > To: "Rajesh Joseph" <rjos...@redhat.com>
> > > > Sent: Tuesday, December 22, 2015 10:05:22 AM
> > > > Subject: volume-snapshot.t failures
> > > > 
> > > > Hi Rajesh
> > > > 
> > > > There is a failure of volume-snapshot.t on build machine:
> > > > https://build.gluster.org/job/rackspace-regression-2GB-triggered/17048/consoleFull
> > > > 
> > > > 
> > > > However, on my local machine test succeeds always. Is it a known case
> > > > of
> > > > spurious failure?
> > > > 
> > > > regards,
> > > > Raghavendra.
> > > ___
> > > Gluster-devel mailing list
> > > Gluster-devel@gluster.org
> > > http://www.gluster.org/mailman/listinfo/gluster-devel
> > > 
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
> > 
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Lot of Netbsd regressions 'Waiting for the next available executor'

2015-12-23 Thread Raghavendra Gowdappa



- Original Message -
> From: "Raghavendra Gowdappa" <rgowd...@redhat.com>
> To: "Ravishankar N" <ravishan...@redhat.com>
> Cc: "Gluster Devel" <gluster-devel@gluster.org>, "gluster-infra" 
> <gluster-in...@gluster.org>
> Sent: Thursday, December 24, 2015 12:11:46 PM
> Subject: Re: [Gluster-devel] Lot of Netbsd regressions 'Waiting for the next  
> available executor'
> 
> https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/12961/consoleFull
> 
> Seems to be hung. May be a hung syscall? I've tried to kill it, but seems
> like its not dead. May be patch #12594 is causing some issues on netbsd. It
> has passed gluster regression.

s/gluster/Linux/

> 
> - Original Message -
> > From: "Ravishankar N" <ravishan...@redhat.com>
> > To: "Gluster Devel" <gluster-devel@gluster.org>, "gluster-infra"
> > <gluster-in...@gluster.org>
> > Sent: Thursday, December 24, 2015 9:27:53 AM
> > Subject: [Gluster-devel] Lot of Netbsd regressions 'Waiting for the next
> > available executor'
> > 
> > $subject.
> > Since yesterday.
> > The build queue is growing. Something's wrong.
> > 
> > " If you see a little black clock icon in the build queue as shown below,
> > it
> > is an indication that your job is sitting in the queue unnecessarily." is
> > what it says.
> > 
> > 
> > 
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Lot of Netbsd regressions 'Waiting for the next available executor'

2015-12-23 Thread Raghavendra Gowdappa

https://build.gluster.org/job/rackspace-netbsd7-regression-triggered/12961/consoleFull
 

Seems to be hung. May be a hung syscall? I've tried to kill it, but seems like 
its not dead. May be patch #12594 is causing some issues on netbsd. It has 
passed gluster regression.

- Original Message -
> From: "Ravishankar N" 
> To: "Gluster Devel" , "gluster-infra" 
> 
> Sent: Thursday, December 24, 2015 9:27:53 AM
> Subject: [Gluster-devel] Lot of Netbsd regressions 'Waiting for the next  
> available executor'
> 
> $subject.
> Since yesterday.
> The build queue is growing. Something's wrong.
> 
> " If you see a little black clock icon in the build queue as shown below, it
> is an indication that your job is sitting in the queue unnecessarily." is
> what it says.
> 
> 
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Native RDMA in libgfapi

2015-11-24 Thread Raghavendra Gowdappa



- Original Message -
> From: "Piotr Rybicki" 
> To: "Gluster Devel" 
> Sent: Friday, November 20, 2015 9:00:28 PM
> Subject: [Gluster-devel] Native RDMA in libgfapi
> 
> Hi All.
> 
> Are there any plans for this feature?
> 
> Just tested latest glusterfs (3.7.6), and it still doesn't work (as
> expected, since there was no info in changelog about it).

What errors did you get? Is it possible to send across log files (bricks, 
clients and glusterd logs)? It works for fuse mounts, so theoretically should 
work for gfapi too.

> 
> Native RDMA transport should give a significant boost in performance,
> based on my observations in fuse mount.
> 
> If that is any of help, I'm more than happy to test patches ;-)
> 
> I'm using Mellanox QDR cards and ofed 3.12.
> 
> Best regards
> Piotr Rybicki
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://www.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] [Review request] write-behind to retry failed syncs

2015-11-18 Thread Raghavendra Gowdappa

Hi all,

[1] adds retry logic to failed syncs (to backend). It would be helpful if you 
can comment on:

1. Interface
2. Design
3. Implementation

[1] review.gluster.org/#/c/12594/7

regards,
Raghavendra.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Review request] write-behind to retry failed syncs

2015-11-18 Thread Raghavendra Gowdappa

For ease of access, I am posting the summary from commit-msg below:

1. When sync fails, the cached-write is still preserved unless there
   is a flush/fsync waiting on it.
2. When a sync fails and there is a flush/fsync waiting on the
   cached-write, the cache is thrown away and no further retries will
   be made. In other words flush/fsync act as barriers for all the
   previous writes. All previous writes are either successfully
   synced to backend or forgotten in case of an error. Without such
   barrier fop (especially flush which is issued prior to a close), we
   end up retrying for ever even after fd is closed.
3. If a fop is waiting on cached-write and syncing to backend fails,
   the waiting fop is failed.
4. sync failures when no fop is waiting are ignored and are not
   propagated to application.
5. The effect of repeated sync failures is that, there will be no
   cache for future writes and they cannot be written behind.

Above algo is for handling of transient errors (EDQUOT, ENOSPC,
ENOTCONN). Handling of non-transient errors is slightly different as
below:
1. Throw away the write-buffer, so that cache is freed. This means no
   retries are made for non-transient errors. Also, since cache is
   freed, future writes can be written-behind.
2. Retain the request till an fsync or flush. This means all future
   operations to failed regions will fail till an fsync/flush. This is
   a conservative error handling to force application to know that a
   written-behind write has failed and take remedial action like
   rollback to last fsync and retrying all the writes from that point.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

1 2 3 4 5 >

1 - 100 of 410 matches

Mail list logo