Re: [Gluster-devel] Regarding doing away with refkeeper in locks xlator

2014-06-04 Thread Pranith Kumar Karampuri


On 06/04/2014 11:37 AM, Krutika Dhananjay wrote:

Hi,

Recently there was a crash in locks translator (BZ 1103347, BZ 
1097102) with the following backtrace:

(gdb) bt
#0  uuid_unpack (in=0x8 Address 0x8 out of bounds, 
uu=0x7fffea6c6a60) at ../../contrib/uuid/unpack.c:44
#1  0x7feeba9e19d6 in uuid_unparse_x (uu=value optimized out, 
out=0x2350fc0 081bbc7a-7551-44ac-85c7-aad5e2633db9,
fmt=0x7feebaa08e00 
%08x-%04x-%04x-%02x%02x-%02x%02x%02x%02x%02x%02x) at 
../../contrib/uuid/unparse.c:55
#2  0x7feeba9be837 in uuid_utoa (uuid=0x8 Address 0x8 out of 
bounds) at common-utils.c:2138
#3  0x7feeb06e8a58 in pl_inodelk_log_cleanup (this=0x230d910, 
ctx=0x7fee700f0c60) at inodelk.c:396
#4  pl_inodelk_client_cleanup (this=0x230d910, ctx=0x7fee700f0c60) at 
inodelk.c:428
#5  0x7feeb06ddf3a in pl_client_disconnect_cbk (this=0x230d910, 
client=value optimized out) at posix.c:2550
#6  0x7feeba9fa2dd in gf_client_disconnect (client=0x27724a0) at 
client_t.c:368
#7  0x7feeab77ed48 in server_connection_cleanup (this=0x2316390, 
client=0x27724a0, flags=value optimized out) at server-helpers.c:354
#8  0x7feeab77ae2c in server_rpc_notify (rpc=value optimized 
out, xl=0x2316390, event=value optimized out, data=0x2bf51c0) at 
server.c:527
#9  0x7feeba775155 in rpcsvc_handle_disconnect (svc=0x2325980, 
trans=0x2bf51c0) at rpcsvc.c:720
#10 0x7feeba776c30 in rpcsvc_notify (trans=0x2bf51c0, 
mydata=value optimized out, event=value optimized out, 
data=0x2bf51c0) at rpcsvc.c:758
#11 0x7feeba778638 in rpc_transport_notify (this=value optimized 
out, event=value optimized out, data=value optimized out) at 
rpc-transport.c:512
#12 0x7feeb115e971 in socket_event_poll_err (fd=value optimized 
out, idx=value optimized out, data=0x2bf51c0, poll_in=value 
optimized out, poll_out=0,

poll_err=0) at socket.c:1071
#13 socket_event_handler (fd=value optimized out, idx=value 
optimized out, data=0x2bf51c0, poll_in=value optimized out, 
poll_out=0, poll_err=0) at socket.c:2240
#14 0x7feeba9fc6a7 in event_dispatch_epoll_handler 
(event_pool=0x22e2d00) at event-epoll.c:384

#15 event_dispatch_epoll (event_pool=0x22e2d00) at event-epoll.c:445
#16 0x00407e93 in main (argc=19, argv=0x7fffea6c7f88) at 
glusterfsd.c:2023

(gdb) f 4
#4  pl_inodelk_client_cleanup (this=0x230d910, ctx=0x7fee700f0c60) at 
inodelk.c:428

428pl_inodelk_log_cleanup (l);
(gdb) p l-pl_inode-refkeeper
$1 = (inode_t *) 0x0
(gdb)

pl_inode-refkeeper was found to be NULL even when there were some 
blocked inodelks in a certain domain of the inode,
which when dereferenced by the epoll thread in the cleanup codepath 
led to a crash.


On inspecting the code (for want of a consistent reproducer), three 
things were found:


1. The function where the crash happens (pl_inodelk_log_cleanup()), 
makes an attempt to resolve the inode to path as can be seen below. 
But the way inode_path() itself
works is to first construct the path based on the given inode's 
ancestry and place it in the buffer provided. And if all else fails, 
the gfid of the inode is placed in a certain format (gfid:%s).
This eliminates the need for statements from line 4 through 7 
below, thereby preventing dereferencing of pl_inode-refkeeper.
Now, although this change prevents the crash altogether, it still 
does not fix the race that led to pl_inode-refkeeper becoming NULL, 
and comes at the cost of
printing (null) in the log message on line 9 every time 
pl_inode-refkeeper is found to be NULL, rendering the logged messages 
somewhat useless.


code
  0 pl_inode = lock-pl_inode;
  1
  2 inode_path (pl_inode-refkeeper, NULL, path);
  3
  4 if (path)
  5 file = path;
  6 else
  7 file = uuid_utoa (pl_inode-refkeeper-gfid);
8
  9 gf_log (THIS-name, GF_LOG_WARNING,
 10 releasing lock on %s held by 
 11 {client=%p, pid=%PRId64 lk-owner=%s},
 12 file, lock-client, (uint64_t) lock-client_pid,
 13 lkowner_utoa (lock-owner));
\code
I think this logging code is from the days when gfid handle concept was 
not there. So it wasn't returning gfid:gfid-str in cases the path is 
not present in the dentries. I believe the else block can be deleted 
safely now.


Pranith


2. There is at least one codepath found that can lead to this crash:
Imagine an inode on which an inodelk operation is attempted by a 
client and is successfully granted too.
   Now, between the time the lock was granted and 
pl_update_refkeeper() was called by this thread, the client could send 
a DISCONNECT event,
   causing cleanup codepath to be executed, where the epoll thread 
crashes on dereferencing pl_inode-refkeeper which is STILL NULL at 
this point.


   Besides, there are still places in locks xlator where the refkeeper 
is NOT updated whenever the lists are modified - for instance in the 
cleanup codepath from a 

Re: [Gluster-devel] All builds are failing with BUILD ERROR

2014-06-04 Thread Niels de Vos
On Wed, Jun 04, 2014 at 11:45:17AM +0530, Kaleb KEITHLEY wrote:
 And since doing this, regression runs seem to be proceeding without issues.

I remember that this used to be an issue when someone cancelled/stopped 
a running regression test through Jenkins. Maybe that was done? I don't 
know if anyone ever looked into solving that particular issue.

Niels

 
 
 
 On 06/04/2014 09:59 AM, Kaleb KEITHLEY wrote:
 On 06/03/2014 04:42 PM, Pranith Kumar Karampuri wrote:
 Guys its failing again with the same error:
 
 Please proceed with configuring, compiling, and installing.
 rm: cannot remove `/build/install/var/run/gluster/patchy': Device or
 resource busy
 + RET=1
 + '[' 1 '!=' 0 ']'
 + VERDICT='BUILD FAILURE'
 
 
 Has someone changed the way builds are cleaned up?
 
 Recent regressions are now failing with
 
 ...
 Running automake...
 Running autogen.sh in argp-standalone ...
 
 Please proceed with configuring, compiling, and installing.
 configure: error: source directory already configured; run make
 distclean there first
 + RET=1
 + '[' 1 '!=' 0 ']'
 + VERDICT='BUILD FAILURE'
 ...
 
 
 
 I looked and found that /var/lib/jenkins/jobs/regression/workspace
 contained what looked like a _previous_ successful regression build.
 
 I manually cleaned the directory (moved it and created a new workspace
 dir actually) and now a regression is running.
 
 What's going on? !!!
 
 --
 
 Kaleb
 
 
 
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 http://supercolony.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] autodelete in snapshots

2014-06-04 Thread Raghavendra Bhat

On Wednesday 04 June 2014 11:23 AM, Rajesh Joseph wrote:


- Original Message -

From: M S Vishwanath Bhat msvb...@gmail.com
To: Rajesh Joseph rjos...@redhat.com
Cc: Vijay Bellur vbel...@redhat.com, Seema Naik sen...@redhat.com, Gluster 
Devel
gluster-devel@gluster.org
Sent: Tuesday, June 3, 2014 5:55:27 PM
Subject: Re: [Gluster-devel] autodelete in snapshots

On 3 June 2014 15:21, Rajesh Joseph rjos...@redhat.com wrote:



- Original Message -
From: M S Vishwanath Bhat msvb...@gmail.com
To: Vijay Bellur vbel...@redhat.com
Cc: Seema Naik sen...@redhat.com, Gluster Devel 
gluster-devel@gluster.org
Sent: Tuesday, June 3, 2014 1:02:08 AM
Subject: Re: [Gluster-devel] autodelete in snapshots




On 2 June 2014 20:22, Vijay Bellur  vbel...@redhat.com  wrote:



On 04/23/2014 05:50 AM, Vijay Bellur wrote:


On 04/20/2014 11:42 PM, Lalatendu Mohanty wrote:


On 04/16/2014 11:39 AM, Avra Sengupta wrote:


The whole purpose of introducing the soft-limit is, that at any point
of time the number of
snaps should not exceed the hard limit. If we trigger auto-delete on
hitting hard-limit, then
the purpose itself is lost, because at that point we would be taking a
snap, making the limit
hard-limit + 1, and then triggering auto-delete, which violates the
sanctity of the hard-limit.
Also what happens when we are at hard-limit + 1, and another snap is
issued, while auto-delete
is yet to process the first delete. At that point we end up at
hard-limit + 1. Also what happens
if for a particular snap the auto-delete fails.

We should see the hard-limit, as something set by the admin keeping in
mind the resource consumption
and at no-point should we cross this limit, come what may. If we hit
this limit, the create command
should fail asking the user to delete snaps using the snapshot
delete command.

The two options Raghavendra mentioned are applicable for the
soft-limit only, in which cases on
hitting the soft-limit

1. Trigger auto-delete

or

2. Log a warning-message, for the user saying the number of snaps is
exceeding the snap-limit and
display the number of available snaps

Now which of these should happen also depends on the user, because the
auto-delete option
is configurable.

So if the auto-delete option is set as true, auto-delete should be
triggered and the above message
should also be logged.

But if the option is set as false, only the message should be logged.

This is the behaviour as designed. Adding Rahul, and Seema in the
mail, to reflect upon the
behaviour as well.

Regards,
Avra

This sounds correct. However we need to make sure that the usage or
documentation around this should be good enough , so that users
understand the each of the limits correctly.


It might be better to avoid the usage of the term soft-limit.
soft-limit as used in quota and other places generally has an alerting
connotation. Something like auto-deletion-limit might be better.


I still see references to soft-limit and auto deletion seems to get
triggered upon reaching soft-limit.

Why is the ability to auto delete not configurable? It does seem pretty
nasty to go about deleting snapshots without obtaining explicit consent
from the user.

I agree with Vijay here. It's not good to delete a snap (even though it is
oldest) without the explicit consent from user.

FYI It took me more than 2 weeks to figure out that my snaps were getting
autodeleted after reaching soft-limit. For all I know I had not done
anything and my snap restore were failing.

I propose to remove the terms soft and hard limit. I believe there
should be a limit (just limit) after which all snapshot creates should
fail with proper error messages. And there can be a water-mark after which
user should get warning messages. So below is my proposal.

auto-delete + snap-limit: If the snap-limit is set to n , next snap create
(n+1th) will succeed only if if auto-delete is set to on/true/1 and oldest
snap will get deleted automatically. If autodelete is set to off/false/0 ,
(n+1)th snap create will fail with proper error message from gluster CLI
command. But again by default autodelete should be off.

snap-water-mark : This should come in picture only if autodelete is turned
off. It should not have any meaning if auto-delete is turned ON. Basically
it's usage is to give the user warning that limit almost being reached and
it is time for admin to decide which snaps should be deleted (or which
should be kept)

*my two cents*

-MS


The reason for having a hard-limit is to stop snapshot creation once we
reached this limit. This helps to have a control over the resource
consumption. Therefore if we only have this limit (as snap-limit) then
there is no question of auto-delete. Auto-delete can only be triggered once
the count crosses the limit. Therefore we introduced the concept of
soft-limit and a hard-limit. As the name suggests once the hard-limit is
reached no more snaps will be created.


Perhaps I could have been more clearer. auto-delete value does come into

Re: [Gluster-devel] [Gluster-users] Need testers for GlusterFS 3.4.4

2014-06-04 Thread Justin Clift
On 04/06/2014, at 6:33 AM, Pranith Kumar Karampuri wrote:
 On 06/04/2014 01:35 AM, Ben Turner wrote:
 Sent: Thursday, May 29, 2014 6:12:40 PM
snip
 FSSANITY_TEST_LIST: arequal bonnie glusterfs_build compile_kernel dbench dd 
 ffsb fileop fsx fs_mark iozone locks ltp multiple_files posix_compliance 
 postmark read_large rpc syscallbench tiobench
 
 I am starting on NFS now, I'll have results tonight or tomorrow morning.  
 I'll look updating the component scripts to work and run them as well.
 Thanks a lot for this ben.
 
 Justin, Ben,
 Do you think we can automate running of these scripts without a lot of 
 human intervention? If yes, how can I help?
 
 We can use that just before making any release in future :-).


It's a decent idea.  :)

Do you have time to get this up and running?

+ Justin

--
Open Source and Standards @ Red Hat

twitter.com/realjustinclift

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] Need testers for GlusterFS 3.4.4

2014-06-04 Thread Justin Clift
On 03/06/2014, at 9:05 PM, Ben Turner wrote:
snip
 So far so good on 3.4.4, sorry for the delay here.  I had to fix my 
 downstream test suites to run outside of RHS / downstream gluster.  I did 
 basic sanity testing on glusterfs mounts including:
 
 FSSANITY_TEST_LIST: arequal bonnie glusterfs_build compile_kernel dbench dd 
 ffsb fileop fsx fs_mark iozone locks ltp multiple_files posix_compliance 
 postmark read_large rpc syscallbench tiobench
 
 I am starting on NFS now, I'll have results tonight or tomorrow morning.  
 I'll look updating the component scripts to work and run them as well.


Out of curiosity, do you have the time/inclination to test 3.5.1beta1
as well? :)

+ Justin

--
Open Source and Standards @ Red Hat

twitter.com/realjustinclift

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] rackspace-regression job history disappeared?

2014-06-04 Thread Justin Clift
Good news.  After reloading the Jenkins configuration from disk the other
day, the complete job history isn't disappearing any more.

+ Justin

On 30/05/2014, at 8:44 PM, Justin Clift wrote:
 As a FYI, there weren't any jobs running on build.gluster.org, so
 I hit the reload configuration from disk button.  All of the historical
 jobs for rackspace-regression are now visible through the UI.
 
 No idea how long for though. ;)
 
 + Justin

--
Open Source and Standards @ Red Hat

twitter.com/realjustinclift

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] Need testers for GlusterFS 3.4.4

2014-06-04 Thread Justin Clift
On 04/06/2014, at 3:14 PM, Ben Turner wrote:
 - Original Message -
 From: Justin Clift jus...@gluster.org
 To: Pranith Kumar Karampuri pkara...@redhat.com
 Cc: Ben Turner btur...@redhat.com, gluster-us...@gluster.org, Gluster 
 Devel gluster-devel@gluster.org
 Sent: Wednesday, June 4, 2014 9:35:47 AM
 Subject: Re: [Gluster-users] [Gluster-devel] Need testers for GlusterFS 3.4.4
 
 On 04/06/2014, at 6:33 AM, Pranith Kumar Karampuri wrote:
 On 06/04/2014 01:35 AM, Ben Turner wrote:
 Sent: Thursday, May 29, 2014 6:12:40 PM
 snip
 FSSANITY_TEST_LIST: arequal bonnie glusterfs_build compile_kernel dbench
 dd ffsb fileop fsx fs_mark iozone locks ltp multiple_files
 posix_compliance postmark read_large rpc syscallbench tiobench
 
 I am starting on NFS now, I'll have results tonight or tomorrow morning.
 I'll look updating the component scripts to work and run them as well.
 Thanks a lot for this ben.
 
 Justin, Ben,
Do you think we can automate running of these scripts without a lot of
human intervention? If yes, how can I help?
 
 We can use that just before making any release in future :-).
 
 
 It's a decent idea.  :)
 
 Do you have time to get this up and running?
 
 Yep, can do.  I'll see what else I can get going as well, I'll start with the 
 sanity tests I mentioned above and go from there.  How often do we want these 
 run?  Daily?  Weekly?  On GIT checkin?  Only on RC?

As often as practical, given the hardware resources available atm.

On git checkout would be great, but may not be practical (unsure). ;)

+ Justin

--
Open Source and Standards @ Red Hat

twitter.com/realjustinclift

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] Need testers for GlusterFS 3.4.4

2014-06-04 Thread Pranith Kumar Karampuri


On 06/04/2014 07:44 PM, Ben Turner wrote:

- Original Message -

From: Justin Clift jus...@gluster.org
To: Pranith Kumar Karampuri pkara...@redhat.com
Cc: Ben Turner btur...@redhat.com, gluster-us...@gluster.org, Gluster Devel 
gluster-devel@gluster.org
Sent: Wednesday, June 4, 2014 9:35:47 AM
Subject: Re: [Gluster-users] [Gluster-devel] Need testers for GlusterFS 3.4.4

On 04/06/2014, at 6:33 AM, Pranith Kumar Karampuri wrote:

On 06/04/2014 01:35 AM, Ben Turner wrote:

Sent: Thursday, May 29, 2014 6:12:40 PM

snip

FSSANITY_TEST_LIST: arequal bonnie glusterfs_build compile_kernel dbench
dd ffsb fileop fsx fs_mark iozone locks ltp multiple_files
posix_compliance postmark read_large rpc syscallbench tiobench

I am starting on NFS now, I'll have results tonight or tomorrow morning.
I'll look updating the component scripts to work and run them as well.

Thanks a lot for this ben.

Justin, Ben,
 Do you think we can automate running of these scripts without a lot of
 human intervention? If yes, how can I help?

We can use that just before making any release in future :-).


It's a decent idea.  :)

Do you have time to get this up and running?

Yep, can do.  I'll see what else I can get going as well, I'll start with the 
sanity tests I mentioned above and go from there.  How often do we want these 
run?  Daily?  Weekly?  On GIT checkin?  Only on RC?


How long does it take to run them?

Pranith


-b
  

+ Justin

--
Open Source and Standards @ Red Hat

twitter.com/realjustinclift




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Reminder: Weekly Gluster Community meeting is in 27 mins

2014-06-04 Thread Justin Clift
Reminder!!!

The weekly Gluster Community meeting is in 30 mins, in
#gluster-meeting on IRC.

This is a completely public meeting, everyone is encouraged
to attend and be a part of it. :)

To add Agenda items
***

Just add them to the main text of the Etherpad, and be at
the meeting. :)

 https://public.pad.fsfe.org/p/gluster-community-meetings

Regards and best wishes,

Justin Clift

--
Open Source and Standards @ Red Hat

twitter.com/realjustinclift
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-users] Need testers for GlusterFS 3.4.4

2014-06-04 Thread Ben Turner
- Original Message -
 From: Justin Clift jus...@gluster.org
 To: Pranith Kumar Karampuri pkara...@redhat.com
 Cc: Ben Turner btur...@redhat.com, gluster-us...@gluster.org, Gluster 
 Devel gluster-devel@gluster.org
 Sent: Wednesday, June 4, 2014 9:35:47 AM
 Subject: Re: [Gluster-users] [Gluster-devel] Need testers for GlusterFS 3.4.4
 
 On 04/06/2014, at 6:33 AM, Pranith Kumar Karampuri wrote:
  On 06/04/2014 01:35 AM, Ben Turner wrote:
  Sent: Thursday, May 29, 2014 6:12:40 PM
 snip
  FSSANITY_TEST_LIST: arequal bonnie glusterfs_build compile_kernel dbench
  dd ffsb fileop fsx fs_mark iozone locks ltp multiple_files
  posix_compliance postmark read_large rpc syscallbench tiobench
  
  I am starting on NFS now, I'll have results tonight or tomorrow morning.
  I'll look updating the component scripts to work and run them as well.
  Thanks a lot for this ben.
  
  Justin, Ben,
  Do you think we can automate running of these scripts without a lot of
  human intervention? If yes, how can I help?
  
  We can use that just before making any release in future :-).
 
 
 It's a decent idea.  :)
 
 Do you have time to get this up and running?

Yep, can do.  I'll see what else I can get going as well, I'll start with the 
sanity tests I mentioned above and go from there.  How often do we want these 
run?  Daily?  Weekly?  On GIT checkin?  Only on RC?

-b
 
 + Justin
 
 --
 Open Source and Standards @ Red Hat
 
 twitter.com/realjustinclift
 
 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel


[Gluster-devel] Need reviewers for these two 3.5.1 beta 2 patches

2014-06-04 Thread Justin Clift
Hi all,

We need some people to review:

  http://review.gluster.org/#/c/7963/

and:

  http://review.gluster.org/#/c/7978/

If that gets done, we can release 3.5.1 beta 2 this week.

  eg if anyone has the time, that would be directly helpful :)

+ Justin

--
Open Source and Standards @ Red Hat

twitter.com/realjustinclift

___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-devel