Re: [Gluster-devel] netbsd regression logs

2015-05-02 Thread Emmanuel Dreyfus
Atin Mukherjee amukh...@redhat.com wrote:

 Is this reproducible in netbsd everytime, if yes I would need a VM to
 further debug it. 

nbslave78.cloud.gluster.org

Note that it failed a lot of jobs yesterday, I do not know why, but I am
not sure the system is the culprit: nbslave7a exhibited the same
behavior and is now fine while I did nothing for it.

I am guessing that the reason of other failure from
 tests/geo-rep/georep-setup.t is same. Is it a new regression failure ?

Yes, I would say it started on april 30th but it is not ovious to tell
as NetBSD regression was already borken by the ENOKEY change.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] spurious failures in tests/basic/afr/sparse-file-self-heal.t

2015-05-02 Thread Pranith Kumar Karampuri


On 05/02/2015 10:14 AM, Krishnan Parthasarathi wrote:

If glusterd itself fails to come up, of course the test will fail :-). Is it
still happening?

Pranith,

Did you get a chance to see glusterd logs and find why glusterd didn't come up?
Please paste the relevant logs in this thread.

No :-(. The etherpad doesn't have any links :-(.
Justin any help here?

Pranith




___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] netbsd regression logs

2015-05-02 Thread Atin Mukherjee


On 05/02/2015 09:08 AM, Atin Mukherjee wrote:
 
 
 On 05/02/2015 08:54 AM, Emmanuel Dreyfus wrote:
 Pranith Kumar Karampuri pkara...@redhat.com wrote:

 Seems like glusterd failure from the looks of it: +glusterd folks.

 Running tests in file ./tests/basic/cdc.t
 volume delete: patchy: failed: Another transaction is in progress for
 patchy. Please try again after sometime.
 [18:16:40] ./tests/basic/cdc.t ..
 not ok 52

 This is a volume stop that fails. Logs says a lock is held by an UUID
 which happeens to be the volume's own UUID. 

 I tried git bisect and it seems to be related to
 http://review.gluster.org/9918 but I am not completely sure (I may have
 botched by git bisect)
 
 I'm looking into this.
Looking at the logs, here is the findings:

- gluster volume stop got timed out at cli because of which
cmd_history.log didn't capture it.
- glusterd acquired the volume lock in volume stop but didn't release it
somehow as gluster v delete failed saying another transaction is in progress
- For gluster volume stop transaction I could see glusterd_nfssvc_stop
was triggered but after that it didn't log anything for almost two
minutes, but catching point here is by this time volinfo-status should
have been marked as stopped and persisted in the disk, but gluster v
info didn't reflect the same.

Is this reproducible in netbsd everytime, if yes I would need a VM to
further debug it. I am guessing that the reason of other failure from
tests/geo-rep/georep-setup.t is same. Is it a new regression failure ?

~Atin

 

-- 
~Atin
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] netbsd regression logs

2015-05-02 Thread Emmanuel Dreyfus
Emmanuel Dreyfus m...@netbsd.org wrote:

 Note that it failed a lot of jobs yesterday, I do not know why, but I am
 not sure the system is the culprit: nbslave7a exhibited the same
 behavior and is now fine while I did nothing for it.

I think it is git.gluster.org that misbehaved: I started
/autobuild/autobuild.sh on nbslave78 and it seems fine.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] netbsd regression update : georep-setup.t

2015-05-02 Thread Atin Mukherjee

On 05/02/2015 11:59 AM, Atin Mukherjee wrote:
 
 
 On 05/02/2015 09:08 AM, Atin Mukherjee wrote:


 On 05/02/2015 08:54 AM, Emmanuel Dreyfus wrote:
 Pranith Kumar Karampuri pkara...@redhat.com wrote:

 Seems like glusterd failure from the looks of it: +glusterd folks.

 Running tests in file ./tests/basic/cdc.t
 volume delete: patchy: failed: Another transaction is in progress for
 patchy. Please try again after sometime.
 [18:16:40] ./tests/basic/cdc.t ..
 not ok 52

 This is a volume stop that fails. Logs says a lock is held by an UUID
 which happeens to be the volume's own UUID. 

 I tried git bisect and it seems to be related to
 http://review.gluster.org/9918 but I am not completely sure (I may have
 botched by git bisect)

 I'm looking into this.
 Looking at the logs, here is the findings:
 
 - gluster volume stop got timed out at cli because of which
 cmd_history.log didn't capture it.
 - glusterd acquired the volume lock in volume stop but didn't release it
 somehow as gluster v delete failed saying another transaction is in progress
 - For gluster volume stop transaction I could see glusterd_nfssvc_stop
 was triggered but after that it didn't log anything for almost two
 minutes, but catching point here is by this time volinfo-status should
 have been marked as stopped and persisted in the disk, but gluster v
 info didn't reflect the same.
 
 Is this reproducible in netbsd everytime, if yes I would need a VM to
 further debug it. I am guessing that the reason of other failure from
 tests/geo-rep/georep-setup.t is same. Is it a new regression failure ?
Although I couldn't reproduce cdc.t failure but georep-setup.t failed
consistently and glusterd backtrace showed that it hangs on gverify.sh
when gsync_create is executed. Since this script was called in runner
framework, big lock was released by that time and the same thread didn't
acquire back the big lock and eventually didn't release the cluster wide
lock. Because of this, subsequent glusterd command failed with another
transaction is in progress.

Ccing Geo-rep team for further analysis. Backtrace for your reference:

hread 3 (LWP 5):
#0  0xbb35e577 in _sys___wait450 () from /usr/lib/libc.so.12
#1  0xbb689e71 in __wait450 () from /usr/lib/libpthread.so.1
#2  0xbb3cba3b in waitpid () from /usr/lib/libc.so.12
#3  0xbb798f0f in runner_end_reuse (runner=0xb86fd828) at run.c:345
#4  0xbb798fa4 in runner_end (runner=0xb86fd828) at run.c:366
#5  0xbb799043 in runner_run_generic (runner=0xb86fd828, rfin=0xbb798f72
runner_end)
at run.c:386
#6  0xbb799088 in runner_run (runner=0xb86fd828) at run.c:392
#7  0xb922d1dc in glusterd_verify_slave (volname=0xb8216cb0 master,
slave_url=0xb8201e90 nbslave70.cloud.gluster.org,
slave_vol=0xb821b170 slave,
op_errstr=0xb86ff5ec, is_force_blocker=0xb86fd92c) at
glusterd-geo-rep.c:2075
#8  0xb922ddfb in glusterd_op_stage_gsync_create (dict=0xb9c07ad0,
op_errstr=0xb86ff5ec)
at glusterd-geo-rep.c:2300
#9  0xb91cfcb6 in glusterd_op_stage_validate (op=GD_OP_GSYNC_CREATE,
dict=0xb9c07ad0,
op_errstr=0xb86ff5ec, rsp_dict=0xb9c07b80) at glusterd-op-sm.c:4932
#10 0xb9255a34 in gd_stage_op_phase (op=GD_OP_GSYNC_CREATE,
op_ctx=0xb9c077b8,
req_dict=0xb9c07ad0, op_errstr=0xb86ff5ec, txn_opinfo=0xb86ff598)
at glusterd-syncop.c:1182
#11 0xb92570d3 in gd_sync_task_begin (op_ctx=0xb9c077b8, req=0xb8f40040)
at glusterd-syncop.c:1745
#12 0xb9257309 in glusterd_op_begin_synctask (req=0xb8f40040,
op=GD_OP_GSYNC_CREATE,
dict=0xb9c077b8) at glusterd-syncop.c:1804
#13 0xb9227bbc in __glusterd_handle_gsync_set (req=0xb8f40040) at
glusterd-geo-rep.c:334
#14 0xb91b29c1 in glusterd_big_locked_handler (req=0xb8f40040,
actor_fn=0xb92275e4 __glusterd_handle_gsync_set) at
glusterd-handler.c:83
#15 0xb9227cb9 in glusterd_handle_gsync_set (req=0xb8f40040) at
glusterd-geo-rep.c:362
#16 0xbb78992f in synctask_wrap (old_task=0xb8d3d000) at syncop.c:375
#17 0xbb385630 in ?? () from /usr/lib/libc.so.12


Emmanuel,

If you happen to see cdc.t failure again please ring a bell :)

~Atin
 
 ~Atin


 

-- 
~Atin
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] netbsd regression update : georep-setup.t

2015-05-02 Thread Emmanuel Dreyfus
Atin Mukherjee amukh...@redhat.com wrote:

 it hangs on gverify.sh

While there, for the sake of portability it should not depend on bash.
While it is reasonable to expect bash to be installed for running the
test suite, IMO the non test stuff should try to lower dependencies, and
bash is easy to avoid.

Here I see 3 points

1) remove the function keyword
POSIX shell defines function like this:
foo() {
}

2) Avoid [[ ]] evaluations. They are the same as [ ] with locale used,
and it is not obvious locale is needed in that script. Remplacing
[[ ]] by [ ] should do it.

3) Avoid /dev/tcp usage

Here it is not obvious without introducing another dependency (on netcat
for instance). But since we probe the port before trying to run ssh, We
could just give up on the probe and run ssh with -oConnectTimeout so
that it does not hang forever.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel