Re: [Gluster-devel] netbsd regression logs
Atin Mukherjee amukh...@redhat.com wrote: Is this reproducible in netbsd everytime, if yes I would need a VM to further debug it. nbslave78.cloud.gluster.org Note that it failed a lot of jobs yesterday, I do not know why, but I am not sure the system is the culprit: nbslave7a exhibited the same behavior and is now fine while I did nothing for it. I am guessing that the reason of other failure from tests/geo-rep/georep-setup.t is same. Is it a new regression failure ? Yes, I would say it started on april 30th but it is not ovious to tell as NetBSD regression was already borken by the ENOKEY change. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] spurious failures in tests/basic/afr/sparse-file-self-heal.t
On 05/02/2015 10:14 AM, Krishnan Parthasarathi wrote: If glusterd itself fails to come up, of course the test will fail :-). Is it still happening? Pranith, Did you get a chance to see glusterd logs and find why glusterd didn't come up? Please paste the relevant logs in this thread. No :-(. The etherpad doesn't have any links :-(. Justin any help here? Pranith ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] netbsd regression logs
On 05/02/2015 09:08 AM, Atin Mukherjee wrote: On 05/02/2015 08:54 AM, Emmanuel Dreyfus wrote: Pranith Kumar Karampuri pkara...@redhat.com wrote: Seems like glusterd failure from the looks of it: +glusterd folks. Running tests in file ./tests/basic/cdc.t volume delete: patchy: failed: Another transaction is in progress for patchy. Please try again after sometime. [18:16:40] ./tests/basic/cdc.t .. not ok 52 This is a volume stop that fails. Logs says a lock is held by an UUID which happeens to be the volume's own UUID. I tried git bisect and it seems to be related to http://review.gluster.org/9918 but I am not completely sure (I may have botched by git bisect) I'm looking into this. Looking at the logs, here is the findings: - gluster volume stop got timed out at cli because of which cmd_history.log didn't capture it. - glusterd acquired the volume lock in volume stop but didn't release it somehow as gluster v delete failed saying another transaction is in progress - For gluster volume stop transaction I could see glusterd_nfssvc_stop was triggered but after that it didn't log anything for almost two minutes, but catching point here is by this time volinfo-status should have been marked as stopped and persisted in the disk, but gluster v info didn't reflect the same. Is this reproducible in netbsd everytime, if yes I would need a VM to further debug it. I am guessing that the reason of other failure from tests/geo-rep/georep-setup.t is same. Is it a new regression failure ? ~Atin -- ~Atin ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] netbsd regression logs
Emmanuel Dreyfus m...@netbsd.org wrote: Note that it failed a lot of jobs yesterday, I do not know why, but I am not sure the system is the culprit: nbslave7a exhibited the same behavior and is now fine while I did nothing for it. I think it is git.gluster.org that misbehaved: I started /autobuild/autobuild.sh on nbslave78 and it seems fine. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] netbsd regression update : georep-setup.t
On 05/02/2015 11:59 AM, Atin Mukherjee wrote: On 05/02/2015 09:08 AM, Atin Mukherjee wrote: On 05/02/2015 08:54 AM, Emmanuel Dreyfus wrote: Pranith Kumar Karampuri pkara...@redhat.com wrote: Seems like glusterd failure from the looks of it: +glusterd folks. Running tests in file ./tests/basic/cdc.t volume delete: patchy: failed: Another transaction is in progress for patchy. Please try again after sometime. [18:16:40] ./tests/basic/cdc.t .. not ok 52 This is a volume stop that fails. Logs says a lock is held by an UUID which happeens to be the volume's own UUID. I tried git bisect and it seems to be related to http://review.gluster.org/9918 but I am not completely sure (I may have botched by git bisect) I'm looking into this. Looking at the logs, here is the findings: - gluster volume stop got timed out at cli because of which cmd_history.log didn't capture it. - glusterd acquired the volume lock in volume stop but didn't release it somehow as gluster v delete failed saying another transaction is in progress - For gluster volume stop transaction I could see glusterd_nfssvc_stop was triggered but after that it didn't log anything for almost two minutes, but catching point here is by this time volinfo-status should have been marked as stopped and persisted in the disk, but gluster v info didn't reflect the same. Is this reproducible in netbsd everytime, if yes I would need a VM to further debug it. I am guessing that the reason of other failure from tests/geo-rep/georep-setup.t is same. Is it a new regression failure ? Although I couldn't reproduce cdc.t failure but georep-setup.t failed consistently and glusterd backtrace showed that it hangs on gverify.sh when gsync_create is executed. Since this script was called in runner framework, big lock was released by that time and the same thread didn't acquire back the big lock and eventually didn't release the cluster wide lock. Because of this, subsequent glusterd command failed with another transaction is in progress. Ccing Geo-rep team for further analysis. Backtrace for your reference: hread 3 (LWP 5): #0 0xbb35e577 in _sys___wait450 () from /usr/lib/libc.so.12 #1 0xbb689e71 in __wait450 () from /usr/lib/libpthread.so.1 #2 0xbb3cba3b in waitpid () from /usr/lib/libc.so.12 #3 0xbb798f0f in runner_end_reuse (runner=0xb86fd828) at run.c:345 #4 0xbb798fa4 in runner_end (runner=0xb86fd828) at run.c:366 #5 0xbb799043 in runner_run_generic (runner=0xb86fd828, rfin=0xbb798f72 runner_end) at run.c:386 #6 0xbb799088 in runner_run (runner=0xb86fd828) at run.c:392 #7 0xb922d1dc in glusterd_verify_slave (volname=0xb8216cb0 master, slave_url=0xb8201e90 nbslave70.cloud.gluster.org, slave_vol=0xb821b170 slave, op_errstr=0xb86ff5ec, is_force_blocker=0xb86fd92c) at glusterd-geo-rep.c:2075 #8 0xb922ddfb in glusterd_op_stage_gsync_create (dict=0xb9c07ad0, op_errstr=0xb86ff5ec) at glusterd-geo-rep.c:2300 #9 0xb91cfcb6 in glusterd_op_stage_validate (op=GD_OP_GSYNC_CREATE, dict=0xb9c07ad0, op_errstr=0xb86ff5ec, rsp_dict=0xb9c07b80) at glusterd-op-sm.c:4932 #10 0xb9255a34 in gd_stage_op_phase (op=GD_OP_GSYNC_CREATE, op_ctx=0xb9c077b8, req_dict=0xb9c07ad0, op_errstr=0xb86ff5ec, txn_opinfo=0xb86ff598) at glusterd-syncop.c:1182 #11 0xb92570d3 in gd_sync_task_begin (op_ctx=0xb9c077b8, req=0xb8f40040) at glusterd-syncop.c:1745 #12 0xb9257309 in glusterd_op_begin_synctask (req=0xb8f40040, op=GD_OP_GSYNC_CREATE, dict=0xb9c077b8) at glusterd-syncop.c:1804 #13 0xb9227bbc in __glusterd_handle_gsync_set (req=0xb8f40040) at glusterd-geo-rep.c:334 #14 0xb91b29c1 in glusterd_big_locked_handler (req=0xb8f40040, actor_fn=0xb92275e4 __glusterd_handle_gsync_set) at glusterd-handler.c:83 #15 0xb9227cb9 in glusterd_handle_gsync_set (req=0xb8f40040) at glusterd-geo-rep.c:362 #16 0xbb78992f in synctask_wrap (old_task=0xb8d3d000) at syncop.c:375 #17 0xbb385630 in ?? () from /usr/lib/libc.so.12 Emmanuel, If you happen to see cdc.t failure again please ring a bell :) ~Atin ~Atin -- ~Atin ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] netbsd regression update : georep-setup.t
Atin Mukherjee amukh...@redhat.com wrote: it hangs on gverify.sh While there, for the sake of portability it should not depend on bash. While it is reasonable to expect bash to be installed for running the test suite, IMO the non test stuff should try to lower dependencies, and bash is easy to avoid. Here I see 3 points 1) remove the function keyword POSIX shell defines function like this: foo() { } 2) Avoid [[ ]] evaluations. They are the same as [ ] with locale used, and it is not obvious locale is needed in that script. Remplacing [[ ]] by [ ] should do it. 3) Avoid /dev/tcp usage Here it is not obvious without introducing another dependency (on netcat for instance). But since we probe the port before trying to run ssh, We could just give up on the probe and run ssh with -oConnectTimeout so that it does not hang forever. -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org ___ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel