On Wed, 30 Aug 2017 at 00:23, Shwetha Panduranga <spand...@redhat.com> wrote:
> Hi Shyam, we are already doing it. we wait for rebalance status to be > complete. We loop. we keep checking if the status is complete for '20' > minutes or so. > Are you saying in this test rebalance status was executed multiple times till it succeed? If yes then the test shouldn't have failed. Can I get to access the complete set of logs? > -Shwetha > > On Tue, Aug 29, 2017 at 7:04 PM, Shyam Ranganathan <srang...@redhat.com> > wrote: > >> On 08/29/2017 09:31 AM, Atin Mukherjee wrote: >> >>> >>> >>> On Tue, Aug 29, 2017 at 4:13 AM, Shyam Ranganathan <srang...@redhat.com >>> <mailto:srang...@redhat.com>> wrote: >>> >>> Nigel, Shwetha, >>> >>> The latest Glusto run [a] that was started by Nigel, post fixing the >>> prior timeout issue, failed (much later though) again. >>> >>> I took a look at the logs and my analysis is here [b] >>> >>> @atin, @kaushal, @ppai can you take a look and see if the analysis >>> is correct? >>> >>> >>> I took a look at the logs and here is my theory: >>> >>> glusterd starts the rebalance daemon through runner framework with >>> nowait mode which essentially means that even though glusterd reports back >>> a success back to CLI for rebalance start, one of the node might take some >>> more additional time to start the rebalance process and establish rpc >>> connection. In this case we hit a race where while one of the node was >>> still trying to start the rebalance process a rebalance status command was >>> triggered which eventually failed on the node as rpc connection wasn't >>> successful and originator glusterd's commit op failed with ""Received >>> commit RJT from uuid: 6f9524e6-9f9e-44aa-b2f4-393404adfd9d" failure. >>> Technically to avoid all these spurious time out issues we try to check the >>> status in a loop till a certain timeout. Isn't that the case in glusto? If >>> my analysis is correct, you shouldn't be seeing this failure on the 2nd >>> attempt as its a race. >>> >> >> Thanks Atin. >> >> In this case there is no second check or a timed check (sleep or >> otherwise (EXPECT_WITHIN like constructs)). >> >> @Shwetha, can we fix up this test and give it another go? >> >> >>> >>> In short glusterd has got an error when checking for rebalance stats >>> from one of the nodes as: >>> "Received commit RJT from uuid: 6f9524e6-9f9e-44aa-b2f4-393404adfd9d" >>> >>> and the rebalance deamon on the node with that UUID is not really >>> ready to serve requests when this was called, hence I am assuming >>> this is causing the error. But need a once over by one of you folks. >>> >>> @Shwetha, can we add a further timeout between rebalance start and >>> checking the status, just so that we avoid this timing issue on >>> these nodes. >>> >>> Thanks, >>> Shyam >>> >>> [a] glusto run: >>> https://ci.centos.org/view/Gluster/job/gluster_glusto/377/ >>> <https://ci.centos.org/view/Gluster/job/gluster_glusto/377/> >>> >>> [b] analysis of the failure: >>> https://paste.fedoraproject.org/paste/mk6ynJ0B9AH6H9ncbyru5w >>> <https://paste.fedoraproject.org/paste/mk6ynJ0B9AH6H9ncbyru5w> >>> >>> On 08/25/2017 04:29 PM, Shyam Ranganathan wrote: >>> >>> Nigel was kind enough to kick off a glusto run on 3.12 head a >>> couple of days back. The status can be seen here [1]. >>> >>> The run failed, but managed to get past what Glusto does on >>> master (see [2]). Not that this is a consolation, but just >>> stating the fact. >>> >>> The run [1] failed at, >>> 17:05:57 >>> >>> functional/bvt/test_cvt.py::TestGlusterHealSanity_dispersed_glusterfs::test_self_heal_when_io_in_progress >>> FAILED >>> >>> The test case failed due to, >>> 17:10:28 E AssertionError: ('Volume %s : All process are >>> not online', 'testvol_dispersed') >>> >>> The test case can be seen here [3], and the reason for failure >>> is that Glusto did not wait long enough for the down brick to >>> come up (it waited for 10 seconds, but the brick came up after >>> 12 seconds or within the same second as the test for it being >>> up. The log snippets pointing to this problem are here [4]. In >>> short there was no real bug or issue that caused the failure as >>> yet. >>> >>> Glusto as a gating factor for this release was desirable, but >>> having got this far on 3.12 does help. >>> >>> @nigel, we could try post increasing the timeout between >>> bringing the brick up to checking if it is up, and try another >>> run, let me know if that works, and what is needed from me to >>> get this going. >>> >>> Shyam >>> >>> [1] Glusto 3.12 run: >>> https://ci.centos.org/view/Gluster/job/gluster_glusto/365/ >>> <https://ci.centos.org/view/Gluster/job/gluster_glusto/365/> >>> >>> [2] Glusto on master: >>> >>> https://ci.centos.org/view/Gluster/job/gluster_glusto/360/testReport/functional.bvt.test_cvt/ >>> < >>> https://ci.centos.org/view/Gluster/job/gluster_glusto/360/testReport/functional.bvt.test_cvt/ >>> > >>> >>> >>> [3] Failed test case: >>> >>> https://ci.centos.org/view/Gluster/job/gluster_glusto/365/testReport/functional.bvt.test_cvt/TestGlusterHealSanity_dispersed_glusterfs/test_self_heal_when_io_in_progress/ >>> < >>> https://ci.centos.org/view/Gluster/job/gluster_glusto/365/testReport/functional.bvt.test_cvt/TestGlusterHealSanity_dispersed_glusterfs/test_self_heal_when_io_in_progress/ >>> > >>> >>> >>> [4] Log analysis pointing to the failed check: >>> https://paste.fedoraproject.org/paste/znTPiFLrc2~vsWuoYRToZA >>> <https://paste.fedoraproject.org/paste/znTPiFLrc2%7EvsWuoYRToZA> >>> >>> "Releases are made better together" >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel@gluster.org <mailto:Gluster-devel@gluster.org> >>> http://lists.gluster.org/mailman/listinfo/gluster-devel >>> <http://lists.gluster.org/mailman/listinfo/gluster-devel> >>> >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel@gluster.org <mailto:Gluster-devel@gluster.org> >>> http://lists.gluster.org/mailman/listinfo/gluster-devel >>> <http://lists.gluster.org/mailman/listinfo/gluster-devel> >>> >>> >>> > -- - Atin (atinm)
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-devel