On Tue, Feb 2, 2021 at 8:14 PM Michael Scherer <msche...@redhat.com> wrote:
> Hi, > > so we finally found the cause of the georep failure, after several days > of work from Deepshika and I. > > Short story: > ============ > > side effect of adding libtirpc-devel on EL 7: > https://github.com/gluster/project-infrastructure/issues/115 Looking at https://github.com/gluster/glusterfs-patch-acceptance-tests/pull/191 - we weren't supposed to use it? From https://github.com/gluster/glusterfs/blob/d1d7a6f35c816822fab51c820e25023863c239c1/glusterfs.spec.in#L61 : # Do not use libtirpc on EL6, it does not have xdr_uint64_t() and xdr_uint32_t # Do not use libtirpc on EL7, it does not have xdr_sizeof() %if ( 0%{?rhel} && 0%{?rhel} <= 7 ) %global _without_libtirpc --without-libtirpc %endif CentOS 7 has an ancient version, CentOS 8 has a newer version, so perhaps just one CentOS 8 slaves? Y. > > Long story: > =========== > > So we first puzzled on why it was failing just on some builders and not > others, especially since it was working fine on softserve VMs. > > We tried to look for the usual suspect, rebooted, reinstalled, searched > if there was something weird (too much ssh keys, not enough inode, some > hardware issue), but nothing obvious. > > After trying to find my way in the logs file and a few weird leads > (like, why gsyncd was running gcc ? (answer: ctypes)), I was left with > a rather cryptic message: > > [2021-02-02 15:19:00.040817 +0000] I > [socket.c:929:__socket_server_bind] 0-socket.gfchangelog: closing > (AF_UNIX) reuse check socket 18 > [2021-02-02 15:19:02.041641 +0000] W [xdr- > rpcclnt.c:68:rpc_request_to_xdr] 0-rpc: failed to encode call msg > [2021-02-02 15:19:02.041673 +0000] E [rpc- > clnt.c:1507:rpc_clnt_record_build_record] 0-gfchangelog: Failed to > build record header > [2021-02-02 15:19:02.041683 +0000] W [rpc-clnt.c:1664:rpc_clnt_submit] > 0-gfchangelog: cannot build rpc-record > [2021-02-02 15:19:02.041692 +0000] E [MSGID: 132023] [gf- > changelog.c:285:gf_changelog_setup_rpc] 0-gfchangelog: Could not > initiate probe RPC, bailing out!!! > [2021-02-02 15:19:02.041809 +0000] E [MSGID: 132022] [gf- > changelog.c:583:gf_changelog_register_generic] 0-gfchangelog: Error > registering with changelog xlator > > Given that all gluster is around RPC, it would be unlikely that rpc is > broken, but that's the only messages we had. > > > We also found that the only builder that was working was builder 210. > Upon looking, we found that 210 failed to be updated with ansible, due > to some debugging we forgot to revert, which made this task fail: > > https://github.com/gluster/gluster.org_ansible_configuration/blob/master/roles/gluster_qa_scripts/tasks/main.yml#L7 > > But it wasn't clear how that would change anything, since the only diff > was a "set -e" that wasn't removed. > > Then Deepshika started to test more than georep, and she noticed that a > lot of others tests were failing, with the same exact message about > rpc. > > And she started to wonder if anything was recently changed. And indeed: > > # rpm -qa --last | head -n 15 > yum-plugin-auto-update-debug-info-1.1.31-54.el7_8.noarch mar. 02 févr. > 2021 14:04:59 UTC > python3-debuginfo-3.6.8-18.el7.x86_64 mar. 02 févr. 2021 > 14:04:58 UTC > glibc-debuginfo-2.17-317.el7.x86_64 mar. 02 févr. 2021 > 14:04:57 UTC > glibc-debuginfo-common-2.17-317.el7.x86_64 mar. 02 févr. 2021 > 14:04:53 UTC > gpg-pubkey-b6792c39-53c4fbdd mar. 02 févr. 2021 > 14:04:34 UTC > tzdata-java-2021a-1.el7.noarch mer. 27 janv. 2021 > 09:09:27 UTC > tzdata-2021a-1.el7.noarch mer. 27 janv. 2021 > 09:09:26 UTC > sudo-1.8.23-10.el7_9.1.x86_64 mer. 27 janv. 2021 > 09:09:26 UTC > libtirpc-devel-0.2.4-0.16.el7.x86_64 mar. 26 janv. 2021 > 12:53:45 UTC > java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64 mar. 26 janv. 2021 > 05:06:44 UTC > > We added libtirpc-devel on the 26/01. > > libtirpc-devel would, as the name imply, change something around the > rpc subsystem. > > It happened around last week, when we started to notice the problem. > > It was not applied to 210, because 210 failed before it got to that > point (since ansible stop as soon as the git update failed, and jenkins > builder role is after the gluster-qa-script update). > > It was not applied to softserve provided VM either, so tests where > working fine there. > > And indeed, once the package got removed, the tests were working again. > > Follow up > ========= > > So, I would like to know exactly what should be tested. Is gluster not > compatible with libtirpc on C7 (as it work on C8), or is there some > weird issue ? (cause from what I remember, RPC format is supposed to be > compatible and covered by a specification) > > Should we test on C8 only ? > > > -- > Michael Scherer / He/Il/Er/Él > Sysadmin, Community Infrastructure > > > > ------- > > Community Meeting Calendar: > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC > Bridge: https://meet.google.com/cpu-eiue-hvk > > Gluster-devel mailing list > gluster-de...@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel > >
_______________________________________________ Gluster-infra mailing list Gluster-infra@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-infra