On Wed, Jan 11, 2023 at 1:08 PM Madison Kelly <mke...@alteeve.com> wrote: > > On 2023-01-11 15:55, Reid Wahl wrote: > > On Wed, Jan 11, 2023 at 12:48 PM Madison Kelly <mke...@alteeve.com> wrote: > >> > >> On 2023-01-11 01:59, Reid Wahl wrote: > >>> On Tue, Jan 10, 2023 at 10:14 PM Vladislav Bogdanov > >>> <bub...@hoster-ok.com> wrote: > >>>> > >>>> I suspect that valudate action is run as a non-root user. > >>> > >>> As far as I know, both the direct command and crm_resource **should** > >>> be running the agent as the same user, as long as Madison is running > >>> both commands as the same user. > >>> > >>> For what it's worth, I copied your test script to my machine (Fedora > >>> 36 using the current upstream main of Pacemaker) and it worked fine > >>> both directly and via crm_resource. At the moment I'm not able to dig > >>> very deeply, but I do wonder if it's either a bug that's since been > >>> fixed, or perhaps an environment issue. > >>> > >>> To try to rule out the former, do you have a test environment where > >>> you can try to reproduce it on the latest Pacemaker from upstream? > >> > >> I built the pacemaker source RPM from Fedora 37, then realized I'm > >> already running 2.1.5 on CS8, so I'm already on the latest release. > >> Looking at git, 2.1.5 is the latest tagged release... Are you running > >> newer than that? > > > > I'm running on the current main, which contains commits that came > > after the 2.1.5 release. I don't really expect this to be a Pacemaker > > bug, especially with how recent your version is, but I would like to > > rule that out if possible. > > You would have either the src.rpm or the ./configure options you used > off hand?
Running `make -C rpm rpm` like Ken said is probably the easiest way. I normally build via `./autogen.sh && ./configure && make && sudo make install`, but with an RPM your cleanup and stuff is taken care of for you. > > >>>> Madison Kelly <mke...@alteeve.com> 11 января 2023 г. 07:06:55 написал: > >>>> > >>>>> On 2023-01-11 00:21, Madison Kelly wrote: > >>>>>> > >>>>>> On 2023-01-11 00:14, Madison Kelly wrote: > >>>>>>> > >>>>>>> Hi all, > >>>>>>> > >>>>>>> Edit: Last message was in HTML format, sorry about that. > >>>>>>> > >>>>>>> I've got a hell of a weird problem, and I am absolutely stumped > >>>>>>> on > >>>>>>> what's going on. > >>>>>>> > >>>>>>> The short of it is; if my RA is called from the command line, > >>>>>>> it's > >>>>>>> fine. If a resource exists, monitor, enable, disable, all that stuff > >>>>>>> works just fine. If I try to create a resource, it hangs on the > >>>>>>> validate stage. Specifically, it hangs when 'pcs' calls: > >>>>>>> > >>>>>>> crm_resource --validate --output-as xml --class ocf --agent server > >>>>>>> --provider alteeve --option name=<resource_name> > >>>>>>> > >>>>>>> Specifically, it hangs when it tries to make a shell call (to > >>>>>>> virsh, specifically, but that doesn't matter). So to debug, I started > >>>>>>> stripping down my RA simpler and simpler until I was left with the > >>>>>>> very most basic of programs; > >>>>>>> > >>>>>>> https://pastebin.com/VtSpkwMr > >>>>>>> > >>>>>>> That is literally the simplest program I could write that made > >>>>>>> the > >>>>>>> shell call. The 'open()' call is where it hangs. > >>>>>>> > >>>>>>> When I call directly; > >>>>>>> > >>>>>>> time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server > >>>>>>> srv04-test; echo rc:$? > >>>>>>> > >>>>>>> ==== > >>>>>>> real 0m0.061s > >>>>>>> user 0m0.037s > >>>>>>> sys 0m0.014s > >>>>>>> rc:0 > >>>>>>> ==== > >>>>>>> > >>>>>>> It's just fine. I can see in the log the output from the 'virsh' call > >>>>>>> as well. However, when I call from crm_resource; > >>>>>>> > >>>>>>> time crm_resource --validate --output-as xml --class ocf --agent > >>>>>>> server --provider alteeve --option name=srv04-test; echo rc:$? > >>>>>>> > >>>>>>> ==== > >>>>>>> <pacemaker-result api-version="2.25" request="crm_resource --validate > >>>>>>> --output-as xml --class ocf --agent server --provider alteeve --option > >>>>>>> name=srv04-test"> > >>>>>>> <resource-agent-action action="validate" class="ocf" > >>>>>>> type="server" > >>>>>>> provider="alteeve"> > >>>>>>> <overrides/> > >>>>>>> <agent-status code="1" message="error" execution_code="2" > >>>>>>> execution_message="Timed Out" reason="Resource agent did not exit > >>>>>>> within specified timeout"/> > >>>>>>> </resource-agent-action> > >>>>>>> <status code="1" message="Error occurred"> > >>>>>>> <errors> > >>>>>>> <error>crm_resource: Error performing operation: Error > >>>>>>> occurred</error> > >>>>>>> </errors> > >>>>>>> </status> > >>>>>>> </pacemaker-result> > >>>>>>> > >>>>>>> real 0m20.521s > >>>>>>> user 0m0.022s > >>>>>>> sys 0m0.010s > >>>>>>> rc:1 > >>>>>>> ==== > >>>>>>> > >>>>>>> In the log file, I see (from line 20 of the super-simple-test-script): > >>>>>>> > >>>>>>> ==== > >>>>>>> Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1; > >>>>>>> /usr/bin/echo return_code:0 |] > >>>>>>> ==== > >>>>>>> > >>>>>>> Then nothing else. > >>>>>>> > >>>>>>> The strace output is: https://pastebin.com/raw/UCEUdBeP > >>>>>>> > >>>>>>> Environment; > >>>>>>> > >>>>>>> * selinux is permissive > >>>>>>> * Pacemaker 2.1.5-4.el8 > >>>>>>> * pcs 0.10.15 > >>>>>>> * 4.18.0-408.el8.x86_64 > >>>>>>> * CentOS Stream release 8 > >>>>>>> > >>>>>>> Any help is appreciated, I am stumped. :/ > >>>>>> > >>>>>> > >>>>>> After sending this, I tried having my "RA" call 'hostname', and that > >>>>>> worked fine. I switched back to 'virsh list --all', and that hangs. So > >>>>>> it seems to somehow be related to call 'virsh' specifically. > >>>>>> > >>>>> > >>>>> OK, so more info... Knowing now that it's a problem with the virsh call > >>>>> specifically (but only when validating, existing VMs monitor, enable, > >>>>> disable fine, all which repeatedly call virsh), I noticed a few things. > >>>>> > >>>>> First, I see in the logs: > >>>>> > >>>>> ==== > >>>>> Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data: > >>>>> Connection reset by peer > >>>>> ==== > >>>>> > >>>>> So with this, I further simplified my test script to this: > >>>>> > >>>>> https://pastebin.com/Ey8FdL1t > >>>>> > >>>>> Then when I ran my test script directly, the strace output is: > >>>>> > >>>>> Good: https://pastebin.com/Trbq67ub > >>>>> > >>>>> When my script is called via crm_resource, the strace is this: > >>>>> > >>>>> Bad: https://pastebin.com/jtbzHrUM > >>>>> > >>>>> The first difference I can see happens around line 929 in the good > >>>>> paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0" > >>>>> exists, which doesn't in the bad paste. Shortly after, I start seeing: > >>>>> > >>>>> ==== > >>>>> line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8] > >>>>> line: [brk(NULL) = 0x562b7877d000] > >>>>> line: [brk(0x562b787aa000) = 0x562b787aa000] > >>>>> line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8] > >>>>> ==== > >>>>> > >>>>> Around line 959 in the bad paste. There are more brk() lines, and not > >>>>> long after the output stops. > >>>>> > >>>>> -- > >>>>> Madison Kelly > >>>>> Alteeve's Niche! > >>>>> Chief Technical Officer > >>>>> c: +1-647-471-0951 > >>>>> https://alteeve.com/ > >>>>> > >>>>> _______________________________________________ > >>>>> Manage your subscription: > >>>>> https://lists.clusterlabs.org/mailman/listinfo/users > >>>>> > >>>>> ClusterLabs home: https://www.clusterlabs.org/ > >>>> > >>>> > >>>> _______________________________________________ > >>>> Manage your subscription: > >>>> https://lists.clusterlabs.org/mailman/listinfo/users > >>>> > >>>> ClusterLabs home: https://www.clusterlabs.org/ > >>> > >>> > >>> > >> > >> -- > >> Madison Kelly > >> Alteeve's Niche! > >> Chief Technical Officer > >> c: +1-647-471-0951 > >> https://alteeve.com/ > >> > > > > > > -- > Madison Kelly > Alteeve's Niche! > Chief Technical Officer > c: +1-647-471-0951 > https://alteeve.com/ > -- Regards, Reid Wahl (He/Him) Senior Software Engineer, Red Hat RHEL High Availability - Pacemaker _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/