[ClusterLabs] Antw: [EXT] Re: RA hangs when called by crm_resource (resending text format)

2023-01-11 Thread Ulrich Windl
>>> Madison Kelly  schrieb am 11.01.2023 um 22:06 in
Nachricht
<8a2f2d45-0419-8e97-1805-2998a9b83...@alteeve.com>:
> On 2023-01-11 01:13, Vladislav Bogdanov wrote:
>> I suspect that valudate action is run as a non-root user.
> 
> I modified the script to log the real and effective UIDs and it's 
> running as root in both instances.

I'm not running Redhat, but could it be one of the additional security
features (selinux)?
If possible maybe try to disable those for the test, or test your RA on
non-Redhat (just for testing).

Regards,
Ulrich

> 
>> Madison Kelly  11 января 2023 г. 07:06:55 написал:
>> 
>>> On 2023-01-11 00:21, Madison Kelly wrote:
 On 2023-01-11 00:14, Madison Kelly wrote:
> Hi all,
>
> Edit: Last message was in HTML format, sorry about that.
>
>I've got a hell of a weird problem, and I am absolutely stumped on
> what's going on.
>
>The short of it is; if my RA is called from the command line, it's
> fine. If a resource exists, monitor, enable, disable, all that stuff
> works just fine. If I try to create a resource, it hangs on the
> validate stage. Specifically, it hangs when 'pcs' calls:
>
> crm_resource --validate --output-as xml --class ocf --agent server
> --provider alteeve --option name=
>
>Specifically, it hangs when it tries to make a shell call (to
> virsh, specifically, but that doesn't matter). So to debug, I started
> stripping down my RA simpler and simpler until I was left with the
> very most basic of programs;
>
> https://pastebin.com/VtSpkwMr 
>
>That is literally the simplest program I could write that made the
> shell call. The 'open()' call is where it hangs.
>
> When I call directly;
>
> time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server
> srv04-test; echo rc:$?
>
> 
> real0m0.061s
> user0m0.037s
> sys0m0.014s
> rc:0
> 
>
> It's just fine. I can see in the log the output from the 'virsh' call
> as well. However, when I call from crm_resource;
>
> time crm_resource --validate --output-as xml --class ocf --agent
> server --provider alteeve --option name=srv04-test; echo rc:$?
>
> 
> 
> provider="alteeve">
>  
>   execution_message="Timed Out" reason="Resource agent did not exit
> within specified timeout"/>
>
>
>  
>crm_resource: Error performing operation: Error
> occurred
>  
>
> 
>
> real0m20.521s
> user0m0.022s
> sys0m0.010s
> rc:1
> 
>
> In the log file, I see (from line 20 of the super-simple-test-script):
>
> 
> Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1;
> /usr/bin/echo return_code:0 |]
> 
>
> Then nothing else.
>
> The strace output is: https://pastebin.com/raw/UCEUdBeP 
>
> Environment;
>
> * selinux is permissive
> * Pacemaker 2.1.5-4.el8
> * pcs 0.10.15
> * 4.18.0-408.el8.x86_64
> * CentOS Stream release 8
>
> Any help is appreciated, I am stumped. :/

 After sending this, I tried having my "RA" call 'hostname', and that
 worked fine. I switched back to 'virsh list --all', and that hangs. So
 it seems to somehow be related to call 'virsh' specifically.

>>>
>>> OK, so more info... Knowing now that it's a problem with the virsh call
>>> specifically (but only when validating, existing VMs monitor, enable,
>>> disable fine, all which repeatedly call virsh), I noticed a few things.
>>>
>>> First, I see in the logs:
>>>
>>> 
>>> Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data:
>>> Connection reset by peer
>>> 
>>>
>>> So with this, I further simplified my test script to this:
>>>
>>> https://pastebin.com/Ey8FdL1t 
>>>
>>> Then when I ran my test script directly, the strace output is:
>>>
>>> Good: https://pastebin.com/Trbq67ub 
>>>
>>> When my script is called via crm_resource, the strace is this:
>>>
>>> Bad: https://pastebin.com/jtbzHrUM 
>>>
>>> The first difference I can see happens around line 929 in the good
>>> paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0"
>>> exists, which doesn't in the bad paste. Shortly after, I start seeing:
>>>
>>> 
>>> line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8]
>>> line: [brk(NULL)   = 0x562b7877d000]
>>> line: [brk(0x562b787aa000) = 0x562b787aa000]
>>> line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8]
>>> 
>>>
>>> Around line 959 in the bad paste. There are more brk() lines, and not
>>> long after the output stops.
>>>
>>> -- 
>>> Madison Kelly
>>> Alteeve's Niche!
>>> Chief Technical Officer
>>> c: +1-647-471-0951
>>> https://alteeve.com/ 
>>>
>>> ___
>>> Manage your 

Re: [ClusterLabs] Antw: [EXT] Re: RA hangs when called by crm_resource (resending text format)

2023-01-10 Thread Madison Kelly



On January 11, 2023 2:26:57 a.m. EST, Ulrich Windl 
 wrote:
 Madison Kelly  schrieb am 11.01.2023 um 06:21 in 
 Nachricht
><74df2c8e-1cff-ba07-7f4a-070be296b...@alteeve.com>:
>> On 2023-01-11 00:14, Madison Kelly wrote:
>>> Hi all,
>>> 
>>> Edit: Last message was in HTML format, sorry about that.
>>> 
>>>I've got a hell of a weird problem, and I am absolutely stumped on 
>>> what's going on.
>>> 
>>>The short of it is; if my RA is called from the command line, it's 
>>> fine. If a resource exists, monitor, enable, disable, all that stuff 
>>> works just fine. If I try to create a resource, it hangs on the validate 
>>> stage. Specifically, it hangs when 'pcs' calls:
>>> 
>>> crm_resource --validate --output-as xml --class ocf --agent server 
>>> --provider alteeve --option name=
>>> 
>>>Specifically, it hangs when it tries to make a shell call (to virsh, 
>>> specifically, but that doesn't matter). So to debug, I started stripping 
>>> down my RA simpler and simpler until I was left with the very most basic 
>>> of programs;
>>> 
>>> https://pastebin.com/VtSpkwMr 
>>> 
>>>That is literally the simplest program I could write that made the 
>>> shell call. The 'open()' call is where it hangs.
>>> 
>>> When I call directly;
>>> 
>>> time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server 
>>> srv04-test; echo rc:$?
>>> 
>>> 
>>> real0m0.061s
>>> user0m0.037s
>>> sys0m0.014s
>>> rc:0
>>> 
>>> 
>>> It's just fine. I can see in the log the output from the 'virsh' call as 
>>> well. However, when I call from crm_resource;
>>> 
>>> time crm_resource --validate --output-as xml --class ocf --agent server 
>>> --provider alteeve --option name=srv04-test; echo rc:$?
>>> 
>>> 
>>> 
>>>>> provider="alteeve">
>>>  
>>>  >> execution_message="Timed Out" reason="Resource agent did not exit within 
>>> specified timeout"/>
>>>
>>>
>>>  
>>>crm_resource: Error performing operation: Error 
>>> occurred
>>>  
>>>
>>> 
>>> 
>>> real0m20.521s
>>> user0m0.022s
>>> sys0m0.010s
>>> rc:1
>>> 
>>> 
>>> In the log file, I see (from line 20 of the super-simple-test-script):
>>> 
>>> 
>>> Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1; 
>>> /usr/bin/echo return_code:0 |]
>>> 
>
>In VirtualDomain RA I found a similar command (assuming that works):
> virsh $VIRSH_OPTIONS dumpxml --inactive --security-info ${DOMAIN_NAME} >
> ${CFGTMP}
>
>virsh is somewhat strange; libvirtd is running, right?

Yes, I can call the RA directly, then immediately call crm_resource, or reverse 
order, always the same results.

Again, same calls work fine when enabling, disabling, etc. So weird...

>>> 
>>> Then nothing else.
>>> 
>>> The strace output is: https://pastebin.com/raw/UCEUdBeP 
>>> 
>>> Environment;
>>> 
>>> * selinux is permissive
>>> * Pacemaker 2.1.5-4.el8
>>> * pcs 0.10.15
>>> * 4.18.0-408.el8.x86_64
>>> * CentOS Stream release 8
>>> 
>>> Any help is appreciated, I am stumped. :/
>> 
>> After sending this, I tried having my "RA" call 'hostname', and that 
>> worked fine. I switched back to 'virsh list --all', and that hangs. So 
>> it seems to somehow be related to call 'virsh' specifically.
>> 
>> -- 
>> Madison Kelly
>> Alteeve's Niche!
>> Chief Technical Officer
>> c: +1-647-471-0951
>> https://alteeve.com/ 
>> 
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>> 
>> ClusterLabs home: https://www.clusterlabs.org/ 
>
>
>
>
>___
>Manage your subscription:
>https://lists.clusterlabs.org/mailman/listinfo/users
>
>ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] Antw: [EXT] Re: RA hangs when called by crm_resource (resending text format)

2023-01-10 Thread Ulrich Windl
>>> Madison Kelly  schrieb am 11.01.2023 um 06:21 in 
>>> Nachricht
<74df2c8e-1cff-ba07-7f4a-070be296b...@alteeve.com>:
> On 2023-01-11 00:14, Madison Kelly wrote:
>> Hi all,
>> 
>> Edit: Last message was in HTML format, sorry about that.
>> 
>>I've got a hell of a weird problem, and I am absolutely stumped on 
>> what's going on.
>> 
>>The short of it is; if my RA is called from the command line, it's 
>> fine. If a resource exists, monitor, enable, disable, all that stuff 
>> works just fine. If I try to create a resource, it hangs on the validate 
>> stage. Specifically, it hangs when 'pcs' calls:
>> 
>> crm_resource --validate --output-as xml --class ocf --agent server 
>> --provider alteeve --option name=
>> 
>>Specifically, it hangs when it tries to make a shell call (to virsh, 
>> specifically, but that doesn't matter). So to debug, I started stripping 
>> down my RA simpler and simpler until I was left with the very most basic 
>> of programs;
>> 
>> https://pastebin.com/VtSpkwMr 
>> 
>>That is literally the simplest program I could write that made the 
>> shell call. The 'open()' call is where it hangs.
>> 
>> When I call directly;
>> 
>> time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server 
>> srv04-test; echo rc:$?
>> 
>> 
>> real0m0.061s
>> user0m0.037s
>> sys0m0.014s
>> rc:0
>> 
>> 
>> It's just fine. I can see in the log the output from the 'virsh' call as 
>> well. However, when I call from crm_resource;
>> 
>> time crm_resource --validate --output-as xml --class ocf --agent server 
>> --provider alteeve --option name=srv04-test; echo rc:$?
>> 
>> 
>> 
>>> provider="alteeve">
>>  
>>  > execution_message="Timed Out" reason="Resource agent did not exit within 
>> specified timeout"/>
>>
>>
>>  
>>crm_resource: Error performing operation: Error 
>> occurred
>>  
>>
>> 
>> 
>> real0m20.521s
>> user0m0.022s
>> sys0m0.010s
>> rc:1
>> 
>> 
>> In the log file, I see (from line 20 of the super-simple-test-script):
>> 
>> 
>> Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1; 
>> /usr/bin/echo return_code:0 |]
>> 

In VirtualDomain RA I found a similar command (assuming that works):
 virsh $VIRSH_OPTIONS dumpxml --inactive --security-info ${DOMAIN_NAME} >
 ${CFGTMP}

virsh is somewhat strange; libvirtd is running, right?

>> 
>> Then nothing else.
>> 
>> The strace output is: https://pastebin.com/raw/UCEUdBeP 
>> 
>> Environment;
>> 
>> * selinux is permissive
>> * Pacemaker 2.1.5-4.el8
>> * pcs 0.10.15
>> * 4.18.0-408.el8.x86_64
>> * CentOS Stream release 8
>> 
>> Any help is appreciated, I am stumped. :/
> 
> After sending this, I tried having my "RA" call 'hostname', and that 
> worked fine. I switched back to 'virsh list --all', and that hangs. So 
> it seems to somehow be related to call 'virsh' specifically.
> 
> -- 
> Madison Kelly
> Alteeve's Niche!
> Chief Technical Officer
> c: +1-647-471-0951
> https://alteeve.com/ 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/