Hi, As i don't have much time to dig into this pacemaker vs systemd problem, i decided to dump systemd.
For apache resource i replaced it with ocf::heartbeat:apache, openvpn i replaced with ocf::heartbeat:anything and for the other resources that need some more elaborated start/stop script i created /etc/init.d/ scripts and used lsb resource type. Everything is working perfectly now. On 20/02/2020 23:10, Maverick wrote: > Hi, > > I'm using Fedora 31 (x86_64). > > For apache i can use the ocf agent sure, but i have other resources for > who don't exist an ocf agent, so for them i need to use systemd. > > All ocf and lsb type resources start ok on boot, only systemd resources > have this problem. > > I already enabled debug for httpd and openvpn-server systemd units, but > i don't see any debug on /var/log/messages or journal about any of these > units. > > Here some of the systemd units: > > Apache: > > [Unit] > Description=The Apache HTTP Server > Wants=httpd-init.service > After=network.target remote-fs.target nss-lookup.target httpd-init.service > Documentation=man:httpd.service(8) > > [Service] > Type=notify > Environment=LANG=C > Environment=SYSTEMD_LOG_LEVEL=debug > > ExecStart=/usr/sbin/httpd $OPTIONS -DFOREGROUND > ExecReload=/usr/sbin/httpd $OPTIONS -k graceful > # Send SIGWINCH for graceful stop > KillSignal=SIGWINCH > KillMode=mixed > PrivateTmp=true > > [Install] > WantedBy=multi-user.target > > ----------------- > > OpenVPN: > > [Unit] > Description=OpenVPN service for %I > After=syslog.target network-online.target > Wants=network-online.target > Documentation=man:openvpn(8) > Documentation=https://community.openvpn.net/openvpn/wiki/Openvpn24ManPage > Documentation=https://community.openvpn.net/openvpn/wiki/HOWTO > > [Service] > Type=notify > PrivateTmp=true > WorkingDirectory=/etc/openvpn/server > Environment=SYSTEMD_LOG_LEVEL=debug > ExecStart=/usr/sbin/openvpn --status %t/openvpn-server/status-%i.log > --status-version 2 --suppress-timestamps --cipher AES-256-GCM > --ncp-ciphers AES-256-GCM:AES-128-GCM:AES-256-CBC:AES-128-CBC:BF-CBC > --config %i.conf > CapabilityBoundingSet=CAP_IPC_LOCK CAP_NET_ADMIN CAP_NET_BIND_SERVICE > CAP_NET_RAW CAP_SETGID CAP_SETUID CAP_SYS_CHROOT CAP_DAC_OVERRIDE > CAP_AUDIT_WRITE > LimitNPROC=10 > DeviceAllow=/dev/null rw > DeviceAllow=/dev/net/tun rw > ProtectSystem=true > ProtectHome=true > KillMode=process > RestartSec=5s > Restart=on-failure > > [Install] > WantedBy=multi-user.target > > --------------------------------- > > Zabbix Server: > > [Unit] > Description=Zabbix Server with Oracle DB > After=syslog.target network.target > > [Service] > Type=simple > Environment="LD_LIBRARY_PATH=/opt/oracle/lib" > ExecStart=/usr/sbin/zabbix_server -f > User=zabbixsrv > > [Install] > WantedBy=multi-user.target > > > > On 20/02/2020 22:29, Strahil Nikolov wrote: >> On February 20, 2020 10:29:54 PM GMT+02:00, Maverick <m...@sapo.pt> wrote: >>>> Hi Maverick, >>>> >>>> >>>> According this thread: >>>> >>> https://lists.clusterlabs.org/pipermail/users/2016-December/021053.html >>>> You have 'startup-fencing' is set to false. >>>> >>>> Check it out - maybe this is your reason. >>>> >>>> Best Regards, >>>> Strahil Nikolov >>> Yes, i have stonith disabled, because as soon as the resources startup >>> fail on boot, node was rebooted. >>> >>> >>> Anyway, i was checking the pacemaker logs and the journal log, and i >>> see >>> that the service actually starts ok but for some reason pacemaker >>> thinks >>> it has timeout and then because of that tries to stop and also thinks >>> it >>> has timeout but actually stops it: >>> >>> pacemaker.log: >>> >>> Feb 20 19:39:52 boss1 pacemaker-execd [1499] (log_execute) info: >>> executing - rsc:apache action:start call_id:25 >>> Feb 20 19:39:52 boss1 pacemaker-execd [1499] (systemd_unit_exec) >>> debug: Performing asynchronous start op on systemd unit httpd named >>> 'apache' >>> Feb 20 19:39:52 boss1 pacemaker-execd [1499] >>> (systemd_unit_exec_with_unit) debug: Calling StartUnit for apache: >>> /org/freedesktop/systemd1/unit/httpd_2eservice >>> Feb 20 19:39:52 boss1 pacemaker-execd [1499] (action_complete) >>> notice: Giving up on apache start (rc=0): timeout (elapsed=248199ms, >>> remaining=-148199ms) >>> Feb 20 19:39:52 boss1 pacemaker-execd [1499] (log_finished) >>> debug: finished - rsc:apache action:monitor call_id:25 exit-code:198 >>> exec-time:248205ms queue-time:216ms >>> >>> Feb 20 19:40:00 boss1 pacemaker-execd [1499] (log_execute) info: >>> executing - rsc:apache action:stop call_id:81 >>> Feb 20 19:40:00 boss1 pacemaker-execd [1499] (systemd_unit_exec) >>> debug: Performing asynchronous stop op on systemd unit httpd named >>> 'apache' >>> Feb 20 19:40:00 boss1 pacemaker-execd [1499] >>> (systemd_unit_exec_with_unit) debug: Calling StopUnit for apache: >>> /org/freedesktop/systemd1/unit/httpd_2eservice >>> Feb 20 19:40:01 boss1 pacemaker-execd [1499] (action_complete) >>> notice: Giving up on apache stop (rc=0): timeout (elapsed=304539ms, >>> remaining=-204539ms) >>> Feb 20 19:40:01 boss1 pacemaker-execd [1499] (log_finished) >>> debug: finished - rsc:apache action:monitor call_id:81 exit-code:198 >>> exec-time:304545ms queue-time:240ms >>> >>> >>> system journal: >>> >>> Feb 20 19:39:52 boss1 systemd[1]: Starting Cluster Controlled httpd... >>> Feb 20 19:39:53 boss1 systemd[1]: Started Cluster Controlled httpd. >>> Feb 20 19:39:53 boss1 httpd[2145]: Server configured, listening on: >>> port >>> 443, port 80 >>> >>> Feb 20 19:40:01 boss1 systemd[1]: Stopping The Apache HTTP Server... >>> Feb 20 19:40:02 boss1 systemd[1]: httpd.service: Succeeded. >>> Feb 20 19:40:02 boss1 systemd[1]: Stopped The Apache HTTP Server. >>> >>> >>> >>> >>> On 20/02/2020 21:02, Strahil Nikolov wrote: >>>> On February 20, 2020 9:35:07 PM GMT+02:00, Maverick <m...@sapo.pt> >>> wrote: >>>>> Manually it starts ok, no problems: >>>>> >>>>> pcs resource debug-start apache --full >>>>> (unpack_config) warning: Blind faith: not fencing unseen nodes >>>>> Operation start for apache (systemd::httpd) returned: 'ok' (0) >>>>> >>>>> >>>>> On 20/02/2020 16:46, Strahil Nikolov wrote: >>>>>> On February 20, 2020 12:49:43 PM GMT+02:00, Maverick <m...@sapo.pt> >>>>> wrote: >>>>>>>> You really need to debug the start & stop of tthe resource . >>>>>>>> >>>>>>>> Please try the debug procedure and provide the output: >>>>>>>> https://wiki.clusterlabs.org/wiki/Debugging_Resource_Failures >>>>>>>> >>>>>>>> Best Regards, >>>>>>>> Strahil Nikolov >>>>>>> Hi, >>>>>>> >>>>>>> Correct me if i'm wrong, but i think that procedure doesn't work >>> for >>>>>>> systemd class resources, i don't know which OCF script is >>>>> responsible >>>>>>> for handling systemd class resources. >>>>>>> >>>>>>> Also crm command doesn't exist in RHEL/Fedora, i've seen the crm >>>>>>> command >>>>>>> only in SUSE. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 19/02/2020 19:23, Strahil Nikolov wrote: >>>>>>>> On February 19, 2020 7:21:12 PM GMT+02:00, Maverick >>> <m...@sapo.pt> >>>>>>> wrote: >>>>>>>>> How is it possible that pacemaker is reporting that takes 4.2 >>>>>>> minutes >>>>>>>>> (254930ms) to execute the start of httpd systemd unit? >>>>>>>>> >>>>>>>>> Feb 19 17:04:09 boss1 pacemaker-execd [1514] (log_execute) >>> >>>>>>>>> info: >>>>>>>>> executing - rsc:apache action:start call_id:25 >>>>>>>>> Feb 19 17:04:09 boss1 pacemaker-execd [1514] >>>>> (systemd_unit_exec) >>>>>>>>> >>>>>>>>> debug: Performing asynchronous start op on systemd unit httpd >>>>> named >>>>>>>>> 'apache' >>>>>>>>> Feb 19 17:04:09 boss1 pacemaker-execd [1514] >>>>>>>>> (systemd_unit_exec_with_unit) debug: Calling StartUnit for >>>>>>> apache: >>>>>>>>> /org/freedesktop/systemd1/unit/httpd_2eservice >>>>>>>>> Feb 19 17:04:10 boss1 pacemaker-execd [1514] >>> (action_complete) >>>>>>> >>>>>>>>> notice: Giving up on apache start (rc=0): timeout >>>>> (elapsed=254930ms, >>>>>>>>> remaining=-154930ms) >>>>>>>>> Feb 19 17:04:10 boss1 pacemaker-execd [1514] (log_finished) >>>>> >>>>>>>>> debug: finished - rsc:apache action:monitor call_id:25 >>>>>>> exit-code:198 >>>>>>>>> exec-time:254935ms queue-time:235ms >>>>>>>>> >>>>>>>>> >>>>>>>>> Starting manually works fine and fast: >>>>>>>>> >>>>>>>>> # time systemctl start httpd >>>>>>>>> real 0m0.144s >>>>>>>>> user 0m0.005s >>>>>>>>> sys 0m0.008s >>>>>>>>> >>>>>>>>> >>>>>>>>> On 17/02/2020 22:47, Mvrk wrote: >>>>>>>>>> In attachment the pacemaker.log. On the log i can see that the >>>>>>>>> cluster >>>>>>>>>> tries to start, the start fails, then tries to stop, and the >>> stop >>>>>>>>> also >>>>>>>>>> fails also. >>>>>>>>>> >>>>>>>>>> One more thing, my cluster was working fine on Fedora 28, i >>>>> started >>>>>>>>>> having this problem after upgrade to Fedora 31. >>>>>>>>>> >>>>>>>>>> On 17/02/2020 21:30, Ricardo Esteves wrote: >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> Yes, i also don't understand why is trying to stop them first. >>>>>>>>>>> >>>>>>>>>>> SELinux is disabled: >>>>>>>>>>> >>>>>>>>>>> # getenforce >>>>>>>>>>> Disabled >>>>>>>>>>> >>>>>>>>>>> All systemd services controlled by the cluster are disabled >>> from >>>>>>>>>>> starting at boot: >>>>>>>>>>> >>>>>>>>>>> # systemctl is-enabled httpd >>>>>>>>>>> disabled >>>>>>>>>>> >>>>>>>>>>> # systemctl is-enabled openvpn-server@01-server >>>>>>>>>>> disabled >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 17/02/2020 20:28, Ken Gaillot wrote: >>>>>>>>>>>> On Mon, 2020-02-17 at 17:35 +0000, Maverick wrote: >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> When i start my cluster, most of my systemd resources won't >>>>>>> start: >>>>>>>>>>>>> Failed Resource Actions: >>>>>>>>>>>>> * apache_stop_0 on boss1 'OCF_TIMEOUT' (198): call=82, >>>>>>>>>>>>> status='Timed Out', exitreason='', >>> last-rc-change='1970-01-01 >>>>>>>>>>>>> 01:00:54 +01:00', queued=29ms, exec=197799ms >>>>>>>>>>>>> * openvpn_stop_0 on boss1 'OCF_TIMEOUT' (198): call=61, >>>>>>>>>>>>> status='Timed Out', exitreason='', >>> last-rc-change='1970-01-01 >>>>>>>>>>>>> 01:00:54 +01:00', queued=1805ms, exec=198841ms >>>>>>>>>>>> These show that attempts to stop failed, rather than start. >>>>>>>>>>>> >>>>>>>>>>>>> So everytime i reboot my node, i need to start the resources >>>>>>>>> manually >>>>>>>>>>>>> using systemd, for example: >>>>>>>>>>>>> >>>>>>>>>>>>> systemd start apache >>>>>>>>>>>>> >>>>>>>>>>>>> and then pcs resource cleanup >>>>>>>>>>>>> >>>>>>>>>>>>> Resources configuration: >>>>>>>>>>>>> >>>>>>>>>>>>> Clone: apache-clone >>>>>>>>>>>>> Meta Attrs: maintenance=false >>>>>>>>>>>>> Resource: apache (class=systemd type=httpd) >>>>>>>>>>>>> Meta Attrs: maintenance=false >>>>>>>>>>>>> Operations: monitor interval=60 timeout=100 >>>>> (apache-monitor- >>>>>>>>>>>>> interval-60) >>>>>>>>>>>>> start interval=0s timeout=100 >>>>>>>>> (apache-start-interval- >>>>>>>>>>>>> 0s) >>>>>>>>>>>>> stop interval=0s timeout=100 >>>>>>>>> (apache-stop-interval-0s) >>>>>>>>>>>>> Resource: openvpn (class=systemd >>>>> type=openvpn-server@01-server) >>>>>>>>>>>>> Meta Attrs: maintenance=false >>>>>>>>>>>>> Operations: monitor interval=60 timeout=100 >>>>> (openvpn-monitor- >>>>>>>>>>>>> interval-60) >>>>>>>>>>>>> start interval=0s timeout=100 >>>>>>>>> (openvpn-start-interval- >>>>>>>>>>>>> 0s) >>>>>>>>>>>>> stop interval=0s timeout=100 >>>>>>>>> (openvpn-stop-interval- >>>>>>>>>>>>> 0s) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Btw, if i try a debug-start / debug-stop the mentioned >>>>> resources >>>>>>>>>>>>> start and stop ok. >>>>>>>>>>>> Based on that, my first guess would be SELinux. Check the >>>>> SELinux >>>>>>>>> logs >>>>>>>>>>>> for denials. >>>>>>>>>>>> >>>>>>>>>>>> Also, make sure your systemd services are not enabled in >>>>> systemd >>>>>>>>> itself >>>>>>>>>>>> (e.g. via systemctl enable). Clustered systemd services >>> should >>>>> be >>>>>>>>>>>> managed by the cluster only. >>>>>>>>> _______________________________________________ >>>>>>>>> Manage your subscription: >>>>>>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>>>>>> >>>>>>>>> ClusterLabs home: https://www.clusterlabs.org/ >>>>>>>> You really need to debug the start & stop of tthe resource . >>>>>>>> >>>>>>>> Please try the debug procedure and provide the output: >>>>>>>> https://wiki.clusterlabs.org/wiki/Debugging_Resource_Failures >>>>>>>> >>>>>>>> Best Regards, >>>>>>>> Strahil Nikolov >>>>>> Hi Maverick, >>>>>> >>>>>> >>>>>> you can replace 'crm resource stop' with 'pcs resource disable'. >>>>>> The rest is working, but sadly not for systemd. >>>>>> >>>>>> You can try to: >>>>>> 'pcs resource debug-start <resource> --full' >>>>>> Another approach is to: >>>>>> 1. Copy service to /etc/systemd/system >>>>>> 2. In '[service]' section add this: >>>>>> Environment=SYSTEMD_LOG_LEVEL=debug >>>>>> 3. Reload systemd: >>>>>> systemctl daemon_reload >>>>>> Note: I assume you got downtime for debugging the issue >>>>>> 4. Use 'debug-start --full' >>>>>> >>>>>> Note: Don't forget to remove the debug, or your journal will get >>>>> full. >>>>>> Best Regards, >>>>>> Strahil Nikolov >>>> Hi Maverick, >>>> >>>> >>>> According this thread: >>>> >>> https://lists.clusterlabs.org/pipermail/users/2016-December/021053.html >>>> You have 'startup-fencing' is set to false. >>>> >>>> Check it out - maybe this is your reason. >>>> >>>> Best Regards, >>>> Strahil Nikolov >> Hi Maverick, >> >> Can you share your systemd service ? >> What distribution are you using and what is the reason for using systemd >> instead of the ocf resource for apache ? >> >> Could you enable the DEBUG for the systemd service ? >> >> >> Best Regards, >> Strahil Nikolov _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/