Re: [ClusterLabs] Failed 'virsh' call when test RA run by crm_resource (con't)

2023-01-12 Thread Keisuke MORI
Hi,

Just a guess but could it be the same issue with this?

https://serverfault.com/questions/1105733/virsh-command-hangs-when-script-runs-in-the-background

2023年1月12日(木) 15:36 Madison Kelly :
>
> On 2023-01-12 01:26, Reid Wahl wrote:
> > On Wed, Jan 11, 2023 at 10:21 PM Madison Kelly  wrote:
> >>
> >> On 2023-01-12 01:12, Reid Wahl wrote:
> >>> On Wed, Jan 11, 2023 at 8:11 PM Madison Kelly  wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>>  There was a lot of sub-threads, so I figured it's helpful to start a
> >>>> new thread with a summary so far. For context; I have a super simple
> >>>> perl script that pretends to be an RA for the sake of debugging.
> >>>>
> >>>> https://pastebin.com/9z314TaB
> >>>>
> >>>>  I've had variations log environment variables and confirmed that all
> >>>> the variables in the direct call that work are in the crm_resource
> >>>> triggered call. There are no selinux issues logged in audit.log and
> >>>> selinux is permissive. The script logs the real and effective UID and
> >>>> GID and it's the same in both instances. Calling other shell programs
> >>>> (tested with 'hostname') run fine, this is specifically crm_resource ->
> >>>> test RA -> virsh call.
> >>>>
> >>>>  I ran strace on the virsh call from inside my test script (changing
> >>>> 'virsh.good' to 'virsh.bad' between running directly and via
> >>>> crm_resource. The strace runs made six files each time. Below are
> >>>> pastebin links with the outputs of the six runs in one paste, but each
> >>>> file's output is in it's own block (search for file: to see the
> >>>> different file outputs)
> >>>>
> >>>> Good/direct run of the test RA:
> >>>> - https://pastebin.com/xtqe9NSG
> >>>>
> >>>> Bad/crm_resource triggered run of the test RA:
> >>>> - https://pastebin.com/vBiLVejW
> >>>>
> >>>> Still absolutely stumped.
> >>>
> >>> The strace outputs show that your bad runs are all getting stopped
> >>> with SIGTTOU. If you've never heard of that, me either.
> >>
> >> The hell?! This is new to me also.
> >>
> >>> https://www.gnu.org/software/libc/manual/html_node/Job-Control-Signals.html
> >>>
> >>> Macro: int SIGTTOU
> >>>
> >>>   This is similar to SIGTTIN, but is generated when a process in a
> >>> background job attempts to write to the terminal or set its modes.
> >>> Again, the default action is to stop the process. SIGTTOU is only
> >>> generated for an attempt to write to the terminal if the TOSTOP output
> >>> mode is set; see Output Modes.
> >>>
> >>>
> >>> Maybe this has something to do with the buffer settings in the perl
> >>> script(?). It might be worth trying a version that doesn't fiddle with
> >>> the outputs and buffer settings.
> >>
> >> I tried removing the $|, and then I changed the script to be entirely a
> >> bash script, still hanging. I tried 'virsh --connect  list
> >> --all' where method was qemu:///system, qemu:///session, and
> >> ssh+qemu:///root@localhost/system, all hang. In bash or perl.
> >>
> >>> I don't know which difference between your environment and mine is
> >>> relevant here, such that I can't reproduce the issue using your test
> >>> script. It works perfectly fine for me.
> >>>
> >>> Can you run `stty -a | grep tostop`? If there's a minus sign
> >>> ("-tostop"), it's disabled; if it's present without a minus sign
> >>> ("tostop"), it's enabled, as best I can tell.
> >>
> >> -tostop is there
> >>
> >> 
> >> [root@mk-a07n02 ~]# stty -a | grep tostop
> >> isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop -echoprt
> >> [root@mk-a07n02 ~]#
> >> 
> >>
> >>> I'm just spitballing here. It's disabled by default on my machine...
> >>> but even when I enable it, crm_resource --validate works fine. It may
> >>> be set differently when running under crm_resource.
> >>
> >> How do you enable it?
> >
> > With `stty tostop`
> >
> > It's 100% possible that this whole thing is a red herring by the way.
> > I'm looking for anything that might explain the discrepancy. SIGTTOU
> > may not be directly tied to the root cause.
>
> Appreciate the stab, didn't stop the hang though :(
>
> --
> Madison Kelly
> Alteeve's Niche!
> Chief Technical Officer
> c: +1-647-471-0951
> https://alteeve.com/
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Keisuke MORI
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Announcing ClusterLabs Summit 2020

2019-11-07 Thread Keisuke MORI
Hi,

2019年11月5日(火) 11:08 Ken Gaillot :
>
> Hi all,
>
> A reminder: We are still interested in ideas for talks, and rough
> estimates of potential attendees. "Maybe" is perfectly fine at this
> stage. It will let us negotiate hotel rates and firm up the location
> details.
>

I would like to join the Summit. 2 from NTT will be there, myself and
one more person.

I don't have a specific topic to talk right now, but possibly I can talk about a
PostgreSQL 12 support of the pgsql resource agent, or share our test results
and the issues from a user's point of view.

Look forward to seeing you guys.
Thanks,

-- 
Keisuke MORI


> On Tue, 2019-10-15 at 16:42 -0500, Ken Gaillot wrote:
> > I'm happy to announce that we have a date and location for the next
> > ClusterLabs Summit: Wednesday, Feb. 5, and Thursday, Feb. 6, 2020, in
> > Brno, Czechia. This year's host is Red Hat.
> >
> > Details will be given on this wiki page as they become available:
> >
> >   http://plan.alteeve.ca/index.php/HA_Cluster_Summit_2020
> >
> > We are still in the early stages of organizing, and need your input.
> >
> > Most importantly, we need a good idea of how many people will attend,
> > to ensure we have an appropriate conference room and amenities. The
> > wiki page has a section where you can say how many people from your
> > organization expect to attend. We don't need a firm commitment or an
> > immediate response, just let us know once you have a rough idea.
> >
> > We also invite you to propose a talk, whether it's a talk you want to
> > give or something you are interested in hearing more about. The wiki
> > page has a section for that, too. Anything related to open-source
> > clustering is welcome: new features and plans for the cluster
> > software projects, how-to's and case histories for integrating
> > specific services into a cluster, utilizing specific
> > stonith/networking/etc. technologies in a cluster, tips for
> > administering a cluster, and so forth.
> >
> > I'm excited about the chance for developers and users to meet in
> > person. Past summits have been helpful for shaping the direction of
> > the
> > projects and strengthening the community. I look forward to seeing
> > many
> > of you there!
> --
> Ken Gaillot 
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 
Keisuke MORI
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Replicated PGSQL woes

2016-10-19 Thread Keisuke MORI
master go back later cleanly, we make 
> sure
> no one could be promoted in the meantime.

Yes, that is correct but  the issue described in the slide is not
relevant to the Timeline ID issue, and the issue in the slide could
still possibly happen in the recent PostgreSQL release too, as far as
I understand.

>
> Note that considering this issue and how the RA tries to avoid it, this test 
> on
> slave being shutdown before master is quite weak anyway...
>
> Last but not least, the two PostgreSQL limitations the RA is messing with have
> been fixed long time ago in 9.3:
>   * https://www.postgresql.org/docs/current/static/release-9-3.html#AEN138909
>   *
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=985bd7d49726c9f178558491d31a570d47340459
>
> ...but it requires PostgreSQL 9.3+ for the timeline issue. By the way, I 
> suspect
> this is related to the "restart_on_promote" parameter of the RA.

Yes, "restart_on_promote" parameter was introduced upon users requests
to avoid the Timeline ID issue when PostgreSQL 9.1 (I've never used
the option though), and  could be deprecated as of 9.3+, but that's a
different issue from the lock file and I think that the lock file
handling is still valid.



>
> 2) from a recent discussion on this list (or maybe on -dev), RA devs should 
> not
> rely on OCF_RESKEY_CRM_meta_notify_* vars outside of "notify" actions.
>
>> > [...]
>> >>>> What can I do to fix this? What troubleshooting steps can I follow?
>> >>>> Thanks.
>> >
>> > I can not find the result of the stop operation in your log files, maybe 
>> > the
>> > log from CentTest2 would be more useful.
>>
>> Sure. I was looking at centtest1 because I was trying to figure out why it
>> wouldn't promote, but if centtest2 never really stopped (properly) that could
>> explain things. Here's the log from 2 when calling pcs cluster stop:
>>
>> [log log log]
>
> Well tis is a normal shutdown and the master was shutdown cleanly. AS you
> pointed out, the lock file stayed there because some slaves were still up.
>
> I **guess** if you really want a shutdown to occurs, you need to simulate a 
> real
> failure, not shutting down the first node cleanly. Try to kill corosync.
>
>> > but I can find this:
>> >
>> >  Oct 13 08:29:41 CentTest1 pengine[30095]:   notice: Scheduling Node
>> >  centtest2.ravnalaska.net for shutdown
>> >  ...
>> >  Oct 13 08:29:41 CentTest1 pengine[30095]:   notice: Scheduling Node
>> >  centtest2.ravnalaska.net for shutdown
>> >
>> > Which means the stop operation probably raised an error, leading to a
>> > fencing of the node. In this circumstance, I bet PostgreSQL wasn't able to
>> > stop correctly and the lock file stayed in place.
>> >
>> > Could you please show us your full cluster setup?
>>
>> Sure: how? pcs status shows this, but I suspect that's not what you are
>> asking about:
>
> "pcs config" would do the trick.
>
>
> --
> Jehan-Guillaume de Rorthais
> Dalibo
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 
Keisuke MORI

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Replicated PGSQL woes

2016-10-14 Thread Keisuke MORI
2016-10-14 16:36 GMT+09:00 Ulrich Windl <ulrich.wi...@rz.uni-regensburg.de>:
>>>> Israel Brewster <isr...@ravnalaska.net> schrieb am 13.10.2016 um 19:04 in
> Nachricht <34091524-d35e-4e28-9c3e-dda6c6a1e...@ravnalaska.net>:
> [...]
>> Oct 13 08:29:39 CentTest1 crmd[30096]:   notice: State transition S_IDLE ->
>> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
>> origin=abort_transition_graph ]
>> Oct 13 08:29:39 CentTest1 pengine[30095]:   notice: On loss of CCM Quorum:
>> Ignore
>> Oct 13 08:29:39 CentTest1 pengine[30095]:   notice: Stop
>> virtual_ip#011(centtest2.ravnalaska.net)
>> Oct 13 08:29:39 CentTest1 pengine[30095]:   notice: Demote
>> pgsql_96:0#011(Master -> Stopped centtest2.ravnalaska.net)
>> Oct 13 08:29:39 CentTest1 pengine[30095]:   notice: Calculated Transition
>> 193: /var/lib/pacemaker/pengine/pe-input-500.bz2
>
>> Oct 13 08:29:39 CentTest1 crmd[30096]:   notice: Initiating action 43:
>> notify pgsql_96_pre_notify_demote_0 on centtest2.ravnalaska.net
>> Oct 13 08:29:39 CentTest1 crmd[30096]:   notice: Initiating action 45:
>> notify pgsql_96_pre_notify_demote_0 on centtest1.ravnalaska.net (local)
>
> The above section looks wrong, because if one resource is master and the 
> other is slave, both cannot be demoted (AFAIK).. I'm also surprised that the 
> cluster tries to demote a failed master; maybe you have no fencing configured?


Those are "notification" operations before the demote and it is
correct being sent to all the nodes in the cluster.


>
>> Oct 13 08:29:39 CentTest1 crmd[30096]:   notice: Operation
>> pgsql_96_notify_0: ok (node=centtest1.ravnalaska.net, call=230, rc=0,
>> cib-update=0, confirmed=true)
>> Oct 13 08:29:39 CentTest1 crmd[30096]:   notice: Initiating action 6: demote
>> pgsql_96_demote_0 on centtest2.ravnalaska.net
>
> "action 6": Where does it come from? We had 43 and 45!

This is the actual "demote" operation.


-- 
Keisuke MORI

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Replicated PGSQL woes

2016-10-14 Thread Keisuke MORI
2016-10-14 2:04 GMT+09:00 Israel Brewster <isr...@ravnalaska.net>:
> Summary: Two-node cluster setup with latest pgsql resource agent. Postgresql
> starts initially, but failover never happens.

> Oct 13 08:29:47 CentTest1 pgsql(pgsql_96)[19602]: INFO: Master does not
> exist.
> Oct 13 08:29:47 CentTest1 pgsql(pgsql_96)[19602]: WARNING: My data is
> out-of-date. status=DISCONNECT
> Oct 13 08:29:51 CentTest1 pgsql(pgsql_96)[19730]: INFO: Master does not
> exist.
> Oct 13 08:29:51 CentTest1 pgsql(pgsql_96)[19730]: WARNING: My data is
> out-of-date. status=DISCONNECT
>
> Those last two lines repeat indefinitely, but there is no indication that
> the cluster ever tries to promote centtest1 to master. Even if I completely
> shut down the cluster, and bring it back up only on centtest1, pacemaker
> refuses to start postgresql on centtest1 as a master.

This is because the data on centtest1 is considered "out-of-date"-ed
(as it says :) and and promoting the node to master might corrupt your
database.

>
> What can I do to fix this? What troubleshooting steps can I follow? Thanks.
>

It seems that the latest data should be only on centtest2 so the
recovering steps should be something like:
 - start centtest2 as master
 - take the basebackup from centtest2 to centtest1
 - start centtest1 as slave
 - make sure the replications is working properly

see below for details.
http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster


Also, it would be helpful to check 'pgsql-data-status' and
'pgsql-status' attributes displayed by 'crm_mon -A' to diagnose
whether the replications is going well or not.

The slave node should have the attributes like below, otherwise the
replications is going something wrong and the node will never be
promoted because it does not have the proper data.

```
* Node node2:
+ master-pgsql  : 100
+ pgsql-data-status : STREAMING|SYNC
    + pgsql-status  : HS:sync
```



-- 
Keisuke MORI

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Q] crmsh release plan for pacemaker-1.1.14?

2016-01-07 Thread Keisuke MORI
2016-01-07 17:21 GMT+09:00 Kristoffer Grönlund <kgronl...@suse.com>:
> Keisuke MORI <keisuke.mori...@gmail.com> writes:
>
>> I/'Hi,
>>
>> 2016-01-07 16:35 GMT+09:00 Kristoffer Grönlund <kgronl...@suse.com>:
>>> Keisuke MORI <keisuke.mori...@gmail.com> writes:
>>>
>>>> Hi,
>>>>
>>>> I would like to know if there is any plan of the new crmsh release
>>>> schedule which cooperate with Pacemaker-1.1.14.
>>>>
>>>> The latest release of crmsh-2.1.4 does not work well with
>>>> Pacemaker-1.1.14-rc4 because of the unmatched schema.
>>>>
>>>> 
>>>> # crm configure load update sample.crm
>>>> ERROR: CIB not supported: validator 'pacemaker-2.4', release '3.0.10'
>>>> ERROR: You may try the upgrade command
>>>> ERROR: configure: Missing requirements
>>>> #
>>>> 
>>>>
>>>> Regards,
>>>> --
>>>> Keisuke MORI
>>>
>>> Hello,
>>>
>>> Yes, I am planning a new release of crmsh very soon. The development
>>> version of crmsh should work well with 1.1.14, so I would recomend using
>>> that for now.
>>>
>>> There are a few issues that I would like to investigate before the
>>> release, but regardless I will release a new version soon.
>>
>> Great!
>> I would look forward to it.
>>
>> Will it be 2.1.5 or 2.2.0 based on the master branch?
>> I'm concerned that 2.2.0 seems require an additional dependency for
>> python-parallax.
>>
>
> I will release both 2.1.5 and 2.2.0.
>
> Yes, 2.2.0 will require python-parallax. There are packages for
> python-parallax available on the OBS [1]. It is also installable via
> PyPI [2].
>
> 2.1.5 will be based on the current 2.1.4 branch will additional bug
> fixes, and will not require python-parallax.
>
> [1]: 
> https://build.opensuse.org/package/show/devel:languages:python/python-parallax
> [2]: https://pypi.python.org/pypi/parallax/

Thank you for your detailed answer!
Everything is clear to me now.

Regards,
-- 
Keisuke MORI

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-01-06 Thread Keisuke MORI
Hi,

2016-01-06 22:57 GMT+09:00 Jan Pokorný <jpoko...@redhat.com>:
> Hello ,
>
> On 04/01/16 17:33 +0100, Bogdan Dobrelya wrote:

>> Note, that it seems the very import action causes the issue, not the
>> ocf_run or ocf_log code itself.
>>
>> [0] https://github.com/ClusterLabs/resource-agents/issues/734
>
> Have to wonder if there is any correlation with the issue discussed
> recently:
>
> http://oss.clusterlabs.org/pipermail/users/2015-November/001806.html
>
> Note that ocf:pacemaker:ClusterMon resource also sources
> ocf-shellfuncs set of helper functions.

I think that it's unlikely to be relevant.

In the case of this time, /bin/dash is apparently the cause of the
fork bomb by a bug in handling PS4 shell variable from the result of
testing on the github issue 734.

In the case of ClusterMon you mentioned, crm_mon is the one forking
repeatedly, not the ClusterMon shell script, not invoked via /bin/dash
either.

Regards,
-- 
Keisuke MORI

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org