Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2013-06-18 Thread Jiaju Zhang
On Thu, 2013-06-13 at 19:06 +0900, Yuichi SEINO wrote:
> Hi Jiaju,
> 
> Could you merge RA and Xia patch ?

Merged.

> 
> If the problem happened, I think that I want to fix it after this patch 
> merged.

Thanks!

Regards,
Jiaju

> 
> Sincerely,
> Yuichi
> 
> 
> 2013/4/10 Yuichi SEINO :
> > Hi,
> >
> > I still should not accept a reply from anyone. Hopefully, I think that
> > I want to early fix this issue.
> >
> > Sincerely,
> > Yuichi
> >
> > 2013/3/19 Yuichi SEINO :
> >> Hi Xia and Jiaju,
> >>
> >> Because RA may read an unintended file, I think that it is better to
> >> check the existence of lockfile in RA. I detailed a previous mail.
> >>
> >> What do you think about this?
> >> If you agrees to this, Could you fix RA?
> >>
> >> Sincerely,
> >> Yuichi
> >>
> >> 2013/2/25 Yuichi SEINO :
> >>> Hi Jiaju,
> >>>
> >>> 2013/2/22 Jiaju Zhang :
>  On Wed, 2013-02-20 at 16:26 +0900, Yuichi SEINO wrote:
> > Hi Jiaju,
> >
> > I am testing this patch.
> > When a lockfile was removed, it seems that the stop of RA isn't a
> > intended behavior.
> 
>  I'm just curious how the lockfile was removed. Basically the existence
>  of the lockfile shows one boothd is started, and prevent being wrongly
>  started again. So the lockfile should not be removed intentionally by
>  the admin.
> >>>
> >>> I used how to run "mv" to the pid file.
> >>>
> >>>  The other case also is the same situation. When we already run
> >>> "boothd -l other.pid" on node, the lockfile exists in the other place.
> >>> So, $lockfile doesn't exist in the start and stop of RA.
> >>>
> >>>  I think that it is better to take account of  the existence of
> >>> lockfile or $pidnum, because /proc/cmdline may happen to fulfill this
> >>> if. For example, anything RA includes the check if pid is the empty.
> >>>
> >>> anything_status() {
> >>> if test -f "$pidfile"
> >>> then
> >>> if pid=`getpid $pidfile` && [ "$pid" ] && kill -s 0 $pid
> >>> then
> >>> return $OCF_SUCCESS
> >>> else
> >>> # pidfile w/o process means the process died
> >>> return $OCF_ERR_GENERIC
> >>> fi
> >>> else
> >>> return $OCF_NOT_RUNNING
> >>> fi
> >>> }
> >>>
> 
>  Thanks,
>  Jiaju
> 
> >>>
> >>> Sincerely,
> >>> Yuichi
> >>>
> >>
> >
> > --
> > Yuichi SEINO
> > METROSYSTEMS CORPORATION
> > E-mail:seino.clust...@gmail.com
> 
> 
> 



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2013-06-13 Thread Yuichi SEINO
Hi Jiaju,

Could you merge RA and Xia patch ?

If the problem happened, I think that I want to fix it after this patch merged.

Sincerely,
Yuichi


2013/4/10 Yuichi SEINO :
> Hi,
>
> I still should not accept a reply from anyone. Hopefully, I think that
> I want to early fix this issue.
>
> Sincerely,
> Yuichi
>
> 2013/3/19 Yuichi SEINO :
>> Hi Xia and Jiaju,
>>
>> Because RA may read an unintended file, I think that it is better to
>> check the existence of lockfile in RA. I detailed a previous mail.
>>
>> What do you think about this?
>> If you agrees to this, Could you fix RA?
>>
>> Sincerely,
>> Yuichi
>>
>> 2013/2/25 Yuichi SEINO :
>>> Hi Jiaju,
>>>
>>> 2013/2/22 Jiaju Zhang :
 On Wed, 2013-02-20 at 16:26 +0900, Yuichi SEINO wrote:
> Hi Jiaju,
>
> I am testing this patch.
> When a lockfile was removed, it seems that the stop of RA isn't a
> intended behavior.

 I'm just curious how the lockfile was removed. Basically the existence
 of the lockfile shows one boothd is started, and prevent being wrongly
 started again. So the lockfile should not be removed intentionally by
 the admin.
>>>
>>> I used how to run "mv" to the pid file.
>>>
>>>  The other case also is the same situation. When we already run
>>> "boothd -l other.pid" on node, the lockfile exists in the other place.
>>> So, $lockfile doesn't exist in the start and stop of RA.
>>>
>>>  I think that it is better to take account of  the existence of
>>> lockfile or $pidnum, because /proc/cmdline may happen to fulfill this
>>> if. For example, anything RA includes the check if pid is the empty.
>>>
>>> anything_status() {
>>> if test -f "$pidfile"
>>> then
>>> if pid=`getpid $pidfile` && [ "$pid" ] && kill -s 0 $pid
>>> then
>>> return $OCF_SUCCESS
>>> else
>>> # pidfile w/o process means the process died
>>> return $OCF_ERR_GENERIC
>>> fi
>>> else
>>> return $OCF_NOT_RUNNING
>>> fi
>>> }
>>>

 Thanks,
 Jiaju

>>>
>>> Sincerely,
>>> Yuichi
>>>
>>
>
> --
> Yuichi SEINO
> METROSYSTEMS CORPORATION
> E-mail:seino.clust...@gmail.com



-- 
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2013-04-09 Thread Yuichi SEINO
Hi,

I still should not accept a reply from anyone. Hopefully, I think that
I want to early fix this issue.

Sincerely,
Yuichi

2013/3/19 Yuichi SEINO :
> Hi Xia and Jiaju,
>
> Because RA may read an unintended file, I think that it is better to
> check the existence of lockfile in RA. I detailed a previous mail.
>
> What do you think about this?
> If you agrees to this, Could you fix RA?
>
> Sincerely,
> Yuichi
>
> 2013/2/25 Yuichi SEINO :
>> Hi Jiaju,
>>
>> 2013/2/22 Jiaju Zhang :
>>> On Wed, 2013-02-20 at 16:26 +0900, Yuichi SEINO wrote:
 Hi Jiaju,

 I am testing this patch.
 When a lockfile was removed, it seems that the stop of RA isn't a
 intended behavior.
>>>
>>> I'm just curious how the lockfile was removed. Basically the existence
>>> of the lockfile shows one boothd is started, and prevent being wrongly
>>> started again. So the lockfile should not be removed intentionally by
>>> the admin.
>>
>> I used how to run "mv" to the pid file.
>>
>>  The other case also is the same situation. When we already run
>> "boothd -l other.pid" on node, the lockfile exists in the other place.
>> So, $lockfile doesn't exist in the start and stop of RA.
>>
>>  I think that it is better to take account of  the existence of
>> lockfile or $pidnum, because /proc/cmdline may happen to fulfill this
>> if. For example, anything RA includes the check if pid is the empty.
>>
>> anything_status() {
>> if test -f "$pidfile"
>> then
>> if pid=`getpid $pidfile` && [ "$pid" ] && kill -s 0 $pid
>> then
>> return $OCF_SUCCESS
>> else
>> # pidfile w/o process means the process died
>> return $OCF_ERR_GENERIC
>> fi
>> else
>> return $OCF_NOT_RUNNING
>> fi
>> }
>>
>>>
>>> Thanks,
>>> Jiaju
>>>
>>
>> Sincerely,
>> Yuichi
>>
>

--
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2013-03-18 Thread Yuichi SEINO
Hi Xia and Jiaju,

Because RA may read an unintended file, I think that it is better to
check the existence of lockfile in RA. I detailed a previous mail.

What do you think about this?
If you agrees to this, Could you fix RA?

Sincerely,
Yuichi

2013/2/25 Yuichi SEINO :
> Hi Jiaju,
>
> 2013/2/22 Jiaju Zhang :
>> On Wed, 2013-02-20 at 16:26 +0900, Yuichi SEINO wrote:
>>> Hi Jiaju,
>>>
>>> I am testing this patch.
>>> When a lockfile was removed, it seems that the stop of RA isn't a
>>> intended behavior.
>>
>> I'm just curious how the lockfile was removed. Basically the existence
>> of the lockfile shows one boothd is started, and prevent being wrongly
>> started again. So the lockfile should not be removed intentionally by
>> the admin.
>
> I used how to run "mv" to the pid file.
>
>  The other case also is the same situation. When we already run
> "boothd -l other.pid" on node, the lockfile exists in the other place.
> So, $lockfile doesn't exist in the start and stop of RA.
>
>  I think that it is better to take account of  the existence of
> lockfile or $pidnum, because /proc/cmdline may happen to fulfill this
> if. For example, anything RA includes the check if pid is the empty.
>
> anything_status() {
> if test -f "$pidfile"
> then
> if pid=`getpid $pidfile` && [ "$pid" ] && kill -s 0 $pid
> then
> return $OCF_SUCCESS
> else
> # pidfile w/o process means the process died
> return $OCF_ERR_GENERIC
> fi
> else
> return $OCF_NOT_RUNNING
> fi
> }
>
>>
>> Thanks,
>> Jiaju
>>
>
> Sincerely,
> Yuichi
>

--
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2013-02-24 Thread Yuichi SEINO
Hi Jiaju,

2013/2/22 Jiaju Zhang :
> On Wed, 2013-02-20 at 16:26 +0900, Yuichi SEINO wrote:
>> Hi Jiaju,
>>
>> I am testing this patch.
>> When a lockfile was removed, it seems that the stop of RA isn't a
>> intended behavior.
>
> I'm just curious how the lockfile was removed. Basically the existence
> of the lockfile shows one boothd is started, and prevent being wrongly
> started again. So the lockfile should not be removed intentionally by
> the admin.

I used how to run "mv" to the pid file.

 The other case also is the same situation. When we already run
"boothd -l other.pid" on node, the lockfile exists in the other place.
So, $lockfile doesn't exist in the start and stop of RA.

 I think that it is better to take account of  the existence of
lockfile or $pidnum, because /proc/cmdline may happen to fulfill this
if. For example, anything RA includes the check if pid is the empty.

anything_status() {
if test -f "$pidfile"
then
if pid=`getpid $pidfile` && [ "$pid" ] && kill -s 0 $pid
then
return $OCF_SUCCESS
else
# pidfile w/o process means the process died
return $OCF_ERR_GENERIC
fi
else
return $OCF_NOT_RUNNING
fi
}

>
> Thanks,
> Jiaju
>

Sincerely,
Yuichi

--
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2013-02-21 Thread Jiaju Zhang
On Wed, 2013-02-20 at 16:26 +0900, Yuichi SEINO wrote:
> Hi Jiaju,
> 
> I am testing this patch.
> When a lockfile was removed, it seems that the stop of RA isn't a
> intended behavior. 

I'm just curious how the lockfile was removed. Basically the existence
of the lockfile shows one boothd is started, and prevent being wrongly
started again. So the lockfile should not be removed intentionally by
the admin.

Thanks,
Jiaju

> Currently, If "pidnum" is empty, RA run "cat
> /proc//cmdline". /proc/cmdline is boot parameter file. So, I added the
> check about a existence of lockfile.
> 
> diff --git a/script/ocf/booth-site b/script/ocf/booth-site
> index 2575643..7c775dc 100755
> --- a/script/ocf/booth-site
> +++ b/script/ocf/booth-site
> @@ -116,6 +116,10 @@ booth_check_daemon_state(){
> 
> case $rc in
> $OCF_SUCCESS)
> +   if [ ! -f $lockfile ]; then
> +   ocf_log err "lockfile not exists.(${lockfile})"
> +   return $BOOTH_DAEMON_EXIST;
> +   fi
> pidnum=$(cat $lockfile |awk '{print $1}')
> daemonstate=$(cat $lockfile |awk '{print $2}')
> if cat /proc/$pidnum/cmdline |grep $OCF_RESKEY_type
> >/dev/null 2>&1; then
> 
> When this happened, I got "crm resource trace booth"
> 
> + 21:09:48: 223: '[' '!' ']'
> + 21:09:48: 224: OCF_RESKEY_daemon=boothd
> + 21:09:48: 227: '[' '!' ']'
> + 21:09:48: 228: OCF_RESKEY_type=site
> + 21:09:48: 231: case $__OCF_ACTION in
> + 21:09:48: 236: booth_stop
> + 21:09:48: booth_stop:166: booth_check_daemon_state
> + 21:09:48: booth_check_daemon_state:115: booth_check_daemon_exist
> + 21:09:48: booth_check_daemon_exist:105: killall -0 boothd
> + 21:09:48: booth_check_daemon_exist:105: rc=0
> + 21:09:48: booth_check_daemon_exist:107: case $rc in
> + 21:09:48: booth_check_daemon_exist:108: return 0
> + 21:09:48: booth_check_daemon_state:115: rc=0
> + 21:09:48: booth_check_daemon_state:117: case $rc in
> + 21:09:48: booth_check_daemon_state:117: case $rc in
> ++ 21:09:48: booth_check_daemon_state:119: awk '{print $1}'
> + 21:09:48: booth_check_daemon_state:117: case $rc in
> + 21:09:48: booth_check_daemon_state:117: case $rc in
> ++ 21:09:48: booth_check_daemon_state:119: cat /var/run/booth.pid
> + 21:09:48: booth_check_daemon_state:117: case $rc in
> + 21:09:48: booth_check_daemon_state:117: case $rc in
> + 21:09:48: booth_check_daemon_state:119: pidnum=
> ++ 21:09:48: booth_check_daemon_state:120: awk '{print $2}'
> ++ 21:09:48: booth_check_daemon_state:120: cat /var/run/booth.pid
> + 21:09:48: booth_check_daemon_state:120: daemonstate=
> + 21:09:48: booth_check_daemon_state:121: grep site
> + 21:09:48: booth_check_daemon_state:121: cat /proc//cmdline
> + 21:09:48: booth_check_daemon_state:122: case $daemonstate in
> + 21:09:48: booth_check_daemon_state:125: return 4
> + 21:09:48: booth_stop:166: rc=4
> + 21:09:48: booth_stop:168: case $rc in
> + 21:09:48: booth_stop:173: return 1
> + 21:09:48: 246: rc=1
> + 21:09:48: 248: exit 1
> 
> 
> 
> 
> 2013/2/19 Jiaju Zhang :
> > Hi Yuichi,
> >
> > On Tue, 2013-02-19 at 10:27 +0900, Yuichi SEINO wrote:
> >> Hi Xia,
> >>
> >> I have a question about the following part. The write man explain that
> >> "errno" is set appropriately if the write return -1. So, if "rv" is
> >> equal to 0, strerror(errno) may not output the correct message. What
> >> do you think about it?
> >
> > Good catch, I think we should differentiate the cases of rv == -1 or rv
> > == 0. Maybe setting errno to ENOSPC when rv == 0.
> >
> > BTW, apart from that, does this patch fix your original issue?
> >
> > Thanks,
> > Jiaju
> >
> 
> 
> 
> --
> Yuichi SEINO
> METROSYSTEMS CORPORATION
> E-mail:seino.clust...@gmail.com



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2013-02-19 Thread Yuichi SEINO
Hi Jiaju,

I am testing this patch.
When a lockfile was removed, it seems that the stop of RA isn't a
intended behavior. Currently, If "pidnum" is empty, RA run "cat
/proc//cmdline". /proc/cmdline is boot parameter file. So, I added the
check about a existence of lockfile.

diff --git a/script/ocf/booth-site b/script/ocf/booth-site
index 2575643..7c775dc 100755
--- a/script/ocf/booth-site
+++ b/script/ocf/booth-site
@@ -116,6 +116,10 @@ booth_check_daemon_state(){

case $rc in
$OCF_SUCCESS)
+   if [ ! -f $lockfile ]; then
+   ocf_log err "lockfile not exists.(${lockfile})"
+   return $BOOTH_DAEMON_EXIST;
+   fi
pidnum=$(cat $lockfile |awk '{print $1}')
daemonstate=$(cat $lockfile |awk '{print $2}')
if cat /proc/$pidnum/cmdline |grep $OCF_RESKEY_type
>/dev/null 2>&1; then

When this happened, I got "crm resource trace booth"

+ 21:09:48: 223: '[' '!' ']'
+ 21:09:48: 224: OCF_RESKEY_daemon=boothd
+ 21:09:48: 227: '[' '!' ']'
+ 21:09:48: 228: OCF_RESKEY_type=site
+ 21:09:48: 231: case $__OCF_ACTION in
+ 21:09:48: 236: booth_stop
+ 21:09:48: booth_stop:166: booth_check_daemon_state
+ 21:09:48: booth_check_daemon_state:115: booth_check_daemon_exist
+ 21:09:48: booth_check_daemon_exist:105: killall -0 boothd
+ 21:09:48: booth_check_daemon_exist:105: rc=0
+ 21:09:48: booth_check_daemon_exist:107: case $rc in
+ 21:09:48: booth_check_daemon_exist:108: return 0
+ 21:09:48: booth_check_daemon_state:115: rc=0
+ 21:09:48: booth_check_daemon_state:117: case $rc in
+ 21:09:48: booth_check_daemon_state:117: case $rc in
++ 21:09:48: booth_check_daemon_state:119: awk '{print $1}'
+ 21:09:48: booth_check_daemon_state:117: case $rc in
+ 21:09:48: booth_check_daemon_state:117: case $rc in
++ 21:09:48: booth_check_daemon_state:119: cat /var/run/booth.pid
+ 21:09:48: booth_check_daemon_state:117: case $rc in
+ 21:09:48: booth_check_daemon_state:117: case $rc in
+ 21:09:48: booth_check_daemon_state:119: pidnum=
++ 21:09:48: booth_check_daemon_state:120: awk '{print $2}'
++ 21:09:48: booth_check_daemon_state:120: cat /var/run/booth.pid
+ 21:09:48: booth_check_daemon_state:120: daemonstate=
+ 21:09:48: booth_check_daemon_state:121: grep site
+ 21:09:48: booth_check_daemon_state:121: cat /proc//cmdline
+ 21:09:48: booth_check_daemon_state:122: case $daemonstate in
+ 21:09:48: booth_check_daemon_state:125: return 4
+ 21:09:48: booth_stop:166: rc=4
+ 21:09:48: booth_stop:168: case $rc in
+ 21:09:48: booth_stop:173: return 1
+ 21:09:48: 246: rc=1
+ 21:09:48: 248: exit 1




2013/2/19 Jiaju Zhang :
> Hi Yuichi,
>
> On Tue, 2013-02-19 at 10:27 +0900, Yuichi SEINO wrote:
>> Hi Xia,
>>
>> I have a question about the following part. The write man explain that
>> "errno" is set appropriately if the write return -1. So, if "rv" is
>> equal to 0, strerror(errno) may not output the correct message. What
>> do you think about it?
>
> Good catch, I think we should differentiate the cases of rv == -1 or rv
> == 0. Maybe setting errno to ENOSPC when rv == 0.
>
> BTW, apart from that, does this patch fix your original issue?
>
> Thanks,
> Jiaju
>



--
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2013-02-18 Thread Jiaju Zhang
Hi Yuichi,

On Tue, 2013-02-19 at 10:27 +0900, Yuichi SEINO wrote:
> Hi Xia,
> 
> I have a question about the following part. The write man explain that
> "errno" is set appropriately if the write return -1. So, if "rv" is
> equal to 0, strerror(errno) may not output the correct message. What
> do you think about it?

Good catch, I think we should differentiate the cases of rv == -1 or rv
== 0. Maybe setting errno to ENOSPC when rv == 0.

BTW, apart from that, does this patch fix your original issue?

Thanks,
Jiaju


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2013-02-18 Thread Yuichi SEINO
Hi Xia,

I have a question about the following part. The write man explain that
"errno" is set appropriately if the write return -1. So, if "rv" is
equal to 0, strerror(errno) may not output the correct message. What
do you think about it?

+ rv = write(fd, buf, strlen(buf));
+ if (rv <= 0) {
+ log_error("write to fd(%d) error, return(%d), message(%s)",
+  fd, rv, strerror(errno));
+ rv = -1;
+ return rv;
+ }

Sincerely,
Yuichi

2013/2/16 Xia Li :
> Hi Yuichi
>
 On 2/5/2013 at 10:46 AM, in message
> , Yuichi
> SEINO  wrote:
>> Hi Xia,
>>
>> I watched your patch. Probably, I think that this patch has a problem
>> with the following check. Do you need if "rv" is equal to 0? The lseek
>> man describes that the lseek return -1 if the lseek error happens.
>> Otherwise, lseek return the number of bytes. So, 0 may not be error.
>> And, this patch includes several "rv <= 0". What do you think about
>> it?
>
> Sorry, this is my miss. Thanks for your check.
> I have modified it and recreated the patch.
>
> Regards,
>  Xia Li
>
>
>



-- 
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2013-02-16 Thread Xia Li
Hi Yuichi

>>> On 2/5/2013 at 10:46 AM, in message
, Yuichi
SEINO  wrote: 
> Hi Xia, 
>  
> I watched your patch. Probably, I think that this patch has a problem 
> with the following check. Do you need if "rv" is equal to 0? The lseek 
> man describes that the lseek return -1 if the lseek error happens. 
> Otherwise, lseek return the number of bytes. So, 0 may not be error. 
> And, this patch includes several "rv <= 0". What do you think about 
> it? 

Sorry, this is my miss. Thanks for your check.   
I have modified it and recreated the patch.

Regards,
 Xia Li



Index: booth/src/main.c
===
--- booth.orig/src/main.c
+++ booth/src/main.c
@@ -60,6 +60,12 @@ static int client_size = 0;
 struct client *client = NULL;
 struct pollfd *pollfd = NULL;
 
+typedef enum 
+{
+   BOOTHD_STARTED=0,
+   BOOTHD_STARTING
+} BOOTH_DAEMON_STATE;
+
 int poll_timeout = -1;
 
 typedef enum {
@@ -447,7 +453,34 @@ static int setup_timer(void)
return timerlist_init();
 }
 
-static int loop(void)
+static int write_daemon_state(int fd, int state)
+{
+   char buf[16];
+   int rv=0;
+   
+   memset(buf, 0, sizeof(buf));
+   snprintf(buf, sizeof(buf), "%d %d", getpid(), state);
+
+   rv = lseek(fd, 0, SEEK_SET);
+   if (rv < 0) {
+   log_error("lseek set fd(%d) offset to 0 error, return(%d), 
message(%s)",
+   fd, rv, strerror(errno));
+   rv = -1;
+   return rv;
+   } 
+   
+   rv = write(fd, buf, strlen(buf));
+   if (rv <= 0) {
+   log_error("write to fd(%d) error, return(%d), message(%s)",
+  fd, rv, strerror(errno));
+   rv = -1;
+   return rv;
+   }
+
+   rv = 0;
+   return rv;
+}
+static int loop(int fd)
 {
void (*workfn) (int ci);
void (*deadfn) (int ci);
@@ -470,6 +503,18 @@ static int loop(void)
goto fail;
client_add(rv, process_listener, NULL);
 
+   rv = write_daemon_state(fd, BOOTHD_STARTED);
+   if (rv != 0) {
+   log_error("write daemon state %d to lockfile error %s: %s",
+  BOOTHD_STARTED, cl.lockfile, strerror(errno));
+   goto fail;
+   }
+
+   if (cl.type == ARBITRATOR)
+   log_info("BOOTH arbitrator daemon started");
+   else if (cl.type == SITE)
+   log_info("BOOTH cluster site daemon started");
+
 while (1) {
 rv = poll(pollfd, client_maxi + 1, poll_timeout);
 if (rv == -1 && errno == EINTR)
@@ -677,9 +722,10 @@ static int do_revoke(void)
return do_command(BOOTHC_CMD_REVOKE);
 }
 
+
+
 static int lockfile(void)
 {
-   char buf[16];
struct flock lock;
int fd, rv;
 
@@ -687,39 +733,36 @@ static int lockfile(void)
if (fd < 0) {
 log_error("lockfile open error %s: %s",
   cl.lockfile, strerror(errno));
-return -1;
-}   
-
-lock.l_type = F_WRLCK;
-lock.l_start = 0;
-lock.l_whence = SEEK_SET;
-lock.l_len = 0;
-
-rv = fcntl(fd, F_SETLK, &lock);
-if (rv < 0) {
-log_error("lockfile setlk error %s: %s",
-  cl.lockfile, strerror(errno));
-goto fail;
-}
-
-rv = ftruncate(fd, 0);
-if (rv < 0) {
-log_error("lockfile truncate error %s: %s",
-  cl.lockfile, strerror(errno));
-goto fail;
-}
+return -1;
+}   
 
-memset(buf, 0, sizeof(buf));
-snprintf(buf, sizeof(buf), "%d\n", getpid());
-
-rv = write(fd, buf, strlen(buf));
-if (rv <= 0) {
-log_error("lockfile write error %s: %s",
-  cl.lockfile, strerror(errno));
-goto fail;
-}
+lock.l_type = F_WRLCK;
+lock.l_start = 0;
+lock.l_whence = SEEK_SET;
+lock.l_len = 0;
+
+rv = fcntl(fd, F_SETLK, &lock);
+if (rv < 0) {
+log_error("lockfile setlk error %s: %s",
+  cl.lockfile, strerror(errno));
+goto fail;
+}
+
+rv = ftruncate(fd, 0);
+if (rv < 0) {
+log_error("lockfile truncate error %s: %s",
+  cl.lockfile, strerror(errno));
+goto fail;
+}
+
+rv = write_daemon_state(fd, BOOTHD_STARTING);
+if (rv != 0) {
+   log_error("write daemon state %d to lockfile error %s: %s",
+  BOOTHD_STARTING, cl.lockfile, strerror(errno));
+   goto fail;
+}
 
-return fd;
+return fd;
  fail:
 close(fd);
 return -1;
@@ -953,14 +996,14 @@ static int do_server(int type)
return fd;
 
if (type == ARBITRATOR)
-   log_info("BOOTH arbitrator daem

Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2013-02-04 Thread Yuichi SEINO
Hi Xia,

I watched your patch. Probably, I think that this patch has a problem
with the following check. Do you need if "rv" is equal to 0? The lseek
man describes that the lseek return -1 if the lseek error happens.
Otherwise, lseek return the number of bytes. So, 0 may not be error.
And, this patch includes several "rv <= 0". What do you think about
it?


+ rv = lseek(fd, 0, SEEK_SET);
+ if (rv <= 0) {
+ log_error("lseek set offset to 0 error: %d: %s",
+ fd, strerror(errno));
+ }

Sincerely,
Yuichi


2013/2/1 Yuichi SEINO :
> Hi Xia,
>
> Thanks for the patch. I have a question.
>
> Following errors always be output. Are you correct about the log level?
> If this isn't a error, it would be better to output as a log info.
> And, I attached the log.
>
> ERROR: lseek set offset to 0 error: 4: Success
> or
> ERROR: lseek set offset to 0 error: 4: Operation now in progress
>
> Sincerely,
> Yuichi
>


--
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2013-02-01 Thread Yuichi SEINO
Hi Xia,

Thanks for the patch. I have a question.

Following errors always be output. Are you correct about the log level?
If this isn't a error, it would be better to output as a log info.
And, I attached the log.

ERROR: lseek set offset to 0 error: 4: Success
or
ERROR: lseek set offset to 0 error: 4: Operation now in progress

Sincerely,
Yuichi

2013/1/31 Xia Li :
> Hi Yuichi
>
 On 1/31/2013 at 02:05 PM, in message
> <510a7a3502156...@soto.provo.novell.com>, "Xia Li" 
> wrote:
>> Hi Yuichi
>>
>> I create two patches trying to fix this issue.
>>
>> In these patches, expand lockfile() to let it not only record the daemon
>> pid,
>> but also record daemon starting status(include "starting" and "started") .
>> At the same time, modify the logic of the controld RA, so that it can read
>
> Sorry, it's typo, it should be boothd RAs.
>
> Regards,
>  Xia Li
>



--
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com

2013/1/31 Xia Li :
> Hi Yuichi
>
 On 1/31/2013 at 02:05 PM, in message
> <510a7a3502156...@soto.provo.novell.com>, "Xia Li" 
> wrote:
>> Hi Yuichi
>>
>> I create two patches trying to fix this issue.
>>
>> In these patches, expand lockfile() to let it not only record the daemon
>> pid,
>> but also record daemon starting status(include "starting" and "started") .
>> At the same time, modify the logic of the controld RA, so that it can read
>
> Sorry, it's typo, it should be boothd RAs.
>
> Regards,
>  Xia Li
>



-- 
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com
# booth site -D
booth-site[17036]: 2013/02/01_17:01:17 ERROR: lseek set offset to 0 error: 4: 
Success
booth-site[17036]: 2013/02/01_17:01:17 info: BOOTH cluster site daemon is 
starting.
booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketA -G 
owner --quiet' was executed
booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketA -G 
expires --quiet' was executed
booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketA -G 
ballot --quiet' was executed
booth-site[17036]: 2013/02/01_17:01:17 debug: attempting catchup from 
192.168.201.166
booth-site[17036]: 2013/02/01_17:01:17 debug: connected to 192.168.201.166
booth-site[17036]: 2013/02/01_17:01:17 debug: sent catchup command to 
192.168.201.166
booth-site[17036]: 2013/02/01_17:01:17 debug: attempting catchup from 
192.168.201.167
booth-site[17036]: 2013/02/01_17:01:17 debug: connected to 192.168.201.167
booth-site[17036]: 2013/02/01_17:01:17 debug: sent catchup command to 
192.168.201.167
booth-site[17036]: 2013/02/01_17:01:17 debug: attempting catchup from 
192.168.201.168
booth-site[17036]: 2013/02/01_17:01:17 debug: connected to 192.168.201.168
booth-site[17036]: 2013/02/01_17:01:17 debug: sent catchup command to 
192.168.201.168
booth-site[17036]: 2013/02/01_17:01:17 debug: attempting catchup from 
192.168.201.169
booth-site[17036]: 2013/02/01_17:01:17 debug: connected to 192.168.201.169
booth-site[17036]: 2013/02/01_17:01:17 debug: sent catchup command to 
192.168.201.169
booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketA -S 
owner -v 2' was executed
booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketA -S 
expires -v 1359705709' was executed
booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketA -S 
ballot -v 2' was executed
booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketA -r 
--force' was executed
booth-site[17036]: 2013/02/01_17:01:17 debug: catchup result: name: ticketA, 
owner: 2, ballot: 2, expires: 1359705709
booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketA -S 
owner -v 2' was executed
booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketA -S 
expires -v 1359705709' was executed
booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketA -S 
ballot -v 2' was executed
booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketB -G 
owner --quiet' was executed
booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketB -G 
expires --quiet' was executed
booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketB -G 
ballot --quiet' was executed
booth-site[17036]: 2013/02/01_17:01:17 debug: attempting catchup from 
192.168.201.166
booth-site[17036]: 2013/02/01_17:01:17 debug: connected to 192.168.201.166
booth-site[17036]: 2013/02/01_17:01:17 debug: sent catchup command to 
192.168.201.166
booth-site[17036]: 2013/02/01_17:01:17 debug: attempting catchup from 
192.168.201.167
booth-site[17036]: 2013/02/01_17:01:17 debug: connected to 192.168.201.167
booth-site[17036]: 2013/02/01_17:01:17 debug: sent catchup command to 
192.168.201.167
booth-site[17036]: 2013/02/01_17:01:17 debug: attempting catchup from 
192.168.201.168
booth-site[17036]: 2013/02/01_17:01:17 debug: connected to 192.168.201.168
booth-site[17036]: 2013/02/01_17:01:17 debug: sent cat

Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2013-01-30 Thread Xia Li
Hi Yuichi

>>> On 1/31/2013 at 02:05 PM, in message
<510a7a3502156...@soto.provo.novell.com>, "Xia Li" 
wrote: 
> Hi Yuichi 
>   
> I create two patches trying to fix this issue.  
>  
> In these patches, expand lockfile() to let it not only record the daemon  
> pid,  
> but also record daemon starting status(include "starting" and "started") . 
> At the same time, modify the logic of the controld RA, so that it can read 
 
Sorry, it's typo, it should be boothd RAs.

Regards,
 Xia Li


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2013-01-30 Thread Xia Li
Hi Yuichi
 
I create two patches trying to fix this issue. 

In these patches, expand lockfile() to let it not only record the daemon pid, 
but also record daemon starting status(include "starting" and "started") .
At the same time, modify the logic of the controld RA, so that it can read 
that status and return more precise result.

Would you mind testing it to see if it works for you?

Regards,
 Xia Li

>>> On 1/23/2013 at 12:43 PM, in message
, Yuichi
SEINO  wrote: 
> Hi Jiaju, 
>  
> I understood about the complete solution. 
> However because this issue causes the critical problem that multiple 
> resources start, Could you apply this request or simply revert a 
> commit to tentatively handle this issue until you are resolved at the 
> summer? I think that we are difficult to avoid  this issue by the 
> operation unlike booth deadlock etc. If booth does not start at the 
> same time, then booth can avoid deadlock. 
>  
> This issue caused following things. 
> * Multiple resources start. 
> * When booth causes deadlock, the resource timeout dose not happen. 
> Previous, we could watch timeout on crm_mon. Currently, timeout 
> happens after booth was daemon. 
>  
> Sincerely, 
> Yuichi 
>  
> 2013/1/21 Jiaju Zhang : 
> > Hi Yuichi, 
> > 
> > On Fri, 2013-01-18 at 17:02 +0900, Yuichi SEINO wrote: 
> >> Hi Jiaju, 
> >> 
> >> I try fixing this issue by reverting a commit. What do you think about it? 
> >> https://github.com/jjzhang/booth/pull/48 
> > 
> > Moving the while setup stage before daemonizing seems not to be a sane 
> > solution. setup_ticket() needs to get the latest ticket information by 
> > communicating with other nodes. Currently it was there and using TCP, 
> > but long term and sane solution would be to move it to the main poll(), 
> > asynchronously waiting for catch-up result. Before catching-up was 
> > ready, booth can still response, it can participate in Paxos as a 
> > non-voting member. 
> > 
> > To fix this issue, how do you think if we remove the stale ticket 
> > information in the CIB once booth was starting? We already have the APIs 
> > in pacemaker.c which can clear the ticket information in the CIB. This 
> > step is reasonable because the tickets at that moment is really stale 
> > data. 
> > 
> > About the implementation, I have not thought it in very detail but one 
> > idea that came into my mind is that maybe we can expand lockfile() (or 
> > some wrapper to lockfile()) to let it do more things, not only record 
> > the daemon pid, but also record daemon starting status, like "starting", 
> > "started", thus, the controld RA can read that status and return more 
> > precise result. 
> > 
> > I'll have Xia to look into this problem in more detail. 
> > 
> > Thanks, 
> > Jiaju 
> > 
> > 
>  
>  
>  
> -- 
> Yuichi SEINO 
> METROSYSTEMS CORPORATION 
> E-mail:seino.clust...@gmail.com 
>  
>  


Index: booth/src/main.c
===
--- booth.orig/src/main.c
+++ booth/src/main.c
@@ -63,6 +63,12 @@ static int client_size = 0;
 struct client *client = NULL;
 struct pollfd *pollfd = NULL;
 
+typedef enum 
+{
+   BOOTHD_STARTED=0,
+   BOOTHD_STARTING
+} BOOTH_DAEMON_STATE;
+
 int poll_timeout = -1;
 
 typedef enum {
@@ -450,7 +456,30 @@ static int setup_timer(void)
return timerlist_init();
 }
 
-static int loop(void)
+static int write_daemon_state(int fd, int state)
+{
+   char buf[16];
+   int rv=1;
+   
+   memset(buf, 0, sizeof(buf));
+   snprintf(buf, sizeof(buf), "%d %d", getpid(), state);
+
+   rv = lseek(fd, 0, SEEK_SET);
+   if (rv <= 0) {
+   log_error("lseek set offset to 0 error: %d: %s",
+   fd, strerror(errno));
+   } 
+   
+   rv = write(fd, buf, strlen(buf));
+   if (rv <= 0) {
+   log_error("lockfile write error %d: %s",
+  fd, strerror(errno));
+   return rv;
+   }
+
+   return rv;
+}
+static int loop(int fd)
 {
void (*workfn) (int ci);
void (*deadfn) (int ci);
@@ -473,6 +502,18 @@ static int loop(void)
goto fail;
client_add(rv, process_listener, NULL);
 
+   rv = write_daemon_state(fd, BOOTHD_STARTED);
+   if (rv <= 0) {
+   log_error("lockfile write state %d error %s: %s",
+  BOOTHD_STARTED, cl.lockfile, strerror(errno));
+   goto fail;
+   }
+
+   if (cl.type == ARBITRATOR)
+   log_info("BOOTH arbitrator daemon started");
+   else if (cl.type == SITE)
+   log_info("BOOTH cluster site daemon started");
+
 while (1) {
 rv = poll(pollfd, client_maxi + 1, poll_timeout);
 if (rv == -1 && errno == EINTR)
@@ -681,9 +722,10 @@ static int do_revoke(void)
return do_command(BOOTHC_CMD_REVOKE);
 }
 
+
+
 static int lockfile(void)
 {
-   char buf[16];
struct flock lock;
int fd, r

Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2013-01-22 Thread Yuichi SEINO
Hi Jiaju,

I understood about the complete solution.
However because this issue causes the critical problem that multiple
resources start, Could you apply this request or simply revert a
commit to tentatively handle this issue until you are resolved at the
summer? I think that we are difficult to avoid  this issue by the
operation unlike booth deadlock etc. If booth does not start at the
same time, then booth can avoid deadlock.

This issue caused following things.
* Multiple resources start.
* When booth causes deadlock, the resource timeout dose not happen.
Previous, we could watch timeout on crm_mon. Currently, timeout
happens after booth was daemon.

Sincerely,
Yuichi

2013/1/21 Jiaju Zhang :
> Hi Yuichi,
>
> On Fri, 2013-01-18 at 17:02 +0900, Yuichi SEINO wrote:
>> Hi Jiaju,
>>
>> I try fixing this issue by reverting a commit. What do you think about it?
>> https://github.com/jjzhang/booth/pull/48
>
> Moving the while setup stage before daemonizing seems not to be a sane
> solution. setup_ticket() needs to get the latest ticket information by
> communicating with other nodes. Currently it was there and using TCP,
> but long term and sane solution would be to move it to the main poll(),
> asynchronously waiting for catch-up result. Before catching-up was
> ready, booth can still response, it can participate in Paxos as a
> non-voting member.
>
> To fix this issue, how do you think if we remove the stale ticket
> information in the CIB once booth was starting? We already have the APIs
> in pacemaker.c which can clear the ticket information in the CIB. This
> step is reasonable because the tickets at that moment is really stale
> data.
>
> About the implementation, I have not thought it in very detail but one
> idea that came into my mind is that maybe we can expand lockfile() (or
> some wrapper to lockfile()) to let it do more things, not only record
> the daemon pid, but also record daemon starting status, like "starting",
> "started", thus, the controld RA can read that status and return more
> precise result.
>
> I'll have Xia to look into this problem in more detail.
>
> Thanks,
> Jiaju
>
>



--
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2013-01-20 Thread Jiaju Zhang
Hi Yuichi,

On Fri, 2013-01-18 at 17:02 +0900, Yuichi SEINO wrote:
> Hi Jiaju,
> 
> I try fixing this issue by reverting a commit. What do you think about it?
> https://github.com/jjzhang/booth/pull/48

Moving the while setup stage before daemonizing seems not to be a sane
solution. setup_ticket() needs to get the latest ticket information by
communicating with other nodes. Currently it was there and using TCP,
but long term and sane solution would be to move it to the main poll(),
asynchronously waiting for catch-up result. Before catching-up was
ready, booth can still response, it can participate in Paxos as a
non-voting member.

To fix this issue, how do you think if we remove the stale ticket
information in the CIB once booth was starting? We already have the APIs
in pacemaker.c which can clear the ticket information in the CIB. This
step is reasonable because the tickets at that moment is really stale
data.

About the implementation, I have not thought it in very detail but one
idea that came into my mind is that maybe we can expand lockfile() (or
some wrapper to lockfile()) to let it do more things, not only record
the daemon pid, but also record daemon starting status, like "starting",
"started", thus, the controld RA can read that status and return more
precise result.

I'll have Xia to look into this problem in more detail.

Thanks,
Jiaju



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2013-01-18 Thread Yuichi SEINO
Hi Jiaju,

I try fixing this issue by reverting a commit. What do you think about it?
https://github.com/jjzhang/booth/pull/48

Sincerely,
Yuichi

2013/1/11 Yuichi SEINO :
> Hi Jiaju,
>
> 2013/1/11 Jiaju Zhang :
>> Hi Yuichi,
>>
>> On Wed, 2013-01-09 at 19:39 +0900, Yuichi SEINO wrote:
>>> Hi Jiaju,
>>>
>>> I have a suggestion about this issue. Could you revert the following commit?
>>> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f
>>>
>>> Because the suitable log can't be output, we took in this fix. Even If
>>> this commit is reverted,  the behavior of booth should not be affect.
>>> And, this issue is more important than this commit.
>>
>> The main reason we took that fix is to make the initialization to be
>> under the protection of lockfile(), so that there are no two booth
>> instances trying to start on the same node.
>
>  Originally, there was lockfile check after setup_listener(). So, I
> think that two booth instances can't start. However, booth usually
> stop at the instant of setup_transport() before lockfile check because
> booth uses the same IP. Therefore, the suitable log could not be
> output.
>  In my opinion,  Reverting this commit is simpler than making the new
> function to initialize cib.
>
> Sincerely,
> Yuichi
>

--
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2013-01-10 Thread Yuichi SEINO
Hi Jiaju,

2013/1/11 Jiaju Zhang :
> Hi Yuichi,
>
> On Wed, 2013-01-09 at 19:39 +0900, Yuichi SEINO wrote:
>> Hi Jiaju,
>>
>> I have a suggestion about this issue. Could you revert the following commit?
>> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f
>>
>> Because the suitable log can't be output, we took in this fix. Even If
>> this commit is reverted,  the behavior of booth should not be affect.
>> And, this issue is more important than this commit.
>
> The main reason we took that fix is to make the initialization to be
> under the protection of lockfile(), so that there are no two booth
> instances trying to start on the same node.

 Originally, there was lockfile check after setup_listener(). So, I
think that two booth instances can't start. However, booth usually
stop at the instant of setup_transport() before lockfile check because
booth uses the same IP. Therefore, the suitable log could not be
output.
 In my opinion,  Reverting this commit is simpler than making the new
function to initialize cib.

Sincerely,
Yuichi

--
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2013-01-10 Thread Jiaju Zhang
Hi Yuichi,

On Wed, 2013-01-09 at 19:39 +0900, Yuichi SEINO wrote:
> Hi Jiaju,
> 
> I have a suggestion about this issue. Could you revert the following commit?
> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f
> 
> Because the suitable log can't be output, we took in this fix. Even If
> this commit is reverted,  the behavior of booth should not be affect.
> And, this issue is more important than this commit.

The main reason we took that fix is to make the initialization to be
under the protection of lockfile(), so that there are no two booth
instances trying to start on the same node. 

Thanks,
Jiaju 



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2013-01-09 Thread Yuichi SEINO
Hi Jiaju,

I have a suggestion about this issue. Could you revert the following commit?
https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f

Because the suitable log can't be output, we took in this fix. Even If
this commit is reverted,  the behavior of booth should not be affect.
And, this issue is more important than this commit.

Sincerely,
Yuichi

2012/12/21 Jiaju Zhang :
> On Fri, 2012-12-21 at 15:44 +0900, Yuichi SEINO wrote:
>> Hi Jiaju,
>>
>> 2012/12/18 Jiaju Zhang :
>> > On Mon, 2012-12-17 at 10:40 +0900, Yuichi SEINO wrote:
>> >> Hi Jiaju,
>> >>
>> >> >> >>
>> >> >> >> Perhaps,  this problem didn't happen before the following commit.
>> >> >> >> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f
>> >> >> >
>> >> >> > Currently when all of the initialization (including loading the new
>> >> >> > ticket information) finished, booth should be regarded as ready. So 
>> >> >> > if
>> >> >> > you encounter some problem here, I guess we should improve the RA to
>> >> >> > better reflect the booth startup status, but not moving the
>> >> >> > initialization order, since it may introduce other regression as we 
>> >> >> > have
>> >> >> > encountered before;)
>> >> >> >
>> >> >>
>> >> >> I am not still sure which we should fix RA or booth.
>> >> >
>> >> > I suggest to add a new function to clear the old ticket info in the CIB,
>> >> > and call that function when booth just run but before deamonized. So,
>> >> > before booth_start in the RA returned, the stale data has been cleared.
>> >> > What do you think about this?;)
>> >> >
>> >>
>> >> In the case of using cib info, Can you implement it? For example,
>> >> booth is fail-over on local. Then, booth need to get the ticket in
>> >> cib. If there is no this problem, I can agree to it.
>> >
>> > OK, I'll implement it;)
>> >
>> > Thanks,
>> > Jiaju
>> >
>> >
>>
>> OK, thanks.
>> Are you going to implement it in the next development ?
>
> Sure, I will;)
>
> Thanks,
> Jiaju
>



--
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2012-12-21 Thread Jiaju Zhang
On Fri, 2012-12-21 at 15:44 +0900, Yuichi SEINO wrote:
> Hi Jiaju,
> 
> 2012/12/18 Jiaju Zhang :
> > On Mon, 2012-12-17 at 10:40 +0900, Yuichi SEINO wrote:
> >> Hi Jiaju,
> >>
> >> >> >>
> >> >> >> Perhaps,  this problem didn't happen before the following commit.
> >> >> >> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f
> >> >> >
> >> >> > Currently when all of the initialization (including loading the new
> >> >> > ticket information) finished, booth should be regarded as ready. So if
> >> >> > you encounter some problem here, I guess we should improve the RA to
> >> >> > better reflect the booth startup status, but not moving the
> >> >> > initialization order, since it may introduce other regression as we 
> >> >> > have
> >> >> > encountered before;)
> >> >> >
> >> >>
> >> >> I am not still sure which we should fix RA or booth.
> >> >
> >> > I suggest to add a new function to clear the old ticket info in the CIB,
> >> > and call that function when booth just run but before deamonized. So,
> >> > before booth_start in the RA returned, the stale data has been cleared.
> >> > What do you think about this?;)
> >> >
> >>
> >> In the case of using cib info, Can you implement it? For example,
> >> booth is fail-over on local. Then, booth need to get the ticket in
> >> cib. If there is no this problem, I can agree to it.
> >
> > OK, I'll implement it;)
> >
> > Thanks,
> > Jiaju
> >
> >
> 
> OK, thanks.
> Are you going to implement it in the next development ?

Sure, I will;)

Thanks,
Jiaju


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2012-12-20 Thread Yuichi SEINO
Hi Jiaju,

2012/12/18 Jiaju Zhang :
> On Mon, 2012-12-17 at 10:40 +0900, Yuichi SEINO wrote:
>> Hi Jiaju,
>>
>> >> >>
>> >> >> Perhaps,  this problem didn't happen before the following commit.
>> >> >> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f
>> >> >
>> >> > Currently when all of the initialization (including loading the new
>> >> > ticket information) finished, booth should be regarded as ready. So if
>> >> > you encounter some problem here, I guess we should improve the RA to
>> >> > better reflect the booth startup status, but not moving the
>> >> > initialization order, since it may introduce other regression as we have
>> >> > encountered before;)
>> >> >
>> >>
>> >> I am not still sure which we should fix RA or booth.
>> >
>> > I suggest to add a new function to clear the old ticket info in the CIB,
>> > and call that function when booth just run but before deamonized. So,
>> > before booth_start in the RA returned, the stale data has been cleared.
>> > What do you think about this?;)
>> >
>>
>> In the case of using cib info, Can you implement it? For example,
>> booth is fail-over on local. Then, booth need to get the ticket in
>> cib. If there is no this problem, I can agree to it.
>
> OK, I'll implement it;)
>
> Thanks,
> Jiaju
>
>

OK, thanks.
Are you going to implement it in the next development ?

Sincerely,
Yuichi

--
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2012-12-17 Thread Jiaju Zhang
On Mon, 2012-12-17 at 10:40 +0900, Yuichi SEINO wrote:
> Hi Jiaju,
> 
> >> >>
> >> >> Perhaps,  this problem didn't happen before the following commit.
> >> >> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f
> >> >
> >> > Currently when all of the initialization (including loading the new
> >> > ticket information) finished, booth should be regarded as ready. So if
> >> > you encounter some problem here, I guess we should improve the RA to
> >> > better reflect the booth startup status, but not moving the
> >> > initialization order, since it may introduce other regression as we have
> >> > encountered before;)
> >> >
> >>
> >> I am not still sure which we should fix RA or booth.
> >
> > I suggest to add a new function to clear the old ticket info in the CIB,
> > and call that function when booth just run but before deamonized. So,
> > before booth_start in the RA returned, the stale data has been cleared.
> > What do you think about this?;)
> >
> 
> In the case of using cib info, Can you implement it? For example,
> booth is fail-over on local. Then, booth need to get the ticket in
> cib. If there is no this problem, I can agree to it.

OK, I'll implement it;)

Thanks,
Jiaju



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2012-12-16 Thread Yuichi SEINO
Hi Jiaju,

>> >>
>> >> Perhaps,  this problem didn't happen before the following commit.
>> >> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f
>> >
>> > Currently when all of the initialization (including loading the new
>> > ticket information) finished, booth should be regarded as ready. So if
>> > you encounter some problem here, I guess we should improve the RA to
>> > better reflect the booth startup status, but not moving the
>> > initialization order, since it may introduce other regression as we have
>> > encountered before;)
>> >
>>
>> I am not still sure which we should fix RA or booth.
>
> I suggest to add a new function to clear the old ticket info in the CIB,
> and call that function when booth just run but before deamonized. So,
> before booth_start in the RA returned, the stale data has been cleared.
> What do you think about this?;)
>

In the case of using cib info, Can you implement it? For example,
booth is fail-over on local. Then, booth need to get the ticket in
cib. If there is no this problem, I can agree to it.


Sincerely,
Yuichi


--
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2012-12-14 Thread Jiaju Zhang
On Thu, 2012-12-13 at 12:01 +0900, Yuichi SEINO wrote:
> Hi Jiaju,
> 
> 2012/12/12 Jiaju Zhang :
> > On Tue, 2012-12-11 at 20:15 +0900, Yuichi SEINO wrote:
> >> Hi Jiaju,
> >>
> >> Currently, booth is the state of "started" on pacemaker before booth
> >> writes ticket information in cib. So, If the old ticket information is
> >> included in cib, a resource relating to the ticket may start before
> >> booth resets the ticket. I think that this problem is when to be
> >> daemon in booth.
> >
> > The resouce should not be started before the booth daemon is ready. We
> > suggest to configure an ordering constraint for the booth daemon and the
> > managed resources by that ticket. That being said, if the ticket is in
> > the CIB but booth daemon has not been started, the resources would not
> > be started.
> >
> 
> booth RA finishes booth_start when booth changed the daemon from the
> foreground process.(To be exact, "sleep 1" is included). The current
> booth change daemon before catchup. On the other hand, the previous
> booth change daemon after catchup. catchup write a ticket in cib.
>  Even if an ordering constraint is set, as shown below, the related
> resource can start when booth changes the state of "started" on
> pacemaker. At this point, the current booth still may not finish
> catchup.

Oh, I think I have known your problem, thanks!

> 
> crm_mon paste.
> ...
> booth(ocf::pacemaker:booth-site):Started multi-site-a-1
> ...
> 
> >>
> >> Perhaps,  this problem didn't happen before the following commit.
> >> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f
> >
> > Currently when all of the initialization (including loading the new
> > ticket information) finished, booth should be regarded as ready. So if
> > you encounter some problem here, I guess we should improve the RA to
> > better reflect the booth startup status, but not moving the
> > initialization order, since it may introduce other regression as we have
> > encountered before;)
> >
> 
> I am not still sure which we should fix RA or booth.

I suggest to add a new function to clear the old ticket info in the CIB,
and call that function when booth just run but before deamonized. So,
before booth_start in the RA returned, the stale data has been cleared.
What do you think about this?;)

Thanks,
Jiaju

> 
> > Thanks,
> > Jiaju
> >
> >>
> >> Sincerely,
> >> Yuichi
> >>
> 
> 
> 
> 
> --
> Yuichi SEINO
> METROSYSTEMS CORPORATION
> E-mail:seino.clust...@gmail.com



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2012-12-12 Thread Yuichi SEINO
Hi Jiaju,

2012/12/12 Jiaju Zhang :
> On Tue, 2012-12-11 at 20:15 +0900, Yuichi SEINO wrote:
>> Hi Jiaju,
>>
>> Currently, booth is the state of "started" on pacemaker before booth
>> writes ticket information in cib. So, If the old ticket information is
>> included in cib, a resource relating to the ticket may start before
>> booth resets the ticket. I think that this problem is when to be
>> daemon in booth.
>
> The resouce should not be started before the booth daemon is ready. We
> suggest to configure an ordering constraint for the booth daemon and the
> managed resources by that ticket. That being said, if the ticket is in
> the CIB but booth daemon has not been started, the resources would not
> be started.
>

booth RA finishes booth_start when booth changed the daemon from the
foreground process.(To be exact, "sleep 1" is included). The current
booth change daemon before catchup. On the other hand, the previous
booth change daemon after catchup. catchup write a ticket in cib.
 Even if an ordering constraint is set, as shown below, the related
resource can start when booth changes the state of "started" on
pacemaker. At this point, the current booth still may not finish
catchup.

crm_mon paste.
...
booth(ocf::pacemaker:booth-site):Started multi-site-a-1
...

>>
>> Perhaps,  this problem didn't happen before the following commit.
>> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f
>
> Currently when all of the initialization (including loading the new
> ticket information) finished, booth should be regarded as ready. So if
> you encounter some problem here, I guess we should improve the RA to
> better reflect the booth startup status, but not moving the
> initialization order, since it may introduce other regression as we have
> encountered before;)
>

I am not still sure which we should fix RA or booth.

> Thanks,
> Jiaju
>
>>
>> Sincerely,
>> Yuichi
>>




--
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2012-12-12 Thread Jiaju Zhang
On Tue, 2012-12-11 at 20:15 +0900, Yuichi SEINO wrote:
> Hi Jiaju,
> 
> Currently, booth is the state of "started" on pacemaker before booth
> writes ticket information in cib. So, If the old ticket information is
> included in cib, a resource relating to the ticket may start before
> booth resets the ticket. I think that this problem is when to be
> daemon in booth.

The resouce should not be started before the booth daemon is ready. We
suggest to configure an ordering constraint for the booth daemon and the
managed resources by that ticket. That being said, if the ticket is in
the CIB but booth daemon has not been started, the resources would not
be started.

> 
> Perhaps,  this problem didn't happen before the following commit.
> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f

Currently when all of the initialization (including loading the new
ticket information) finished, booth should be regarded as ready. So if
you encounter some problem here, I guess we should improve the RA to
better reflect the booth startup status, but not moving the
initialization order, since it may introduce other regression as we have
encountered before;)

Thanks,
Jiaju

> 
> Sincerely,
> Yuichi
> 
> --
> Yuichi SEINO
> METROSYSTEMS CORPORATION
> E-mail:seino.clust...@gmail.com



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.

2012-12-11 Thread Yuichi SEINO
Hi Jiaju,

Currently, booth is the state of "started" on pacemaker before booth
writes ticket information in cib. So, If the old ticket information is
included in cib, a resource relating to the ticket may start before
booth resets the ticket. I think that this problem is when to be
daemon in booth.

Perhaps,  this problem didn't happen before the following commit.
https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f

Sincerely,
Yuichi

--
Yuichi SEINO
METROSYSTEMS CORPORATION
E-mail:seino.clust...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org