Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
On Thu, 2013-06-13 at 19:06 +0900, Yuichi SEINO wrote: > Hi Jiaju, > > Could you merge RA and Xia patch ? Merged. > > If the problem happened, I think that I want to fix it after this patch > merged. Thanks! Regards, Jiaju > > Sincerely, > Yuichi > > > 2013/4/10 Yuichi SEINO : > > Hi, > > > > I still should not accept a reply from anyone. Hopefully, I think that > > I want to early fix this issue. > > > > Sincerely, > > Yuichi > > > > 2013/3/19 Yuichi SEINO : > >> Hi Xia and Jiaju, > >> > >> Because RA may read an unintended file, I think that it is better to > >> check the existence of lockfile in RA. I detailed a previous mail. > >> > >> What do you think about this? > >> If you agrees to this, Could you fix RA? > >> > >> Sincerely, > >> Yuichi > >> > >> 2013/2/25 Yuichi SEINO : > >>> Hi Jiaju, > >>> > >>> 2013/2/22 Jiaju Zhang : > On Wed, 2013-02-20 at 16:26 +0900, Yuichi SEINO wrote: > > Hi Jiaju, > > > > I am testing this patch. > > When a lockfile was removed, it seems that the stop of RA isn't a > > intended behavior. > > I'm just curious how the lockfile was removed. Basically the existence > of the lockfile shows one boothd is started, and prevent being wrongly > started again. So the lockfile should not be removed intentionally by > the admin. > >>> > >>> I used how to run "mv" to the pid file. > >>> > >>> The other case also is the same situation. When we already run > >>> "boothd -l other.pid" on node, the lockfile exists in the other place. > >>> So, $lockfile doesn't exist in the start and stop of RA. > >>> > >>> I think that it is better to take account of the existence of > >>> lockfile or $pidnum, because /proc/cmdline may happen to fulfill this > >>> if. For example, anything RA includes the check if pid is the empty. > >>> > >>> anything_status() { > >>> if test -f "$pidfile" > >>> then > >>> if pid=`getpid $pidfile` && [ "$pid" ] && kill -s 0 $pid > >>> then > >>> return $OCF_SUCCESS > >>> else > >>> # pidfile w/o process means the process died > >>> return $OCF_ERR_GENERIC > >>> fi > >>> else > >>> return $OCF_NOT_RUNNING > >>> fi > >>> } > >>> > > Thanks, > Jiaju > > >>> > >>> Sincerely, > >>> Yuichi > >>> > >> > > > > -- > > Yuichi SEINO > > METROSYSTEMS CORPORATION > > E-mail:seino.clust...@gmail.com > > > ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi Jiaju, Could you merge RA and Xia patch ? If the problem happened, I think that I want to fix it after this patch merged. Sincerely, Yuichi 2013/4/10 Yuichi SEINO : > Hi, > > I still should not accept a reply from anyone. Hopefully, I think that > I want to early fix this issue. > > Sincerely, > Yuichi > > 2013/3/19 Yuichi SEINO : >> Hi Xia and Jiaju, >> >> Because RA may read an unintended file, I think that it is better to >> check the existence of lockfile in RA. I detailed a previous mail. >> >> What do you think about this? >> If you agrees to this, Could you fix RA? >> >> Sincerely, >> Yuichi >> >> 2013/2/25 Yuichi SEINO : >>> Hi Jiaju, >>> >>> 2013/2/22 Jiaju Zhang : On Wed, 2013-02-20 at 16:26 +0900, Yuichi SEINO wrote: > Hi Jiaju, > > I am testing this patch. > When a lockfile was removed, it seems that the stop of RA isn't a > intended behavior. I'm just curious how the lockfile was removed. Basically the existence of the lockfile shows one boothd is started, and prevent being wrongly started again. So the lockfile should not be removed intentionally by the admin. >>> >>> I used how to run "mv" to the pid file. >>> >>> The other case also is the same situation. When we already run >>> "boothd -l other.pid" on node, the lockfile exists in the other place. >>> So, $lockfile doesn't exist in the start and stop of RA. >>> >>> I think that it is better to take account of the existence of >>> lockfile or $pidnum, because /proc/cmdline may happen to fulfill this >>> if. For example, anything RA includes the check if pid is the empty. >>> >>> anything_status() { >>> if test -f "$pidfile" >>> then >>> if pid=`getpid $pidfile` && [ "$pid" ] && kill -s 0 $pid >>> then >>> return $OCF_SUCCESS >>> else >>> # pidfile w/o process means the process died >>> return $OCF_ERR_GENERIC >>> fi >>> else >>> return $OCF_NOT_RUNNING >>> fi >>> } >>> Thanks, Jiaju >>> >>> Sincerely, >>> Yuichi >>> >> > > -- > Yuichi SEINO > METROSYSTEMS CORPORATION > E-mail:seino.clust...@gmail.com -- Yuichi SEINO METROSYSTEMS CORPORATION E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi, I still should not accept a reply from anyone. Hopefully, I think that I want to early fix this issue. Sincerely, Yuichi 2013/3/19 Yuichi SEINO : > Hi Xia and Jiaju, > > Because RA may read an unintended file, I think that it is better to > check the existence of lockfile in RA. I detailed a previous mail. > > What do you think about this? > If you agrees to this, Could you fix RA? > > Sincerely, > Yuichi > > 2013/2/25 Yuichi SEINO : >> Hi Jiaju, >> >> 2013/2/22 Jiaju Zhang : >>> On Wed, 2013-02-20 at 16:26 +0900, Yuichi SEINO wrote: Hi Jiaju, I am testing this patch. When a lockfile was removed, it seems that the stop of RA isn't a intended behavior. >>> >>> I'm just curious how the lockfile was removed. Basically the existence >>> of the lockfile shows one boothd is started, and prevent being wrongly >>> started again. So the lockfile should not be removed intentionally by >>> the admin. >> >> I used how to run "mv" to the pid file. >> >> The other case also is the same situation. When we already run >> "boothd -l other.pid" on node, the lockfile exists in the other place. >> So, $lockfile doesn't exist in the start and stop of RA. >> >> I think that it is better to take account of the existence of >> lockfile or $pidnum, because /proc/cmdline may happen to fulfill this >> if. For example, anything RA includes the check if pid is the empty. >> >> anything_status() { >> if test -f "$pidfile" >> then >> if pid=`getpid $pidfile` && [ "$pid" ] && kill -s 0 $pid >> then >> return $OCF_SUCCESS >> else >> # pidfile w/o process means the process died >> return $OCF_ERR_GENERIC >> fi >> else >> return $OCF_NOT_RUNNING >> fi >> } >> >>> >>> Thanks, >>> Jiaju >>> >> >> Sincerely, >> Yuichi >> > -- Yuichi SEINO METROSYSTEMS CORPORATION E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi Xia and Jiaju, Because RA may read an unintended file, I think that it is better to check the existence of lockfile in RA. I detailed a previous mail. What do you think about this? If you agrees to this, Could you fix RA? Sincerely, Yuichi 2013/2/25 Yuichi SEINO : > Hi Jiaju, > > 2013/2/22 Jiaju Zhang : >> On Wed, 2013-02-20 at 16:26 +0900, Yuichi SEINO wrote: >>> Hi Jiaju, >>> >>> I am testing this patch. >>> When a lockfile was removed, it seems that the stop of RA isn't a >>> intended behavior. >> >> I'm just curious how the lockfile was removed. Basically the existence >> of the lockfile shows one boothd is started, and prevent being wrongly >> started again. So the lockfile should not be removed intentionally by >> the admin. > > I used how to run "mv" to the pid file. > > The other case also is the same situation. When we already run > "boothd -l other.pid" on node, the lockfile exists in the other place. > So, $lockfile doesn't exist in the start and stop of RA. > > I think that it is better to take account of the existence of > lockfile or $pidnum, because /proc/cmdline may happen to fulfill this > if. For example, anything RA includes the check if pid is the empty. > > anything_status() { > if test -f "$pidfile" > then > if pid=`getpid $pidfile` && [ "$pid" ] && kill -s 0 $pid > then > return $OCF_SUCCESS > else > # pidfile w/o process means the process died > return $OCF_ERR_GENERIC > fi > else > return $OCF_NOT_RUNNING > fi > } > >> >> Thanks, >> Jiaju >> > > Sincerely, > Yuichi > -- Yuichi SEINO METROSYSTEMS CORPORATION E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi Jiaju, 2013/2/22 Jiaju Zhang : > On Wed, 2013-02-20 at 16:26 +0900, Yuichi SEINO wrote: >> Hi Jiaju, >> >> I am testing this patch. >> When a lockfile was removed, it seems that the stop of RA isn't a >> intended behavior. > > I'm just curious how the lockfile was removed. Basically the existence > of the lockfile shows one boothd is started, and prevent being wrongly > started again. So the lockfile should not be removed intentionally by > the admin. I used how to run "mv" to the pid file. The other case also is the same situation. When we already run "boothd -l other.pid" on node, the lockfile exists in the other place. So, $lockfile doesn't exist in the start and stop of RA. I think that it is better to take account of the existence of lockfile or $pidnum, because /proc/cmdline may happen to fulfill this if. For example, anything RA includes the check if pid is the empty. anything_status() { if test -f "$pidfile" then if pid=`getpid $pidfile` && [ "$pid" ] && kill -s 0 $pid then return $OCF_SUCCESS else # pidfile w/o process means the process died return $OCF_ERR_GENERIC fi else return $OCF_NOT_RUNNING fi } > > Thanks, > Jiaju > Sincerely, Yuichi -- Yuichi SEINO METROSYSTEMS CORPORATION E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
On Wed, 2013-02-20 at 16:26 +0900, Yuichi SEINO wrote: > Hi Jiaju, > > I am testing this patch. > When a lockfile was removed, it seems that the stop of RA isn't a > intended behavior. I'm just curious how the lockfile was removed. Basically the existence of the lockfile shows one boothd is started, and prevent being wrongly started again. So the lockfile should not be removed intentionally by the admin. Thanks, Jiaju > Currently, If "pidnum" is empty, RA run "cat > /proc//cmdline". /proc/cmdline is boot parameter file. So, I added the > check about a existence of lockfile. > > diff --git a/script/ocf/booth-site b/script/ocf/booth-site > index 2575643..7c775dc 100755 > --- a/script/ocf/booth-site > +++ b/script/ocf/booth-site > @@ -116,6 +116,10 @@ booth_check_daemon_state(){ > > case $rc in > $OCF_SUCCESS) > + if [ ! -f $lockfile ]; then > + ocf_log err "lockfile not exists.(${lockfile})" > + return $BOOTH_DAEMON_EXIST; > + fi > pidnum=$(cat $lockfile |awk '{print $1}') > daemonstate=$(cat $lockfile |awk '{print $2}') > if cat /proc/$pidnum/cmdline |grep $OCF_RESKEY_type > >/dev/null 2>&1; then > > When this happened, I got "crm resource trace booth" > > + 21:09:48: 223: '[' '!' ']' > + 21:09:48: 224: OCF_RESKEY_daemon=boothd > + 21:09:48: 227: '[' '!' ']' > + 21:09:48: 228: OCF_RESKEY_type=site > + 21:09:48: 231: case $__OCF_ACTION in > + 21:09:48: 236: booth_stop > + 21:09:48: booth_stop:166: booth_check_daemon_state > + 21:09:48: booth_check_daemon_state:115: booth_check_daemon_exist > + 21:09:48: booth_check_daemon_exist:105: killall -0 boothd > + 21:09:48: booth_check_daemon_exist:105: rc=0 > + 21:09:48: booth_check_daemon_exist:107: case $rc in > + 21:09:48: booth_check_daemon_exist:108: return 0 > + 21:09:48: booth_check_daemon_state:115: rc=0 > + 21:09:48: booth_check_daemon_state:117: case $rc in > + 21:09:48: booth_check_daemon_state:117: case $rc in > ++ 21:09:48: booth_check_daemon_state:119: awk '{print $1}' > + 21:09:48: booth_check_daemon_state:117: case $rc in > + 21:09:48: booth_check_daemon_state:117: case $rc in > ++ 21:09:48: booth_check_daemon_state:119: cat /var/run/booth.pid > + 21:09:48: booth_check_daemon_state:117: case $rc in > + 21:09:48: booth_check_daemon_state:117: case $rc in > + 21:09:48: booth_check_daemon_state:119: pidnum= > ++ 21:09:48: booth_check_daemon_state:120: awk '{print $2}' > ++ 21:09:48: booth_check_daemon_state:120: cat /var/run/booth.pid > + 21:09:48: booth_check_daemon_state:120: daemonstate= > + 21:09:48: booth_check_daemon_state:121: grep site > + 21:09:48: booth_check_daemon_state:121: cat /proc//cmdline > + 21:09:48: booth_check_daemon_state:122: case $daemonstate in > + 21:09:48: booth_check_daemon_state:125: return 4 > + 21:09:48: booth_stop:166: rc=4 > + 21:09:48: booth_stop:168: case $rc in > + 21:09:48: booth_stop:173: return 1 > + 21:09:48: 246: rc=1 > + 21:09:48: 248: exit 1 > > > > > 2013/2/19 Jiaju Zhang : > > Hi Yuichi, > > > > On Tue, 2013-02-19 at 10:27 +0900, Yuichi SEINO wrote: > >> Hi Xia, > >> > >> I have a question about the following part. The write man explain that > >> "errno" is set appropriately if the write return -1. So, if "rv" is > >> equal to 0, strerror(errno) may not output the correct message. What > >> do you think about it? > > > > Good catch, I think we should differentiate the cases of rv == -1 or rv > > == 0. Maybe setting errno to ENOSPC when rv == 0. > > > > BTW, apart from that, does this patch fix your original issue? > > > > Thanks, > > Jiaju > > > > > > -- > Yuichi SEINO > METROSYSTEMS CORPORATION > E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi Jiaju, I am testing this patch. When a lockfile was removed, it seems that the stop of RA isn't a intended behavior. Currently, If "pidnum" is empty, RA run "cat /proc//cmdline". /proc/cmdline is boot parameter file. So, I added the check about a existence of lockfile. diff --git a/script/ocf/booth-site b/script/ocf/booth-site index 2575643..7c775dc 100755 --- a/script/ocf/booth-site +++ b/script/ocf/booth-site @@ -116,6 +116,10 @@ booth_check_daemon_state(){ case $rc in $OCF_SUCCESS) + if [ ! -f $lockfile ]; then + ocf_log err "lockfile not exists.(${lockfile})" + return $BOOTH_DAEMON_EXIST; + fi pidnum=$(cat $lockfile |awk '{print $1}') daemonstate=$(cat $lockfile |awk '{print $2}') if cat /proc/$pidnum/cmdline |grep $OCF_RESKEY_type >/dev/null 2>&1; then When this happened, I got "crm resource trace booth" + 21:09:48: 223: '[' '!' ']' + 21:09:48: 224: OCF_RESKEY_daemon=boothd + 21:09:48: 227: '[' '!' ']' + 21:09:48: 228: OCF_RESKEY_type=site + 21:09:48: 231: case $__OCF_ACTION in + 21:09:48: 236: booth_stop + 21:09:48: booth_stop:166: booth_check_daemon_state + 21:09:48: booth_check_daemon_state:115: booth_check_daemon_exist + 21:09:48: booth_check_daemon_exist:105: killall -0 boothd + 21:09:48: booth_check_daemon_exist:105: rc=0 + 21:09:48: booth_check_daemon_exist:107: case $rc in + 21:09:48: booth_check_daemon_exist:108: return 0 + 21:09:48: booth_check_daemon_state:115: rc=0 + 21:09:48: booth_check_daemon_state:117: case $rc in + 21:09:48: booth_check_daemon_state:117: case $rc in ++ 21:09:48: booth_check_daemon_state:119: awk '{print $1}' + 21:09:48: booth_check_daemon_state:117: case $rc in + 21:09:48: booth_check_daemon_state:117: case $rc in ++ 21:09:48: booth_check_daemon_state:119: cat /var/run/booth.pid + 21:09:48: booth_check_daemon_state:117: case $rc in + 21:09:48: booth_check_daemon_state:117: case $rc in + 21:09:48: booth_check_daemon_state:119: pidnum= ++ 21:09:48: booth_check_daemon_state:120: awk '{print $2}' ++ 21:09:48: booth_check_daemon_state:120: cat /var/run/booth.pid + 21:09:48: booth_check_daemon_state:120: daemonstate= + 21:09:48: booth_check_daemon_state:121: grep site + 21:09:48: booth_check_daemon_state:121: cat /proc//cmdline + 21:09:48: booth_check_daemon_state:122: case $daemonstate in + 21:09:48: booth_check_daemon_state:125: return 4 + 21:09:48: booth_stop:166: rc=4 + 21:09:48: booth_stop:168: case $rc in + 21:09:48: booth_stop:173: return 1 + 21:09:48: 246: rc=1 + 21:09:48: 248: exit 1 2013/2/19 Jiaju Zhang : > Hi Yuichi, > > On Tue, 2013-02-19 at 10:27 +0900, Yuichi SEINO wrote: >> Hi Xia, >> >> I have a question about the following part. The write man explain that >> "errno" is set appropriately if the write return -1. So, if "rv" is >> equal to 0, strerror(errno) may not output the correct message. What >> do you think about it? > > Good catch, I think we should differentiate the cases of rv == -1 or rv > == 0. Maybe setting errno to ENOSPC when rv == 0. > > BTW, apart from that, does this patch fix your original issue? > > Thanks, > Jiaju > -- Yuichi SEINO METROSYSTEMS CORPORATION E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi Yuichi, On Tue, 2013-02-19 at 10:27 +0900, Yuichi SEINO wrote: > Hi Xia, > > I have a question about the following part. The write man explain that > "errno" is set appropriately if the write return -1. So, if "rv" is > equal to 0, strerror(errno) may not output the correct message. What > do you think about it? Good catch, I think we should differentiate the cases of rv == -1 or rv == 0. Maybe setting errno to ENOSPC when rv == 0. BTW, apart from that, does this patch fix your original issue? Thanks, Jiaju ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi Xia, I have a question about the following part. The write man explain that "errno" is set appropriately if the write return -1. So, if "rv" is equal to 0, strerror(errno) may not output the correct message. What do you think about it? + rv = write(fd, buf, strlen(buf)); + if (rv <= 0) { + log_error("write to fd(%d) error, return(%d), message(%s)", + fd, rv, strerror(errno)); + rv = -1; + return rv; + } Sincerely, Yuichi 2013/2/16 Xia Li : > Hi Yuichi > On 2/5/2013 at 10:46 AM, in message > , Yuichi > SEINO wrote: >> Hi Xia, >> >> I watched your patch. Probably, I think that this patch has a problem >> with the following check. Do you need if "rv" is equal to 0? The lseek >> man describes that the lseek return -1 if the lseek error happens. >> Otherwise, lseek return the number of bytes. So, 0 may not be error. >> And, this patch includes several "rv <= 0". What do you think about >> it? > > Sorry, this is my miss. Thanks for your check. > I have modified it and recreated the patch. > > Regards, > Xia Li > > > -- Yuichi SEINO METROSYSTEMS CORPORATION E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi Yuichi >>> On 2/5/2013 at 10:46 AM, in message , Yuichi SEINO wrote: > Hi Xia, > > I watched your patch. Probably, I think that this patch has a problem > with the following check. Do you need if "rv" is equal to 0? The lseek > man describes that the lseek return -1 if the lseek error happens. > Otherwise, lseek return the number of bytes. So, 0 may not be error. > And, this patch includes several "rv <= 0". What do you think about > it? Sorry, this is my miss. Thanks for your check. I have modified it and recreated the patch. Regards, Xia Li Index: booth/src/main.c === --- booth.orig/src/main.c +++ booth/src/main.c @@ -60,6 +60,12 @@ static int client_size = 0; struct client *client = NULL; struct pollfd *pollfd = NULL; +typedef enum +{ + BOOTHD_STARTED=0, + BOOTHD_STARTING +} BOOTH_DAEMON_STATE; + int poll_timeout = -1; typedef enum { @@ -447,7 +453,34 @@ static int setup_timer(void) return timerlist_init(); } -static int loop(void) +static int write_daemon_state(int fd, int state) +{ + char buf[16]; + int rv=0; + + memset(buf, 0, sizeof(buf)); + snprintf(buf, sizeof(buf), "%d %d", getpid(), state); + + rv = lseek(fd, 0, SEEK_SET); + if (rv < 0) { + log_error("lseek set fd(%d) offset to 0 error, return(%d), message(%s)", + fd, rv, strerror(errno)); + rv = -1; + return rv; + } + + rv = write(fd, buf, strlen(buf)); + if (rv <= 0) { + log_error("write to fd(%d) error, return(%d), message(%s)", + fd, rv, strerror(errno)); + rv = -1; + return rv; + } + + rv = 0; + return rv; +} +static int loop(int fd) { void (*workfn) (int ci); void (*deadfn) (int ci); @@ -470,6 +503,18 @@ static int loop(void) goto fail; client_add(rv, process_listener, NULL); + rv = write_daemon_state(fd, BOOTHD_STARTED); + if (rv != 0) { + log_error("write daemon state %d to lockfile error %s: %s", + BOOTHD_STARTED, cl.lockfile, strerror(errno)); + goto fail; + } + + if (cl.type == ARBITRATOR) + log_info("BOOTH arbitrator daemon started"); + else if (cl.type == SITE) + log_info("BOOTH cluster site daemon started"); + while (1) { rv = poll(pollfd, client_maxi + 1, poll_timeout); if (rv == -1 && errno == EINTR) @@ -677,9 +722,10 @@ static int do_revoke(void) return do_command(BOOTHC_CMD_REVOKE); } + + static int lockfile(void) { - char buf[16]; struct flock lock; int fd, rv; @@ -687,39 +733,36 @@ static int lockfile(void) if (fd < 0) { log_error("lockfile open error %s: %s", cl.lockfile, strerror(errno)); -return -1; -} - -lock.l_type = F_WRLCK; -lock.l_start = 0; -lock.l_whence = SEEK_SET; -lock.l_len = 0; - -rv = fcntl(fd, F_SETLK, &lock); -if (rv < 0) { -log_error("lockfile setlk error %s: %s", - cl.lockfile, strerror(errno)); -goto fail; -} - -rv = ftruncate(fd, 0); -if (rv < 0) { -log_error("lockfile truncate error %s: %s", - cl.lockfile, strerror(errno)); -goto fail; -} +return -1; +} -memset(buf, 0, sizeof(buf)); -snprintf(buf, sizeof(buf), "%d\n", getpid()); - -rv = write(fd, buf, strlen(buf)); -if (rv <= 0) { -log_error("lockfile write error %s: %s", - cl.lockfile, strerror(errno)); -goto fail; -} +lock.l_type = F_WRLCK; +lock.l_start = 0; +lock.l_whence = SEEK_SET; +lock.l_len = 0; + +rv = fcntl(fd, F_SETLK, &lock); +if (rv < 0) { +log_error("lockfile setlk error %s: %s", + cl.lockfile, strerror(errno)); +goto fail; +} + +rv = ftruncate(fd, 0); +if (rv < 0) { +log_error("lockfile truncate error %s: %s", + cl.lockfile, strerror(errno)); +goto fail; +} + +rv = write_daemon_state(fd, BOOTHD_STARTING); +if (rv != 0) { + log_error("write daemon state %d to lockfile error %s: %s", + BOOTHD_STARTING, cl.lockfile, strerror(errno)); + goto fail; +} -return fd; +return fd; fail: close(fd); return -1; @@ -953,14 +996,14 @@ static int do_server(int type) return fd; if (type == ARBITRATOR) - log_info("BOOTH arbitrator daem
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi Xia, I watched your patch. Probably, I think that this patch has a problem with the following check. Do you need if "rv" is equal to 0? The lseek man describes that the lseek return -1 if the lseek error happens. Otherwise, lseek return the number of bytes. So, 0 may not be error. And, this patch includes several "rv <= 0". What do you think about it? + rv = lseek(fd, 0, SEEK_SET); + if (rv <= 0) { + log_error("lseek set offset to 0 error: %d: %s", + fd, strerror(errno)); + } Sincerely, Yuichi 2013/2/1 Yuichi SEINO : > Hi Xia, > > Thanks for the patch. I have a question. > > Following errors always be output. Are you correct about the log level? > If this isn't a error, it would be better to output as a log info. > And, I attached the log. > > ERROR: lseek set offset to 0 error: 4: Success > or > ERROR: lseek set offset to 0 error: 4: Operation now in progress > > Sincerely, > Yuichi > -- Yuichi SEINO METROSYSTEMS CORPORATION E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi Xia, Thanks for the patch. I have a question. Following errors always be output. Are you correct about the log level? If this isn't a error, it would be better to output as a log info. And, I attached the log. ERROR: lseek set offset to 0 error: 4: Success or ERROR: lseek set offset to 0 error: 4: Operation now in progress Sincerely, Yuichi 2013/1/31 Xia Li : > Hi Yuichi > On 1/31/2013 at 02:05 PM, in message > <510a7a3502156...@soto.provo.novell.com>, "Xia Li" > wrote: >> Hi Yuichi >> >> I create two patches trying to fix this issue. >> >> In these patches, expand lockfile() to let it not only record the daemon >> pid, >> but also record daemon starting status(include "starting" and "started") . >> At the same time, modify the logic of the controld RA, so that it can read > > Sorry, it's typo, it should be boothd RAs. > > Regards, > Xia Li > -- Yuichi SEINO METROSYSTEMS CORPORATION E-mail:seino.clust...@gmail.com 2013/1/31 Xia Li : > Hi Yuichi > On 1/31/2013 at 02:05 PM, in message > <510a7a3502156...@soto.provo.novell.com>, "Xia Li" > wrote: >> Hi Yuichi >> >> I create two patches trying to fix this issue. >> >> In these patches, expand lockfile() to let it not only record the daemon >> pid, >> but also record daemon starting status(include "starting" and "started") . >> At the same time, modify the logic of the controld RA, so that it can read > > Sorry, it's typo, it should be boothd RAs. > > Regards, > Xia Li > -- Yuichi SEINO METROSYSTEMS CORPORATION E-mail:seino.clust...@gmail.com # booth site -D booth-site[17036]: 2013/02/01_17:01:17 ERROR: lseek set offset to 0 error: 4: Success booth-site[17036]: 2013/02/01_17:01:17 info: BOOTH cluster site daemon is starting. booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketA -G owner --quiet' was executed booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketA -G expires --quiet' was executed booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketA -G ballot --quiet' was executed booth-site[17036]: 2013/02/01_17:01:17 debug: attempting catchup from 192.168.201.166 booth-site[17036]: 2013/02/01_17:01:17 debug: connected to 192.168.201.166 booth-site[17036]: 2013/02/01_17:01:17 debug: sent catchup command to 192.168.201.166 booth-site[17036]: 2013/02/01_17:01:17 debug: attempting catchup from 192.168.201.167 booth-site[17036]: 2013/02/01_17:01:17 debug: connected to 192.168.201.167 booth-site[17036]: 2013/02/01_17:01:17 debug: sent catchup command to 192.168.201.167 booth-site[17036]: 2013/02/01_17:01:17 debug: attempting catchup from 192.168.201.168 booth-site[17036]: 2013/02/01_17:01:17 debug: connected to 192.168.201.168 booth-site[17036]: 2013/02/01_17:01:17 debug: sent catchup command to 192.168.201.168 booth-site[17036]: 2013/02/01_17:01:17 debug: attempting catchup from 192.168.201.169 booth-site[17036]: 2013/02/01_17:01:17 debug: connected to 192.168.201.169 booth-site[17036]: 2013/02/01_17:01:17 debug: sent catchup command to 192.168.201.169 booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketA -S owner -v 2' was executed booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketA -S expires -v 1359705709' was executed booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketA -S ballot -v 2' was executed booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketA -r --force' was executed booth-site[17036]: 2013/02/01_17:01:17 debug: catchup result: name: ticketA, owner: 2, ballot: 2, expires: 1359705709 booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketA -S owner -v 2' was executed booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketA -S expires -v 1359705709' was executed booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketA -S ballot -v 2' was executed booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketB -G owner --quiet' was executed booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketB -G expires --quiet' was executed booth-site[17036]: 2013/02/01_17:01:17 info: command: 'crm_ticket -t ticketB -G ballot --quiet' was executed booth-site[17036]: 2013/02/01_17:01:17 debug: attempting catchup from 192.168.201.166 booth-site[17036]: 2013/02/01_17:01:17 debug: connected to 192.168.201.166 booth-site[17036]: 2013/02/01_17:01:17 debug: sent catchup command to 192.168.201.166 booth-site[17036]: 2013/02/01_17:01:17 debug: attempting catchup from 192.168.201.167 booth-site[17036]: 2013/02/01_17:01:17 debug: connected to 192.168.201.167 booth-site[17036]: 2013/02/01_17:01:17 debug: sent catchup command to 192.168.201.167 booth-site[17036]: 2013/02/01_17:01:17 debug: attempting catchup from 192.168.201.168 booth-site[17036]: 2013/02/01_17:01:17 debug: connected to 192.168.201.168 booth-site[17036]: 2013/02/01_17:01:17 debug: sent cat
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi Yuichi >>> On 1/31/2013 at 02:05 PM, in message <510a7a3502156...@soto.provo.novell.com>, "Xia Li" wrote: > Hi Yuichi > > I create two patches trying to fix this issue. > > In these patches, expand lockfile() to let it not only record the daemon > pid, > but also record daemon starting status(include "starting" and "started") . > At the same time, modify the logic of the controld RA, so that it can read Sorry, it's typo, it should be boothd RAs. Regards, Xia Li ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi Yuichi I create two patches trying to fix this issue. In these patches, expand lockfile() to let it not only record the daemon pid, but also record daemon starting status(include "starting" and "started") . At the same time, modify the logic of the controld RA, so that it can read that status and return more precise result. Would you mind testing it to see if it works for you? Regards, Xia Li >>> On 1/23/2013 at 12:43 PM, in message , Yuichi SEINO wrote: > Hi Jiaju, > > I understood about the complete solution. > However because this issue causes the critical problem that multiple > resources start, Could you apply this request or simply revert a > commit to tentatively handle this issue until you are resolved at the > summer? I think that we are difficult to avoid this issue by the > operation unlike booth deadlock etc. If booth does not start at the > same time, then booth can avoid deadlock. > > This issue caused following things. > * Multiple resources start. > * When booth causes deadlock, the resource timeout dose not happen. > Previous, we could watch timeout on crm_mon. Currently, timeout > happens after booth was daemon. > > Sincerely, > Yuichi > > 2013/1/21 Jiaju Zhang : > > Hi Yuichi, > > > > On Fri, 2013-01-18 at 17:02 +0900, Yuichi SEINO wrote: > >> Hi Jiaju, > >> > >> I try fixing this issue by reverting a commit. What do you think about it? > >> https://github.com/jjzhang/booth/pull/48 > > > > Moving the while setup stage before daemonizing seems not to be a sane > > solution. setup_ticket() needs to get the latest ticket information by > > communicating with other nodes. Currently it was there and using TCP, > > but long term and sane solution would be to move it to the main poll(), > > asynchronously waiting for catch-up result. Before catching-up was > > ready, booth can still response, it can participate in Paxos as a > > non-voting member. > > > > To fix this issue, how do you think if we remove the stale ticket > > information in the CIB once booth was starting? We already have the APIs > > in pacemaker.c which can clear the ticket information in the CIB. This > > step is reasonable because the tickets at that moment is really stale > > data. > > > > About the implementation, I have not thought it in very detail but one > > idea that came into my mind is that maybe we can expand lockfile() (or > > some wrapper to lockfile()) to let it do more things, not only record > > the daemon pid, but also record daemon starting status, like "starting", > > "started", thus, the controld RA can read that status and return more > > precise result. > > > > I'll have Xia to look into this problem in more detail. > > > > Thanks, > > Jiaju > > > > > > > > -- > Yuichi SEINO > METROSYSTEMS CORPORATION > E-mail:seino.clust...@gmail.com > > Index: booth/src/main.c === --- booth.orig/src/main.c +++ booth/src/main.c @@ -63,6 +63,12 @@ static int client_size = 0; struct client *client = NULL; struct pollfd *pollfd = NULL; +typedef enum +{ + BOOTHD_STARTED=0, + BOOTHD_STARTING +} BOOTH_DAEMON_STATE; + int poll_timeout = -1; typedef enum { @@ -450,7 +456,30 @@ static int setup_timer(void) return timerlist_init(); } -static int loop(void) +static int write_daemon_state(int fd, int state) +{ + char buf[16]; + int rv=1; + + memset(buf, 0, sizeof(buf)); + snprintf(buf, sizeof(buf), "%d %d", getpid(), state); + + rv = lseek(fd, 0, SEEK_SET); + if (rv <= 0) { + log_error("lseek set offset to 0 error: %d: %s", + fd, strerror(errno)); + } + + rv = write(fd, buf, strlen(buf)); + if (rv <= 0) { + log_error("lockfile write error %d: %s", + fd, strerror(errno)); + return rv; + } + + return rv; +} +static int loop(int fd) { void (*workfn) (int ci); void (*deadfn) (int ci); @@ -473,6 +502,18 @@ static int loop(void) goto fail; client_add(rv, process_listener, NULL); + rv = write_daemon_state(fd, BOOTHD_STARTED); + if (rv <= 0) { + log_error("lockfile write state %d error %s: %s", + BOOTHD_STARTED, cl.lockfile, strerror(errno)); + goto fail; + } + + if (cl.type == ARBITRATOR) + log_info("BOOTH arbitrator daemon started"); + else if (cl.type == SITE) + log_info("BOOTH cluster site daemon started"); + while (1) { rv = poll(pollfd, client_maxi + 1, poll_timeout); if (rv == -1 && errno == EINTR) @@ -681,9 +722,10 @@ static int do_revoke(void) return do_command(BOOTHC_CMD_REVOKE); } + + static int lockfile(void) { - char buf[16]; struct flock lock; int fd, r
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi Jiaju, I understood about the complete solution. However because this issue causes the critical problem that multiple resources start, Could you apply this request or simply revert a commit to tentatively handle this issue until you are resolved at the summer? I think that we are difficult to avoid this issue by the operation unlike booth deadlock etc. If booth does not start at the same time, then booth can avoid deadlock. This issue caused following things. * Multiple resources start. * When booth causes deadlock, the resource timeout dose not happen. Previous, we could watch timeout on crm_mon. Currently, timeout happens after booth was daemon. Sincerely, Yuichi 2013/1/21 Jiaju Zhang : > Hi Yuichi, > > On Fri, 2013-01-18 at 17:02 +0900, Yuichi SEINO wrote: >> Hi Jiaju, >> >> I try fixing this issue by reverting a commit. What do you think about it? >> https://github.com/jjzhang/booth/pull/48 > > Moving the while setup stage before daemonizing seems not to be a sane > solution. setup_ticket() needs to get the latest ticket information by > communicating with other nodes. Currently it was there and using TCP, > but long term and sane solution would be to move it to the main poll(), > asynchronously waiting for catch-up result. Before catching-up was > ready, booth can still response, it can participate in Paxos as a > non-voting member. > > To fix this issue, how do you think if we remove the stale ticket > information in the CIB once booth was starting? We already have the APIs > in pacemaker.c which can clear the ticket information in the CIB. This > step is reasonable because the tickets at that moment is really stale > data. > > About the implementation, I have not thought it in very detail but one > idea that came into my mind is that maybe we can expand lockfile() (or > some wrapper to lockfile()) to let it do more things, not only record > the daemon pid, but also record daemon starting status, like "starting", > "started", thus, the controld RA can read that status and return more > precise result. > > I'll have Xia to look into this problem in more detail. > > Thanks, > Jiaju > > -- Yuichi SEINO METROSYSTEMS CORPORATION E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi Yuichi, On Fri, 2013-01-18 at 17:02 +0900, Yuichi SEINO wrote: > Hi Jiaju, > > I try fixing this issue by reverting a commit. What do you think about it? > https://github.com/jjzhang/booth/pull/48 Moving the while setup stage before daemonizing seems not to be a sane solution. setup_ticket() needs to get the latest ticket information by communicating with other nodes. Currently it was there and using TCP, but long term and sane solution would be to move it to the main poll(), asynchronously waiting for catch-up result. Before catching-up was ready, booth can still response, it can participate in Paxos as a non-voting member. To fix this issue, how do you think if we remove the stale ticket information in the CIB once booth was starting? We already have the APIs in pacemaker.c which can clear the ticket information in the CIB. This step is reasonable because the tickets at that moment is really stale data. About the implementation, I have not thought it in very detail but one idea that came into my mind is that maybe we can expand lockfile() (or some wrapper to lockfile()) to let it do more things, not only record the daemon pid, but also record daemon starting status, like "starting", "started", thus, the controld RA can read that status and return more precise result. I'll have Xia to look into this problem in more detail. Thanks, Jiaju ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi Jiaju, I try fixing this issue by reverting a commit. What do you think about it? https://github.com/jjzhang/booth/pull/48 Sincerely, Yuichi 2013/1/11 Yuichi SEINO : > Hi Jiaju, > > 2013/1/11 Jiaju Zhang : >> Hi Yuichi, >> >> On Wed, 2013-01-09 at 19:39 +0900, Yuichi SEINO wrote: >>> Hi Jiaju, >>> >>> I have a suggestion about this issue. Could you revert the following commit? >>> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f >>> >>> Because the suitable log can't be output, we took in this fix. Even If >>> this commit is reverted, the behavior of booth should not be affect. >>> And, this issue is more important than this commit. >> >> The main reason we took that fix is to make the initialization to be >> under the protection of lockfile(), so that there are no two booth >> instances trying to start on the same node. > > Originally, there was lockfile check after setup_listener(). So, I > think that two booth instances can't start. However, booth usually > stop at the instant of setup_transport() before lockfile check because > booth uses the same IP. Therefore, the suitable log could not be > output. > In my opinion, Reverting this commit is simpler than making the new > function to initialize cib. > > Sincerely, > Yuichi > -- Yuichi SEINO METROSYSTEMS CORPORATION E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi Jiaju, 2013/1/11 Jiaju Zhang : > Hi Yuichi, > > On Wed, 2013-01-09 at 19:39 +0900, Yuichi SEINO wrote: >> Hi Jiaju, >> >> I have a suggestion about this issue. Could you revert the following commit? >> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f >> >> Because the suitable log can't be output, we took in this fix. Even If >> this commit is reverted, the behavior of booth should not be affect. >> And, this issue is more important than this commit. > > The main reason we took that fix is to make the initialization to be > under the protection of lockfile(), so that there are no two booth > instances trying to start on the same node. Originally, there was lockfile check after setup_listener(). So, I think that two booth instances can't start. However, booth usually stop at the instant of setup_transport() before lockfile check because booth uses the same IP. Therefore, the suitable log could not be output. In my opinion, Reverting this commit is simpler than making the new function to initialize cib. Sincerely, Yuichi -- Yuichi SEINO METROSYSTEMS CORPORATION E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi Yuichi, On Wed, 2013-01-09 at 19:39 +0900, Yuichi SEINO wrote: > Hi Jiaju, > > I have a suggestion about this issue. Could you revert the following commit? > https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f > > Because the suitable log can't be output, we took in this fix. Even If > this commit is reverted, the behavior of booth should not be affect. > And, this issue is more important than this commit. The main reason we took that fix is to make the initialization to be under the protection of lockfile(), so that there are no two booth instances trying to start on the same node. Thanks, Jiaju ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi Jiaju, I have a suggestion about this issue. Could you revert the following commit? https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f Because the suitable log can't be output, we took in this fix. Even If this commit is reverted, the behavior of booth should not be affect. And, this issue is more important than this commit. Sincerely, Yuichi 2012/12/21 Jiaju Zhang : > On Fri, 2012-12-21 at 15:44 +0900, Yuichi SEINO wrote: >> Hi Jiaju, >> >> 2012/12/18 Jiaju Zhang : >> > On Mon, 2012-12-17 at 10:40 +0900, Yuichi SEINO wrote: >> >> Hi Jiaju, >> >> >> >> >> >> >> >> >> >> Perhaps, this problem didn't happen before the following commit. >> >> >> >> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f >> >> >> > >> >> >> > Currently when all of the initialization (including loading the new >> >> >> > ticket information) finished, booth should be regarded as ready. So >> >> >> > if >> >> >> > you encounter some problem here, I guess we should improve the RA to >> >> >> > better reflect the booth startup status, but not moving the >> >> >> > initialization order, since it may introduce other regression as we >> >> >> > have >> >> >> > encountered before;) >> >> >> > >> >> >> >> >> >> I am not still sure which we should fix RA or booth. >> >> > >> >> > I suggest to add a new function to clear the old ticket info in the CIB, >> >> > and call that function when booth just run but before deamonized. So, >> >> > before booth_start in the RA returned, the stale data has been cleared. >> >> > What do you think about this?;) >> >> > >> >> >> >> In the case of using cib info, Can you implement it? For example, >> >> booth is fail-over on local. Then, booth need to get the ticket in >> >> cib. If there is no this problem, I can agree to it. >> > >> > OK, I'll implement it;) >> > >> > Thanks, >> > Jiaju >> > >> > >> >> OK, thanks. >> Are you going to implement it in the next development ? > > Sure, I will;) > > Thanks, > Jiaju > -- Yuichi SEINO METROSYSTEMS CORPORATION E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
On Fri, 2012-12-21 at 15:44 +0900, Yuichi SEINO wrote: > Hi Jiaju, > > 2012/12/18 Jiaju Zhang : > > On Mon, 2012-12-17 at 10:40 +0900, Yuichi SEINO wrote: > >> Hi Jiaju, > >> > >> >> >> > >> >> >> Perhaps, this problem didn't happen before the following commit. > >> >> >> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f > >> >> > > >> >> > Currently when all of the initialization (including loading the new > >> >> > ticket information) finished, booth should be regarded as ready. So if > >> >> > you encounter some problem here, I guess we should improve the RA to > >> >> > better reflect the booth startup status, but not moving the > >> >> > initialization order, since it may introduce other regression as we > >> >> > have > >> >> > encountered before;) > >> >> > > >> >> > >> >> I am not still sure which we should fix RA or booth. > >> > > >> > I suggest to add a new function to clear the old ticket info in the CIB, > >> > and call that function when booth just run but before deamonized. So, > >> > before booth_start in the RA returned, the stale data has been cleared. > >> > What do you think about this?;) > >> > > >> > >> In the case of using cib info, Can you implement it? For example, > >> booth is fail-over on local. Then, booth need to get the ticket in > >> cib. If there is no this problem, I can agree to it. > > > > OK, I'll implement it;) > > > > Thanks, > > Jiaju > > > > > > OK, thanks. > Are you going to implement it in the next development ? Sure, I will;) Thanks, Jiaju ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi Jiaju, 2012/12/18 Jiaju Zhang : > On Mon, 2012-12-17 at 10:40 +0900, Yuichi SEINO wrote: >> Hi Jiaju, >> >> >> >> >> >> >> Perhaps, this problem didn't happen before the following commit. >> >> >> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f >> >> > >> >> > Currently when all of the initialization (including loading the new >> >> > ticket information) finished, booth should be regarded as ready. So if >> >> > you encounter some problem here, I guess we should improve the RA to >> >> > better reflect the booth startup status, but not moving the >> >> > initialization order, since it may introduce other regression as we have >> >> > encountered before;) >> >> > >> >> >> >> I am not still sure which we should fix RA or booth. >> > >> > I suggest to add a new function to clear the old ticket info in the CIB, >> > and call that function when booth just run but before deamonized. So, >> > before booth_start in the RA returned, the stale data has been cleared. >> > What do you think about this?;) >> > >> >> In the case of using cib info, Can you implement it? For example, >> booth is fail-over on local. Then, booth need to get the ticket in >> cib. If there is no this problem, I can agree to it. > > OK, I'll implement it;) > > Thanks, > Jiaju > > OK, thanks. Are you going to implement it in the next development ? Sincerely, Yuichi -- Yuichi SEINO METROSYSTEMS CORPORATION E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
On Mon, 2012-12-17 at 10:40 +0900, Yuichi SEINO wrote: > Hi Jiaju, > > >> >> > >> >> Perhaps, this problem didn't happen before the following commit. > >> >> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f > >> > > >> > Currently when all of the initialization (including loading the new > >> > ticket information) finished, booth should be regarded as ready. So if > >> > you encounter some problem here, I guess we should improve the RA to > >> > better reflect the booth startup status, but not moving the > >> > initialization order, since it may introduce other regression as we have > >> > encountered before;) > >> > > >> > >> I am not still sure which we should fix RA or booth. > > > > I suggest to add a new function to clear the old ticket info in the CIB, > > and call that function when booth just run but before deamonized. So, > > before booth_start in the RA returned, the stale data has been cleared. > > What do you think about this?;) > > > > In the case of using cib info, Can you implement it? For example, > booth is fail-over on local. Then, booth need to get the ticket in > cib. If there is no this problem, I can agree to it. OK, I'll implement it;) Thanks, Jiaju ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi Jiaju, >> >> >> >> Perhaps, this problem didn't happen before the following commit. >> >> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f >> > >> > Currently when all of the initialization (including loading the new >> > ticket information) finished, booth should be regarded as ready. So if >> > you encounter some problem here, I guess we should improve the RA to >> > better reflect the booth startup status, but not moving the >> > initialization order, since it may introduce other regression as we have >> > encountered before;) >> > >> >> I am not still sure which we should fix RA or booth. > > I suggest to add a new function to clear the old ticket info in the CIB, > and call that function when booth just run but before deamonized. So, > before booth_start in the RA returned, the stale data has been cleared. > What do you think about this?;) > In the case of using cib info, Can you implement it? For example, booth is fail-over on local. Then, booth need to get the ticket in cib. If there is no this problem, I can agree to it. Sincerely, Yuichi -- Yuichi SEINO METROSYSTEMS CORPORATION E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
On Thu, 2012-12-13 at 12:01 +0900, Yuichi SEINO wrote: > Hi Jiaju, > > 2012/12/12 Jiaju Zhang : > > On Tue, 2012-12-11 at 20:15 +0900, Yuichi SEINO wrote: > >> Hi Jiaju, > >> > >> Currently, booth is the state of "started" on pacemaker before booth > >> writes ticket information in cib. So, If the old ticket information is > >> included in cib, a resource relating to the ticket may start before > >> booth resets the ticket. I think that this problem is when to be > >> daemon in booth. > > > > The resouce should not be started before the booth daemon is ready. We > > suggest to configure an ordering constraint for the booth daemon and the > > managed resources by that ticket. That being said, if the ticket is in > > the CIB but booth daemon has not been started, the resources would not > > be started. > > > > booth RA finishes booth_start when booth changed the daemon from the > foreground process.(To be exact, "sleep 1" is included). The current > booth change daemon before catchup. On the other hand, the previous > booth change daemon after catchup. catchup write a ticket in cib. > Even if an ordering constraint is set, as shown below, the related > resource can start when booth changes the state of "started" on > pacemaker. At this point, the current booth still may not finish > catchup. Oh, I think I have known your problem, thanks! > > crm_mon paste. > ... > booth(ocf::pacemaker:booth-site):Started multi-site-a-1 > ... > > >> > >> Perhaps, this problem didn't happen before the following commit. > >> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f > > > > Currently when all of the initialization (including loading the new > > ticket information) finished, booth should be regarded as ready. So if > > you encounter some problem here, I guess we should improve the RA to > > better reflect the booth startup status, but not moving the > > initialization order, since it may introduce other regression as we have > > encountered before;) > > > > I am not still sure which we should fix RA or booth. I suggest to add a new function to clear the old ticket info in the CIB, and call that function when booth just run but before deamonized. So, before booth_start in the RA returned, the stale data has been cleared. What do you think about this?;) Thanks, Jiaju > > > Thanks, > > Jiaju > > > >> > >> Sincerely, > >> Yuichi > >> > > > > > -- > Yuichi SEINO > METROSYSTEMS CORPORATION > E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi Jiaju, 2012/12/12 Jiaju Zhang : > On Tue, 2012-12-11 at 20:15 +0900, Yuichi SEINO wrote: >> Hi Jiaju, >> >> Currently, booth is the state of "started" on pacemaker before booth >> writes ticket information in cib. So, If the old ticket information is >> included in cib, a resource relating to the ticket may start before >> booth resets the ticket. I think that this problem is when to be >> daemon in booth. > > The resouce should not be started before the booth daemon is ready. We > suggest to configure an ordering constraint for the booth daemon and the > managed resources by that ticket. That being said, if the ticket is in > the CIB but booth daemon has not been started, the resources would not > be started. > booth RA finishes booth_start when booth changed the daemon from the foreground process.(To be exact, "sleep 1" is included). The current booth change daemon before catchup. On the other hand, the previous booth change daemon after catchup. catchup write a ticket in cib. Even if an ordering constraint is set, as shown below, the related resource can start when booth changes the state of "started" on pacemaker. At this point, the current booth still may not finish catchup. crm_mon paste. ... booth(ocf::pacemaker:booth-site):Started multi-site-a-1 ... >> >> Perhaps, this problem didn't happen before the following commit. >> https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f > > Currently when all of the initialization (including loading the new > ticket information) finished, booth should be regarded as ready. So if > you encounter some problem here, I guess we should improve the RA to > better reflect the booth startup status, but not moving the > initialization order, since it may introduce other regression as we have > encountered before;) > I am not still sure which we should fix RA or booth. > Thanks, > Jiaju > >> >> Sincerely, >> Yuichi >> -- Yuichi SEINO METROSYSTEMS CORPORATION E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
On Tue, 2012-12-11 at 20:15 +0900, Yuichi SEINO wrote: > Hi Jiaju, > > Currently, booth is the state of "started" on pacemaker before booth > writes ticket information in cib. So, If the old ticket information is > included in cib, a resource relating to the ticket may start before > booth resets the ticket. I think that this problem is when to be > daemon in booth. The resouce should not be started before the booth daemon is ready. We suggest to configure an ordering constraint for the booth daemon and the managed resources by that ticket. That being said, if the ticket is in the CIB but booth daemon has not been started, the resources would not be started. > > Perhaps, this problem didn't happen before the following commit. > https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f Currently when all of the initialization (including loading the new ticket information) finished, booth should be regarded as ready. So if you encounter some problem here, I guess we should improve the RA to better reflect the booth startup status, but not moving the initialization order, since it may introduce other regression as we have encountered before;) Thanks, Jiaju > > Sincerely, > Yuichi > > -- > Yuichi SEINO > METROSYSTEMS CORPORATION > E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[Pacemaker] booth is the state of "started" on pacemaker before booth write ticket info in cib.
Hi Jiaju, Currently, booth is the state of "started" on pacemaker before booth writes ticket information in cib. So, If the old ticket information is included in cib, a resource relating to the ticket may start before booth resets the ticket. I think that this problem is when to be daemon in booth. Perhaps, this problem didn't happen before the following commit. https://github.com/jjzhang/booth/commit/4b00d46480f45a205f2550ff0760c8b372009f7f Sincerely, Yuichi -- Yuichi SEINO METROSYSTEMS CORPORATION E-mail:seino.clust...@gmail.com ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org