Re: [Pacemaker] [Linux-HA] [ha-wg-technical] [RFC] Organizing HA Summit 2015

2015-01-13 Thread Yusuke Iida
Thank you for adding to the list!

introduce: Japanese Pacemaker developer (and user). Belong to the
Linux-HA japan. @yuusuke on Github.

This, please.

Thanks,
Yusuke

2015-01-14 13:38 GMT+09:00 Digimer :
> Woohoo!!
>
> Will be very nice to see you. :)
>
> I've added you. Can you give me a short sentence to introduce yourself to
> people who haven't met you?
>
> Madi
>
>
> On 13/01/15 11:33 PM, Yusuke Iida wrote:
>>
>> Hi Digimer,
>>
>> I am Iida to participate from NTT along with Mori.
>> I want you added to the list of participants.
>>
>> I'm sorry contact is late.
>>
>> Regards,
>> Yusuke
>>
>> 2014-12-23 2:13 GMT+09:00 Digimer :
>>>
>>> It will be very nice to see you again! Will Ikeda-san be there as well?
>>>
>>> digimer
>>>
>>> On 22/12/14 03:35 AM, Keisuke MORI wrote:
>>>>
>>>>
>>>> Hi all,
>>>>
>>>> Really late response but,
>>>> I will be joining the HA summit, with a few colleagues from NTT.
>>>>
>>>> See you guys in Brno,
>>>> Thanks,
>>>>
>>>>
>>>> 2014-12-08 22:36 GMT+09:00 Jan Pokorný :
>>>>>
>>>>>
>>>>> Hello,
>>>>>
>>>>> it occured to me that if you want to use the opportunity and double
>>>>> as as tourist while being in Brno, it's about the right time to
>>>>> consider reservations/ticket purchases this early.
>>>>> At least in some cases it is a must, e.g., Villa Tugendhat:
>>>>>
>>>>>
>>>>>
>>>>> http://rezervace.spilberk.cz/langchange.aspx?mrsname=&languageId=2&returnUrl=%2Flist
>>>>>
>>>>> On 08/09/14 12:30 +0200, Fabio M. Di Nitto wrote:
>>>>>>
>>>>>>
>>>>>> DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices.
>>>>>>
>>>>>> My suggestion would be to have a 2 days dedicated HA summit the 4th
>>>>>> and
>>>>>> the 5th of February.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jan
>>>>>
>>>>> ___
>>>>> ha-wg-technical mailing list
>>>>> ha-wg-techni...@lists.linux-foundation.org
>>>>> https://lists.linuxfoundation.org/mailman/listinfo/ha-wg-technical
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Digimer
>>> Papers and Projects: https://alteeve.ca/w/
>>> What if the cure for cancer is trapped in the mind of a person without
>>> access to education?
>>> ___
>>> Linux-HA mailing list
>>> linux...@lists.linux-ha.org
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>>> See also: http://linux-ha.org/ReportingProblems
>>
>>
>>
>>
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
> ___
> Linux-HA mailing list
> linux...@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems



-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Linux-HA] [ha-wg-technical] [RFC] Organizing HA Summit 2015

2015-01-13 Thread Yusuke Iida
Hi Digimer,

I am Iida to participate from NTT along with Mori.
I want you added to the list of participants.

I'm sorry contact is late.

Regards,
Yusuke

2014-12-23 2:13 GMT+09:00 Digimer :
> It will be very nice to see you again! Will Ikeda-san be there as well?
>
> digimer
>
> On 22/12/14 03:35 AM, Keisuke MORI wrote:
>>
>> Hi all,
>>
>> Really late response but,
>> I will be joining the HA summit, with a few colleagues from NTT.
>>
>> See you guys in Brno,
>> Thanks,
>>
>>
>> 2014-12-08 22:36 GMT+09:00 Jan Pokorný :
>>>
>>> Hello,
>>>
>>> it occured to me that if you want to use the opportunity and double
>>> as as tourist while being in Brno, it's about the right time to
>>> consider reservations/ticket purchases this early.
>>> At least in some cases it is a must, e.g., Villa Tugendhat:
>>>
>>>
>>> http://rezervace.spilberk.cz/langchange.aspx?mrsname=&languageId=2&returnUrl=%2Flist
>>>
>>> On 08/09/14 12:30 +0200, Fabio M. Di Nitto wrote:
>>>>
>>>> DevConf will start Friday the 6th of Feb 2015 in Red Hat Brno offices.
>>>>
>>>> My suggestion would be to have a 2 days dedicated HA summit the 4th and
>>>> the 5th of February.
>>>
>>>
>>> --
>>> Jan
>>>
>>> ___
>>> ha-wg-technical mailing list
>>> ha-wg-techni...@lists.linux-foundation.org
>>> https://lists.linuxfoundation.org/mailman/listinfo/ha-wg-technical
>>>
>>
>>
>>
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
> ___
> Linux-HA mailing list
> linux...@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems



-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] The problem with which queue between cib and stonith-ng overflows

2014-06-03 Thread Yusuke Iida
0:47:09 vm02 stonith-ng[2971]:   notice:
stonith_device_register: Added 'prmStonith_helper03' to the device
list (3 active devices)
Jun  4 10:47:09 vm02 stonith-ng[2971]:   notice:
stonith_device_register: Added 'prmStonith_libvirt03' to the device
list (4 active devices)
Jun  4 10:47:09 vm02 stonith-ng[2971]:   notice:
stonith_device_register: Added 'prmStonith_helper04' to the device
list (5 active devices)
Jun  4 10:47:09 vm02 stonith-ng[2971]:   notice:
stonith_device_register: Added 'prmStonith_libvirt04' to the device
list (6 active devices)
Jun  4 10:47:09 vm02 stonith-ng[2971]:   notice:
stonith_device_register: Added 'prmStonith_helper05' to the device
list (7 active devices)
Jun  4 10:47:09 vm02 stonith-ng[2971]:   notice:
stonith_device_register: Added 'prmStonith_libvirt05' to the device
list (8 active devices)
Jun  4 10:47:09 vm02 stonith-ng[2971]:   notice:
stonith_device_register: Added 'prmStonith_helper06' to the device
list (9 active devices)
Jun  4 10:47:09 vm02 stonith-ng[2971]:   notice:
stonith_device_register: Added 'prmStonith_libvirt06' to the device
list (10 active devices)
Jun  4 10:47:09 vm02 stonith-ng[2971]:   notice:
stonith_device_register: Added 'prmStonith_helper07' to the device
list (11 active devices)
Jun  4 10:47:09 vm02 stonith-ng[2971]:   notice:
stonith_device_register: Added 'prmStonith_libvirt07' to the device
list (12 active devices)
Jun  4 10:47:09 vm02 stonith-ng[2971]:   notice:
stonith_device_register: Added 'prmStonith_helper08' to the device
list (13 active devices)
Jun  4 10:47:09 vm02 stonith-ng[2971]:   notice:
stonith_device_register: Added 'prmStonith_libvirt08' to the device
list (14 active devices)

2014-06-04 8:18 GMT+09:00 Andrew Beekhof :
>
> On 4 Jun 2014, at 8:11 am, Andrew Beekhof  wrote:
>
>>
>> On 3 Jun 2014, at 11:26 am, Yusuke Iida  wrote:
>>
>>> Hi, Andrew
>>>
>>> About 15 seconds are the time taken in the whole device construction.
>>> I think that it cannot receive the message from cib during device
>>> construction since stonith-ng does not return to mainloop.
>>
>> I'm reasonably sure this is because we do synchronous metadata calls when a 
>> device is added.
>> I'll have a patch which creates a per-agent metadata cache (instead of per 
>> device) for you to test later today.
>
> Can you try this please:
>
>http://paste.fedoraproject.org/106995/18374851
>
>>
>>>
>>> Jun  2 11:34:02 vm04 stonith-ng[4891]: info: init_cib_cache_cb:
>>> Updating device list from the cib: init
>>> Jun  2 11:34:02 vm04 stonith-ng[4891]: info: stonith_level_remove:
>>> Node vm01 not found (0 active entries)
>>> Jun  2 11:34:02 vm04 stonith-ng[4891]: info:
>>> stonith_level_register: Node vm01 has 1 active fencing levels
>>> Jun  2 11:34:02 vm04 stonith-ng[4891]: info:
>>> stonith_level_register: Node vm01 has 2 active fencing levels
>>> Jun  2 11:34:02 vm04 stonith-ng[4891]: info: stonith_level_remove:
>>> Node vm02 not found (1 active entries)
>>> Jun  2 11:34:02 vm04 stonith-ng[4891]: info:
>>> stonith_level_register: Node vm02 has 1 active fencing levels
>>> Jun  2 11:34:02 vm04 stonith-ng[4891]: info:
>>> stonith_level_register: Node vm02 has 2 active fencing levels
>>> Jun  2 11:34:02 vm04 stonith-ng[4891]: info: stonith_level_remove:
>>> Node vm03 not found (2 active entries)
>>> Jun  2 11:34:02 vm04 stonith-ng[4891]: info:
>>> stonith_level_register: Node vm03 has 1 active fencing levels
>>> Jun  2 11:34:02 vm04 stonith-ng[4891]: info:
>>> stonith_level_register: Node vm03 has 2 active fencing levels
>>> Jun  2 11:34:02 vm04 stonith-ng[4891]: info: stonith_level_remove:
>>> Node vm04 not found (3 active entries)
>>> Jun  2 11:34:02 vm04 stonith-ng[4891]: info:
>>> stonith_level_register: Node vm04 has 1 active fencing levels
>>> Jun  2 11:34:02 vm04 stonith-ng[4891]: info:
>>> stonith_level_register: Node vm04 has 2 active fencing levels
>>> Jun  2 11:34:02 vm04 stonith-ng[4891]: info: stonith_level_remove:
>>> Node vm05 not found (4 active entries)
>>> Jun  2 11:34:02 vm04 stonith-ng[4891]: info:
>>> stonith_level_register: Node vm05 has 1 active fencing levels
>>> Jun  2 11:34:02 vm04 stonith-ng[4891]: info:
>>> stonith_level_register: Node vm05 has 2 active fencing levels
>>> Jun  2 11:34:02 vm04 stonith-ng[4891]: info: stonith_level_remove:
>>> Node vm06 not found (5 active entries)
>>> Jun  2 11:34:02 vm04 ston

Re: [Pacemaker] The problem with which queue between cib and stonith-ng overflows

2014-06-02 Thread Yusuke Iida
91]: info: cib_device_update:
Device prmStonith_helper04 has been disabled on vm04: score=-INFINITY
Jun  2 11:34:08 vm04 stonith-ng[4891]: info: cib_device_update:
Device prmStonith_libvirt04 has been disabled on vm04: score=-INFINITY
Jun  2 11:34:09 vm04 stonith-ng[4891]:   notice:
stonith_device_register: Added 'prmStonith_helper05' to the device
list (7 active devices)
Jun  2 11:34:10 vm04 stonith-ng[4891]:   notice:
stonith_device_register: Added 'prmStonith_libvirt05' to the device
list (8 active devices)
Jun  2 11:34:11 vm04 stonith-ng[4891]:   notice:
stonith_device_register: Added 'prmStonith_helper06' to the device
list (9 active devices)
Jun  2 11:34:12 vm04 stonith-ng[4891]:   notice:
stonith_device_register: Added 'prmStonith_libvirt06' to the device
list (10 active devices)
Jun  2 11:34:13 vm04 stonith-ng[4891]:   notice:
stonith_device_register: Added 'prmStonith_helper07' to the device
list (11 active devices)
Jun  2 11:34:14 vm04 stonith-ng[4891]:   notice:
stonith_device_register: Added 'prmStonith_libvirt07' to the device
list (12 active devices)
Jun  2 11:34:15 vm04 stonith-ng[4891]:   notice:
stonith_device_register: Added 'prmStonith_helper08' to the device
list (13 active devices)
Jun  2 11:34:16 vm04 stonith-ng[4891]:   notice:
stonith_device_register: Added 'prmStonith_libvirt08' to the device
list (14 active devices)

Regards,
Yusuke

2014-06-02 20:31 GMT+09:00 Andrew Beekhof :
>
> On 2 Jun 2014, at 3:05 pm, Yusuke Iida  wrote:
>
>> Hi, Andrew
>>
>> I use the newest of 1.1 brunches and am testing by eight sets of nodes.
>>
>> Although the problem was settled once,
>> Now, the problem with which queue overflows between cib and stonithd
>> has recurred.
>>
>> As an example, I paste the log of the DC node.
>> The problem is occurring on all nodes.
>>
>> Jun  2 11:34:02 vm04 cib[3940]:error: crm_ipcs_flush_events:
>> Evicting slow client 0x250afe0[3941]: event queue reached 638 entries
>> Jun  2 11:34:02 vm04 stonith-ng[3941]:error: crm_ipc_read:
>> Connection to cib_rw failed
>> Jun  2 11:34:02 vm04 stonith-ng[3941]:error:
>> mainloop_gio_callback: Connection to cib_rw[0x662510] closed (I/O
>> condition=17)
>> Jun  2 11:34:02 vm04 stonith-ng[3941]:   notice:
>> cib_connection_destroy: Connection to the CIB terminated. Shutting
>> down.
>> Jun  2 11:34:02 vm04 stonith-ng[3941]: info: stonith_shutdown:
>> Terminating with  2 clients
>> Jun  2 11:34:02 vm04 stonith-ng[3941]: info: qb_ipcs_us_withdraw:
>> withdrawing server sockets
>>
>> After loading a resource setup, time for stonithd to build device
>> information is long.
>> It has taken the time for about about 15 seconds.
>
> 15 seconds!! Yikes. I'll investigate tomorrow.
>
>> It seems that the diff message of cib accumulates between them.
>>
>> Are there any plans to improve on this issue?
>>
>> I attach a report when a problem occurs.
>> https://drive.google.com/file/d/0BwMFJItoO-fVUEFEN1NlelNWRjg/edit?usp=sharing
>>
>> Regards,
>> Yusuke
>> --
>> 
>> METRO SYSTEMS CO., LTD
>>
>> Yusuke Iida
>> Mail: yusk.i...@gmail.com
>> 
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] The problem with which queue between cib and stonith-ng overflows

2014-06-01 Thread Yusuke Iida
Hi, Andrew

I use the newest of 1.1 brunches and am testing by eight sets of nodes.

Although the problem was settled once,
Now, the problem with which queue overflows between cib and stonithd
has recurred.

As an example, I paste the log of the DC node.
The problem is occurring on all nodes.

Jun  2 11:34:02 vm04 cib[3940]:error: crm_ipcs_flush_events:
Evicting slow client 0x250afe0[3941]: event queue reached 638 entries
Jun  2 11:34:02 vm04 stonith-ng[3941]:error: crm_ipc_read:
Connection to cib_rw failed
Jun  2 11:34:02 vm04 stonith-ng[3941]:error:
mainloop_gio_callback: Connection to cib_rw[0x662510] closed (I/O
condition=17)
Jun  2 11:34:02 vm04 stonith-ng[3941]:   notice:
cib_connection_destroy: Connection to the CIB terminated. Shutting
down.
Jun  2 11:34:02 vm04 stonith-ng[3941]: info: stonith_shutdown:
Terminating with  2 clients
Jun  2 11:34:02 vm04 stonith-ng[3941]: info: qb_ipcs_us_withdraw:
withdrawing server sockets

After loading a resource setup, time for stonithd to build device
information is long.
It has taken the time for about about 15 seconds.
It seems that the diff message of cib accumulates between them.

Are there any plans to improve on this issue?

I attach a report when a problem occurs.
https://drive.google.com/file/d/0BwMFJItoO-fVUEFEN1NlelNWRjg/edit?usp=sharing

Regards,
Yusuke
-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] If 256 resources are load(ed), crmd will reboot.

2014-05-29 Thread Yusuke Iida
Hi, Andrew

2014-05-29 15:30 GMT+09:00 Andrew Beekhof :
>
> On 29 May 2014, at 3:40 pm, Yusuke Iida  wrote:
>
>> Hi, Andrew
>>
>> 2014-05-29 14:00 GMT+09:00 Andrew Beekhof :
>>>
>>> On 29 May 2014, at 12:28 pm, Yusuke Iida  wrote:
>>>
>>>> Hi, Andrew
>>>>
>>>> I'm sorry.
>>>> It seems that the notation of the node name became another by syslog.
>>>> In order to dispel misunderstanding, the report was newly acquired.
>>>> I think that the signs are appearing in vm02/ha-log.
>>>
>>> Got it :)
>>>
>>> Ok, step 1 - stop logging debug.
>>> Debug is accounting for 30% of the logs and all that writing to disk would 
>>> be adding significantly to the cluster's workload.
>> I understand.
>>
>>>
>>> Question:  How have you got logging configured? Anything in 
>>> /etc/sysconfig/pacemaker ?
>>>
>>> I ask because pacemaker.log appears to have a jumble of syslog and regular 
>>> file output:
>>>
>>> May 29 10:45:26 vm02 cib[25603]: info: cib_perform_op: +  /cib:  
>>> @num_updates=1295
>>> May 29 10:45:26 [25603] vm02cib: info: cib_perform_op:  +  
>>> /cib:  @num_updates=1295
>> The position of pid is different although seldom cared.
>> I attach the /etc/sysconfig/pacemaker of my environment.
>
> The format isn't a problem, it just indicates that there are two mechanisms 
> logging to the same place.
> So its redundant.
>
> The question is... how, your configs look fine to me :-/
This was my setting mistake.
syslog was set up to output "local1.*" to "/var/log/pacemaker.log."
I am sorry to cause confusion.

>
>>
>>>
>>>
>>> Step 2 - can you try this patch:
>>>
>>> diff --git a/crmd/te_callbacks.c b/crmd/te_callbacks.c
>>> index 4d330a6..eba5f11 100644
>>> --- a/crmd/te_callbacks.c
>>> +++ b/crmd/te_callbacks.c
>>> @@ -381,12 +381,15 @@ te_update_diff(const char *event, xmlNode * msg)
>>>
>>> } else if(strstr(xpath, "/cib/configuration")) {
>>> abort_transition(INFINITY, tg_restart, "Non-status change", 
>>> change);
>>> +break; /* Wont be packaged with any resource operations we may 
>>> be waiting for */
>>>
>>> } else if(strstr(xpath, "/"XML_CIB_TAG_TICKETS) || 
>>> safe_str_eq(name, XML_CIB_TAG_TICKETS)) {
>>> abort_transition(INFINITY, tg_restart, "Ticket attribute 
>>> change", change);
>>> +break; /* Wont be packaged with any resource operations we may 
>>> be waiting for */
>>>
>>> } else if(strstr(xpath, "/"XML_TAG_TRANSIENT_NODEATTRS"[") || 
>>> safe_str_eq(name, XML_TAG_TRANSIENT_NODEATTRS)) {
>>> abort_transition(INFINITY, tg_restart, "Transient attribute 
>>> change", change);
>>> +break; /* Wont be packaged with any resource operations we may 
>>> be waiting for */
>>>
>>> } else if(strstr(xpath, "/"XML_LRM_TAG_RSC_OP"[") && 
>>> safe_str_eq(op, "delete")) {
>>> crm_action_t *cancel = NULL;
>>
>> Thank you for the patch.
>> It replies by checking a motion.
>
> Do you mean it works now?
I think the patch is running without any problems.
When a setup was loaded, it changed so that abort_transition() might
be called only once.
I want this correction to be included in Pacemaker-1.1.12.

A report when a patch is applied is attached.
https://drive.google.com/file/d/0BwMFJItoO-fVWWV0VmxqclMzT2M/edit?usp=sharing


Regards,
Yusuke
>
>>
>> Regards,
>> Yusuke
>>>
>>>
>>>>
>>>> May 29 10:43:37 vm02 crmd[25608]:error: config_query_callback:
>>>> Local CIB query resulted in an error: Timer expired
>>>> May 29 10:43:37 vm02 crmd[25608]: info: register_fsa_error_adv:
>>>> Resetting the current action list
>>>> May 29 10:43:37 vm02 crmd[25608]:error: do_log: FSA: Input I_ERROR
>>>> from config_query_callback() received in state S_POLICY_ENGINE
>>>> May 29 10:43:37 vm02 crmd[25608]:  warning: do_state_transition: State
>>>> transition S_POLICY_ENGINE -> S_RECOVERY [ input=I_ERROR
>>>> cause=C_FSA_INTERNAL origin=config_query_callback ]
>>>> May 29 10:43:37 vm02 crmd[25608]:  warning: do_recover: Fast-tracking
>>>> s

Re: [Pacemaker] If 256 resources are load(ed), crmd will reboot.

2014-05-28 Thread Yusuke Iida
Hi, Andrew

2014-05-29 14:00 GMT+09:00 Andrew Beekhof :
>
> On 29 May 2014, at 12:28 pm, Yusuke Iida  wrote:
>
>> Hi, Andrew
>>
>> I'm sorry.
>> It seems that the notation of the node name became another by syslog.
>> In order to dispel misunderstanding, the report was newly acquired.
>> I think that the signs are appearing in vm02/ha-log.
>
> Got it :)
>
> Ok, step 1 - stop logging debug.
> Debug is accounting for 30% of the logs and all that writing to disk would be 
> adding significantly to the cluster's workload.
I understand.

>
> Question:  How have you got logging configured? Anything in 
> /etc/sysconfig/pacemaker ?
>
> I ask because pacemaker.log appears to have a jumble of syslog and regular 
> file output:
>
> May 29 10:45:26 vm02 cib[25603]: info: cib_perform_op: +  /cib:  
> @num_updates=1295
> May 29 10:45:26 [25603] vm02cib: info: cib_perform_op:  +  
> /cib:  @num_updates=1295
The position of pid is different although seldom cared.
I attach the /etc/sysconfig/pacemaker of my environment.

>
>
> Step 2 - can you try this patch:
>
> diff --git a/crmd/te_callbacks.c b/crmd/te_callbacks.c
> index 4d330a6..eba5f11 100644
> --- a/crmd/te_callbacks.c
> +++ b/crmd/te_callbacks.c
> @@ -381,12 +381,15 @@ te_update_diff(const char *event, xmlNode * msg)
>
>  } else if(strstr(xpath, "/cib/configuration")) {
>  abort_transition(INFINITY, tg_restart, "Non-status change", 
> change);
> +break; /* Wont be packaged with any resource operations we may 
> be waiting for */
>
>  } else if(strstr(xpath, "/"XML_CIB_TAG_TICKETS) || safe_str_eq(name, 
> XML_CIB_TAG_TICKETS)) {
>  abort_transition(INFINITY, tg_restart, "Ticket attribute 
> change", change);
> +break; /* Wont be packaged with any resource operations we may 
> be waiting for */
>
>  } else if(strstr(xpath, "/"XML_TAG_TRANSIENT_NODEATTRS"[") || 
> safe_str_eq(name, XML_TAG_TRANSIENT_NODEATTRS)) {
>  abort_transition(INFINITY, tg_restart, "Transient attribute 
> change", change);
> +break; /* Wont be packaged with any resource operations we may 
> be waiting for */
>
>  } else if(strstr(xpath, "/"XML_LRM_TAG_RSC_OP"[") && safe_str_eq(op, 
> "delete")) {
>  crm_action_t *cancel = NULL;

Thank you for the patch.
It replies by checking a motion.

Regards,
Yusuke
>
>
>>
>> May 29 10:43:37 vm02 crmd[25608]:error: config_query_callback:
>> Local CIB query resulted in an error: Timer expired
>> May 29 10:43:37 vm02 crmd[25608]: info: register_fsa_error_adv:
>> Resetting the current action list
>> May 29 10:43:37 vm02 crmd[25608]:error: do_log: FSA: Input I_ERROR
>> from config_query_callback() received in state S_POLICY_ENGINE
>> May 29 10:43:37 vm02 crmd[25608]:  warning: do_state_transition: State
>> transition S_POLICY_ENGINE -> S_RECOVERY [ input=I_ERROR
>> cause=C_FSA_INTERNAL origin=config_query_callback ]
>> May 29 10:43:37 vm02 crmd[25608]:  warning: do_recover: Fast-tracking
>> shutdown in response to errors
>> May 29 10:43:37 vm02 crmd[25608]:  warning: do_election_vote: Not
>> voting in election, we're in state S_RECOVERY
>>
>> https://drive.google.com/file/d/0BwMFJItoO-fVSEd2MkRiOGxkelk/edit?usp=sharing
>>
>> Regards,
>> Yusuke
>>
>> 2014-05-29 10:26 GMT+09:00 Andrew Beekhof :
>>>
>>> On 28 May 2014, at 6:42 pm, Yusuke Iida  wrote:
>>>
>>>> Hi, Andrew
>>>>
>>>> I made the cluster load a setup to which 256 resources are started using 
>>>> crmsh.
>>>> At this time, crmd changed into the S_RECOVERY state and rebooted.
>>>>
>>>> May 28 17:08:00 [14194] vm02   crmd:error:
>>>> config_query_callback: Local CIB query resulted in an error: Timer
>>>> expired
>>>> May 28 17:08:00 [14194] vm02   crmd: info:
>>>> register_fsa_error_adv: Resetting the current action list
>>>> May 28 17:08:00 [14194] vm02   crmd:error: do_log: FSA: Input
>>>> I_ERROR from config_query_callback() received in state S_POLICY_ENGINE
>>>> May 28 17:08:00 [14194] vm02   crmd:  warning:
>>>> do_state_transition: State transition S_POLICY_ENGINE -> S_RECOVERY [
>>>> input=I_ERROR cause=C_FSA_INTERNAL origin=config_query_callback ]
>>>> May 28 17:08:00 [14194] vm02   crmd:  warning: do_recover:
>>>> Fast-tracking shutdow

Re: [Pacemaker] If 256 resources are load(ed), crmd will reboot.

2014-05-28 Thread Yusuke Iida
Hi, Andrew

I'm sorry.
It seems that the notation of the node name became another by syslog.
In order to dispel misunderstanding, the report was newly acquired.
I think that the signs are appearing in vm02/ha-log.

May 29 10:43:37 vm02 crmd[25608]:error: config_query_callback:
Local CIB query resulted in an error: Timer expired
May 29 10:43:37 vm02 crmd[25608]: info: register_fsa_error_adv:
Resetting the current action list
May 29 10:43:37 vm02 crmd[25608]:error: do_log: FSA: Input I_ERROR
from config_query_callback() received in state S_POLICY_ENGINE
May 29 10:43:37 vm02 crmd[25608]:  warning: do_state_transition: State
transition S_POLICY_ENGINE -> S_RECOVERY [ input=I_ERROR
cause=C_FSA_INTERNAL origin=config_query_callback ]
May 29 10:43:37 vm02 crmd[25608]:  warning: do_recover: Fast-tracking
shutdown in response to errors
May 29 10:43:37 vm02 crmd[25608]:  warning: do_election_vote: Not
voting in election, we're in state S_RECOVERY

https://drive.google.com/file/d/0BwMFJItoO-fVSEd2MkRiOGxkelk/edit?usp=sharing

Regards,
Yusuke

2014-05-29 10:26 GMT+09:00 Andrew Beekhof :
>
> On 28 May 2014, at 6:42 pm, Yusuke Iida  wrote:
>
>> Hi, Andrew
>>
>> I made the cluster load a setup to which 256 resources are started using 
>> crmsh.
>> At this time, crmd changed into the S_RECOVERY state and rebooted.
>>
>> May 28 17:08:00 [14194] vm02   crmd:error:
>> config_query_callback: Local CIB query resulted in an error: Timer
>> expired
>> May 28 17:08:00 [14194] vm02   crmd: info:
>> register_fsa_error_adv: Resetting the current action list
>> May 28 17:08:00 [14194] vm02   crmd:error: do_log: FSA: Input
>> I_ERROR from config_query_callback() received in state S_POLICY_ENGINE
>> May 28 17:08:00 [14194] vm02   crmd:  warning:
>> do_state_transition: State transition S_POLICY_ENGINE -> S_RECOVERY [
>> input=I_ERROR cause=C_FSA_INTERNAL origin=config_query_callback ]
>> May 28 17:08:00 [14194] vm02   crmd:  warning: do_recover:
>> Fast-tracking shutdown in response to errors
>> May 28 17:08:00 [14194] vm02   crmd:  warning: do_election_vote:
>> Not voting in election, we're in state S_RECOVERY
>>
>> I think that query performed in large quantities cannot be processed.
>> Before implementing cib_performance, abort_transition() was called only once.
>>
>> Is this corrected?
>>
>> report when a problem occurs is attached.
>> https://drive.google.com/file/d/0BwMFJItoO-fVX0gxM1ptcE52WWs/edit?usp=sharing
>
> That doesn't appear to match the symptoms above.
>
>>
>> Regards,
>> Yusuke
>> --
>> 
>> METRO SYSTEMS CO., LTD
>>
>> Yusuke Iida
>> Mail: yusk.i...@gmail.com
>> 
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] If 256 resources are load(ed), crmd will reboot.

2014-05-28 Thread Yusuke Iida
Hi, Andrew

I made the cluster load a setup to which 256 resources are started using crmsh.
At this time, crmd changed into the S_RECOVERY state and rebooted.

May 28 17:08:00 [14194] vm02   crmd:error:
config_query_callback: Local CIB query resulted in an error: Timer
expired
May 28 17:08:00 [14194] vm02   crmd: info:
register_fsa_error_adv: Resetting the current action list
May 28 17:08:00 [14194] vm02   crmd:error: do_log: FSA: Input
I_ERROR from config_query_callback() received in state S_POLICY_ENGINE
May 28 17:08:00 [14194] vm02   crmd:  warning:
do_state_transition: State transition S_POLICY_ENGINE -> S_RECOVERY [
input=I_ERROR cause=C_FSA_INTERNAL origin=config_query_callback ]
May 28 17:08:00 [14194] vm02   crmd:  warning: do_recover:
Fast-tracking shutdown in response to errors
May 28 17:08:00 [14194] vm02   crmd:  warning: do_election_vote:
Not voting in election, we're in state S_RECOVERY

I think that query performed in large quantities cannot be processed.
Before implementing cib_performance, abort_transition() was called only once.

Is this corrected?

report when a problem occurs is attached.
https://drive.google.com/file/d/0BwMFJItoO-fVX0gxM1ptcE52WWs/edit?usp=sharing

Regards,
Yusuke
-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] crmd does abort if a stopped node is specified

2014-05-08 Thread Yusuke Iida
Hi, Andrew

I read the code.
In the present processing, a setup of "startup-fencing" is read only
once after starting.
https://github.com/ClusterLabs/pacemaker/blob/master/lib/pengine/unpack.c#L455

In Pacemaker-1.0, whenever unpack_nodes() was called, a setup was read.
https://github.com/ClusterLabs/pacemaker-1.0/blob/master/lib/pengine/unpack.c#L194

While a cluster starts, a setup of "startup-fencing" cannot be changed.
It seems to it that the function has deteriorated.

I made the correction to this problem below.
https://github.com/ClusterLabs/pacemaker/pull/512

Will it be good in this fix?

Regards,
Yusuke

2014-05-08 15:59 GMT+09:00 Yusuke Iida :
> Hi, Andrew
>
> I am the method shown above and made a setup read.
>
> crmd was able to be added as a node of OFFLINE, without core dumping.
>
> However, the node of OFFLINE added although "startup-fencing=false"
> was set up has been fenced.
> I do not expect fence here.
> Why is it that "startup-fencing=false" is not effective?
>
> I attach crm_report when a problem occurs.
>
> The version of used Pacemaker is as follows.
> https://github.com/ClusterLabs/pacemaker/commit/9fa1ed36e373768e84bee47b5d21b0bf80f608b7
>
> Regards,
> Yusuke
>
> 2014-05-08 8:58 GMT+09:00 Andrew Beekhof :
>>
>> On 7 May 2014, at 7:53 pm, Yusuke Iida  wrote:
>>
>>> Hi, Andrew
>>>
>>> I would also like to describe the node which has not participated in a
>>> cluster to a crmsh file.
>>>
>>> I understood that uuid was required for a setup of a node as follows
>>> from this mail thread.
>>>
>>> # cat node.crm
>>> ### Cluster Option ###
>>> property no-quorum-policy="ignore" \
>>>stonith-enabled="true" \
>>>startup-fencing="false" \
>>>crmd-transition-delay="2s"
>>>
>>> node $id=131 vm01
>>> node $id=132 vm02
>>> (snip)
>>>
>>> Is the method of setting up ID of the node which has not participated
>>> in a cluster using a corosync stack like this?
>>
>> I don;t know how crmsh works, sorry
>>
>>> It is sufficient to describe the nodelist and nodeid to corosync.conf?
>>
>> That is my understanding, yes.
>>
>>>
>>> # cat corosync.conf
>>> (snip)
>>> nodelist {
>>>  node {
>>>ring0_addr: 192.168.101.131
>>>ring1_addr: 192.168.102.131
>>>nodeid: 131
>>>  }
>>>  node {
>>>ring0_addr: 192.168.101.132
>>>ring1_addr: 192.168.101.132
>>>nodeid: 132
>>>  }
>>> }
>>>
>>> Regards,
>>> Yusuke
>>>
>>> 2014-04-24 12:33 GMT+09:00 Kazunori INOUE :
>>>> 2014-04-23 19:32 GMT+09:00 Andrew Beekhof :
>>>>>
>>>>> On 23 Apr 2014, at 7:17 pm, Kazunori INOUE  
>>>>> wrote:
>>>>>
>>>>>> 2014-04-22 0:45 GMT+09:00 David Vossel :
>>>>>>>
>>>>>>> - Original Message -
>>>>>>>> From: "Kazunori INOUE" 
>>>>>>>> To: "pm" 
>>>>>>>> Sent: Friday, April 18, 2014 4:49:42 AM
>>>>>>>> Subject: [Pacemaker] crmd does abort if a stopped node is specified
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> crmd does abort if I load CIB which specified a stopped node.
>>>>>>>>
>>>>>>>> # crm_mon -1
>>>>>>>> Last updated: Fri Apr 18 11:51:36 2014
>>>>>>>> Last change: Fri Apr 18 11:51:30 2014
>>>>>>>> Stack: corosync
>>>>>>>> Current DC: pm103 (3232261519) - partition WITHOUT quorum
>>>>>>>> Version: 1.1.11-cf82673
>>>>>>>> 1 Nodes configured
>>>>>>>> 0 Resources configured
>>>>>>>>
>>>>>>>> Online: [ pm103 ]
>>>>>>>>
>>>>>>>> # cat test.cli
>>>>>>>> node pm103
>>>>>>>> node pm104
>>>>>>>>
>>>>>>>> # crm configure load update test.cli
>>>>>>>>
>>>>>>>> Apr 18 11:52:42 pm103 crmd[11672]:error: crm_int_helper:
>>>>>>>> Characters left over after parsing 'pm104': 'pm104'
>>>>>>

Re: [Pacemaker] crmd does abort if a stopped node is specified

2014-05-07 Thread Yusuke Iida
f302461d1bc in cib_native_notify (data=0x10ef750,
>>>>> user_data=0x1137660) at cib_utils.c:733
>>>>> #8  0x0033db83d6bc in g_list_foreach () from /lib64/libglib-2.0.so.0
>>>>> #9  0x7f3024620191 in cib_native_dispatch_internal
>>>>> (buffer=0xe61ea8 ">>>> cib_op=\"cib_apply_diff\" cib_rc=\"0\"
>>>>> cib_object_type=\"diff\">>>>> num_updates=\"0\" admin_epoch=\"0\" validate-with=\"pacem"...,
>>>>> length=1708, userdata=0xe5eb90) at cib_native.c:123
>>>>> #10 0x7f30241dee72 in mainloop_gio_callback (gio=0xf61ea0,
>>>>> condition=G_IO_IN, data=0xe601b0) at mainloop.c:639
>>>>> #11 0x0033db83feb2 in g_main_context_dispatch () from
>>>>> /lib64/libglib-2.0.so.0
>>>>> #12 0x0033db843d68 in ?? () from /lib64/libglib-2.0.so.0
>>>>> #13 0x0033db844275 in g_main_loop_run () from /lib64/libglib-2.0.so.0
>>>>> #14 0x00406469 in crmd_init () at main.c:154
>>>>> #15 0x004062b0 in main (argc=1, argv=0x7fff908829f8) at main.c:121
>>>>>
>>>>> Is this all right?
>>>>>
>>>>> Best Regards,
>>>>> Kazunori INOUE
>>>>>
>>>>> ___
>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>
>>>> ___
>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>> ___
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out "lost"?

2014-03-31 Thread Yusuke Iida
Hi, Andrew

crm_mon has the processing which makes cib the newest, when
pcmk_err_old_data is still received.

Since this processing can be considered to be unnecessary like the
processing changed by stonithd, I correct this.

Please merge the following, if satisfactory.
https://github.com/ClusterLabs/pacemaker/pull/477

Regards,
Yusuke

2014-03-18 9:56 GMT+09:00 Andrew Beekhof :
>
> On 12 Mar 2014, at 1:45 pm, Yusuke Iida  wrote:
>
>> Hi, Andrew
>> 2014-03-12 6:37 GMT+09:00 Andrew Beekhof :
>>>> Mar 07 13:24:14 [2528] vm01   crmd: (te_callbacks:493   )   error:
>>>> te_update_diff: Ingoring create operation for /cib 0xf91c10,
>>>> configuration
>>>
>>> Thats interesting... is that with the fixes mentioned above?
>> I'm sorry.
>> The above-mentioned log is not outputted by the newest Pacemaker.
>> The following logs come out in the newest thing.
>>
>> Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:377   )   trace:
>> te_update_diff:  Handling create operation for /cib/configuration
>> 0x1c37c60, fencing-topology
>> Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:493   )   error:
>> te_update_diff:  Ingoring create operation for /cib/configuration
>> 0x1c37c60, fencing-topology
>> Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:377   )   trace:
>> te_update_diff:  Handling create operation for /cib/configuration
>> 0x1c397a0, rsc_defaults
>> Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:493   )   error:
>> te_update_diff:  Ingoring create operation for /cib/configuration
>> 0x1c397a0, rsc_defaults
>>
>> I checked code of te_update_diff.
>> Should not the next judgment be changed if change of fencing-topology
>> or rsc_defaults is processed as a configuration subordinate's change?
>
> Perfect!
>
>   https://github.com/beekhof/pacemaker/commit/1c285ac
>
> Thanks to everyone for giving the new CIB a pounding, we should be in very 
> good shape for a release soon :-)
>
>>
>> diff --git a/crmd/te_callbacks.c b/crmd/te_callbacks.c
>> index dd57660..f97bab5 100644
>> --- a/crmd/te_callbacks.c
>> +++ b/crmd/te_callbacks.c
>> @@ -378,7 +378,7 @@ te_update_diff(const char *event, xmlNode * msg)
>> if(xpath == NULL) {
>> /* Version field, ignore */
>>
>> -} else if(strstr(xpath, "/cib/configuration/")) {
>> +} else if(strstr(xpath, "/cib/configuration")) {
>> abort_transition(INFINITY, tg_restart, "Non-status
>> change", change);
>>
>> } else if(strstr(xpath, "/"XML_CIB_TAG_TICKETS"[") ||
>> safe_str_eq(name, XML_CIB_TAG_TICKETS)) {
>>
>> How is such change?
>>
>> I attach report at this time.
>> The trace log of te_update_diff is also contained.
>> https://drive.google.com/file/d/0BwMFJItoO-fVeVVEemVsZVBoUWc/edit?usp=sharing
>>
>> Regards,
>> Yusuke
>>>
>>>>
>>>>>
>>>>>> but it looks like crmsh is doing something funny with its updates... 
>>>>>> does anyone know what command it is running?
>>>>
>>>> The execution result of the following commands remained in 
>>>> /var/log/messages.
>>>>
>>>> Mar  7 13:24:14 vm01 cibadmin[2555]:   notice: crm_log_args: Invoked:
>>>> cibadmin -p -R --force
>>>
>>> I'm somewhat confused at this point if crmsh is using --replace, then 
>>> why is it doing diff calculations?
>>> Or are replace operations only for the load operation?
>>
>>
>>
>>
>> --
>> 
>> METRO SYSTEMS CO., LTD
>>
>> Yusuke Iida
>> Mail: yusk.i...@gmail.com
>> 
>>
>> _______
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out "lost"?

2014-03-11 Thread Yusuke Iida
Hi, Andrew
2014-03-12 6:37 GMT+09:00 Andrew Beekhof :
>> Mar 07 13:24:14 [2528] vm01   crmd: (te_callbacks:493   )   error:
>> te_update_diff: Ingoring create operation for /cib 0xf91c10,
>> configuration
>
> Thats interesting... is that with the fixes mentioned above?
I'm sorry.
The above-mentioned log is not outputted by the newest Pacemaker.
The following logs come out in the newest thing.

Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:377   )   trace:
te_update_diff:  Handling create operation for /cib/configuration
0x1c37c60, fencing-topology
Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:493   )   error:
te_update_diff:  Ingoring create operation for /cib/configuration
0x1c37c60, fencing-topology
Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:377   )   trace:
te_update_diff:  Handling create operation for /cib/configuration
0x1c397a0, rsc_defaults
Mar 12 10:43:38 [6124] vm02   crmd: (te_callbacks:493   )   error:
te_update_diff:  Ingoring create operation for /cib/configuration
0x1c397a0, rsc_defaults

I checked code of te_update_diff.
Should not the next judgment be changed if change of fencing-topology
or rsc_defaults is processed as a configuration subordinate's change?

diff --git a/crmd/te_callbacks.c b/crmd/te_callbacks.c
index dd57660..f97bab5 100644
--- a/crmd/te_callbacks.c
+++ b/crmd/te_callbacks.c
@@ -378,7 +378,7 @@ te_update_diff(const char *event, xmlNode * msg)
 if(xpath == NULL) {
 /* Version field, ignore */

-} else if(strstr(xpath, "/cib/configuration/")) {
+} else if(strstr(xpath, "/cib/configuration")) {
 abort_transition(INFINITY, tg_restart, "Non-status
change", change);

 } else if(strstr(xpath, "/"XML_CIB_TAG_TICKETS"[") ||
safe_str_eq(name, XML_CIB_TAG_TICKETS)) {

How is such change?

I attach report at this time.
The trace log of te_update_diff is also contained.
https://drive.google.com/file/d/0BwMFJItoO-fVeVVEemVsZVBoUWc/edit?usp=sharing

Regards,
Yusuke
>
>>
>>>
>>>> but it looks like crmsh is doing something funny with its updates... does 
>>>> anyone know what command it is running?
>>
>> The execution result of the following commands remained in /var/log/messages.
>>
>> Mar  7 13:24:14 vm01 cibadmin[2555]:   notice: crm_log_args: Invoked:
>> cibadmin -p -R --force
>
> I'm somewhat confused at this point if crmsh is using --replace, then why 
> is it doing diff calculations?
> Or are replace operations only for the load operation?




-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out "lost"?

2014-03-11 Thread Yusuke Iida
Hi, Andrew

2014-03-11 14:21 GMT+09:00 Andrew Beekhof :
>
> On 11 Mar 2014, at 4:14 pm, Andrew Beekhof  wrote:
>
> [snip]
>
>> If I do this however:
>>
>> # cp start.xml 1.xml;  tools/cibadmin --replace -o configuration --xml-file 
>> replace.some -V
>>
>> I start to see what you see:
>>
>> (   xml.c:4985  )info: validate_with_relaxng: Creating RNG 
>> parser context
>> (  cib_file.c:268   )info: cib_file_perform_op_delegate:  cib_replace on 
>> configuration
>> ( cib_utils.c:338   )   trace: cib_perform_op:Begin cib_replace op
>> (   xml.c:1487  )   trace: cib_perform_op:-- /configuration
>> (   xml.c:1490  )   trace: cib_perform_op:+  > num_updates="14" admin_epoch="0" validate-with="pacemaker-1.2" 
>> crm_feature_set="3.0.9" cib-last-written="Fri Mar  7 13:24:07 2014" 
>> update-origin="vm01" update-client="crmd" update-user="hacluster" 
>> have-quorum="1" dc-uuid="3232261507"/>
>> (   xml.c:1490  )   trace: cib_perform_op:++   
>> (   xml.c:1490  )   trace: cib_perform_op:++ 
>>
>> Fixed in https://github.com/beekhof/pacemaker/commit/7d3b93b ,
>
> And now with improved change detection: 
> https://github.com/beekhof/pacemaker/commit/6f364db

I checked that the problem as which crm_mon does not display updating
had been solved.

BTW,
The following logs came to come out recently.
Although it seems that there is no problem in operation, when the
following logs have come out, are there any problems?

Mar 07 13:24:14 [2528] vm01   crmd: (te_callbacks:493   )   error:
te_update_diff: Ingoring create operation for /cib 0xf91c10,
configuration

>
>> but it looks like crmsh is doing something funny with its updates... does 
>> anyone know what command it is running?

The execution result of the following commands remained in /var/log/messages.

Mar  7 13:24:14 vm01 cibadmin[2555]:   notice: crm_log_args: Invoked:
cibadmin -p -R --force

I am using crmsh-1.2.6-rc3.

Thanks,
Yusuke
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out "lost"?

2014-03-10 Thread Yusuke Iida
Yusuke

2014-03-11 10:26 GMT+09:00 Andrew Beekhof :
>
> On 7 Mar 2014, at 5:35 pm, Yusuke Iida  wrote:
>
>> Hi, Andrew
>> 2014-03-07 11:43 GMT+09:00 Andrew Beekhof :
>>> I don't understand... crm_mon doesn't look for changes to resources or 
>>> constraints and it should already be using the new faster diff format.
>>>
>>> [/me reads attachment]
>>>
>>> Ah, but perhaps I do understand afterall :-)
>>>
>>> This is repeated over and over:
>>>
>>>  notice: crm_diff_update:  [cib_diff_notify] Patch aborted: Application 
>>> of an update diff failed (-206)
>>>  notice: xml_patch_version_check:  Current num_updates is too high (885 
>>> > 67)
>>>
>>> That would certainly drive up CPU usage and cause crm_mon to get left 
>>> behind.
>>> Happily the fix for that should be: 
>>> https://github.com/beekhof/pacemaker/commit/6c33820
>>
>> I think that refreshment of cib is no longer repeated when a version
>> has a difference.
>> Thank you cope.
>>
>> Now, I see another problem.
>>
>> If "crm configure load update" is performed, with crm_mon started,
>> information will no longer be displayed.
>> Information will be displayed if crm_mon is restarted.
>>
>> I executed the following commands and took the log of crm_mon.
>> # crm_mon --disable-ncurses -VV >crm_mon.log 2>&1
>>
>> I am observing the cib information inside crm_mon after load was performed.
>>
>> Two configuration sections exist in cib after load.
>>
>> It seems that this is the next processing, and it remains since it
>> failed in deletion of the configuration section.
>>   trace: cib_native_dispatch_internal: cib-reply
>> 
>>
>> A little following is the debugging log acquired by old pacemaker.
>> It is not found in order that <(null) > may try to look for
>> path=/configuration from the document tree of top.
>> Should not path be path=/cib/configuration essentially?
>
> Yes.  Could you send me the cib as well as the update you're trying to load?
>
>>
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   <(null)>
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: > epoch="2" num_updates="6" admin_epoch="0"
>> validate-with="pacemaker-1.2" crm_feature_set="3.0.9"
>> cib-last-written="Tue Mar  4 11:32:36 2014"
>> update-origin="rhel64rpmbuild" update-client="crmd" have-quorum="1"
>> dc-uuid="3232261524">
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   
>> 
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: 
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
>> 
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
>> > value="1.1.10-2dbaf19"/>
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
>> > name="cluster-infrastructure" value="corosync"/>
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
>> 
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: 
>> 
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: 
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
>> 
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: 
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: 
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: 
>> 
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   
>> 
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
>> > crmd="online" crm-debug-origin="do_state_transition" join="member"
>> expected="member">
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
>> 
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
>> 
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
>> 
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
>> 
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
>> 
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
>> > value="true"/>
>> notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:
>> 
>> notice  Mar 04 11:33:10 __xml_f

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out "lost"?

2014-03-06 Thread Yusuke Iida
Hi, Andrew
2014-03-07 11:43 GMT+09:00 Andrew Beekhof :
> I don't understand... crm_mon doesn't look for changes to resources or 
> constraints and it should already be using the new faster diff format.
>
> [/me reads attachment]
>
> Ah, but perhaps I do understand afterall :-)
>
> This is repeated over and over:
>
>   notice: crm_diff_update:  [cib_diff_notify] Patch aborted: Application 
> of an update diff failed (-206)
>   notice: xml_patch_version_check:  Current num_updates is too high (885 
> > 67)
>
> That would certainly drive up CPU usage and cause crm_mon to get left behind.
> Happily the fix for that should be: 
> https://github.com/beekhof/pacemaker/commit/6c33820

I think that refreshment of cib is no longer repeated when a version
has a difference.
Thank you cope.

Now, I see another problem.

If "crm configure load update" is performed, with crm_mon started,
information will no longer be displayed.
Information will be displayed if crm_mon is restarted.

I executed the following commands and took the log of crm_mon.
# crm_mon --disable-ncurses -VV >crm_mon.log 2>&1

I am observing the cib information inside crm_mon after load was performed.

Two configuration sections exist in cib after load.

It seems that this is the next processing, and it remains since it
failed in deletion of the configuration section.
   trace: cib_native_dispatch_internal: cib-reply


A little following is the debugging log acquired by old pacemaker.
It is not found in order that <(null) > may try to look for
path=/configuration from the document tree of top.
Should not path be path=/cib/configuration essentially?

notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   <(null)>
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: 
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: 
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:

notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:

notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:

notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:

notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: 
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: 
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:

notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: 
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: 
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: 
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:

notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:

notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:

notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:

notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:

notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:

notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:

notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:

notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:

notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: 
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG: 
notice  Mar 04 11:33:10 __xml_find_path(1294):0: IDEBUG:   


Is this the already recognized problem?

I attach the report at the time of this occurring, and the log of crm_mon.

- crm_report
https://drive.google.com/file/d/0BwMFJItoO-fVWEw4Qnp0aHIzSm8/edit?usp=sharing
- crm_mon.log
https://drive.google.com/file/d/0BwMFJItoO-fVRDRMTGtUUEdBc1E/edit?usp=sharing

Regards,
Yusuke


-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out "lost"?

2014-02-20 Thread yusuke iida
Hi, Andrew

2014-02-20 17:28 GMT+09:00 Andrew Beekhof :
> Who was pid 16243?
> Doesn't look like a pacemaker daemon.
pid 16243 is crm_mon.
In vm01, crm_mon was started and the state was checked.

If there is information required for analysis to other, I get it.

Regards,
Yusuke
>
>>
>> Overflow of queue of vm09 has taken place between cib and stonithd.
>> Feb 20 14:20:22 [15519] vm09cib: (   ipc.c:506   )
>> trace: crm_ipcs_flush_events:  Sent 36 events (530 remaining) for
>> 0x105ec10[15520]: Resource temporarily unavailable (-11)
>> Feb 20 14:20:22 [15519] vm09cib: (   ipc.c:515   )
>> error: crm_ipcs_flush_events:  Evicting slow client 0x105ec10[15520]:
>> event queue reached 530 entries
>>
>> Although I checked the code of the problem part, it was not understood
>> by which it would be solved.
>>
>> Is it less likelihood of sending a message of 100 at a time?
>> Does calculation of the waiting time after message transmission have a 
>> problem?
>> Threshold of 500 may be too low?
>
> being 500 behind is really quite a long way.




-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out "lost"?

2014-02-19 Thread yusuke iida
Hi, Andrew

I tested in the following environments.

KVM virtual 16 machines
CPU: 1
memory: 2048MB
OS: RHEL6.4
Pacemaker-1.1.11(709b36b)
corosync-2.3.2
libqb-0.16.0

It looks like performance is much better on the whole.

However, the problem to which queue overflows with some nodes during
the test of 16 nodes arose.
It happened by vm01 and vm09.

Overflow of queue of vm01 has taken place between cib and crm_mon.
eb 20 14:21:02 [16211] vm01cib: (   ipc.c:506   )   trace:
crm_ipcs_flush_events:  Sent 40 events (729 remaining) for
0x1cd1850[16243]: Resource temporarily unavailable (-11)
Feb 20 14:21:02 [16211] vm01cib: (   ipc.c:515   )
error: crm_ipcs_flush_events:  Evicting slow client 0x1cd1850[16243]:
event queue reached 729 entries

Overflow of queue of vm09 has taken place between cib and stonithd.
Feb 20 14:20:22 [15519] vm09cib: (   ipc.c:506   )
trace: crm_ipcs_flush_events:  Sent 36 events (530 remaining) for
0x105ec10[15520]: Resource temporarily unavailable (-11)
Feb 20 14:20:22 [15519] vm09cib: (   ipc.c:515   )
error: crm_ipcs_flush_events:  Evicting slow client 0x105ec10[15520]:
event queue reached 530 entries

Although I checked the code of the problem part, it was not understood
by which it would be solved.

Is it less likelihood of sending a message of 100 at a time?
Does calculation of the waiting time after message transmission have a problem?
Threshold of 500 may be too low?

I attach crm_report when a problem occurs.
https://drive.google.com/file/d/0BwMFJItoO-fVeGZuWkFnZTFWTDQ/edit?usp=sharing

Regards,
Yusuke
2014-02-18 19:53 GMT+09:00 yusuke iida :
> Hi, Andrew and Digimer
>
> Thank you for the comment.
>
> I solved with reference to other mailing list about this problem.
> https://bugzilla.redhat.com/show_bug.cgi?id=880035
>
> It seems that the kernel of my environment was old when said from the
> conclusion.
> It updated to the newest kernel now.
> kernel-2.6.32-431.5.1.el6.x86_64.rpm
>
> The following parameters are set to bridge which is letting
> communication of corosync pass now.
> As a result, "Retransmit List" no longer occur almost.
> # echo 1 > /sys/class/net//bridge/multicast_querier
> # echo 0 > /sys/class/net//bridge/multicast_snooping
>
> 2014-02-18 9:49 GMT+09:00 Andrew Beekhof :
>>
>> On 31 Jan 2014, at 6:20 pm, yusuke iida  wrote:
>>
>>> Hi, all
>>>
>>> I measure the performance of Pacemaker in the following combinations.
>>> Pacemaker-1.1.11.rc1
>>> libqb-0.16.0
>>> corosync-2.3.2
>>>
>>> All nodes are KVM virtual machines.
>>>
>>>  stopped the node of vm01 compulsorily from the inside, after starting 14 
>>> nodes.
>>> "virsh destroy vm01" was used for the stop.
>>> Then, in addition to the compulsorily stopped node, other nodes are 
>>> separated from a cluster.
>>>
>>> The log of "Retransmit List:" is then outputted in large quantities from 
>>> corosync.
>>
>> Probably best to poke the corosync guys about this.
>>
>> However, <= .11 is known to cause significant CPU usage with that many nodes.
>> I can easily imagine this staving corosync of resources and causing breakage.
>>
>> I would _highly_ recommend retesting with the current git master of 
>> pacemaker.
>> I merged the new cib code last week which is faster by _two_ orders of 
>> magnitude and uses significantly less CPU.
>>
>> I'd be interested to hear your feedback.
> Since I am very interested in this, I would like to test, although the
> problem of "Retransmit List" was solved.
> Please wait for a result a little.
>
> Thanks,
> Yusuke
>
>>
>>>
>>> What is the reason which the node in which failure has not occurred carries 
>>> out "lost"?
>>>
>>> Please advise, if there is a problem in a setup in something.
>>>
>>> I attached the report when the problem occurred.
>>> https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing
>>>
>>> Regards,
>>> Yusuke
>>> --
>>> 
>>> METRO SYSTEMS CO., LTD
>>>
>>> Yusuke Iida
>>> Mail: yusk.i...@gmail.com
>>> 
>>> ___
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://b

Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out "lost"?

2014-02-18 Thread yusuke iida
Hi, Andrew and Digimer

Thank you for the comment.

I solved with reference to other mailing list about this problem.
https://bugzilla.redhat.com/show_bug.cgi?id=880035

It seems that the kernel of my environment was old when said from the
conclusion.
It updated to the newest kernel now.
kernel-2.6.32-431.5.1.el6.x86_64.rpm

The following parameters are set to bridge which is letting
communication of corosync pass now.
As a result, "Retransmit List" no longer occur almost.
# echo 1 > /sys/class/net//bridge/multicast_querier
# echo 0 > /sys/class/net//bridge/multicast_snooping

2014-02-18 9:49 GMT+09:00 Andrew Beekhof :
>
> On 31 Jan 2014, at 6:20 pm, yusuke iida  wrote:
>
>> Hi, all
>>
>> I measure the performance of Pacemaker in the following combinations.
>> Pacemaker-1.1.11.rc1
>> libqb-0.16.0
>> corosync-2.3.2
>>
>> All nodes are KVM virtual machines.
>>
>>  stopped the node of vm01 compulsorily from the inside, after starting 14 
>> nodes.
>> "virsh destroy vm01" was used for the stop.
>> Then, in addition to the compulsorily stopped node, other nodes are 
>> separated from a cluster.
>>
>> The log of "Retransmit List:" is then outputted in large quantities from 
>> corosync.
>
> Probably best to poke the corosync guys about this.
>
> However, <= .11 is known to cause significant CPU usage with that many nodes.
> I can easily imagine this staving corosync of resources and causing breakage.
>
> I would _highly_ recommend retesting with the current git master of pacemaker.
> I merged the new cib code last week which is faster by _two_ orders of 
> magnitude and uses significantly less CPU.
>
> I'd be interested to hear your feedback.
Since I am very interested in this, I would like to test, although the
problem of "Retransmit List" was solved.
Please wait for a result a little.

Thanks,
Yusuke

>
>>
>> What is the reason which the node in which failure has not occurred carries 
>> out "lost"?
>>
>> Please advise, if there is a problem in a setup in something.
>>
>> I attached the report when the problem occurred.
>> https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing
>>
>> Regards,
>> Yusuke
>> --
>> 
>> METRO SYSTEMS CO., LTD
>>
>> Yusuke Iida
>> Mail: yusk.i...@gmail.com
>> 
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] What is the reason which the node in which failure has not occurred carries out "lost"?

2014-01-30 Thread yusuke iida
Hi, all

I measure the performance of Pacemaker in the following combinations.
Pacemaker-1.1.11.rc1
libqb-0.16.0
corosync-2.3.2

All nodes are KVM virtual machines.

 stopped the node of vm01 compulsorily from the inside, after starting 14
nodes.
"virsh destroy vm01" was used for the stop.
Then, in addition to the compulsorily stopped node, other nodes are
separated from a cluster.

The log of "Retransmit List:" is then outputted in large quantities from
corosync.

What is the reason which the node in which failure has not occurred carries
out "lost"?

Please advise, if there is a problem in a setup in something.

I attached the report when the problem occurred.
https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing

Regards,
Yusuke
-- 
----
METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] The larger cluster is tested.

2013-11-20 Thread yusuke iida
Hi, Andrew

I understand.

More, a lower batch-limit, there is a possibility that the operation
of the cluster becomes too late.
I examine avoiding by changing adjustment of a parameter, or the motive method.

Thank you for various adjustments.
Yusuke
2013/11/19 Andrew Beekhof :
>
> On 16 Nov 2013, at 12:22 am, yusuke iida  wrote:
>
>> Hi, Andrew
>>
>> Thanks for the suggestion variety.
>>
>> I fixed and tested the value of batch-limit by 1, 2, 3, and 4 from the
>> beginning, in order to confirm what batch-limit is suitable.
>>
>> It was something like the following in my environment.
>> Timeout did not occur batch-limit=1 and 2.
>> batch-limit = 3 was 1 timeout.
>> batch-limit = 4 was 5 timeout.
>>
>> I think the limit is still high in; From the above results, "limit =
>> QB_MAX (1, peers / 4)".
>
> Remember these results are specific to your (virtual) hardware and configured 
> timeouts.
> I would argue that 5 timeouts out of 2853 actions is actually quite 
> impressive for a default value in this sort of situation.[1]
>
> Some tuning in a cluster of this kind is to be expected.
>
> [1] It took crm_simulate 4 minutes to even pretend to perform all those 
> operations.
>
>>
>> So I have created a fix to fixed to 2 batch-limit when it became a
>> state of extreme.
>> https://github.com/yuusuke/pacemaker/commit/efe2d6ebc55be39b8be43de38e7662f039b61dec
>>
>> Results of the test several times, it seems to work without problems.
>>
>> When batch-limit is fixed and tested, below has a report.
>> batch-limit=1
>> https://drive.google.com/file/d/0BwMFJItoO-fVNk8wTGlYNjNnSHc/edit?usp=sharing
>> batch-limit=2
>> https://drive.google.com/file/d/0BwMFJItoO-fVTnc4bXY2YXF2M2M/edit?usp=sharing
>> batch-limit=3
>> https://drive.google.com/file/d/0BwMFJItoO-fVYl9Gbks2VlJMR0k/edit?usp=sharing
>> batch-limit=4
>> https://drive.google.com/file/d/0BwMFJItoO-fVZnJIazd5MFQ1aGs/edit?usp=sharing
>>
>> The report at the time of making it operate by my test code is the following.
>> https://drive.google.com/file/d/0BwMFJItoO-fVbzB0NjFLeVY3Zmc/edit?usp=sharing
>>
>> Regards,
>> Yusuke
>>
>> 2013/11/13 Andrew Beekhof :
>>> Did you look at the load numbers in the logs?
>>> The CPUs are being slammed for over 20 minutes.
>>>
>>> The automatic tuning can only help so much, you're simply asking the 
>>> cluster to do more work than it is capable of.
>>> Giving more priority to cib operations the come via IPC is one option, but 
>>> as I explained earlier, it comes at the cost of correctness.
>>>
>>> Given the huge mismatch between the nodes' capacity and the tasks you're 
>>> asking them to achieve, your best path forward is probably setting a 
>>> load-threshold < 40% or a batch-limit <= 8.
>>> Or we could try a patch like the one below if we think that the defaults 
>>> are not aggressive enough.
>>>
>>> diff --git a/crmd/throttle.c b/crmd/throttle.c
>>> index d77195a..7636d4a 100644
>>> --- a/crmd/throttle.c
>>> +++ b/crmd/throttle.c
>>> @@ -611,14 +611,14 @@ throttle_get_total_job_limit(int l)
>>> switch(r->mode) {
>>>
>>> case throttle_extreme:
>>> -if(limit == 0 || limit > peers/2) {
>>> -limit = peers/2;
>>> +if(limit == 0 || limit > peers/4) {
>>> +limit = QB_MAX(1, peers/4);
>>> }
>>> break;
>>>
>>> case throttle_high:
>>> -if(limit == 0 || limit > peers) {
>>> -limit = peers;
>>> +if(limit == 0 || limit > peers/2) {
>>> +limit = QB_MAX(1, peers/2);
>>> }
>>> break;
>>> default:
>>>
>>> This may also be worthwhile:
>>>
>>> diff --git a/crmd/throttle.c b/crmd/throttle.c
>>> index d77195a..586513a 100644
>>> --- a/crmd/throttle.c
>>> +++ b/crmd/throttle.c
>>> @@ -387,22 +387,36 @@ static bool throttle_io_load(float *load, unsigned 
>>> int *blocked)
>>> }
>>>
>>> static enum throttle_state_e
>>> -throttle_handle_load(float load, const char *desc)
>>> +throttle_handle_load(float load, const char *desc, int cores)
>>> {
>>> -if(load > THROTTLE_FACTOR_HIGH * throttle_load_target) {
>>> +   

Re: [Pacemaker] The larger cluster is tested.

2013-11-15 Thread yusuke iida
vg(&load)) {
> -float simple = load / cores;
> -mode |= throttle_handle_load(simple, "CPU load");
> +mode |= throttle_handle_load(load, "CPU load", cores);
>  }
>
>  if(throttle_io_load(&load, &blocked)) {
> -float blocked_ratio = 0.0;
> -
> -mode |= throttle_handle_load(load, "IO load");
> -
> -if(cores) {
> -blocked_ratio = blocked / cores;
> -} else {
> -blocked_ratio = blocked;
> -}
> -
> -mode |= throttle_handle_load(blocked_ratio, "blocked IO ratio");
> +mode |= throttle_handle_load(load, "IO load", 0);
> +mode |= throttle_handle_load(blocked, "blocked IO ratio", cores);
>  }
>
>  if(mode & throttle_extreme) {
>
>
>
>
> On 12 Nov 2013, at 3:25 pm, yusuke iida  wrote:
>
>> Hi, Andrew
>>
>> I'm sorry.
>> This report was a thing when two cores were assigned to the virtual machine.
>> https://drive.google.com/file/d/0BwMFJItoO-fVdlIwTVdFOGRkQ0U/edit?usp=sharing
>>
>> I'm sorry to be misleading.
>>
>> This is the report acquired with one core.
>> https://drive.google.com/file/d/0BwMFJItoO-fVSlo0dE0xMzNORGc/edit?usp=sharing
>>
>> It does not define the LRMD_MAX_CHILDREN on any node.
>> load-threshold is still default.
>> cib_max_cpu is set to 0.4 by the following processing.
>>
>>if(cores == 1) {
>>cib_max_cpu = 0.4;
>>}
>>
>> since -- if it exceeds 60%, it will be in the state of Extreme.
>> Nov 08 11:08:31 [2390] vm01   crmd: (  throttle.c:441   )  notice:
>> throttle_mode:Extreme CIB load detected: 0.67
>>
>> From the state of a bit, DC is detecting that vm01 is in the state of 
>> Extreme.
>> Nov 08 11:08:32 [2387] vm13   crmd: (  throttle.c:701   )   debug:
>> throttle_update: Host vm01 supports a maximum of 2 jobs and
>> throttle mode 1000.  New job limit is 1
>>
>> From the following log, a dynamic change of batch-limit also seems to
>> process satisfactorily.
>> # grep "throttle_get_total_job_limit" pacemaker.log
>> (snip)
>> Nov 08 11:08:31 [2387] vm13   crmd: (  throttle.c:629   )   trace:
>> throttle_get_total_job_limit:No change to batch-limit=0
>> Nov 08 11:08:32 [2387] vm13   crmd: (  throttle.c:632   )   trace:
>> throttle_get_total_job_limit:Using batch-limit=8
>> (snip)
>> Nov 08 11:10:32 [2387] vm13   crmd: (  throttle.c:632   )   trace:
>> throttle_get_total_job_limit:Using batch-limit=16
>>
>> The above shows that it is not solved even if it restricts the whole
>> number of jobs by batch-limit.
>> Are there any other methods of reducing a synchronous message?
>>
>> Internal IPC message is not so much.
>> Do not be able to handle even a little it on the way to handle the
>> synchronization message?
>>
>> Regards,
>> Yusuke
>>
>> 2013/11/12 Andrew Beekhof :
>>>
>>> On 11 Nov 2013, at 11:48 pm, yusuke iida  wrote:
>>>
>>>> Execution of the graph was also checked.
>>>> Since the number of pending(s) is restricted to 16 from the middle, it
>>>> is judged that batch-limit is effective.
>>>> Observing here, even if a job is restricted by batch-limit, two or
>>>> more jobs are always fired(ed) in 1 second.
>>>> These performed jobs return a result and the synchronous message of
>>>> CIB generates them.
>>>> The node which continued receiving a synchronous message processes
>>>> there preferentially, and postpones an internal IPC message.
>>>> I think that it caused timeout.
>>>
>>> What load-threshold were you running this with?
>>>
>>> I see this in the logs:
>>> "Host vm10 supports a maximum of 4 jobs and throttle mode 0100.  New job 
>>> limit is 1"
>>>
>>> Have you set LRMD_MAX_CHILDREN=4 on these nodes?
>>> I wouldn't recommend that for a single core VM.  I'd let the default of 
>>> 2*cores be used.
>>>
>>>
>>> Also, I'm not seeing "Extreme CIB load detected".  Are these still single 
>>> core machines?
>>> If so it would suggest that something about:
>>>
>>>if(cores == 1) {
>>>cib_max_cpu = 0.4;
>>>}
>>>if(throttle_load_target > 0.0 && throttle_load_target < cib_max_cpu) 
>>> {
>>

Re: [Pacemaker] The larger cluster is tested.

2013-11-11 Thread yusuke iida
Hi, Andrew

I'm sorry.
This report was a thing when two cores were assigned to the virtual machine.
https://drive.google.com/file/d/0BwMFJItoO-fVdlIwTVdFOGRkQ0U/edit?usp=sharing

I'm sorry to be misleading.

This is the report acquired with one core.
https://drive.google.com/file/d/0BwMFJItoO-fVSlo0dE0xMzNORGc/edit?usp=sharing

It does not define the LRMD_MAX_CHILDREN on any node.
load-threshold is still default.
cib_max_cpu is set to 0.4 by the following processing.

if(cores == 1) {
cib_max_cpu = 0.4;
}

since -- if it exceeds 60%, it will be in the state of Extreme.
Nov 08 11:08:31 [2390] vm01   crmd: (  throttle.c:441   )  notice:
throttle_mode:Extreme CIB load detected: 0.67

>From the state of a bit, DC is detecting that vm01 is in the state of Extreme.
Nov 08 11:08:32 [2387] vm13   crmd: (  throttle.c:701   )   debug:
throttle_update: Host vm01 supports a maximum of 2 jobs and
throttle mode 1000.  New job limit is 1

>From the following log, a dynamic change of batch-limit also seems to
process satisfactorily.
# grep "throttle_get_total_job_limit" pacemaker.log
(snip)
Nov 08 11:08:31 [2387] vm13   crmd: (  throttle.c:629   )   trace:
throttle_get_total_job_limit:No change to batch-limit=0
Nov 08 11:08:32 [2387] vm13   crmd: (  throttle.c:632   )   trace:
throttle_get_total_job_limit:Using batch-limit=8
(snip)
Nov 08 11:10:32 [2387] vm13   crmd: (  throttle.c:632   )   trace:
throttle_get_total_job_limit:Using batch-limit=16

The above shows that it is not solved even if it restricts the whole
number of jobs by batch-limit.
Are there any other methods of reducing a synchronous message?

Internal IPC message is not so much.
Do not be able to handle even a little it on the way to handle the
synchronization message?

Regards,
Yusuke

2013/11/12 Andrew Beekhof :
>
> On 11 Nov 2013, at 11:48 pm, yusuke iida  wrote:
>
>> Execution of the graph was also checked.
>> Since the number of pending(s) is restricted to 16 from the middle, it
>> is judged that batch-limit is effective.
>> Observing here, even if a job is restricted by batch-limit, two or
>> more jobs are always fired(ed) in 1 second.
>> These performed jobs return a result and the synchronous message of
>> CIB generates them.
>> The node which continued receiving a synchronous message processes
>> there preferentially, and postpones an internal IPC message.
>> I think that it caused timeout.
>
> What load-threshold were you running this with?
>
> I see this in the logs:
> "Host vm10 supports a maximum of 4 jobs and throttle mode 0100.  New job 
> limit is 1"
>
> Have you set LRMD_MAX_CHILDREN=4 on these nodes?
> I wouldn't recommend that for a single core VM.  I'd let the default of 
> 2*cores be used.
>
>
> Also, I'm not seeing "Extreme CIB load detected".  Are these still single 
> core machines?
> If so it would suggest that something about:
>
> if(cores == 1) {
> cib_max_cpu = 0.4;
> }
> if(throttle_load_target > 0.0 && throttle_load_target < cib_max_cpu) {
> cib_max_cpu = throttle_load_target;
> }
>
> if(load > 1.5 * cib_max_cpu) {
> /* Can only happen on machines with a low number of cores */
> crm_notice("Extreme %s detected: %f", desc, load);
> mode |= throttle_extreme;
>
> is wrong.
>
> What was load-threshold configured as?
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] The larger cluster is tested.

2013-11-11 Thread yusuke iida
   crmd: ( graph.c:336   )   debug:
run_graph:   Transition 1 (Complete=590, Pending=22, Fired=0,
Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-67.bz2): In-progress
Nov 08 15:27:14 [2473] vm12   crmd: ( graph.c:277   )   debug:
run_graph:   Throttling output: batch limit (16) reached
Nov 08 15:27:14 [2473] vm12   crmd: ( graph.c:336   )   debug:
run_graph:   Transition 1 (Complete=592, Pending=20, Fired=0,
Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-67.bz2): In-progress
Nov 08 15:27:15 [2473] vm12   crmd: ( graph.c:277   )   debug:
run_graph:   Throttling output: batch limit (16) reached
Nov 08 15:27:15 [2473] vm12   crmd: ( graph.c:336   )   debug:
run_graph:   Transition 1 (Complete=594, Pending=18, Fired=0,
Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-67.bz2): In-progress
Nov 08 15:27:15 [2473] vm12   crmd: ( graph.c:277   )   debug:
run_graph:   Throttling output: batch limit (16) reached
Nov 08 15:27:15 [2473] vm12   crmd: ( graph.c:336   )   debug:
run_graph:   Transition 1 (Complete=596, Pending=16, Fired=0,
Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-67.bz2): In-progress
Nov 08 15:27:15 [2473] vm12   crmd: ( graph.c:277   )   debug:
run_graph:   Throttling output: batch limit (16) reached
Nov 08 15:27:15 [2473] vm12   crmd: ( graph.c:336   )   debug:
run_graph:   Transition 1 (Complete=598, Pending=16, Fired=2,
Skipped=0, Incomplete=228,
Source=/var/lib/pacemaker/pengine/pe-input-67.bz2): In-progress
Nov 08 15:27:16 [2473] vm12   crmd: ( graph.c:277   )   debug:
run_graph:   Throttling output: batch limit (16) reached
Nov 08 15:27:16 [2473] vm12   crmd: ( graph.c:336   )   debug:
run_graph:   Transition 1 (Complete=600, Pending=16, Fired=2,
Skipped=0, Incomplete=233,
Source=/var/lib/pacemaker/pengine/pe-input-67.bz2): In-progress
Nov 08 15:27:16 [2473] vm12   crmd: ( graph.c:277   )   debug:
run_graph:   Throttling output: batch limit (16) reached
Nov 08 15:27:16 [2473] vm12   crmd: ( graph.c:336   )   debug:
run_graph:   Transition 1 (Complete=603, Pending=16, Fired=3,
Skipped=0, Incomplete=233,
Source=/var/lib/pacemaker/pengine/pe-input-67.bz2): In-progress
Nov 08 15:27:16 [2473] vm12   crmd: ( graph.c:277   )   debug:
run_graph:   Throttling output: batch limit (16) reached
Nov 08 15:27:16 [2473] vm12   crmd: ( graph.c:336   )   debug:
run_graph:   Transition 1 (Complete=605, Pending=16, Fired=2,
Skipped=0, Incomplete=241,
Source=/var/lib/pacemaker/pengine/pe-input-67.bz2): In-progress
Nov 08 15:27:17 [2473] vm12   crmd: ( graph.c:277   )   debug:
run_graph:   Throttling output: batch limit (16) reached
Nov 08 15:27:17 [2473] vm12   crmd: ( graph.c:336   )   debug:
run_graph:   Transition 1 (Complete=609, Pending=16, Fired=4,
Skipped=0, Incomplete=272,
Source=/var/lib/pacemaker/pengine/pe-input-67.bz2): In-progress
Nov 08 15:27:17 [2473] vm12   crmd: ( graph.c:277   )   debug:
run_graph:   Throttling output: batch limit (16) reached
Nov 08 15:27:17 [2473] vm12   crmd: ( graph.c:336   )   debug:
run_graph:   Transition 1 (Complete=611, Pending=16, Fired=2,
Skipped=0, Incomplete=243,
Source=/var/lib/pacemaker/pengine/pe-input-67.bz2): In-progress

Regards,
Yusuke
2013/11/11 Andrew Beekhof :
>
> On 11 Nov 2013, at 5:08 pm, yusuke iida  wrote:
>
>> Hi, Andrew
>>
>> I tested by the following versions.
>> https://github.com/yuusuke/pacemaker/commit/3b90af1b11a4389f8b4a95a20ef12b8c259e73dc
>>
>> However, the problem has not been solved yet.
>>
>> I do not think that this problem can cope with it by batch-limit.
>> Execution of a job is interrupted by batch-limit temporarily.
>> However, graph will be immediately resumed by trigger_graph called in
>> match_graph_event.
>
> batch-limit controls how many in-flight jobs can be performed (and therefor 
> how busy the CIB can be).
> If batch-limit=10 and there are still 10 jobs in progress, then calling 
> trigger_graph() over and over does nothing until there are 9 jobs (or less).
> At which point one more can be scheduled.
>
> So if "synchronous message of CIB is sent now ceaseless", then there is a bug 
> somewhere.
> Did you confirm that throttle_get_total_job_limit() was returning an 
> appropriate value?
>
>> Since the synchronous message of CIB is sent now ceaseless, the IPC
>> message sent from crmd cannot be processed.
>>
>> The following methods can be considered to solve a problem for this
>> CPG message sent continuously.
>>
>> In order to make the time when a CPG message is processed, it stops
>> that DC sends job for a definite period of time.
>>
>> Or I th

Re: [Pacemaker] The larger cluster is tested.

2013-11-10 Thread yusuke iida
Hi, Andrew

I tested by the following versions.
https://github.com/yuusuke/pacemaker/commit/3b90af1b11a4389f8b4a95a20ef12b8c259e73dc

However, the problem has not been solved yet.

I do not think that this problem can cope with it by batch-limit.
Execution of a job is interrupted by batch-limit temporarily.
However, graph will be immediately resumed by trigger_graph called in
match_graph_event.
Since the synchronous message of CIB is sent now ceaseless, the IPC
message sent from crmd cannot be processed.

The following methods can be considered to solve a problem for this
CPG message sent continuously.

In order to make the time when a CPG message is processed, it stops
that DC sends job for a definite period of time.

Or I think that it is necessary to make the priority of a CPG message
be the same as that of G_PRIORITY_DEFAULT defined by
gio_poll_dispatch_add().

I attach report which tested.
https://drive.google.com/file/d/0BwMFJItoO-fVdlIwTVdFOGRkQ0U/edit?usp=sharing

Regards,
Yusuke

2013/11/8 Andrew Beekhof :
>
> On 8 Nov 2013, at 12:10 am, yusuke iida  wrote:
>
>> Hi, Andrew
>>
>> The shown code seems not to process correctly.
>> I wrote correction.
>> Please check.
>> https://github.com/yuusuke/pacemaker/commit/3b90af1b11a4389f8b4a95a20ef12b8c259e73dc
>
> Ah, yes that looks better.
> Did it help at all?
>
>>
>> Regards,
>> Yusuke
>>
>> 2013/11/7 Andrew Beekhof :
>>>
>>> On 7 Nov 2013, at 12:43 pm, yusuke iida  wrote:
>>>
>>>> Hi, Andrew
>>>>
>>>> 2013/11/7 Andrew Beekhof :
>>>>>
>>>>> On 6 Nov 2013, at 4:48 pm, yusuke iida  wrote:
>>>>>
>>>>>> Hi, Andrew
>>>>>>
>>>>>> I tested by the following versions.
>>>>>> https://github.com/ClusterLabs/pacemaker/commit/3492fec7fe58a6fd94071632df27d3fd3fc3ffe3
>>>>>>
>>>>>> load-threshold was checked at 60%, 40%, and 20%.
>>>>>>
>>>>>> However, the problem was not solved.
>>>>>> It will not change but timeout will occur.
>>>>>
>>>>> That is extremely surprising.  I will have a look at your logs today.
>>>>> How many cores do these machines have btw?
>>>>
>>>> The machine which I am using by the test is a virtual machine of KVM.
>>>> There are four physical servers. Four virtual machines are started on
>>>> each server.
>>>> Has four core physical server, I am assigned a core of separate to the
>>>> virtual machine.
>>>> The number of CPUs currently assigned to the virtual machine is one piece.
>>>> The memory is assigning 2048 MB per set.
>>>
>>> I think I understand whats happening...
>>>
>>> The throttling code is designed to keep the cib's CPU usage from reaching 
>>> 100% (ie. 1 core completely busy).
>>> In a single core setup, thats already much too late, and with 16 nodes I 
>>> can easily imagine that even 1 job per machine is going to be too much for 
>>> an underpowered CPU.
>>>
>>> I'm currently experimenting with:
>>>
>>>   http://paste.fedoraproject.org/52283/37994581
>>>
>>> which may help on both fronts.
>>>
>>> Essentially it is trying to dynamically infer a "good" value for 
>>> batch-limit when the CIB is using too much CPU.
>>>
>>>
>>>
>>> ___
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>> --
>> 
>> METRO SYSTEMS CO., LTD
>>
>> Yusuke Iida
>> Mail: yusk.i...@gmail.com
>> 
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] The larger cluster is tested.

2013-11-07 Thread yusuke iida
Hi, Andrew

The shown code seems not to process correctly.
I wrote correction.
Please check.
https://github.com/yuusuke/pacemaker/commit/3b90af1b11a4389f8b4a95a20ef12b8c259e73dc

Regards,
Yusuke

2013/11/7 Andrew Beekhof :
>
> On 7 Nov 2013, at 12:43 pm, yusuke iida  wrote:
>
>> Hi, Andrew
>>
>> 2013/11/7 Andrew Beekhof :
>>>
>>> On 6 Nov 2013, at 4:48 pm, yusuke iida  wrote:
>>>
>>>> Hi, Andrew
>>>>
>>>> I tested by the following versions.
>>>> https://github.com/ClusterLabs/pacemaker/commit/3492fec7fe58a6fd94071632df27d3fd3fc3ffe3
>>>>
>>>> load-threshold was checked at 60%, 40%, and 20%.
>>>>
>>>> However, the problem was not solved.
>>>> It will not change but timeout will occur.
>>>
>>> That is extremely surprising.  I will have a look at your logs today.
>>> How many cores do these machines have btw?
>>
>> The machine which I am using by the test is a virtual machine of KVM.
>> There are four physical servers. Four virtual machines are started on
>> each server.
>> Has four core physical server, I am assigned a core of separate to the
>> virtual machine.
>> The number of CPUs currently assigned to the virtual machine is one piece.
>> The memory is assigning 2048 MB per set.
>
> I think I understand whats happening...
>
> The throttling code is designed to keep the cib's CPU usage from reaching 
> 100% (ie. 1 core completely busy).
> In a single core setup, thats already much too late, and with 16 nodes I can 
> easily imagine that even 1 job per machine is going to be too much for an 
> underpowered CPU.
>
> I'm currently experimenting with:
>
>http://paste.fedoraproject.org/52283/37994581
>
> which may help on both fronts.
>
> Essentially it is trying to dynamically infer a "good" value for batch-limit 
> when the CIB is using too much CPU.
>
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] The larger cluster is tested.

2013-11-06 Thread yusuke iida
Hi, Andrew

2013/11/7 Andrew Beekhof :
>
> On 6 Nov 2013, at 4:48 pm, yusuke iida  wrote:
>
>> Hi, Andrew
>>
>> I tested by the following versions.
>> https://github.com/ClusterLabs/pacemaker/commit/3492fec7fe58a6fd94071632df27d3fd3fc3ffe3
>>
>> load-threshold was checked at 60%, 40%, and 20%.
>>
>> However, the problem was not solved.
>> It will not change but timeout will occur.
>
> That is extremely surprising.  I will have a look at your logs today.
> How many cores do these machines have btw?

The machine which I am using by the test is a virtual machine of KVM.
There are four physical servers. Four virtual machines are started on
each server.
Has four core physical server, I am assigned a core of separate to the
virtual machine.
The number of CPUs currently assigned to the virtual machine is one piece.
The memory is assigning 2048 MB per set.

Regards,
Yusuke
-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] The larger cluster is tested.

2013-11-05 Thread yusuke iida
Hi, Andrew

I tested by the following versions.
https://github.com/ClusterLabs/pacemaker/commit/3492fec7fe58a6fd94071632df27d3fd3fc3ffe3

load-threshold was checked at 60%, 40%, and 20%.

However, the problem was not solved.
It will not change but timeout will occur.

Restriction of the number of jobs seems to be carried out correctly.
However, since the synchronous message of CIB is sent ceaseless, it is
processing there preferentially.
Therefore, the internal IPC communication message is kept waiting.

I think that I need to change the priority of message processing in
order to solve this problem.
Or when load is high, I think that processing which stops that DC
sends a job is effective.
The accumulated message may be processed while transmission of the job
has stopped.
However, it is expected that operation of the whole cluster becomes
slow in that case.

Does it happen with the problem which may occur when a priority is
changed in what kind of case?
And if known, I want you to tell me should be what the test.

load-threshold 60% test report
https://drive.google.com/file/d/0BwMFJItoO-fVOHB5S1ROOUJrams/edit?usp=sharing
load-threshold 40% test report
https://drive.google.com/file/d/0BwMFJItoO-fVemlqVUU2QkhEMW8/edit?usp=sharing
load-threshold 20% test report
https://drive.google.com/file/d/0BwMFJItoO-fVTWFTU2pqOF9pcms/edit?usp=sharing

report tested by the commitment which changed the priority is also sent.
https://github.com/yuusuke/pacemaker/commit/17a7cbe67c455f5f6d36a1e1bc255b4ab0039dd8

load-threshold 80% and CPG G_PRIORITY_DEFAULT test report
https://drive.google.com/file/d/0BwMFJItoO-fVV1BoTjVQMk52WEU/edit?usp=sharing

2013/11/6 Andrew Beekhof :
>
> On 5 Nov 2013, at 12:48 pm, yusuke iida  wrote:
>
>> Hi, Andrew
>>
>> I tested by this commitment.
>> https://github.com/beekhof/pacemaker/commit/145c782e432d8108ca865f994640cf5a62406363
>>
>> However, the problem has not improved.
>> It seems that it will be preferentially processed since the message of
>> CPG is set as G_PRIORITY_MED.
>>
>> I suggest that you lower the priority of CPG instead.
>
> I worry about this change.
> It may allow ipc clients to read out of date information (the pending cpg 
> messages almost certainly contain updates) and could result in updates being 
> lost (because they're not being made to the latest config+status).
>
> Could you try reducing the value of load-threshold? The default (80%) could 
> be too high.
>
>> How is this?
>> https://github.com/yuusuke/pacemaker/commit/22a14318cc740b3043106609923f47039c3aa407
>>
>> I did not find the method of lowering only the priority of the CPG
>> message of a CIB process.
>>
>> Reports when the error came out were collected.
>> I want you to note that it is delayed that an IPC message is processed
>> as follows.
>>
>> Nov 01 21:53:52 [9246] vm01   crmd: (cib_native.c:397   )   trace:
>> cib_native_perform_op_delegate:  Async call, returning 32
>> (snip)
>> Nov 01 21:55:57 [9241] vm01cib: ( callbacks.c:688   )info:
>> cib_process_request: Forwarding cib_modify operation for section
>> status to master (origin=local/crmd/32)
>>
>> Since size is large, I want you to download from the following.
>> https://drive.google.com/file/d/0BwMFJItoO-fVWDg1Sjc2WXltUjQ/edit?usp=sharing
>>
>> Regards,
>> Yusuke
>>
>> 2013/10/31 Andrew Beekhof :
>>>
>>> On 29 Oct 2013, at 12:12 am, yusuke iida  wrote:
>>>
>>>> Hi, Andrew
>>>>
>>>> I tested using following commit.
>>>> https://github.com/beekhof/pacemaker/commit/b6fa1e650f64b1ba73fdb143f41323aa8cb3544e
>>>>
>>>> However, timeout of operation has still occurred.
>>>>
>>>> I analyzed the log.
>>>>
>>>> I am noting that it is late that the ipc message transmitted to cib
>>>> from crmd of local is processed.
>>>> Since the CIB synchronous message by which the CIB process came from
>>>> the outside will have priority and will be processed, this happens?
>>>>
>>>>
>>>> I made the following corrections so that the priority of the message
>>>> which CIB processes might be changed.
>>>> In this case, timeout does not occur.
>>>>
>>>> diff --git a/lib/cluster/cpg.c b/lib/cluster/cpg.c
>>>> index 8522cbf..3a67998 100644
>>>> --- a/lib/cluster/cpg.c
>>>> +++ b/lib/cluster/cpg.c
>>>> @@ -212,7 +212,7 @@ pcmk_cpg_dispatch(gpointer user_data)
>>>>int rc = 0;
>>>>crm_cluster_t *cluster = (crm_cluster_t*) user_data;
>>>>
>>>> -r

Re: [Pacemaker] The larger cluster is tested.

2013-11-04 Thread yusuke iida
Hi, Andrew

I tested by this commitment.
https://github.com/beekhof/pacemaker/commit/145c782e432d8108ca865f994640cf5a62406363

However, the problem has not improved.
It seems that it will be preferentially processed since the message of
CPG is set as G_PRIORITY_MED.

I suggest that you lower the priority of CPG instead.
How is this?
https://github.com/yuusuke/pacemaker/commit/22a14318cc740b3043106609923f47039c3aa407

I did not find the method of lowering only the priority of the CPG
message of a CIB process.

Reports when the error came out were collected.
I want you to note that it is delayed that an IPC message is processed
as follows.

Nov 01 21:53:52 [9246] vm01   crmd: (cib_native.c:397   )   trace:
cib_native_perform_op_delegate:  Async call, returning 32
(snip)
Nov 01 21:55:57 [9241] vm01cib: ( callbacks.c:688   )info:
cib_process_request: Forwarding cib_modify operation for section
status to master (origin=local/crmd/32)

Since size is large, I want you to download from the following.
https://drive.google.com/file/d/0BwMFJItoO-fVWDg1Sjc2WXltUjQ/edit?usp=sharing

Regards,
Yusuke

2013/10/31 Andrew Beekhof :
>
> On 29 Oct 2013, at 12:12 am, yusuke iida  wrote:
>
>> Hi, Andrew
>>
>> I tested using following commit.
>> https://github.com/beekhof/pacemaker/commit/b6fa1e650f64b1ba73fdb143f41323aa8cb3544e
>>
>> However, timeout of operation has still occurred.
>>
>> I analyzed the log.
>>
>> I am noting that it is late that the ipc message transmitted to cib
>> from crmd of local is processed.
>> Since the CIB synchronous message by which the CIB process came from
>> the outside will have priority and will be processed, this happens?
>>
>>
>> I made the following corrections so that the priority of the message
>> which CIB processes might be changed.
>> In this case, timeout does not occur.
>>
>> diff --git a/lib/cluster/cpg.c b/lib/cluster/cpg.c
>> index 8522cbf..3a67998 100644
>> --- a/lib/cluster/cpg.c
>> +++ b/lib/cluster/cpg.c
>> @@ -212,7 +212,7 @@ pcmk_cpg_dispatch(gpointer user_data)
>> int rc = 0;
>> crm_cluster_t *cluster = (crm_cluster_t*) user_data;
>>
>> -rc = cpg_dispatch(cluster->cpg_handle, CS_DISPATCH_ALL);
>> +rc = cpg_dispatch(cluster->cpg_handle, CS_DISPATCH_ONE);
>> if (rc != CS_OK) {
>> crm_err("Connection to the CPG API failed: %s (%d)",
>> ais_error2text(rc), rc);
>> cluster->cpg_handle = 0;
>> diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
>> index 18a67e6..d605288 100644
>> --- a/lib/common/mainloop.c
>> +++ b/lib/common/mainloop.c
>> @@ -482,7 +482,7 @@ gio_poll_dispatch_add(enum qb_loop_priority p,
>> int32_t fd, int32_t evts,
>> adaptor->p = p;
>> adaptor->is_used = QB_TRUE;
>> adaptor->source =
>> -g_io_add_watch_full(channel, G_PRIORITY_DEFAULT, evts,
>> gio_read_socket, adaptor,
>> +g_io_add_watch_full(channel, G_PRIORITY_MEDIUM, evts,
>> gio_read_socket, adaptor,
>> gio_poll_destroy);
>>
>> /* Now that mainloop now holds a reference to channel,
>>
>> I do not know this fix is correct.
>> Can't the comment to this correction be got?
>
> The CS_DISPATCH_ONE change looks ok: 
> https://github.com/beekhof/pacemaker/commit/6384053
> Did you try with just that?  I'd like to avoid the mainloop priority change 
> if possible.
>
>>
>> Regards,
>> Yusuke
>>
>> 2013/10/20 Andrew Beekhof :
>>>
>>> On 18/10/2013, at 10:12 PM, yusuke iida  wrote:
>>>
>>>> Hi, Andrew
>>>>
>>>> Now, I am testing the configuration of one standby node and active node of 
>>>> 15.
>>>> About 10 Dummy resources are started per node.
>>>>
>>>> If all the nodes are started with this composition, before all the
>>>> resources start, it will take the time for about 20 minutes.
>>>>
>>>> And some resources have caused start timeout.
>>>> probe is performed all at once by all the nodes at a start-up.
>>>> The result is written in cib and synchronizes with all the nodes.
>>>> This processing requires very high load.
>>>> I think that timeout has occurred owing to it.
>>>
>>> More than likely, yes.
>>>
>>>>
>>>> I am very interested in whether this problem is solvable, if you use
>>>> throttle created now.
>>>
>>> I have been using it, I have found it more effective than batch-limit for 
>>>

Re: [Pacemaker] The larger cluster is tested.

2013-10-28 Thread yusuke iida
Hi, Andrew

I tested using following commit.
https://github.com/beekhof/pacemaker/commit/b6fa1e650f64b1ba73fdb143f41323aa8cb3544e

However, timeout of operation has still occurred.

I analyzed the log.

I am noting that it is late that the ipc message transmitted to cib
from crmd of local is processed.
Since the CIB synchronous message by which the CIB process came from
the outside will have priority and will be processed, this happens?


I made the following corrections so that the priority of the message
which CIB processes might be changed.
In this case, timeout does not occur.

diff --git a/lib/cluster/cpg.c b/lib/cluster/cpg.c
index 8522cbf..3a67998 100644
--- a/lib/cluster/cpg.c
+++ b/lib/cluster/cpg.c
@@ -212,7 +212,7 @@ pcmk_cpg_dispatch(gpointer user_data)
 int rc = 0;
 crm_cluster_t *cluster = (crm_cluster_t*) user_data;

-rc = cpg_dispatch(cluster->cpg_handle, CS_DISPATCH_ALL);
+rc = cpg_dispatch(cluster->cpg_handle, CS_DISPATCH_ONE);
 if (rc != CS_OK) {
 crm_err("Connection to the CPG API failed: %s (%d)",
ais_error2text(rc), rc);
 cluster->cpg_handle = 0;
diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
index 18a67e6..d605288 100644
--- a/lib/common/mainloop.c
+++ b/lib/common/mainloop.c
@@ -482,7 +482,7 @@ gio_poll_dispatch_add(enum qb_loop_priority p,
int32_t fd, int32_t evts,
 adaptor->p = p;
 adaptor->is_used = QB_TRUE;
 adaptor->source =
-g_io_add_watch_full(channel, G_PRIORITY_DEFAULT, evts,
gio_read_socket, adaptor,
+g_io_add_watch_full(channel, G_PRIORITY_MEDIUM, evts,
gio_read_socket, adaptor,
 gio_poll_destroy);

 /* Now that mainloop now holds a reference to channel,

I do not know this fix is correct.
Can't the comment to this correction be got?

Regards,
Yusuke

2013/10/20 Andrew Beekhof :
>
> On 18/10/2013, at 10:12 PM, yusuke iida  wrote:
>
>> Hi, Andrew
>>
>> Now, I am testing the configuration of one standby node and active node of 
>> 15.
>> About 10 Dummy resources are started per node.
>>
>> If all the nodes are started with this composition, before all the
>> resources start, it will take the time for about 20 minutes.
>>
>> And some resources have caused start timeout.
>> probe is performed all at once by all the nodes at a start-up.
>> The result is written in cib and synchronizes with all the nodes.
>> This processing requires very high load.
>> I think that timeout has occurred owing to it.
>
> More than likely, yes.
>
>>
>> I am very interested in whether this problem is solvable, if you use
>> throttle created now.
>
> I have been using it, I have found it more effective than batch-limit for 
> bounding CPU usage and avoiding timeouts.
> I would be interested to hear your feedback if you have the time to do some 
> testing.
>
>> When is throttle due to be merged into the repository of ClusterLabs?
>
> It is queued up behind a compatibility patch that is needed for some changes 
> I made to the pacemaker-remote wire protocol.
>
>>
>> Best Regards,
>>
>> --
>> 
>> METRO SYSTEMS CO., LTD
>>
>> Yusuke Iida
>> Mail: yusk.i...@gmail.com
>> 
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>



-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] The larger cluster is tested.

2013-10-18 Thread yusuke iida
Hi, Andrew

Now, I am testing the configuration of one standby node and active node of 15.
About 10 Dummy resources are started per node.

If all the nodes are started with this composition, before all the
resources start, it will take the time for about 20 minutes.
And some resources have caused start timeout.
probe is performed all at once by all the nodes at a start-up.
The result is written in cib and synchronizes with all the nodes.
This processing requires very high load.
I think that timeout has occurred owing to it.

I am very interested in whether this problem is solvable, if you use
throttle created now.
When is throttle due to be merged into the repository of ClusterLabs?

Best Regards,

-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker still may include memory leaks

2013-07-21 Thread yusuke iida
' manner, so crmd does not stop for a long time, and the rest of
>>>>>>>> pacemaker does not see it 'hanged'. Again, I did not try that, and I do
>>>>>>>> not know if it's even possible to do that with crmd.
>>>>>>>>
>>>>>>>> And, as pacemaker heavily utilizes glib, which has own memory allocator
>>>>>>>> (slices), it is better to switch it to a 'standard' malloc/free for
>>>>>>>> debugging with G_SLICE=always-malloc env var.
>>>>>>>>
>>>>>>>> Last, I did memleak checks for a 'static' (i.e. no operations except
>>>>>>>> monitors are performed) cluster for ~1.1.8, and did not find any. It
>>>>>>>> would be interesting to see if that is true for an 'active' one, which
>>>>>>>> starts/stops resources, handles failures, etc.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sincerely,
>>>>>>>>> Yuichi
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Also, the measurements are in pages... could you run "getconf 
>>>>>>>>>>> PAGESIZE" and let us know the result?
>>>>>>>>>>> I'm guessing 4096 bytes.
>>>>>>>>>>>
>>>>>>>>>>> On 23/05/2013, at 5:47 PM, Yuichi SEINO  
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> I retry the test after we updated packages to the latest tag and 
>>>>>>>>>>>> OS.
>>>>>>>>>>>> glue and booth is latest.
>>>>>>>>>>>>
>>>>>>>>>>>> * Environment
>>>>>>>>>>>> OS:RHEL 6.4
>>>>>>>>>>>> cluster-glue:latest(commit:2755:8347e8c9b94f) +
>>>>>>>>>>>> patch[detail:http://www.gossamer-threads.com/lists/linuxha/dev/85787]
>>>>>>>>>>>> resource-agent:v3.9.5
>>>>>>>>>>>> libqb:v0.14.4
>>>>>>>>>>>> corosync:v2.3.0
>>>>>>>>>>>> pacemaker:v1.1.10-rc2
>>>>>>>>>>>> crmsh:v1.2.5
>>>>>>>>>>>> booth:latest(commit:67e1208973de728958432aaba165766eac1ce3a0)
>>>>>>>>>>>>
>>>>>>>>>>>> * Test procedure
>>>>>>>>>>>> we regularly switch a ticket. The previous test also used the same 
>>>>>>>>>>>> way.
>>>>>>>>>>>> And, There was no a memory leak when we tested pacemaker-1.1 before
>>>>>>>>>>>> pacemaker use libqb.
>>>>>>>>>>>>
>>>>>>>>>>>> * Result
>>>>>>>>>>>> As a result, I think that crmd may cause the memory leak.
>>>>>>>>>>>>
>>>>>>>>>>>> crmd smaps(a total of each addresses)
>>>>>>>>>>>> In detail, we attached smaps of  start and end. And, I recorded 
>>>>>>>>>>>> smaps
>>>>>>>>>>>> every 1 minutes.
>>>>>>>>>>>>
>>>>>>>>>>>> Start
>>>>>>>>>>>> RSS: 7396
>>>>>>>>>>>> SHR(Shared_Clean+Shared_Dirty):3560
>>>>>>>>>>>> Private(Private_Clean+Private_Dirty):3836
>>>>>>>>>>>>
>>>>>>>>>>>> Interbal(about 30h later)
>>>>>>>>>>>> RSS:18464
>>>>>>>>>>>> SHR:14276
>>>>>>>>>>>> Private:4188
>>>>>>>>>>>>
>>>>>>>>>>>> End(about 70h later)
>>>>>>>>>>>> RSS:19104
>>>>>>>>>>>> SHR:14336
>>>>>>>>>>>> Private:4768
>>>>>>>>>>>>
>>>>>>>>>>>> Sincerely,
>>>>>>>>>>>> Yuichi
>>>>>>>>>>>>
>>>>>>>>>>>> 2013/5/15 Yuichi SEINO :
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I ran the test for about two days.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Environment
>>>>>>>>>>>>>
>>>>>>>>>>>>> OS:RHEL 6.3
>>>>>>>>>>>>> pacemaker-1.1.9-devel (commit 
>>>>>>>>>>>>> 138556cb0b375a490a96f35e7fbeccc576a22011)
>>>>>>>>>>>>> corosync-2.3.0
>>>>>>>>>>>>> cluster-glue 
>>>>>>>>>>>>> latest+patch(detail:http://www.gossamer-threads.com/lists/linuxha/dev/85787)
>>>>>>>>>>>>> libqb- 0.14.4
>>>>>>>>>>>>>
>>>>>>>>>>>>> There may be a memory leak in crmd and lrmd. I regularly got rss 
>>>>>>>>>>>>> of ps.
>>>>>>>>>>>>>
>>>>>>>>>>>>> start-up
>>>>>>>>>>>>> crmd:5332
>>>>>>>>>>>>> lrmd:3625
>>>>>>>>>>>>>
>>>>>>>>>>>>> interval(about 30h later)
>>>>>>>>>>>>> crmd:7716
>>>>>>>>>>>>> lrmd:3744
>>>>>>>>>>>>>
>>>>>>>>>>>>> ending(about 60h later)
>>>>>>>>>>>>> crmd:8336
>>>>>>>>>>>>> lrmd:3780
>>>>>>>>>>>>>
>>>>>>>>>>>>> I still don't run a test that pacemaker-1.1.10-rc2 use. So, I 
>>>>>>>>>>>>> will run its test.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sincerely,
>>>>>>>>>>>>> Yuichi
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Yuichi SEINO
>>>>>>>>>>>>> METROSYSTEMS CORPORATION
>>>>>>>>>>>>> E-mail:seino.clust...@gmail.com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Yuichi SEINO
>>>>>>>>>>>> METROSYSTEMS CORPORATION
>>>>>>>>>>>> E-mail:seino.clust...@gmail.com
>>>>>>>>>>>> ___
>>>>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>>
>>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>>> Getting started: 
>>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ___
>>>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>>
>>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>>> Getting started: 
>>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> ___
>>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>>
>>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>>> Getting started: 
>>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Yuichi SEINO
>>>>>>>>> METROSYSTEMS CORPORATION
>>>>>>>>> E-mail:seino.clust...@gmail.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ___
>>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>>
>>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>>> Getting started: 
>>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ___
>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>
>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>> Getting started: 
>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> ___
>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>>
>>>> --
>>>> Yuichi SEINO
>>>> METROSYSTEMS CORPORATION
>>>> E-mail:seino.clust...@gmail.com
>>>
>>> --
>>> Yuichi SEINO
>>> METROSYSTEMS CORPORATION
>>> E-mail:seino.clust...@gmail.com
>>> ___
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
> --
> Yuichi SEINO
> METROSYSTEMS CORPORATION
> E-mail:seino.clust...@gmail.com
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Is there any character which must not be used for an attribute name?

2013-06-17 Thread yusuke iida
Hi, Andrew

I used libqb which installed from source.
A version is tag:v0.14.4.

I read the code of Pacemaker.
"default_ping_set(1)" is connected with "CRM_meta_" and becomes
"CRM_meta_default_ping_set(1)".
It had failed, when it was passed to xmlCtxtReadDoc().

Jun  5 14:43:13 vm1 crmd[22669]:error: crm_xml_err: XML Error:
Entity: line 1: parser error : Specification mandate value for
attribute CRM_meta_default_ping_set
Jun  5 14:43:13 vm1 crmd[22669]:error: crm_xml_err: XML Error: 2"
on_node="vm2" on_node_uuid="3232261508">https://github.com/asalkeld/libqb/commit/0532fa56ce3e19184a592c3ae9660e8e9fcc4c54

Regards,
Yusuke

2013/6/12 Andrew Beekhof :
> What version of libqb is installed?
> It doesn't appear to have been installed with yum/rpm.
>
> On 05/06/2013, at 7:56 PM, yusuke iida  wrote:
>
>> Hi, Andrew
>>
>> crmd took out core in the environment which I am using, and the
>> phenomenon of stopping occurred.
>>
>> Pacemaker currently used is the following.
>> Pacemaker-1.1.10-rc3(changeset 7209c02a0e0435024a104d40e2165a18b6288dec)
>>
>> If STONITH occurs in the state where "(" is attached to the attribute
>> of the node, this problem will occur.
>>
>> question:
>> Is there any character which must not be used for an attribute name?
>>
>> Regards,
>> Yusuke
>> --
>> 
>> METRO SYSTEMS CO., LTD
>>
>> Yusuke Iida
>> Mail: yusk.i...@gmail.com
>> 
>> ___
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



--

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Implement the ability to change the display options during operation.

2013-06-07 Thread yusuke iida
Hi, Andrew

I implemented the function which changes a display option during
operation of crm_mon.

If this function is used, the time and effort which changes an option
and reboots crm_mon can be saved.
Furthermore, I extended the function from now on and think that he
would like to also attach the function to hide the display of a
resource.

how to use:
During starting of crm_mon, the key of the short option corresponding
to each display option is inputted.
Then, a display option changes by a toggle.
The display option supported is "c f n o r t A".

If the "?" key is pressed, it will become a change screen of the
following display options.
That "*" is attached expresses the effective option.
The change of an option by a key is received also on this screen.
In order to return to the original screen, keys other than an option
are inputted.
---
Display option change mode

* c:Display cluster tickets
* f:Display resource fail counts
  n:Group resources by node
  o:Display resource operation history
* r:Display inactive resources
  t:Display resource operation history with timing details
* A:Display node attributes

Toggle fields via field letter, type any other key to return
---

This feature is a pull request below.
https://github.com/ClusterLabs/pacemaker/pull/307

I want you to merge it if there is no problem.

Best regards,
Yusuke

--
----
METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] About Paxos algorithm

2013-04-24 Thread yusuke iida
Hi,

I have asked about the patents of Paxsos here,
http://www.gossamer-threads.com/lists/linuxha/pacemaker/84309#84309

and now, I would like to just update our investigation about this.
The booth uses Paxos algorithm, it seems to be issued as some US patents..

I believe that the booth is a great useful software.
I'm really interested in using/developing the booth, and also
investigating its non-infringement of the related patents.

If the booth uses the Paxos algorithm in only US5261085 patent, we can
use the booth without concern for the patent infringement.
Because US5261085 has already expired.

If anyone has information concerning these issues, please let me know.


FYI: our investigation summary

Mr. Leslie B. Lamport invented the Paxos, and he hold all of patents below.
He had worked in DEC, and he belongs to Microsoft now.

I searched Patents that relates the Paxos in Google. Keyword is 'Lamport Paxos'.
I get the following results.

 [Patents related Paxos]
  US5261085  Fault-tolerant system and method for implementing a
distributed state machine
  US7565433  Byzantine paxos
  US7558883  Fast transaction commit
  US7620680  Fast byzantine paxos
  US8005888  Conflict fast consensus
  US7711825  Simplified Paxos
  US7249280  Cheap paxos
  US7856502  Cheap paxos
  US716  Fast Paxos recovery
  US7698465  Generalized Paxos
  US8046413  Automatic commutativity detection for generalized paxos
  US7921424  Systems and methods for the repartitioning of data
  US7797457  Leaderless byzantine consensus
  US7849223  Virtually synchronous Paxos
  US8073897  Selecting values in a distributed computing system
  US20120239722  Read-only operations processing in a paxos replication system


Mr.Leslie has published many papers about the Paxos, so the Paxos
algorithm is well known.
But it is groundless that we can use Paxos freely.
He obtained patents of the above papers.

US5261085 patent is a basis for the Paxos. It seems that the other
patents related Paxos are the improvement of US5261085.

According to the advice of the patent attorney, US5261085 has expired.
Patents are valid for 20 years after they are registrated in US.

Regards,
Yusuke
--

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Improvement for the communication failure of booth

2013-03-25 Thread yusuke iida
Hi, Jiaju

A reply becomes slow, and I'm sorry.

2013/2/12 Jiaju Zhang :
> Hi Yusuke,
>
>
> Just look at the patch, it seems to me that it wanted to differentiate
> every state like "init", "waiting promise", "promised", "waiting accept"
> and "accepted", etc ... However I'm afraid in this way, it can only
> differentiate "accepted" or "not accepted" (for the "not accepted" case
> here, it will shows "waiting accept").

Yes. I wanted to display a communication state of paxos.
It is because it thought that it was useful for the analysis of a
communication state.
I thought that the state of paxos was as follows.

State"init" shows the state where paxos is not communicated.

State "waiting promise" shows the state where it is waiting to
transmit a "PROMISING" message from an acceptor, after a proposer
transmits a "PREPARING" message.

State "promised" shows the state where the "PROMISING" message was
received from the acceptor.

"waiting promise" and "promised" are displayed only in proposer.

State "waiting accept" shows the state where it is waiting to transmit
a "ACCEPTING" message from other acceptors, after each acceptor
receives a "PROPOSING" message.

State "accepted" shows the state where the "ACCEPTING" message was
received from the acceptor.

Is my thought wrong somewhere?


>
> In acceptor_accepted function,
>
> +   pi->state_monitoring = 1;
> +   for (i = 0; i < booth_conf->node_count; i++)
> +   pi->node[i].connect_state = PROPOSING;
> +
>
> For this acceptor, every node was set to PROPOSING state here, but we cannot
> make sure what state other nodes were in at that moment.

I made processing here because I wanted to make a state of paxos
"accept waiting".
PROPOSING of enum was used in order to save the time and effort which
defines a new variable.
Possibly another variables, such as WAITING_ACCEPT, should just newly
have been made here.

Regards,
Yusuke
>
> Thanks,
> Jiaju
>
>
>



--

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] The correction request of the log of booth

2013-03-06 Thread yusuke iida
Hi, Jiaju

2013/3/6 Jiaju Zhang :
> On Wed, 2013-03-06 at 15:13 +0900, yusuke iida wrote:
>> Hi, Jiaju
>>
>> There is a request about the log of booth.
>>
>> I want you to change a log level when a ticket expires into "info" from 
>> "debug".
>>
>> I think that this log is important since it means what occurred.
>>
>> And I want you to add the following information to log.
>>  * Which ticket is it?
>>  * Who had a ticket?
>>
>> For example, I want you to use the following forms.
>> info: lease expires ... owner [0] ticket [ticketA]
>
> Sounds great, will improve that;)
Thank you for accepting.

Many thanks!
Yusuke

>
> Thanks,
> Jiaju
>



--

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] The correction request of the log of booth

2013-03-05 Thread yusuke iida
Hi, Jiaju

There is a request about the log of booth.

I want you to change a log level when a ticket expires into "info" from "debug".

I think that this log is important since it means what occurred.

And I want you to add the following information to log.
 * Which ticket is it?
 * Who had a ticket?

For example, I want you to use the following forms.
info: lease expires ... owner [0] ticket [ticketA]

diff --git a/src/paxos_lease.c b/src/paxos_lease.c
index 74b41b1..8681ecd 100644
--- a/src/paxos_lease.c
+++ b/src/paxos_lease.c
@@ -153,7 +153,8 @@ static void lease_expires(unsigned long data)
pl_handle_t plh = (pl_handle_t)pl;
struct paxos_lease_result plr;

-   log_debug("lease expires ...");
+   log_info("lease expires ... owner [%d] ticket [%s]",
+   pl->owner, pl->name);
pl->owner = -1;
strcpy(plr.name, pl->name);
plr.owner = -1;


Regards,
Yusuke

--
----
METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Suggestion to improve movement of booth

2013-02-26 Thread yusuke iida
Hi, Jiaju

2013/2/19 Jiaju Zhang :
> On Tue, 2013-02-19 at 18:26 +0900, yusuke iida wrote:
>> Hi, Jiaju
>>
>> Thank you for merging!
>>
>> BTW, is there any comment about the patent that I heard before?
>
> I don't think there are any patent related problems here. There are
> quite a lot of open source projects which are based on Paxos or Paxos
> variant algorithms, such as zookeeper, Ceph monitor cluster, etc.
> Booth used some similar concepts, but didn't follow any specific paper
> or specific implementation. And it should be a yet another
> implementation of Paxos variant, also add some own logic that we think
> are needed in Geo-cluster use cases. So from what I have known, there is
> no this kind of issue, however if there are issues that I don't know,
> please contact me.

I understood it.
I will contact you if anything happens.

Thanks,
Yusuke

>
> Thanks,
> Jiaju
>



--

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Booth] Howto grant ticket to other site, if resource fails

2013-02-20 Thread yusuke iida
Hi, Michael

2013/2/15 Michael Wagner :
> Hi,
>
> I am currently working on a geo-redundant setup with two sites and one 
> arbitrator, using booth as ticket manager.
>
> In case of network-outages, or a whole site is down, booth correctly grants 
> the ticket to the other site (where pacemaker then starts all dependent 
> resources).
>
> What I´m now trying to achieve is, that also in case a resource fails (and 
> can´t be restarted) on one site, the ticket should be revoked from this site, 
> and granted to the other. (I.e. also a fail-over in case of issues with 
> single resources, and not only in case of whole sites or networks between 
> them).
>
> I found an old thread regarding that topic, that suggests implementing the 
> behavior in an own daemon:
> http://oss.clusterlabs.org/pipermail/pacemaker/2012-March/013393.html
> Does such a daemon exist, or is someone working on it? (in that case I´d 
> rather contribute, than re-inventing the wheel).
> Or is there any other way to achieve such a behavior with "built 
> in"-functions in pacemaker and booth?

After this discussion, I have created a daemon in order to achieve my goal.

Easy explanation of created daemon is as follows:
- Daemon monitors the resources that depend on the ticket.
- When it becomes impossible for the resource depending on a ticket to
start in a site, a ticket is moved to other sites.
- Pacemaker and Booth are required in order to use daemon.
- For the moment, it operates only in combination with Pacemaker-1.1
in about April, 2012.

If the old thing is enough, I think that I can show a source by the
end of March.
In that case, I'm hoping to be incorporated into the repository Booth.

In the future, we are planning to modified to work in combination with
the latest Pacemaker.

Regards,
Yusuke

>
> Best Regards,
> Michael
>
> ___
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



--

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Suggestion to improve movement of booth

2013-02-19 Thread yusuke iida
Hi, Jiaju

Thank you for merging!

BTW, is there any comment about the patent that I heard before?

Thanks,
Yusuke

2013/2/18 Jiaju Zhang :
> On Mon, 2013-02-18 at 14:12 +0900, yusuke iida wrote:
>> Hi, Jiaju
>>
>> It understood leaving processing of is_prepared().
>> I did "pull request" of judgment processing.
>> Please confirm it.
>>
>> https://github.com/jjzhang/booth/pull/50
>
> Pulled, thanks!
>
> Thanks,
> Jiaju
>



--
----
METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Suggestion to improve movement of booth

2013-02-17 Thread yusuke iida
Hi, Jiaju

It understood leaving processing of is_prepared().
I did "pull request" of judgment processing.
Please confirm it.

https://github.com/jjzhang/booth/pull/50

Regards,
Yusuke

2013/2/18 Jiaju Zhang :
> On Thu, 2013-02-14 at 11:36 +0900, yusuke iida wrote:
>> Hi, Jiaju
>>
>> Thank you for reply!
>>
>> 2013/2/12 Jiaju Zhang :
>> > Hi Yusuke,
>> >
>> > On Tue, 2013-02-12 at 13:08 +0900, yusuke iida wrote:
>> >> Hi, Jiaju
>> >>
>> >> Could you please reply to this message?
>> >
>> > My suggestion is no need to add additional configuration options, in
>> > this case, the behavior of booth is definitive, return a negative value
>> > in lease_promise.
>>
>> I understood it.
>> According to your suggestion, I cancel the additional setting option.
>> I change the judgment of current lease_promise to the form according
>> to "master lease".
>>
>> When this logic is taken in, "hdr->leased = 1" thinks that processing
>> of lease_is_prepare becomes unnecessary since it will not be
>> transmitted.
>
> Yes, actually it would be necessary.
>
>> Is this processing still required?
>
> The processing of judging hdr->leased in is_prepared function is not
> required any more, but I'd like to keep the is_prepared callback there,
> in case we need it in the future.
>
> Thanks,
> Jiaju
>
>



-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


2013/2/18 Jiaju Zhang :
> On Thu, 2013-02-14 at 11:36 +0900, yusuke iida wrote:
>> Hi, Jiaju
>>
>> Thank you for reply!
>>
>> 2013/2/12 Jiaju Zhang :
>> > Hi Yusuke,
>> >
>> > On Tue, 2013-02-12 at 13:08 +0900, yusuke iida wrote:
>> >> Hi, Jiaju
>> >>
>> >> Could you please reply to this message?
>> >
>> > My suggestion is no need to add additional configuration options, in
>> > this case, the behavior of booth is definitive, return a negative value
>> > in lease_promise.
>>
>> I understood it.
>> According to your suggestion, I cancel the additional setting option.
>> I change the judgment of current lease_promise to the form according
>> to "master lease".
>>
>> When this logic is taken in, "hdr->leased = 1" thinks that processing
>> of lease_is_prepare becomes unnecessary since it will not be
>> transmitted.
>
> Yes, actually it would be necessary.
>
>> Is this processing still required?
>
> The processing of judging hdr->leased in is_prepared function is not
> required any more, but I'd like to keep the is_prepared callback there,
> in case we need it in the future.
>
> Thanks,
> Jiaju
>
>



--

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Suggestion to improve movement of booth

2013-02-13 Thread yusuke iida
Hi, Jiaju

Thank you for reply!

2013/2/12 Jiaju Zhang :
> Hi Yusuke,
>
> On Tue, 2013-02-12 at 13:08 +0900, yusuke iida wrote:
>> Hi, Jiaju
>>
>> Could you please reply to this message?
>
> My suggestion is no need to add additional configuration options, in
> this case, the behavior of booth is definitive, return a negative value
> in lease_promise.

I understood it.
According to your suggestion, I cancel the additional setting option.
I change the judgment of current lease_promise to the form according
to "master lease".

When this logic is taken in, "hdr->leased = 1" thinks that processing
of lease_is_prepare becomes unnecessary since it will not be
transmitted.
Is this processing still required?

Regards,
Yusuke.
>
> That being said, for your patch, we only need the block
>
> +   if (hdr->leased == 1) {
> +   log_error("the proposal collided.");
> +   return -1;
> +   }
> +
>
> Thanks,
> Jiaju
>



--

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Suggestion to improve movement of booth

2013-02-11 Thread yusuke iida
Hi, Jiaju

Could you please reply to this message?

Regards,
Yusuke.

2013/1/28 yusuke iida :
> Hi, Jiaju
>
> 2013/1/15 Jiaju Zhang :
>> On Tue, 2013-01-15 at 11:28 +0900, yusuke iida wrote:
>>> Hi, Jiaju
>>>
>>> 2013/1/11 Jiaju Zhang :
>>> > Hi Yusuke,
>>> >
>>> > Sorry for the late reply;)
>>> >
>>> > On Mon, 2013-01-07 at 13:50 +0900, yusuke iida wrote:
>>> >> Hi, Jiaju
>>> >>
>>> >> When the proposal was conflict, I want to keep on the site of the
>>> >> original lease.
>>> >> I do not want to generate a revoke when maintained.
>>> >>
>>> >>
>>> >> I made a patch according to a thought of "5.2 Master lease" described
>>> >> in the next article.
>>> >> It means that it prevents you from accepting new suggestion until a
>>> >> time limit of lease expires.
>>> >
>>> > Exactly.
>>> >
>>> >>
>>> >> http://www.read.seas.harvard.edu/~kohler/class/08w-dsi/chandra07paxos.pdf
>>> >>
>>> >> Is there anything wrong with this idea?
>>> >
>>> > This idea is totally right. But we have already implemented it. When the
>>> > master exists and is still in its lease, if some other one wants to be
>>> > the master and sent the "prepare" message, the acceptor will told him
>>> > "oh, we have already had a master" by setting "hdr->leased = 1" in his
>>> > respond message, actually this is a rejection, then the one trying to be
>>> > master won't succeed.
>>> I understand these specifications.
>>> However, by the present specification, when returning "hdr->leased =
>>> 1", "highest_promised" is updated by ballot of new "prepare".
>>> When "highest_promised" is updated, reaccession of lease is carried
>>> out in original masters.
>>> Since revoke is performed at this time, the node which the resource
>>> was start(ing) is STONITH(ed) by loss-policy=fence.
>>> As for this, the stop of temporary service happens.
>>> To avoid this, I've implemented the change not to do to re-acquire the 
>>> lease.
>>
>> Understood, this is an important fix. However, it seems that there is an
>> easier way to fix this, just change the return value of "lease_promise",
>> that is to say, return -1 when having leased.
>>
>> I'm inclined to do so because the new function "lease_is_leased"
>> basically did the same thing with "lease_promise", but "lease_promise"
>> returned a wrong value currently. What do you think?;)
> I reconsidered this processing.
> To be sure, "lease_is_leased" is redundant.
> I moved a function of "lease_is_leased" to "lease_promise".
> I'm not sure I should leave the "ability to get re-lease" existing.
> With the patch which I made, a function is changed by the configuration file.
>
> The patch which I made again is here.
> https://github.com/yuusuke/booth/commit/f99e74f38b51b89bf2deccb0eb2249a12fb45bb6
>
>>
>> BTW, don't forget to add your Signed-off-by line and check the patch
>> with checkpatch.pl, this will make booth to use the same coding style;)
> I understood it.
>
> BTW, we have noticed that some patents relevant to paxos exist, while
> investigating about the technology of paxos.
> Has the paxos algorithm used by booth cleared the problem about such a patent?
>
> Regards,
> Yusuke
>>
>> Thanks,
>> Jiaju
>>
>>
>>
>
> --
> 
> METRO SYSTEMS CO., LTD
>
> Yusuke Iida
> Mail: yusk.i...@gmail.com
> 



-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Improvement for the communication failure of booth

2013-02-11 Thread yusuke iida
Hi, Jiaju

2012/12/18 Jiaju Zhang :
> Good suggestion! I think it may need to introduce a notifier callback so
> that the failure of communicating with the problematic node can be
> notified to the "active" node. This makes sense for the active node,
> because it will make the admin know how many healthy "passive" nodes
> currently there are and any potential issues might be resolved in
> advance.
>
> Regarding the implementation of this feature, I think it is doable
> although it may need a lot of changes;)

I performed implementation about the state indication.

This function displays the communication state of paxos of each site.
In order to view the status we have extended the display of the "booth
client list" command.

The display format is as follows.
# booth client list
ticket: ticketA, owner: None, expires: INF
site: , state: 

I below shows an example of a status display.

Constitution: 3 sites, 1 ticket
siteA: 192.168.201.131(grant)
siteB: 192.168.201.132
siteC: 192.168.201.133

"""initial state"""
siteA # booth client list
ticket: ticketA, owner: 192.168.201.131, expires: 2013/02/08 20:21:54
site: 192.168.201.131, state: accepted
site: 192.168.201.132, state: accepted
site: 192.168.201.133, state: accepted

siteB # booth client list
ticket: ticketA, owner: 192.168.201.131, expires: 2013/02/08 20:21:53
site: 192.168.201.131, state: accepted
site: 192.168.201.132, state: accepted
site: 192.168.201.133, state: accepted

siteC # booth client list
ticket: ticketA, owner: 192.168.201.131, expires: 2013/02/08 20:21:54
site: 192.168.201.131, state: accepted
site: 192.168.201.132, state: accepted
site: 192.168.201.133, state: accepted

"""siteA is down"""
siteB # booth client list
ticket: ticketA, owner: 192.168.201.133, expires: 2013/02/12 11:00:45
site: 192.168.201.131, state: waiting accept
site: 192.168.201.132, state: accepted
site: 192.168.201.133, state: accepted

siteC # booth client list
ticket: ticketA, owner: 192.168.201.133, expires: 2013/02/12 11:00:46
site: 192.168.201.131, state: waiting accept
site: 192.168.201.132, state: accepted
site: 192.168.201.133, state: accepted

"""siteB is down"""
siteA # booth client list
ticket: ticketA, owner: 192.168.201.131, expires: 2013/02/12 11:14:41
site: 192.168.201.131, state: accepted
site: 192.168.201.132, state: waiting accept
site: 192.168.201.133, state: accepted

siteC # booth client list
ticket: ticketA, owner: 192.168.201.131, expires: 2013/02/12 11:14:41
site: 192.168.201.131, state: accepted
site: 192.168.201.132, state: waiting accept
site: 192.168.201.133, state: accepted

"""communication blockade between siteB-siteC"""
siteA # booth client list
ticket: ticketA, owner: 192.168.201.131, expires: 2013/02/12 11:33:20
site: 192.168.201.131, state: accepted
site: 192.168.201.132, state: accepted
site: 192.168.201.133, state: accepted

siteB # booth client list
ticket: ticketA, owner: 192.168.201.131, expires: 2013/02/12 11:33:19
site: 192.168.201.131, state: accepted
site: 192.168.201.132, state: accepted
site: 192.168.201.133, state: waiting accept

siteC # booth client list
ticket: ticketA, owner: 192.168.201.131, expires: 2013/02/12 11:33:19
site: 192.168.201.131, state: accepted
site: 192.168.201.132, state: waiting accept
site: 192.168.201.133, state: accepted

I want a repository to merge it if I do not have any problem.
https://github.com/jjzhang/booth/pull/49

Best regards,
Yusuke.

--

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Suggestion to improve movement of booth

2013-01-27 Thread yusuke iida
Hi, Jiaju

2013/1/15 Jiaju Zhang :
> On Tue, 2013-01-15 at 11:28 +0900, yusuke iida wrote:
>> Hi, Jiaju
>>
>> 2013/1/11 Jiaju Zhang :
>> > Hi Yusuke,
>> >
>> > Sorry for the late reply;)
>> >
>> > On Mon, 2013-01-07 at 13:50 +0900, yusuke iida wrote:
>> >> Hi, Jiaju
>> >>
>> >> When the proposal was conflict, I want to keep on the site of the
>> >> original lease.
>> >> I do not want to generate a revoke when maintained.
>> >>
>> >>
>> >> I made a patch according to a thought of "5.2 Master lease" described
>> >> in the next article.
>> >> It means that it prevents you from accepting new suggestion until a
>> >> time limit of lease expires.
>> >
>> > Exactly.
>> >
>> >>
>> >> http://www.read.seas.harvard.edu/~kohler/class/08w-dsi/chandra07paxos.pdf
>> >>
>> >> Is there anything wrong with this idea?
>> >
>> > This idea is totally right. But we have already implemented it. When the
>> > master exists and is still in its lease, if some other one wants to be
>> > the master and sent the "prepare" message, the acceptor will told him
>> > "oh, we have already had a master" by setting "hdr->leased = 1" in his
>> > respond message, actually this is a rejection, then the one trying to be
>> > master won't succeed.
>> I understand these specifications.
>> However, by the present specification, when returning "hdr->leased =
>> 1", "highest_promised" is updated by ballot of new "prepare".
>> When "highest_promised" is updated, reaccession of lease is carried
>> out in original masters.
>> Since revoke is performed at this time, the node which the resource
>> was start(ing) is STONITH(ed) by loss-policy=fence.
>> As for this, the stop of temporary service happens.
>> To avoid this, I've implemented the change not to do to re-acquire the lease.
>
> Understood, this is an important fix. However, it seems that there is an
> easier way to fix this, just change the return value of "lease_promise",
> that is to say, return -1 when having leased.
>
> I'm inclined to do so because the new function "lease_is_leased"
> basically did the same thing with "lease_promise", but "lease_promise"
> returned a wrong value currently. What do you think?;)
I reconsidered this processing.
To be sure, "lease_is_leased" is redundant.
I moved a function of "lease_is_leased" to "lease_promise".
I'm not sure I should leave the "ability to get re-lease" existing.
With the patch which I made, a function is changed by the configuration file.

The patch which I made again is here.
https://github.com/yuusuke/booth/commit/f99e74f38b51b89bf2deccb0eb2249a12fb45bb6

>
> BTW, don't forget to add your Signed-off-by line and check the patch
> with checkpatch.pl, this will make booth to use the same coding style;)
I understood it.

BTW, we have noticed that some patents relevant to paxos exist, while
investigating about the technology of paxos.
Has the paxos algorithm used by booth cleared the problem about such a patent?

Regards,
Yusuke
>
> Thanks,
> Jiaju
>
>
>

--

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Suggestion to improve movement of booth

2013-01-14 Thread yusuke iida
Hi, Jiaju

2013/1/11 Jiaju Zhang :
> Hi Yusuke,
>
> Sorry for the late reply;)
>
> On Mon, 2013-01-07 at 13:50 +0900, yusuke iida wrote:
>> Hi, Jiaju
>>
>> When the proposal was conflict, I want to keep on the site of the
>> original lease.
>> I do not want to generate a revoke when maintained.
>>
>>
>> I made a patch according to a thought of "5.2 Master lease" described
>> in the next article.
>> It means that it prevents you from accepting new suggestion until a
>> time limit of lease expires.
>
> Exactly.
>
>>
>> http://www.read.seas.harvard.edu/~kohler/class/08w-dsi/chandra07paxos.pdf
>>
>> Is there anything wrong with this idea?
>
> This idea is totally right. But we have already implemented it. When the
> master exists and is still in its lease, if some other one wants to be
> the master and sent the "prepare" message, the acceptor will told him
> "oh, we have already had a master" by setting "hdr->leased = 1" in his
> respond message, actually this is a rejection, then the one trying to be
> master won't succeed.
I understand these specifications.
However, by the present specification, when returning "hdr->leased =
1", "highest_promised" is updated by ballot of new "prepare".
When "highest_promised" is updated, reaccession of lease is carried
out in original masters.
Since revoke is performed at this time, the node which the resource
was start(ing) is STONITH(ed) by loss-policy=fence.
As for this, the stop of temporary service happens.
To avoid this, I've implemented the change not to do to re-acquire the lease.

Regards,
Yusuke
>
> It seems your patch is also doing this, right?
>
> Thanks,
> Jiaju
>



--

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Suggestion to improve movement of booth

2013-01-06 Thread yusuke iida
Hi, Jiaju

When the proposal was conflict, I want to keep on the site of the
original lease.
I do not want to generate a revoke when maintained.


I made a patch according to a thought of "5.2 Master lease" described
in the next article.
It means that it prevents you from accepting new suggestion until a
time limit of lease expires.

http://www.read.seas.harvard.edu/~kohler/class/08w-dsi/chandra07paxos.pdf

Is there anything wrong with this idea?

Regards,
Yusuke

2012/12/12 Jiaju Zhang :
> Hi Yusuke,
>
> On Fri, 2012-11-30 at 21:30 +0900, yusuke iida wrote:
>> Hi, Jiaju
>>
>>
>>
>> When communication of a part of proposer and acceptor goes out,
>> re-acquisition of lease is temporarily performed by proposer.
>>
>> Since a ticket will be temporarily revoke(d) at this time, service
>> will stop temporarily.
>>
>> I think that this is a problem.
>>
>> I hope that lease of the ticket is held.
>
> This is what I wanted to do as well;) That is to say, the lease should
> keep renewing on the original site successfully unless it was down.
> Current implementation is to let the original site renew the ticket
> before ticket lease expires (only when lease expires the ticket will be
> revoked), hence, before other sites tries to acquire the ticket, the
> original site has renewed the ticket already, so the result is the
> ticket is still on that site.
>
> I'm not quite understand your problem here. Is that the lease not
> keeping in the original site?
>
> Thanks,
> Jiaju
>
>>
>>
>>
>> I thought about a plan to prevent movement to become the reaccession
>> of lease.
>>
>> When proposer continues updating lease, I think that you should refuse
>> a message from new proposer.
>>
>> In order to remain at the existing behavior, I want to be switched
>> according to the setting.
>>
>>
>>
>> I wrote the patch about this proposal.
>>
>> https://github.com/yuusuke/booth/commit/6b82fda7b4220c418ff906a9cf8152fe88032566
>>
>>
>>
>> What do you think about this proposal?
>>
>>
>>
>> Best regards,
>> Yuusuke
>> --
>> 
>> METRO SYSTEMS CO., LTD
>>
>> Yuusuke Iida
>> Mail: yusk.i...@gmail.com
>> 
>
>



--

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


2012/12/12 Jiaju Zhang :
> Hi Yusuke,
>
> On Fri, 2012-11-30 at 21:30 +0900, yusuke iida wrote:
>> Hi, Jiaju
>>
>>
>>
>> When communication of a part of proposer and acceptor goes out,
>> re-acquisition of lease is temporarily performed by proposer.
>>
>> Since a ticket will be temporarily revoke(d) at this time, service
>> will stop temporarily.
>>
>> I think that this is a problem.
>>
>> I hope that lease of the ticket is held.
>
> This is what I wanted to do as well;) That is to say, the lease should
> keep renewing on the original site successfully unless it was down.
> Current implementation is to let the original site renew the ticket
> before ticket lease expires (only when lease expires the ticket will be
> revoked), hence, before other sites tries to acquire the ticket, the
> original site has renewed the ticket already, so the result is the
> ticket is still on that site.
>
> I'm not quite understand your problem here. Is that the lease not
> keeping in the original site?
>
> Thanks,
> Jiaju
>
>>
>>
>>
>> I thought about a plan to prevent movement to become the reaccession
>> of lease.
>>
>> When proposer continues updating lease, I think that you should refuse
>> a message from new proposer.
>>
>> In order to remain at the existing behavior, I want to be switched
>> according to the setting.
>>
>>
>>
>> I wrote the patch about this proposal.
>>
>> https://github.com/yuusuke/booth/commit/6b82fda7b4220c418ff906a9cf8152fe88032566
>>
>>
>>
>> What do you think about this proposal?
>>
>>
>>
>> Best regards,
>> Yuusuke
>> --
>> 
>> METRO SYSTEMS CO., LTD
>>
>> Yuusuke Iida
>> Mail: yusk.i...@gmail.com
>> 
>
>



-- 

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Improvement for the communication failure of booth

2012-12-17 Thread yusuke iida
Hi, Jiaju

I would like to attach the function which displays a communicative
state on booth.
In the present booth, when communication between sites stops service,
no errors are told.
If it becomes like this, the user cannot notice a problem.
I think that he would like to define newly the variable which saves
the communication state of paxos, in order to solve this problem.
I want to display on the client command, and its state.
Is this thought realistic?
Are there any other good idea?

Regards,
Yusuke
--

METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Suggestion to improve movement of booth

2012-12-14 Thread yusuke iida
Hi, Jiaju

Thanks for the reply.

2012/12/12 Jiaju Zhang :
> This is what I wanted to do as well;) That is to say, the lease should
> keep renewing on the original site successfully unless it was down.
> Current implementation is to let the original site renew the ticket
> before ticket lease expires (only when lease expires the ticket will be
> revoked), hence, before other sites tries to acquire the ticket, the
> original site has renewed the ticket already, so the result is the
> ticket is still on that site.
>
> I'm not quite understand your problem here. Is that the lease not
> keeping in the original site?

When reaccession of lease is carried out, loss-policy acts because
revoke of the ticket is performed temporarily.

For example, as for the node which the resource has started, STONITH
is performed by this temporary revoke when loss-policy is fence.

At this time, service will stop until it is rebooted at other nodes or a site.

I would like to prevent this behavior.

When an original site repeats "renew", I make modifications not to let
you promise the preparations from other sites.

I think that unnecessary revoke is not performed by this correction.

Regards,
Yusuke

--
----
METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.i...@gmail.com


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Suggestion to improve movement of booth

2012-11-30 Thread yusuke iida
Hi, Jiaju

When communication of a part of proposer and acceptor goes out,
re-acquisition of lease is temporarily performed by proposer.
Since a ticket will be temporarily revoke(d) at this time, service will
stop temporarily.
I think that this is a problem.
I hope that lease of the ticket is held.

I thought about a plan to prevent movement to become the reaccession of
lease.
When proposer continues updating lease, I think that you should refuse a
message from new proposer.
In order to remain at the existing behavior, I want to be switched
according to the setting.

I wrote the patch about this proposal.
https://github.com/yuusuke/booth/commit/6b82fda7b4220c418ff906a9cf8152fe88032566

What do you think about this proposal?

Best regards,
Yuusuke
-- 

METRO SYSTEMS CO., LTD

Yuusuke Iida
Mail: yusk.i...@gmail.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org