Re: [Pacemaker] Resource capacity limit
On Fri, Oct 30, 2009 at 12:41 PM, Yan Gao wrote: > Hi Andrew and Lars, > The attachment is the first try to implement "Resource capacity limit" > which is issued by Lars from: > https://fate.novell.com/303384 > > Description: > We need a mechanism for the PE to take resource weight into account to > prevent nodes from being overloaded. > > Resources would require certain minimal values for node attributes > (this is available right now); however, they would also "consume" them, > reducing the value of the node attributes for further resource placement. > (This could be a special flag in the rsc_location rule, for example.) > If a node does not have enough capacity available, it is not considered. > .. > > User case: > Xen guests have memory requirements; nodes cannot host more guests than > the node has physical memory installed. > > > Configuration example: > > node yingying \ > attributes capacity="100" > primitive dummy0 ocf:heartbeat:Dummy \ > meta weight="90" priority="2" > primitive dummy1 ocf:heartbeat:Dummy \ > meta weight="60" priority="1" > .. > property $id="cib-bootstrap-options" \ > limit-capacity="true" > .. > > Because dummy0 has the higher priority, it'll be running on node "yingying". > While this node only has "10" (100-90) capacity remaining now, so dummy1 > cannot > be running on this node. If there's no other node where it can be running on, > dummy1 will be stopped. > > If we don't want to enable capacity limit. We could set property > "limit-capacity" to "false", or default it. > > > What do you think about the way it's implemented? Did I do it right? Just one question, why the new cluster property? Didn't we already have placement-strategy for that purpose? > > I also noticed a likely similar planned feature described in > http://clusterlabs.org/wiki/Planned_Features > > "Implement adaptive service placement (based on the RAM, CPU etc. > required by the service and made available by the nodes) " > > Indeed, this try only supports single kind of capacity, and it's not > adaptive... Do you already have a thorough consideration about this > feature? > Any comments or suggestions are appreciated. Thanks! > > Regards, > Yan > -- > y...@novell.com > Software Engineer > China Server Team, OPS Engineering > > Novell, Inc. > Making IT Work As One™ > > > > > ___ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
Hi Andrew, I added utilization support for crm_attribute and crm_resource. Attached the patch. Please let me know if you have any comments or suggestions on that. BTW, LF#2351 has been fixed. Attached the patch there: http://developerbugs.linux-foundation.org/show_bug.cgi?id=2351 Thanks, Yan -- Yan Gao Software Engineer China Server Team, OPS Engineering, Novell, Inc. # HG changeset patch # User Yan Gao # Date 1267696081 -28800 # Node ID 692f8d2fa65b1e956f450ccb0664fc50f6f8b7bb # Parent a6b66ac53fa969658860964b186548bc514c9455 Dev: Tools: Add utilization support for crm_attribute and crm_resource diff -r a6b66ac53fa9 -r 692f8d2fa65b crmd/control.c --- a/crmd/control.c Thu Mar 04 08:40:17 2010 +0100 +++ b/crmd/control.c Thu Mar 04 17:48:01 2010 +0800 @@ -86,7 +86,7 @@ } else { int rc = update_attr( fsa_cib_conn, cib_quorum_override|cib_scope_local|cib_inhibit_notify, - XML_CIB_TAG_CRMCONFIG, NULL, NULL, NULL, XML_ATTR_EXPECTED_VOTES, votes, FALSE); + XML_CIB_TAG_CRMCONFIG, NULL, NULL, NULL, NULL, XML_ATTR_EXPECTED_VOTES, votes, FALSE); crm_info("Setting expected votes to %s", votes); if(cib_ok > rc) { diff -r a6b66ac53fa9 -r 692f8d2fa65b crmd/election.c --- a/crmd/election.c Thu Mar 04 08:40:17 2010 +0100 +++ b/crmd/election.c Thu Mar 04 17:48:01 2010 +0800 @@ -444,10 +444,10 @@ add_cib_op_callback(fsa_cib_conn, rc, FALSE, NULL, feature_update_callback); update_attr(fsa_cib_conn, cib_none, XML_CIB_TAG_CRMCONFIG, - NULL, NULL, NULL, "dc-version", VERSION"-"BUILD_VERSION, FALSE); + NULL, NULL, NULL, NULL, "dc-version", VERSION"-"BUILD_VERSION, FALSE); update_attr(fsa_cib_conn, cib_none, XML_CIB_TAG_CRMCONFIG, - NULL, NULL, NULL, "cluster-infrastructure", cluster_type, FALSE); + NULL, NULL, NULL, NULL, "cluster-infrastructure", cluster_type, FALSE); mainloop_set_trigger(config_read); free_xml(cib); diff -r a6b66ac53fa9 -r 692f8d2fa65b crmd/lrm.c --- a/crmd/lrm.c Thu Mar 04 08:40:17 2010 +0100 +++ b/crmd/lrm.c Thu Mar 04 17:48:01 2010 +0800 @@ -1156,7 +1156,7 @@ from_sys, rsc->id); update_attr(fsa_cib_conn, cib_none, XML_CIB_TAG_CRMCONFIG, - NULL, NULL, NULL, "last-lrm-refresh", now_s, FALSE); + NULL, NULL, NULL, NULL, "last-lrm-refresh", now_s, FALSE); crm_free(now_s); } diff -r a6b66ac53fa9 -r 692f8d2fa65b include/crm/cib_util.h --- a/include/crm/cib_util.h Thu Mar 04 08:40:17 2010 +0100 +++ b/include/crm/cib_util.h Thu Mar 04 17:48:01 2010 +0800 @@ -50,21 +50,21 @@ extern enum cib_errors update_attr( cib_t *the_cib, int call_options, - const char *section, const char *node_uuid, const char *set_name, + const char *section, const char *node_uuid, const char *set_type, const char *set_name, const char *attr_id, const char *attr_name, const char *attr_value, gboolean to_console); extern enum cib_errors find_nvpair_attr( -cib_t *the_cib, const char *attr, const char *section, const char *node_uuid, const char *set_name, -const char *attr_id, const char *attr_name, gboolean to_console, char **value); +cib_t *the_cib, const char *attr, const char *section, const char *node_uuid, const char *set_type, +const char *set_name, const char *attr_id, const char *attr_name, gboolean to_console, char **value); extern enum cib_errors read_attr( cib_t *the_cib, - const char *section, const char *node_uuid, const char *set_name, + const char *section, const char *node_uuid, const char *set_type, const char *set_name, const char *attr_id, const char *attr_name, char **attr_value, gboolean to_console); extern enum cib_errors delete_attr( cib_t *the_cib, int options, - const char *section, const char *node_uuid, const char *set_name, + const char *section, const char *node_uuid, const char *set_type, const char *set_name, const char *attr_id, const char *attr_name, const char *attr_value, gboolean to_console); extern enum cib_errors query_node_uuid( diff -r a6b66ac53fa9 -r 692f8d2fa65b lib/cib/cib_attrs.c --- a/lib/cib/cib_attrs.c Thu Mar 04 08:40:17 2010 +0100 +++ b/lib/cib/cib_attrs.c Thu Mar 04 17:48:01 2010 +0800 @@ -47,8 +47,8 @@ extern enum cib_errors find_nvpair_attr( -cib_t *the_cib, const char *attr, const char *section, const char *node_uuid, const char *set_name, -const char *attr_id, const char *attr_name, gboolean to_console, char **value) +cib_t *the_cib, const char *attr, const char *section, const char *node_uuid, const char *attr_set_type, +const char *set_name, const char *attr_id, const char *attr_name, gboolean to_console, char **value) { int offset = 0; static int xpath_max = 1024; @@ -56,7 +56,13 @@ char *xpath_string = NULL; xmlNode *xml_search = NULL; -const char *set_type = XML_TAG_ATTR_SETS; +const char *set_type = NULL; + +if (attr_set_type) { + set_type = attr_set_type; +} else { + set_type = XML_TAG_ATTR_SETS; +} CRM_ASSERT(value != NULL); *value = NULL; @@
Re: [Pacemaker] Resource capacity limit
Hi Andrew, Yan Gao wrote: > On 12/10/09 12:56, Yan Gao wrote: >> Hi Andrew, >> Attached the hg export patch against the devel branch for that. Hope >> that's easier to be merged:-) > And the patch including the test cases. The *.score files were adopted the wrong file name extension rather than "scores". :-\ Attached the patches to rename them. Sorry about that! Thanks, Yan -- y...@novell.com Software Engineer China Server Team, OPS Engineering Novell, Inc. Making IT Work As One™ # HG changeset patch # User Yan Gao # Date 1260773202 -28800 # Node ID ab416ebb0734839d6e0e51e7f2fb9dac4832a50f # Parent f182beaeedab79278301ffb1bb2207e20f25f87f Low: PE: Repair the file name extensions for several test scores files diff -r f182beaeedab -r ab416ebb0734 pengine/test10/balanced.score --- a/pengine/test10/balanced.score Fri Dec 11 20:19:24 2009 +0100 +++ /dev/null Thu Jan 01 00:00:00 1970 + @@ -1,5 +0,0 @@ -Allocation scores: -native_color: rsc1 allocation score on host1: 0 -native_color: rsc1 allocation score on host2: 0 -native_color: rsc2 allocation score on host1: 0 -native_color: rsc2 allocation score on host2: 0 diff -r f182beaeedab -r ab416ebb0734 pengine/test10/balanced.scores --- /dev/null Thu Jan 01 00:00:00 1970 + +++ b/pengine/test10/balanced.scores Mon Dec 14 14:46:42 2009 +0800 @@ -0,0 +1,5 @@ +Allocation scores: +native_color: rsc1 allocation score on host1: 0 +native_color: rsc1 allocation score on host2: 0 +native_color: rsc2 allocation score on host1: 0 +native_color: rsc2 allocation score on host2: 0 diff -r f182beaeedab -r ab416ebb0734 pengine/test10/minimal.score --- a/pengine/test10/minimal.score Fri Dec 11 20:19:24 2009 +0100 +++ /dev/null Thu Jan 01 00:00:00 1970 + @@ -1,5 +0,0 @@ -Allocation scores: -native_color: rsc1 allocation score on host1: 0 -native_color: rsc1 allocation score on host2: 0 -native_color: rsc2 allocation score on host1: 0 -native_color: rsc2 allocation score on host2: 0 diff -r f182beaeedab -r ab416ebb0734 pengine/test10/minimal.scores --- /dev/null Thu Jan 01 00:00:00 1970 + +++ b/pengine/test10/minimal.scores Mon Dec 14 14:46:42 2009 +0800 @@ -0,0 +1,5 @@ +Allocation scores: +native_color: rsc1 allocation score on host1: 0 +native_color: rsc1 allocation score on host2: 0 +native_color: rsc2 allocation score on host1: 0 +native_color: rsc2 allocation score on host2: 0 diff -r f182beaeedab -r ab416ebb0734 pengine/test10/utilization.score --- a/pengine/test10/utilization.score Fri Dec 11 20:19:24 2009 +0100 +++ /dev/null Thu Jan 01 00:00:00 1970 + @@ -1,5 +0,0 @@ -Allocation scores: -native_color: rsc2 allocation score on host1: 0 -native_color: rsc2 allocation score on host2: 0 -native_color: rsc1 allocation score on host1: 0 -native_color: rsc1 allocation score on host2: 0 diff -r f182beaeedab -r ab416ebb0734 pengine/test10/utilization.scores --- /dev/null Thu Jan 01 00:00:00 1970 + +++ b/pengine/test10/utilization.scores Mon Dec 14 14:46:42 2009 +0800 @@ -0,0 +1,5 @@ +Allocation scores: +native_color: rsc2 allocation score on host1: 0 +native_color: rsc2 allocation score on host2: 0 +native_color: rsc1 allocation score on host1: 0 +native_color: rsc1 allocation score on host2: 0 ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
Andrew Beekhof wrote: > On Thu, Dec 10, 2009 at 9:52 AM, Yan Gao wrote: >> On 12/10/09 12:56, Yan Gao wrote: >>> Hi Andrew, >>> Attached the hg export patch against the devel branch for that. Hope >>> that's easier to be merged:-) >> And the patch including the test cases. > > Done. Thanks for your efforts! Thanks for taking care of them! -- Regards, Yan Gao y...@novell.com Software Engineer China Server Team, OPS Engineering Novell, Inc. Making IT Work As One™ ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
On Thu, Dec 10, 2009 at 9:52 AM, Yan Gao wrote: > On 12/10/09 12:56, Yan Gao wrote: >> Hi Andrew, >> Attached the hg export patch against the devel branch for that. Hope >> that's easier to be merged:-) > And the patch including the test cases. Done. Thanks for your efforts! ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
On 12/10/09 12:56, Yan Gao wrote: > Hi Andrew, > Attached the hg export patch against the devel branch for that. Hope > that's easier to be merged:-) And the patch including the test cases. Thanks, Yan -- y...@novell.com Software Engineer China Server Team, OPS Engineering Novell, Inc. Making IT Work As One™ # HG changeset patch # User Yan Gao # Date 1260434891 -28800 # Node ID c8013f5b53f018eb5ca2667e0170810e45257489 # Parent 456f25dc72b805e12e5dc32fb23ea8dbe5b8103c Low: PE: Add regression tests for the new placement strategies diff -r 456f25dc72b8 -r c8013f5b53f0 pengine/regression.sh --- a/pengine/regression.sh Thu Dec 10 12:37:46 2009 +0800 +++ b/pengine/regression.sh Thu Dec 10 16:48:11 2009 +0800 @@ -328,5 +328,10 @@ do_test systemhealthp3 "System Health (Progessive) #3" echo "" +do_test utilization "Placement Strategy - utilization" +do_test minimal "Placement Strategy - minimal" +do_test balanced"Placement Strategy - balanced" + +echo "" test_results diff -r 456f25dc72b8 -r c8013f5b53f0 pengine/test10/balanced.dot --- /dev/null Thu Jan 01 00:00:00 1970 + +++ b/pengine/test10/balanced.dot Thu Dec 10 16:48:11 2009 +0800 @@ -0,0 +1,19 @@ +digraph "g" { +"probe_complete host1" -> "probe_complete" [ style = bold] +"probe_complete host1" [ style=bold color="green" fontcolor="black" ] +"probe_complete host2" -> "probe_complete" [ style = bold] +"probe_complete host2" [ style=bold color="green" fontcolor="black" ] +"probe_complete" -> "rsc1_start_0 host2" [ style = bold] +"probe_complete" -> "rsc2_start_0 host1" [ style = bold] +"probe_complete" [ style=bold color="green" fontcolor="orange" ] +"rsc1_monitor_0 host1" -> "probe_complete host1" [ style = bold] +"rsc1_monitor_0 host1" [ style=bold color="green" fontcolor="black" ] +"rsc1_monitor_0 host2" -> "probe_complete host2" [ style = bold] +"rsc1_monitor_0 host2" [ style=bold color="green" fontcolor="black" ] +"rsc1_start_0 host2" [ style=bold color="green" fontcolor="black" ] +"rsc2_monitor_0 host1" -> "probe_complete host1" [ style = bold] +"rsc2_monitor_0 host1" [ style=bold color="green" fontcolor="black" ] +"rsc2_monitor_0 host2" -> "probe_complete host2" [ style = bold] +"rsc2_monitor_0 host2" [ style=bold color="green" fontcolor="black" ] +"rsc2_start_0 host1" [ style=bold color="green" fontcolor="black" ] +} diff -r 456f25dc72b8 -r c8013f5b53f0 pengine/test10/balanced.exp --- /dev/null Thu Jan 01 00:00:00 1970 + +++ b/pengine/test10/balanced.exp Thu Dec 10 16:48:11 2009 +0800 @@ -0,0 +1,110 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff -r 456f25dc72b8 -r c8013f5b53f0 pengine/test10/balanced.score --- /dev/null Thu Jan 01 00:00:00 1970 + +++ b/pengine/test10/balanced.score Thu Dec 10 16:48:11 2009 +0800 @@ -0,0 +1,5 @@ +Allocation scores: +native_color: rsc1 allocation score on host1: 0 +native_color: rsc1 allocation score on host2: 0 +native_color: rsc2 allocation score on host1: 0 +native_color: rsc2 allocation score on host2: 0 diff -r 456f25dc72b8 -r c8013f5b53f0 pengine/test10/balanced.xml --- /dev/null Thu Jan 01 00:00:00 1970 + +++ b/pengine/test10/balanced.xml Thu Dec 10 16:48:11 2009 +0800 @@ -0,0 +1,44 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff -r 456f25dc72b8 -r c8013f5b53f0 pengine/test10/minimal.dot --- /dev/null Thu Jan 01 00:00:00 1970 + +++ b/pengine/test10/minimal.dot Thu Dec 10 16:48:11 2009 +0800 @@ -0,0 +1,19 @@ +digraph "g" { +"probe_complete host1" -> "probe_complete" [ style = bold] +"probe_complete host1" [ style=bold color="green" fontcolor="black" ] +"probe_complete host2" -> "probe_complete" [ style = bold] +"probe_complete host2" [ style=bold color="green" fontcolor="black" ] +"probe_complete" -> "rsc1_start_0 host1" [ style = bold] +"probe_complete" -> "rsc2_start_0 host1" [ style = bold] +"probe_complete" [ style=bold color="green" fontcolor="orange" ] +"rsc1_monitor_0 host1" -> "probe_complete host1" [ style = bold]
Re: [Pacemaker] Resource capacity limit
Hi Andrew, Attached the hg export patch against the devel branch for that. Hope that's easier to be merged:-) Thanks, Yan -- y...@novell.com Software Engineer China Server Team, OPS Engineering Novell, Inc. Making IT Work As One™ # HG changeset patch # User Yan Gao # Date 1260419866 -28800 # Node ID 456f25dc72b805e12e5dc32fb23ea8dbe5b8103c # Parent 0fbf9c62b0555d4d105f6a038f6846af93c1d9e3 Dev: PE: Implement more resource placement strategies: utilization, minimal and balanced diff -r 0fbf9c62b055 -r 456f25dc72b8 include/crm/msg_xml.h --- a/include/crm/msg_xml.h Mon Aug 10 13:57:42 2009 +0200 +++ b/include/crm/msg_xml.h Thu Dec 10 12:37:46 2009 +0800 @@ -130,6 +130,7 @@ #define XML_TAG_ATTRS "attributes" #define XML_TAG_PARAMS "parameters" #define XML_TAG_PARAM "param" +#define XML_TAG_UTILIZATION "utilization" #define XML_TAG_RESOURCE_REF "resource_ref" #define XML_CIB_TAG_RESOURCE "primitive" diff -r 0fbf9c62b055 -r 456f25dc72b8 include/crm/pengine/status.h --- a/include/crm/pengine/status.h Mon Aug 10 13:57:42 2009 +0200 +++ b/include/crm/pengine/status.h Thu Dec 10 12:37:46 2009 +0800 @@ -68,6 +68,7 @@ char *dc_uuid; node_t *dc_node; const char *stonith_action; + const char *placement_strategy; unsigned long long flags; @@ -116,6 +117,8 @@ GHashTable *attrs; /* char* => char* */ enum node_type type; + + GHashTable *utilization; }; struct node_s { @@ -186,6 +189,7 @@ GHashTable *meta; GHashTable *parameters; + GHashTable *utilization; GListPtr children; /* resource_t* */ }; diff -r 0fbf9c62b055 -r 456f25dc72b8 lib/pengine/common.c --- a/lib/pengine/common.c Mon Aug 10 13:57:42 2009 +0200 +++ b/lib/pengine/common.c Thu Dec 10 12:37:46 2009 +0800 @@ -80,6 +80,24 @@ return FALSE; } +static gboolean +check_placement_strategy(const char *value) +{ + if(safe_str_eq(value, "default")) { + return TRUE; + + } else if(safe_str_eq(value, "utilization")) { + return TRUE; + + } else if(safe_str_eq(value, "minimal")) { + return TRUE; + + } else if(safe_str_eq(value, "balanced")) { + return TRUE; + } + return FALSE; +} + pe_cluster_option pe_opts[] = { /* name, old-name, validate, default, description */ { "no-quorum-policy", "no_quorum_policy", "enum", "stop, freeze, ignore, suicide", "stop", &check_quorum, @@ -147,6 +165,10 @@ { "node-health-red", NULL, "integer", NULL, "-INFINITY", &check_number, "The score 'red' translates to in rsc_location constraints", "Only used when node-health-strategy is set to custom or progressive." }, + + /*Placement Strategy*/ + { "placement-strategy", NULL, "enum", "default, utilization, minimal, balanced", "default", &check_placement_strategy, + "The strategy to determine resource placement", NULL}, }; void diff -r 0fbf9c62b055 -r 456f25dc72b8 lib/pengine/complex.c --- a/lib/pengine/complex.c Mon Aug 10 13:57:42 2009 +0200 +++ b/lib/pengine/complex.c Thu Dec 10 12:37:46 2009 +0800 @@ -371,6 +371,12 @@ if(safe_str_eq(class, "stonith")) { set_bit_inplace(data_set->flags, pe_flag_have_stonith_resource); } + + (*rsc)->utilization = g_hash_table_new_full( + g_str_hash, g_str_equal, g_hash_destroy_str, g_hash_destroy_str); + + unpack_instance_attributes(data_set->input, (*rsc)->xml, XML_TAG_UTILIZATION, NULL, + (*rsc)->utilization, NULL, FALSE, data_set->now); /* data_set->resources = g_list_append(data_set->resources, (*rsc)); */ return TRUE; @@ -451,6 +457,9 @@ if(rsc->meta != NULL) { g_hash_table_destroy(rsc->meta); } + if(rsc->utilization != NULL) { + g_hash_table_destroy(rsc->utilization); + } if(rsc->parent == NULL && is_set(rsc->flags, pe_rsc_orphan)) { free_xml(rsc->xml); } diff -r 0fbf9c62b055 -r 456f25dc72b8 lib/pengine/status.c --- a/lib/pengine/status.c Mon Aug 10 13:57:42 2009 +0200 +++ b/lib/pengine/status.c Thu Dec 10 12:37:46 2009 +0800 @@ -159,6 +159,9 @@ if(details->attrs != NULL) { g_hash_table_destroy(details->attrs); } + if(details->utilization != NULL) { +g_hash_table_destroy(details->utilization); + } pe_free_shallow_adv(details->running_rsc, FALSE); pe_free_shallow_adv(details->allocated_rsc, FALSE); crm_free(details); diff -r 0fbf9c62b055 -r 456f25dc72b8 lib/pengine/unpack.c --- a/lib/pengine/unpack.c Mon Aug 10 13:57:42 2009 +0200 +++ b/lib/pengine/unpack.c Thu Dec 10 12:37:46 2009 +0800 @@ -165,6 +165,9 @@ crm_info("Node scores: 'red' = %s, 'yellow' = %s, 'green' = %s", score2char(node_score_red),score2char(node_score_yellow), score2char(node_score_green)); + + data_set->placement_strategy = pe_pref(data_set->config_hash, "placement-strategy"); + crm_debug_2("Placement strategy: %s", data_set->placement_strategy); return TRUE; } @@ -233,6 +236,9 @@ new_node->details->attrs= g_hash_table_new_full( g_str_hash, g_str_equal, g_hash_destroy_str, g_hash_destroy_str); + new_node->details->utilization = g_hash_table_new_full( + g_str_hash, g_str_equal,
Re: [Pacemaker] Resource capacity limit
Hi Andrew, On 11/20/09 04:10, Andrew Beekhof wrote: > Btw. You're still missing some test cases ;-) Oh, right:-) I created some. Hope I created them in the correct way. Sorry for so many attachments... Thanks, Yan -- y...@novell.com Software Engineer China Server Team, OPS Engineering Novell, Inc. Making IT Work As One™ digraph "g" { "probe_complete host1" -> "probe_complete" [ style = bold] "probe_complete host1" [ style=bold color="green" fontcolor="black" ] "probe_complete host2" -> "probe_complete" [ style = bold] "probe_complete host2" [ style=bold color="green" fontcolor="black" ] "probe_complete" -> "rsc1_start_0 host2" [ style = bold] "probe_complete" -> "rsc2_start_0 host1" [ style = bold] "probe_complete" [ style=bold color="green" fontcolor="orange" ] "rsc1_monitor_0 host1" -> "probe_complete host1" [ style = bold] "rsc1_monitor_0 host1" [ style=bold color="green" fontcolor="black" ] "rsc1_monitor_0 host2" -> "probe_complete host2" [ style = bold] "rsc1_monitor_0 host2" [ style=bold color="green" fontcolor="black" ] "rsc1_start_0 host2" [ style=bold color="green" fontcolor="black" ] "rsc2_monitor_0 host1" -> "probe_complete host1" [ style = bold] "rsc2_monitor_0 host1" [ style=bold color="green" fontcolor="black" ] "rsc2_monitor_0 host2" -> "probe_complete host2" [ style = bold] "rsc2_monitor_0 host2" [ style=bold color="green" fontcolor="black" ] "rsc2_start_0 host1" [ style=bold color="green" fontcolor="black" ] } Allocation scores: native_color: rsc1 allocation score on host1: 0 native_color: rsc1 allocation score on host2: 0 native_color: rsc2 allocation score on host1: 0 native_color: rsc2 allocation score on host2: 0 digraph "g" { "probe_complete host1" -> "probe_complete" [ style = bold] "probe_complete host1" [ style=bold color="green" fontcolor="black" ] "probe_complete host2" -> "probe_complete" [ style = bold] "probe_complete host2" [ style=bold color="green" fontcolor="black" ] "probe_complete" -> "rsc1_start_0 host1" [ style = bold] "probe_complete" -> "rsc2_start_0 host1" [ style = bold] "probe_complete" [ style=bold color="green" fontcolor="orange" ] "rsc1_monitor_0 host1" -> "probe_complete host1" [ style = bold] "rsc1_monitor_0 host1" [ style=bold color="green" fontcolor="black" ] "rsc1_monitor_0 host2" -> "probe_complete host2" [ style = bold] "rsc1_monitor_0 host2" [ style=bold color="green" fontcolor="black" ] "rsc1_start_0 host1" [ style=bold color="green" fontcolor="black" ] "rsc2_monitor_0 host1" -> "probe_complete host1" [ style = bold] "rsc2_monitor_0 host1" [ style=bold color="green" fontcolor="black" ] "rsc2_monitor_0 host2" -> "probe_complete host2" [ style = bold] "rsc2_monitor_0 host2" [ style=bold color="green" fontcolor="black" ] "rsc2_start_0 host1" [ style=bold color="green" fontcolor="black" ] } Allocation scores: native_color: rsc1 allocation score on host1: 0 native_color: rsc1 allocation score on host2: 0 native_color: rsc2 allocation score on host1: 0 native_color: rsc2 allocation score on host2: 0 digraph "
Re: [Pacemaker] Resource capacity limit
Btw. You're still missing some test cases ;-) On Fri, Nov 13, 2009 at 8:23 AM, Yan Gao wrote: > Hi Andrew, Lars, > > Andrew Beekhof wrote: >> I'd like to see the while-block from native_color() be a function that >> is called from native_assign_node(). > It seems to be too late to filter out the nodes without enough capacity from > native_assign_node(). I wrote a have_enough_capacity() function which is > called from native_choose_node() to achieve that. > >> And instead of a limit-utilization option, we'd have >> placement-strategy=(default|utilization|minimal) > Done. And added a "balanced" option as Lars advised. > >> >> Default ::= what we do now >> Utilization ::= what you've implemented >> Minimal ::= what you've implemented _without_ the load balancing we >> currently do. >> >> (Maybe the names could be improved, but hopefully you get the idea). >> >> The last one is interesting because it allows us to concentrate >> services on the minimum number of required nodes (and potentially >> power some of the others down). > Done. > > Minimal: > Consider the utilization of nodes and resources. While if a resource has > the same score for several available nodes, do _not_ balance the load. > That implies that the resources will be concentrated to minimal number of > nodes. > > Balanced: > Consider the utilization of nodes and resources. If a resource has > the same score for several available nodes: > * First, balance the load according to the remaining capacity of nodes. > (implemented from compare_capacity()) > * If the nodes still have the equal remaining capacity, then balance > the load according to the numbers of resources that the nodes will run. > > The strategies are determined mainly from sort_node_weight(), so I changed the > prototypes of some functions a bit. > > Please help to review and test it. Any comments and suggestions are welcome:-) > > Thanks, > Yan > > -- > y...@novell.com > Software Engineer > China Server Team, OPS Engineering > > Novell, Inc. > Making IT Work As One™ > > ___ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
On 2009-11-13T15:23:20, Yan Gao wrote: > Minimal: > Consider the utilization of nodes and resources. While if a resource has > the same score for several available nodes, do _not_ balance the load. > That implies that the resources will be concentrated to minimal number of > nodes. > > Balanced: > Consider the utilization of nodes and resources. If a resource has > the same score for several available nodes: > * First, balance the load according to the remaining capacity of nodes. > (implemented from compare_capacity()) > * If the nodes still have the equal remaining capacity, then balance > the load according to the numbers of resources that the nodes will run. > > The strategies are determined mainly from sort_node_weight(), so I changed the > prototypes of some functions a bit. Hi Yan Gao, great work! But Minimal or Balanced don't quite do what is described above. A linear assignment doesn't provide anything close to an optimal solution, in particular if combined with (anti-)collocation rules; solving this optimally is NP-complete (rucksack problem for the Minimal policy, for example), though heuristics to get close in sane time exists. At least this is worth understanding and describing as a limitation of the current algorithm. Regards, Lars -- Architect Storage/HA, OPS Engineering, Novell, Inc. SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
On Thu, Nov 12, 2009 at 11:58 PM, Lars Marowsky-Bree wrote: > On 2009-11-12T14:53:24, Andrew Beekhof wrote: > >> At this point in time, I can't see us going back to the way heartbeat >> releases were done. >> If there was a single thing that I'd credit Pacemaker's current >> reliability to, it would be our release strategy. > > Well, exactly, and that's what pacemaker has been doing, right? Phasing > in features over time? Successfully? ;-) > >> > With increasing coverage of the regression tests, the existing >> > functionality is protected; which is really the important bit. This >> > encourages a smooth forward transition. >> One simply can't test everything. > > True, but we do a pretty good job of it. > > Or are there any fundamental changes you've queued up? Yes, stonith and possibly the lrmd will be seeing some changes in the near future. There are also a number of configuration changes I want to make. > >> > There's a point in having a devel tree (similar to linux-next) before >> > merging back major features into the trunk, but I don't really subscribe >> > to the major version flow. That just means that there's a lot of testing >> > that needs to happen at once, which means more things slip through than >> > with incremental testing. In my experience, major updates make them a >> > royal PITA for users. >> Noted. But for now, I don't think we'll go in that direction. > > So you want to change away from a successful model (as in the 1.0.x > series so far) to a more disruptive one? ;-) No, I'm suggesting that we won't be changing from what we do now. I'd just document it. > If you're saying we don't have resources for people to test a > development tree, that's true either for one that periodically gets > merged back into "mainline", as well as for one that gets merged back in > much larger intervals. In fact, I'd predict it'll be worse for the > latter model. Except that no-ones putting a gun to people's head making them use the new stuff. Thats the point of cutting off development at some point, so that there is always something stable to use while we (and other people that must have whatever cool new features we added) get the next series into shape. You'd have a point if 0.6 was deleted the second 1.0 came out, but its been a year and I've still not turned away a 0.6 bug yet. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
On Fri, Nov 13, 2009 at 8:23 AM, Yan Gao wrote: > Hi Andrew, Lars, > > Andrew Beekhof wrote: >> I'd like to see the while-block from native_color() be a function that >> is called from native_assign_node(). > It seems to be too late to filter out the nodes without enough capacity from > native_assign_node(). I wrote a have_enough_capacity() function which is > called from native_choose_node() to achieve that. Ah, yes, thats what I meant. Well done interpreting my vague design :-) > >> And instead of a limit-utilization option, we'd have >> placement-strategy=(default|utilization|minimal) > Done. And added a "balanced" option as Lars advised. > >> >> Default ::= what we do now >> Utilization ::= what you've implemented >> Minimal ::= what you've implemented _without_ the load balancing we >> currently do. >> >> (Maybe the names could be improved, but hopefully you get the idea). >> >> The last one is interesting because it allows us to concentrate >> services on the minimum number of required nodes (and potentially >> power some of the others down). > Done. > > Minimal: > Consider the utilization of nodes and resources. While if a resource has > the same score for several available nodes, do _not_ balance the load. > That implies that the resources will be concentrated to minimal number of > nodes. > > Balanced: > Consider the utilization of nodes and resources. If a resource has > the same score for several available nodes: > * First, balance the load according to the remaining capacity of nodes. > (implemented from compare_capacity()) > * If the nodes still have the equal remaining capacity, then balance > the load according to the numbers of resources that the nodes will run. > > The strategies are determined mainly from sort_node_weight(), so I changed the > prototypes of some functions a bit. > > Please help to review and test it. Any comments and suggestions are welcome:-) Will do. Thanks! ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
Hi Andrew, Lars, Andrew Beekhof wrote: > I'd like to see the while-block from native_color() be a function that > is called from native_assign_node(). It seems to be too late to filter out the nodes without enough capacity from native_assign_node(). I wrote a have_enough_capacity() function which is called from native_choose_node() to achieve that. > And instead of a limit-utilization option, we'd have > placement-strategy=(default|utilization|minimal) Done. And added a "balanced" option as Lars advised. > > Default ::= what we do now > Utilization ::= what you've implemented > Minimal ::= what you've implemented _without_ the load balancing we > currently do. > > (Maybe the names could be improved, but hopefully you get the idea). > > The last one is interesting because it allows us to concentrate > services on the minimum number of required nodes (and potentially > power some of the others down). Done. Minimal: Consider the utilization of nodes and resources. While if a resource has the same score for several available nodes, do _not_ balance the load. That implies that the resources will be concentrated to minimal number of nodes. Balanced: Consider the utilization of nodes and resources. If a resource has the same score for several available nodes: * First, balance the load according to the remaining capacity of nodes. (implemented from compare_capacity()) * If the nodes still have the equal remaining capacity, then balance the load according to the numbers of resources that the nodes will run. The strategies are determined mainly from sort_node_weight(), so I changed the prototypes of some functions a bit. Please help to review and test it. Any comments and suggestions are welcome:-) Thanks, Yan -- y...@novell.com Software Engineer China Server Team, OPS Engineering Novell, Inc. Making IT Work As One™ diff -r f49a0cab20aa include/crm/msg_xml.h --- a/include/crm/msg_xml.h Thu Nov 12 12:18:10 2009 +0100 +++ b/include/crm/msg_xml.h Fri Nov 13 14:08:16 2009 +0800 @@ -130,6 +130,7 @@ #define XML_TAG_ATTRS "attributes" #define XML_TAG_PARAMS "parameters" #define XML_TAG_PARAM "param" +#define XML_TAG_UTILIZATION "utilization" #define XML_TAG_RESOURCE_REF "resource_ref" #define XML_CIB_TAG_RESOURCE "primitive" diff -r f49a0cab20aa include/crm/pengine/status.h --- a/include/crm/pengine/status.h Thu Nov 12 12:18:10 2009 +0100 +++ b/include/crm/pengine/status.h Fri Nov 13 14:08:16 2009 +0800 @@ -68,6 +68,7 @@ char *dc_uuid; node_t *dc_node; const char *stonith_action; + const char *placement_strategy; unsigned long long flags; @@ -116,6 +117,8 @@ GHashTable *attrs; /* char* => char* */ enum node_type type; + + GHashTable *utilization; }; struct node_s { @@ -186,6 +189,7 @@ GHashTable *meta; GHashTable *parameters; + GHashTable *utilization; GListPtr children; /* resource_t* */ }; diff -r f49a0cab20aa lib/pengine/common.c --- a/lib/pengine/common.c Thu Nov 12 12:18:10 2009 +0100 +++ b/lib/pengine/common.c Fri Nov 13 14:08:16 2009 +0800 @@ -80,6 +80,24 @@ return FALSE; } +static gboolean +check_placement_strategy(const char *value) +{ + if(safe_str_eq(value, "default")) { + return TRUE; + + } else if(safe_str_eq(value, "utilization")) { + return TRUE; + + } else if(safe_str_eq(value, "minimal")) { + return TRUE; + + } else if(safe_str_eq(value, "balanced")) { + return TRUE; + } + return FALSE; +} + pe_cluster_option pe_opts[] = { /* name, old-name, validate, default, description */ { "no-quorum-policy", "no_quorum_policy", "enum", "stop, freeze, ignore, suicide", "stop", &check_quorum, @@ -147,6 +165,10 @@ { "node-health-red", NULL, "integer", NULL, "-INFINITY", &check_number, "The score 'red' translates to in rsc_location constraints", "Only used when node-health-strategy is set to custom or progressive." }, + + /*Placement Strategy*/ + { "placement-strategy", NULL, "enum", "default, utilization, minimal, balanced", "default", &check_placement_strategy, + "The strategy to determine resource placement", NULL}, }; void diff -r f49a0cab20aa lib/pengine/complex.c --- a/lib/pengine/complex.c Thu Nov 12 12:18:10 2009 +0100 +++ b/lib/pengine/complex.c Fri Nov 13 14:08:16 2009 +0800 @@ -371,6 +371,12 @@ if(safe_str_eq(class, "stonith")) { set_bit_inplace(data_set->flags, pe_flag_have_stonith_resource); } + + (*rsc)->utilization = g_hash_table_new_full( + g_str_hash, g_str_equal, g_hash_destroy_str, g_hash_destroy_str); + + unpack_instance_attributes(data_set->input, (*rsc)->xml, XML_TAG_UTILIZATION, NULL, + (*rsc)->utilization, NULL, FALSE, data_set->now); /* data_set->resources = g_list_append(data_set->resources, (*rsc)); */ return TRUE; @@ -451,6 +457,9 @@ if(rsc->meta != NULL) { g_hash_table_destroy(rsc->meta); } + if(rsc->utilization != NULL) { + g_hash_table_destroy(rsc->utilization); + } if(rsc->parent == NULL && is_set(rsc->flags, pe_rsc_orphan)) { free_xm
Re: [Pacemaker] Resource capacity limit
On Thu, 2009-11-12 at 14:53 +0100, Andrew Beekhof wrote: > On Wed, Nov 11, 2009 at 1:36 PM, Lars Marowsky-Bree wrote: > > On 2009-11-05T14:45:36, Andrew Beekhof wrote: > > > >> Lastly, I would really like to defer this for 1.2 > >> I know I've bent the rules a bit for 1.0 in the past, but its really > >> late in the game now. > > > > Personally, I think the Linux kernel model works really well. ie, no > > "major releases" any more, but bugfixes and features alike get merged > > over time and constantly. > > Thats a great model if you've got hoards of developers and testers. > Of which we have neither. > > At this point in time, I can't see us going back to the way heartbeat > releases were done. > If there was a single thing that I'd credit Pacemaker's current > reliability to, it would be our release strategy. Maintaining corosync and openais, I'd surely like to only have one tree where all work is done and never have a "stable" branch. Andrew is right though, this model only works if there is large downstream adoption and support and distros take on the work of stabilizing the efforts of the trunk development. Talking with distros I know this is generally not the case with any package other then kernel.org and maybe some related bits like xen/kvm (which has forced this model upon them). Regards -steve ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
On 2009-11-12T14:53:24, Andrew Beekhof wrote: > At this point in time, I can't see us going back to the way heartbeat > releases were done. > If there was a single thing that I'd credit Pacemaker's current > reliability to, it would be our release strategy. Well, exactly, and that's what pacemaker has been doing, right? Phasing in features over time? Successfully? ;-) > > With increasing coverage of the regression tests, the existing > > functionality is protected; which is really the important bit. This > > encourages a smooth forward transition. > One simply can't test everything. True, but we do a pretty good job of it. Or are there any fundamental changes you've queued up? > > There's a point in having a devel tree (similar to linux-next) before > > merging back major features into the trunk, but I don't really subscribe > > to the major version flow. That just means that there's a lot of testing > > that needs to happen at once, which means more things slip through than > > with incremental testing. In my experience, major updates make them a > > royal PITA for users. > Noted. But for now, I don't think we'll go in that direction. So you want to change away from a successful model (as in the 1.0.x series so far) to a more disruptive one? ;-) If you're saying we don't have resources for people to test a development tree, that's true either for one that periodically gets merged back into "mainline", as well as for one that gets merged back in much larger intervals. In fact, I'd predict it'll be worse for the latter model. I mean, sure, it's your project, but I really wonder if it's a good direction to go. Having done this for over a decade, I can honestly tell that major upgrades are always a pain. They are never smooth. Many small steps over time are better. Just consider that and make the bets choices ;-) Regards, Lars -- Architect Storage/HA, OPS Engineering, Novell, Inc. SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
On Wed, Nov 11, 2009 at 1:36 PM, Lars Marowsky-Bree wrote: > On 2009-11-05T14:45:36, Andrew Beekhof wrote: > >> Lastly, I would really like to defer this for 1.2 >> I know I've bent the rules a bit for 1.0 in the past, but its really >> late in the game now. > > Personally, I think the Linux kernel model works really well. ie, no > "major releases" any more, but bugfixes and features alike get merged > over time and constantly. Thats a great model if you've got hoards of developers and testers. Of which we have neither. At this point in time, I can't see us going back to the way heartbeat releases were done. If there was a single thing that I'd credit Pacemaker's current reliability to, it would be our release strategy. > > With increasing coverage of the regression tests, the existing > functionality is protected; which is really the important bit. This > encourages a smooth forward transition. One simply can't test everything. > There's a point in having a devel tree (similar to linux-next) before > merging back major features into the trunk, but I don't really subscribe > to the major version flow. That just means that there's a lot of testing > that needs to happen at once, which means more things slip through than > with incremental testing. In my experience, major updates make them a > royal PITA for users. Noted. But for now, I don't think we'll go in that direction. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
On Wed, Nov 11, 2009 at 1:42 PM, Lars Marowsky-Bree wrote: > On 2009-11-06T12:45:17, Andrew Beekhof wrote: > >> And instead of a limit-utilization option, we'd have >> placement-strategy=(default|utilization|minimal) >> >> Default ::= what we do now >> Utilization ::= what you've implemented > > These two are obvious, since we can already do them with existing code. > > The following: > >> Minimal ::= what you've implemented _without_ the load balancing we >> currently do. > > (Basically, concentrate load on as few nodes as possible. Rucksack > problem.) > > To this I'd like to add > > Balanced ::= try to spread the load as evenly as possible. This is hard > to define - perhaps "maximise average free resources on nodes". > > These latter two are harder, and basically require a linear optimization > engine to be integrated. But I'd, of course, love to see them. Of no question there, just trying to at least be prepared for it so that we don't have to change the option name(s). ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
On 2009-11-09T10:52:03, Michael Schwartzkopff wrote: > I just think it would be cool solution to make the cluster itself > doing the work if condfigured to do so. So the CRM (or the RAs) should > have the abaility to monitor the resource consumption of resources > dynamically. This automatism would make the live of admins much easier > and they would not be forced to do the scripting by their own. Automatically, and possibly dynamically, figuring out the load incurred by a specific resource, and its min/avg/peak limits, is an extremely hard problem. Yes, it is very cool, but outside the scope of Pacemaker itself. With these patches to take the load into account, Pacemaker is equipped to take such input from monitoring frameworks, but I don't think Pacemaker itself should be this monitoring tool. For VMs, this is somewhat easier (compared to individual resources) to monitor, since the hypervisor/Dom0 has access to this data: memory consumption, CPU utilization over N minutes, disk IO/network etc. I'd very very much love to see this added. Perhaps the "monitor" op for the RA could handle this. Regards, Lars -- Architect Storage/HA, OPS Engineering, Novell, Inc. SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
On 2009-11-06T12:45:17, Andrew Beekhof wrote: > And instead of a limit-utilization option, we'd have > placement-strategy=(default|utilization|minimal) > > Default ::= what we do now > Utilization ::= what you've implemented These two are obvious, since we can already do them with existing code. The following: > Minimal ::= what you've implemented _without_ the load balancing we > currently do. (Basically, concentrate load on as few nodes as possible. Rucksack problem.) To this I'd like to add Balanced ::= try to spread the load as evenly as possible. This is hard to define - perhaps "maximise average free resources on nodes". These latter two are harder, and basically require a linear optimization engine to be integrated. But I'd, of course, love to see them. (Automatically powering down nodes is not that trivial, since we'd need some way to wake them up in-time; STONITH actually can do that, but it needs some thinking to get right. At least though those nodes could go to power savings mode, so it'd definitely help.) With those, Pacemaker would be a full-scale replacement for certain data center management and automation frameworks ;-) Regards, Lars -- Architect Storage/HA, OPS Engineering, Novell, Inc. SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
On 2009-11-05T14:45:36, Andrew Beekhof wrote: > Lastly, I would really like to defer this for 1.2 > I know I've bent the rules a bit for 1.0 in the past, but its really > late in the game now. Personally, I think the Linux kernel model works really well. ie, no "major releases" any more, but bugfixes and features alike get merged over time and constantly. With increasing coverage of the regression tests, the existing functionality is protected; which is really the important bit. This encourages a smooth forward transition. There's a point in having a devel tree (similar to linux-next) before merging back major features into the trunk, but I don't really subscribe to the major version flow. That just means that there's a lot of testing that needs to happen at once, which means more things slip through than with incremental testing. In my experience, major updates make them a royal PITA for users. Just my few euro cents ;-) Regards, Lars -- Architect Storage/HA, OPS Engineering, Novell, Inc. SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
Am Freitag, 6. November 2009 08:49:06 schrieb Andrew Beekhof: > On Fri, Nov 6, 2009 at 8:31 AM, Michael Schwartzkopff wrote: > > Am Donnerstag, 5. November 2009 21:37:23 schrieb Andrew Beekhof: > >> On Thu, Nov 5, 2009 at 8:50 PM, Michael Schwartzkopff > >> > > > > wrote: > >> > Hi, > >> > > >> > on the list was a discussion about resource capacity limits. Yan Gao > >> > also implemented it. > >> > > >> > As far as I understood the discussion the solution is to attach nodes > >> > and resources capacity limits. Resources are distributed on the nodes > >> > of a cluster according to its capacatiy needs. They would be migrated > >> > or shutdown if the capacity limits on the node are not met. > >> > > >> > My question is: Can the capacity figures of the resources be made > >> > dynamically? > >> > >> You can, but you probably don't want to. > >> For example, free RAM and CPU load are two things that absolutely make > >> no sense to include in such calculations. > >> > >> Consider how it works: > >> > >> - The node starts and reports 2Gb of RAM > >> - We place a service there that reserves 512Mb > >> - The cluster knows there is 1.5Gb remaining > >> - We place two more services there that also reserve 512Mb each > >> > >> If the amount of RAM at the beginning was the amount free, then when > >> you updated it to be 512Mb the PE would run and stop two of the > >> resources! > > > > Stop. I do not want to make the capacity of the nodes dynamically but the > > actual resource consumption of the resources (i.e the database). > > Ah, ok, that makes more sense. > Well like any part of the configuration it can be changed at any time. > What we don't have though is a nice cli tool like crm_attribute for doing > so. I know the crm_attribute command. Of course someone can always do somw scripting to acchieve the aim I described above. I just think it would be cool solution to make the cluster itself doing the work if condfigured to do so. So the CRM (or the RAs) should have the abaility to monitor the resource consumption of resources dynamically. This automatism would make the live of admins much easier and they would not be forced to do the scripting by their own. Greetings, -- Dr. Michael Schwartzkopff MultiNET Services GmbH Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany Tel: +49 - 89 - 45 69 11 0 Fax: +49 - 89 - 45 69 11 21 mob: +49 - 174 - 343 28 75 mail: mi...@multinet.de web: www.multinet.de Sitz der Gesellschaft: 85630 Grasbrunn Registergericht: Amtsgericht München HRB 114375 Geschäftsführer: Günter Jurgeneit, Hubert Martens --- PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B Skype: misch42 ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
Hi, Andrew Beekhof wrote: > > I've been thinking about this more and while this will work, I think > we can make it better. > > I'd like to see the while-block from native_color() be a function that > is called from native_assign_node(). Okay. > And instead of a limit-utilization option, we'd have > placement-strategy=(default|utilization|minimal) > > Default ::= what we do now > Utilization ::= what you've implemented > Minimal ::= what you've implemented _without_ the load balancing we > currently do. > > (Maybe the names could be improved, but hopefully you get the idea). Great! There would be more policy options for users. > > The last one is interesting because it allows us to concentrate > services on the minimum number of required nodes (and potentially > power some of the others down). Right. I'll look it into. Thanks, Yan -- y...@novell.com Software Engineer China Server Team, OPS Engineering Novell, Inc. Making IT Work As One™ ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
Hi, On Thu, Nov 05, 2009 at 07:28:16PM +0100, Andrew Beekhof wrote: > On Thu, Nov 5, 2009 at 5:25 PM, Dejan Muhamedagic wrote: > > Hi, > >> Which reminds me, I need to get devel sorted out... > > > > While you're at it, perhaps it would be good to rethink the > > release policy. For me in particular it would be great to know at > > least one week in advance when there'll be a release. For the > > general public as well, since they'll have a chance to do some > > testing of the new code. > > The idea is that people can always pull from stable-1.0, in theory it > should never be broken. > If you're in the middle of stuff, keep it locally until you're done. > > Generally though, I start testing on the 15th of every month. > I thought I said that somewhere... I know the releases page indicates > the month (if its delayed as it was due to my move). > > But I'm thinking of moving to a bi-monthly cycle. Thoughts? Agreed. In principle, somewhat slower release process should result in better releases. > > I know that you do test before > > releasing, but the more people test in various environments, the > > more bugs found. Also, it may be good to introduce and announce > > the feature freeze point, after which only bug fixes will be > > accepted. > > Well in theory that point is x.y.0 > I've been turning a blind eye to your changes in the shell because its > still very immature (I don't mean that negatively, its just new code). That should of course change as soon as the shell supports all CIB constructs (which is not far away). But we all understood that it made no sense to keep those changes out :) Thanks, Dejan > Though I've a history of allowing isolated, non-invasive features if > we've not yet planned the next stable series (basically what happened > for the node health stuff from IBM). Its a case-by-case thing, but > I'd agree that we could do with documenting this. > > ___ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
On Fri, Nov 6, 2009 at 8:31 AM, Michael Schwartzkopff wrote: > Am Donnerstag, 5. November 2009 21:37:23 schrieb Andrew Beekhof: >> On Thu, Nov 5, 2009 at 8:50 PM, Michael Schwartzkopff > wrote: >> > Hi, >> > >> > on the list was a discussion about resource capacity limits. Yan Gao also >> > implemented it. >> > >> > As far as I understood the discussion the solution is to attach nodes and >> > resources capacity limits. Resources are distributed on the nodes of a >> > cluster according to its capacatiy needs. They would be migrated or >> > shutdown if the capacity limits on the node are not met. >> > >> > My question is: Can the capacity figures of the resources be made >> > dynamically? >> >> You can, but you probably don't want to. >> For example, free RAM and CPU load are two things that absolutely make >> no sense to include in such calculations. >> >> Consider how it works: >> >> - The node starts and reports 2Gb of RAM >> - We place a service there that reserves 512Mb >> - The cluster knows there is 1.5Gb remaining >> - We place two more services there that also reserve 512Mb each >> >> If the amount of RAM at the beginning was the amount free, then when >> you updated it to be 512Mb the PE would run and stop two of the >> resources! > > Stop. I do not want to make the capacity of the nodes dynamically but the > actual resource consumption of the resources (i.e the database). Ah, ok, that makes more sense. Well like any part of the configuration it can be changed at any time. What we don't have though is a nice cli tool like crm_attribute for doing so. > > It sometimes happens that resource (i.e. RAM) consumption of resources varies > and I wanted to make that number dynamical. So after some kind of damping the > node that started swapping out the resources could migrate resources to a node > with free capacities. > >> You always want to feed the cluster the total amount of RAM installed >> which, at most, you'd query when the cluster starts on that node. > > The amount of capacity of any resource (RAM, CPU, ...) of a node should be > fixed. That makes sense because a node does not get more resources if switched > on. > > Greetings, > > Michael. > > > ___ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
Am Donnerstag, 5. November 2009 21:37:23 schrieb Andrew Beekhof: > On Thu, Nov 5, 2009 at 8:50 PM, Michael Schwartzkopff wrote: > > Hi, > > > > on the list was a discussion about resource capacity limits. Yan Gao also > > implemented it. > > > > As far as I understood the discussion the solution is to attach nodes and > > resources capacity limits. Resources are distributed on the nodes of a > > cluster according to its capacatiy needs. They would be migrated or > > shutdown if the capacity limits on the node are not met. > > > > My question is: Can the capacity figures of the resources be made > > dynamically? > > You can, but you probably don't want to. > For example, free RAM and CPU load are two things that absolutely make > no sense to include in such calculations. > > Consider how it works: > > - The node starts and reports 2Gb of RAM > - We place a service there that reserves 512Mb > - The cluster knows there is 1.5Gb remaining > - We place two more services there that also reserve 512Mb each > > If the amount of RAM at the beginning was the amount free, then when > you updated it to be 512Mb the PE would run and stop two of the > resources! Stop. I do not want to make the capacity of the nodes dynamically but the actual resource consumption of the resources (i.e the database). It sometimes happens that resource (i.e. RAM) consumption of resources varies and I wanted to make that number dynamical. So after some kind of damping the node that started swapping out the resources could migrate resources to a node with free capacities. > You always want to feed the cluster the total amount of RAM installed > which, at most, you'd query when the cluster starts on that node. The amount of capacity of any resource (RAM, CPU, ...) of a node should be fixed. That makes sense because a node does not get more resources if switched on. Greetings, Michael. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
Hi Andrew, Andrew Beekhof wrote: > On Tue, Nov 3, 2009 at 12:15 PM, Yan Gao wrote: >> Hi again, >> >> Yan Gao wrote: >> >> XML sample: >> .. >> >> >> >> >> >> >> >> >> .. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> .. >> >> Please kindly review it... >> Any suggestions are appreciated! > > Whats the behavior if a node has either no utilization block or no > value for that attribute? If so, it would be regarded that the node has no any capacity or no that specific capacity. > Can a resource with a utilization block still be placed there? No, unless the utilization block is blank. As long as any attribute is set in resource utilization, which means the resource requests some kind of capacity, while a node has no that capacity, the node would not be considered. A interesting case is, if a resource has no utilization block, it would be regarded that the resource doesn't consume any capacity. so it could be placed on any node, even the node has no utilization block (no any capacity). Do you think the behavior is reasonable? Thanks, Yan -- y...@novell.com Software Engineer China Server Team, OPS Engineering Novell, Inc. Making IT Work As One™ ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
Hi Andrew, Thanks for your reply! Andrew Beekhof wrote: > On Wed, Nov 4, 2009 at 5:41 PM, Lars Marowsky-Bree wrote: >> On 2009-11-03T19:15:59, Yan Gao wrote: >> >>> XML sample: >>> .. >>> >>> >>> >>> >>> >>> >>> >>> >>> .. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> .. >>> >>> Please kindly review it... >>> Any suggestions are appreciated! >> I think this is exactly what we need. Great job! >> >> Code looks good too. >> >> Andrew? > > Four things... > > Do we still need the limit-utilization option? > I guess it might be nice to be able to turn it off globally... was > that the intention here? Sorry, missed it in the sample, while it has been implemented in the codes:-) Yes, it's "limit-utilization" property, and default to "false". So the working XML sample should be : .. ... ... .. > > The next one is minor, there should at least be a debug message when > we filter out a node in native_color() > Thats the sort of thing thats going to mess with people :-) Indeed :-) Added one and attached the revised patch. > > There also needs to be some PE regression tests for this (and be sure > to run the existing ones to make sure they don't break). Right. > > Lastly, I would really like to defer this for 1.2 Agree too. > I know I've bent the rules a bit for 1.0 in the past, but its really > late in the game now. > > Which reminds me, I need to get devel sorted out... :-) Thanks again! Best regards, Yan -- y...@novell.com Software Engineer China Server Team, OPS Engineering Novell, Inc. Making IT Work As One™ diff -r c81e55653fba include/crm/msg_xml.h --- a/include/crm/msg_xml.h Fri Oct 16 14:26:27 2009 +0200 +++ b/include/crm/msg_xml.h Fri Nov 06 15:20:33 2009 +0800 @@ -130,6 +130,7 @@ #define XML_TAG_ATTRS "attributes" #define XML_TAG_PARAMS "parameters" #define XML_TAG_PARAM "param" +#define XML_TAG_UTILIZATION "utilization" #define XML_TAG_RESOURCE_REF "resource_ref" #define XML_CIB_TAG_RESOURCE "primitive" diff -r c81e55653fba include/crm/pengine/status.h --- a/include/crm/pengine/status.h Fri Oct 16 14:26:27 2009 +0200 +++ b/include/crm/pengine/status.h Fri Nov 06 15:20:33 2009 +0800 @@ -58,6 +58,8 @@ #define pe_flag_start_failure_fatal 0x1000ULL #define pe_flag_remove_after_stop 0x2000ULL +#define pe_flag_limit_utilization 0x0001ULL + typedef struct pe_working_set_s { @@ -116,6 +118,8 @@ GHashTable *attrs; /* char* => char* */ enum node_type type; + + GHashTable *utilization; }; struct node_s { @@ -186,6 +190,7 @@ GHashTable *meta; GHashTable *parameters; + GHashTable *utilization; GListPtr children; /* resource_t* */ }; diff -r c81e55653fba lib/pengine/common.c --- a/lib/pengine/common.c Fri Oct 16 14:26:27 2009 +0200 +++ b/lib/pengine/common.c Fri Nov 06 15:20:33 2009 +0800 @@ -147,6 +147,10 @@ { "node-health-red", NULL, "integer", NULL, "-INFINITY", &check_number, "The score 'red' translates to in rsc_location constraints", "Only used when node-health-strategy is set to custom or progressive." }, + + /*Resource utilization*/ + { "limit-utilization", NULL, "boolean", NULL, "false", &check_boolean, + "Limit the resource utilization of nodes to avoid being overloaded", NULL}, }; void diff -r c81e55653fba lib/pengine/complex.c --- a/lib/pengine/complex.c Fri Oct 16 14:26:27 2009 +0200 +++ b/lib/pengine/complex.c Fri Nov 06 15:20:33 2009 +0800 @@ -371,6 +371,12 @@ if(safe_str_eq(class, "stonith")) { set_bit_inplace(data_set->flags, pe_flag_have_stonith_resource); } + + (*rsc)->utilization = g_hash_table_new_full( + g_str_hash, g_str_equal, g_hash_destroy_str, g_hash_destroy_str); + + unpack_instance_attributes(data_set->input, (*rsc)->xml, XML_TAG_UTILIZATION, NULL, + (*rsc)->utilization, NULL, FALSE, data_set->now); /* data_set->resources = g_list_append(data_set->resources, (*rsc)); */ return TRUE; @@ -451,6 +457,9 @@ if(rsc->meta != NULL) { g_hash_table_destroy(rsc->meta); } + if(rsc->utilization != NULL) { + g_hash_table_destroy(rsc->utilization); + } if(rsc->parent == NULL && is_set(rsc->flags, pe_rsc_orphan)) { free_xml(rsc->xml); } diff -r c81e55653fba lib/pengine/status.c --- a/lib/pengine/status.c Fri Oct 16 14:26:27 2009 +0200 +++ b/lib/pengine/status.c Fri Nov 06 15:20:33 2009 +0800 @@ -159,6 +159,9 @@ if(details->attrs != NULL) { g_hash_table_destroy(details->attrs); } + if(details->utilization != NULL) { +g_hash_table_destroy(details->utilization); + } pe_free_shallow_adv(details->running_rsc, FALSE); pe_free_s
Re: [Pacemaker] Resource capacity limit
On Thu, Nov 5, 2009 at 8:50 PM, Michael Schwartzkopff wrote: > Hi, > > on the list was a discussion about resource capacity limits. Yan Gao also > implemented it. > > As far as I understood the discussion the solution is to attach nodes and > resources capacity limits. Resources are distributed on the nodes of a cluster > according to its capacatiy needs. They would be migrated or shutdown if the > capacity limits on the node are not met. > > My question is: Can the capacity figures of the resources be made dynamically? You can, but you probably don't want to. For example, free RAM and CPU load are two things that absolutely make no sense to include in such calculations. Consider how it works: - The node starts and reports 2Gb of RAM - We place a service there that reserves 512Mb - The cluster knows there is 1.5Gb remaining - We place two more services there that also reserve 512Mb each If the amount of RAM at the beginning was the amount free, then when you updated it to be 512Mb the PE would run and stop two of the resources! You always want to feed the cluster the total amount of RAM installed which, at most, you'd query when the cluster starts on that node. > > So i.e. for every monitor operation the CRM updates the capacity usage figures > of the resource. So the cluster could react dynamically oin the actual > capacity of a resource. > > Greetings, > > Michael. > > ___ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
[Pacemaker] Resource capacity limit
Hi, on the list was a discussion about resource capacity limits. Yan Gao also implemented it. As far as I understood the discussion the solution is to attach nodes and resources capacity limits. Resources are distributed on the nodes of a cluster according to its capacatiy needs. They would be migrated or shutdown if the capacity limits on the node are not met. My question is: Can the capacity figures of the resources be made dynamically? So i.e. for every monitor operation the CRM updates the capacity usage figures of the resource. So the cluster could react dynamically oin the actual capacity of a resource. Greetings, Michael. ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
On 2009-11-03T19:15:59, Yan Gao wrote: > XML sample: > .. > > > > > > > > > .. > > > > > > > > > > > > > > > .. > > Please kindly review it... > Any suggestions are appreciated! I think this is exactly what we need. Great job! Code looks good too. Andrew? For added kicks, which may be something Andrew can add more readily, I wonder if utilization should also be subject to time-based evaluation. Think of a database needing more horsepower on weekdays - but perhaps that is something that should wait until dynamic load balancing happens. Regards, Lars -- Architect Storage/HA, OPS Engineering, Novell, Inc. SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
Hi again, Yan Gao wrote: > Hi Lars, > Thanks for the great suggestions! > > Lars Marowsky-Bree wrote: >> On 2009-10-30T19:41:35, Yan Gao wrote: >>> Configuration example: >>> >>> node yingying \ >>> attributes capacity="100" >>> primitive dummy0 ocf:heartbeat:Dummy \ >>> meta weight="90" priority="2" >>> primitive dummy1 ocf:heartbeat:Dummy \ >>> meta weight="60" priority="1" >>> .. >>> property $id="cib-bootstrap-options" \ >>> limit-capacity="true" >> First, I would prefer not to contaminate the regular node attribute >> namespace; the word "capacity" might already be used. Second, the >> "weight" is just one dimension, which is somewhat difficult. >> >> I'd propose to introduce a new XML element, "resource_utilization" (name >> to be decided ;-) containing a "nvset", and which can be used in a node >> element or a resource primitive. >> >> This creates a new namespace, avoiding clashes, and distinguishes the >> utilization parameters from the other various attributes. >> >> Further, it trivially allows for several user-defined metrics. > Right, great idea! I'll try to implement it if Andrew is OK with that either:) > Done and attached. XML sample: .. .. .. Please kindly review it... Any suggestions are appreciated! Thanks, Yan -- y...@novell.com Software Engineer China Server Team, OPS Engineering Novell, Inc. Making IT Work As One™ diff -r c81e55653fba include/crm/msg_xml.h --- a/include/crm/msg_xml.h Fri Oct 16 14:26:27 2009 +0200 +++ b/include/crm/msg_xml.h Tue Nov 03 19:02:22 2009 +0800 @@ -130,6 +130,7 @@ #define XML_TAG_ATTRS "attributes" #define XML_TAG_PARAMS "parameters" #define XML_TAG_PARAM "param" +#define XML_TAG_UTILIZATION "utilization" #define XML_TAG_RESOURCE_REF "resource_ref" #define XML_CIB_TAG_RESOURCE "primitive" diff -r c81e55653fba include/crm/pengine/status.h --- a/include/crm/pengine/status.h Fri Oct 16 14:26:27 2009 +0200 +++ b/include/crm/pengine/status.h Tue Nov 03 19:02:22 2009 +0800 @@ -58,6 +58,8 @@ #define pe_flag_start_failure_fatal 0x1000ULL #define pe_flag_remove_after_stop 0x2000ULL +#define pe_flag_limit_utilization 0x0001ULL + typedef struct pe_working_set_s { @@ -116,6 +118,8 @@ GHashTable *attrs; /* char* => char* */ enum node_type type; + + GHashTable *utilization; }; struct node_s { @@ -186,6 +190,7 @@ GHashTable *meta; GHashTable *parameters; + GHashTable *utilization; GListPtr children; /* resource_t* */ }; diff -r c81e55653fba lib/pengine/common.c --- a/lib/pengine/common.c Fri Oct 16 14:26:27 2009 +0200 +++ b/lib/pengine/common.c Tue Nov 03 19:02:22 2009 +0800 @@ -147,6 +147,10 @@ { "node-health-red", NULL, "integer", NULL, "-INFINITY", &check_number, "The score 'red' translates to in rsc_location constraints", "Only used when node-health-strategy is set to custom or progressive." }, + + /*Resource utilization*/ + { "limit-utilization", NULL, "boolean", NULL, "false", &check_boolean, + "Limit the resource utilization of nodes to avoid being overloaded", NULL}, }; void diff -r c81e55653fba lib/pengine/complex.c --- a/lib/pengine/complex.c Fri Oct 16 14:26:27 2009 +0200 +++ b/lib/pengine/complex.c Tue Nov 03 19:02:22 2009 +0800 @@ -371,6 +371,12 @@ if(safe_str_eq(class, "stonith")) { set_bit_inplace(data_set->flags, pe_flag_have_stonith_resource); } + + (*rsc)->utilization = g_hash_table_new_full( + g_str_hash, g_str_equal, g_hash_destroy_str, g_hash_destroy_str); + + unpack_instance_attributes(data_set->input, (*rsc)->xml, XML_TAG_UTILIZATION, NULL, + (*rsc)->utilization, NULL, FALSE, data_set->now); /* data_set->resources = g_list_append(data_set->resources, (*rsc)); */ return TRUE; @@ -451,6 +457,9 @@ if(rsc->meta != NULL) { g_hash_table_destroy(rsc->meta); } + if(rsc->utilization != NULL) { + g_hash_table_destroy(rsc->utilization); + } if(rsc->parent == NULL && is_set(rsc->flags, pe_rsc_orphan)) { free_xml(rsc->xml); } diff -r c81e55653fba lib/pengine/status.c --- a/lib/pengine/status.c Fri Oct 16 14:26:27 2009 +0200 +++ b/lib/pengine/status.c Tue Nov 03 19:02:22 2009 +0800 @@ -159,6 +159,9 @@ if(details->attrs != NULL) { g_hash_table_destroy(details->attrs); } + if(details->utilization != NULL) { +g_hash_table_destroy(details->utilization); + } pe_free_shallow_adv(details->running_rsc, FALSE); pe_free_shallow_adv(details->allocated_rsc, FALSE); crm_free(details); diff -r c81e55653fba lib/pengine/unpack.c --- a/lib/pengine/unpack.c Fri Oct 16 14:26:27 2009 +0200 +++ b/lib/pengine/unpack.c Tue Nov 03 19:02:22 2009 +0800 @@ -165,6 +165,10 @@ crm_info("Node scores: 'red' = %s, 'yellow' = %s, 'green' = %s", score2char(node_score_red),score2char(node_score_yellow), score2char(node_s
Re: [Pacemaker] Resource capacity limit
Hi Lars, Thanks for the great suggestions! Lars Marowsky-Bree wrote: > On 2009-10-30T19:41:35, Yan Gao wrote: >> >> Configuration example: >> >> node yingying \ >> attributes capacity="100" >> primitive dummy0 ocf:heartbeat:Dummy \ >> meta weight="90" priority="2" >> primitive dummy1 ocf:heartbeat:Dummy \ >> meta weight="60" priority="1" >> .. >> property $id="cib-bootstrap-options" \ >> limit-capacity="true" > > First, I would prefer not to contaminate the regular node attribute > namespace; the word "capacity" might already be used. Second, the > "weight" is just one dimension, which is somewhat difficult. > > I'd propose to introduce a new XML element, "resource_utilization" (name > to be decided ;-) containing a "nvset", and which can be used in a node > element or a resource primitive. > > This creates a new namespace, avoiding clashes, and distinguishes the > utilization parameters from the other various attributes. > > Further, it trivially allows for several user-defined metrics. Right, great idea! I'll try to implement it if Andrew is OK with that either:) > > node hex-0 \ > utilization memory="4096" cpu="8" > ... > primitive dummy0 ocf:heartbeat:Dummy \ > meta priority="2" > utilization memory="2048" cpu="2" > primitive dummy1 ocf:heartbeat:Dummy \ > utilization memory="3012" > primitive dummy2 ocf:heartbeat:Dummy \ > utilization cpu="6" > > dummy0 + dummy2 could both be placed on hex-0, or dummy1+dummy2, but not > dummy0 + dummy1. > > "Placement allowed where none of the utilization parameters would become > negative." (ie, iterate over the utilization attributes specified for > the resource.) > >> I also noticed a likely similar planned feature described in >> http://clusterlabs.org/wiki/Planned_Features >> >> "Implement adaptive service placement (based on the RAM, CPU etc. >> required by the service and made available by the nodes) " >> >> Indeed, this try only supports single kind of capacity, and it's not >> adaptive... Do you already have a thorough consideration about this >> feature? > > I think this is a two phase feature for the PE: The first phase is what > you propose - make sure we do not overload any given node, basically > implementing hard limits. > > The second phase would be for the PE to actually try to "optimize" > placement, and try to solve the constraints imposed by the utilization > versus capacity scores to a) place as many resources as possible > successfully, and b) to either spread them thinly (load distribution) or > condensed (load concentration, think power savings by being able to put > some nodes to sleep). > > The first phase should, IMHO, be quite easy to implement. The second one > is significantly more difficult, and we'd need to pull in an > optimization library to solve this for us. It's conceivable that for > this to happen, we'd need to disable the normal "rsc_location" rules > altogether because they'd interfere badly. (And interesting to note that > the rsc_collocation constraints can be mapped into this scheme and > entirely handled by this solver.) > > There is the "adaptive" bit, of course, where the utilization of the > resources and the nodes is automatically determined and adjusted based > on utilization monitoring. This is even more challenging and frequently > considered a research problem. > > > In summary, I think phase one is urgently needed; thankfully, it is > straightforward to solve too, and the admin can influence placement with > priorities and scoring sufficiently to avoid resources being offlined > due to resource collisions too frequently. > > Phase two is a "solved problem" from an algorithmic point of view, but > implementing it is probably not quite as trivial. I'd welcome to see > this happening too. > Thanks for the information! Best regards, Yan -- y...@novell.com Software Engineer China Server Team, OPS Engineering Novell, Inc. Making IT Work As One™ ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
hi, On 10/30/2009 01:20 PM, Lars Marowsky-Bree wrote: > I think this is a two phase feature for the PE: The first phase is what > you propose - make sure we do not overload any given node, basically > implementing hard limits. > > The second phase would be for the PE to actually try to "optimize" > placement, and try to solve the constraints imposed by the utilization > versus capacity scores to a) place as many resources as possible > successfully, and b) to either spread them thinly (load distribution) or > condensed (load concentration, think power savings by being able to put > some nodes to sleep). i just want to let you know that i think that this is a marvelous addition to pacemaker! cheers, raoul -- DI (FH) Raoul Bhatia M.Sc. email. r.bha...@ipax.at Technischer Leiter IPAX - Aloy Bhatia Hava OEG web. http://www.ipax.at Barawitzkagasse 10/2/2/11 email.off...@ipax.at 1190 Wien tel. +43 1 3670030 FN 277995t HG Wien fax.+43 1 3670030 15 ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Re: [Pacemaker] Resource capacity limit
On 2009-10-30T19:41:35, Yan Gao wrote: Hi Yan Gao, excellent! Before reviewing the code, lets review the interface/configuration though. > User case: > Xen guests have memory requirements; nodes cannot host more guests than > the node has physical memory installed. > > > Configuration example: > > node yingying \ > attributes capacity="100" > primitive dummy0 ocf:heartbeat:Dummy \ > meta weight="90" priority="2" > primitive dummy1 ocf:heartbeat:Dummy \ > meta weight="60" priority="1" > .. > property $id="cib-bootstrap-options" \ > limit-capacity="true" First, I would prefer not to contaminate the regular node attribute namespace; the word "capacity" might already be used. Second, the "weight" is just one dimension, which is somewhat difficult. I'd propose to introduce a new XML element, "resource_utilization" (name to be decided ;-) containing a "nvset", and which can be used in a node element or a resource primitive. This creates a new namespace, avoiding clashes, and distinguishes the utilization parameters from the other various attributes. Further, it trivially allows for several user-defined metrics. node hex-0 \ utilization memory="4096" cpu="8" ... primitive dummy0 ocf:heartbeat:Dummy \ meta priority="2" utilization memory="2048" cpu="2" primitive dummy1 ocf:heartbeat:Dummy \ utilization memory="3012" primitive dummy2 ocf:heartbeat:Dummy \ utilization cpu="6" dummy0 + dummy2 could both be placed on hex-0, or dummy1+dummy2, but not dummy0 + dummy1. "Placement allowed where none of the utilization parameters would become negative." (ie, iterate over the utilization attributes specified for the resource.) > If we don't want to enable capacity limit. We could set property > "limit-capacity" to "false", or default it. Right, a cluster property to globally disable/enable this is a very good idea. > I also noticed a likely similar planned feature described in > http://clusterlabs.org/wiki/Planned_Features > > "Implement adaptive service placement (based on the RAM, CPU etc. > required by the service and made available by the nodes) " > > Indeed, this try only supports single kind of capacity, and it's not > adaptive... Do you already have a thorough consideration about this > feature? I think this is a two phase feature for the PE: The first phase is what you propose - make sure we do not overload any given node, basically implementing hard limits. The second phase would be for the PE to actually try to "optimize" placement, and try to solve the constraints imposed by the utilization versus capacity scores to a) place as many resources as possible successfully, and b) to either spread them thinly (load distribution) or condensed (load concentration, think power savings by being able to put some nodes to sleep). The first phase should, IMHO, be quite easy to implement. The second one is significantly more difficult, and we'd need to pull in an optimization library to solve this for us. It's conceivable that for this to happen, we'd need to disable the normal "rsc_location" rules altogether because they'd interfere badly. (And interesting to note that the rsc_collocation constraints can be mapped into this scheme and entirely handled by this solver.) There is the "adaptive" bit, of course, where the utilization of the resources and the nodes is automatically determined and adjusted based on utilization monitoring. This is even more challenging and frequently considered a research problem. In summary, I think phase one is urgently needed; thankfully, it is straightforward to solve too, and the admin can influence placement with priorities and scoring sufficiently to avoid resources being offlined due to resource collisions too frequently. Phase two is a "solved problem" from an algorithmic point of view, but implementing it is probably not quite as trivial. I'd welcome to see this happening too. Adaptive placement ... anyone who wants to write a master or phd thesis around? ;-) Best, Lars -- Architect Storage/HA, OPS Engineering, Novell, Inc. SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde ___ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker
[Pacemaker] Resource capacity limit
Hi Andrew and Lars, The attachment is the first try to implement "Resource capacity limit" which is issued by Lars from: https://fate.novell.com/303384 Description: We need a mechanism for the PE to take resource weight into account to prevent nodes from being overloaded. Resources would require certain minimal values for node attributes (this is available right now); however, they would also "consume" them, reducing the value of the node attributes for further resource placement. (This could be a special flag in the rsc_location rule, for example.) If a node does not have enough capacity available, it is not considered. .. User case: Xen guests have memory requirements; nodes cannot host more guests than the node has physical memory installed. Configuration example: node yingying \ attributes capacity="100" primitive dummy0 ocf:heartbeat:Dummy \ meta weight="90" priority="2" primitive dummy1 ocf:heartbeat:Dummy \ meta weight="60" priority="1" .. property $id="cib-bootstrap-options" \ limit-capacity="true" .. Because dummy0 has the higher priority, it'll be running on node "yingying". While this node only has "10" (100-90) capacity remaining now, so dummy1 cannot be running on this node. If there's no other node where it can be running on, dummy1 will be stopped. If we don't want to enable capacity limit. We could set property "limit-capacity" to "false", or default it. What do you think about the way it's implemented? Did I do it right? I also noticed a likely similar planned feature described in http://clusterlabs.org/wiki/Planned_Features "Implement adaptive service placement (based on the RAM, CPU etc. required by the service and made available by the nodes) " Indeed, this try only supports single kind of capacity, and it's not adaptive... Do you already have a thorough consideration about this feature? Any comments or suggestions are appreciated. Thanks! Regards, Yan -- y...@novell.com Software Engineer China Server Team, OPS Engineering Novell, Inc. Making IT Work As One™ diff -r 462f1569a437 include/crm/msg_xml.h --- a/include/crm/msg_xml.h Mon Aug 10 16:42:41 2009 +0200 +++ b/include/crm/msg_xml.h Fri Oct 30 19:06:34 2009 +0800 @@ -155,6 +155,7 @@ #define XML_RSC_ATTR_FAIL_TIMEOUT "failure-timeout" #define XML_RSC_ATTR_MULTIPLE "multiple-active" #define XML_RSC_ATTR_PRIORITY "priority" +#define XML_RSC_ATTR_WEIGHT "weight" #define XML_OP_ATTR_ON_FAIL "on-fail" #define XML_OP_ATTR_START_DELAY "start-delay" #define XML_OP_ATTR_ALLOW_MIGRATE "allow-migrate" diff -r 462f1569a437 include/crm/pengine/status.h --- a/include/crm/pengine/status.h Mon Aug 10 16:42:41 2009 +0200 +++ b/include/crm/pengine/status.h Fri Oct 30 19:06:34 2009 +0800 @@ -58,6 +58,8 @@ #define pe_flag_start_failure_fatal 0x1000ULL #define pe_flag_remove_after_stop 0x2000ULL +#define pe_flag_limit_capacity 0x0001ULL + typedef struct pe_working_set_s { @@ -111,6 +113,7 @@ gboolean expected_up; gboolean is_dc; int num_resources; + int remain_capacity; GListPtr running_rsc; /* resource_t* */ GListPtr allocated_rsc; /* resource_t* */ @@ -168,6 +171,7 @@ int failure_timeout; int effective_priority; int migration_threshold; + int weight; unsigned long long flags; diff -r 462f1569a437 lib/pengine/common.c --- a/lib/pengine/common.c Mon Aug 10 16:42:41 2009 +0200 +++ b/lib/pengine/common.c Fri Oct 30 19:06:34 2009 +0800 @@ -147,6 +147,10 @@ { "node-health-red", NULL, "integer", NULL, "-INFINITY", &check_number, "The score 'red' translates to in rsc_location constraints", "Only used when node-health-strategy is set to custom or progressive." }, + + /*Capacity*/ + { "limit-capacity", NULL, "boolean", NULL, "false", &check_boolean, + "Limit the capacity of nodes to avoid being overloaded", NULL}, }; void diff -r 462f1569a437 lib/pengine/complex.c --- a/lib/pengine/complex.c Mon Aug 10 16:42:41 2009 +0200 +++ b/lib/pengine/complex.c Fri Oct 30 19:06:34 2009 +0800 @@ -352,6 +352,11 @@ /* call crm_get_msec() and convert back to seconds */ (*rsc)->failure_timeout = (crm_get_msec(value) / 1000); } + + value = g_hash_table_lookup((*rsc)->meta, XML_RSC_ATTR_WEIGHT); + if(value != NULL) { + (*rsc)->weight = crm_parse_int(value, "0"); + } value = g_hash_table_lookup((*rsc)->meta, XML_RSC_ATTR_TARGET_ROLE); if(is_set(data_set->flags, pe_flag_stop_everything)) { diff -r 462f1569a437 lib/pengine/unpack.c --- a/lib/pengine/unpack.c Mon Aug 10 16:42:41 2009 +0200 +++ b/lib/pengine/unpack.c Fri Oct 30 19:06:34 2009 +0800 @@ -165,6 +165,10 @@ crm_info("Node scores: 'red' = %s, 'yellow' = %s, 'green' = %s", score2char(node_score_red),score2char(node_score_yellow), score2char(node_score_green)); + + set_config_flag(data_set, "limit-capacity", pe_flag_limit_capacity); + crm_debug_2("Limit capacity: %s", + is_set(data_set->flags, pe_flag_limit_capacity)?"true":"false"); retur