On 21 Jul 2014, at 11:07 pm, Vladislav Bogdanov <bub...@hoster-ok.com> wrote:
> 21.07.2014 13:37, Andrew Beekhof wrote: >> >> On 21 Jul 2014, at 3:09 pm, Vladislav Bogdanov <bub...@hoster-ok.com> wrote: >> >>> 21.07.2014 06:21, Andrew Beekhof wrote: >>>> >>>> On 18 Jul 2014, at 5:16 pm, Vladislav Bogdanov <bub...@hoster-ok.com> >>>> wrote: >>>> >>>>> Hi Andrew, all, >>>>> >>>>> I have a task which seems to be easily solvable with the use of >>>>> globally-unique clone: start huge number of specific virtual machines to >>>>> provide a load to a connection multiplexer. >>>>> >>>>> I decided to look how pacemaker behaves in such setup with Dummy >>>>> resource agent, and found that handling of every instance in an >>>>> "initial" transition (probe+start) slows down with increase of clone-max. >>>> >>>> "yep" >>>> >>>> for non unique clones the number of probes needed is N, where N is the >>>> number of nodes. >>>> for unique clones, we must test every instance and node combination, or >>>> N*M, where M is clone-max. >>>> >>>> And that's just the running of the probes... just figuring out which nodes >>>> need to be >>>> probed is incredibly resource intensive (run crm_simulate and it will be >>>> painfully obvious). >>>> >>>>> >>>>> F.e. for 256 instances transition took 225 seconds, ~0.88s per instance. >>>>> After I added 768 more instances (set clone-max to 1024) >> >> How many nodes though? > > Two nodes run in VMs. > >> Assuming 3, thats still only ~1s per operation (including the time taken to >> send the operation across the network twice and update the cib). >> >>>>> together with >>>>> increasing batch-limit to 512, transition took almost an hour (3507 >>>>> seconds), or ~4.57s per added instance. Even if I take in account that >>>>> monitoring of already started instances consumes some resources, last >>>>> number seems to be rather big, >>> >>> I believe this ^ is the main point. >>> If with N instances probe/start of _each_ instance takes X time slots, >>> then with 4*N instances probe/start of _each_ instance takes ~5*X time >>> slots. In an ideal world, I would expect it to remain constant. >> >> Unless you have 512 cores in the cluster, increasing the batch-limit in this >> way is certainly not going to give you the results you're looking for. >> Firing more tasks at a machine just ends up in producing more context >> switches as the kernel tries to juggle the various tasks. >> >> More context switches == more CPU wasted == more time taken overall == >> completely consistent with your results. > > Thanks to the oprofile, I was able to gain speedup by 8-9% with following > patch: > ========= > diff --git a/crmd/te_utils.c b/crmd/te_utils.c > index 2167370..c612718 100644 > --- a/crmd/te_utils.c > +++ b/crmd/te_utils.c > @@ -374,8 +374,6 @@ te_graph_trigger(gpointer user_data) > graph_rc = run_graph(transition_graph); > transition_graph->batch_limit = limit; /* Restore the configured > value */ > > - print_graph(LOG_DEBUG_3, transition_graph); > - This one can go... it gets called every time an action finishes. > if (graph_rc == transition_active) { > crm_trace("Transition not yet complete"); > return TRUE; > diff --git a/crmd/tengine.c b/crmd/tengine.c > index 765628c..ec0e1d4 100644 > --- a/crmd/tengine.c > +++ b/crmd/tengine.c > @@ -221,7 +221,6 @@ do_te_invoke(long long action, > } > > trigger_graph(); > - print_graph(LOG_DEBUG_2, transition_graph); This is once per transition though... shouldn't hurt much. > > if (graph_data != input->xml) { > free_xml(graph_data); > ========= > > Results this time are measured only for clean start op, after probes are done > (add > stopped clone, wait for probes to complete and then start clone). > 256(vanilla): 09:51:50 - 09:53:17 => 1:27 = 87s => 0.33984375 s per > instance > 1024(vanilla): 10:17:10 - 10:34:34 => 17:24 = 1044s => 1.01953125 s per > instance > 1024(patched): 11:59:26 - 12:15:12 => 15:46 = 946s => 0.92382813 s per > instance > > So, still not perfect, but better. > > Unfortunately, my binaries are build with optimization, so I'm not able to > get call > graphs yet. > > Also, as I run in VMs, no hardware support for oprofile is available, so > results may be > inaccurate a bit. > > Here is system-wide opreport's top for unpatched crmd with 1024 instances: > CPU: CPU with timer interrupt, speed 0 MHz (estimated) > Profiling through timer interrupt > samples % image name app name symbol > name > 429963 41.3351 no-vmlinux no-vmlinux > /no-vmlinux > 129533 12.4528 libxml2.so.2.7.6 libxml2.so.2.7.6 > /usr/lib64/libxml2.so.2.7.6 > 101326 9.7411 libc-2.12.so libc-2.12.so > __strcmp_sse42 > 42524 4.0881 libtransitioner.so.2.0.1 libtransitioner.so.2.0.1 > print_synapse > 37062 3.5630 libc-2.12.so libc-2.12.so > malloc_consolidate > 23268 2.2369 libcrmcommon.so.3.2.0 libcrmcommon.so.3.2.0 > find_entity > 21416 2.0589 libc-2.12.so libc-2.12.so > _int_malloc > 18950 1.8218 libcrmcommon.so.3.2.0 libcrmcommon.so.3.2.0 > crm_element_value > 17482 1.6807 libfreebl3.so libfreebl3.so > /lib64/libfreebl3.so > 15350 1.4757 libc-2.12.so libc-2.12.so vfprintf > 15016 1.4436 libqb.so.0.16.0 libqb.so.0.16.0 > /usr/lib64/libqb.so.0.16.0 > 13189 1.2679 bash bash /bin/bash > 11375 1.0936 libc-2.12.so libc-2.12.so _int_free > 10762 1.0346 libtotem_pg.so.5.0.0 libtotem_pg.so.5.0.0 > /usr/lib64/libtotem_pg.so.5.0.0 > 10345 0.9945 libc-2.12.so libc-2.12.so > _IO_default_xsputn > ... > > And with patch: > CPU: CPU with timer interrupt, speed 0 MHz (estimated) > Profiling through timer interrupt > samples % image name app name symbol > name > 434810 46.2143 no-vmlinux no-vmlinux > /no-vmlinux > 125397 13.3280 libxml2.so.2.7.6 libxml2.so.2.7.6 > /usr/lib64/libxml2.so.2.7.6 > 85259 9.0619 libc-2.12.so libc-2.12.so > __strcmp_sse42 > 33563 3.5673 libc-2.12.so libc-2.12.so > malloc_consolidate > 18885 2.0072 libc-2.12.so libc-2.12.so > _int_malloc > 16714 1.7765 libcrmcommon.so.3.2.0 libcrmcommon.so.3.2.0 > crm_element_value > 14966 1.5907 libfreebl3.so libfreebl3.so > /lib64/libfreebl3.so > 14510 1.5422 libc-2.12.so libc-2.12.so vfprintf > 13664 1.4523 bash bash /bin/bash > 13505 1.4354 libcrmcommon.so.3.2.0 libcrmcommon.so.3.2.0 > find_entity > 12605 1.3397 libqb.so.0.16.0 libqb.so.0.16.0 > /usr/lib64/libqb.so.0.16.0 > 10855 1.1537 libc-2.12.so libc-2.12.so _int_free > 9857 1.0477 libc-2.12.so libc-2.12.so > _IO_default_xsputn > ... > > >> >>> Otherwise we have an issue with scalability into this direction. >>> >>>>> >>>>> Main CPU consumer on DC while transition is running is crmd, Its memory >>>>> footprint is around 85Mb, resulting CIB size together with the status >>>>> section is around 2Mb, >>>> >>>> You said CPU and then listed RAM... >>> >>> Something wrong with that? :) >>> That just three distinct facts. >> >> I was expecting quantification of the relative CPU usage. >> I was also expecting the PE to have massive spikes whenever a new transition >> is calculated. >> >>> >>>> >>>>> >>>>> Could it be possible to optimize this use-case from your opinion with >>>>> minimal efforts? Could it be optimized with just configuration? Or may >>>>> it be some trivial development task, f.e replace one GList with >>>>> GHashtable somewhere? >>>> >>>> Optimize: yes, Minimal: no >>>> >>>>> >>>>> Sure I can look deeper and get any additional information, f.e. to get >>>>> crmd profiling results if it is hard to get an answer just from the head. >>>> >>>> Perhaps start looking in clone_create_probe() >>> >>> Got it, thanks for pointer! >>> >>>> >>>>> >>>>> Best, >>>>> Vladislav >>>>> >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> >>>> >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>>> >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org