Thanks Thomas for putting this together. The things that come to mind as
concerns regarding Ganeti performance are:

   - Level of concurrency for all jobs. Naturally, creates (including
   inter-cluster moves) and replace-disks operations tend to block the job
   queue for extended periods of time.
   - Responsiveness of RAPI and the CLI tools when there are 30+ jobs in
   the queue (perhaps related to the above point).

The amount of time we wait on disk wipes and disk syncs completely eclipses
any amount of time we spend waiting on Ganeti *when the job is running.* We
do, of course, want more job concurrency, so the parallel job execution
benchmarking you're planning to do should measure that well.

I would suggest adding a few more instance operations to the list of
operations you perform in the second of your parallel job execution
benchmarks, like reboots and reinstalls. It's common that we have reboots
blocked on other long running tasks. The benchmarking will hopefully
discover if there's different locking characteristics for the different
jobs that might impact concurrency.

Jonathan


On Wed, Apr 16, 2014 at 11:12 AM, Thomas Thrainer <[email protected]>wrote:

> Hi,
>
> I forgot to add the new design doc to Makefile.am, so I'd like to include
> the following interdiff:
>
> diff --git a/Makefile.am b/Makefile.am
> index fbeb9f2..2ce5b24 100644
> --- a/Makefile.am
> +++ b/Makefile.am
> @@ -534,6 +534,7 @@ docinput = \
>         doc/design-optables.rst \
>         doc/design-ovf-support.rst \
>         doc/design-partitioned.rst \
> +       doc/design-performance-tests.rst \
>         doc/design-query-splitting.rst \
>         doc/design-query2.rst \
>         doc/design-reason-trail.rst \
>
>
> On Wed, Apr 16, 2014 at 2:47 PM, Thomas Thrainer <[email protected]>wrote:
>
>> This design doc describes which tests are added in order to test the
>> performance of Ganeti, specifically when handling multiple jobs in
>> parallel.
>>
>> Note that this design doc is submitted to stable-2.10 so performance
>> changes over different Ganeti versions can be captured.
>>
>> Signed-off-by: Thomas Thrainer <[email protected]>
>> ---
>>
>> If you have additional test scenarious in mind, please share them
>> with me. Ideally, also include a rational for why a scenario is
>> relevant.
>>
>>  doc/design-draft.rst             |  1 +
>>  doc/design-performance-tests.rst | 96
>> ++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 97 insertions(+)
>>  create mode 100644 doc/design-performance-tests.rst
>>
>> diff --git a/doc/design-draft.rst b/doc/design-draft.rst
>> index 35f6c96..81faa80 100644
>> --- a/doc/design-draft.rst
>> +++ b/doc/design-draft.rst
>> @@ -19,6 +19,7 @@ Design document drafts
>>     design-ceph-ganeti-support.rst
>>     design-daemons.rst
>>     design-hsqueeze.rst
>> +   design-performance-tests.rst
>>
>>  .. vim: set textwidth=72 :
>>  .. Local Variables:
>> diff --git a/doc/design-performance-tests.rst
>> b/doc/design-performance-tests.rst
>> new file mode 100644
>> index 0000000..1f804e0
>> --- /dev/null
>> +++ b/doc/design-performance-tests.rst
>> @@ -0,0 +1,96 @@
>> +========================
>> +Performance tests for QA
>> +========================
>> +
>> +.. contents:: :depth: 4
>> +
>> +This design document describes performance tests to be added to QA in
>> +order to measure performance changes over time.
>> +
>> +Current state and shortcomings
>> +==============================
>> +
>> +Currently, only functional QA tests are performed. Those tests verify
>> +the correct behaviour of Ganeti in various configurations, but are not
>> +designed to continuously monitor the performance of Ganeti.
>> +
>> +The current QA tests don't execute multiple tasks/jobs in parallel.
>> +Therefore, the locking part of Ganeti does not really receive any
>> +testing, neither functional nor performance wise.
>> +
>> +On the plus side, Ganeti's QA code does already measure the runtime of
>> +individual tests, which is leveraged in this design.
>> +
>> +Proposed changes
>> +================
>> +
>> +The tests to be added in the context of this design document focus on
>> +two areas:
>> +
>> +  * Job queue performance. How does Ganeti handle a lot of submitted
>> +    jobs?
>> +  * Parallel job execution performance. How well does Ganeti
>> +    parallelize jobs?
>> +
>> +In order to make it easier to recognize performance related tests, all
>> +tests added in the context of this design get a description with a
>> +"PERFORMANCE: " prefix.
>> +
>> +Job queue performance
>> +---------------------
>> +
>> +Tests targeting the job queue should eliminate external factors (like
>> +network/disk performance or hypervisor delays) as much as possible, so
>> +they are designed to run in a vcluster QA environment.
>> +
>> +The following tests are added to the QA:
>> +
>> +  * Submit the maximum amount of instance create jobs in parallel. As
>> +    soon as a creation job starts to run, submit a removal job for this
>> +    instance.
>> +  * Submit as many instance create jobs as there are nodes in the
>> +    cluster in parallel (for non-redundant instances). Removal jobs
>> +    as above.
>> +  * For the maximum amount of instances in the cluster, submit modify
>> +    jobs (modify hypervisor and backend parameters) in parallel.
>> +  * For the maximum amount of instances in the cluster, submit stop,
>> +    start, reboot and reinstall jobs in parallel.
>> +  * For the maximum amount of instances in the cluster, submit multiple
>> +    list and info jobs in parallel.
>> +  * For the maximum amount of instances in the cluster, submit move
>> +    jobs in parallel.
>> +  * For the maximum amount of instances in the cluster, submit add-,
>> +    remove- and list-tags jobs.
>> +
>> +Parallel job execution performance
>> +----------------------------------
>> +
>> +Tests targeting the performance of parallel execution of "real" jobs
>> +in close-to-production clusters should actually perform all operations,
>> +such as creating disks and starting instances. This way, real world
>> +locking or waiting issues can be reproduced. Performing all those
>> +operations does requires quite some time though, so only a smaller
>> +number of instances and parallel jobs can be tested realistically.
>> +
>> +The following tests are added to the QA:
>> +
>> +  * Submitting twice as many instance creation request as there are
>> +    nodes in the cluster, using DRBD as disk template. As soon as a
>> +    creation job starts to run, submit a removal job for this instance.
>> +  * Create an instance using DRBD. Fail it over, migrate it, recreate
>> +    its disk and change its secondary node while creating an additional
>> +    instance in parallel to each of those operations.
>> +
>> +Future work
>> +===========
>> +
>> +Based on test results of the tests listed above, additional tests can
>> +be added to cover more real-world use-cases. Also, based on user
>> +requests, specially crafted performance tests modeling those workloads
>> +can be added too.
>> +
>> +.. vim: set textwidth=72 :
>> +.. Local Variables:
>> +.. mode: rst
>> +.. fill-column: 72
>> +.. End:
>> --
>> 1.9.1.423.g4596e3a
>>
>>
>
>
> --
> Thomas Thrainer | Software Engineer | [email protected] |
>
> Google Germany GmbH
> Dienerstr. 12
> 80331 München
>
> Registergericht und -nummer: Hamburg, HRB 86891
> Sitz der Gesellschaft: Hamburg
> Geschäftsführer: Graham Law, Christine Elizabeth Flores
>



-- 
Jonathan Woodbury
Ganeti SRE - NYC
Google Inc.

Reply via email to