[389-devel] Re: Performance discussion

2019-11-19 Thread William Brown


> On 14 Nov 2019, at 22:33, Ludwig Krispenz  wrote:
> 
> 
> On 11/14/2019 12:17 PM, William Brown wrote:
>> 
>>> On 14 Nov 2019, at 19:06, Ludwig Krispenz  wrote:
>>> 
>>> 
>>> On 11/14/2019 09:29 AM, William Brown wrote:
> On 14 Nov 2019, at 18:22, Ludwig Krispenz  wrote:
> 
> Hi William,
> 
> before further thinking about this, I need some clarification, or maybe I 
> just missed this. When you talk about 1..16 threads do you mean worker 
> threads ?
 Server worker threads. ldclt is set to only use 10 client threads - which 
 is surprising that with 10 client threads we see a decline when workers > 
 10 (one would assume it should stabilise).
 
> Or concurrent client connection threads in ldclt/rsearch/ - how many 
> concurrent connections do you have and how does varying this number 
> change results ?
 I will add more tests to this to allow varying the ldclt numbers.
>>> ok, and I assume that you are using a version with nunc-stans removed, 
>>> could you please also verify the effect of tubo-mode on/off ?
>> Correct, I'm using git master. Yes I'll check that also. I plan to add 
>> permutations like this to the test harness so it's easier for us to repeat 
>> in the future when we make changes.
>> 
>> I also need to find a way to wire in perf/stap so we can generate 
>> flamegraphs from each test run too for later analysis.
>> 
>> Thanks for the great ideas :)
> Thanks, and one more idea ;-)
> Can you separate the client and the server on two different machines, I've 
> seen ldclt or other clients impacting cpu usage a lot, there will be some 
> network overhead, but this should be ok (and more realistic)

That was the original goal, but I can't seperate it (yet) because we restart to 
change settings ... 

I'm not sure what's the best way to do it - have the tests maybe act as a 
generator and then you have to run the ldclt from a seperate machine? Not sure 
really  I need to think what it should look like.

I know viktor did some work on pytest over multiple hosts so perhaps that could 
help here too to coordinate? I think they also were speaking about ansible as 
well ... maybe he should comment if he has ideas.



>> 
> Regards,
> Ludwig
> 
> On 11/14/2019 03:34 AM, William Brown wrote:
>> Hi all,
>> 
>> After our catch up, we were discussing performance matters. I decided to 
>> start on this while waiting for some of my tickets to be reviewed and to 
>> see what's going on.
>> 
>> These tests were carried out on a virtual machine configured in search 6 
>> to have access to 6 CPU's, and search 12 with 12 CPU. Both machines had 
>> access to 8GB of ram.
>> 
>> The hardware is an i7 2.2GHz with 6 cores (12 threads) and 32GB of ram, 
>> with NVME storage provided.
>> 
>> The rows are the VM CPU's available, and the columns are the number of 
>> threads in nsslapd-threadnumber. No other variables were changed. The 
>> database has 6000 users and 4000 groups. The instance was restarted 
>> before each test. The search was a randomised uid equality test with a 
>> single result. I provided the thread 6 and 12 columns to try to match 
>> the VM and host specs rather than just the traditional base 2 sequence 
>> we see.
>> 
>> I've attached a screen shot of the results, but I have some initial 
>> thoughts to provide on this. What's interesting is our initial 1 thread 
>> performance and how steeply it ramps up towards 4 thread. This in mind 
>> it's not a linear increase. Per thread on s6 we go from ~3800 to ~2500 
>> ops per second, and a similar ratio exists in s12. What is stark is that 
>> after t4 we immediately see a per thread *decline* despite the greater 
>> amount of available computer resources. This indicates that it is poor 
>> locking and thread coordination causing a rapid decline in performance. 
>> This was true on both s6 and s12. The decline intesifies rapidly once we 
>> exceed the CPU avail on the host (s6 between t6 to t12), but still 
>> declines even when we do have the hardware threads available in s12.
>> 
>> I will perform some testing between t1 and t6 versions to see if I can 
>> isolate which functions are having a growth in time consumption.
>> 
>> For now an early recommendation is that we alter our default CPU 
>> auto-tuning. Currently we use a curve which starts at 16 threads from 1 
>> to 4 cores, and then tapering down to 512 cores to 512 threads - however 
>> in almost all of these autotuned threads we have threads greater than 
>> our core count. This from this graph would indicate that this decision 
>> only hurts our performance rather than improving it. I suggest we change 
>> our thread autotuning to be 1 to 1 ratio of threads to cores to prevent 
>> over contention on lock resources.
>> 
>> 

[389-devel] Re: Performance discussion

2019-11-14 Thread Ludwig Krispenz


On 11/14/2019 12:17 PM, William Brown wrote:



On 14 Nov 2019, at 19:06, Ludwig Krispenz  wrote:


On 11/14/2019 09:29 AM, William Brown wrote:

On 14 Nov 2019, at 18:22, Ludwig Krispenz  wrote:

Hi William,

before further thinking about this, I need some clarification, or maybe I just 
missed this. When you talk about 1..16 threads do you mean worker threads ?

Server worker threads. ldclt is set to only use 10 client threads - which is 
surprising that with 10 client threads we see a decline when workers > 10 (one 
would assume it should stabilise).


Or concurrent client connection threads in ldclt/rsearch/ - how many 
concurrent connections do you have and how does varying this number change 
results ?

I will add more tests to this to allow varying the ldclt numbers.

ok, and I assume that you are using a version with nunc-stans removed, could 
you please also verify the effect of tubo-mode on/off ?

Correct, I'm using git master. Yes I'll check that also. I plan to add 
permutations like this to the test harness so it's easier for us to repeat in 
the future when we make changes.

I also need to find a way to wire in perf/stap so we can generate flamegraphs 
from each test run too for later analysis.

Thanks for the great ideas :)

Thanks, and one more idea ;-)
Can you separate the client and the server on two different machines, 
I've seen ldclt or other clients impacting cpu usage a lot, there will 
be some network overhead, but this should be ok (and more realistic)



Regards,
Ludwig

On 11/14/2019 03:34 AM, William Brown wrote:

Hi all,

After our catch up, we were discussing performance matters. I decided to start 
on this while waiting for some of my tickets to be reviewed and to see what's 
going on.

These tests were carried out on a virtual machine configured in search 6 to 
have access to 6 CPU's, and search 12 with 12 CPU. Both machines had access to 
8GB of ram.

The hardware is an i7 2.2GHz with 6 cores (12 threads) and 32GB of ram, with 
NVME storage provided.

The rows are the VM CPU's available, and the columns are the number of threads 
in nsslapd-threadnumber. No other variables were changed. The database has 6000 
users and 4000 groups. The instance was restarted before each test. The search 
was a randomised uid equality test with a single result. I provided the thread 
6 and 12 columns to try to match the VM and host specs rather than just the 
traditional base 2 sequence we see.

I've attached a screen shot of the results, but I have some initial thoughts to 
provide on this. What's interesting is our initial 1 thread performance and how 
steeply it ramps up towards 4 thread. This in mind it's not a linear increase. 
Per thread on s6 we go from ~3800 to ~2500 ops per second, and a similar ratio 
exists in s12. What is stark is that after t4 we immediately see a per thread 
*decline* despite the greater amount of available computer resources. This 
indicates that it is poor locking and thread coordination causing a rapid 
decline in performance. This was true on both s6 and s12. The decline 
intesifies rapidly once we exceed the CPU avail on the host (s6 between t6 to 
t12), but still declines even when we do have the hardware threads available in 
s12.

I will perform some testing between t1 and t6 versions to see if I can isolate 
which functions are having a growth in time consumption.

For now an early recommendation is that we alter our default CPU auto-tuning. 
Currently we use a curve which starts at 16 threads from 1 to 4 cores, and then 
tapering down to 512 cores to 512 threads - however in almost all of these 
autotuned threads we have threads greater than our core count. This from this 
graph would indicate that this decision only hurts our performance rather than 
improving it. I suggest we change our thread autotuning to be 1 to 1 ratio of 
threads to cores to prevent over contention on lock resources.

Thanks, more to come once I setup this profiling on a real machine so I can 
generate flamegraphs.



—
Sincerely,

William Brown

Senior Software Engineer, 389 Directory Server
SUSE Labs



___
389-devel mailing list --
389-devel@lists.fedoraproject.org

To unsubscribe send an email to
389-devel-le...@lists.fedoraproject.org

Fedora Code of Conduct:
https://docs.fedoraproject.org/en-US/project/code-of-conduct/

List Guidelines:
https://fedoraproject.org/wiki/Mailing_list_guidelines

List Archives:
https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org

--
Red Hat GmbH,
http://www.de.redhat.com/
, Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill, Eric 
Shander

___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 

[389-devel] Re: Performance discussion

2019-11-14 Thread William Brown


> On 14 Nov 2019, at 19:06, Ludwig Krispenz  wrote:
> 
> 
> On 11/14/2019 09:29 AM, William Brown wrote:
>> 
>>> On 14 Nov 2019, at 18:22, Ludwig Krispenz  wrote:
>>> 
>>> Hi William,
>>> 
>>> before further thinking about this, I need some clarification, or maybe I 
>>> just missed this. When you talk about 1..16 threads do you mean worker 
>>> threads ?
>> Server worker threads. ldclt is set to only use 10 client threads - which is 
>> surprising that with 10 client threads we see a decline when workers > 10 
>> (one would assume it should stabilise).
>> 
>>> Or concurrent client connection threads in ldclt/rsearch/ - how many 
>>> concurrent connections do you have and how does varying this number change 
>>> results ?
>> I will add more tests to this to allow varying the ldclt numbers.
> ok, and I assume that you are using a version with nunc-stans removed, could 
> you please also verify the effect of tubo-mode on/off ?

Correct, I'm using git master. Yes I'll check that also. I plan to add 
permutations like this to the test harness so it's easier for us to repeat in 
the future when we make changes. 

I also need to find a way to wire in perf/stap so we can generate flamegraphs 
from each test run too for later analysis.

Thanks for the great ideas :) 

>> 
>>> Regards,
>>> Ludwig
>>> 
>>> On 11/14/2019 03:34 AM, William Brown wrote:
 Hi all,
 
 After our catch up, we were discussing performance matters. I decided to 
 start on this while waiting for some of my tickets to be reviewed and to 
 see what's going on.
 
 These tests were carried out on a virtual machine configured in search 6 
 to have access to 6 CPU's, and search 12 with 12 CPU. Both machines had 
 access to 8GB of ram.
 
 The hardware is an i7 2.2GHz with 6 cores (12 threads) and 32GB of ram, 
 with NVME storage provided.
 
 The rows are the VM CPU's available, and the columns are the number of 
 threads in nsslapd-threadnumber. No other variables were changed. The 
 database has 6000 users and 4000 groups. The instance was restarted before 
 each test. The search was a randomised uid equality test with a single 
 result. I provided the thread 6 and 12 columns to try to match the VM and 
 host specs rather than just the traditional base 2 sequence we see.
 
 I've attached a screen shot of the results, but I have some initial 
 thoughts to provide on this. What's interesting is our initial 1 thread 
 performance and how steeply it ramps up towards 4 thread. This in mind 
 it's not a linear increase. Per thread on s6 we go from ~3800 to ~2500 ops 
 per second, and a similar ratio exists in s12. What is stark is that after 
 t4 we immediately see a per thread *decline* despite the greater amount of 
 available computer resources. This indicates that it is poor locking and 
 thread coordination causing a rapid decline in performance. This was true 
 on both s6 and s12. The decline intesifies rapidly once we exceed the CPU 
 avail on the host (s6 between t6 to t12), but still declines even when we 
 do have the hardware threads available in s12.
 
 I will perform some testing between t1 and t6 versions to see if I can 
 isolate which functions are having a growth in time consumption.
 
 For now an early recommendation is that we alter our default CPU 
 auto-tuning. Currently we use a curve which starts at 16 threads from 1 to 
 4 cores, and then tapering down to 512 cores to 512 threads - however in 
 almost all of these autotuned threads we have threads greater than our 
 core count. This from this graph would indicate that this decision only 
 hurts our performance rather than improving it. I suggest we change our 
 thread autotuning to be 1 to 1 ratio of threads to cores to prevent over 
 contention on lock resources.
 
 Thanks, more to come once I setup this profiling on a real machine so I 
 can generate flamegraphs.
 
 
 
 —
 Sincerely,
 
 William Brown
 
 Senior Software Engineer, 389 Directory Server
 SUSE Labs
 
 
 
 ___
 389-devel mailing list --
 389-devel@lists.fedoraproject.org
 
 To unsubscribe send an email to
 389-devel-le...@lists.fedoraproject.org
 
 Fedora Code of Conduct:
 https://docs.fedoraproject.org/en-US/project/code-of-conduct/
 
 List Guidelines:
 https://fedoraproject.org/wiki/Mailing_list_guidelines
 
 List Archives:
 https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org
>>> -- 
>>> Red Hat GmbH,
>>> http://www.de.redhat.com/
>>> , Registered seat: Grasbrunn,
>>> Commercial register: Amtsgericht Muenchen, HRB 153243,
>>> Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill, 
>>> Eric Shander
>>> 
>>> 

[389-devel] Re: Performance discussion

2019-11-14 Thread Ludwig Krispenz


On 11/14/2019 09:29 AM, William Brown wrote:



On 14 Nov 2019, at 18:22, Ludwig Krispenz  wrote:

Hi William,

before further thinking about this, I need some clarification, or maybe I just 
missed this. When you talk about 1..16 threads do you mean worker threads ?

Server worker threads. ldclt is set to only use 10 client threads - which is 
surprising that with 10 client threads we see a decline when workers > 10 (one 
would assume it should stabilise).


Or concurrent client connection threads in ldclt/rsearch/ - how many 
concurrent connections do you have and how does varying this number change 
results ?

I will add more tests to this to allow varying the ldclt numbers.
ok, and I assume that you are using a version with nunc-stans removed, 
could you please also verify the effect of tubo-mode on/off ?



Regards,
Ludwig

On 11/14/2019 03:34 AM, William Brown wrote:

Hi all,

After our catch up, we were discussing performance matters. I decided to start 
on this while waiting for some of my tickets to be reviewed and to see what's 
going on.

These tests were carried out on a virtual machine configured in search 6 to 
have access to 6 CPU's, and search 12 with 12 CPU. Both machines had access to 
8GB of ram.

The hardware is an i7 2.2GHz with 6 cores (12 threads) and 32GB of ram, with 
NVME storage provided.

The rows are the VM CPU's available, and the columns are the number of threads 
in nsslapd-threadnumber. No other variables were changed. The database has 6000 
users and 4000 groups. The instance was restarted before each test. The search 
was a randomised uid equality test with a single result. I provided the thread 
6 and 12 columns to try to match the VM and host specs rather than just the 
traditional base 2 sequence we see.

I've attached a screen shot of the results, but I have some initial thoughts to 
provide on this. What's interesting is our initial 1 thread performance and how 
steeply it ramps up towards 4 thread. This in mind it's not a linear increase. 
Per thread on s6 we go from ~3800 to ~2500 ops per second, and a similar ratio 
exists in s12. What is stark is that after t4 we immediately see a per thread 
*decline* despite the greater amount of available computer resources. This 
indicates that it is poor locking and thread coordination causing a rapid 
decline in performance. This was true on both s6 and s12. The decline 
intesifies rapidly once we exceed the CPU avail on the host (s6 between t6 to 
t12), but still declines even when we do have the hardware threads available in 
s12.

I will perform some testing between t1 and t6 versions to see if I can isolate 
which functions are having a growth in time consumption.

For now an early recommendation is that we alter our default CPU auto-tuning. 
Currently we use a curve which starts at 16 threads from 1 to 4 cores, and then 
tapering down to 512 cores to 512 threads - however in almost all of these 
autotuned threads we have threads greater than our core count. This from this 
graph would indicate that this decision only hurts our performance rather than 
improving it. I suggest we change our thread autotuning to be 1 to 1 ratio of 
threads to cores to prevent over contention on lock resources.

Thanks, more to come once I setup this profiling on a real machine so I can 
generate flamegraphs.



—
Sincerely,

William Brown

Senior Software Engineer, 389 Directory Server
SUSE Labs



___
389-devel mailing list --
389-devel@lists.fedoraproject.org

To unsubscribe send an email to
389-devel-le...@lists.fedoraproject.org

Fedora Code of Conduct:
https://docs.fedoraproject.org/en-US/project/code-of-conduct/

List Guidelines:
https://fedoraproject.org/wiki/Mailing_list_guidelines

List Archives:
https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org

--
Red Hat GmbH,
http://www.de.redhat.com/
, Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill, Eric 
Shander

___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org

—
Sincerely,

William Brown

Senior Software Engineer, 389 Directory Server
SUSE Labs
___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 

[389-devel] Re: Performance discussion

2019-11-14 Thread William Brown


> On 14 Nov 2019, at 18:22, Ludwig Krispenz  wrote:
> 
> Hi William,
> 
> before further thinking about this, I need some clarification, or maybe I 
> just missed this. When you talk about 1..16 threads do you mean worker 
> threads ?

Server worker threads. ldclt is set to only use 10 client threads - which is 
surprising that with 10 client threads we see a decline when workers > 10 (one 
would assume it should stabilise). 

> Or concurrent client connection threads in ldclt/rsearch/ - how many 
> concurrent connections do you have and how does varying this number change 
> results ?

I will add more tests to this to allow varying the ldclt numbers. 

> 
> Regards,
> Ludwig
> 
> On 11/14/2019 03:34 AM, William Brown wrote:
>> Hi all,
>> 
>> After our catch up, we were discussing performance matters. I decided to 
>> start on this while waiting for some of my tickets to be reviewed and to see 
>> what's going on.
>> 
>> These tests were carried out on a virtual machine configured in search 6 to 
>> have access to 6 CPU's, and search 12 with 12 CPU. Both machines had access 
>> to 8GB of ram.
>> 
>> The hardware is an i7 2.2GHz with 6 cores (12 threads) and 32GB of ram, with 
>> NVME storage provided.
>> 
>> The rows are the VM CPU's available, and the columns are the number of 
>> threads in nsslapd-threadnumber. No other variables were changed. The 
>> database has 6000 users and 4000 groups. The instance was restarted before 
>> each test. The search was a randomised uid equality test with a single 
>> result. I provided the thread 6 and 12 columns to try to match the VM and 
>> host specs rather than just the traditional base 2 sequence we see.
>> 
>> I've attached a screen shot of the results, but I have some initial thoughts 
>> to provide on this. What's interesting is our initial 1 thread performance 
>> and how steeply it ramps up towards 4 thread. This in mind it's not a linear 
>> increase. Per thread on s6 we go from ~3800 to ~2500 ops per second, and a 
>> similar ratio exists in s12. What is stark is that after t4 we immediately 
>> see a per thread *decline* despite the greater amount of available computer 
>> resources. This indicates that it is poor locking and thread coordination 
>> causing a rapid decline in performance. This was true on both s6 and s12. 
>> The decline intesifies rapidly once we exceed the CPU avail on the host (s6 
>> between t6 to t12), but still declines even when we do have the hardware 
>> threads available in s12.
>> 
>> I will perform some testing between t1 and t6 versions to see if I can 
>> isolate which functions are having a growth in time consumption. 
>> 
>> For now an early recommendation is that we alter our default CPU 
>> auto-tuning. Currently we use a curve which starts at 16 threads from 1 to 4 
>> cores, and then tapering down to 512 cores to 512 threads - however in 
>> almost all of these autotuned threads we have threads greater than our core 
>> count. This from this graph would indicate that this decision only hurts our 
>> performance rather than improving it. I suggest we change our thread 
>> autotuning to be 1 to 1 ratio of threads to cores to prevent over contention 
>> on lock resources. 
>> 
>> Thanks, more to come once I setup this profiling on a real machine so I can 
>> generate flamegraphs. 
>> 
>> 
>> 
>> —
>> Sincerely,
>> 
>> William Brown
>> 
>> Senior Software Engineer, 389 Directory Server
>> SUSE Labs
>> 
>> 
>> 
>> ___
>> 389-devel mailing list -- 
>> 389-devel@lists.fedoraproject.org
>> 
>> To unsubscribe send an email to 
>> 389-devel-le...@lists.fedoraproject.org
>> 
>> Fedora Code of Conduct: 
>> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>> 
>> List Guidelines: 
>> https://fedoraproject.org/wiki/Mailing_list_guidelines
>> 
>> List Archives: 
>> https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org
> 
> -- 
> Red Hat GmbH, 
> http://www.de.redhat.com/
> , Registered seat: Grasbrunn, 
> Commercial register: Amtsgericht Muenchen, HRB 153243,
> Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill, 
> Eric Shander
> 
> ___
> 389-devel mailing list -- 389-devel@lists.fedoraproject.org
> To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
> Fedora Code of Conduct: 
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: 
> https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org

—
Sincerely,

William Brown

Senior Software Engineer, 389 Directory Server
SUSE Labs
___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/

[389-devel] Re: Performance discussion

2019-11-14 Thread Ludwig Krispenz

Hi William,

before further thinking about this, I need some clarification, or maybe 
I just missed this. When you talk about 1..16 threads do you mean worker 
threads ? Or concurrent client connection threads in ldclt/rsearch/ 
- how many concurrent connections do you have and how does varying this 
number change results ?


Regards,
Ludwig

On 11/14/2019 03:34 AM, William Brown wrote:

Hi all,

After our catch up, we were discussing performance matters. I decided 
to start on this while waiting for some of my tickets to be reviewed 
and to see what's going on.


These tests were carried out on a virtual machine configured in search 
6 to have access to 6 CPU's, and search 12 with 12 CPU. Both machines 
had access to 8GB of ram.


The hardware is an i7 2.2GHz with 6 cores (12 threads) and 32GB of 
ram, with NVME storage provided.


The rows are the VM CPU's available, and the columns are the number of 
threads in nsslapd-threadnumber. No other variables were changed. The 
database has 6000 users and 4000 groups. The instance was restarted 
before each test. The search was a randomised uid equality test with a 
single result. I provided the thread 6 and 12 columns to try to match 
the VM and host specs rather than just the traditional base 2 sequence 
we see.


I've attached a screen shot of the results, but I have some initial 
thoughts to provide on this. What's interesting is our initial 1 
thread performance and how steeply it ramps up towards 4 thread. This 
in mind it's not a linear increase. Per thread on s6 we go from ~3800 
to ~2500 ops per second, and a similar ratio exists in s12. What is 
stark is that after t4 we immediately see a per thread *decline* 
despite the greater amount of available computer resources. This 
indicates that it is poor locking and thread coordination causing a 
rapid decline in performance. This was true on both s6 and s12. The 
decline intesifies rapidly once we exceed the CPU avail on the host 
(s6 between t6 to t12), but still declines even when we do have the 
hardware threads available in s12.


I will perform some testing between t1 and t6 versions to see if I can 
isolate which functions are having a growth in time consumption.


For now an early recommendation is that we alter our default CPU 
auto-tuning. Currently we use a curve which starts at 16 threads from 
1 to 4 cores, and then tapering down to 512 cores to 512 threads - 
however in almost all of these autotuned threads we have threads 
greater than our core count. This from this graph would indicate that 
this decision only hurts our performance rather than improving it. I 
suggest we change our thread autotuning to be 1 to 1 ratio of threads 
to cores to prevent over contention on lock resources.


Thanks, more to come once I setup this profiling on a real machine so 
I can generate flamegraphs.



—
Sincerely,

William Brown

Senior Software Engineer, 389 Directory Server
SUSE Labs



___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org


--
Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill, Eric 
Shander

___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org