[389-devel] Re: System tap performance results,

2019-12-11 Thread William Brown


> On 11 Dec 2019, at 20:29, thierry bordaz  wrote:
> 
> 
> 
> On 12/11/19 1:21 AM, William Brown wrote:
>> 
>>> On 10 Dec 2019, at 19:15, thierry bordaz  wrote:
>>> 
>>> Hi William,
>>> 
>>> Thanks for these very interesting results.
>>> It would help if you can provide the stap scripts to make sure what you are 
>>> accounting the latency.
>> Yes, I plan to put them into a PR soon once I have done a bit more data 
>> collection and polishing of the script setup.
>> 
>>> Also just to be sure is latency a synonym for response time ?
>> Yep, here I mean the execution time of a single operation.
>> 
>>> Regarding the comparison (tail) 1client/16client. It looks to me that the 
>>> tail are equivalent: The more we have clients the more we have long 
>>> latency. So in a first approach I would eliminate contention effect.
>> I disagree, the tail is much more pronounced in the 16 client version, and 
>> there is a significant shift of response times from the 32768 bucket to 
>> 65536 indicating that many operations are now being "held up".
> 
> You are probably right.
> There is a response time shift but I would expect a major contention effect 
> to shift more and flatter the graph to upper response time.
> Whatever is the answer, I think an important point is to agree on the method 
> and that reports shows somethings suspicious.

In a highly concurrent system like DS we shouldn't be seeing tail latency or 
spikes, we should be seeing all responses within a really small window 
especially when the searches are all the "same query" for a single uid. So 
that's why I'm looking at the spikes first. 

> 
>> 
>>> Regarding the ratio shorter/longer latency (assuming the search are 
>>> equivalent) this is interesting to know why we have such effect. One of the 
>>> possible cause I was thinking of is the impact of DB thread (checkpointing 
>>> or trickling). But if it exists similar long tail in ldbm_back, the 
>>> absolute value is much lower than the opshare_search: in short ldbm_back 
>>> accounted at most for 4ms while opshared for 67ms. So there is something 
>>> else (aci, network, frontend..).
>>> 
>>> Regarding USDT I think it is very good idea. However, just to share some 
>>> recent stap experience, I found it intrusive. In short, I had not the same 
>>> throughput with and without. In my case it was not a problem, as I wanted 
>>> to investigate a reproducible contention. But if we want 
>>> support/user/customers to use it, the performance impact in production will 
>>> be valid point.
>> I haven't noticed any "intrusiveness" from USDT so far? How were you running 
>> the stap scripts?
> I did not add probe in DS. I was just using stap to gather per operation 
> specific functions duration (like plugin, backend or core).
> 
> I do not recall the "intrusiveness" level as I was more looking for 
> contention data than performance value.

Hmmm okay. I've found stap extremely fast and it doesn't get in the way, so 
unless I saw your stap scripts I'm not sure . 

> 
>> 
>>> best regards
>>> thierry
>>> 
>>> 
>>> 
>>> On 12/9/19 6:16 AM, William Brown wrote:
 Hi all,
 
 Following last weeks flamegraph runs, I wanted to find more details on 
 exactly what was happening. While flamegraphs highlighted that a changed 
 existed between single and multithreaded servers, it did not help to 
 isolate where
 the change was occuring.
 
 To understand this I have started to work on a set of systemtap scripts 
 that we can use to profile 389ds. These will be included in a future PR.
 
 Here are the hisograms from an initial test of profiling "do_search"
 
 1 thread
 
 stap test_do_search.stp
 ^CDistribution of latencies (in nanoseconds) for 441148 samples
 max/avg/min: 35911542/85694/38927
 value |-- count
  8192 |0
 16384 |0
 32768 |@@ 167285
 65536 |@  239280
131072 |@   25788
262144 |@8530
524288 |  252
   1048576 |6
   2097152 |1
   4194304 |3
   8388608 |0
  16777216 |2
  33554432 |1
  67108864 |0
 134217728 | 

[389-devel] Re: System tap performance results,

2019-12-11 Thread thierry bordaz



On 12/11/19 1:21 AM, William Brown wrote:



On 10 Dec 2019, at 19:15, thierry bordaz  wrote:

Hi William,

Thanks for these very interesting results.
It would help if you can provide the stap scripts to make sure what you are 
accounting the latency.

Yes, I plan to put them into a PR soon once I have done a bit more data 
collection and polishing of the script setup.


Also just to be sure is latency a synonym for response time ?

Yep, here I mean the execution time of a single operation.


Regarding the comparison (tail) 1client/16client. It looks to me that the tail 
are equivalent: The more we have clients the more we have long latency. So in a 
first approach I would eliminate contention effect.

I disagree, the tail is much more pronounced in the 16 client version, and there is a 
significant shift of response times from the 32768 bucket to 65536 indicating that many 
operations are now being "held up".


You are probably right.
There is a response time shift but I would expect a major contention 
effect to shift more and flatter the graph to upper response time.
Whatever is the answer, I think an important point is to agree on the 
method and that reports shows somethings suspicious.





Regarding the ratio shorter/longer latency (assuming the search are equivalent) 
this is interesting to know why we have such effect. One of the possible cause 
I was thinking of is the impact of DB thread (checkpointing or trickling). But 
if it exists similar long tail in ldbm_back, the absolute value is much lower 
than the opshare_search: in short ldbm_back accounted at most for 4ms while 
opshared for 67ms. So there is something else (aci, network, frontend..).

Regarding USDT I think it is very good idea. However, just to share some recent 
stap experience, I found it intrusive. In short, I had not the same throughput 
with and without. In my case it was not a problem, as I wanted to investigate a 
reproducible contention. But if we want support/user/customers to use it, the 
performance impact in production will be valid point.

I haven't noticed any "intrusiveness" from USDT so far? How were you running 
the stap scripts?
I did not add probe in DS. I was just using stap to gather per operation 
specific functions duration (like plugin, backend or core).


I do not recall the "intrusiveness" level as I was more looking for 
contention data than performance value.





best regards
thierry



On 12/9/19 6:16 AM, William Brown wrote:

Hi all,

Following last weeks flamegraph runs, I wanted to find more details on exactly 
what was happening. While flamegraphs highlighted that a changed existed 
between single and multithreaded servers, it did not help to isolate where
the change was occuring.

To understand this I have started to work on a set of systemtap scripts that we 
can use to profile 389ds. These will be included in a future PR.

Here are the hisograms from an initial test of profiling "do_search"

1 thread

stap test_do_search.stp
^CDistribution of latencies (in nanoseconds) for 441148 samples
max/avg/min: 35911542/85694/38927
 value |-- count
  8192 |0
 16384 |0
 32768 |@@ 167285
 65536 |@  239280
131072 |@   25788
262144 |@8530
524288 |  252
   1048576 |6
   2097152 |1
   4194304 |3
   8388608 |0
  16777216 |2
  33554432 |1
  67108864 |0
134217728 |0

16 thread

  stap test_do_search.stp
^CDistribution of latencies (in nanoseconds) for 407806 samples
max/avg/min: 100315928/112407/39368
 value |-- count
  8192 |0
 16384 |0
 32768 |   100281
 65536 |@  249656
131072 |@@@ 37837
262144 |@@@ 18322
524288 | 1171
   1048576 |  203
   2097152 |  

[389-devel] Re: System tap performance results,

2019-12-10 Thread William Brown


> On 10 Dec 2019, at 19:15, thierry bordaz  wrote:
> 
> Hi William,
> 
> Thanks for these very interesting results.
> It would help if you can provide the stap scripts to make sure what you are 
> accounting the latency.

Yes, I plan to put them into a PR soon once I have done a bit more data 
collection and polishing of the script setup. 

> Also just to be sure is latency a synonym for response time ?

Yep, here I mean the execution time of a single operation. 

> 
> Regarding the comparison (tail) 1client/16client. It looks to me that the 
> tail are equivalent: The more we have clients the more we have long latency. 
> So in a first approach I would eliminate contention effect.

I disagree, the tail is much more pronounced in the 16 client version, and 
there is a significant shift of response times from the 32768 bucket to 65536 
indicating that many operations are now being "held up". 

> 
> Regarding the ratio shorter/longer latency (assuming the search are 
> equivalent) this is interesting to know why we have such effect. One of the 
> possible cause I was thinking of is the impact of DB thread (checkpointing or 
> trickling). But if it exists similar long tail in ldbm_back, the absolute 
> value is much lower than the opshare_search: in short ldbm_back accounted at 
> most for 4ms while opshared for 67ms. So there is something else (aci, 
> network, frontend..).
> 
> Regarding USDT I think it is very good idea. However, just to share some 
> recent stap experience, I found it intrusive. In short, I had not the same 
> throughput with and without. In my case it was not a problem, as I wanted to 
> investigate a reproducible contention. But if we want support/user/customers 
> to use it, the performance impact in production will be valid point.

I haven't noticed any "intrusiveness" from USDT so far? How were you running 
the stap scripts? 

> 
> best regards
> thierry
> 
> 
> 
> On 12/9/19 6:16 AM, William Brown wrote:
>> Hi all,
>> 
>> Following last weeks flamegraph runs, I wanted to find more details on 
>> exactly what was happening. While flamegraphs highlighted that a changed 
>> existed between single and multithreaded servers, it did not help to isolate 
>> where
>> the change was occuring.
>> 
>> To understand this I have started to work on a set of systemtap scripts that 
>> we can use to profile 389ds. These will be included in a future PR.
>> 
>> Here are the hisograms from an initial test of profiling "do_search"
>> 
>> 1 thread
>> 
>> stap test_do_search.stp
>> ^CDistribution of latencies (in nanoseconds) for 441148 samples
>> max/avg/min: 35911542/85694/38927
>> value |-- count
>>  8192 |0
>> 16384 |0
>> 32768 |@@ 167285
>> 65536 |@  239280
>>131072 |@   25788
>>262144 |@8530
>>524288 |  252
>>   1048576 |6
>>   2097152 |1
>>   4194304 |3
>>   8388608 |0
>>  16777216 |2
>>  33554432 |1
>>  67108864 |0
>> 134217728 |0
>> 
>> 16 thread
>> 
>>  stap test_do_search.stp
>> ^CDistribution of latencies (in nanoseconds) for 407806 samples
>> max/avg/min: 100315928/112407/39368
>> value |-- count
>>  8192 |0
>> 16384 |0
>> 32768 |   100281
>> 65536 |@  249656
>>131072 |@@@ 37837
>>262144 |@@@ 18322
>>524288 | 1171
>>   1048576 |  203
>>   2097152 |   90
>>   4194304 |   74
>>   8388608 |   83
>>  16777216 |   58
>>  33554432 |   21
>>  67108864 |

[389-devel] Re: System tap performance results,

2019-12-10 Thread William Brown


> On 11 Dec 2019, at 02:20, Howard Chu  wrote:
> 
> 389-devel-requ...@lists.fedoraproject.org wrote:
> 
>> My next steps from here will be:
>> 
>> * To add USDT probes to the logs and back_ldbm to get better, fine detail 
>> about what is causing the excessive latency. 
>>  * these probes are also needed to resolve what appears to be a symbol 
>> resolution issue with systemtap when optimisations are enabled.
>> * To add probes in other parts of the code base to get better visualisation 
>> about where and how long timings are taking through an operation.
>> * To run a lock contention profile (I was unable to do this today due to 
>> bugs in systemtap)
>> * To document the setup proceedures
>> * Commit all these into a PR
>> * Document some actionable improvements we can make to improve the server 
>> performance. 
> 
> mutrace will give you a good audit of lock contention, and it's simple to 
> use, takes no setup.
> 
> http://git.0pointer.net/mutrace.git/

Thanks, that's a good tip, I'll look into that too. Appreciate it,

> 
> -- 
>  -- Howard Chu
>  CTO, Symas Corp.   http://www.symas.com
>  Director, Highland Sun http://highlandsun.com/hyc/
>  Chief Architect, OpenLDAP  http://www.openldap.org/project/
> ___
> 389-devel mailing list -- 389-devel@lists.fedoraproject.org
> To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
> Fedora Code of Conduct: 
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: 
> https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org

—
Sincerely,

William Brown

Senior Software Engineer, 389 Directory Server
SUSE Labs
___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org


[389-devel] Re: System tap performance results,

2019-12-10 Thread Howard Chu
389-devel-requ...@lists.fedoraproject.org wrote:

> My next steps from here will be:
> 
> * To add USDT probes to the logs and back_ldbm to get better, fine detail 
> about what is causing the excessive latency. 
>   * these probes are also needed to resolve what appears to be a symbol 
> resolution issue with systemtap when optimisations are enabled.
> * To add probes in other parts of the code base to get better visualisation 
> about where and how long timings are taking through an operation.
> * To run a lock contention profile (I was unable to do this today due to bugs 
> in systemtap)
> * To document the setup proceedures
> * Commit all these into a PR
> * Document some actionable improvements we can make to improve the server 
> performance. 

mutrace will give you a good audit of lock contention, and it's simple to use, 
takes no setup.

http://git.0pointer.net/mutrace.git/

-- 
  -- Howard Chu
  CTO, Symas Corp.   http://www.symas.com
  Director, Highland Sun http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/
___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org


[389-devel] Re: System tap performance results,

2019-12-10 Thread thierry bordaz

Hi William,

Thanks for these very interesting results.
It would help if you can provide the stap scripts to make sure what you 
are accounting the latency.

Also just to be sure is latency a synonym for response time ?

Regarding the comparison (tail) 1client/16client. It looks to me that 
the tail are equivalent: The more we have clients the more we have long 
latency. So in a first approach I would eliminate contention effect.


Regarding the ratio shorter/longer latency (assuming the search are 
equivalent) this is interesting to know why we have such effect. One of 
the possible cause I was thinking of is the impact of DB thread 
(checkpointing or trickling). But if it exists similar long tail in 
ldbm_back, the absolute value is much lower than the opshare_search: in 
short ldbm_back accounted at most for 4ms while opshared for 67ms. So 
there is something else (aci, network, frontend..).


Regarding USDT I think it is very good idea. However, just to share some 
recent stap experience, I found it intrusive. In short, I had not the 
same throughput with and without. In my case it was not a problem, as I 
wanted to investigate a reproducible contention. But if we want 
support/user/customers to use it, the performance impact in production 
will be valid point.


best regards
thierry



On 12/9/19 6:16 AM, William Brown wrote:

Hi all,

Following last weeks flamegraph runs, I wanted to find more details on exactly 
what was happening. While flamegraphs highlighted that a changed existed 
between single and multithreaded servers, it did not help to isolate where
the change was occuring.

To understand this I have started to work on a set of systemtap scripts that we 
can use to profile 389ds. These will be included in a future PR.

Here are the hisograms from an initial test of profiling "do_search"

1 thread

stap test_do_search.stp
^CDistribution of latencies (in nanoseconds) for 441148 samples
max/avg/min: 35911542/85694/38927
 value |-- count
  8192 |0
 16384 |0
 32768 |@@ 167285
 65536 |@  239280
131072 |@   25788
262144 |@8530
524288 |  252
   1048576 |6
   2097152 |1
   4194304 |3
   8388608 |0
  16777216 |2
  33554432 |1
  67108864 |0
134217728 |0

16 thread

  stap test_do_search.stp
^CDistribution of latencies (in nanoseconds) for 407806 samples
max/avg/min: 100315928/112407/39368
 value |-- count
  8192 |0
 16384 |0
 32768 |   100281
 65536 |@  249656
131072 |@@@ 37837
262144 |@@@ 18322
524288 | 1171
   1048576 |  203
   2097152 |   90
   4194304 |   74
   8388608 |   83
  16777216 |   58
  33554432 |   21
  67108864 |   10
134217728 |0
268435456 |0


It's interesting to note the tail latency here: On the 16 thread version we see 67000 
less in the 32768 buckets, shifting mostly through the 131072 and 262144 buckets, as well 
as showing a much greater number of calls in the tail. In thread 1 no operation made it 
to 67108864, but 10 did in 16thread, along with ~200 more that are higher than 1048567, 
and ~1500 more that are greater than 524288. This kind of tailing means we have 
"spikes" of latency throughout the execution, which then have a minor flow on 
cause other operations to be increased in latency.

These are all in