On 10/24/18 6:02 AM, William Brown wrote:

On 24 Oct 2018, at 17:56, thierry bordaz <tbor...@redhat.com> wrote:



On 10/21/2018 03:55 AM, William Brown wrote:
Thanks for this write up Thierry,

On 19 Oct 2018, at 17:15, thierry bordaz <tbor...@redhat.com> wrote:

Hi,

C10K is a scalability problem that a server can face when dealing with events 
of thousands of connections (i.e. clients) at the same time. Events can be new 
connections, new operations on the established connections, closure of 
connection (from client or server)
For 389-ds, C10K problem was resolved with a new framework Nunc-Stans [1]. 
Nunc-stans was first enabled in RHDS 7.4 and improved/fixed in 7.5. Robustness 
issues [2] and [3] were reported in 7.5 and it was decided to disable 
Nunc-stans. It is not known if those issues exist or not in 7.4.

William posted a PR to fix those two issues [4]. Nunc-stans is a complex 
framework, with its own dynamic. Review of this PR is not easy and even a 
careful review may not guaranty it will fix [2] and [3] and may not introduce 
others unexpected side effects.

 From there we discussed two options (but there may be others):

        • Review and merge the PR [4], then later run some intensive tests 
aiming to verify [2],[3] and checking the robustness in order to reenable NS
I think this is the best solution. If NS is “not working” today, then merging 
the PR won’t make it “not work” any less ;)

Given we have a reliable way to disabled it at build time, I think that 
merging, testing, and then eventually discussion and decision of enabling by 
default is the best plan.
Hi William,

The existing set of tests reveal a single failure[1]. That shows the PR is 
really promising but at the same time that tests are crucial with NS. Like the 
two robustness bugs (conn leak and deadlock) only happening is some specific 
conditions.

I think review on large NS PR is bit formal step and the most important are 
tests/measure. We may agree to push the NS patches base on existing tests 
results.  New tests/measure (for benefits and [2] and [3]) on top of the 
current PR could be the option. Let's wait for additional feedback
Sure. I will work on Viktor's comments and found issues with paged search, and 
I’ll commit some more time to running the tests if I can.

The reason I’m pushing to “just merge and test/fix later” is that at the moment 
I have very little time, and I don’t want to see my work bit-rot over time, and 
get further and further away. It’s going to make future fixes and improvements 
to the space easier once it’s merged.

Saying that I think it would be good to have a deadline or a list of people we 
want comments from. Mark, Simon, Ludwig, Viktor, yourself and myself come to 
mind … we need to know what our approach is, and this ties back to the “feature 
gating” discussion that I hope you saw. We need a way to have features we are 
working on, so we can test and reason about them, but without letting them out 
in production - and having ways to roll the back.
Coming in late to the discussion...  So, in this case with nunc-stans(the new PR) I think we should do some basic sanity tests on it (which are occurring right now, and one issue was found which you are already aware of).  Once these basic sanity tests pass then we should merge it.  After that we can do the more extensive testing on it from master branch, but I don't want to commit something that is completely broken (even though the current state of NS isn't great).  Anyway, I think that's a fair compromise :-)

Thanks,


[1] https://pagure.io/389-ds-base/pull-request/49636#comment-66941

[2] https://bugzilla.redhat.com/show_bug.cgi?id=1608746 deadlock
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1605554 connection leaks

best regards
thierry

        • Build some tests for
                • measure the benefit of NS as [2] and [3] do not prevent some 
performance tests
                • identify possible reproducers for [2] and [3]
                • create robustness and long duration NS specific tests
                • review and merge the PR [4]
As PR [4] is not intended for perf improvement, the step 2.1 will impact the 
priority according to the performance benefits.
I think that the test plan is good. 2 and 3 will be hard to reproduce I think, 
they seem like they require complex state and interactions. The goal of the PR 
I have open is to reduce the complexity of NS integration into DS, so that 
tests for situations like 2 and 3 can be more easily created (and hopefully the 
simpler integration is going to resolve some issues).

The “benefit” of NS is hard to say - because the other parts of the server core 
are still needing threading improvement, NS may help some aspects (connection 
acceptance performance for example), we won’t see all the of the benefits until 
we improve other parts. There are many bottlenecks to fix! An example is lib 
globs high use of atomics (should be COW), and the plugin architecture.

At a theoretical level, NS will be faster as we don’t need to lock and poll 
over the connection table, but today the CT may not be the bottleneck, so NS 
may not make this “faster” just by enabling.

Comments are welcomed

Regarding 2.1 plan we made the following notes for the test plan:
The benefit of Nunc-Stans can only be measure with a large number of 
connections (i.e. client) above 1000. That means a set of clients (sometime 
all) should keep their connection opened. Clients should run on several hosts 
so that clients are not the bootleneck.

For the two types of events (new connection and new operations), the 
measurement could be

        • Event: New connections
                • Start all clients in parallel to establish connections 
(keeping them opened) take the duration to get 1000, 2000, ... 10000 
connections and check there are drop or not
                • Establish 1000 connections and monitor during to open 100 
more, the same starting with 2000, 10000
                • Client should not run any operations during the monitoring
        • Event: New operations
                • Start all clients and when 1000 connections are established, launch 
simple operations (e.g. search -s base -b "" objectclass) and monitor how many 
of them can be handled. The same with 2000, ... 10000.
                • response time and workqueue length could be monitored to be 
sure the bottleneck are not the worker.
At one point I started (but did not finish) ldclt integration with lib389. It 
would be great to have a series of stress and acceptance tests like these 
available in lib389 for our release validations, so I believe that their 
creation is valuable regardless of NS.

[1] http://www.port389.org/docs/389ds/design/nunc-stans.html

[2] https://bugzilla.redhat.com/show_bug.cgi?id=1608746 deadlock
[3] https://bugzilla.redhat.com/show_bug.cgi?id=1605554 connection leaks

[4] https://pagure.io/389-ds-base/pull-request/49636

_______________________________________________
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org
—
Sincerely,

William


_______________________________________________
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org
_______________________________________________
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org
—
Sincerely,

William


_______________________________________________
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org
_______________________________________________
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org

Reply via email to