> On 10 May 2019, at 23:59, Rich Megginson <rmegg...@redhat.com> wrote: > > On 5/9/19 9:13 PM, William Brown wrote: >> Hi all, >> >> So I think it's time for me to write some logging code to improve the >> situation. Relevant links before we start: >> >> https://pagure.io/389-ds-base/issue/49415 >> http://www.port389.org/docs/389ds/design/logging-performance-improvement.html >> https://pagure.io/389-ds-base/issue/50350 >> https://pagure.io/389-ds-base/issue/50361 >> >> >> All of these links touch on issues around logging, and I think they all >> combine to create three important points: >> >> * The performance of logging should be improved >> * The amount of details (fine grain) and information in logs should improve >> * The structure of the log content should be improved to aid interaction >> (possibly even machine parsable) >> >> I will turn this into a design document, but there are some questions I >> would like some input to help answer as part of this process to help set the >> direction and tasks to achieve. >> >> -- Should our logs as they exist today, continue to exist? >> >> I think that my view on this is "no". I think if we make something better, >> we have little need to continue to support our legacy interfaces. Of course, >> this would be a large change and it may not sit comfortably with people. >> >> A large part of this thinking is that the "new" log interface I want to add >> is focused on *operations* rather than auditing accesses or changes, or over >> looking at errors. The information of both the current access/audit/error >> would largely be melded into a single operation log, and then with tools >> like logconv, we >> could parse and extract information that would behave the same way as >> access/error/audit. >> >> At the same time - I can see how people *may* want a "realtime" audit of >> operations as they occur (IE access log), but this still today is already >> limited by having to "wait" for operations to complete. >> >> In a crash scenario, we would be able to still view the logs that are >> queued, so I think there are not so many concerns about losing information >> in these cases (in fact we'd probably have more). >> >> -- What should the operation log look like? >> >> I think it should be structured, and should be whole-units of information, >> related to a single operation. IE only at the conclusion of the operation is >> it logged (thus the async!). It should support arbitrary, nested timers, and >> would *not* support log levels - it's a detailed log of the process each >> query goes through. >> >> An example could be something like: >> >> [timestamp] - [conn=id op=id] - start operation >> [timestamp] - [conn=id op=id] - start time = time ... >> [timestamp] - [conn=id op=id] - started internal search '(some=filter)' >> [timestamp] - [conn=id op=id parentop=id] - start nested operation >> [timestamp] - [conn=id op=id parentop=id] - start time = time ... >> ... >> [timestamp] - [conn=id op=id parentop=id] - end time = time... >> [timestamp] - [conn=id op=id parentop=id] - duration = diff end - start >> [timestamp] - [conn=id op=id parentop=id] - end nested operation - result -> >> ... >> [timestamp] - [conn=id op=id] - ended internal search '(some=filter)' >> ... >> [timestamp] - [conn=id op=id] - end time = time >> [timestamp] - [conn=id op=id] - duration = diff end - start >> >> >> Due to the structured - blocked nature, there would be no interleaving of >> operation messages. therefor the log would appear as: >> >> [timestamp] - [conn=00 op=00] - start operation >> [timestamp] - [conn=00 op=00] - start time = time ... >> [timestamp] - [conn=00 op=00] - started internal search '(some=filter)' >> [timestamp] - [conn=00 op=00 parentop=01] - start nested operation >> [timestamp] - [conn=00 op=00 parentop=01] - start time = time ... >> ... >> [timestamp] - [conn=00 op=00 parentop=01] - end time = time... >> [timestamp] - [conn=00 op=00 parentop=01] - duration = diff end - start >> [timestamp] - [conn=00 op=00 parentop=01] - end nested operation - result -> >> ... >> [timestamp] - [conn=00 op=00] - ended internal search '(some=filter)' >> ... >> [timestamp] - [conn=00 op=00] - end time = time >> [timestamp] - [conn=00 op=00] - duration = diff end - start >> [timestamp] - [conn=22 op=00] - start operation >> [timestamp] - [conn=22 op=00] - start time = time ... >> [timestamp] - [conn=22 op=00] - started internal search '(some=filter)' >> [timestamp] - [conn=22 op=00 parentop=01] - start nested operation >> [timestamp] - [conn=22 op=00 parentop=01] - start time = time ... >> ... >> [timestamp] - [conn=22 op=00 parentop=01] - end time = time... >> [timestamp] - [conn=22 op=00 parentop=01] - duration = diff end - start >> [timestamp] - [conn=22 op=00 parentop=01] - end nested operation - result -> >> ... >> [timestamp] - [conn=22 op=00] - ended internal search '(some=filter)' >> ... >> [timestamp] - [conn=22 op=00] - end time = time >> [timestamp] - [conn=22 op=00] - duration = diff end - start >> >> An alternate method for structuring could be a machine readable format like >> json: >> >> { >> 'timestamp': 'time', >> 'duration': ...., >> 'bind': 'dn of who initiated operation', >> 'events': [ >> 'debug': 'msg', >> 'internal_search': { >> 'timestamp': 'time', >> 'duration': ...., >> } >> ], >> } >> >> This would make writing tools like logconv much easier, as we wouldn't need >> regex or others to parse and understand these logs. > > +1 > > I would just like developers to keep in mind a couple of points > > - traceability - the ability to trace operations through all of the various > parts of the server, and eventually even system calls/kernel, for use in > debugging and performance improvements (identify bottlenecks, etc.) - how to > make it easy to trace operations using the log files > > - json - json also makes it easy to consume these logs - just about all > log/event collectors use a single-line json format
For certain - I have another project I work on which has json logs of almost this format, and I have never needed gdb once due to the high level of tracing and detail - it's basically an analysis of all branches and server state, as well as timing and performance data. It's been infinitely valuable, and I'd love to see that in 389 too. > > >> >> >> -- Should our log interface from slapi_log_err etc change? >> >> This is a super tricky question ... I think the answer is yes - we are using >> log_err for messages that should be associated to an operation - and >> additionally, we often disable or lower the level of error messages due to >> the low performance. It would be hard to shift behaviour and expectations >> around the behaviour of this log, so a better idea would be to fix things >> oourselves by adjusting how we log in plugins. We'll probably always need to >> support log_err, but then those developers would not gain the benefit of our >> improvements. >> >> We could translate log_err to emit simple text messages into the events >> section of the log, but for full benefit, using the operation log interfaces >> would be best. >> >> -- Should we still improve the performance of logs as they exist today? >> >> Probably yes - but I think there is a good reason to approach this >> differently. >> >> I think we should be log_async.c and log_operation.c in parallel to existing >> logging, and have a similer feature gating flag in use like we have done for >> other features. We can then build this in parallel and gain confidence and >> experience with it. Once complete, we could then change the log_access, >> log_audit and log_err macros/functions to call related log_op calls, or >> rewrite those logs to have log_async.c execution backends. >> >> >> >> SUMMARY: >> >> * Make slapi_log_op and an async log system in parallel to existing log infra >> * Determine a deprecation or improvement plan for other log types >> * Port log types to async, or remove them. >> >> >> >> -- >> Sincerely, >> >> William >> _______________________________________________ >> 389-devel mailing list -- 389-devel@lists.fedoraproject.org >> To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org >> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html >> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines >> List Archives: >> https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org > > _______________________________________________ > 389-devel mailing list -- 389-devel@lists.fedoraproject.org > To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org > Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines > List Archives: > https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org — Sincerely, William Brown Senior Software Engineer, 389 Directory Server SUSE Labs _______________________________________________ 389-devel mailing list -- 389-devel@lists.fedoraproject.org To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org