> On 10 May 2019, at 23:59, Rich Megginson <rmegg...@redhat.com> wrote:
> 
> On 5/9/19 9:13 PM, William Brown wrote:
>> Hi all,
>> 
>> So I think it's time for me to write some logging code to improve the 
>> situation. Relevant links before we start:
>> 
>> https://pagure.io/389-ds-base/issue/49415
>> http://www.port389.org/docs/389ds/design/logging-performance-improvement.html
>> https://pagure.io/389-ds-base/issue/50350
>> https://pagure.io/389-ds-base/issue/50361
>> 
>> 
>> All of these links touch on issues around logging, and I think they all 
>> combine to create three important points:
>> 
>> * The performance of logging should be improved
>> * The amount of details (fine grain) and information in logs should improve
>> * The structure of the log content should be improved to aid interaction 
>> (possibly even machine parsable)
>> 
>> I will turn this into a design document, but there are some questions I 
>> would like some input to help answer as part of this process to help set the 
>> direction and tasks to achieve.
>> 
>> -- Should our logs as they exist today, continue to exist?
>> 
>> I think that my view on this is "no". I think if we make something better, 
>> we have little need to continue to support our legacy interfaces. Of course, 
>> this would be a large change and it may not sit comfortably with people.
>> 
>> A large part of this thinking is that the "new" log interface I want to add 
>> is focused on *operations* rather than auditing accesses or changes, or over 
>> looking at errors. The information of both the current access/audit/error 
>> would largely be melded into a single operation log, and then with tools 
>> like logconv, we
>> could parse and extract information that would behave the same way as 
>> access/error/audit.
>> 
>> At the same time - I can see how people *may* want a "realtime" audit of 
>> operations as they occur (IE access log), but this still today is already 
>> limited by having to "wait" for operations to complete.
>> 
>> In a crash scenario, we would be able to still view the logs that are 
>> queued, so I think there are not so many concerns about losing information 
>> in these cases (in fact we'd probably have more).
>> 
>> -- What should the operation log look like?
>> 
>> I think it should be structured, and should be whole-units of information, 
>> related to a single operation. IE only at the conclusion of the operation is 
>> it logged (thus the async!). It should support arbitrary, nested timers, and 
>> would *not* support log levels - it's a detailed log of the process each 
>> query goes through.
>> 
>> An example could be something like:
>> 
>> [timestamp] - [conn=id op=id] - start operation
>> [timestamp] - [conn=id op=id] - start time = time ...
>> [timestamp] - [conn=id op=id] - started internal search '(some=filter)'
>> [timestamp] - [conn=id op=id parentop=id] - start nested operation
>> [timestamp] - [conn=id op=id parentop=id] - start time = time ...
>> ...
>> [timestamp] - [conn=id op=id parentop=id] - end time = time...
>> [timestamp] - [conn=id op=id parentop=id] - duration = diff end - start
>> [timestamp] - [conn=id op=id parentop=id] - end nested operation - result -> 
>> ...
>> [timestamp] - [conn=id op=id] - ended internal search '(some=filter)'
>> ...
>> [timestamp] - [conn=id op=id] - end time = time
>> [timestamp] - [conn=id op=id] - duration = diff end - start
>> 
>> 
>> Due to the structured - blocked nature, there would be no interleaving of 
>> operation messages. therefor the log would appear as:
>> 
>> [timestamp] - [conn=00 op=00] - start operation
>> [timestamp] - [conn=00 op=00] - start time = time ...
>> [timestamp] - [conn=00 op=00] - started internal search '(some=filter)'
>> [timestamp] - [conn=00 op=00 parentop=01] - start nested operation
>> [timestamp] - [conn=00 op=00 parentop=01] - start time = time ...
>> ...
>> [timestamp] - [conn=00 op=00 parentop=01] - end time = time...
>> [timestamp] - [conn=00 op=00 parentop=01] - duration = diff end - start
>> [timestamp] - [conn=00 op=00 parentop=01] - end nested operation - result -> 
>> ...
>> [timestamp] - [conn=00 op=00] - ended internal search '(some=filter)'
>> ...
>> [timestamp] - [conn=00 op=00] - end time = time
>> [timestamp] - [conn=00 op=00] - duration = diff end - start
>> [timestamp] - [conn=22 op=00] - start operation
>> [timestamp] - [conn=22 op=00] - start time = time ...
>> [timestamp] - [conn=22 op=00] - started internal search '(some=filter)'
>> [timestamp] - [conn=22 op=00 parentop=01] - start nested operation
>> [timestamp] - [conn=22 op=00 parentop=01] - start time = time ...
>> ...
>> [timestamp] - [conn=22 op=00 parentop=01] - end time = time...
>> [timestamp] - [conn=22 op=00 parentop=01] - duration = diff end - start
>> [timestamp] - [conn=22 op=00 parentop=01] - end nested operation - result -> 
>> ...
>> [timestamp] - [conn=22 op=00] - ended internal search '(some=filter)'
>> ...
>> [timestamp] - [conn=22 op=00] - end time = time
>> [timestamp] - [conn=22 op=00] - duration = diff end - start
>> 
>> An alternate method for structuring could be a machine readable format like 
>> json:
>> 
>> {
>>     'timestamp': 'time',
>>     'duration': ....,
>>     'bind': 'dn of who initiated operation',
>>     'events': [
>>         'debug': 'msg',
>>         'internal_search': {
>>              'timestamp': 'time',
>>              'duration': ....,
>>         }
>>     ],
>> }
>> 
>> This would make writing tools like logconv much easier, as we wouldn't need 
>> regex or others to parse and understand these logs.
> 
> +1
> 
> I would just like developers to keep in mind a couple of points
> 
> - traceability - the ability to trace operations through all of the various 
> parts of the server, and eventually even system calls/kernel, for use in 
> debugging and performance improvements (identify bottlenecks, etc.) - how to 
> make it easy to trace operations using the log files
> 
> - json - json also makes it easy to consume these logs - just about all 
> log/event collectors use a single-line json format

For certain - I have another project I work on which has json logs of almost 
this format, and I have never needed gdb once due to the high level of tracing 
and detail - it's basically an analysis of all branches and server state, as 
well as timing and performance data. It's been infinitely valuable, and I'd 
love to see that in 389 too. 

> 
> 
>> 
>> 
>> -- Should our log interface from slapi_log_err etc change?
>> 
>> This is a super tricky question ... I think the answer is yes - we are using 
>> log_err for messages that should be associated to an operation - and 
>> additionally, we often disable or lower the level of error messages due to 
>> the low performance. It would be hard to shift behaviour and expectations 
>> around the behaviour of this log, so a better idea would be to fix things 
>> oourselves by adjusting how we log in plugins. We'll probably always need to 
>> support log_err, but then those developers would not gain the benefit of our 
>> improvements.
>> 
>> We could translate log_err to emit simple text messages into the events 
>> section of the log, but for full benefit, using the operation log interfaces 
>> would be best.
>> 
>> -- Should we still improve the performance of logs as they exist today?
>> 
>> Probably yes - but I think there is a good reason to approach this 
>> differently.
>> 
>> I think we should be log_async.c and log_operation.c in parallel to existing 
>> logging, and have a similer feature gating flag in use like we have done for 
>> other features. We can then build this in parallel and gain confidence and 
>> experience with it. Once complete, we could then change the log_access, 
>> log_audit and log_err macros/functions to call related log_op calls, or 
>> rewrite those logs to have log_async.c execution backends.
>> 
>> 
>> 
>> SUMMARY:
>> 
>> * Make slapi_log_op and an async log system in parallel to existing log infra
>> * Determine a deprecation or improvement plan for other log types
>> * Port log types to async, or remove them.
>> 
>> 
>> 
>> --
>> Sincerely,
>> 
>> William
>> _______________________________________________
>> 389-devel mailing list -- 389-devel@lists.fedoraproject.org
>> To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
>> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>> List Archives: 
>> https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org
> 
> _______________________________________________
> 389-devel mailing list -- 389-devel@lists.fedoraproject.org
> To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: 
> https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org

—
Sincerely,

William Brown

Senior Software Engineer, 389 Directory Server
SUSE Labs
_______________________________________________
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-devel@lists.fedoraproject.org

Reply via email to