[389-users] [@all] Request for sanitised access log data

William Brown Wed, 14 Apr 2021 22:03:02 -0700

Hi everyone,

At the moment I have been helping a student with their higher education thesis. 
As part of this we need to understand realistic work loads from 389-ds servers 
in production.


To assist, I would like to ask if anyone is able to or willing to volunteer to 
submit sanitised content of their access logs to us for this. We can 
potentially also use these for 389-ds for benchmarking and simulation in the 
future.

An example of the sanitised log output is below. All DN's are replaced with 
UUID's that are randomly generated so that no data can be reversed from the 
content of the access log. The script uses a combination of filter and basedn 
uniqueness, and nentries to make a virtual tree, and then substitute in extra 
data as required. All rtimes are relative times of "when" the event occured 
relative to the start of the log, so we also do not see information about the 
time of accesses.

This data will likely be used in a public setting, so assume that it will be 
released if provided. Of course I encourage you to review the content of the 
sanitisation script, and the sanitised output so that you are comfortable to 
run this tool. It's worth noting the tool will likely use a lot of RAM, so you 
should NOT run it on your production server - rather you should copy the 
production access log to another machine and process it there.

1 hour to 24 hours of processed output from a production server would help a 
lot!

Please send the output as a response to this mail, or directly to me ( wbrown 
at suse dot de )

Thanks,

[
   {
       "etime": "0.005077600",
       "ids": [
           "9b207c6e-f7a2-4cd8-984a-415ad5e0960f"
       ],
       "rtime": "0:00:36.219942",
       "type": "add"
   },
   {
       "etime": "0.000433300",
       "ids": [
           "9b207c6e-f7a2-4cd8-984a-415ad5e0960f"
       ],
       "rtime": "0:00:36.225207",
       "type": "bind"
   },
   {
       "etime": "0.000893100",
       "ids": [
           "9b207c6e-f7a2-4cd8-984a-415ad5e0960f",
           "eb2139a1-a0f3-41cf-bdbe-d213a75c6bb7"
       ],
       "rtime": "0:00:40.165807",
       "type": "srch"
   },
]

USAGE:
python3 access_pattern_extract.py /path/to/log/access /path/to/output.json

#!/usr/bin/python3
import sys
import json
import re
import uuid
import random
from datetime import datetime

# 'time': { },
dirty_events = []

# This is how we track event state as we process.
ev_state = {}

MONTH_LOOKUP = {
    'Jan': 1,
    'Feb': 2,
    'Mar': 3,
    'Apr': 4,
    'May': 5,
    'Jun': 6,
    'Jul': 7,
    'Aug': 8,
    'Oct': 9,
    'Sep': 10,
    'Nov': 11,
    'Dec': 12,
}

events = []
dnmap = {}

prog_m1 = re.compile(r'^(?P<timestamp>\[.*\])\sconn=(?P<conn>\d*)\sop=(?P<op>\d*)\s(?P<action>\w*)\s(?P<rem>.*)\n?')
prog_timestamp = re.compile(r'\[(?P<day>\d*)\/(?P<month>\w*)\/(?P<year>\d*):(?P<hour>\d*):(?P<minute>\d*):(?P<second>\d*)(.(?P<nanosecond>\d*))+\s(?P<tz>[\+\-]\d*)')
prog_basedn = re.compile(r'base="(?P<dn>[^"]*)"\sscope=(?P<scope>\d)\sfilter="(?P<filt>[^"]*)".*')
prog_etime = re.compile(r'.*nentries=(?P<nentries>\d*).*etime=(?P<etime>\d*\.\d*).*')
prog_dn = re.compile(r'dn="(?P<dn>[^"]*)"')


def parse_timestamp(ts):
    timedata = prog_timestamp.match(ts).groupdict()
    # Now, have to convert month to an int.
    dt_str = '{YEAR}-{MONTH}-{DAY} {HOUR}-{MINUTE}-{SECOND}.{MS:06} {TZ}'.format(
        YEAR=timedata['year'],
        MONTH=MONTH_LOOKUP[timedata['month']],
        DAY=timedata['day'],
        HOUR=timedata['hour'],
        MINUTE=timedata['minute'],
        SECOND=timedata['second'],
        TZ=timedata['tz'],
        MS=int(int(timedata['nanosecond']) / 1000),
        )
    dt = datetime.strptime(dt_str, "%Y-%m-%d %H-%M-%S.%f %z")
    return dt


def process_event(estate, rem):
    # Get the time to something relative
    dt = parse_timestamp(estate['time'])

    #   srch? what dn and scope? (0 base, 1 one, 2 sub)
    if estate['type'] == 'srch':
        m = prog_basedn.match(estate['data'])
        # from estate.data extract base.
        dn = m.groupdict()['dn']
        if dn == '' or dn.endswith('cn=config') or dn.endswith('cn=monitor') or dn.endswith('cn=schema'):
            # nope out
            return

        filt = m.groupdict()['filt']
        scope = m.groupdict()['scope']

        m = prog_etime.match(rem)
        etime = m.groupdict()['etime']
        nentries = m.groupdict()['nentries']

        dirty_events.append({
            'time': dt,
            'type': estate['type'],
            'dn': dn,
            'filt': filt,
            'scope': scope,
            'etime': etime,
            'nentries': int(nentries),
        })
    else:
        m = prog_dn.match(estate['data'])
        # from estate.data extract base.
        dn = m.groupdict()['dn']
        if dn.endswith('cn=config') or dn.endswith('cn=monitor') or dn.endswith('cn=schema'):
            # nope out
            return

        m = prog_etime.match(rem)
        etime = m.groupdict()['etime']
        # nentries = m.groupdict()['nentries']

        dirty_events.append({
            'time': dt,
            'type': estate['type'],
            'dn': dn,
            'etime': etime,
            'nentries': 1,
        })


def process_line(line):
    result = prog_m1.match(line)
    if result:
        gd = result.groupdict()

        if gd['action'] == 'SRCH':
            ev_state[(gd['conn'], gd['op'])] = {
                'time': gd['timestamp'],
                'type': 'srch',
                'data': gd['rem']
            }
        elif gd['action'] == 'BIND':
            ev_state[(gd['conn'], gd['op'])] = {
                'time': gd['timestamp'],
                'type': 'bind',
                'data': gd['rem']
            }
        elif gd['action'] == 'DEL':
            ev_state[(gd['conn'], gd['op'])] = {
                'time': gd['timestamp'],
                'type': 'del',
                'data': gd['rem']
            }
        elif gd['action'] == 'MOD':
            ev_state[(gd['conn'], gd['op'])] = {
                'time': gd['timestamp'],
                'type': 'mod',
                'data': gd['rem']
            }
        elif gd['action'] == 'ADD':
            ev_state[(gd['conn'], gd['op'])] = {
                'time': gd['timestamp'],
                'type': 'add',
                'data': gd['rem']
            }
        elif gd['action'] == 'RESULT':
            estate = ev_state.get((gd['conn'], gd['op']), None)
            if estate is not None:
                # =====================================
                # Now we have enough info to process this.
                # =====================================
                process_event(estate, gd['rem'])

def remap(dn):
    x = dnmap.get(dn, None)
    if not x:
        x = str(uuid.uuid4())
        dnmap[dn] = x
    return x


def clean_events():
    # Generated Stuff
    generated = {}
    known_suffixes = {}
    created_dns = set()
    c_known_dns = set()
    # First build our db view estimate.
    dns = []
    for ev in dirty_events:
        dns.append(ev['dn'])

    # All the unique dns we know about.
    dns = sorted(set(dns))
    # build a kinda tree.
    for dn in dns:
        sub = [ x for x in dns if x.endswith(dn)]
        known_suffixes[dn] = sub
        generated[dn] = []

    c_known_dns = set(known_suffixes.keys())

    # Find the lowest timestamp.
    ltime = dirty_events[0]['time']
    for ev in dirty_events:
        if ev['time'] < ltime:
            ltime = ev['time']
    # ltime is now the minimum event.

    for ev in dirty_events:
        reltime = ev['time'] - ltime

        accessed_dns = []
        if ev['type'] == 'srch':
            known_dns = known_suffixes[ev['dn']]
            supplemental_dns = []
            if len(known_dns) < ev['nentries']:
                missing = ev['nentries'] - len(known_dns)
                if len(generated[ev['dn']]) < missing:
                    # Generate some extra stuff.
                    a = [ str(uuid.uuid4()) for i in range(0, missing - len(generated[ev['dn']]))]
                    c_known_dns.union(set(a))
                    generated[ev['dn']] = generated[ev['dn']] + a
                supplemental_dns = random.sample(generated[ev['dn']], missing)

            accessed_dns = known_dns + supplemental_dns
        elif ev['type'] == 'add':
            accessed_dns = [ev['dn']]
            created_dns.add(ev['dn'])
        else:
            accessed_dns = [ev['dn']]

        accessed_ids = [ remap(x) for x in accessed_dns ]

        events.append({
            'rtime': str(reltime),
            'etime': ev['etime'],
            'type': ev['type'],
            'ids': accessed_ids,
        })

    precreate = [
        remap(x) for x in list(c_known_dns - created_dns)
    ]
    events.insert(0, {
            'rtime': str(ltime - ltime),
            'etime': 0,
            'type': 'precreate',
            'ids': precreate,
    })


def do_main(access_file, out_file):
    sys.stdout.write(f"Extracting access pattern data from '{access_file}' to '{out_file}' . This may take a while ")
    count = 0
    with open(out_file, 'w') as outfile:
        with open(access_file, 'r') as logfile:
            for line in logfile:
                count = count + 1
                if (count % 500) == 0:
                    sys.stdout.write(".")
                process_line(line)

        sys.stdout.write("â\n")
        print(f"Complete: Processed {count} lines into {len(events)} events")
        # Process dirty events to clean ones.
        clean_events()
        json.dump(events, outfile, sort_keys=True, indent=4)


if __name__ == '__main__':
    if len(sys.argv) < 3:
        print(f"{sys.argv[0]} /path/to/log/access /path/to/output.json")
        sys.exit(1)

    do_main(sys.argv[1], sys.argv[2])



—
Sincerely,

William Brown

Senior Software Engineer, 389 Directory Server
SUSE Labs, Australia

_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure

[389-users] [@all] Request for sanitised access log data

Reply via email to