[389-devel] Re: Please review new Replication Diff Tool

2017-04-12 Thread Mark Reynolds


On 04/12/2017 06:58 PM, William Brown wrote:
> On Wed, 2017-04-12 at 17:02 -0400, Mark Reynolds wrote:
>> Hello,
>>
>> This is a beta version of a replication diff tool written in python. 
>>
>> Design page (this needs updating - I hope to get that done tonight)
>>
>> http://www.port389.org/docs/389ds/design/repl-diff-tool-design.html
>>
>>
>> Current usage:
>>
>>   -v, --verbose Verbose output
>>   -o FILE, --outfile=FILE
>> The output file
>>   -D BINDDN, --binddn=BINDDN
>> The Bind DN (REQUIRED)
>>   -w BINDPW, --bindpw=BINDPW
>> The Bind password (REQUIRED)
>>   -h MHOST, --master_host=MHOST
>> The Master host (default localhost)
>>   -p MPORT, --master_port=MPORT
>> The Master port (default 389)
>>   -H RHOST, --replica_host=RHOST
>> The Replica host (REQUIRED)
>>   -P RPORT, --replica_port=RPORT
>> The Replica port (REQUIRED)
>>   -b SUFFIX, --basedn=SUFFIX
>> Replicated suffix (REQUIRED)
>>   -l LAG, --lagtime=LAG
>> The amount of time to ignore inconsistencies
>> (default
>> 300 seconds)
>>   -Z CERTDIR, --certdir=CERTDIR
>> The certificate database directory for startTLS
>> connections
>>   -i IGNORE, --ignore=IGNORE
>> Comma separated list of attributes to ignore
>>   -M MLDIF, --mldif=MLDIF
>> Master LDIF file (offline mode)
>>   -R RLDIF, --rldif=RLDIF
>> Replica LDIF file (offline mode)
>>
>> |Examples: python repl-diff.py -D "cn=directory manager" -w PASSWORD -h
>> localhost -p 389 -H remotehost -P  -b "dc=example,dc=com" ||python 
>> repl-diff.py -D "cn=directory manager" -w PASSWORD -h localhost
>> -p 389 -H remotehost -P  -b "dc=example,dc=com" -Z
>> /etc/dirsrv/slapd-localhost|
>> |python repl-diff.py -M /tmp/master.ldif -R /tmp/replica.ldif |
>>
>>
>> How long the tool takes to run depends on the number of entries per
>> database.  See performance numbers below 
>>
>> Entries per Replica Time
>>
>> -
>> 100k40 seconds
>> 500k3m 30secs
>> 1 million   7m 30secs
>> 2 million   14 minutes
>> 10 million  ~70 minutes
>>
>>
>>
>> I'd be very interested in feedback, RFE's, and bugs.
> Hey mate, 
>
> The tool looks great, awesome work on this. Really impressive that you
> got it to 70 minutes for 10 million entries.
And in reality its 20 million entries (master + replica)
>
> How responsive is the server during this process? We aren't going to
> cause some odd resource exhaustion? 
Not really  :)  We don't have to rely on server side sorting, and it's
just a paged result search - so it breaks up the load (slightly).  But,
it's still expensive because it is returning all the entries, but I
didn't see any extreme CPU usage.
>
> import optparse
>
> With python, optparse is deprecated. Can we use argparse instead? It's
> nearly identical. Lots of examples of this in dsctl. 
Easy, no problem.
>
> With connect to replicas, some sites may only have ldaps (provided by a
> load balancer). So our scripts should really be taking an LDAPurl, a
> certdir, and a starttls flag. Because ldaps://localhost + certdir is a
> valid option, but if we force call start_tls_s(), we break it. 
Good idea
> As well,
> someone may use ldapi:// etc. It also saves on port options and more
> flags to the cli because we can do ldap://localhost:30389 etc. 
>
> Hope that helps, I'll be happy to review again later!
>
> For now, I think our strategy with this should be to add it to
> 389-ds-base, and later we can move this into lib389 when we can. How
> does that sound? 
I think it should always be a standalone tool, but we can tie it in with
lib389 (move the main guts out of the tool and into lib389)
>
>
>
> ___
> 389-devel mailing list -- 389-devel@lists.fedoraproject.org
> To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org

___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org


[389-devel] Re: Please review new Replication Diff Tool

2017-04-12 Thread William Brown
On Wed, 2017-04-12 at 17:02 -0400, Mark Reynolds wrote:
> Hello,
> 
> This is a beta version of a replication diff tool written in python. 
> 
> Design page (this needs updating - I hope to get that done tonight)
> 
> http://www.port389.org/docs/389ds/design/repl-diff-tool-design.html
> 
> 
> Current usage:
> 
>   -v, --verbose Verbose output
>   -o FILE, --outfile=FILE
> The output file
>   -D BINDDN, --binddn=BINDDN
> The Bind DN (REQUIRED)
>   -w BINDPW, --bindpw=BINDPW
> The Bind password (REQUIRED)
>   -h MHOST, --master_host=MHOST
> The Master host (default localhost)
>   -p MPORT, --master_port=MPORT
> The Master port (default 389)
>   -H RHOST, --replica_host=RHOST
> The Replica host (REQUIRED)
>   -P RPORT, --replica_port=RPORT
> The Replica port (REQUIRED)
>   -b SUFFIX, --basedn=SUFFIX
> Replicated suffix (REQUIRED)
>   -l LAG, --lagtime=LAG
> The amount of time to ignore inconsistencies
> (default
> 300 seconds)
>   -Z CERTDIR, --certdir=CERTDIR
> The certificate database directory for startTLS
> connections
>   -i IGNORE, --ignore=IGNORE
> Comma separated list of attributes to ignore
>   -M MLDIF, --mldif=MLDIF
> Master LDIF file (offline mode)
>   -R RLDIF, --rldif=RLDIF
> Replica LDIF file (offline mode)
> 
> |Examples: python repl-diff.py -D "cn=directory manager" -w PASSWORD -h
> localhost -p 389 -H remotehost -P  -b "dc=example,dc=com" ||python 
> repl-diff.py -D "cn=directory manager" -w PASSWORD -h localhost
> -p 389 -H remotehost -P  -b "dc=example,dc=com" -Z
> /etc/dirsrv/slapd-localhost|
> |python repl-diff.py -M /tmp/master.ldif -R /tmp/replica.ldif |
> 
> 
> How long the tool takes to run depends on the number of entries per
> database.  See performance numbers below 
> 
> Entries per Replica Time
> 
> -
> 100k40 seconds
> 500k3m 30secs
> 1 million   7m 30secs
> 2 million   14 minutes
> 10 million  ~70 minutes
> 
> 
> 
> I'd be very interested in feedback, RFE's, and bugs.

Hey mate, 

The tool looks great, awesome work on this. Really impressive that you
got it to 70 minutes for 10 million entries.

How responsive is the server during this process? We aren't going to
cause some odd resource exhaustion? 

import optparse

With python, optparse is deprecated. Can we use argparse instead? It's
nearly identical. Lots of examples of this in dsctl. 

With connect to replicas, some sites may only have ldaps (provided by a
load balancer). So our scripts should really be taking an LDAPurl, a
certdir, and a starttls flag. Because ldaps://localhost + certdir is a
valid option, but if we force call start_tls_s(), we break it. As well,
someone may use ldapi:// etc. It also saves on port options and more
flags to the cli because we can do ldap://localhost:30389 etc. 

Hope that helps, I'll be happy to review again later!

For now, I think our strategy with this should be to add it to
389-ds-base, and later we can move this into lib389 when we can. How
does that sound? 

-- 
Sincerely,

William Brown
Software Engineer
Red Hat, Australia/Brisbane



signature.asc
Description: This is a digitally signed message part
___
389-devel mailing list -- 389-devel@lists.fedoraproject.org
To unsubscribe send an email to 389-devel-le...@lists.fedoraproject.org


[389-devel] Please review new Replication Diff Tool

2017-04-12 Thread Mark Reynolds
Hello,

This is a beta version of a replication diff tool written in python. 

Design page (this needs updating - I hope to get that done tonight)

http://www.port389.org/docs/389ds/design/repl-diff-tool-design.html


Current usage:

  -v, --verbose Verbose output
  -o FILE, --outfile=FILE
The output file
  -D BINDDN, --binddn=BINDDN
The Bind DN (REQUIRED)
  -w BINDPW, --bindpw=BINDPW
The Bind password (REQUIRED)
  -h MHOST, --master_host=MHOST
The Master host (default localhost)
  -p MPORT, --master_port=MPORT
The Master port (default 389)
  -H RHOST, --replica_host=RHOST
The Replica host (REQUIRED)
  -P RPORT, --replica_port=RPORT
The Replica port (REQUIRED)
  -b SUFFIX, --basedn=SUFFIX
Replicated suffix (REQUIRED)
  -l LAG, --lagtime=LAG
The amount of time to ignore inconsistencies
(default
300 seconds)
  -Z CERTDIR, --certdir=CERTDIR
The certificate database directory for startTLS
connections
  -i IGNORE, --ignore=IGNORE
Comma separated list of attributes to ignore
  -M MLDIF, --mldif=MLDIF
Master LDIF file (offline mode)
  -R RLDIF, --rldif=RLDIF
Replica LDIF file (offline mode)

|Examples: python repl-diff.py -D "cn=directory manager" -w PASSWORD -h
localhost -p 389 -H remotehost -P  -b "dc=example,dc=com" ||python 
repl-diff.py -D "cn=directory manager" -w PASSWORD -h localhost
-p 389 -H remotehost -P  -b "dc=example,dc=com" -Z
/etc/dirsrv/slapd-localhost|
|python repl-diff.py -M /tmp/master.ldif -R /tmp/replica.ldif |


How long the tool takes to run depends on the number of entries per
database.  See performance numbers below 

Entries per Replica Time

-
100k40 seconds
500k3m 30secs
1 million   7m 30secs
2 million   14 minutes
10 million  ~70 minutes



I'd be very interested in feedback, RFE's, and bugs.

Thanks,
Mark
# --- BEGIN COPYRIGHT BLOCK ---
# Copyright (C) 2017 Red Hat, Inc.
# All rights reserved.
#
# License: GPL (version 3 or any later version).
# See LICENSE for details.
# --- END COPYRIGHT BLOCK ---
#
import re
import time
import ldap
import optparse
from ldap.ldapobject import SimpleLDAPObject
from ldap.cidict import cidict
from ldap.controls import SimplePagedResultsControl

VERSION = "1.0"
RUV_FILTER = '(&(nsuniqueid=---)(objectclass=nstombstone))'
vucsn_pattern = re.compile(';vucsn-([A-Fa-f0-9]+)')
vdcsn_pattern = re.compile(';vdcsn-([A-Fa-f0-9]+)')
mdcsn_pattern = re.compile(';mdcsn-([A-Fa-f0-9]+)')
adcsn_pattern = re.compile(';adcsn-([A-Fa-f0-9]+)')


class Entry(object):
''' This is a stripped down version of Entry from python-lib389.
Once python-lib389 is released on RHEL this class will go away.
'''
def __init__(self, entrydata):
if entrydata:
self.dn = entrydata[0]
self.data = cidict(entrydata[1])

def __getitem__(self, name):
return self.__getattr__(name)

def __getattr__(self, name):
if name == 'dn' or name == 'data':
return self.__dict__.get(name, None)
return self.getValue(name)


def get_entry(entries, dn):
''' Loop over enties looking for a matching dn
'''
for entry in entries:
if entry.dn == dn:
return entry
return None


def remove_entry(rentries, dn):
''' Remove an entry from the array of entries
'''
for entry in rentries:
if entry.dn == dn:
rentries.remove(entry)
break


def extract_time(stateinfo):
''' Take the nscpEntryWSI attribute and get the most recent timestamp from
one of the csns (vucsn, vdcsn, mdcsn, adcsn)

Return the timestamp in decimal
'''
timestamp = 0
for pattern in [vucsn_pattern, vdcsn_pattern, mdcsn_pattern, adcsn_pattern]:
csntime = pattern.search(stateinfo)
if csntime:
hextime = csntime.group(1)[:8]
dectime = int(hextime, 16)
if dectime > timestamp:
timestamp = dectime

return timestamp


def convert_timestamp(timestamp):
''' Convert createtimestamp to ctime: 20170405184656Z -> Wed Apr  5 19:46:56 2017
'''
time_tuple = (int(timestamp[:4]), int(timestamp[4:6]), int(timestamp[6:8]),
  int(timestamp[8:10]), int(timestamp[10:12]), int(timestamp[12:14]),
  0, 0, 0)
secs = time.mktime(time_tuple)
return time.ctime(secs)


def convert_entries(entries):
'''Convert and normalize the ldap entries
'''
new_entries = []
for entry in entries:
new_entry = Entry(entry)
new_entry.data = {k.lower(): v