[389-users] Re: Determining max CSN of running server

2024-03-01 Thread William Faulk
One problem with reinitializing that replica is that since it's successfully 
receiving changes from everywhere else and not sending its changes outward, 
it's the only one that has the most up-to-date data.

For what it's worth, the topology is that at each of my PoPs, I have a pair of 
replicas that are replicating with each other, and each of the pair is 
replicating with one of the pair at the neighbor PoPs. The PoP topology is 
basically a ring of 9 PoPs, call them A through I. Then there are another two 
PoPs that connect A and E. Then there are leaf PoPs that hang off of B, C, H, 
and I.

If that's not clear, let me know and I can draw a diagram.

-- 
William Faulk
--
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


[389-users] Re: Determining max CSN of running server

2024-03-01 Thread William Faulk
It's on a VM.

I don't have enough archived logs to show the progression of the serial number. 
However, I do have a text dump of the cldb, and I can filter it down to just 
the CSNs, and then to just the CSNs originated on this replica. The timestamp 
with the most CSNs is 752, and, of the 3323 unique timestamps, only 13 have 
more than 100 CSNs, only 267 have 10 or more, and 1299 are just a single change.

Here's the list, if you really want to look: https://pastebin.com/muegmwzV

I can't come up with a rationale for the numbers, honestly. They should just 
start at zero for each unique timestamp, right?

> IIUC the consumer is currently catching up. Is the RUV, on the consumer, 
> evolving ?

Based on the one set of debug logs, yes, but I'm not sure if that's an anomaly 
or not. I haven't been able to see it move since then, but I'm keeping an eye 
on it.

> Do you have fractional replication ?

Yes. This is actually part of an IdM/FreeIPA installation, so the regular 
things that are stripped out there:

nsds5ReplicaStripAttrs: modifiersName modifyTimestamp internalModifiersName 
internalModifyTimestamp
nsDS5ReplicatedAttributeList: (objectclass=*) $ EXCLUDE memberof idnssoaserial 
entryusn krblastsuccessfulauth krblastfailedauth krbloginfailedcount
nsDS5ReplicatedAttributeListTotal: (objectclass=*) $ EXCLUDE entryusn 
krblastsuccessfulauth krblastfailedauth krbloginfailedcount

-- 
William Faulk
--
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


[389-users] Re: Determining max CSN of running server

2024-02-29 Thread William Faulk
> FYI: There is a list of pending operations to ensure that the RUV is not
> updated while an older operation is not yet completed. And I suspect that
> you hit a bug about this list. I remember that we fixed something in that
> area a few years ago ...

I think I found it, or something closely related.

https://github.com/389ds/389-ds-base/pull/4553
--
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


[389-users] Re: Determining max CSN of running server

2024-02-29 Thread William Faulk
Thanks, Pierre and Thierry.

After quite some time of poring over these debug logs, I've found some 
anomalies and they seem like they're matching up with the idea that the 
affected replica isn't updating its own RUV correctly.

The logs show a change being made, and it lists the CSN of the change. The 
first anomalies are here, but they probably aren't terribly significant. The 
CSN includes a timestamp, and the timestamp on this CSN is 11 hours into the 
future from when the change was made and logged. Also, the next part of the CSN 
is supposed to be a serial number for when there are changes made during the 
same second of the timestamp. In the case I was looking at, that serial was 
0xb231. I'm certain that this replica didn't record another 45000 changes in 
that second.

Then it shows the server committing the change to the changelog. It shows it 
"processing data" for over 16000 other CSNs, and it takes about 25 seconds to 
complete.

It then starts a replication session with the peer and prints out the peer's 
(consumer's) RUV and then its own (supplier's) RUV. The RUV it prints out for 
itself shows the maxCSN for itself with a timestamp from almost 4 months ago. 
It is greater than the maxCSN for itself in the consumer's RUV, though, by a 
little. (The replicagenerations are equal, though.)

It then claims to send 7 changes, all of which are skipped because "empty". It 
then claims that there are "No more updates to send" and releases the consumer 
and eventually closes the connection.

I like the idea that there's a list of pending operations that's blocking RUV 
updates. Is there any way for me to examine this list? That said, I do think it 
updated its own maxCSN in its own RUV by a few hours. The peer I'm looking at 
does seem to reflect the increased maxCSN for the bad replica in the RUV I can 
see in the "mapping tree". I've tried to reproduce this small update, but 
haven't been able to yet.

I also have another replica that seems to be experiencing the same problem, and 
I've restarted it with no improvement in symptoms. It might be different, 
though. It doesn't look like it discarded its changelog.

I definitely don't relish reinitializing from this bad replica, though. I'd 
have to perform a rolling reinitialization throughout our whole environment, 
and it takes ages and a lot of effort.

-- 
William Faulk
--
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


[389-users] Re: Determining max CSN of running server

2024-02-28 Thread William Faulk
> Might be worth re-reading

Well, I still don't really know the details of the replication process.

I have deduced that changes originated on a replica seem to prompt that replica 
to start a replication process with its peers, but I don't really know what 
happens then. There's a comparison of the RUVs of the two replicas, but does 
the initiating system send its RUV to the receiver, or does it go the other 
way, or do both happen? Does the comparison prompt the comparing system to send 
the changes it thinks the other system needs, or does it cause the comparing 
system to request new changes from the other? Maybe none of this really makes 
much difference, but the lack of technical detail around this makes me just 
question everything.

> It doesn't send a single CSN, the replication compares the RUVs and 
> determines the
> range of CSNs that are missing from the consumer. 

Sure, but notionally any changes that originated on that replica would be 
reflected in the max CSN for itself in the RUV that is used to compare. And at 
least one side is sending its RUV to the other during the replication process.

> It's also not immediate. Between the server accepting a change (add, mod 
> etc), the
> change is associated to a CSN. But then there may be a delay before the two 
> nodes actually
> communicate and exchange data. 

Sure, but the changes originated on this replica haven't made it to other 
replicas in weeks. This isn't a mere delay in replication.

> Generally you'd need replication logging (errorloglevel 8192). But it's very 
> noisy
> and can be hard to read. What you need to see is the ranges that they agree 
> to send.

Okay. I've done that and haven't had a chance to pore through them yet.

> Also remember CSN's are a monotonic lamport clock. This means they only ever 
> advance
> and can never step backwards. So they have some different properties to what 
> you may
> expect. If they ever go backwards I think the replication handler throws a 
> pretty nasty
> error.

I don't think it's going backwards. What I'm trying to rule out is that the 
replica is failing to advance its max CSN in the RUV being used to compare.

> I *think* so. It's been a while since I had to look. The nsds50ruv shows the 
> ruv of
> the server, and I think the other replica entries are "what the peers ruv was 
> last
> time".

Well, it's at least nice to hear that my guess at least isn't asinine. :)

> replication monitoring code in newer versions does this for you, so I'd 
> probably
> advise you attempt to upgrade your environment. 1.3 is really old at this 
> point

I've been trying to get the current environment stable enough that I feel 
comfortable going through the relatively lengthy upgrade process. I think I'm 
going to have to adjust my comfort level.

> I'm not sure if even RH or SUSE still support that version anymore).

RedHat does, as it's what's in RHEL7.9, which is supported for another, uh, 4 
months. They're working on this with me. I'm still just trying to understand 
the system better so that I can try to be productive while I'm waiting on them 
to come up with ideas.

> The problem here is that to read the RUV's and then compare them, you need to 
> read
> each RUV from each server and then check if they are advancing (not that they 
> are equal).

The problem is that the changes in my environment are few enough that all the 
replicas' RUVs _are_ equal the majority of the time. I'm not in front of that 
system as I respond right now, so my details might be wrong, but I'm asking 
about all of this because every RUV I see in all of the replicas is the same, 
and it shows a max CSN for this one replica that's much older than the CSNs I 
see it reference in the logs about changes originating on the replica. The CSNs 
I see in the logs when a new change is made are referencing the current time in 
them, while the max CSN I see in the RUVs is from 4 months ago.

Maybe it *did* go backwards somehow and that's why it's not working. Not that 
that would really help me understand what actually went wrong any better than I 
do now.

> If you want to assert that "Some change I made at CSN X is on all servers" 
> then
> you would need to read and parse the ruv and ensure that all of them are at 
> or past that
> CSN for that replica id. 

Well, you'd think so. I've got that problem, too, where some CSNs just seem to 
get missed, but the max CSN in the RUV is well past that. But that's a 
different problem and not the one I'm working on now.

Thanks for the input.

-- 
William Faulk
--
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List G

[389-users] Determining max CSN of running server

2024-02-28 Thread William Faulk
I'm having another replication problem where changes made on a particular 
server are not being replicated outward at all. Right now, I'm trying to 
determine what's going on during the replication process.

(Caveat: I'm still running an old version of 389ds: v1.3.10. In particular, the 
dsconf utility does not exist.)

My understanding is that when a server receives a change from a client, it 
wraps it up as a CSN and starts a replication session with its peers, during 
which it sends a message that states the greatest CSN that it originated. First 
off, is that a correct understanding?

If so, how can I determine what CSN a particular server is telling its 
replication peers during those sessions? I have a feeling that this server is, 
for some reason, sending an inaccurate number.

In the cn=replica,cn=...,cn=mapping tree,cn=config tree, there are entries for 
each of the servers topology peers, and they contain nsds50ruv attributes that 
seem to be the RUVs that that server has received from those peers, right? But 
the nsds50ruv attribute also exists directly in the cn=replica if you 
explicitly ask for it. Is it possible that this is the server's own RUV?

Can I rely on the nsds50ruv attributes on this server's peers'  cn=replica 
nsds50ruv attribute values to be an accurate reflection of what this server is 
sending as its CSN in replication sessions?

Any other way to see what's going on in a replication session? (I'm even trying 
to decrypt a network capture, but I'm not having any luck with that yet.)

In particular, I see the max CSN for this server in all of these RUVs less than 
CSNs recorded in the server's own log files.

-- 
William Faulk
--
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


[389-users] Re: Solving naming conflicts in replicated environment

2024-01-18 Thread William Faulk
I completed this last night. I found that deleting the active entry did not 
automatically promote the conflict entry. I still had to perform the modrdn 
operation.

Also, in addition to deleting the "nsds5ReplConflict" attrbute, I also manually 
deleted the "ConflictCSN" attribute, and the "ldapsubentry" value from the 
"objectclass" attribute.

And it didn't magically get added to the groups that the formerly active entry 
and the same entry in the other IdM replicas was in. I had to add them 
manually, using IdM utilities, on the replica where this change took place. (I 
actually only had to add one group; the other memberships were based on that 
one group, so adding it to that group added it to the others.)

After that, though, the entry on this server matched the entries on the other 
replicas except for "entryusn", "entryid", and "modifyTimestamp", which I 
believe are all normal variances.

Thanks for your help.

By the way, Red Hat support spent four days failing to even understand the 
question that you answered for me in half an hour: that deleting the active 
entry here wouldn't delete it on the other replicas. I asked them three or four 
times, each time getting a response that either explained to me how to delete 
the conflict entry, or failing to address the idea that it might delete the 
entry on the replicas, until I was finally told that it was impossible to 
promote the conflict entry, despite the documentation providing a procedure 
exactly for that, and that I would have to reinitialize the data on that 
replica.

If anyone has any suggestions for a vendor that can provide decent IdM support, 
I'd love to hear it.

Again, many thanks to everyone here.
--
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


[389-users] Re: Solving naming conflicts in replicated environment

2024-01-12 Thread William Faulk
I was prepping to make this change and realized there's a part of the 
documentation I don't understand.

It says to delete the active entry, then perform a modrdn on the conflict 
entry, then delete the old RDN value of the naming attribute.

That last step can't be correct in this case, right? The naming attribute isn't 
changing.

Their actual example is:

# ldapmodify -D "cn=Directory Manager" -W -p 389 -h server.example.com -x
dn: nsuniqueid=66446001-1dd211b2+uid=adamss,dc=example,dc=com
changetype: modrdn
newrdn: uid=NewValue
deleteoldrdn: 0

# ldapmodify -D "cn=Directory Manager" -W -p 389 -h server.example.com -x
dn: uid=NewValue,dc=example,dc=com
changetype: modify
delete: uid
uid: adamss
-
delete: nsds5ReplConflict
-

But if you're trying to promote the conflict entry to replace the bad active 
entry, the naming attribute value isn't changing. That is, the "NewValue" in 
their example is the same as the old value: "adamss". Surely following these 
directions naively is going to result in deleting the naming attribute 
altogether. Unless maybe the schema prevents it from deleting the last value?

Am I correct in thinking I should just skip that part, while continuing to 
delete the nsds5ReplConflict attribute?
--
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


[389-users] Re: Solving naming conflicts in replicated environment

2024-01-12 Thread William Faulk
Thanks for the confirmation.

I'll follow up with the results, just in case anyone in the future comes across 
this thread, and to let folks know how the membership gets handled upon rename 
of the conflict entry.
--
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


[389-users] Re: Solving naming conflicts in replicated environment

2024-01-11 Thread William Faulk
Sorry. I did confirm that the nsuniqueid of the bad replica's active entry is 
different from the other replicas' entries and I forgot to say that. (The 
conflict entry's nsuniqueid and the entries on the good replicas match, too.) 
Here are the entries, with names and crypto stuff redacted, but everything else 
verbatim:

good: https://pastebin.com/N2AZNXAH
bad: https://pastebin.com/MMMzqwN3

My concern is that the access logs seem to contradict what Pierre said: that 
replicated deletes are basing the delete on the nsuniqueid. If I can get a 
confirmation that the logs are lying to me, that's fine. I just want to be 
doubly sure.

That said, I then have a concern about the group memberships on the conflict 
entry once it's renamed. I can't imagine that it will acquire the correct 
groups just by being renamed. Am I going to just need to fix that up manually? 
(That may be outside the scope of this mailing list.)
--
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


[389-users] Re: Solving naming conflicts in replicated environment

2024-01-11 Thread William Faulk
Oh, that's surprising to me.

The LDAP spec seems to indicate that the only possible argument for a delete 
operation is a DN, and, while I still can't reproduce the problem with 
unimportant entries, access logs on replicas where deletes are being replicated 
to seem to imply that the remote server is just requesting a normal delete 
operation specifying the DN, and the access logs don't seem to show any sort of 
search to determine the DN from the nsuniqueid (or anything else).

So, and I'm sorry to say this, but: Are you sure? Keep in mind that I'm running 
an old version of 389-ds: v1.3.11, I think. Maybe the replication protocol is 
handled in such a way that access logs are showing an action that is ultimately 
what's happening, even if it's not exactly how the request was actually made?

(I genuinely do appreciate the input.)
--
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


[389-users] Solving naming conflicts in replicated environment

2024-01-11 Thread William Faulk
I have an IdM/freeipa installation with around 30 replicas. I have an entry for 
a computer that exist across all of those replicas. However, one of the 
replicas has incorrect data in the DN, with the correct data found in a 
conflict entry. (It appears that that entry was created on that replica, 
somehow didn't get replicated anywhere else, and then the entry was created 
again on a different replica.)

I would like to resolve this naming conflict. The documentation (RHDS 10 Admin 
Guide, ยง15.26.1) states that the correct way to "promote" a conflict entry to 
the active entry is to first delete the active entry and then rename the 
conflict entry. (I'm running an old version of IdM that uses a 389-ds that 
doesn't include the dsconf utility.)

But it seems to me that if I send a delete operation to the replica with the 
bad data, it's just going to replicate that delete operation to all the other 
replicas, deleting the correct data from all the other replicas, which seems 
like an awfully dramatic action to take. To reiterate, the correct data exists 
on all of the other replicas in an entry with the same DN as the entry with the 
bad data on the "bad" replica.

I have tried to recreate this situation with a new DN that doesn't reference 
active systems, but I have been unsuccessful.

Can someone confirm that deleting the bad entry from the bad replica will cause 
the good entries on all the good replicas to also be deleted? If so, is there a 
better way to resolve this conflict? (At the moment, I'm inclined to just 
reinitialize the data on this one replica.)
--
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


[389-users] Re: Documentation as to how replication works

2023-11-17 Thread William Faulk
> I noticed there is code to dump the changelog to a flat file, but
> it isn't clear to me how to call it

Aha! I poked through the code and figured it out:

Perform an ldapmodify against "cn=replica,cn=...,cn=mapping tree,cn=config" 
adding the attribute "nsds5Task" with the value "CL2LDIF". It then writes the 
LDIF file to the same directory that contains the changelog database files, 
which is defined in the "nsslapd-changelogdir" attribute of 
"cn=changelog5,cn=config", which, for me, is 
"/var/lib/dirsrv/slapd-/cldb".

To be clear, here's the ldapmodify LDIF that worked for me:

dn: cn=replica,cn=...,cn=mapping tree,cn=config
changetype: modify
add: nsds5Task
nsds5Task: CL2LDIF

The LDIF that's created shows the actual changed data and not just a blob, 
which certainly helps.
--
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


[389-users] Re: Documentation as to how replication works

2023-11-16 Thread William Faulk
> I suspect the CSN is available as an operational attribute on
> each entry

If it is, I can't find it. Plus, a CSN seems to be associated with a change, 
not an entry. Like, if I changed a user's city and then changed their initials, 
that would be two different changes, each with its own CSN. Would the entry 
contain both? How would you know what changes each entailed?

> I thought the changelog was queryable via LDAP, somehow

Since asking the question, I've been doing some research and found that the 
"cn=changelog" tree is populated by the "Retro Changelog Plugin", and on my 
systems, that has a config that limits it to the "cn=dns" subtree in my domain. 
I guess that's the default config either for the plugin itself or for IdM. I 
did temporarily change the config on a test server, and it started reporting 
new CSNs as they came in, and it shows the target DN for each CSN, but the 
change itself is encapsulated in a blob.

The cn=changelog5,cn=config entry contains the on-disk location of the 
changelog where its saved as a Berkeley DB. It's almost as easy to pull the 
same data out of there.

It's good to know that I'm not just missing something obvious, though. Thanks.
--
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


[389-users] Re: Documentation as to how replication works

2023-11-16 Thread William Faulk
> What you are wondering about is attribute level conflicts

I don't *think* I am. The one problem I'm trying to understand right now is 
based on a simple password change. That password change generates many 
attribute changes on a single entry: password history, various krb attributes, 
etc. What I saw from audit logs is that those various attribute changes on the 
one entry got split into two ldap modifications. The audit log shows that all 
of my servers got one of the modifications, but a few failed to get the other.

The thing I've been pursuing here is if those both had the same CSN, since they 
were created at the same time on the same replica, then it's possible that one 
of my replicas got an update that contained only one of the modifications, 
recorded it as the most recent CSN from that replica, and then a second attempt 
to push the second one resulted in the check seeing that it already had the 
most recent update and failing to make that other change.

I recognize that that's a lot of weirdness. Everything I read claims that CSNs 
aren't inextricably tied to timestamp, in order to make sure that they're 
unique, so that would suppose a bug in that system. And then the idea that one 
of those updates would be carried separately from the other seems like an odd 
situation, at best. The more I understand about the replication system, the 
less likely this hypothesis seems. But I'm having a hard time coming up with 
another.
--
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


[389-users] Re: Documentation as to how replication works

2023-11-16 Thread William Faulk
Makes sense. I'll try to read some more documentation/source about the actual 
communication.

Do you know how I can find mappings between CSNs and changes? Or even just how 
to see the changelog at all?
--
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


[389-users] Re: Documentation as to how replication works

2023-11-16 Thread William Faulk
I'm currently just using the Directory Manager credentials for my monitoring; 
sorry.
--
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


[389-users] Re: Documentation as to how replication works

2023-11-16 Thread William Faulk
This was helpful; thanks. I think my biggest misunderstanding was that the RUV 
was just the most recent CSN, when it's actually a list of the most recent CSNs 
from each replica.
--
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


[389-users] Re: Documentation as to how replication works

2023-11-16 Thread William Faulk
> A CSN is generated with each externally applied modification, not for a 
> replicated operation

This is very useful information; thank you.

> The RUV is a vector of CSNs for all replicaids a specific replica has 
> seen

So each replica has its own RUV which ideally should be the same across all 
replicas, but which may temporarily differ as replication occurs. And the RUV 
contains a list of all the replicas and the most recent CSN it knows about from 
that replica.

I think part of my confusion is that the RUV for a replica seems to be hidden. 
I think I've discovered that it's in the cn=replica,cn=...,cn=mapping tree, 
cn=config as the "nsds50ruv" multivalue attribute, but I have to explicitly 
request that attribute. Neither "*" nor "+" returns it, nor does a search for 
"(nsds50ruv=*)", which makes it hard to find. Additionally confusing me was the 
fact that "nsds50ruv" attributes do show up in the replication agreement 
entries that are children of that entry, and they seem to contain cached values 
of the remote replicas RUVs at, I'm guessing, the last time they initiated a 
replication event.

Ultimately, I think I mostly understand now. A change happens on a replica, it 
assigns a CSN to it and updates its RUV to indicate that that's now the newest 
CSN it has. Then a replication event occurs with its peers and those peers 
basically say "you have something newer; send me everything you originated 
after this last CSN from you that I know about". And then a replication event 
happens to their peers and they see that there's something new from that 
replica, etc.

I think the biggest thing I don't understand now is how to associate changes 
with CSNs. It's supposed to be in the changelog, but the only changes I see in 
"cn=changelog" are for "idnsname" DNs, and there are definitely more changes 
going on than that. 

> Now assume that the updates 100x have been conflicting

I'm not really concerned at the moment with conflicting updates. I get why 
that's a problem and I generally understand the "+nsuniqueid" conflict 
resolution method. My problem is occurring without conflicting updates.
--
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


[389-users] Re: Documentation as to how replication works

2023-11-15 Thread William Faulk
Do you think those variables could add up to lags of weeks?

Also, are there known bugs with replication in earlier versions in older RHEL 
releases? I am definitely very downrev, unfortunately. (I'm embarrassed to say 
I'm still on 7.9.) I need to upgrade soon, since that's going EoS in less than 
a year, but if there are known issues, I can get that work prioritized.
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


[389-users] Re: Documentation as to how replication works

2023-11-15 Thread William Faulk
> The explanation below looks excellent to me

Things that I currently know I don't know include:

* When/where a new CSN is generated. If a piece of data is changed on a 
particular replica, that must obviously create a new CSN. When that data is 
replicated, does the accepting replica create its own CSN for that change or 
does it copy the initiating replica's CSN? I think it's the former, but I'm not 
sure, because:
* How are CSNs compared? Since the CSN contains a replica ID, it seems like 
there's the potential for one replica's updates to prevent others' updates from 
propagating. Unless that isn't really used in the comparison. In which case, 
what's it doing in there?
* How a replica knows what data to send based on CSN comparison.

I'm sure that there are things that I don't yet know that I don't know, but 
that knowledge feels like it's gated partially by the answers to these 
questions.

> A key element is that there is no synchronous 
> replication, an update is not sync immediately to all replicas.

To be clear, I'm not saying that sometimes it takes minutes or hours for the 
replicas to become synchronized. I'm saying that occasionally some random data 
change never synchronizes, even over weeks or months. For example, I have a 
user who changed his password three weeks ago, and parts of that change are 
still missing from a few of my replicas. All the changes that have happened 
since then (of which there are many) have successfully replicated to all of my 
replicas.

One of the reasons that I'm running down this path is that the audit logs show 
that this password change, which involves changes to many values within a 
single entry, was, for some reason, apparently split into two separate modify 
operations, one of which is a change to "krbExtraData" and the other of which 
contains changes to a bunch of other attributes. All replicas show the former 
in the audit log, but a small number of replicas don't show the latter at all. 
Since those changes happened at exactly the same time, I'm looking into how 
replication uses timestamps and replica IDs to determine what data needs to be 
replicated, and, while I feel like it's unlikely that this is the problem, I 
also don't have enough data to disprove it.
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


[389-users] Re: Documentation as to how replication works

2023-11-15 Thread William Faulk
> it isn't necessary to keep track of a list of CSNs

If it doesn't keep track of the CSNs, how does it know what data needs to be 
replicated?

That is, imagine replica A, whose latest CSN is 48, talks to replica B, whose 
latest CSN is 40. Clearly replica A should send some data to replica B. But if 
it isn't keeping track of what data is associated with CSNs 41 through 48, how 
does it know what data to send?

> by asking the other node for its current ruv
> can determine which if any of the changes it has need to be propagated to the 
> peer.

In addition, the CSNs are apparently a timestamp and replica ID. So imagine a 
simple ring topology of replicas, A-B-C-D-E-(A), all in sync. Now imagine 
simultaneous changes on replicas A and C. C has a new CSN of, say, 100C, and it 
replicates that to B and D. At the same time, A replicates its new CSN of 100A 
to B and E. Now E has a new CSN. Is it 100A or 101E?

If E's new max CSN is 100A, then when it checks with D, D has a latest CSN of 
100C, which is greater than 100A, so the algorithm would seem to imply that 
there's nothing to replicate and the change that started at A doesn't get 
replicated to D.

If E's max CSN is 101E, then, when D checks in with its 101D, it thinks it 
doesn't have anything to send. I suppose in this scenario that the data would 
get there coming from the other direction. But if E's max CSN is 101E, 
eventually it's going to check in with A, which has a max CSN of 100A, so it 
would think that it needed to replicate that same data back to A, but it's 
already there. This is an obvious infinite loop.

I'm certain I'm missing something or misunderstanding something, but I don't 
understand what, and these details are what I'm trying to unravel.
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue


[389-users] Documentation as to how replication works

2023-11-15 Thread William Faulk
I am running a RedHat IdM environment and am having regular problems with 
missed replications. I want to understand how it's supposed to work better so 
that I can make reasonable hypotheses to test, but I cannot seem to find any 
in-depth documentation for it. Every time I think I start to piece together an 
understanding, experimentation makes it fall apart. Can someone either point me 
to some documentation or help me understand how it works?

In particular, IdM implements multimaster replication, and I'm initially trying 
to understand how changes are replicated in that environment. What I think I 
understand is that changes beget CSNs, which are comprised of a timestamp and a 
replica ID, and some sort of comparison is made between the most recent CSNs in 
order to determine what changes need to be sent to the remote side. Does each 
replica keep a list of CSNs that have been sent to each other replica? Just the 
replicas that it peers with? Can I see this data? (I thought it might be in the 
nsds5replicationagreement entries, but the nsds50ruv values there don't seem to 
change.) But it feels like it doesn't keep that data, because then what would 
be the point of comparing the CSN values be? Anyway, these are the types of 
questions I'm looking to understand. Can anyone help, please?

-- 
William Faulk
___
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue