[389-users] Re: Determining max CSN of running server

William Faulk Wed, 28 Feb 2024 20:12:48 -0800

> Might be worth re-reading

Well, I still don't really know the details of the replication process.


I have deduced that changes originated on a replica seem to prompt that replica 
to start a replication process with its peers, but I don't really know what 
happens then. There's a comparison of the RUVs of the two replicas, but does 
the initiating system send its RUV to the receiver, or does it go the other 
way, or do both happen? Does the comparison prompt the comparing system to send 
the changes it thinks the other system needs, or does it cause the comparing 
system to request new changes from the other? Maybe none of this really makes 
much difference, but the lack of technical detail around this makes me just 
question everything.

> It doesn't send a single CSN, the replication compares the RUVs and 
> determines the
> range of CSNs that are missing from the consumer. 

Sure, but notionally any changes that originated on that replica would be 
reflected in the max CSN for itself in the RUV that is used to compare. And at 
least one side is sending its RUV to the other during the replication process.

> It's also not immediate. Between the server accepting a change (add, mod 
> etc), the
> change is associated to a CSN. But then there may be a delay before the two 
> nodes actually
> communicate and exchange data. 

Sure, but the changes originated on this replica haven't made it to other 
replicas in weeks. This isn't a mere delay in replication.

> Generally you'd need replication logging (errorloglevel 8192). But it's very 
> noisy
> and can be hard to read. What you need to see is the ranges that they agree 
> to send.

Okay. I've done that and haven't had a chance to pore through them yet.

> Also remember CSN's are a monotonic lamport clock. This means they only ever 
> advance
> and can never step backwards. So they have some different properties to what 
> you may
> expect. If they ever go backwards I think the replication handler throws a 
> pretty nasty
> error.

I don't think it's going backwards. What I'm trying to rule out is that the 
replica is failing to advance its max CSN in the RUV being used to compare.

> I *think* so. It's been a while since I had to look. The nsds50ruv shows the 
> ruv of
> the server, and I think the other replica entries are "what the peers ruv was 
> last
> time".

Well, it's at least nice to hear that my guess at least isn't asinine. :)

> replication monitoring code in newer versions does this for you, so I'd 
> probably
> advise you attempt to upgrade your environment. 1.3 is really old at this 
> point

I've been trying to get the current environment stable enough that I feel 
comfortable going through the relatively lengthy upgrade process. I think I'm 
going to have to adjust my comfort level.

> I'm not sure if even RH or SUSE still support that version anymore).

RedHat does, as it's what's in RHEL7.9, which is supported for another, uh, 4 
months. They're working on this with me. I'm still just trying to understand 
the system better so that I can try to be productive while I'm waiting on them 
to come up with ideas.

> The problem here is that to read the RUV's and then compare them, you need to 
> read
> each RUV from each server and then check if they are advancing (not that they 
> are equal).

The problem is that the changes in my environment are few enough that all the 
replicas' RUVs _are_ equal the majority of the time. I'm not in front of that 
system as I respond right now, so my details might be wrong, but I'm asking 
about all of this because every RUV I see in all of the replicas is the same, 
and it shows a max CSN for this one replica that's much older than the CSNs I 
see it reference in the logs about changes originating on the replica. The CSNs 
I see in the logs when a new change is made are referencing the current time in 
them, while the max CSN I see in the RUVs is from 4 months ago.

Maybe it *did* go backwards somehow and that's why it's not working. Not that 
that would really help me understand what actually went wrong any better than I 
do now.

> If you want to assert that "Some change I made at CSN X is on all servers" 
> then
> you would need to read and parse the ruv and ensure that all of them are at 
> or past that
> CSN for that replica id. 

Well, you'd think so. I've got that problem, too, where some CSNs just seem to 
get missed, but the max CSN in the RUV is well past that. But that's a 
different problem and not the one I'm working on now.

Thanks for the input.

-- 
William Faulk
--
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

[389-users] Re: Determining max CSN of running server

Reply via email to