Bonno Bloksma wrote:
Hi Pete,
I get what you said. But:
/ I'm nowhere near your timezone, I'm at GMT+1 or +2. So should there
not have been a problem long before where my system would see older
files at your system several times a day when in fact there would be a
newer one?/
/*Does that mean my system has been getting only two or three updates
a day where it should have gotten over a dozen?*/
If two systems agree on the time, and then only one of them advances
their clock by an hour the two clocks will still be different. Anyway -
we've learned more since then (below)
I've switched curl so everything should work ok by now. According to
my logs I'm getting a new rulebase about every hour.
Once per hour is just about right.
Pacing is currently set to 55 minutes.
---
More that has been learned (technical stuff) and a story (skip if you
like, but some might find this interesting):
Yesterday while working on this problem and testing on one of our
inbound spamtrap processors I noticed that things still weren't quite
right. This discovery led me to break a paradigm in my thinking and
begin to see another problem (perhaps the key problem).
Paradigm: I had been very focused on the one hour time difference, DST,
and the obvious coincidence with the "DST storm" -- Our countermeasures
at the server and deployment of the new getRulebase script had
essentially mitigated the problem... so I was expecting everything to
work fine.
Having loaded the new getRulebase script on the system I was monitoring
it didn't make sense that there was still a problem. Even worse, the
telemetry was showing timestamps that were close, but off by a few
minutes -- as if the server had picked up the time shifted file instead
of the original posting... but that didn't make sense. I wondered if
something else was going on and so I loaded up the UTC as a reference:
http://www.worldtimeserver.com/current_time_in_UTC.aspx
To my wonder and amazement the telemetry I was looking at showed the UTC
reference for the ruelbase on the server in the future by one hour!
"That can't be right", I said to myself, and then I checked the
timestamp again on the delivery server. I rechecked the math and sure
enough the timestamp on the delivery server was correct! I hate a mystery.
I went to the main SYNC server to see if something had happened to it --
Why would it report the file's timestamp in the future when the
timestamp on the file system is correct? We hadn't made any changes to
the software. The only thing that had happened was DST.
I made my priority getting the reported timestamp correct, and I made
the assumption that there might be some obscure DST bug in this version
of RedHat or one of the libraries that I would solve later. I began
looking for a way to tweak the SYNC server code to adjust the time stamp
before reporting it when these conditions were detected... A way to work
around the bug. I would fix the bug later.
Of course, to do this tweak I would need to find a way to detect the
condition so I started to look for ways to do that reliably. I know it's
a funny notion -- looking for a reliable way to leverage a system that
you have already determined is unreliable... but that is the nature of
what we do. Nothing is perfect and a lot of software development for
high availability is figuring out how to "stay solid on shifting
sands"... but I digress.
One of the first things I did was list a directory of the rulebase files
from that system... Then I saw something weird that started to break my
paradigm. Breaking a paradigm always requires new information in some
form ;-)
Some of the files listed had times -- and others only had dates! I'd
never seen that before. Digging deeper I determined that the ones that
had dates had current timestamps at the delivery server and the ones
that had times had been pushed back. (Recall that one of the things we
did to mitigate this problem was to push the timestamp on a rulebase
back by one hour after it had been posted for 5 minutes. This would
prevent systems with DST conflicts from seeing the files as perpetually
in the future after (at most) one or two downloads).
Now the paradigm started to unravel... The timestamps that were seen at
the SYNC server were one hour in the future... So the files seen without
times (only dates) might be so far in the future that the ls software
can't make sense of them (or something like that anyway).
Now I was onto a different path. Why would the SYNC server see the
timestamps differently.
Some technical background (it does matter, bear with me..):
In order to deliver rulebase files at high volumes I had decided that
the best solution would be a shared RAM drive that could be fed by the
rulebase compiler bots and consumed by the delivery servers. No sense
putting a rulebase on a physical disc when it would be thrown away
minutes later -- let alone thousands of them!
At the time I was planning this upgrade it was determined that using a
hardware based RAM drive was too experimental for the hosting guys. We
could use it-- but they would not support it. I hit upon an alternate
solution: tmpfs!
Any file system on a linux box can be turned into a RAM drive using
tmpfs. It's a fantastic piece of software and it's ubiquitous in linux
distros.
Trouble is-- NFS cannot export a tmpfs file system -- or at least it
couldn't at the time. I haven't checked recently.
I discovered that samba / cifs CAN export tmpfs file systems -- SO a new
solution was born. We built a system with plenty of RAM, set up a tmpfs
file system to hold rulebase files, and exported that to our cluster of
servers via samba over our private network.
It works beautifully and everything is "off the shelf" so ordinary
hosting folks know how to manage it.
Here is where that becomes important.
There is a bug in samba! (It took some "googling" to find this)
Samba apparently calculates the difference between the local clock and
utc when it starts up and then NEVER CHECKS IT AGAIN. As a result, if
samba is started before DST begins then when DST starts samba will
report file timestamps one hour into the future!
Presumably the opposite is also true-- If samba is started during DST
then when DST ends samba will report file timestamps one hour in the
past. We shall see this fall -- or rather, we won't see it because I
plan to make sure we restart samba at the close of DST so that it has no
impact, of course :-)
In case you missed it-- that was the fix. Restarting the samba server
software caused it to re-calculate it's time reference and as a result
it began reporting the correct timestamp. The SYNC server software got
accurate timestamps; the telemetry returned to normal; and everything
has been fine since.
Best,
_M