Re: [sniffer] v2-3 persistent and Linux

Pete McNeil Mon, 24 May 2004 06:32:56 -0700

At 08:26 PM 5/23/2004, you wrote:

Does this mean that WinX machines should ensure that they are rebooted at
least every 24 days to avoid overflowing the clock() value until the next
version is available?

No.

* Win32 machines do not appear to be effected (so far no reports and I haven't been able to replicate the problem on NT, W2K, or XP).

* The clock() function should reset itself whenever the application begins running.

* The clock() function provides high resolution timing information in an "implementation specific" way. (It's that last part that will 'get ya'.) Rebooting has nothing to do with this function. It is intended to return the number of "clock ticks" that a program has been running.

Before folks become overly concerned I want to explain about this "overflow" bug.

The non-technical part is - if your system is running ok then you don't have to worry about it - it will continue to run fine.

(Fellow Code Dwellers Read On)

The technical part is as follows:

There are a number of places in Message Sniffer where we use code similar to the following to create a high resolution timer:

if(PersistentServerMode) {      // Persistent servers hardwire the clock.

if(ForcedServerLife > 0) {    // If there is a forced value, use that.

    ServerDeathClock = clock() + (ForcedServerLife * CLOCKS_PER_SEC);

} else {                      // If no force then use the standard.

    ServerDeathClock = clock() + (StdPersistence * CLOCKS_PER_SEC / 1000);

}
}

The variable ServerDeathClock is of type clock_t which in most systems resolves to an integer. Specifically (the problem) a 32 bit signed integer.

CLOCKS_PER_SEC defines how many 'clock tics' are returned by clock() after one second.

StdPersistence is set to the default value 600000 which is in milliseconds. This is 600 seconds or 10 minutes.

In some cases this code:

(StdPersistence * CLOCKS_PER_SEC / 1000)

results in a negative number because StdPersistence * CLOCKS_PER_SEC is greater than the largest positive integer. (that's the overflow)

As a result, later on when we evaluate this timer it seems that it has always expired:

inline bool CheckServerClock(){         // Return true as long as the
return (clock() < ServerDeathClock); // Server clock has not run out.
}

This happens because clock() returns a positive integer and a positive integer is never less than (is always greater than) a negative one.

The reported bug was a race condition set up by this particular failure... The persistent server is supposed to run for a period defined by the ServerDeathClock after which the server expires and drops back to the main program where it is immediately re-launched. This causes it to load the rulebase file and re-evaluate the .cfg file etc...

The race condition occurs because the ServerDeathClock appears always and immediately expired (which should never happen in real life). As a result the program does the following: Launch Server, Load .cfg, Load Rulebase, Drop Server, repeat... as fast as it can. This is bad.

The short term fix was to modify the code so that StdPersistence is represented in seconds rather than milliseconds... then the calculation becomes:

(StdPersistence * CLOCKS_PER_SEC)

---

Some of you might be asking the burning question - why was StdPersistence represented in milliseconds in the first place?

A: Nearly every timing issue in the program was handled using clock() and values measured in milliseconds - for example, poll timing values, lock wait values, etc... It seemed less confusing at first to simply represent all timing information in milliseconds and use the same engineering template throughout. As it turns out, this was a bad choice.

Version 2-3.1 will implement long term timers using time_t and time() which resolves in seconds.

In addition, short-term timers will largely be replaced with code using gettimeofday() rather than clock() due to another troublesome discovery.

Two bugs for the price of one!:

It turns out that clock() counts tics only when the application is actually running. This is usually fine when you are spinning in a tight loop, but can be a real problem if you expand that loop to do other things that might Sleep() for long periods or otherwise give up large chunks of computing time to the system. In those cases clock() might move so slowly as to appear stopped.

This was discovered during testing of the fix for the first bug. In the top of the program there is a "Command Check" function which is called whenever the persistent server checks for a job. This might be many times per second - so rather than scan the directory for a stop, reload, or rotate message file every time it is called, the function set up a timer that holds off checking for these messages for at least 1 second at a time.

What I discovered was that the function would check for commands once at startup but never again! (apparently). When I dug deeper and monitored the output of clock() I found that even when it was called many seconds apart the value returned *NEVER CHANGED* !!!. My theory is that since the other parts of the program were sleeping most of the time the process never received enough clock ticks to move the output of clock().

I discovered also that this possibility is not well covered in most documentation. In fact, most text on the use of clock() indicate that it will reliably return the number of clock ticks that have passed... You are left to infer that this is a reliable measure of time and that the only important implementation specific component is how many 'ticks' there are per second... this is misleading.

I finally stumbled across a reference to clock() that was very explicit about the "implementation specific" and "time used by the process" aspects of the function - then I put two and fifteen together. Specifically, that the "time used by the process" may (as in this case) mean the cpu time used, not the wall-clock time.

In future I will be very cautious about using clock() and I recommend the same for others!

Also, I think there is a looming challenge for all programs that might use clock() for any relatively long term timing functions. As CPU clock rates climb past several GHz, the number of clock ticks rises also. I fear there is a good possibility that as CPU speeds rise many 32 bit programs will begin to fail inexplicably due to the same bug we discovered here.

Wherever clock_t is defined as a signed int, and the time to be measured is sufficiently high, a high CLOCKS_PER_SEC may cause the program to suddenly calculate negative numbers causing inexplicable timing problems. Even worse, these problems are likely to only show up on faster systems and only some of the time depending upon the starting values from clock() for a given instance of the program.

Given the ubiquity of examples using clock() in precisely the way I did I suspect that this particular bug may exist in some form on many systems out there that have always run perfectly up to now.

Best,
_M

Re: [sniffer] v2-3 persistent and Linux

Reply via email to