I wrote: > The problem looks to be an underestimation of SampleOverhead, and on > reflection it's clear why: what CalculateSampleOverhead is measuring > isn't the total overhead, but the time between the two gettimeofday > calls. Which is probably about half the true overhead.
On further thought, I take that back: the "true overhead" is not the point here. The time elapsed during a plan node execution when sampling can be broken down into three phases: time before first gettimeofday call time between gettimeofday calls time after second gettimeofday call Only the second part is actually measured by the instrumentation system; the other parts are overhead that has never been counted by EXPLAIN ANALYZE, sampling version or no. Moreover, most of the runtime of InstrStartNode and InstrStopNode falls into the first or third parts. What we would actually like to set SampleOverhead to is the portion of the second-part runtime that doesn't occur when sampling = false. Assuming that gettimeofday() has consistent runtime and the actual time reported is measured at a consistent instant within that runtime, I believe that we should take the SampleOverhead as just equal to the runtime of a single gettimeofday() call. The added or removed second-part time within InstrStartNode is just the tail time of gettimeofday, and the added or removed second-part time within InstrStopNode is basically just the head time of gettimeofday. (To make this as true as possible, we need to change the order of operations so that gettimeofday is invoked *immediately* after the "if (sampling)" test, but that's easy.) So this line of thought leads to the conclusion that CalculateSampleOverhead is actually overestimating SampleOverhead a bit, and we should simplify it to just time INSTR_TIME_SET_CURRENT(). But that still leaves me with a problem because my machine is clearly overestimating the correction needed. I added some printouts and got raw totaltime = 0.370937 per_iter = 0.000156913, SampleOverhead = 3.28e-06 adj totaltime = 1.82976 sampling = 0 starttime = 0/000000 counter = 0/370937 firsttuple = 0.258321 tuplecount = 10000 itercount = 10001 samplecount = 704 nextsample = 10011 startup = 0.258321 total = 1.82976 ntuples = 10000 nloops = 1 on a run with an actual elapsed time near 750 msec. Clearly the sampling adjustment is wrong, but why? I have a theory about this, and it's not pleasant at all. What I think is that we have a Heisenberg problem here: the act of invoking gettimeofday() actually changes what is measured. That is, the runtime of the "second part" of ExecProcNode is actually longer when we sample than when we don't, not merely due to the extra time spent in gettimeofday(). It's not very hard to guess at reasons why, either. The kernel entry is probably flushing some part of the CPU's state, such as virtual/physical address mapping for the userland address space. After returning from the kernel call, the time to reload that state shows up as more execution time within the "second part". This theory explains two observations that otherwise are hard to explain. One, that the effect is platform-specific: your machine may avoid flushing as much state during a kernel call as mine does. And two, that upper plan nodes seem much more affected than lower ones. That makes sense because the execution cycle of an upper node will involve touching more userspace data than a lower node, and therefore more of the flushed TLB entries will need to be reloaded. If this theory is correct, then the entire notion of EXPLAIN ANALYZE sampling has just crashed and burned. We can't ship a measurement tool that is accurate on some platforms and not others. I'm wondering if it's at all practical to go over to a gprof-like measurement method for taking EXPLAIN ANALYZE runtime measurements; that is, take an interrupt every so often and bump the count for the currently active plan node. This would spread the TLB-flush overhead more evenly and thus avoid introducing that bias. There may be too much platform dependency in this idea, though, and I also wonder if it'd break the ability to do normal gprof profiling of the backend. regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq