Jim C. Nasby wrote:
On Tue, Aug 22, 2006 at 11:56:17AM -0700, Mark Dilger wrote:
I proposed something like this quite a bit up-thread. I was hoping we
could have a mode in which the system would run the second, third, fourth,
... best plans rather than just the best looking one, and then determine
from actual runtime statistics which was best. (The proposal also included
the ability to output the best plan and read that in at a later time in
lieu of a SQL query, but that part of it can be ignored if you like.) The
posting didn't generate much response, so I'm not sure what people thought
of it. The only major problem I see is getting the planner to keep track
of alternate plans. I don't know the internals of it very well, but I
think the genetic query optimizer doesn't have a concept of "runner-up #1",
"runner-up #2", etc., which it would need to have.
I think the biggest issue is that you'd have to account for varying load
on the box. If we assume that the database is the only thing running on
the box, we might be able to do that by looking at things like how much
IO traffic we generated (though of course OS caching will screw with
that).
Actually, that's another issue... any plans run after the first one will
show up as being artificially fast, since there will be a lot of extra
cached data.
Yes, caching issues prevent you from using wall-clock time. We could instrument
the code to count the number of rows vs. the number predicted for each internal
join, from which new cost estimates could be generated.
Perhaps you can check my reasoning for me: I'm imagining a query which computes
AxBxCxD, where A, B, C, and D are actual tables. I'm also imagining that the
planner always chooses AxB first, then joins on C, then joins on D. (It does so
because the single-table statistics suggest this as the best course of action.)
It might be that AxD is a really small metatable, much smaller than would be
estimated from the statistics for A independent of the statistics for D, but AxB
is pretty much what you would expect given the independent statistics for A and
B. So we need some way for the system to stumble upon that fact. If we only
ever calculate cross-join statistics for plans that the system chooses, we will
only discover that AxB is about the size we expected it to be. So, if the
actual size of AxB is nearly equal to the estimated size of AxB, the system will
continue to choose the same plan in future queries, totally ignorant of the
advantages of doing AxD first.
That last paragraph is my reasoning for suggesting that the system have a mode
in which it runs the "runner-up #1", "runner-up #2", etc sorts of plans. Such a
mode could force it down alternate paths where it might pick up interesting
statistics that it wouldn't find otherwise.
This idea could be changed somewhat. Rather than running the other plans, we
could just extract from them which alternate joins they include, and consider
also calculating those join statistics.
mark
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?
http://www.postgresql.org/docs/faq