Jim C. Nasby wrote:
On Tue, Aug 22, 2006 at 11:56:17AM -0700, Mark Dilger wrote:
I proposed something like this quite a bit up-thread. I was hoping we could have a mode in which the system would run the second, third, fourth, ... best plans rather than just the best looking one, and then determine from actual runtime statistics which was best. (The proposal also included the ability to output the best plan and read that in at a later time in lieu of a SQL query, but that part of it can be ignored if you like.) The posting didn't generate much response, so I'm not sure what people thought of it. The only major problem I see is getting the planner to keep track of alternate plans. I don't know the internals of it very well, but I think the genetic query optimizer doesn't have a concept of "runner-up #1", "runner-up #2", etc., which it would need to have.

I think the biggest issue is that you'd have to account for varying load
on the box. If we assume that the database is the only thing running on
the box, we might be able to do that by looking at things like how much
IO traffic we generated (though of course OS caching will screw with
that).

Actually, that's another issue... any plans run after the first one will
show up as being artificially fast, since there will be a lot of extra
cached data.

Yes, caching issues prevent you from using wall-clock time. We could instrument the code to count the number of rows vs. the number predicted for each internal join, from which new cost estimates could be generated.

Perhaps you can check my reasoning for me: I'm imagining a query which computes AxBxCxD, where A, B, C, and D are actual tables. I'm also imagining that the planner always chooses AxB first, then joins on C, then joins on D. (It does so because the single-table statistics suggest this as the best course of action.) It might be that AxD is a really small metatable, much smaller than would be estimated from the statistics for A independent of the statistics for D, but AxB is pretty much what you would expect given the independent statistics for A and B. So we need some way for the system to stumble upon that fact. If we only ever calculate cross-join statistics for plans that the system chooses, we will only discover that AxB is about the size we expected it to be. So, if the actual size of AxB is nearly equal to the estimated size of AxB, the system will continue to choose the same plan in future queries, totally ignorant of the advantages of doing AxD first.

That last paragraph is my reasoning for suggesting that the system have a mode in which it runs the "runner-up #1", "runner-up #2", etc sorts of plans. Such a mode could force it down alternate paths where it might pick up interesting statistics that it wouldn't find otherwise.

This idea could be changed somewhat. Rather than running the other plans, we could just extract from them which alternate joins they include, and consider also calculating those join statistics.

mark

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

              http://www.postgresql.org/docs/faq

Reply via email to