Tom Lane wrote:
Mark Dilger <[EMAIL PROTECTED]> writes:
Tom Lane wrote:
Would a simple constant value be workable, or do we need some more
complex model (and if so what)?

Consider:
ANALYZE myfunc(integer) ON (SELECT myfunc(7)) WITH RATIO 0.03;
...
It seems to me that the above system would work perfectly well for
collecting the number of rows returned from a set returning function,
not just the run times.

I don't really think that data collection is the bottleneck.

Ahh, I'm not just thinking about data collection. I'm thinking about usability for non-hackers who know enough plpgsql to write a function and then want to train the system to plan for it appropriately. It's a much easier task for a novice user to say "go away and figure out how expensive this thing is" than for a novice to think about things like statistical variance, etc. We don't demand that users have that kind of knowledge to write queries or run analyze on tables, so why would they need that kind of knowledge to write a function?

If a
constant estimate isn't good enough for you, then you need some kind of
model of how the runtime or number of rows varies with the function's
inputs ... and I hardly see how something like the above is likely to
figure out how to fit a good model.  Or at least, if you think it can,
then you skipped all the interesting bits.

I am (perhaps naively) imagining that the user will train the database over the same query as the one that will actually get used most often in production. In the case that the query modifies the table, the user could train the database over a copy of that table. The data collected by the analyze phase would just be constant stuff like average and stddev. That would make the job of the planner / cost estimator easier, right? It could treat the function as a constant cost function.

One other point is that we already know that sampling overhead and
measurement error are significant problems when trying to measure
intervals on the order of one Plan-node execution.  I'm afraid that
would get a great deal worse if we try to use a similar approach to
timing individual function calls.

The query could be run with the arguments passed to "myfunc" being recorded to a temporary table. After the query is complete (and the temporary table populated), data from the temp table could be pulled into memory in batches, with the "myfunc" run on them again in a tight loop. The loop itself could be timed, rather than each iteration. The sum of all the timings for the various loops would then be the final runtime which would be divided by the total number of rows to get the average runtime per call. The downside is that I don't see how you retrieve the standard deviation. (I also don't know if the planner knows how to use standard deviation information, so perhaps this is a non issue.)

A further refinement would be to batch the inputs based on properties of the input data. For text, you could run a batch of short text first, a batch of medium second, and a batch of long text last, and use best-fit linear algebra to determine the runtime cost vs. input text length function. I'm not sure how such a refinement would be done for fixed size datatypes. And for some text functions the runtime won't vary with length but with some other property anyway.

mark

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Reply via email to