Re: [HACKERS] Function execution costs 'n all that

Mark Dilger Sun, 28 Jan 2007 18:51:59 -0800

Tom Lane wrote:

Mark Dilger <[EMAIL PROTECTED]> writes:

Tom Lane wrote:

Would a simple constant value be workable, or do we need some more
complex model (and if so what)?

Consider:
ANALYZE myfunc(integer) ON (SELECT myfunc(7)) WITH RATIO 0.03;
...
It seems to me that the above system would work perfectly well for
collecting the number of rows returned from a set returning function,
not just the run times.


I don't really think that data collection is the bottleneck.

Ahh, I'm not just thinking about data collection. I'm thinking about usabilityfor non-hackers who know enough plpgsql to write a function and then want totrain the system to plan for it appropriately. It's a much easier task for anovice user to say "go away and figure out how expensive this thing is" than fora novice to think about things like statistical variance, etc. We don't demandthat users have that kind of knowledge to write queries or run analyze ontables, so why would they need that kind of knowledge to write a function?

If a
constant estimate isn't good enough for you, then you need some kind of
model of how the runtime or number of rows varies with the function's
inputs ... and I hardly see how something like the above is likely to
figure out how to fit a good model.  Or at least, if you think it can,
then you skipped all the interesting bits.

I am (perhaps naively) imagining that the user will train the database over thesame query as the one that will actually get used most often in production. Inthe case that the query modifies the table, the user could train the databaseover a copy of that table. The data collected by the analyze phase would justbe constant stuff like average and stddev. That would make the job of theplanner / cost estimator easier, right? It could treat the function as aconstant cost function.

One other point is that we already know that sampling overhead and
measurement error are significant problems when trying to measure
intervals on the order of one Plan-node execution.  I'm afraid that
would get a great deal worse if we try to use a similar approach to
timing individual function calls.

The query could be run with the arguments passed to "myfunc" being recorded to atemporary table. After the query is complete (and the temporary tablepopulated), data from the temp table could be pulled into memory in batches,with the "myfunc" run on them again in a tight loop. The loop itself could betimed, rather than each iteration. The sum of all the timings for the variousloops would then be the final runtime which would be divided by the total numberof rows to get the average runtime per call. The downside is that I don't seehow you retrieve the standard deviation. (I also don't know if the plannerknows how to use standard deviation information, so perhaps this is a non issue.)

A further refinement would be to batch the inputs based on properties of theinput data. For text, you could run a batch of short text first, a batch ofmedium second, and a batch of long text last, and use best-fit linear algebra todetermine the runtime cost vs. input text length function. I'm not sure howsuch a refinement would be done for fixed size datatypes. And for some textfunctions the runtime won't vary with length but with some other property anyway.


mark

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Re: [HACKERS] Function execution costs 'n all that

Reply via email to