Matthew Schumacher wrote: > Matthew Schumacher wrote: > >>Tom Lane wrote: >> >> >> >>>I don't really see why you think that this path is going to lead to >>>better performance than where you were before. Manipulation of the >>>temp table is never going to be free, and IN (sub-select) is always >>>inherently not fast, and NOT IN (sub-select) is always inherently >>>awful. Throwing a pile of simple queries at the problem is not >>>necessarily the wrong way ... especially when you are doing it in >>>plpgsql, because you've already eliminated the overhead of network >>>round trips and repeated planning of the queries. >>> >>> regards, tom lane >> >> >>The reason why I think this may be faster is because I would avoid >>running an update on data that needs to be inserted which saves >>searching though the table for a matching token. >> >>Perhaps I should do the insert first, then drop those tokens from the >>temp table, then do my updates in a loop. >> >>I'll have to do some benchmarking... >> >>schu > > > Tom, I think your right, whenever I do a NOT IN it does a full table > scan against bayes_token and since that table is going to get very big > doing the simple query in a loop that uses an index seems a bit faster. > > John, thanks for your help, it was worth a try, but it looks like the > looping is just faster. > > Here is what I have so far in case anyone else has ideas before I > abandon it:
Surely this isn't what you have. You have *no* loop here, and you have
stuff like:
AND
(bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);
I'm guessing this isn't your last version of the function.
As far as putting the CREATE TEMP TABLE inside the function, I think the
problem is that the first time it runs, it compiles the function, and
when it gets to the UPDATE/INSERT with the temporary table name, at
compile time it hard-codes that table id.
I tried getting around it by using "EXECUTE" which worked, but it made
the function horribly slow. So I don't recommend it.
Anyway, if you want us to evaluate it, you really need to send us the
real final function.
John
=:->
>
> CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
> intokenary BYTEA[],
> inspam_count INTEGER,
> inham_count INTEGER,
> inatime INTEGER)
> RETURNS VOID AS '
> DECLARE
> _token BYTEA;
> BEGIN
>
> UPDATE
> bayes_token
> SET
> spam_count = greatest_int(spam_count + inspam_count, 0),
> ham_count = greatest_int(ham_count + inham_count , 0),
> atime = greatest_int(atime, inatime)
> WHERE
> id = inuserid
> AND
> (token) IN (SELECT bayes_token_tmp FROM bayes_token_tmp(intokenary));
>
> UPDATE
> bayes_vars
> SET
> token_count = token_count + (
> SELECT
> count(bayes_token_tmp)
> FROM
> bayes_token_tmp(intokenary)
> WHERE
> bayes_token_tmp NOT IN (SELECT token FROM bayes_token)),
> newest_token_age = greatest_int(newest_token_age, inatime),
> oldest_token_age = least_int(oldest_token_age, inatime)
> WHERE
> id = inuserid;
>
> INSERT INTO
> bayes_token
> SELECT
> inuserid,
> bayes_token_tmp,
> inspam_count,
> inham_count,
> inatime
> FROM
> bayes_token_tmp(intokenary)
> WHERE
> (inspam_count > 0 OR inham_count > 0)
> AND
> (bayes_token_tmp) NOT IN (SELECT token FROM bayes_token);
>
> RETURN;
> END;
> ' LANGUAGE 'plpgsql';
>
> CREATE OR REPLACE FUNCTION bayes_token_tmp(intokenary BYTEA[]) RETURNS
> SETOF bytea AS
> '
> BEGIN
> for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
> LOOP
> return next intokenary[i];
> END LOOP;
> RETURN;
> end
> '
> language 'plpgsql';
>
> CREATE OR REPLACE FUNCTION greatest_int (integer, integer)
> RETURNS INTEGER
> IMMUTABLE STRICT
> AS 'SELECT CASE WHEN $1 < $2 THEN $2 ELSE $1 END;'
> LANGUAGE SQL;
>
> CREATE OR REPLACE FUNCTION least_int (integer, integer)
> RETURNS INTEGER
> IMMUTABLE STRICT
> AS 'SELECT CASE WHEN $1 < $2 THEN $1 ELSE $2 END;'
> LANGUAGE SQL;
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>
signature.asc
Description: OpenPGP digital signature
