Re: [HACKERS] count(*) performance improvement ideas

Stephen Denne Tue, 15 Apr 2008 18:51:40 -0700

(There is a possible performance bug mentioned at the end of the email, the 
rest is further discussion regarding materialised views)

I wrote
> Pavan Deolasee wrote
> > On Thu, Mar 20, 2008 at 3:24 AM, Stephen Denne
> > <[EMAIL PROTECTED]> wrote:
> > 
> > >
> > >  Pavan also refers to deferred triggers, which has got me 
> > thinking about another possible solution:
> > >
> > >  Instead of inserting a delta row, that will be updated a 
> > lot of times, create an on commit drop temp table named after 
> > the txid and the grouping value (which is an integer in my 
> > case, perhaps hash it if you're grouping by something that 
> > doesn't easily convert to part of a table name),
> > >  create an after insert initially deferred constraint to 
> > call a function which will move the 'at commit' values of the 
> > rows in the temp table to the summary table.
> > >  The temp table will only have one row inserted, updated 
> > many times, then on commit the trigger is fired once, and the 
> > temp table is dropped.
> > >
> > >  Does anyone think this will or won't work for some reason?
> > 
> > 
> > I think this should work, although you may need to take some 
> > extra steps
> > to manage the summary table. Also, I think a single temp 
> > table per transaction
> > should suffice. The temp table would have one row per "group 
> > by" or "where"
> > condition on which you want to track the count. The 
> > corresponding row will
> > be updated as and when the corresponding count changes. You 
> would need
> > INSERT/DELETE/UPDATE triggers to do that. If there are any 
> > subtransaction
> > aborts, that will be taken care by MVCC.
> 
> Thanks for that. I had gone ahead and tried out the idea, and 
> it was working 'ok'.
> 
> Using one table per transaction has the benefit of less temp 
> tables (but the same number of triggers waiting to run). It 
> also removes the grouping key from the table name.
> 
> I was using a single table per grouping key, with a single 
> updated row in it.
> The benefit was simpler queries, and I could create an ON 
> INSERT trigger that would be triggered only once when the 
> temp table was created, and a 'zero' row was inserted, 
> thereby separating the setup of the trigger from the 
> maintenance of the delta.

One temp table per grouping key would presumably allocate at least one disk 
page per grouping key.
This might result in pretty poor performance. Though if the number of updates 
per grouping key is large, HOT would have plenty of room on the page to write 
new row versions.

Does creation & dropping of temp tables result in system catalog bloat?

> I took a wild guess at a way of finding out whether the temp 
> table already exists:
>       not exists(select tablename from pg_catalog.pg_tables 
> where tablename=temp_delta_txid_group)
> Is there a better/safer way?

Answering my own question:
There is a better way to do what I was doing, (not sure about a better way to 
check existence of a temp table though)...

A custom variable class can be set up and used to record whether the 
transaction in question has been set up. (Thanks to Andreas Kretschmer for 
pointing those out in another thread on -general)

Alter this setting within postgresql.conf to add 'mv':
custom_variable_classes = 'mv'

Add this setting to postgresql.conf:
mv.initialized = 'false'

Then only set it to true local to the transaction.

Usage within a trigger:

   IF NOT (current_setting('mv.initialized')::boolean) THEN
      -- trigger a deferred constraint function:
      INSERT INTO mv_txid_doc_type_summary VALUES (txid_current());
      PERFORM set_config('mv.initialized', 'true', true);
   END IF;

The custom variable class can also be used to aggregate the deltas within a 
transaction, though there are some minor difficulties:

1) They only store text
2) You can only find out that a value has not been set by catching an exception
3) There is no way to list the settings.

The main benefit is that changing a variable's setting does not write a new row 
version.
Creating new transactionally scoped variables seems to take around 150 to 200 
bytes of the process ram per variable, (depending on the size of the value 
stored).
The time to create them for the first time for a connection seems to vary a 
bit, between 200 and 1000 per minute. No IO is being performed, CPU is at 100%
These statistics gathered when trying variations of this query:

select count(*) from (select set_config('mv.test11.' || s1.s1, s2.s2::text, 
true) from (select s1 from generate_series(1,1000) as s1) as s1, (select s2 
from generate_series(1,1000) as s2) as s2) as s3;

Which sets each one of 1000 variables to 1000 different values. This can take a 
few minutes, but strangely a second connection starting the same query just 
after the first one, can finish in seconds.

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer 
focus, and courage. This email with any attachments is confidential and may be 
subject to legal privilege.  If it is not intended for you please advise by 
reply immediately, destroy it and do not copy, disclose or use it in any way.
__________________________________________________________________
  This email has been scanned by the DMZGlobal Business Quality
              Electronic Messaging Suite.
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__________________________________________________________________

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] count(*) performance improvement ideas

Reply via email to