Re: [HACKERS] count(*) performance improvement ideas

2008-06-23 Thread Bruce Momjian

Added to TODO:

* Allow custom variables to appear in pg_settings()


---

Tom Lane wrote:
 Magnus Hagander [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  Really?  [ pokes around ... ]  Hm, you're right, because
  add_placeholder_variable() sets the GUC_NO_SHOW_ALL flag, and in this
  usage it'll never be cleared.  I wonder if we should change that.
  
  The whole thing is a bit of an abuse of what the mechanism was
  intended for, and so I'm not sure we should rejigger GUC's behavior
  to make it more pleasant, but on the other hand if we're not ready to
  provide a better substitute ...
 
  While I agree with that part, is there any actual *reason* why we
  shouldn't have the custom variables included in pg_settings?
 
 IIRC, the motivation for doing that was to not expose a completely bogus
 set of attributes for a variable whose defining C-module hadn't been
 loaded yet.
 
 I thought about this in the shower just now, and ISTM that if we want to
 turn this into an actual feature rather than a kluge, there needs to be
 some sort of define variable command that sets up a custom variable
 and specifies its type (and whatever other properties seem worth
 setting).  IOW expose the DefineCustomFooVariable functions to SQL users.
 
 I'd be a bit inclined to restrict the namespace that can be set up that
 way, eg allow only local. or session. as the prefix.  Maybe
 that's just being too anal, but we could guarantee not to introduce
 colliding built-in GUCs in future releases, whereas people trying to
 define variables with any random name would definitely be at risk.
 
 Comments?
 
   regards, tom lane
 
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers

-- 
  Bruce Momjian  [EMAIL PROTECTED]http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-04-17 Thread PFC
On Thu, 17 Apr 2008 02:48:37 +0200, Stephen Denne  
[EMAIL PROTECTED] wrote:



PFC wrote:

Let's try this quick  dirty implementation of a local
count-delta cache
using a local in-memory hashtable (ie. {}).



CREATE OR REPLACE FUNCTION update_count( key TEXT, delta INTEGER )
   RETURNS INTEGER
AS $$
 if key in GD:
 GD[key] += delta
 else:
 GD[key] = delta
 return GD[key]
$$ LANGUAGE plpythonu;


Thanks for the code, this seems to be very much what I was looking for.

I don't know plpythonu (nor python), just read a few docs now:


Learn Python, it is a really useful language ;)

The global dictionary SD is available to store data between function  
calls. This variable is private static data. The global dictionary GD is  
public data, available to all Python functions within a session. Use  
with care.


Does session == transaction or connection?
I don't understand the difference between SD and GD, private and public.  
Where are the context boundaries?


There is no sharing between processes, so
	- both SD and GD are limited to the current session (connection, postgres  
process), no shared memory is involved

- GD is global between all python functions (global)
- SD is specific to each python function (static)

	The big gotcha is that these are all non-transactional : if you rollback,  
GD and SD stay the same, and when you issue a query, you can assume the  
state of SD and GD is random (due to previous queries) unless you  
initialize them to a known value.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-04-17 Thread Tom Lane
Joe Conway [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 I thought about this in the shower just now, and ISTM that if we want to
 turn this into an actual feature rather than a kluge, there needs to be
 some sort of define variable command that sets up a custom variable
 and specifies its type (and whatever other properties seem worth
 setting).  IOW expose the DefineCustomFooVariable functions to SQL users.
 
 I'd be a bit inclined to restrict the namespace that can be set up that
 way, eg allow only local. or session. as the prefix.  Maybe
 that's just being too anal, but we could guarantee not to introduce
 colliding built-in GUCs in future releases, whereas people trying to
 define variables with any random name would definitely be at risk.

 Would it make sense to have built-in GUCs belong to pg_catalog. and 
 user defined GUCs default to public.?

[ after a bit of reflection... ]  I don't think that we want to tie
GUC names to schemas, especially not schemas that might not be there
(remember public is droppable).  The existing scheme for qualified
GUC names considers that the prefix is the name of a loadable module,
which isn't typically tied to any particular schema.

I kinda like session as the prefix since it helps remind people that
these things will have session lifespan.

OTOH, there's a possibility for confusion with the SET SESSION syntax
(hm, did you mean SET SESSION foo = ... or SET session.foo = ...?).
local has got the same issue.  Maybe temp?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-04-17 Thread Stephen Denne
PFC wrote
 [EMAIL PROTECTED] wrote:
  I don't know plpythonu (nor python), just read a few docs now:
 
   Learn Python, it is a really useful language ;)

My wife has a snake phobia, besides, I've just started learning Scala.


   There is no sharing between processes, so
   - both SD and GD are limited to the current session 
 (connection, postgres  
 process), no shared memory is involved
   - GD is global between all python functions (global)
   - SD is specific to each python function (static)

Thanks.

   The big gotcha is that these are all non-transactional 
 : if you rollback,  
 GD and SD stay the same, and when you issue a query, you can 
 assume the  
 state of SD and GD is random (due to previous queries) unless you  
 initialize them to a known value.

Using txid_current() as a key should alleviate that.

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer 
focus, and courage. This email with any attachments is confidential and may be 
subject to legal privilege.  If it is not intended for you please advise by 
reply immediately, destroy it and do not copy, disclose or use it in any way.
__
  This email has been scanned by the DMZGlobal Business Quality
  Electronic Messaging Suite.
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-04-17 Thread Stephen Denne
  The big gotcha is that these are all non-transactional 
  : if you rollback,  
  GD and SD stay the same, and when you issue a query, you can 
  assume the  
  state of SD and GD is random (due to previous queries) unless you  
  initialize them to a known value.
 
 Using txid_current() as a key should alleviate that.

No... hold on, it is per session, and a session can't have two or more 
transactions active at once can it?

(Though I hear of things called sub-transactions)

So the problem is that other functions may be using GD themselves, and your own 
code is at the mercy of the other functions. Conversely you shouldn't clear GD, 
as some other function may be using it.

So you're better off using a single function for everything, and using SD 
within it?

There isn't any way of telling whether the function is being called for the 
first time in a transaction. You don't know when to clear it.

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer 
focus, and courage. This email with any attachments is confidential and may be 
subject to legal privilege.  If it is not intended for you please advise by 
reply immediately, destroy it and do not copy, disclose or use it in any way.
__
  This email has been scanned by the DMZGlobal Business Quality
  Electronic Messaging Suite.
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-04-17 Thread PFC



My wife has a snake phobia, besides, I've just started learning Scala.


	Just had a look at Scala, it looks nice. Slightly Lispish (like all good  
languages)...



txid_current()
No... hold on, it is per session, and a session can't have two or more  
transactions active at once can it?


It could be used to detect rollback.

So the problem is that other functions may be using GD themselves, and  
your own code is at the mercy of the other functions. Conversely you  
shouldn't clear GD, as some other function may be using it.


Exactly.

So you're better off using a single function for everything, and using  
SD within it?


	Since the purpose is to store counts for rows matching a certain criteria  
in a set of tables, you could build a hashtable of hashtables, like :


GD[table name][criteria name][criteria value] = count

	This would add complexity, about half a line of code. But you'd have to  
create lots of plpgsql trigger functions to wrap it.


There isn't any way of telling whether the function is being called for  
the first time in a transaction. You don't know when to clear it.


The first time in a session, GD will be empty.
	Clearing it at the start of a transaction would not be useful (clearing  
it at ROLLBACK would).
	It is updating the real summary table with the contents of this hash  
that is the problem, also.


	So, basically, if you connect, do one insert, and disconnect, this would  
be useless.
	But, if you do a zillion inserts, caching the counts deltas in RAM would  
be faster.
	And if you use persistent connections, you could update the counts in the  
real table only every N minutes, for instance, but this would need some  
complicity from the backend.



Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement,  
customer focus, and courage. This email with any attachments is  
confidential and may be subject to legal privilege.  If it is not  
intended for you please advise by reply immediately, destroy it and do  
not copy, disclose or use it in any way.

__
  This email has been scanned by the DMZGlobal Business Quality
  Electronic Messaging Suite.
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-04-16 Thread Magnus Hagander
Tom Lane wrote:
 Stephen Denne [EMAIL PROTECTED] writes:
  From: Tom Lane [mailto:[EMAIL PROTECTED]
  As for 2) and 3), can't you look into the pg_settings view?
 
  pg_settings view doesn't contain custom variables created on the
  fly,
 
 Really?  [ pokes around ... ]  Hm, you're right, because
 add_placeholder_variable() sets the GUC_NO_SHOW_ALL flag, and in this
 usage it'll never be cleared.  I wonder if we should change that.
 
 The whole thing is a bit of an abuse of what the mechanism was
 intended for, and so I'm not sure we should rejigger GUC's behavior
 to make it more pleasant, but on the other hand if we're not ready to
 provide a better substitute ...

While I agree with that part, is there any actual *reason* why we
shouldn't have the custom variables included in pg_settings? ISTM that
it would be usable in cases that aren't an abuse as well...

//Magnus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-04-16 Thread Tom Lane
Magnus Hagander [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 Really?  [ pokes around ... ]  Hm, you're right, because
 add_placeholder_variable() sets the GUC_NO_SHOW_ALL flag, and in this
 usage it'll never be cleared.  I wonder if we should change that.
 
 The whole thing is a bit of an abuse of what the mechanism was
 intended for, and so I'm not sure we should rejigger GUC's behavior
 to make it more pleasant, but on the other hand if we're not ready to
 provide a better substitute ...

 While I agree with that part, is there any actual *reason* why we
 shouldn't have the custom variables included in pg_settings?

IIRC, the motivation for doing that was to not expose a completely bogus
set of attributes for a variable whose defining C-module hadn't been
loaded yet.

I thought about this in the shower just now, and ISTM that if we want to
turn this into an actual feature rather than a kluge, there needs to be
some sort of define variable command that sets up a custom variable
and specifies its type (and whatever other properties seem worth
setting).  IOW expose the DefineCustomFooVariable functions to SQL users.

I'd be a bit inclined to restrict the namespace that can be set up that
way, eg allow only local. or session. as the prefix.  Maybe
that's just being too anal, but we could guarantee not to introduce
colliding built-in GUCs in future releases, whereas people trying to
define variables with any random name would definitely be at risk.

Comments?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-04-16 Thread Joe Conway

Tom Lane wrote:

Magnus Hagander [EMAIL PROTECTED] writes:

Tom Lane wrote:

Really?  [ pokes around ... ]  Hm, you're right, because
add_placeholder_variable() sets the GUC_NO_SHOW_ALL flag, and in this
usage it'll never be cleared.  I wonder if we should change that.

The whole thing is a bit of an abuse of what the mechanism was
intended for, and so I'm not sure we should rejigger GUC's behavior
to make it more pleasant, but on the other hand if we're not ready to
provide a better substitute ...



While I agree with that part, is there any actual *reason* why we
shouldn't have the custom variables included in pg_settings?


I've needed it myself before -- I think it is a good idea.


IIRC, the motivation for doing that was to not expose a completely bogus
set of attributes for a variable whose defining C-module hadn't been
loaded yet.

I thought about this in the shower just now, and ISTM that if we want to
turn this into an actual feature rather than a kluge, there needs to be
some sort of define variable command that sets up a custom variable
and specifies its type (and whatever other properties seem worth
setting).  IOW expose the DefineCustomFooVariable functions to SQL users.

I'd be a bit inclined to restrict the namespace that can be set up that
way, eg allow only local. or session. as the prefix.  Maybe
that's just being too anal, but we could guarantee not to introduce
colliding built-in GUCs in future releases, whereas people trying to
define variables with any random name would definitely be at risk.

Comments?


Would it make sense to have built-in GUCs belong to pg_catalog. and 
user defined GUCs default to public.?


Joe

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-04-16 Thread Stephen Denne
Tom Lane wrote
 Stephen Denne [EMAIL PROTECTED] writes:
  From: Tom Lane [mailto:[EMAIL PROTECTED]
  As for 2) and 3), can't you look into the pg_settings view?
 
  pg_settings view doesn't contain custom variables created 
 on the fly,
 
 Really?  [ pokes around ... ]  Hm, you're right, because
 add_placeholder_variable() sets the GUC_NO_SHOW_ALL flag, and in this
 usage it'll never be cleared.  I wonder if we should change that.
 
 The whole thing is a bit of an abuse of what the mechanism 
 was intended
 for, and so I'm not sure we should rejigger GUC's behavior to make it
 more pleasant, but on the other hand if we're not ready to provide a
 better substitute ...

In my experiments with materialized views, I identified these problems as 
minor difficulties. Resolving them would allow further abuse ;)

Aside: It is currently more cumbersome to get a function to run, if needed, at 
commit. Ideal solution would be something like EXECUTE ON COMMIT 
my_function() or maybe SAVEPOINT my_name ON COMMIT my_function(), but these 
suggestions are made without investigating what provision the SQL standard has 
made to address this need.

My use of mv.initialized means I can create variables when initializing a 
transaction, and afterwards know that they have values, but what I can't easily 
do is use those variables to identify which grouping keys have been updated. To 
do that I select  conditionally insert to a table for that explicit purpose. 
If select doesn't find the key, then I create variables named after that key, 
with zero values.

Performance and efficiency-wise which would be better way of keeping track 
of grouping keys used in a transaction?:
1) Create a temp table, on commit drop, for the transaction, storing grouping 
keys affected.
2) Use a persistent table, storing txid and grouping keys affected, deleting 
txid rows at commit.
3) Use pg_settings, storing tx local grouping keys affected, existence check 
via catching an exception, listing via checking existence for all possible 
values (a possibility in my scenario).

Speed is my priority, low disk IO is a probable means to that end, which is why 
I investigated using variables.

Basically, (3) isn't a viable option, so what are the trade-offs between 
creating a temporary table per transaction, or using rows in a permanent table 
with a txid column?

Here are some more plpgsql code fragments:

   mv := 'mv.' || view_name || '.' || key_value || '.';

When recording a grouping key as being affected by the transaction, create the 
variables with zeroes:

   PERFORM set_config(mv||'documents', '0', true);
   PERFORM set_config(mv||'last_addition', 'null', true);

In an insert trigger:

   PERFORM set_config(mv||'documents', 
(current_setting(mv||'documents')::bigint + 1)::text, true);
   PERFORM set_config(mv||'last_addition', now()::text, true);

In the defferred till commit trigger:

  UPDATE materialized_view set 
 documents=documents+current_setting(mv||'documents')::bigint, 
 
last_addition=greatest(last_addition,nullif(current_setting(mv||'last_addition'),'null')::timestamp)
 where 
 group_id = key_values.key_value;


Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer 
focus, and courage. This email with any attachments is confidential and may be 
subject to legal privilege.  If it is not intended for you please advise by 
reply immediately, destroy it and do not copy, disclose or use it in any way.
__
  This email has been scanned by the DMZGlobal Business Quality
  Electronic Messaging Suite.
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-04-16 Thread Tom Lane
Stephen Denne [EMAIL PROTECTED] writes:
 Aside: It is currently more cumbersome to get a function to run, if needed, 
 at commit. Ideal solution would be something like EXECUTE ON COMMIT 
 my_function() or maybe SAVEPOINT my_name ON COMMIT my_function(), but 
 these suggestions are made without investigating what provision the SQL 
 standard has made to address this need.

There is none, and the reason seems pretty obvious to me.  What if your
on commit function fails?  Or if you have two, and the second one
fails?  Or even more to the point, the second one does something that
the first one expected to see the effects of?

Transaction commit is an exceedingly subtle and carefully structured
thing.  Throwing random user-defined code into it ain't gonna happen.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-04-16 Thread PFC



The whole thing is a bit of an abuse of what the mechanism
was intended
for, and so I'm not sure we should rejigger GUC's behavior to make it
more pleasant, but on the other hand if we're not ready to provide a
better substitute ...


In my experiments with materialized views, I identified these problems  
as minor difficulties. Resolving them would allow further abuse ;)


Let's try this quick  dirty implementation of a local count-delta cache  
using a local in-memory hashtable (ie. {}).
Writing the results to stable storage in an ON COMMIT trigger is left as  
an exercise to the reader ;)

Performance isn't that bad, calling the trigger takes about 50 us.
Oldskool implementation with a table is at the end, it's about 10x slower.

Example :

INSERT INTO victim1 (key) VALUES ('one'),('two'),('two');
INSERT 0 3
Temps : 1,320 ms
test=# SELECT * FROM get_count();
 key | cnt
-+-
 two |   2
 one |   1


CREATE OR REPLACE FUNCTION clear_count(  )
  RETURNS VOID
AS $$
GD.clear()
$$ LANGUAGE plpythonu;

CREATE OR REPLACE FUNCTION update_count( key TEXT, delta INTEGER )
  RETURNS INTEGER
AS $$
if key in GD:
GD[key] += delta
else:
GD[key] = delta
return GD[key]
$$ LANGUAGE plpythonu;

CREATE TYPE count_data AS ( key TEXT, cnt INTEGER );

CREATE OR REPLACE FUNCTION get_count( )
RETURNS SETOF count_data
AS $$
return GD.iteritems()
$$ LANGUAGE plpythonu;


CREATE TABLE victim( id SERIAL PRIMARY KEY, key TEXT NOT NULL );
INSERT INTO victim (key) SELECT (random() * 300)::INTEGER::TEXT FROM  
generate_series( 1,10 );


CREATE TABLE victim1( id SERIAL PRIMARY KEY, key TEXT NOT NULL );

\timing
INSERT INTO victim1 SELECT * FROM victim;
TRUNCATE TABLE victim1;

SELECT clear_count();
INSERT INTO victim1 SELECT * FROM victim RETURNING update_count( key, 1 );
SELECT * FROM get_count();
TRUNCATE TABLE victim1;

CREATE OR REPLACE FUNCTION counter_trigger_f()
RETURNS TRIGGER
LANGUAGE plpgsql
AS
$$
DECLARE
BEGIN
IF TG_OP = 'INSERT' THEN
PERFORM update_count( NEW.key, 1 );
RETURN NEW;
ELSEIF TG_OP = 'UPDATE' THEN
-- update topic
IF NEW.key != OLD.key THEN
PERFORM update_count( OLD.key, -1 ), update_count( NEW.key, 1  
);

END IF;
RETURN NEW;
ELSE-- DELETE
PERFORM update_count( OLD.key, -1 );
RETURN OLD;
END IF;
END;
$$;

CREATE TRIGGER count_trigger BEFORE INSERT OR UPDATE OR DELETE ON victim1  
FOR EACH ROW EXECUTE PROCEDURE counter_trigger_f();


SELECT clear_count();
INSERT INTO victim1 SELECT * FROM victim;
SELECT * FROM get_count();

SELECT clear_count();
TRUNCATE TABLE victim1;
INSERT INTO victim1 (key) VALUES ('one'),('two'),('two');
SELECT * FROM get_count();
DELETE FROM victim1 WHERE key='two';
SELECT * FROM get_count();
UPDATE victim1 SET key='three' WHERE key='one';
SELECT * FROM get_count();
DELETE FROM victim1;
SELECT * FROM get_count();


CREATE TABLE counts( key TEXT PRIMARY KEY, total INTEGER NOT NULL DEFAULT  
0 );


CREATE OR REPLACE FUNCTION table_counter_trigger_f()
RETURNS TRIGGER
LANGUAGE plpgsql
AS
$$
DECLARE
BEGIN
IF TG_OP = 'INSERT' THEN
UPDATE counts SET total=total+1 WHERE key=NEW.key;
IF NOT FOUND THEN INSERT INTO counts (key,total) VALUES  
(NEW.key,1); END IF;

RETURN NEW;
ELSEIF TG_OP = 'UPDATE' THEN
-- update topic
IF NEW.key != OLD.key THEN
UPDATE counts SET total=total-1 WHERE key=OLD.key;
UPDATE counts SET total=total+1 WHERE key=NEW.key;
IF NOT FOUND THEN INSERT INTO counts (key,total) VALUES  
(NEW.key,1); END IF;

END IF;
RETURN NEW;
ELSE-- DELETE
UPDATE counts SET total=total-1 WHERE key=OLD.key;
RETURN OLD;
END IF;
END;
$$;

CREATE TABLE victim2( id SERIAL PRIMARY KEY, key TEXT NOT NULL );

CREATE TRIGGER table_count_trigger BEFORE INSERT OR UPDATE OR DELETE ON  
victim2 FOR EACH ROW EXECUTE PROCEDURE table_counter_trigger_f();

SELECT * FROM counts;
TRUNCATE TABLE victim2;
INSERT INTO victim2 SELECT * FROM victim;



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-04-16 Thread Stephen Denne
Tom Lane wrote
 Transaction commit is an exceedingly subtle and carefully structured
 thing.  Throwing random user-defined code into it ain't gonna happen.

Deferred constraint triggers currently run random user-defined code. This'll do 
me.

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer 
focus, and courage. This email with any attachments is confidential and may be 
subject to legal privilege.  If it is not intended for you please advise by 
reply immediately, destroy it and do not copy, disclose or use it in any way.
__
  This email has been scanned by the DMZGlobal Business Quality
  Electronic Messaging Suite.
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-04-16 Thread Stephen Denne
PFC wrote:
 Let's try this quick  dirty implementation of a local 
 count-delta cache  
 using a local in-memory hashtable (ie. {}).

 CREATE OR REPLACE FUNCTION update_count( key TEXT, delta INTEGER )
RETURNS INTEGER
 AS $$
  if key in GD:
  GD[key] += delta
  else:
  GD[key] = delta
  return GD[key]
 $$ LANGUAGE plpythonu;

Thanks for the code, this seems to be very much what I was looking for.

I don't know plpythonu (nor python), just read a few docs now:
The global dictionary SD is available to store data between function calls. 
This variable is private static data. The global dictionary GD is public data, 
available to all Python functions within a session. Use with care.

Does session == transaction or connection?
I don't understand the difference between SD and GD, private and public. Where 
are the context boundaries?

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer 
focus, and courage. This email with any attachments is confidential and may be 
subject to legal privilege.  If it is not intended for you please advise by 
reply immediately, destroy it and do not copy, disclose or use it in any way.
__
  This email has been scanned by the DMZGlobal Business Quality
  Electronic Messaging Suite.
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-04-15 Thread Stephen Denne
(There is a possible performance bug mentioned at the end of the email, the 
rest is further discussion regarding materialised views)

I wrote
 Pavan Deolasee wrote
  On Thu, Mar 20, 2008 at 3:24 AM, Stephen Denne
  [EMAIL PROTECTED] wrote:
  
  
Pavan also refers to deferred triggers, which has got me 
  thinking about another possible solution:
  
Instead of inserting a delta row, that will be updated a 
  lot of times, create an on commit drop temp table named after 
  the txid and the grouping value (which is an integer in my 
  case, perhaps hash it if you're grouping by something that 
  doesn't easily convert to part of a table name),
create an after insert initially deferred constraint to 
  call a function which will move the 'at commit' values of the 
  rows in the temp table to the summary table.
The temp table will only have one row inserted, updated 
  many times, then on commit the trigger is fired once, and the 
  temp table is dropped.
  
Does anyone think this will or won't work for some reason?
  
  
  I think this should work, although you may need to take some 
  extra steps
  to manage the summary table. Also, I think a single temp 
  table per transaction
  should suffice. The temp table would have one row per group 
  by or where
  condition on which you want to track the count. The 
  corresponding row will
  be updated as and when the corresponding count changes. You 
 would need
  INSERT/DELETE/UPDATE triggers to do that. If there are any 
  subtransaction
  aborts, that will be taken care by MVCC.
 
 Thanks for that. I had gone ahead and tried out the idea, and 
 it was working 'ok'.
 
 Using one table per transaction has the benefit of less temp 
 tables (but the same number of triggers waiting to run). It 
 also removes the grouping key from the table name.
 
 I was using a single table per grouping key, with a single 
 updated row in it.
 The benefit was simpler queries, and I could create an ON 
 INSERT trigger that would be triggered only once when the 
 temp table was created, and a 'zero' row was inserted, 
 thereby separating the setup of the trigger from the 
 maintenance of the delta.

One temp table per grouping key would presumably allocate at least one disk 
page per grouping key.
This might result in pretty poor performance. Though if the number of updates 
per grouping key is large, HOT would have plenty of room on the page to write 
new row versions.

Does creation  dropping of temp tables result in system catalog bloat?

 I took a wild guess at a way of finding out whether the temp 
 table already exists:
   not exists(select tablename from pg_catalog.pg_tables 
 where tablename=temp_delta_txid_group)
 Is there a better/safer way?

Answering my own question:
There is a better way to do what I was doing, (not sure about a better way to 
check existence of a temp table though)...

A custom variable class can be set up and used to record whether the 
transaction in question has been set up. (Thanks to Andreas Kretschmer for 
pointing those out in another thread on -general)

Alter this setting within postgresql.conf to add 'mv':
custom_variable_classes = 'mv'

Add this setting to postgresql.conf:
mv.initialized = 'false'

Then only set it to true local to the transaction.

Usage within a trigger:

   IF NOT (current_setting('mv.initialized')::boolean) THEN
  -- trigger a deferred constraint function:
  INSERT INTO mv_txid_doc_type_summary VALUES (txid_current());
  PERFORM set_config('mv.initialized', 'true', true);
   END IF;

The custom variable class can also be used to aggregate the deltas within a 
transaction, though there are some minor difficulties:

1) They only store text
2) You can only find out that a value has not been set by catching an exception
3) There is no way to list the settings.

The main benefit is that changing a variable's setting does not write a new row 
version.
Creating new transactionally scoped variables seems to take around 150 to 200 
bytes of the process ram per variable, (depending on the size of the value 
stored).
The time to create them for the first time for a connection seems to vary a 
bit, between 200 and 1000 per minute. No IO is being performed, CPU is at 100%
These statistics gathered when trying variations of this query:

select count(*) from (select set_config('mv.test11.' || s1.s1, s2.s2::text, 
true) from (select s1 from generate_series(1,1000) as s1) as s1, (select s2 
from generate_series(1,1000) as s2) as s2) as s3;

Which sets each one of 1000 variables to 1000 different values. This can take a 
few minutes, but strangely a second connection starting the same query just 
after the first one, can finish in seconds.

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer 
focus, and courage. This email with any attachments is confidential and may be 
subject to legal privilege.  If it is not intended for you please 

Re: [HACKERS] count(*) performance improvement ideas

2008-04-15 Thread Tom Lane
Stephen Denne [EMAIL PROTECTED] writes:
 The custom variable class can also be used to aggregate the deltas within a 
 transaction, though there are some minor difficulties:

 1) They only store text
 2) You can only find out that a value has not been set by catching an 
 exception
 3) There is no way to list the settings.

As for 2) and 3), can't you look into the pg_settings view?

 The time to create them for the first time for a connection seems to
vary a bit, between 200 and 1000 per minute.

Yeah, that path is not optimized at all because it was never considered
performance-critical.  Updating an existing variable should be cheaper.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-04-15 Thread Stephen Denne
(apologies for the email format  top posting, I've had to temporarily switch 
to using a web interface from home that doesn't seem to know what plain text is)
 
pg_settings view doesn't contain custom variables created on the fly, (nor, 
from memory, ones defined in postgresql.conf. I'm not able to check and confirm 
that at the moment).
Fixing that would satisfy 2  3 nicely.
The docs on them say the ones in postgresql.conf are to be used by modules when 
they initialize, as the values to use when setting up actual real server 
variables, (able to store more types than just text), which I presume would 
appear in pg_settings.
 
Updating existing variables is much quicker, over 10 updates per second.
 
Regards,
Stephen Denne.



From: Tom Lane [mailto:[EMAIL PROTECTED]
Sent: Wed 16/04/2008 1:55 p.m.

Stephen Denne [EMAIL PROTECTED] writes:
 The custom variable class can also be used to aggregate the deltas within a 
 transaction, though there are some minor difficulties:

 1) They only store text
 2) You can only find out that a value has not been set by catching an 
 exception
 3) There is no way to list the settings.

As for 2) and 3), can't you look into the pg_settings view?

 The time to create them for the first time for a connection seems to
vary a bit, between 200 and 1000 per minute.

Yeah, that path is not optimized at all because it was never considered
performance-critical.  Updating an existing variable should be cheaper.

regards, tom lane


Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer 
focus, and courage. This email with any attachments is confidential and may be 
subject to legal privilege.  If it is not intended for you please advise by 
reply immediately, destroy it and do not copy, disclose or use it in any way.
__
  This email has been scanned by the DMZGlobal Business Quality
  Electronic Messaging Suite.
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__



Re: [HACKERS] count(*) performance improvement ideas

2008-04-15 Thread Tom Lane
Stephen Denne [EMAIL PROTECTED] writes:
 From: Tom Lane [mailto:[EMAIL PROTECTED]
 As for 2) and 3), can't you look into the pg_settings view?

 pg_settings view doesn't contain custom variables created on the fly,

Really?  [ pokes around ... ]  Hm, you're right, because
add_placeholder_variable() sets the GUC_NO_SHOW_ALL flag, and in this
usage it'll never be cleared.  I wonder if we should change that.

The whole thing is a bit of an abuse of what the mechanism was intended
for, and so I'm not sure we should rejigger GUC's behavior to make it
more pleasant, but on the other hand if we're not ready to provide a
better substitute ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-03-24 Thread Stephen Denne
Pavan Deolasee wrote
 On Thu, Mar 20, 2008 at 3:24 AM, Stephen Denne
 [EMAIL PROTECTED] wrote:
 
 
   Pavan also refers to deferred triggers, which has got me 
 thinking about another possible solution:
 
   Instead of inserting a delta row, that will be updated a 
 lot of times, create an on commit drop temp table named after 
 the txid and the grouping value (which is an integer in my 
 case, perhaps hash it if you're grouping by something that 
 doesn't easily convert to part of a table name),
   create an after insert initially deferred constraint to 
 call a function which will move the 'at commit' values of the 
 rows in the temp table to the summary table.
   The temp table will only have one row inserted, updated 
 many times, then on commit the trigger is fired once, and the 
 temp table is dropped.
 
   Does anyone think this will or won't work for some reason?
 
 
 I think this should work, although you may need to take some 
 extra steps
 to manage the summary table. Also, I think a single temp 
 table per transaction
 should suffice. The temp table would have one row per group 
 by or where
 condition on which you want to track the count. The 
 corresponding row will
 be updated as and when the corresponding count changes. You would need
 INSERT/DELETE/UPDATE triggers to do that. If there are any 
 subtransaction
 aborts, that will be taken care by MVCC.

Thanks for that. I had gone ahead and tried out the idea, and it was working 
'ok'.

Using one table per transaction has the benefit of less temp tables (but the 
same number of triggers waiting to run). It also removes the grouping key from 
the table name.

I was using a single table per grouping key, with a single updated row in it.
The benefit was simpler queries, and I could create an ON INSERT trigger that 
would be triggered only once when the temp table was created, and a 'zero' row 
was inserted, thereby separating the setup of the trigger from the maintenance 
of the delta.

I haven't explored the transactional implications of updating vs inserting 
delta rows in the summary table at the time of transaction commit. The code 
below updates the summary table, which I think could lead to a large delay or 
deadlocks if there are other styles of updates on that table (other than 
on-commit triggers)?

I also hadn't considered sub-transactions.

Below is a cut-down version of what I tried out.

I was summarizing more than just the number of documents shown below, I was 
storing a sum, and two maximums of timestamps (using the 'greatest' function 
for aggregating each record). These were extra fields in both the summary table 
and the temp tables.
This is able to be made more generic by changing get_temp_table_name() to take 
an additional couple of parameters specifying the name of the function to run 
at commit, and an identifying key (eg 'doc_count' vs 'doc_size_sum'), or 
perhaps including the delta value too, (which looks like it would simplify the 
triggers on the tables whose changes we wish to summarize, except that it 
doesn't cater for storing greatest or least aggregates.)

I took a wild guess at a way of finding out whether the temp table already 
exists:
not exists(select tablename from pg_catalog.pg_tables where 
tablename=temp_delta_txid_group)
Is there a better/safer way?

Here's part of the code I've got at the moment (edited here to cut it down to 
the core example, so it may contain errors):

CREATE TABLE doc_type_summary (
  document_type_id integer NOT NULL,
  documents bigint NOT NULL DEFAULT 0,
  CONSTRAINT pk_doc_type_summary PRIMARY KEY (document_type_id)
);

CREATE OR REPLACE FUNCTION process_delta() RETURNS TRIGGER AS
$$
   BEGIN
  EXECUTE 'UPDATE doc_type_summary set 
documents=doc_type_summary.documents+d.documents from ' || TG_ARGV[0] || ' as d 
where document_type_id = ' || TG_ARGV[1];
  RETURN NULL;
   END;
$$
LANGUAGE 'plpgsql';

CREATE OR REPLACE FUNCTION get_temp_table_name(document_type_id integer)
RETURNS text AS
$$
DECLARE
   temp_delta_txid_group text;
BEGIN
   temp_delta_txid_group := 'temp_delta_' || txid_current() || '_' || 
document_type_id;
   IF not exists(select tablename from pg_catalog.pg_tables where 
tablename=temp_delta_txid_group) THEN
  EXECUTE 'CREATE TEMP TABLE ' || temp_delta_txid_group || '(documents 
bigint NOT NULL DEFAULT 0) ON COMMIT DROP';
  EXECUTE 'CREATE CONSTRAINT TRIGGER ' || temp_delta_txid_group || '_trig 
AFTER INSERT ON ' || temp_delta_txid_group || ' DEFERRABLE INITIALLY DEFERRED 
FOR EACH ROW EXECUTE PROCEDURE process_delta (' || temp_delta_txid_group || 
', ' || document_type_id || ')';
  EXECUTE 'INSERT INTO ' || temp_delta_txid_group || ' DEFAULT VALUES';
   END IF;
   RETURN temp_delta_txid_group;
END;
$$
LANGUAGE 'plpgsql';

CREATE OR REPLACE FUNCTION document_count_trig()
RETURNS TRIGGER AS
$$
DECLARE
   temp_delta_txid_group text;
BEGIN
   IF TG_OP = 'INSERT' THEN
  temp_delta_txid_group := 

Re: [HACKERS] count(*) performance improvement ideas

2008-03-20 Thread Pavan Deolasee
On Thu, Mar 20, 2008 at 3:24 AM, Stephen Denne
[EMAIL PROTECTED] wrote:


  Pavan also refers to deferred triggers, which has got me thinking about 
 another possible solution:

  Instead of inserting a delta row, that will be updated a lot of times, 
 create an on commit drop temp table named after the txid and the grouping 
 value (which is an integer in my case, perhaps hash it if you're grouping by 
 something that doesn't easily convert to part of a table name),
  create an after insert initially deferred constraint to call a function 
 which will move the 'at commit' values of the rows in the temp table to the 
 summary table.
  The temp table will only have one row inserted, updated many times, then on 
 commit the trigger is fired once, and the temp table is dropped.

  Does anyone think this will or won't work for some reason?


I think this should work, although you may need to take some extra steps
to manage the summary table. Also, I think a single temp table per transaction
should suffice. The temp table would have one row per group by or where
condition on which you want to track the count. The corresponding row will
be updated as and when the corresponding count changes. You would need
INSERT/DELETE/UPDATE triggers to do that. If there are any subtransaction
aborts, that will be taken care by MVCC.

As you said, a single deferred trigger would then merge the temp table with
the summary table. Here we need to be extra careful because serializable
transactions may fail to update the same row in the summary table. One solution
I can think of is (and I haven't checked the archives, so somebody might have
already suggested this before):

Each summary table will have one summary row per group by or where condition
(same as temp table). In addition to that, it can have zero or more temporary
rows for the conditions.

   - select summary_rows from summary table for update nowait;
   - if lock is not available, insert our deltas into the summary table as a new
 row. These rows will be deleted as soon as some other transaction gets lock
 on the summary rows and merge our deltas with them
   - if lock is available
  - merge our deltas with the summary rows
  - check for other temporary deltas and merge them with the summary rows
 and delete those temporary rows

I guess we can write the generic triggers as contrib module. What needs to done
is to let user specify the tables and the conditions on which they want to track
count(*) and then apply those conditions in the generic triggers.


Thanks,
Pavan
-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-03-19 Thread Stephen Denne
Mark Mielke wrote
 This returns to the question of whether count of the whole table is useful, 
 or whether 
 count of a GROUP BY or WHERE is useful.
 If GROUP BY or WHERE is useful, then trigger on UPDATE becomes necessary.

True... for the example I gave I should have had an update trigger on my table.
I neglected it as I based the example of a more complex multi-table example 
from my application, where I know that the value I'm grouping by doesn't change.

 What is the direction here? Is it count of the whole table only? (-- not 
 interesting to me)
 Or count of more practical real life examples, which I completely agree with 
 Greg, 
 that this gets into the materialized view realm, and becomes very interesting.
 In my current db project, I never count all of the rows in a table. However, 
 I do use count(*) with GROUP BY and WHERE.

I'm trying to figure out how close I can come to a useful efficient 
materialized view with current production builds of postgresql,
and identifying areas where changes to postgresql could make it easier.

Currently I can see three areas of concern:

1) Turning it on
My query to initially populate the materialized view table takes 6 hours on my 
(limited hardware) development system, which highlights the problem of when do 
you turn on the triggers.
An outage is one way to guarantee that there are neither missing details nor 
double counted details.
Would turning on the triggers and then running my initial population query in 
the same transaction work?

2) Table bloat
I'm trying to group together a large number of +1 or -1 deltas into a single 
delta per transaction.
This creates as many dead rows as there are updates.
This is also a problem with Pavan's suggestion of maintaining a counter table.
They can all be HOT updates in 8.3, but they still all create new row versions.

Tom says Collapsing a transaction's changes
into a single row would require keeping some intra-transaction state,
which is do-able at the C-code level but not from plpgsql.

So can I hope that this problem is solvable as a contrib module that will work 
with at least 8.3?
I'd still want to write plpgsql trigger function myself, as I know the 
aggregation rules, and call contributed functions to integrate with the 
collapsing of the transaction's changes into a single row. (Expect I'd need a 
small number of rows per transaction, as I need to create delta rows for each 
value of my grouping field involved in the transaction.

3) How to incorporate the deltas.
With my technique, if the transaction rolls back the delta record becomes a 
dead row, if it commits, the delta is then visible to transaction started after 
this time.
I need to have a regular process run to sum and remove the deltas, rewriting 
the summary rows.
I'd like to be able to write an after-commit trigger that fires after changes 
in dependent tables that I identify, which can add the transaction's deltas to 
the summary table. I would want it to effectively be a new, small transaction.

I think that if these three areas are addressed, then before even considering 
writing code to automatically convert any given view into a materialized view.

2) and 3) could perhaps be implemented with a per transaction map from 
my_custom_key to a prepared statement and a list of parameter values.
Provide access to the values, making them updateable within the transaction. 
Have the statement automatically executed on commit.

Pavan also refers to deferred triggers, which has got me thinking about another 
possible solution:

Instead of inserting a delta row, that will be updated a lot of times, create 
an on commit drop temp table named after the txid and the grouping value (which 
is an integer in my case, perhaps hash it if you're grouping by something that 
doesn't easily convert to part of a table name), 
create an after insert initially deferred constraint to call a function which 
will move the 'at commit' values of the rows in the temp table to the summary 
table.
The temp table will only have one row inserted, updated many times, then on 
commit the trigger is fired once, and the temp table is dropped.

Does anyone think this will or won't work for some reason?

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer 
focus, and courage. This email with any attachments is confidential and may be 
subject to legal privilege.  If it is not intended for you please advise by 
reply immediately, destroy it and do not copy, disclose or use it in any way.
__
  This email has been scanned by the DMZGlobal Business Quality
  Electronic Messaging Suite.
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] count(*) performance improvement ideas

2008-03-18 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes:

 Stephen Denne [EMAIL PROTECTED] writes:

 However I'm not after a fast count(*) from table, but more like a fast
  select grouping_id, count(*) from my_table group by grouping_id

 You could apply the same technique across each group id, though this
 certainly is getting beyond what any built-in feature might offer.

At that point you're talking about materialized views. Which makes it a whole
lot more interesting imho.

 Does this idea apply with the same efficiency in pre 8.3, non-HOT 
 implementations?

 I didn't claim it was amazingly efficient in any implementation ;-).
 HOT in particular is nearly useless since most rows in the count
 table will never be updated, only inserted and eventually deleted.
 You might get some mileage on the base row, but that'd be about it.
 The count table will need frequent vacuums as well as frequent
 aggregation scans.

It might be better not to update this delta table in normal transactional
updates. After all the tuples you're deleting are precisely the ones that
nobody should be interested in any more. If you locked the table and magically
deleted those tuples and updated the master tuple using the global xmin
instead of your real xid people would get the same result and you could
reclaim the space much much sooner. Locking the table kind of sucks though.
And crash recovery would be a problem.


-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's PostGIS support!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-03-18 Thread Stephen Denne
(Sorry for the repeat email Tom, I forgot the cc's)

Tom Lane wrote:
 Stephen Denne [EMAIL PROTECTED] writes:
  I initially thought that what you meant by having 
  transactions enter delta entries was that I have a trigger 
  that would create a row each time it was called, writing how 
  many records where inserted or deleted. I didn't understand 
  how this would be much of an improvement, as each of my rows 
  would contain either +1 or -1.
 
 Well, ideally you'd aggregate all the deltas caused by a particular
 transaction into one entry in the counting table.  

Yes, that's what I was attempting to do, but without changing the application 
code.

Using txid_current() can do that, so each of a large number of individual 
inserts or deletes within a transaction updates the same delta row for that 
transaction. I haven't found any references to this being a solution, and 
thought it was worth recording.

  However I'm not after a fast count(*) from table, but more 
 like a fast
  select grouping_id, count(*) from my_table group by grouping_id
 
 You could apply the same technique across each group id, though this
 certainly is getting beyond what any built-in feature might offer.

Agreed. I've tried it out now, and am fairly happy with what I've got.

  Can you clarify the lack of MVCC problems?
 
 The point there is that the right answer actually depends on the
 observer, since each observer might have a different snapshot and
 therefore be able to see a different set of committed rows in the
 underlying table.  The multiple-delta table handles this 
 automatically,
 because you can see a delta entry if and only if you could see
 the underlying-table changes it represents.
 
  Does this idea apply with the same efficiency in pre 8.3, 
 non-HOT implementations?
 
 I didn't claim it was amazingly efficient in any implementation ;-).
 HOT in particular is nearly useless since most rows in the count
 table will never be updated, only inserted and eventually deleted.
 You might get some mileage on the base row, but that'd be about it.
 The count table will need frequent vacuums as well as frequent
 aggregation scans.
 
 It should beat scanning a large underlying table, but it's hardly
 gonna be free.

What I was asking about with those questions, is if a single row is inserted in 
transaction A, and updated 1000 times still within transaction A, then 
transaction A commits... does a single row version get written, or 1001, 1000 
of which are removable?

I tested this with a small summary table freshly vacuum full'ed.

10 transactions, one after the other, each transaction inserted three delta 
rows, and updates one of them three times, and the other two five times each.
So 3 inserts and 13 updates per transaction.
The updates only affect non-indexed fields in rows created in the same 
transaction.

The next vacuum full found 52 removable row versions.

I repeated the test, and got 13 removable row versions.

I repeated the test again, and got 13 removable row versions.

I repeated just one of the ten transactions, 13 removable row versions.

All inserts and updates are probably in the one page that has a fair bit of 
free space.

Is it possible to update the HOT code to re-use row versions on the same page 
if they were created in the same transaction?

Conclusion: making use of txid_current(), I can get single delta rows with 
deltas of 1, but doing so creates 1 dead row versions.

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer 
focus, and courage. This email with any attachments is confidential and may be 
subject to legal privilege.  If it is not intended for you please advise by 
reply immediately, destroy it and do not copy, disclose or use it in any way.
__
  This email has been scanned by the DMZGlobal Business Quality
  Electronic Messaging Suite.
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-03-18 Thread Stephen Denne
Tom Lane wrote
 Stephen Denne [EMAIL PROTECTED] writes:
  What I was asking about with those questions, is if a 
  single row is inserted in transaction A, and updated 1000 
  times still within transaction A, then transaction A 
  commits... does a single row version get written, or 1001, 
  1000 of which are removable?
 
 Umm ... AFAICS there is no need for an UPDATE to touch the count table
 at all.  You'd only need ON INSERT and ON DELETE triggers.

I'm not referring to updates of my base table... the single row inserted was 
referring to the delta row...

I'm trying to group together a large number of +1 or -1 deltas into a single 
delta per transaction.

A cut down example:

CREATE TABLE document_type_summary_detail
(
  document_type_id integer NOT NULL,
  txid bigint NOT NULL DEFAULT 0,
  documents bigint NOT NULL DEFAULT 0,
  CONSTRAINT pk_document_type_summary PRIMARY KEY (document_type_id, txid)
);

CREATE OR REPLACE FUNCTION document_count_trig()
RETURNS TRIGGER AS
$$
   DECLARE
   BEGIN
   IF TG_OP = 'INSERT' THEN
  UPDATE document_type_summary_detail set documents=documents+1 where 
document_type_id = NEW.document_type_id and txid=txid_current();
  IF NOT FOUND THEN
 INSERT INTO document_type_summary_detail 
(document_type_id,documents,txid) VALUES(NEW.document_type_id,1,txid_current());
  END IF;
  RETURN NEW;
   ELSIF TG_OP = 'DELETE' THEN
  UPDATE document_type_summary_detail set documents=documents-1 where 
document_type_id = OLD.document_type_id and txid=txid_current();
  IF NOT FOUND THEN
 INSERT INTO document_type_summary_detail 
(document_type_id,documents,txid) 
VALUES(OLD.document_type_id,-1,txid_current());
  END IF;
  RETURN OLD;
   END IF;
   END;
$$
LANGUAGE 'plpgsql';

create trigger document_count_trig before insert or delete on document for each 
row execute procedure document_count_trig();

--one off setup:
insert into document_type_summary_detail (document_type_id,txid,documents)
select dt.id, 0, count(d.id) from document_type dt left join document d on 
d.document_type_id = dt.id
group by 1,2;


--useful view:
CREATE OR REPLACE VIEW document_type_summary AS SELECT document_type_id, 
sum(documents) AS documents FROM document_type_summary_detail GROUP BY 
document_type_id;


--scheduled cleanup:
CREATE OR REPLACE FUNCTION document_type_summary_aggregate() RETURNS void AS
$$
   BEGIN
  INSERT INTO document_type_summary_detail(document_type_id) select 
distinct document_type_id from document_type_summary_detail except select 
document_type_id from document_type_summary_detail where txid=0;
  UPDATE document_type_summary_detail set documents=v.documents from 
document_type_summary as v where document_type_summary_detail.document_type_id 
= v.document_type_id and document_type_summary_detail.txid=0 and exists (select 
1 from document_type_summary_detail ss where ss.document_type_id = 
document_type_summary_detail.document_type_id and ss.txid  0);
  DELETE FROM document_type_summary_detail where txid  0;
   END;
$$
LANGUAGE 'plpgsql';


My assumption is that this solves the locking causes serialisation of 
transactions problem as the only rows updated are those inserted by the same 
transaction.

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer 
focus, and courage. This email with any attachments is confidential and may be 
subject to legal privilege.  If it is not intended for you please advise by 
reply immediately, destroy it and do not copy, disclose or use it in any way.
__
  This email has been scanned by the DMZGlobal Business Quality
  Electronic Messaging Suite.
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-03-18 Thread Tom Lane
Stephen Denne [EMAIL PROTECTED] writes:
 Tom Lane wrote
 Umm ... AFAICS there is no need for an UPDATE to touch the count table
 at all.  You'd only need ON INSERT and ON DELETE triggers.

 I'm not referring to updates of my base table... the single row inserted was 
 referring to the delta row...

 I'm trying to group together a large number of +1 or -1 deltas into a single 
 delta per transaction.

Oh.  You can't realistically do that from the level of individual ON
INSERT/DELETE triggers --- as you've found out, you end up with about
the same number of dead rows anyway.  Collapsing a transaction's changes
into a single row would require keeping some intra-transaction state,
which is do-able at the C-code level but not from plpgsql.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-03-18 Thread Tom Lane
Stephen Denne [EMAIL PROTECTED] writes:
 What I was asking about with those questions, is if a single row is inserted 
 in transaction A, and updated 1000 times still within transaction A, then 
 transaction A commits... does a single row version get written, or 1001, 1000 
 of which are removable?

Umm ... AFAICS there is no need for an UPDATE to touch the count table
at all.  You'd only need ON INSERT and ON DELETE triggers.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-03-18 Thread Mark Mielke

Tom Lane wrote:

Stephen Denne [EMAIL PROTECTED] writes:
  

What I was asking about with those questions, is if a single row is inserted in 
transaction A, and updated 1000 times still within transaction A, then 
transaction A commits... does a single row version get written, or 1001, 1000 
of which are removable?



Umm ... AFAICS there is no need for an UPDATE to touch the count table
at all.  You'd only need ON INSERT and ON DELETE triggers.
  


This returns to the question of whether count of the whole table is 
useful, or whether count of a GROUP BY or WHERE is useful. If GROUP BY 
or WHERE is useful, then trigger on UPDATE becomes necessary.


What is the direction here? Is it count of the whole table only? (-- 
not interesting to me)


Or count of more practical real life examples, which I completely agree 
with Greg, that this gets into the materialized view realm, and becomes 
very interesting.


In my current db project, I never count all of the rows in a table. 
However, I do use count(*) with GROUP BY and WHERE.


Cheers,
mark

--
Mark Mielke [EMAIL PROTECTED]



Re: [HACKERS] count(*) performance improvement ideas

2008-03-17 Thread Stephen Denne
 

Tom Lane wrote
 Pavan Deolasee [EMAIL PROTECTED] writes:
  I am sure this must have been discussed before.
 
 Indeed.  Apparently you didn't find the threads in which the idea of
 having transactions enter delta entries was discussed?  Solves both
 the locking and the MVCC problems, at the cost that you need to make
 cleanup passes over the counter table fairly often.

I can't find any posts that directly address what I was looking for.

In my situation I have a small number of concurrent transactions with each 
transaction running a large number of single row inserts or deletes.

However I'm not after a fast count(*) from table, but more like a fast
select grouping_id, count(*) from my_table group by grouping_id

I initially thought that what you meant by having transactions enter delta 
entries was that I have a trigger that would create a row each time it was 
called, writing how many records where inserted or deleted. I didn't understand 
how this would be much of an improvement, as each of my rows would contain 
either +1 or -1.

But I just realised you might mean to include a txid row in my table of deltas, 
and in my trigger insert or update that row where txid = txid_current()
(per grouping_id)

Is that what is recommended?

No locking problems as each transaction is only updating its own rows.

Can you clarify the lack of MVCC problems?
Do new versions of rows get created if the original version of the row was 
created in the current transaction?
Does this idea apply with the same efficiency in pre 8.3, non-HOT 
implementations?
Any advice on creating or avoiding indexes on the tables in question?

I can think of two options for a performing the cleanup passes using current 
functionality:
1) Using Listen/Notify
Issue a notify whenever a new row is inserted for a transaction. They get 
delivered post transaction commit don't they? And not at all on rollback? Have 
an application listening for them, performing the aggregation  cleanup work.
2) Use a schedule, based on local knowledge of expected number of transactions 
over time.

So I'd rephrase Pavan's suggestion as a request to have post-transaction commit 
triggers that have access to (at least) the txid of the transaction that was 
committed.

Suggested syntax is to add the option TRANSACTION (or perhaps COMMIT) to 
the CREATE TRIGGER statement:

CREATE TRIGGER name AFTER INSERT OR UPDATE OR DELETE ON table FOR EACH 
TRANSACTION EXECUTE PROCEDURE funcname ( arguments );

Any of the triggering actions on the specified table ensure that the function 
is called once if the transaction commits.
Requires a new TG_LEVEL.
TG_OP could be the first action triggered.

Would implementing this be extremely difficult due to transferring information 
from within the transaction to outside the transaction?
If so, perhaps I'd get the same benefit from having a trigger set up to fire 
pre-commit (or pre-prepare), and be a part of the transaction.
Would the locking difficulties be reduced as the lock would not be required 
till late in the game, and not held for long?

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer 
focus, and courage. This email with any attachments is confidential and may be 
subject to legal privilege.  If it is not intended for you please advise by 
reply immediately, destroy it and do not copy, disclose or use it in any way.

__
  This email has been scanned by the DMZGlobal Business Quality 
  Electronic Messaging Suite.
Please see http://www.dmzglobal.com/services/bqem.htm for details.
__



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-03-17 Thread Tom Lane
Stephen Denne [EMAIL PROTECTED] writes:
 I initially thought that what you meant by having transactions enter delta 
 entries was that I have a trigger that would create a row each time it was 
 called, writing how many records where inserted or deleted. I didn't 
 understand how this would be much of an improvement, as each of my rows would 
 contain either +1 or -1.

Well, ideally you'd aggregate all the deltas caused by a particular
transaction into one entry in the counting table.  Whether or not that
happens, though, the point of the concept is that some background task
aggregates all the deltas from long-gone transactions into just one base
row, and then deletes the old delta entries.  To get a valid value of
COUNT(*), what onlookers must do is SUM() the base row and delta records
from all transactions that they can see under MVCC rules.  The amount
of work involved is proportional to the number of recent updates, not
the total size of the underlying table.

 However I'm not after a fast count(*) from table, but more like a fast
   select grouping_id, count(*) from my_table group by grouping_id

You could apply the same technique across each group id, though this
certainly is getting beyond what any built-in feature might offer.

 Can you clarify the lack of MVCC problems?

The point there is that the right answer actually depends on the
observer, since each observer might have a different snapshot and
therefore be able to see a different set of committed rows in the
underlying table.  The multiple-delta table handles this automatically,
because you can see a delta entry if and only if you could see
the underlying-table changes it represents.

 Does this idea apply with the same efficiency in pre 8.3, non-HOT 
 implementations?

I didn't claim it was amazingly efficient in any implementation ;-).
HOT in particular is nearly useless since most rows in the count
table will never be updated, only inserted and eventually deleted.
You might get some mileage on the base row, but that'd be about it.
The count table will need frequent vacuums as well as frequent
aggregation scans.

It should beat scanning a large underlying table, but it's hardly
gonna be free.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-03-12 Thread Tom Lane
Pavan Deolasee [EMAIL PROTECTED] writes:
 I am sure this must have been discussed before.

Indeed.  Apparently you didn't find the threads in which the idea of
having transactions enter delta entries was discussed?  Solves both
the locking and the MVCC problems, at the cost that you need to make
cleanup passes over the counter table fairly often.

I don't see this as material for the core DB but it would make a great
contrib module.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-03-12 Thread Bruce Momjian
Tom Lane wrote:
 Pavan Deolasee [EMAIL PROTECTED] writes:
  I am sure this must have been discussed before.
 
 Indeed.  Apparently you didn't find the threads in which the idea of
 having transactions enter delta entries was discussed?  Solves both
 the locking and the MVCC problems, at the cost that you need to make
 cleanup passes over the counter table fairly often.
 
 I don't see this as material for the core DB but it would make a great
 contrib module.

The TODO list has lots of details on this.

-- 
  Bruce Momjian  [EMAIL PROTECTED]http://momjian.us
  EnterpriseDB http://postgres.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-03-12 Thread Pavan Deolasee
On Wed, Mar 12, 2008 at 9:01 PM, Tom Lane [EMAIL PROTECTED] wrote:
 Pavan Deolasee [EMAIL PROTECTED] writes:
   I am sure this must have been discussed before.

  Indeed.  Apparently you didn't find the threads in which the idea of
  having transactions enter delta entries was discussed?  Solves both
  the locking and the MVCC problems, at the cost that you need to make
  cleanup passes over the counter table fairly often.


Ok. I would surely look at those threads. Hopefully HOT will considerably
solve the counter table cleanup issue.

  I don't see this as material for the core DB but it would make a great
  contrib module.


I guess we will need some backend hooks to make it really work, no ?
At the minimum we need to track the deltas at the transaction level
and the ability to do some extra processing at the commit time.
May be I should first read those threads and I will find the answers.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-03-12 Thread Mark Mielke

Pavan Deolasee wrote:

I am reading discussion about improving count(*) performance. I have
also seen a TODO for this.

Many people have suggested TRIGGER based solution to the slow count(*)
problem. I looked at the following link which presents the solution
neatly.

http://www.varlena.com/GeneralBits/120.php

But how does that really work for SERIALIZABLE transactions ? If
two concurrent transactions INSERT/DELETE rows from a table,
the trigger execution of one of the transactions is bound to fail
because of concurrent access. Even for READ COMMITTED transactions,
the trigger execution would wait if the other transaction has executed
the trigger on the same table. Well, I think the READ COMMITTED case
can be handled with DEFERRED triggers, but that may require queuing up
too many triggers if there are many inserts/deletes in a transaction.

Running trigger for every insert/delete seems too expensive. I wonder
if we can have a separate counter table (as suggested in the TRIGGER
based solution) and track total number of tuples inserted and deleted
in a transaction (and all the committed subtransactions). We then
execute a single UPDATE at the end of the transaction. With HOT,
updating the counter table should not be a big pain since all these
updates can potentially be HOT updates. Also since the update of
the counter table happens at the commit time, other transactions
inserting/deleting from the same user table may need to wait for a
very small period on the counter table tuple.

This still doesn't solve the serializable transaction problem
though. But I am sure we can figure out some solution for that case
as well if we agree on the general approach.

I am sure this must have been discussed before. So what are the
objections 


If you are talking about automatically doing this for every table - I 
have an objection that the performance impact seems unwarranted against 
the gain. We are still talking about every insert or update updating 
some counter table, with the only mitigating factor being that the 
trigger would be coded deeper into PostgreSQL theoretically making it 
cheaper?


You can already today create a trigger on insert that will append to a 
summary table of some sort, whose only purpose is to maintain counts. At 
the simplest, it is as somebody else suggested where you might have the 
other table only store the primary keys with foreign key references back 
to the main table for handling deletes and updates. Storing transaction 
numbers and such might allow the data to be reduced in terms of size 
(and therefore elapsed time to scan), but it seems complex.


If this really is a problem that must be solved - I prefer the 
suggestion from the past of keeping track of live rows per block for a 
certain transaction range, and any that fall within this range can check 
off this block quickly with an exact count, then the exceptional blocks 
(the ones being changed) can be scanned to be sure. But, it's still 
pretty complicated to implement right and maintain, for what is probably 
limited gain. I don't personally buy into the need to do exact count(*) 
on a whole table quickly. I know people ask for it - but I find these 
same people either confused, or trying to use this functionality to 
accomplish some other end, under the assumption that because they can 
get counts faster from other databases, therefore PostgreSQL should do 
it as well. I sometimes wonder whether these people would even notice if 
PostgreSQL translated count(*) on the whole table to query reltuples. :-)


Cheers,
mark

--
Mark Mielke [EMAIL PROTECTED]


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-03-12 Thread Pavan Deolasee
On Wed, Mar 12, 2008 at 9:14 PM, Mark Mielke [EMAIL PROTECTED] wrote:

  If you are talking about automatically doing this for every table - I
  have an objection that the performance impact seems unwarranted against
  the gain. We are still talking about every insert or update updating
  some counter table, with the only mitigating factor being that the
  trigger would be coded deeper into PostgreSQL theoretically making it
  cheaper?


No, I am not suggesting that. If you read proposal carefully, its one UPDATE
per transaction. With HOT, I am hoping that the counter table may be
completely cached in memory and won't bloat much.

Also, we can always have a GUC (like pgstats) to control the overhead.

Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-03-12 Thread Mark Mielke

Pavan Deolasee wrote:

On Wed, Mar 12, 2008 at 9:14 PM, Mark Mielke [EMAIL PROTECTED] wrote:
  

 If you are talking about automatically doing this for every table - I
 have an objection that the performance impact seems unwarranted against
 the gain. We are still talking about every insert or update updating
 some counter table, with the only mitigating factor being that the
 trigger would be coded deeper into PostgreSQL theoretically making it
 cheaper?



No, I am not suggesting that. If you read proposal carefully, its one UPDATE
per transaction. With HOT, I am hoping that the counter table may be
completely cached in memory and won't bloat much.

Also, we can always have a GUC (like pgstats) to control the overhead.
  


Fine - once per transaction instead of once per insert. Still, if there 
is overhead to this (updating a secondary summary table), does it really 
make sense to have it for every table? Most of my tables do not require 
count(*) on the whole table (actually - none of them do). For the same 
reason as I don't want oid, I don't think I would want fast count 
capabilities to impact my regular queries. Again, I don't think count(*) 
on the whole table is a particularly useful case. count(*) on particular 
subsets of the data may be, but of the whole table?


If you can make a secondary summary table to be used for count(*) 
optional, great. If using HOT makes the secondary table more efficient, 
great.


Cheers,
mark

--
Mark Mielke [EMAIL PROTECTED]



Re: [HACKERS] count(*) performance improvement ideas

2008-03-12 Thread Tom Lane
Mark Mielke [EMAIL PROTECTED] writes:
 Fine - once per transaction instead of once per insert. Still, if there 
 is overhead to this (updating a secondary summary table), does it really 
 make sense to have it for every table?

We certainly wouldn't accept a patch that imposed this overhead on every
table.

One of the advantages of framing it as an explicit set of triggers is
that then you have a natural way of indicating which table(s) you want
the feature for (and are willing to pay the overhead to get it).

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-03-12 Thread Pavan Deolasee
On Wed, Mar 12, 2008 at 9:53 PM, Mark Mielke [EMAIL PROTECTED] wrote:



  Fine - once per transaction instead of once per insert. Still, if there is
 overhead to this (updating a secondary summary table), does it really make
 sense to have it for every table? Most of my tables do not require count(*)
 on the whole table (actually - none of them do). For the same reason as I
 don't want oid, I don't think I would want fast count capabilities to
 impact my regular queries. Again, I don't think count(*) on the whole table
 is a particularly useful case. count(*) on particular subsets of the data
 may be, but of the whole table?


ISTM that you are complaining because we never had an *fast* count(*)
and adding that now comes at a cost. Had it been there from day one with
the same overhead as we are talking about now, nobody would have
complained :-)

Anyways, your point is taken and it would be great if can make it configurable,
if not table level then at least globally.


Thanks,
Pavan


-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-03-12 Thread Mark Mielke

Pavan Deolasee wrote:

On Wed, Mar 12, 2008 at 9:53 PM, Mark Mielke [EMAIL PROTECTED] wrote:
  


 Fine - once per transaction instead of once per insert. Still, if there is
overhead to this (updating a secondary summary table), does it really make
sense to have it for every table? Most of my tables do not require count(*)
on the whole table (actually - none of them do). For the same reason as I
don't want oid, I don't think I would want fast count capabilities to
impact my regular queries. Again, I don't think count(*) on the whole table
is a particularly useful case. count(*) on particular subsets of the data
may be, but of the whole table?




ISTM that you are complaining because we never had an *fast* count(*)
and adding that now comes at a cost. Had it been there from day one with
the same overhead as we are talking about now, nobody would have
complained :-)
  


Quite possibly - but, similarly, if PostgreSQL was generally slower due 
to such things, I might not have chosen PostgreSQL as my favoured db. 
:-) I think there is a reason PostgreSQL doesn't come with a fast 
count(*), and it's that the developers leading up to this point shared 
the opinion that this feature was not critical.


I somewhat disagree about my personal reaction, though. I complain about 
many things, even configurable things, such as LISTEN/NOTIFY. I 
occasionally wish I had such a function, but given it's implementation, 
I would choose to use a non-PostgreSQL mechanism for implementation if 
available. Luckily, it's also only suffer the performance cost if you 
CHOOSE to use it. Seems like a historically valuable trend.



Anyways, your point is taken and it would be great if can make it configurable,
if not table level then at least globally


If table level, I might even try it out. If global level, I will not be 
trying it out. I'm only one person - but perhaps this will help point 
you in a direction favourable to many?


If you are REALLY feeling clever, the queries that I would find it to 
benefit the MOST on, would include WHERE conditions. Just like indexes 
with WHERE conditions. If you get to this point, I think you are 
reaching something that will have far more universal benefit to existing 
applications. CREATE COUNT INDEX blah ON blah WHERE ... I would use this 
in many places.


Cheers,
mark

--
Mark Mielke [EMAIL PROTECTED]



Re: [HACKERS] count(*) performance improvement ideas

2008-03-12 Thread Gregory Stark

Tom Lane [EMAIL PROTECTED] writes:

 Mark Mielke [EMAIL PROTECTED] writes:
 Fine - once per transaction instead of once per insert. Still, if there 
 is overhead to this (updating a secondary summary table), does it really 
 make sense to have it for every table?

 We certainly wouldn't accept a patch that imposed this overhead on every
 table.

If you look at this at the right angle it's actually a degenerate case of
materialized views. I think think it would be more useful to approach it from
that direction even if it only supported a very limited set of expressions.

In an ideal world I would love to be able to do something like:

CREATE MATERIALIZED VIEW foo AS (select count(*) from bar) WITH INCREMENTAL 
UPDATES;

and have that automatically create both a heap to store the count and another
to store the incremental changes. To do this would require some magic to know
what incremental changes means for each aggregate where it's meaningful
though.

Then it would require some magic in the optimizer to recognize when piece of
the query can be satisfied by a materialized view.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com
  Ask me about EnterpriseDB's Slony Replication support!

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] count(*) performance improvement ideas

2008-03-12 Thread Mark Mielke

Gregory Stark wrote:

In an ideal world I would love to be able to do something like:

CREATE MATERIALIZED VIEW foo AS (select count(*) from bar) WITH INCREMENTAL 
UPDATES;

and have that automatically create both a heap to store the count and another
to store the incremental changes. To do this would require some magic to know
what incremental changes means for each aggregate where it's meaningful
though.

Then it would require some magic in the optimizer to recognize when piece of
the query can be satisfied by a materialized view.
  


Hehe... +5 if you think it can be done in the next 12 - 24 months. :-)

Cheers,
mark

--
Mark Mielke [EMAIL PROTECTED]


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers