Re: [HACKERS] RFC: changing autovacuum_naptime semantics

2007-03-09 Thread Grzegorz Jaskiewicz


On Mar 9, 2007, at 6:42 AM, Tom Lane wrote:


Alvaro Herrera [EMAIL PROTECTED] writes:

Now regarding your restartable vacuum work.  I think that stopping a
vacuum at some point and being able to restart it later is very  
cool and

may get you some hot chicks, but I'm not sure it's really useful.


Too true :-(


Yeah.
Wouldn't 'divide and conquer' kinda approach make it better ? Ie. let  
vacuum to work on some part of table/db. Than stop, pick up another  
part later, vacuum it, etc, etc ?


--
Grzegorz Jaskiewicz
[EMAIL PROTECTED]




---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Auto creation of Partitions

2007-03-09 Thread Zeugswetter Andreas ADI SD

  Since partition is inheritance-based, a simple DROP or  NO 
 INHERIT 
  will do the job to deal with the partition. Do we want to reinvent 
  additional syntax when these are around and are documented?
 
 Well, if the syntax for adding a new partition eventually 
 ends up as ALTER TABLE ADD PARTITION, then it would make more 
 sense that you remove a partition via ALTER TABLE DROP PARTITION.

But DROP PARTITION usually moves the data from this partition to other
partitions,
so it is something different.

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Calculated view fields (8.1 != 8.2)

2007-03-09 Thread Gaetano Mendola
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Gaetano Mendola wrote:
 Tom Lane wrote:
 Gaetano Mendola [EMAIL PROTECTED] writes:
 [ 8.2 evaluates volatile functions in the targetlist of a view ]
 If I mark the function as STABLE or IMMUTABLE then even with version
 8.2 the function is not evaluated. Is this the intended behavior?
 Yes; people complained that we needed to be more careful about the
 number of times volatile functions get evaluated.
 
 I suspect that functions are evaluated also for record discarded due to
 joins. Is that the case?
 
 Like:
 
 SELECT * FROM ta JOIN tb USING (id) where tb.foo = 4;
 
 If ta is a view with some calculated fields are the function on ta
 evaluated only for record matching the filters or in some case (
 like a full scan on ta ) also for the records discarded due to the join?
 

I did a check on a 8.2 and I can confirm my suspects:

kalman=#  create table ta ( a integer, b integer );
CREATE TABLE
kalman=#  create table tb ( b integer, c integer );
CREATE TABLE
kalman=#
kalman=#  CREATE OR REPLACE FUNCTION sp_test ( INTEGER )
kalman-#  RETURNS INTEGER AS'
kalman'#  DECLARE
kalman'# a_idALIAS FOR $1;
kalman'#  BEGIN
kalman'# RAISE NOTICE ''here'';
kalman'#
kalman'# return 3;
kalman'#  END;
kalman'#  ' LANGUAGE 'plpgsql';
CREATE FUNCTION
kalman=#
kalman=#  CREATE OR REPLACE VIEW v_ta AS
kalman-#SELECT
kalman-#   sp_test(a) AS a,
kalman-#   b  AS b
kalman-#FROM
kalman-#   ta c
kalman-#  ;
CREATE VIEW
kalman=#
kalman=#  insert into ta values (2,3);
INSERT 0 1
kalman=#  insert into ta values (3,4);
INSERT 0 1
kalman=#  insert into tb values (4,5);
INSERT 0 1
kalman=#
kalman=#  select * from v_ta join tb using (b) where c = 5;
NOTICE:  here
NOTICE:  here
 b | a | c
- ---+---+---
 4 | 3 | 5
(1 row)


Is really this what we want? I did a migration 8.0.x = 8.2.3 and I had on 
first hour of service up
lot of queries blocked due to this, consider in my case I have on v_ta 
milions of records and usually
that join extracts 1 row. Is there a way to set till I don't check all my huge 
schema to disable this
behaviour?

Regards
Gaetano Mendola



-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFF8TAc7UpzwH2SGd4RAgajAKCvIxLH9JSBk4gxSbuaq4WE2y7v2wCfbnRa
jWDV3hlEq/Loye6G+E2S9Ew=
=LR5T
-END PGP SIGNATURE-

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


[HACKERS] A naive question about the Makefile

2007-03-09 Thread Cao Yu

Hi all,
I am adding some new files into the system, so I wonder what I need 
to do to

update the Makefiles?
Is the system smart enough to recognize my new files and add 
corresponding entries into the Makefiles?

Thanks.

Regards
Yu

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] A naive question about the Makefile

2007-03-09 Thread Peter Eisentraut
Am Freitag, 9. März 2007 10:38 schrieb Cao Yu:
  Is the system smart enough to recognize my new files and add
 corresponding entries into the Makefiles?

No, you need to add it yourself.  Look for where the other files are listed.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] Auto creation of Partitions

2007-03-09 Thread Simon Riggs
On Fri, 2007-03-09 at 11:48 +0530, NikhilS wrote:
 Hi, 
 
 On 3/9/07, Shane Ambler [EMAIL PROTECTED] wrote:
 
  Note to Nikhil: Make sure the new syntax doesn't prevent
 partitions from
  being placed upon multiple tablespaces in some manner, at
 CREATE TABLE
  time.
 
 What if the syntax was something like - 
 
 CREATE TABLE tabname (
  ...
  ...
   ) PARTITION BY
   HASH(expr)
 | RANGE(expr)
 | LIST(expr)
 [PARTITIONS num_partitions] /* will apply to HASH only for
 now*/
 [PARTITION partition_name CHECK(...) [USING TABLESPACE
 tblspcname], 
   PARTITION partition_name CHECK(...) [USING TABLESPACE
 tblspcname]
   ...
 ];
 
 
 And (if we use the ALTER TABLE to add partitions)
 
 ALTER TABLE tabname
 ADD PARTITION partition_name CHECK(...)
 [USING TABLESPACE tblspcname];
 
 
 
 We could as well drop the USING part. 

Why would we support HASH partitions?
If you did, the full syntax for hash clusters should be supported.

If we do the CHECK clauses like that then we still have don't have a
guaranteed non-overlap between partitions. It would be easier to use
Oracle syntax and then construct the CHECK clauses from that.

Also, the syntax needs to be fairly complex to allow for a mixture of
modes, e.g. range and list partitioning. That is currently possible
today and the syntax for doing that is IMHO much simpler than the Oracle
simple way of specifying it.

An alternative is to provide a partitioning function which decides which
partition each values goes into.

PARTITION FUNCTION which_partition(date_col)

The partition function must return an unsigned integer  0, which would
correspond to particular partitions. Partitions would be numbered 1..N,
and named tablename_partM where 1 = M = N.

The input and contents of the partition function would be up to the
user. e.g. 

CREATE FUNCTION range_partition(date date_col)
{
if (date_col  D1)
return 1;
else if (date_col  D2)
return 2;
else if (date_col  D3)
return 3;

return 4;
}

Doing it this way would allow us to easily join two tables based upon a
common partition function.

In time, I would suggest we support both ways: declarative and
functional.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[HACKERS] CLUSTER and MVCC

2007-03-09 Thread Heikki Linnakangas
Is there a particular reason why CLUSTER isn't MVCC-safe? It seems to me 
that it would be trivial to fix, by using SnapshotAny instead of 
SnapshotNow, and not overwriting the xmin/xmax with the xid of the 
cluster command.


I feel that I must missing something, or someone would've already fixed 
it a long time ago...


Csaba, you mentioned recently 
(http://archives.postgresql.org/pgsql-hackers/2007-03/msg00027.php) that 
you're actually using the MVCC-violation to clean up tables during a 
backup. Can you tell us a bit more about that? Would you be upset if we 
shut that backdoor?


In any case, the MVCC-violation needs to be documented. I'll send a doc 
patch to pgsql-patches shortly.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] [COMMITTERS] pgsql: Remove unsafe calling of WSAStartup and WSA Cleanup from DllMain.

2007-03-09 Thread Magnus Hagander
On Fri, Mar 09, 2007 at 08:16:12AM +, Dave Page wrote:
 Magnus Hagander wrote:
 For example, do you recall us being confused
 when we found we needed to call it in slon.exe?
 
 Eh, no, actually not. Sorry.
 
 Well, it was only a couple of years ago!! Seriously though, from what I 
 recall that was the origin of this code - you left it out because libpq 
 called WSAStartup on your system, and I added it because on mine it 
 didn't. Or something like that. I remember us getting confused about it 
 on IM anyway.

Hmm. Was that actually fort he libpq stuff, though? I don't recall it
clearly, but somethign tells me the problem was around the pipe
emulation and not around libpq.

In which case it can simply be because I was building against a libpq
built with MSVC = it had the broken startup code, and you used a mingw
one, which didn't have it.

Per the docs, an application like slon *should* make it's own call to
WSAStartup() because it uses socket functions diretly (port/pipe.c)...


Another question related to backpatching - should I backpatch this to
8.1 and 8.0 as well? I know we said we more or less don't maintain the
win32 port back there because it was too new, but this is all code in
the client libpq, which has been around no win32 much longer. The reason
I'm asking is that the original reporter of this problem is on 8.1...

I'm leaning towards yes, but would like to hear further comments...

//Magnus

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Auto creation of Partitions

2007-03-09 Thread NikhilS

Hi,



Why would we support HASH partitions?
If you did, the full syntax for hash clusters should be supported.



In MySQL, Oracle, the syntax for HASH partitions seems to be similar to the
one mentioned. I do not know much about hash clusters though.

If we do the CHECK clauses like that then we still have don't have a

guaranteed non-overlap between partitions. It would be easier to use
Oracle syntax and then construct the CHECK clauses from that.



Again Oracle, MySQL use VALUES LESS THAN  (expr) format for RANGE
partitions. So you mean that they end up creating ranges like MININT -
Range1, Range1+1 - Range2 etc for each of the partitions?

I think Postgres users are used to the CHECK clauses and I still feel that
the onus of distinct partitions lies on the partition creator.

Also, the syntax needs to be fairly complex to allow for a mixture of

modes, e.g. range and list partitioning. That is currently possible
today and the syntax for doing that is IMHO much simpler than the Oracle
simple way of specifying it.



Subpartitioning is not being targeted right now, but could be put on the
TODO list for further enhancements.

An alternative is to provide a partitioning function which decides which

partition each values goes into.

PARTITION FUNCTION which_partition(date_col)

The partition function must return an unsigned integer  0, which would
correspond to particular partitions. Partitions would be numbered 1..N,
and named tablename_partM where 1 = M = N.

The input and contents of the partition function would be up to the
user. e.g.

CREATE FUNCTION range_partition(date date_col)
{
if (date_col  D1)
return 1;
else if (date_col  D2)
return 2;
else if (date_col  D3)
return 3;

return 4;
}

Doing it this way would allow us to easily join two tables based upon a
common partition function.

In time, I would suggest we support both ways: declarative and
functional.



Till now, we are going the declarative way.

Regards,
Nikhils
--
EnterpriseDB   http://www.enterprisedb.com


Re: [HACKERS] who gets paid for this

2007-03-09 Thread Andrew Dunstan

Tom Lane wrote:


Even more to the point, getting paid for has almost nothing to do
with has commit privileges.  At least on this project.

  


Darn. So the cheque isn't really in the mail?

cheers

andrew

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Csaba Nagy
On Fri, 2007-03-09 at 12:29, Heikki Linnakangas wrote:
 Csaba, you mentioned recently 
 (http://archives.postgresql.org/pgsql-hackers/2007-03/msg00027.php) that 
 you're actually using the MVCC-violation to clean up tables during a 
 backup. Can you tell us a bit more about that? Would you be upset if we 
 shut that backdoor?

My use case: a queue-like table (in fact a 'task' table) which is very
frequently inserted/updated/deleted. This table tends to be bloated in
the presence of any long running transaction... the only transactional
behavior we need from this table is to make sure that when we insert
something in this table in a transaction (possibly together with other
actions) and then commit/rollback, it commits/rolls back the insert.
CLUSTER's violation of MVCC does not affect this, as CLUSTER will not be
able to lock the table if another transaction inserted something in it
(the inserting transaction will have a lock on the table). Selections on
this table are not critical for us, it just doesn't matter which job
processor is getting which task and in what order... (actually it does
matter, but CLUSTER won't affect that either).

So what I do is execute CLUSTER once in 5 minutes on this table. This
works just fine, and keeps the table size small even if I have long
running transactions in progress. The DB backup is one of such
unavoidable long running transactions, and I use the table exclusion
switch to exclude this task table from the backup so it won't get locked
by it and let CLUSTER still do it's job (I had a rudimentary patch to do
this even before the feature was introduced to pg_dump). The table can
be dumped separately which is a brief operation, but I would have anyway
to clear it on a crash...

Now I could try and disable the CLUSTER cron job and see if i get
problems, as last it was disabled with postgres 7.4, maybe something
changed in between... but I can tell for sure that last time I enabled
it it really fixed our load on the DB server...

Wouldn't be possible to do it like Simon (IIRC) suggested, and add a
parameter to enable/disable the current behavior, and use the MVCC
behavior as default ?

Cheers,
Csaba.



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Hannu Krosing
Ühel kenal päeval, R, 2007-03-09 kell 11:29, kirjutas Heikki
Linnakangas:
 Is there a particular reason why CLUSTER isn't MVCC-safe? It seems to me 
 that it would be trivial to fix, by using SnapshotAny instead of 
 SnapshotNow, and not overwriting the xmin/xmax with the xid of the 
 cluster command.
 
 I feel that I must missing something, or someone would've already fixed 
 it a long time ago...

Probably it is not MVCC safe because the relation is swapped out from
under the pg_class.

That is, it can be possible , that older and newer transactions read
different datafiles and so simle MVCC does not work.


 Csaba, you mentioned recently 
 (http://archives.postgresql.org/pgsql-hackers/2007-03/msg00027.php) that 
 you're actually using the MVCC-violation to clean up tables during a 
 backup. Can you tell us a bit more about that? Would you be upset if we 
 shut that backdoor?
 
 In any case, the MVCC-violation needs to be documented. I'll send a doc 
 patch to pgsql-patches shortly.
 
-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] RFC: changing autovacuum_naptime semantics

2007-03-09 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes:

 Er, why not just finish out the scan at the reduced I/O rate?  Any sort
 of abort behavior is going to create net inefficiency, eg doing an
 index scan to remove only a few tuples.  ISTM that the vacuum ought to
 just continue along its existing path at a slower I/O rate.

I think the main motivation to abort a vacuum scan is so we can switch to some
more urgent scan. So if in the middle of a 1-hour long vacuum of some big
warehouse table we realize that a small hot table is long overdue for a vacuum
we want to be able to remove the tuples we've found so far, switch to the hot
table, and when we don't have more urgent tables to vacuum resume the large
warehouse table vacuum.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Auto creation of Partitions

2007-03-09 Thread Luke Lonergan
Simon,

What happens to the data when the function is dropped or replaced?

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Simon Riggs [mailto:[EMAIL PROTECTED]
Sent:   Friday, March 09, 2007 06:20 AM Eastern Standard Time
To: NikhilS
Cc: Shane Ambler; Luke Lonergan; Zeugswetter Andreas ADI SD; Peter 
Eisentraut; pgsql-hackers@postgresql.org
Subject:Re: [HACKERS] Auto creation of Partitions

On Fri, 2007-03-09 at 11:48 +0530, NikhilS wrote:
 Hi, 
 
 On 3/9/07, Shane Ambler [EMAIL PROTECTED] wrote:
 
  Note to Nikhil: Make sure the new syntax doesn't prevent
 partitions from
  being placed upon multiple tablespaces in some manner, at
 CREATE TABLE
  time.
 
 What if the syntax was something like - 
 
 CREATE TABLE tabname (
  ...
  ...
   ) PARTITION BY
   HASH(expr)
 | RANGE(expr)
 | LIST(expr)
 [PARTITIONS num_partitions] /* will apply to HASH only for
 now*/
 [PARTITION partition_name CHECK(...) [USING TABLESPACE
 tblspcname], 
   PARTITION partition_name CHECK(...) [USING TABLESPACE
 tblspcname]
   ...
 ];
 
 
 And (if we use the ALTER TABLE to add partitions)
 
 ALTER TABLE tabname
 ADD PARTITION partition_name CHECK(...)
 [USING TABLESPACE tblspcname];
 
 
 
 We could as well drop the USING part. 

Why would we support HASH partitions?
If you did, the full syntax for hash clusters should be supported.

If we do the CHECK clauses like that then we still have don't have a
guaranteed non-overlap between partitions. It would be easier to use
Oracle syntax and then construct the CHECK clauses from that.

Also, the syntax needs to be fairly complex to allow for a mixture of
modes, e.g. range and list partitioning. That is currently possible
today and the syntax for doing that is IMHO much simpler than the Oracle
simple way of specifying it.

An alternative is to provide a partitioning function which decides which
partition each values goes into.

PARTITION FUNCTION which_partition(date_col)

The partition function must return an unsigned integer  0, which would
correspond to particular partitions. Partitions would be numbered 1..N,
and named tablename_partM where 1 = M = N.

The input and contents of the partition function would be up to the
user. e.g. 

CREATE FUNCTION range_partition(date date_col)
{
if (date_col  D1)
return 1;
else if (date_col  D2)
return 2;
else if (date_col  D3)
return 3;

return 4;
}

Doing it this way would allow us to easily join two tables based upon a
common partition function.

In time, I would suggest we support both ways: declarative and
functional.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com





Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Gregory Stark
Csaba Nagy [EMAIL PROTECTED] writes:

 Wouldn't be possible to do it like Simon (IIRC) suggested, and add a
 parameter to enable/disable the current behavior, and use the MVCC
 behavior as default ?

Doing it in CLUSTER would be weird. However perhaps it would be useful to have
some sort of stand-alone tool that just bumped all the xmin/xmax's. It would
have to be super-user-only and carry big warning labels saying it breaks MVCC.

But it would be useful any time you have a table that you want to exempt a
particular table from serializable snapshots. Basically a per-table way to
force a read-committed snapshot on. Though, actually it's not quite a
read-committed snapshot is it? Anyone using an old serializable snapshot will
see what, no tuples at all?

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] RFC: changing autovacuum_naptime semantics

2007-03-09 Thread Alvaro Herrera
Gregory Stark wrote:
 Tom Lane [EMAIL PROTECTED] writes:
 
  Er, why not just finish out the scan at the reduced I/O rate?  Any sort
  of abort behavior is going to create net inefficiency, eg doing an
  index scan to remove only a few tuples.  ISTM that the vacuum ought to
  just continue along its existing path at a slower I/O rate.
 
 I think the main motivation to abort a vacuum scan is so we can switch to some
 more urgent scan. So if in the middle of a 1-hour long vacuum of some big
 warehouse table we realize that a small hot table is long overdue for a vacuum
 we want to be able to remove the tuples we've found so far, switch to the hot
 table, and when we don't have more urgent tables to vacuum resume the large
 warehouse table vacuum.

Why not just let another autovac worker do the hot table?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Csaba Nagy
On Fri, 2007-03-09 at 13:42, Gregory Stark wrote:
 Csaba Nagy [EMAIL PROTECTED] writes:
 
  Wouldn't be possible to do it like Simon (IIRC) suggested, and add a
  parameter to enable/disable the current behavior, and use the MVCC
  behavior as default ?
 
 Doing it in CLUSTER would be weird. However perhaps it would be useful to have
 some sort of stand-alone tool that just bumped all the xmin/xmax's. It would
 have to be super-user-only and carry big warning labels saying it breaks MVCC.

Well, the current behavior of CLUSTER is just perfect for what I'm using
it. If anything else would do the job, I would be happy to use it
instead...

 But it would be useful any time you have a table that you want to exempt a
 particular table from serializable snapshots. Basically a per-table way to
 force a read-committed snapshot on. Though, actually it's not quite a
 read-committed snapshot is it? Anyone using an old serializable snapshot will
 see what, no tuples at all?

I'm afraid what I need has nothing to do with serializable snapshots...
I still want the table to be completely transactional except if somebody
can get an exclusive lock on it, it can be compacted regardless of other
running transactions. I'm not sure how to express this in other way...
it means something like: no transaction cares about the content of the
table until it gets some kind of lock on it. In other words the table's
state is not connected with the state of other tables until I actually
do something on it...

Cheers,
Csaba.



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Heikki Linnakangas

Csaba Nagy wrote:

On Fri, 2007-03-09 at 12:29, Heikki Linnakangas wrote:
Csaba, you mentioned recently 
(http://archives.postgresql.org/pgsql-hackers/2007-03/msg00027.php) that 
you're actually using the MVCC-violation to clean up tables during a 
backup. Can you tell us a bit more about that? Would you be upset if we 
shut that backdoor?


My use case: a queue-like table (in fact a 'task' table) which is very
frequently inserted/updated/deleted. This table tends to be bloated in
the presence of any long running transaction... the only transactional
behavior we need from this table is to make sure that when we insert
something in this table in a transaction (possibly together with other
actions) and then commit/rollback, it commits/rolls back the insert.
CLUSTER's violation of MVCC does not affect this, as CLUSTER will not be
able to lock the table if another transaction inserted something in it
(the inserting transaction will have a lock on the table). Selections on
this table are not critical for us, it just doesn't matter which job
processor is getting which task and in what order... (actually it does
matter, but CLUSTER won't affect that either).


Hmm. You could use something along these lines instead:

0. LOCK TABLE queue_table
1. SELECT * INTO queue_table_new FROM queue_table
2. DROP TABLE queue_table
3. ALTER TABLE queue_table_new RENAME queue_table

After all, it's not that you care about the clustering of the table, you 
just want to remove old tuples.


As a long term solution, it would be nice if we had more fine-grained 
bookkeeping of snapshots that are in use in the system. In your case, 
there's a lot of tuples that are not visible to pg_dump because xmin is 
too new, and also not visible to any other transaction because xmax is 
too old. If we had a way to recognize situations like that, and vacuum 
those tuples, much of the problem with long-running transactions would 
go away.



Wouldn't be possible to do it like Simon (IIRC) suggested, and add a
parameter to enable/disable the current behavior, and use the MVCC
behavior as default ?


I guess we could, but I don't see why should encourage using CLUSTER for 
that. A more aggressive, MVCC-breaking version of VACUUM would make more 
sense to me, but I don't like the idea of adding break-MVCC flags to 
any commands.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Alvaro Herrera
Gregory Stark wrote:
 Csaba Nagy [EMAIL PROTECTED] writes:
 
  Wouldn't be possible to do it like Simon (IIRC) suggested, and add a
  parameter to enable/disable the current behavior, and use the MVCC
  behavior as default ?
 
 Doing it in CLUSTER would be weird. However perhaps it would be useful to have
 some sort of stand-alone tool that just bumped all the xmin/xmax's. It would
 have to be super-user-only and carry big warning labels saying it breaks MVCC.
 
 But it would be useful any time you have a table that you want to exempt a
 particular table from serializable snapshots. Basically a per-table way to
 force a read-committed snapshot on. Though, actually it's not quite a
 read-committed snapshot is it? Anyone using an old serializable snapshot will
 see what, no tuples at all?

Unless you used FrozenTransactionId ...

But I'm not really seeing the problem here.  Why isn't Csaba's problem
fixed by the fact that HOT reduces the number of dead tuples in the
first place?  If it does, then he no longer needs the CLUSTER
workaround, or at least, he needs it to a much lesser extent.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Zeugswetter Andreas ADI SD

  Is there a particular reason why CLUSTER isn't MVCC-safe? It seems
to 
  me that it would be trivial to fix, by using SnapshotAny instead of 
  SnapshotNow, and not overwriting the xmin/xmax with the xid of the 
  cluster command.
 
 It's trivial to fix now in this way, but it would break HOT, 
 since an indexscan only returns one row per index entry.

Well, with SnapshotAny HOT should probably return all possibly visible
tuples
with an indexscan. (Btw, does CLUSTER really do an index scan ? Seems
for reading a whole table a seq scan and sort is usually cheaper, at
least when the clustering is so bad that a CLUSTER is needed.)

Andreas

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Csaba Nagy
On Fri, 2007-03-09 at 14:00, Alvaro Herrera wrote:
 But I'm not really seeing the problem here.  Why isn't Csaba's problem
 fixed by the fact that HOT reduces the number of dead tuples in the
 first place?  If it does, then he no longer needs the CLUSTER
 workaround, or at least, he needs it to a much lesser extent.

Is this actually true in the case of HOT + long running transactions ? I
was supposing HOT has the same problems in the presence of long running
transactions...

Cheers,
Csaba.



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Csaba Nagy
 Hmm. You could use something along these lines instead:
 
 0. LOCK TABLE queue_table
 1. SELECT * INTO queue_table_new FROM queue_table
 2. DROP TABLE queue_table
 3. ALTER TABLE queue_table_new RENAME queue_table
 
 After all, it's not that you care about the clustering of the table, you 
 just want to remove old tuples.

... and then restart the app so all my pooled connections drop their
cached plans ;-)

Seriously, that won't work. If a session tries to insert a new row after
I lock the table to clean it up, I still want it to be able to insert
after the cleanup is finished... if I drop the table it tries to insert
to, it will fail.

 As a long term solution, it would be nice if we had more fine-grained 
 bookkeeping of snapshots that are in use in the system. In your case, 
 there's a lot of tuples that are not visible to pg_dump because xmin is 
 too new, and also not visible to any other transaction because xmax is 
 too old. If we had a way to recognize situations like that, and vacuum 
 those tuples, much of the problem with long-running transactions would 
 go away.

In the general case that won't work either in a strict MVCC sense... if
you have an old transaction, you should never clean up a dead tuple
which could be still visible to it.

  Wouldn't be possible to do it like Simon (IIRC) suggested, and add a
  parameter to enable/disable the current behavior, and use the MVCC
  behavior as default ?
 
 I guess we could, but I don't see why should encourage using CLUSTER for 
 that. A more aggressive, MVCC-breaking version of VACUUM would make more 
 sense to me, but I don't like the idea of adding break-MVCC flags to 
 any commands.

Well, if there would be any other way to avoid the table bloat I would
agree.

Cheers,
Csaba.



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-09 Thread Greg Smith

On Fri, 9 Mar 2007, ITAGAKI Takahiro wrote:

Pinned means bufHdr-refcount  0 and you don't distinguish pinned or 
recently-used (bufHdr-usage_count  0) buffers in your patch.


Thank you, I will revise the terminology used accordingly.  I was using 
pinned as a shortcut for will be ignored by skip_pinned which was 
sloppy of me.  As I said, I was trying to show how the buffer cache looks 
from the perspective of the background writer, and therefore lumping them 
together because that's how SyncOneBuffer views them.  A buffer cache full 
of either type will be largely ignored by the LRU writer, and that's what 
I've been finding when running insert/update heavy workloads like pgbench.


If I might suggest a terminology change to avoid this confusion in the 
future, I'd like to rename the SyncOneBuffer skip_pinned parameter to 
something like skip_active, which is closer to the real behavior.  I 
know Oracle refers to these as hot and cold LRU entries.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Auto creation of Partitions

2007-03-09 Thread Simon Riggs
On Fri, 2007-03-09 at 07:40 -0500, Luke Lonergan wrote:

 What happens to the data when the function is dropped or replaced?
 
Well, that wouldn't happen because you build in a dependency.

I'm not working on this, so don't expect lots of detail. The idea is
essentially to implement things the way SQLServer does it.

The function would need some care. It might even need a function that
writes a function.

e.g. 
CreateRangePartitionFunction('{date1,date2,date3}', funcname);
ReplaceRangePartitionFunction('{date1,date2,date3}', funcname);

which would then give a properly designed function called funcname to be
used for partitioning, which would include tests to make sure a
partition wasn't inadvertently excluded from the list of existing
partitions of any table that used it.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] Interaction of PITR backups and Bulk operations avoiding WAL

2007-03-09 Thread Simon Riggs
Reviewing earlier threads, I realised that there was a potential
bug/loophole in PITR backups in conjunction with avoiding WAL for bulk
operations. This would be rare, but should be fixed.
http://archives.postgresql.org/pgsql-hackers/2006-05/msg01113.php

Say you issue COPY, CREATE INDEX etc..
pg_start_backup()
pg_stop_backup()
...then bulk operation ends.
This will result in a base backup that does not contain the data written
during the bulk operation and the changes aren't in WAL either.

I propose to fix this by making two new calls
 bool RequestBulkCommandUseNoWAL(void)
 void ResetBulkCommandUseNoWAL(void)

so we would use it like this
 use_wal = RequestBulkCommandUseNoWAL()
and then at end of operation
 if (!use_wal)
   ResetBulkCommandUseNoWAL();

The routine would record a flag on the shmem ControlFile data that would
prevent pg_start backup functions from executing while a bulk operation
was in progress. It would also prevent a bulk operation from using no
WAL while a backup was in progress, as is already the case, since the
backup can only take place while archiving is enabled.

A new entry point pg_start_backup(text, bool) would allow the user to
specify whether to wait for bulk ops to finish, or not. The old entry
point would always wait, to ensure safety in all cases.

Thoughts?

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


[HACKERS] Re: [COMMITTERS] pgsql: Remove unsafe calling of WSAStartup and WSA Cleanup from DllMain.

2007-03-09 Thread Dave Page

Magnus Hagander wrote:


In which case it can simply be because I was building against a libpq
built with MSVC = it had the broken startup code, and you used a mingw
one, which didn't have it.


Maybe - but it does imply it's potentially easy to break code with this 
change.



Per the docs, an application like slon *should* make it's own call to
WSAStartup() because it uses socket functions diretly (port/pipe.c)...


Another question related to backpatching - should I backpatch this to
8.1 and 8.0 as well? I know we said we more or less don't maintain the
win32 port back there because it was too new, but this is all code in
the client libpq, which has been around no win32 much longer. The reason
I'm asking is that the original reporter of this problem is on 8.1...

I'm leaning towards yes, but would like to hear further comments...


I'm far from convinced it should be backpatched at all.

/D

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Calculated view fields (8.1 != 8.2)

2007-03-09 Thread Martijn van Oosterhout
On Fri, Mar 09, 2007 at 10:59:56AM +0100, Gaetano Mendola wrote:
 Is really this what we want? I did a migration 8.0.x = 8.2.3 and I had on 
 first hour of service up
 lot of queries blocked due to this, consider in my case I have on v_ta 
 milions of records and usually
 that join extracts 1 row. Is there a way to set till I don't check all my 
 huge schema to disable this
 behaviour?

Most people figured it was a improvment. It's configured per function
now, which wasn't the case before. I dont't think there was ever any
discussion about having a global switch.

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch

2007-03-09 Thread Martijn van Oosterhout
On Thu, Mar 08, 2007 at 07:01:17PM -0500, Umar Farooq Minhas wrote:
 displayed, I want cpu cost and io cost displayed separated when i run
 EXPLAIN on a particular query. Till now I haven't been able to figure
 out a 'clean' way of doing this. Can anyone tell me how much time
 should I expect to spend making such a change ? and from where should
 I start ? costsize.c ?

That's going to be a lot of work. You need to duplicate the variable
and eery usage of that variable. And I can't imagine why you'd be
interested anyway...

 I have another question. Looking at the optimizer code, it pretty
 much looks insensitive to the memory factor. The only parameters
 being utilized are the effective_cache_size ( in estimating index
 cost only) and work_mem for (sort, aggregation, groups, hash/merge
 joins). Are these the only memory factors that DIRECTLY effect the
 cost estimates of the planner/optimizer?

Sure, what other factors were you considering?

Have a nice day,
-- 
Martijn van Oosterhout   kleptog@svana.org   http://svana.org/kleptog/
 From each according to his ability. To each according to his ability to 
 litigate.


signature.asc
Description: Digital signature


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Heikki Linnakangas

Csaba Nagy wrote:

On Fri, 2007-03-09 at 14:00, Alvaro Herrera wrote:

But I'm not really seeing the problem here.  Why isn't Csaba's problem
fixed by the fact that HOT reduces the number of dead tuples in the
first place?  If it does, then he no longer needs the CLUSTER
workaround, or at least, he needs it to a much lesser extent.


Is this actually true in the case of HOT + long running transactions ? I
was supposing HOT has the same problems in the presence of long running
transactions...


It does, HOT won't help you here. A long-running transaction is just as 
much of a problem with HOT as without. Besides, I don't recall that 
you're doing updates in the first place.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Calculated view fields (8.1 != 8.2)

2007-03-09 Thread Tom Lane
Martijn van Oosterhout kleptog@svana.org writes:
 Most people figured it was a improvment. It's configured per function
 now, which wasn't the case before. I dont't think there was ever any
 discussion about having a global switch.

Volatile functions that are not at the top level of a query are *always*
going to be a risk factor, in that you don't know quite where the
planner is going to evaluate them.  While I'm not by any means wedded to
the 8.2 no-flattening patch, it seems to me to be reasonable because it
reduces that uncertainty a bit.  The fact that Gaetano's code depended
on the uncertainty being resolved in a different direction is
unfortunate, but I think his code is really to blame, because postponing
the function eval like that couldn't be guaranteed anyway across all
queries.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Calculated view fields (8.1 != 8.2)

2007-03-09 Thread Florian G. Pflug

Martijn van Oosterhout wrote:

On Fri, Mar 09, 2007 at 10:59:56AM +0100, Gaetano Mendola wrote:

Is really this what we want? I did a migration 8.0.x = 8.2.3 and I had on 
first hour of service up
lot of queries blocked due to this, consider in my case I have on v_ta 
milions of records and usually
that join extracts 1 row. Is there a way to set till I don't check all my huge 
schema to disable this
behaviour?


Most people figured it was a improvment. It's configured per function
now, which wasn't the case before. I dont't think there was ever any
discussion about having a global switch.


If your function is already marked immutable or stable, then nothing changes for 
you. If you *did* call volatile functions inside your select, then you now get

consistens behaviour. Since you don't want your function to be evaluated in all
cases, I assume that it shouldn't be marked volatile in the first place.

I think a lot of people forget to mark their functions volatile/stable/immutable
correctly, or don't know about the implications of these flags.

Maybe there should be a guc force_explicit_sideeffeect_declaration (defaulting
to on) which makes specifying either volatile, stable or immutable mandatory.
Then people would (hopefully) read the relevant part of the docs before
creating a function, and probably get the declaration right in the first place.

greetings, Florian Pflug

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Florian G. Pflug

Heikki Linnakangas wrote:

Csaba Nagy wrote:

On Fri, 2007-03-09 at 14:00, Alvaro Herrera wrote:

But I'm not really seeing the problem here.  Why isn't Csaba's problem
fixed by the fact that HOT reduces the number of dead tuples in the
first place?  If it does, then he no longer needs the CLUSTER
workaround, or at least, he needs it to a much lesser extent.


Is this actually true in the case of HOT + long running transactions ? I
was supposing HOT has the same problems in the presence of long running
transactions...


It does, HOT won't help you here. A long-running transaction is just as 
much of a problem with HOT as without. Besides, I don't recall that 
you're doing updates in the first place.


Couldn't HOT in principle deal with this? Let's say you have two long-running
transactions, which see row versions A and D. While those transactions
are running, the row is constantly updated, leading to row versions B, C (before
the second long-running transaction started), D, E, F, ... Z.
Now, the versions B,C,E,F,...Z could be removed by HOT or vacuum, because they
are not currently visible, nor will they ever become visible because they are
already deleted.

greetings, Florian Pflug

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Calculated view fields (8.1 != 8.2)

2007-03-09 Thread Gaetano Mendola
Martijn van Oosterhout wrote:
 On Fri, Mar 09, 2007 at 10:59:56AM +0100, Gaetano Mendola wrote:
 Is really this what we want? I did a migration 8.0.x = 8.2.3 and I had on 
 first hour of service up
 lot of queries blocked due to this, consider in my case I have on v_ta 
 milions of records and usually
 that join extracts 1 row. Is there a way to set till I don't check all my 
 huge schema to disable this
 behaviour?
 
 Most people figured it was a improvment. It's configured per function
 now, which wasn't the case before. I dont't think there was ever any
 discussion about having a global switch.

Well it's not an improvement in term of performances but a performance 
degradation in the best case and
in the worst can be devastating:


create table ta ( a integer, b integer );
CREATE TABLE
create table tb ( b integer, c integer );
CREATE TABLE

CREATE OR REPLACE FUNCTION sp_delete_selected_row ( INTEGER )
RETURNS INTEGER AS'
DECLARE
   a_idALIAS FOR $1;
BEGIN
   DELETE FROM ta where a = a_id;
   return 0;
END;
' LANGUAGE 'plpgsql';
CREATE FUNCTION

CREATE OR REPLACE VIEW v_ta AS
  SELECT
 sp_delete_selected_row(a) AS a,
 b AS b
  FROM
 ta
;
CREATE VIEW

insert into ta values (2,3);
INSERT 0 1
insert into ta values (3,4);
INSERT 0 1
insert into tb values (4,5);
INSERT 0 1

select * from v_ta join tb using (b) where c = 5;

 b | a | c
---+---+---
 4 | 0 | 5
(1 row)

select * from ta;
 a | b
---+---
(0 rows)


All rows are gone instead of the only one extracted from that query. IMHO is a 
undesired side effect.
In my case I destroyed my application statistics on how many time a certain row 
was extracted.


Regards
Gaetano Mendola





---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Heikki Linnakangas

Csaba Nagy wrote:

Hmm. You could use something along these lines instead:

0. LOCK TABLE queue_table
1. SELECT * INTO queue_table_new FROM queue_table
2. DROP TABLE queue_table
3. ALTER TABLE queue_table_new RENAME queue_table

After all, it's not that you care about the clustering of the table, you 
just want to remove old tuples.


... and then restart the app so all my pooled connections drop their
cached plans ;-)


Yeah, though Tom's working on plan invalidation for 8.3, so that 
wouldn't be an issue.



Seriously, that won't work. If a session tries to insert a new row after
I lock the table to clean it up, I still want it to be able to insert
after the cleanup is finished... if I drop the table it tries to insert
to, it will fail.


Hmm. How about:

1. LOCK TABLE queue_table
2. SELECT * INTO temp_table FROM queue_table
3. TRUNCATE queue_table
4. INSERT INTO queue_table SELECT * FROM temp_table

That way you're copying the rows twice, but if there isn't many live 
tuples it shouldn't matter too much.


As a long term solution, it would be nice if we had more fine-grained 
bookkeeping of snapshots that are in use in the system. In your case, 
there's a lot of tuples that are not visible to pg_dump because xmin is 
too new, and also not visible to any other transaction because xmax is 
too old. If we had a way to recognize situations like that, and vacuum 
those tuples, much of the problem with long-running transactions would 
go away.


In the general case that won't work either in a strict MVCC sense... if
you have an old transaction, you should never clean up a dead tuple
which could be still visible to it.


We wouldn't clean up tuples that are visible to a transaction, but if 
you have one long-running transaction like pg_dump in a database with 
otherwise short transaction, you'll have a lot of tuples that are not 
vacuumable because of the long-running process, but are not in fact 
visible to any transaction. That's transactions that were inserted too 
late to be seen by the old transaction, and deleted too long time ago to 
be seen by any other transaction. Let me illustrate this with a timeline:


 xmin1xmax1
 ||
-+--X-X+X-+ooXoXoXoXXo+--now
   |  |
   xmin2  xmax2

xmin1 and xmax1 are the xmin and xmax of an old, long-running 
serializable transaction, like pg_dump. The Xs between them are xids of 
transactions that the old transaction sees as in-progress, IOW the 
SnapshotData.xip-array.


xmin2 and xmax2 are the xmin and xmax of a newer transaction. Because of 
the old-running transaction, xmin2 is far behind xmax2, but there's a 
wide gap between that and the next transaction that the newer 
transaction sees as in-progress.


The current rule to determine if a tuple is dead or not is to check that 
 tuple's xmax  oldestxmin. Oldestxmin is in this case xmin1. But in 
addition to that, any tuple with an xmin  xmax1 and xmax that's not in 
the xip-array of any snapshot in use (marked with o above), isn't 
visible to any current or future transaction and can therefore be safely 
vacuumed.


The implementation problem is that we don't have a global view of all 
snapshots in the system. If we solve that, we can be more aggressive 
with vacuuming in presence of long-running transactions. It's not an 
easy problem, we don't want to add a lot of accounting overhead, but 
maybe we could have some kind of an approximation of the global state 
with little overhead, that would give most of the benefit.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Alvaro Herrera
Zeugswetter Andreas ADI SD wrote:
 
   Is there a particular reason why CLUSTER isn't MVCC-safe? It seems
 to 
   me that it would be trivial to fix, by using SnapshotAny instead of 
   SnapshotNow, and not overwriting the xmin/xmax with the xid of the 
   cluster command.
  
  It's trivial to fix now in this way, but it would break HOT, 
  since an indexscan only returns one row per index entry.
 
 Well, with SnapshotAny HOT should probably return all possibly visible
 tuples
 with an indexscan. (Btw, does CLUSTER really do an index scan ? Seems
 for reading a whole table a seq scan and sort is usually cheaper, at
 least when the clustering is so bad that a CLUSTER is needed.)

Yes, it does an indexscan (last time I checked, at least).  I think if a
performance improvement is demonstrated, we would accept a patch ...

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Heikki Linnakangas

Florian G. Pflug wrote:
Couldn't HOT in principle deal with this? Let's say you have two 
long-running

transactions, which see row versions A and D. While those transactions
are running, the row is constantly updated, leading to row versions B, C 
(before

the second long-running transaction started), D, E, F, ... Z.
Now, the versions B,C,E,F,...Z could be removed by HOT or vacuum, 
because they
are not currently visible, nor will they ever become visible because 
they are

already deleted.


Yes, you could detect that but you'd need a global view of all snapshots 
in the system. I just posted a reply in this thread with more details..


It's not just with HOT, it's the way we determine that a tuple is 
vacuumable in general.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Mike Rylander

On 3/9/07, Florian G. Pflug [EMAIL PROTECTED] wrote:

Heikki Linnakangas wrote:
 Csaba Nagy wrote:
 On Fri, 2007-03-09 at 14:00, Alvaro Herrera wrote:
 But I'm not really seeing the problem here.  Why isn't Csaba's problem
 fixed by the fact that HOT reduces the number of dead tuples in the
 first place?  If it does, then he no longer needs the CLUSTER
 workaround, or at least, he needs it to a much lesser extent.

 Is this actually true in the case of HOT + long running transactions ? I
 was supposing HOT has the same problems in the presence of long running
 transactions...

 It does, HOT won't help you here. A long-running transaction is just as
 much of a problem with HOT as without. Besides, I don't recall that
 you're doing updates in the first place.

Couldn't HOT in principle deal with this? Let's say you have two long-running
transactions, which see row versions A and D. While those transactions
are running, the row is constantly updated, leading to row versions B, C (before
the second long-running transaction started), D, E, F, ... Z.
Now, the versions B,C,E,F,...Z could be removed by HOT or vacuum, because they
are not currently visible, nor will they ever become visible because they are
already deleted.


Couldn't they (or at least one of them) become visible due to
SAVEPOINT rollback?



greetings, Florian Pflug

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq




--
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Heikki Linnakangas

Mike Rylander wrote:

On 3/9/07, Florian G. Pflug [EMAIL PROTECTED] wrote:
Couldn't HOT in principle deal with this? Let's say you have two 
long-running

transactions, which see row versions A and D. While those transactions
are running, the row is constantly updated, leading to row versions B, 
C (before

the second long-running transaction started), D, E, F, ... Z.
Now, the versions B,C,E,F,...Z could be removed by HOT or vacuum, 
because they
are not currently visible, nor will they ever become visible because 
they are

already deleted.


Couldn't they (or at least one of them) become visible due to
SAVEPOINT rollback?


You wouldn't remove tuples with an uncommited xmax, of course.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Calculated view fields (8.1 != 8.2)

2007-03-09 Thread Florian G. Pflug

Gaetano Mendola wrote:

Martijn van Oosterhout wrote:

On Fri, Mar 09, 2007 at 10:59:56AM +0100, Gaetano Mendola wrote:

Is really this what we want? I did a migration 8.0.x = 8.2.3 and I had on 
first hour of service up
lot of queries blocked due to this, consider in my case I have on v_ta 
milions of records and usually
that join extracts 1 row. Is there a way to set till I don't check all my huge 
schema to disable this
behaviour?

Most people figured it was a improvment. It's configured per function
now, which wasn't the case before. I dont't think there was ever any
discussion about having a global switch.


Well it's not an improvement in term of performances but a performance 
degradation in the best case and
in the worst can be devastating:

create table ta ( a integer, b integer );
CREATE TABLE
create table tb ( b integer, c integer );
CREATE TABLE

CREATE OR REPLACE FUNCTION sp_delete_selected_row ( INTEGER )
RETURNS INTEGER AS'
DECLARE
   a_idALIAS FOR $1;
BEGIN
   DELETE FROM ta where a = a_id;
   return 0;
END;
' LANGUAGE 'plpgsql';
CREATE FUNCTION

CREATE OR REPLACE VIEW v_ta AS
  SELECT
 sp_delete_selected_row(a) AS a,
 b AS b
  FROM
 ta
;
CREATE VIEW

insert into ta values (2,3);
INSERT 0 1
insert into ta values (3,4);
INSERT 0 1
insert into tb values (4,5);
INSERT 0 1

select * from v_ta join tb using (b) where c = 5;

 b | a | c
---+---+---
 4 | 0 | 5
(1 row)

select * from ta;
 a | b
---+---
(0 rows)


All rows are gone instead of the only one extracted from that query. IMHO is a 
undesired side effect.
In my case I destroyed my application statistics on how many time a certain row 
was extracted.


This is insane. Whoever creates a view like that on a production system should 
*immediatly* be carried away from his keyboard, to prevent further damage. 
Imagine someone using View Data on  this view in pgadmin.. I don't wanna be 
near him when he clicks Refresh, and suddenly all data is gone...


Maybe calling volatile functions in selects and views should be forbidden 
entirely, except for volatile functions in the top-level select clause,

to support things like select ..., nextval('seq') from 

But it's probably not worth the effort - there will always be creative
ways to shoot yourself into your foot.

greetings, Florian Pflug

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Is there a particular reason why CLUSTER isn't MVCC-safe? It seems to me 
 that it would be trivial to fix, by using SnapshotAny instead of 
 SnapshotNow, and not overwriting the xmin/xmax with the xid of the 
 cluster command.

The reason it's not trivial is that you also have to preserve the t_ctid
links of update chains.  If you look into VACUUM FULL, a very large part
of its complexity is that it moves update chains as a unit to make that
possible.  (BTW, I believe the problem Pavan Deolasee reported yesterday
is a bug somewhere in there --- it looks to me like sometimes the same
update chain is getting copied multiple times.)

regards, tom lane

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove unsafe calling of WSAStartup and WSA Cleanup from DllMain.

2007-03-09 Thread Tom Lane
Dave Page [EMAIL PROTECTED] writes:
 Magnus Hagander wrote:
 Another question related to backpatching - should I backpatch this to
 8.1 and 8.0 as well?

 I'm far from convinced it should be backpatched at all.

I tend to agree with Dave --- I think this change needs to go through a
beta-testing cycle before we unleash it on the world.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] who gets paid for this

2007-03-09 Thread Bruno Wolff III
On Thu, Mar 08, 2007 at 12:10:22 -0800,
  Christian Bird [EMAIL PROTECTED] wrote:
 
 I'm a grad student at UC Davis studying the postgres community and I
 wanted to know if some on this list could help me out.  I'm studying
 the factors that affect people graduating from being mailing list
 participant to developers with write access to the repository.  Is it
 possible to find out who is being employed to work on postgres and who
 is doing it on their own time?  Some of my data points to there being
 two ways that people make the jump.  More specifically, could those
 who worked on apache as some aspect of their job prior to getting repo
 access let me know?  Or if there are devs who know this information
 about others, I'd be really appreciative to get it.  Thanks a lot.

Si Chen from Open Source Strategies talked to a number of mailing list
contributors (which is different than code contributors) a year or two
ago. They are supposed to have a web page about this at
http://www.opensourcestrategies.com/pgsurvey/control/main
but I am getting a 500 error right now. The rest of their web pages are
working, so they may still be there. There is a contact link on their main
page which you might use to contact them and see if you can get access to
those results.

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Simon Riggs
On Fri, 2007-03-09 at 12:48 -0300, Alvaro Herrera wrote:
 Zeugswetter Andreas ADI SD wrote:
  
Is there a particular reason why CLUSTER isn't MVCC-safe? It seems
  to 
me that it would be trivial to fix, by using SnapshotAny instead of 
SnapshotNow, and not overwriting the xmin/xmax with the xid of the 
cluster command.
   
   It's trivial to fix now in this way, but it would break HOT, 
   since an indexscan only returns one row per index entry.
  
  Well, with SnapshotAny HOT should probably return all possibly visible
  tuples
  with an indexscan. (Btw, does CLUSTER really do an index scan ? Seems
  for reading a whole table a seq scan and sort is usually cheaper, at
  least when the clustering is so bad that a CLUSTER is needed.)
 
 Yes, it does an indexscan (last time I checked, at least).  I think if a
 performance improvement is demonstrated, we would accept a patch ...

Again, right now, most things people do here will break HOT. At this
late stage before freeze, please everybody be careful to look and plan
for patch conflicts. (That isn't stay away, just be careful). Thanks.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Interaction of PITR backups and Bulk operations avoiding WAL

2007-03-09 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 Say you issue COPY, CREATE INDEX etc..
 pg_start_backup()
 pg_stop_backup()
 ...then bulk operation ends.
 This will result in a base backup that does not contain the data written
 during the bulk operation and the changes aren't in WAL either.

Uh, no.  The state of XLogArchivingActive() isn't affected by that.

It strikes me that allowing archive_command to be changed on the fly
might not be such a good idea though, or at least it shouldn't be
possible to flip it from empty to nonempty during live operation.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
Is there a particular reason why CLUSTER isn't MVCC-safe? It seems to me 
that it would be trivial to fix, by using SnapshotAny instead of 
SnapshotNow, and not overwriting the xmin/xmax with the xid of the 
cluster command.


The reason it's not trivial is that you also have to preserve the t_ctid
links of update chains.  If you look into VACUUM FULL, a very large part
of its complexity is that it moves update chains as a unit to make that
possible.  (BTW, I believe the problem Pavan Deolasee reported yesterday
is a bug somewhere in there --- it looks to me like sometimes the same
update chain is getting copied multiple times.)


Ah, that's it. Thanks.

The easiest solution I can think of is to skip newer versions of updated 
rows when scanning the old relation, and to fetch and copy all tuples in 
the update chain to the new relation whenever you encounter the first 
tuple in the chain.


To get a stable view of what's the first tuple in chain, you need to get 
the oldest xmin once at the beginning, and use that throughout the 
operation. Since we take an exclusive lock on the table, no-one can 
insert new updated tuples during the operation, and all updaters are 
finished before the lock is granted.


Those tuples wouldn't be in the cluster order, though, but that's not a 
big deal.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Interaction of PITR backups and Bulk operationsavoiding WAL

2007-03-09 Thread Simon Riggs
On Fri, 2007-03-09 at 11:15 -0500, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  Say you issue COPY, CREATE INDEX etc..
  pg_start_backup()
  pg_stop_backup()
  ...then bulk operation ends.
  This will result in a base backup that does not contain the data written
  during the bulk operation and the changes aren't in WAL either.
 
 Uh, no.  The state of XLogArchivingActive() isn't affected by that.

Sorry, error case should have been

Say you issue COPY, CREATE INDEX etc..
set archive_command
pg_ctl reload 
pg_start_backup()
pg_stop_backup()
...then bulk operation ends.

 It strikes me that allowing archive_command to be changed on the fly
 might not be such a good idea though, or at least it shouldn't be
 possible to flip it from empty to nonempty during live operation.

As long as we allow it to be turned on/off during normal operation then
there is a current window of error.

I'd rather fix it the proposed way than force a restart. ISTM wrong to
have an availability feature cause downtime.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 We wouldn't clean up tuples that are visible to a transaction, but if 
 you have one long-running transaction like pg_dump in a database with 
 otherwise short transaction, you'll have a lot of tuples that are not 
 vacuumable because of the long-running process, but are not in fact 
 visible to any transaction.

It sounds to me like you are proposing to remove the middles of update
chains, which would break READ-COMMITTED updates initiated by the older
transactions.  Now admittedly pg_dump isn't going to issue any such
updates, but VACUUM doesn't know that.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Heikki Linnakangas

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
We wouldn't clean up tuples that are visible to a transaction, but if 
you have one long-running transaction like pg_dump in a database with 
otherwise short transaction, you'll have a lot of tuples that are not 
vacuumable because of the long-running process, but are not in fact 
visible to any transaction.


It sounds to me like you are proposing to remove the middles of update
chains, which would break READ-COMMITTED updates initiated by the older
transactions.  Now admittedly pg_dump isn't going to issue any such
updates, but VACUUM doesn't know that.


I was thinking of inserts+deletes. Updates are harder, you'd need to 
change the ctid of the old version to skip the middle part of the chain, 
atomically, but I suppose they could be handled as well.


Isolation level doesn't really matter. We just need a global view of 
in-use *snapshots* in the system, serializable or not. Not that that's 
an easy thing to do...


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Florian G. Pflug

Heikki Linnakangas wrote:
The implementation problem is that we don't have a global view of all 
snapshots in the system. If we solve that, we can be more aggressive 
with vacuuming in presence of long-running transactions. It's not an 
easy problem, we don't want to add a lot of accounting overhead, but 
maybe we could have some kind of an approximation of the global state 
with little overhead, that would give most of the benefit.


Hm.. Maybe there could be a fixed-sized list of xids together with
a usecount in shared memory.
If a transaction puts an xid into it's snapshot, it increments the
usecount of that xid in the global list (inserting it if it's not
already in the list). If there is no free space in the list, it
first removes all xid with xid  oldestxmin. If there is still no
free space, it does nothing.
When the transaction is done with the snapshot, it decrements all
the usecounts of xids it incremented before.

You than know that a xid is *not* viewed as in-progress by any
transaction if the xid is in that list, and has a refcount of zero.

greetings, Florian Pflug

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Interaction of PITR backups and Bulk operationsavoiding WAL

2007-03-09 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 On Fri, 2007-03-09 at 11:15 -0500, Tom Lane wrote:
 It strikes me that allowing archive_command to be changed on the fly
 might not be such a good idea though, or at least it shouldn't be
 possible to flip it from empty to nonempty during live operation.

 I'd rather fix it the proposed way than force a restart. ISTM wrong to
 have an availability feature cause downtime.

I don't think that people are very likely to need to turn archiving on
and off on-the-fly.  Your proposed solution introduces a great deal of
complexity (and risk of future bugs-of-omission, to say nothing of race
conditions) to solve a non-problem.  We have better things to be doing
with our development time.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] CLUSTER and MVCC

2007-03-09 Thread Florian G. Pflug

Tom Lane wrote:

Heikki Linnakangas [EMAIL PROTECTED] writes:
We wouldn't clean up tuples that are visible to a transaction, but if 
you have one long-running transaction like pg_dump in a database with 
otherwise short transaction, you'll have a lot of tuples that are not 
vacuumable because of the long-running process, but are not in fact 
visible to any transaction.


It sounds to me like you are proposing to remove the middles of update
chains, which would break READ-COMMITTED updates initiated by the older
transactions.  Now admittedly pg_dump isn't going to issue any such
updates, but VACUUM doesn't know that.


You could restrict this to serializable transactions, or even to
read-only transactions. Or maybe the tuple could be reduced to
just it's header - doesn't HOT do something similar?

greetings, Florian Pflug


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Interaction of PITR backups and Bulkoperationsavoiding WAL

2007-03-09 Thread Simon Riggs
On Fri, 2007-03-09 at 11:47 -0500, Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
  On Fri, 2007-03-09 at 11:15 -0500, Tom Lane wrote:
  It strikes me that allowing archive_command to be changed on the fly
  might not be such a good idea though, or at least it shouldn't be
  possible to flip it from empty to nonempty during live operation.
 
  I'd rather fix it the proposed way than force a restart. ISTM wrong to
  have an availability feature cause downtime.
 
 I don't think that people are very likely to need to turn archiving on
 and off on-the-fly.  Your proposed solution introduces a great deal of
 complexity (and risk of future bugs-of-omission, to say nothing of race
 conditions) to solve a non-problem.  We have better things to be doing
 with our development time.

It's certainly a quicker fix. Unless others object, I'll set
archive_command to only be changeable at server startup.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Interaction of PITR backups and Bulk operationsavoiding WAL

2007-03-09 Thread Andreas Pflug
Tom Lane wrote:
 Simon Riggs [EMAIL PROTECTED] writes:
   
 On Fri, 2007-03-09 at 11:15 -0500, Tom Lane wrote:
 
 It strikes me that allowing archive_command to be changed on the fly
 might not be such a good idea though, or at least it shouldn't be
 possible to flip it from empty to nonempty during live operation.
   

   
 I'd rather fix it the proposed way than force a restart. ISTM wrong to
 have an availability feature cause downtime.
 

 I don't think that people are very likely to need to turn archiving on
 and off on-the-fly.  Your proposed solution introduces a great deal of
 complexity (and risk of future bugs-of-omission, to say nothing of race
 conditions) to solve a non-problem.  We have better things to be doing
 with our development time.
   
So how to do a file based backup without permanent archiving? If
pg_start_backup would turn on archiving temporarily with forcing
archiving all WAL files that contain open transactions, this would be
possible. This is what's requested for sites where PITR isn't needed,
just filesystem level backup. Currently, this can be mimicked somehow by
turning on archiving on-the-fly, hoping that all xactions are in the WAL
archive when pg_start_backup is issued (Simons mail shows how this will
fail).

Regards,
Andreas


---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Interaction of PITR backups and Bulk operationsavoiding WAL

2007-03-09 Thread Csaba Nagy
On Fri, 2007-03-09 at 17:47, Tom Lane wrote:
 I don't think that people are very likely to need to turn archiving on
 and off on-the-fly.

We did need occasionally to turn archiving on on-the-fly. It did happen
that I started up a new DB machine and I did not have yet the log
archive available, so I had to wait with configuring that, but the
machine went on-line before the archive machine was ready... and then
later I had to switch on archiving. It was very convenient that I could
do it without a restart.

It's true that has been rare occasion, more often you just need to
change the archive command (e.g. to archive to a different location if
the archive repository goes down).

It's somewhat moot for us as we changed to use Slony (which is a heavy
beast but once it works it's great).

Cheers,
Csaba.




---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch

2007-03-09 Thread Grzegorz Jaskiewicz

It would be interested to see some code here.
Maybe this would be a great oportunity to start - some sort of 'auto- 
tune' (as an option), in the area.



--
GJ

C/C++/SQL freelance to hire.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [HACKERS] Bug: Buffer cache is not scan resistant

2007-03-09 Thread Luke Lonergan
Cool!

- Luke

Msg is shrt cuz m on ma treo

 -Original Message-
From:   Simon Riggs [mailto:[EMAIL PROTECTED]
Sent:   Friday, March 09, 2007 02:32 PM Eastern Standard Time
To: Luke Lonergan; ITAGAKI Takahiro
Cc: Sherry Moore; Tom Lane; Mark Kirkwood; Pavan Deolasee; Gavin Sherry; 
PGSQL Hackers; Doug Rady
Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant

On Tue, 2007-03-06 at 22:32 -0500, Luke Lonergan wrote:
 Incidentally, we tried triggering NTA (L2 cache bypass)
 unconditionally and in various patterns and did not see the
 substantial gain as with reducing the working set size.
 
 My conclusion: Fixing the OS is not sufficient to alleviate the issue.
 We see a 2x penalty (1700MB/s versus 3500MB/s) at the higher data
 rates due to this effect.
 
I've implemented buffer recycling, as previously described, patch being
posted now to -patches as scan_recycle_buffers.

This version includes buffer recycling

- for SeqScans larger than shared buffers, with the objective of
improving L2 cache efficiency *and* reducing the effects of shared
buffer cache spoiling (both as previously discussed on this thread)

- for VACUUMs of any size, with the objective of reducing WAL thrashing
whilst keeping VACUUM's behaviour of not spoiling the buffer cache (as
originally suggested by Itagaki-san, just with a different
implementation).

Behaviour is not activated by default in this patch. To request buffer
recycling, set the USERSET GUC
SET scan_recycle_buffers = N
tested with 1,4,8,16, but only  8 seems sensible, IMHO.

Patch effects StrategyGetBuffer, so only effects the disk-cache path.
The idea is that if its already in shared buffer cache then we get
substantial benefit already and nothing else is needed. So for the
general case, the patch adds a single if test into the I/O path.

The parameter is picked up at the start of SeqScan and VACUUM
(currently). Any change mid-scan will be ignored.

IMHO its possible to do this and to allow Synch Scans at the same time,
with some thought. There is no need for us to rely on cache spoiling
behaviour of scans to implement that feature as well.

Independent performance tests requested, so that we can discuss this
objectively.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com





Re: [HACKERS] who gets paid for this

2007-03-09 Thread Lukas Kahwe Smith

Andrew Dunstan wrote:

Tom Lane wrote:


Even more to the point, getting paid for has almost nothing to do
with has commit privileges.  At least on this project.

  


Darn. So the cheque isn't really in the mail?


I think his question was just which ratio of developers works on 
PostgreSQL on company time.


regards,
Lukas

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-09 Thread Jim Nasby

On Mar 8, 2007, at 11:51 PM, Greg Smith wrote:
almost everything that's dirty is also pinned during pgbench, and  
the LRU is lucky to find anything it can write as a result


I'm wondering if pg_bench is a good test of this stuff. ISTM it's  
unrealistically write-heavy, which is going to tend to not only put a  
lot of dirty buffers into the pool, but also keep them pinned enough  
that you can't write them.


Perhaps you should either modify pg_bench to do a lot more selects  
out of the various tables or look towards a different benchmark.

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-09 Thread Jim Nasby

On Mar 8, 2007, at 11:51 PM, Greg Smith wrote:
almost everything that's dirty is also pinned during pgbench, and  
the LRU is lucky to find anything it can write as a result


I'm wondering if pg_bench is a good test of this stuff. ISTM it's  
unrealistically write-heavy, which is going to tend to not only put a  
lot of dirty buffers into the pool, but also keep them pinned enough  
that you can't write them.


Perhaps you should either modify pg_bench to do a lot more selects  
out of the various tables

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-09 Thread Jim Nasby

On Mar 9, 2007, at 7:57 AM, Greg Smith wrote:

On Fri, 9 Mar 2007, ITAGAKI Takahiro wrote:

Pinned means bufHdr-refcount  0 and you don't distinguish  
pinned or recently-used (bufHdr-usage_count  0) buffers in your  
patch.


Thank you, I will revise the terminology used accordingly.  I was  
using pinned as a shortcut for will be ignored by skip_pinned  
which was sloppy of me.  As I said, I was trying to show how the  
buffer cache looks from the perspective of the background writer,  
and therefore lumping them together because that's how  
SyncOneBuffer views them.  A buffer cache full of either type will  
be largely ignored by the LRU writer, and that's what I've been  
finding when running insert/update heavy workloads like pgbench.


If I might suggest a terminology change to avoid this confusion in  
the future, I'd like to rename the SyncOneBuffer skip_pinned  
parameter to something like skip_active, which is closer to the  
real behavior.  I know Oracle refers to these as hot and cold  
LRU entries.


Well, AIUI, whether the buffer is actually pinned or not is almost  
inconsequential (other than if a buffer *is* pinned then it's usage  
count is about to become  0, so there's no reason to consider  
writing it).


What that parameter really does is control whether you're going to  
follow the LRU semantics or not...

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Bug in VACUUM FULL ?

2007-03-09 Thread Tom Lane
Pavan Deolasee [EMAIL PROTECTED] writes:
 The problem mentioned before is hard to reproduce with the
 suggested change, but its not completely gone away. I have
 seen that again on CVS HEAD with the patch applied.
 I am facing another issue with VACUUM FULL. This
 problem gets reproduced with HOT very easily, but takes
 few attempts to reproduce with CVS HEAD, but it
 certainly exists.

I've been banging away on this since yesterday, and I think I've
achieved a full understanding of what's going on.  There are three or
four different-looking pathologies but they all seem to arise from
the same problem: the update-chain-moving code assumes that
RECENTLY_DEAD tuples will never have update successors that are entirely
DEAD (according to HeapTupleSatisfiesVacuum).  When faced with an
update chain in which that does happen, it can move the chain multiple
times, neglect to remove index entries for tuples that get truncated
away, crash on an Assert, or other interesting stuff.  Here's an example
from some debug printouts I inserted into repair_frag:

chain forward branches 17/174 to 17/183 (x 1993046 1993057) RECENTLY_DEAD
chain forward branches 17/183 to 15/109 (x 1993057 1993055) RECENTLY_DEAD
chain forward branches 15/109 to 15/111 (x 1993055 1993045) DEAD
chain forward branches 15/111 to 15/114 (x 1993045 1993025) DEAD
chain forward branches 15/114 to 15/116 (x 1993025 1993096) RECENTLY_DEAD
chain forward branches 15/116 to 15/119 (x 1993096 1993107) RECENTLY_DEAD
chain forward branches 15/119 to 15/121 (x 1993107 1993120) RECENTLY_DEAD
chain forward branches 15/121 to 15/125 (x 1993120 1993121) RECENTLY_DEAD
chain forward branches 15/125 to 15/128 (x 1993121 1993122) RECENTLY_DEAD
chain forward branches 15/128 to 15/131 (x 1993122 1993092) RECENTLY_DEAD
chain forward branches 15/131 to 15/133 (x 1993092 1993145) RECENTLY_DEAD
chain forward branches 15/133 to 15/139 (x 1993145 1993182) RECENTLY_DEAD
chain forward branches 15/139 to 15/141 (x 1993182 1993183) RECENTLY_DEAD
chain forward branches 15/141 to 15/147 (x 1993183 1993155) RECENTLY_DEAD
chain forward branches 15/147 to 15/150 (x 1993155 1993167) LIVE
chain back stops at branches 15/114: xmin 1993025  1993050
moved branches 15/150 to 0/69; next 0/69
moved branches 15/147 to 0/70; next 0/69
moved branches 15/141 to 0/71; next 0/70
moved branches 15/139 to 0/72; next 0/71
moved branches 15/133 to 0/73; next 0/72
moved branches 15/131 to 0/74; next 0/73
moved branches 15/128 to 0/75; next 0/74
moved branches 15/125 to 0/76; next 0/75
moved branches 15/121 to 0/77; next 0/76
moved branches 15/119 to 0/78; next 0/77
moved branches 15/116 to 0/79; next 0/78
moved branches 15/114 to 0/80; next 0/79

Since TIDs 17/174 and 17/183 didn't get moved, when the repair_frag
search arrives at 17/183 it will copy this chain again, leading to
duplicate copies of the LIVE tuple at the chain end, leading to trouble.

It's not surprising that tuples could have xmax less than xmin, since
transactions can commit in orders different than they start; when using
READ COMMITTED updates it's not at all surprising that a transaction
might update rows after a later-numbered transaction does.  However, in
looking at this code previously I'd assumed that the OldestXmin cutoff
could never fall between two such transactions, and so the above
scenario wouldn't happen.  I'm not real sure why I thought that.
For the cases that VACUUM FULL is interested in, both XIDs mentioned
in a DEAD tuple must have committed before OldestXmin was computed, but
there doesn't seem to be a compelling reason why OldestXmin might not
have been determined by an unrelated third transaction with a number
between those two.

I believe it's the case that any update chain members appearing before
a DEAD entry must in fact also be dead (ie, not listed as live in any
active snapshot) but a test based on OldestXmin hasn't got enough
resolution to prove that they are dead.

Does anyone want to argue that this is an error in the calculation of
OldestXmin (and if so, how would you propose to fix it)?  If not, I'll
set to work on fixing the chain-moving logic.  I think the correct
behavior is that we shouldn't consider DEAD tuples part of a movable
chain, and so in the above example there are two separate chains to move
not one.  Alternatively we could try to recognize that the older part of
the chain is really dead and removable, but that seems complicated and
likely to introduce new bugs.

I wonder whether this has any implications for HOT ...

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Auto creation of Partitions

2007-03-09 Thread Jim Nasby

On Mar 9, 2007, at 3:31 AM, Zeugswetter Andreas ADI SD wrote:

Since partition is inheritance-based, a simple DROP or  NO

INHERIT

will do the job to deal with the partition. Do we want to reinvent
additional syntax when these are around and are documented?


Well, if the syntax for adding a new partition eventually
ends up as ALTER TABLE ADD PARTITION, then it would make more
sense that you remove a partition via ALTER TABLE DROP PARTITION.


But DROP PARTITION usually moves the data from this partition to other
partitions,
so it is something different.


It does? IIRC every partitioning system I've seen DROP PARTITION  
drops the data as well. It's up to you to move it somewhere else if  
you want to keep it.

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Auto creation of Partitions

2007-03-09 Thread Hannu Krosing
Ühel kenal päeval, R, 2007-03-09 kell 15:41, kirjutas Jim Nasby:
 On Mar 9, 2007, at 3:31 AM, Zeugswetter Andreas ADI SD wrote:
  Since partition is inheritance-based, a simple DROP or  NO
  INHERIT
  will do the job to deal with the partition. Do we want to reinvent
  additional syntax when these are around and are documented?
 
  Well, if the syntax for adding a new partition eventually
  ends up as ALTER TABLE ADD PARTITION, then it would make more
  sense that you remove a partition via ALTER TABLE DROP PARTITION.
 
  But DROP PARTITION usually moves the data from this partition to other
  partitions,
  so it is something different.
 
 It does? IIRC every partitioning system I've seen DROP PARTITION  
 drops the data as well. It's up to you to move it somewhere else if  
 you want to keep it.

Will this proposed DROP PARTITION just disassociate the table from the
master, or will it actually drop the partitions table from the whole
database ?

-- 

Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com



---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [HACKERS] who gets paid for this

2007-03-09 Thread Joshua D. Drake
Christian Bird wrote:
 Hi all,
 
 I'm a grad student at UC Davis studying the postgres community and I
 wanted to know if some on this list could help me out.  I'm studying
 the factors that affect people graduating from being mailing list
 participant to developers with write access to the repository. 

It is done on a meritocracy basis and has zero bearing if the person is
paid to work on PostgreSQL or not.

I believe (would need verification) that it is -core who decides who
gets actual commit privileges.

Joshua D. Drake



-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Bug in VACUUM FULL ?

2007-03-09 Thread Simon Riggs
On Fri, 2007-03-09 at 16:40 -0500, Tom Lane wrote:
 I wonder whether this has any implications for HOT ...

My general feeling, expressed in a number of recent posts was that the
VACUUM FULL code really isn't worth the trouble it causes. Especially
when CLUSTER does a better job anyway?

I've proposed a number of different proposals for changing VACUUM FULL,
and Hannu posted some really cool ideas.

Please can we spend time doing something useful, rather than trying to
fix up a bag of worms that nobody ever runs? C'mon guys, this isn't a
challenge, its a lost cause. I don't really mean to be radical, but I
just think VACUUM FULL's time has come. A better utility could be
written in the time it takes to fix and be certain of a fix.

Yes, we need a utility that compacts a table, but isn't there a faster,
safer way of doing that than the current VACUUM FULL algorithm and code?
We can still *call* it VACUUM FULL. Modular replacement has been done
numerous times over the years with great success, e.g. tuplesort, index
build... lets do the same thing now and kiss goodbye to some code whose
time has come. 

Put it another way: if anybody submitted a patch that does what VACUUM
FULL does, coded in the way it is, it would never be applied, now.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


[HACKERS] autovacuum next steps, take 3

2007-03-09 Thread Alvaro Herrera
Here is a low-level, very detailed description of the implementation of
the autovacuum ideas we have so far.

launcher's dealing with databases
-

We'll add a new member nexttime to the autovac_dbase struct, which
will be the time_t of the next time a worker needs to process that DB.
Initially, those times will be 0 for all databases.  The launcher will
keep that list in memory, and on each iteration it will fetch the entry
that has the earliest time, and sleep until that time.  When it awakens,
it will start a worker on that database and set the nexttime to
now+naptime.

The list will be a Dllist so that it's easy to keep it sorted by
increasing time and picking the head of the list each time, and then
putting that node as a new tail.

Every so often seconds, the launcher will call autovac_get_database_list
and compare that list with the list it has on memory.  If a new database
is in the list, it will assign a nexttime between the current instant
and the time of the head of the Dllist.  Then it'll put it as the new
head.  The new database will thus be put as the next database to be
processed.

When a node with nexttime=0 is found, the amount of time to sleep will
be determined as Min(naptime/num_elements, 1), so that initially
databases will be distributed roughly evenly in the naptime interval.

When a nexttime in the past is detected, the launcher will start a
worker either right away or as soon as possible (read below).


launcher and worker interactions


The launcher PID will be in shared memory, so that workers can signal
it.  We will also keep worker information in shared memory as an array
of WorkerInfo structs:

typedef struct
{
Oid wi_dboid;
Oid wi_tableoid;
int wi_workerpid;
boolwi_finished;
} WorkerInfo;

We will use SIGUSR1 to communicate between workers and launcher.  When
the launcher wants to start a worker, it sets the dboid field and
signals the postmaster.  Then goes back to sleep.  When a worker has
started up and is about to start vacuuming, it will store its PID in
workerpid, and then send a SIGUSR1 to the launcher.  If the schedule
says that there's no need to run a new worker, the launcher will go back
to sleeping.

We cannot call SendPostmasterSignal a second time just after calling it;
the second call would be lost.  So it is important that the launcher
does not try to start a worker until there's no worker starting.  So if
the launcher wakes up for any reason and detects that there is a
WorkerInfo entry with valid dboid but workerpid is zero, it will go back
to sleep.  Since the starting worker will send a signal as soon as it
finishes starting up, the launcher will wake up, detect this condition
and then it can start a second worker.

Also, the launcher cannot start new workers when there are
autovacuum_max_workers already running.  So if there are that many when
it wakes up, it cannot do anything else but go back to sleep again.
When one of those workers finishes, it will wake the launcher by setting
the finished flag on its WorkerInfo, and sending SIGUSR1 to the
launcher.  The launcher then wakes up, resets the WorkerInfo struct, and
can start another worker if needed.

There is an additional problem if, for some reason, a worker starts and
is not able to finish its task correctly.  It will not be able to set
its finished flag, so the launcher will believe that it's still starting
up.  To prevent this problem, we check the PGPROCs of worker processes,
and clean them up if we find they are not actually running (or the PIDs
correspond to processes that are not autovacuum workers).  We only do it
if all WorkerInfo structures are in use, thus frequently enough so that
this problem doesn't cause any starvation, but seldom enough so that
it's not a performance hit.


worker to-do list
-

When each worker starts, it determines which tables to process in the
usual fashion: get pg_autovacuum and pgstat data and compute the
equations.

The worker then takes a snapshot of what's currently going on in the
database, by storing worker PIDs, the corresponding table OID that's
being currently worked, and the to-do list for each worker.

It removes from its to-do list the tables being processed.  Finally, it
writes the list to disk.

The table list will be written to a file in
PGDATA/vacuum/database-oid/todo.worker-pid
The file will consist of table OIDs, in the order in which they are
going to be vacuumed.

At this point, vacuuming can begin.

Before processing each table, it scans the WorkerInfos to see if there's
a new worker, in which case it reads its to-do list to memory.

Then it again fetches the tables being processed by other workers in the
same database, and for each other worker, removes from its own in-memory
to-do all those tables mentioned in the other lists that appear earlier
than the current 

Re: [HACKERS] Bug in VACUUM FULL ?

2007-03-09 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 On Fri, 2007-03-09 at 16:40 -0500, Tom Lane wrote:
 I wonder whether this has any implications for HOT ...

 My general feeling, expressed in a number of recent posts was that the
 VACUUM FULL code really isn't worth the trouble it causes. Especially
 when CLUSTER does a better job anyway?

Point A: we have to fix the back branches anyway.
Point B: until we have an MVCC-safe CLUSTER, that is not a substitute.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Bug in VACUUM FULL ?

2007-03-09 Thread Joshua D. Drake

 Put it another way: if anybody submitted a patch that does what VACUUM
 FULL does, coded in the way it is, it would never be applied, now.

Have an opinion do we? How about we just alias VACUUM FULL to cluster
and add the reporting stuff from VERBOSE?

Joshua D. Drake




-- 

  === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997
 http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/


---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-09 Thread Tom Lane
Jim Nasby [EMAIL PROTECTED] writes:
 On Mar 8, 2007, at 11:51 PM, Greg Smith wrote:
 almost everything that's dirty is also pinned during pgbench, and  
 the LRU is lucky to find anything it can write as a result

 I'm wondering if pg_bench is a good test of this stuff.

On reflection I think that Greg's result is probably unsurprising, and
furthermore does not indicate that anything is wrong.

What it shows (now that we got past the terminology) is that only about
half of the buffer pool is subject to replacement during any given clock
sweep.  For low-usage pages that's about what you'd expect: a page is
sucked in on demand (using a buffer returned by the clock sweep), and
when we're done with it it'll have usage_count = 1.  If it's not touched
again then when the clock sweep returns to it it'll be decremented to
usage_count 0, and on the next visit it'll be recycled for use as
something else.  Thus for low-usage pages you'd fully expect that about
half of the buffer population has usage_count 1 and the rest has usage
count 0; which is strikingly close to Greg's measurement that 48.8%
of the population has usage_count 0.

What this seems to tell us is that pgbench's footprint of heavily used
pages (those able to achieve usage_counts above 1) is very small.  Which
is probably right unless you've used a very large scale factor.  I'd be
interested to know what scale factor and shared_buffers setting led to
the above measurement.

It strikes me that the patch would be more useful if it produced a
histogram of the observed usage_counts, rather than merely the count
for usage_count = 0.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] autovacuum next steps, take 3

2007-03-09 Thread Matthew T. O'Connor
My initial reaction is that this looks good to me, but still a few 
comments below.


Alvaro Herrera wrote:

Here is a low-level, very detailed description of the implementation of
the autovacuum ideas we have so far.

launcher's dealing with databases
-


[ Snip ]


launcher and worker interactions


[Snip]


worker to-do list
-
When each worker starts, it determines which tables to process in the
usual fashion: get pg_autovacuum and pgstat data and compute the
equations.

The worker then takes a snapshot of what's currently going on in the
database, by storing worker PIDs, the corresponding table OID that's
being currently worked, and the to-do list for each worker.


Does a new worker really care about the PID of other workers or what 
table they are currently working on?



It removes from its to-do list the tables being processed.  Finally, it
writes the list to disk.


Just to be clear, the new worker removes from it's todo list all the 
tables mentioned in the todo lists of all the other workers?



The table list will be written to a file in
PGDATA/vacuum/database-oid/todo.worker-pid
The file will consist of table OIDs, in the order in which they are
going to be vacuumed.

At this point, vacuuming can begin.


This all sounds good to me so far.


Before processing each table, it scans the WorkerInfos to see if there's
a new worker, in which case it reads its to-do list to memory.


It's not clear to me why a worker cares that there is a new worker, 
since the new worker is going to ignore all the tables that are already 
claimed by all worker todo lists.



Then it again fetches the tables being processed by other workers in the
same database, and for each other worker, removes from its own in-memory
to-do all those tables mentioned in the other lists that appear earlier
than the current table being processed (inclusive).  Then it picks the
next non-removed table in the list.  All of this must be done with the
Autovacuum LWLock grabbed in exclusive mode, so that no other worker can
pick the same table (no IO takes places here, because the whole lists
were saved in memory at the start.)


Again it's not clear to me what this is gaining us?  It seems to me that 
if when a worker starts up writes out it's to-do list, it should just do 
it, I don't see the value in workers constantly updating their todo 
lists.  Maybe I'm just missing something can you enlighten me?



other things to consider


This proposal doesn't deal with the hot tables stuff at all, but that is
very easy to bolt on later: just change the first phase, where the
initial to-do list is determined, to exclude cold tables.  That way,
the vacuuming will be fast.  Determining what is a cold table is still
an exercise to the reader ...


I think we can make this algorithm naturally favor small / hot tables 
with one small change.  Having workers remove tables that they just 
vacuumed from their to-do lists and re-write their todo lists to disk. 
Assuming the todo lists are ordered by size ascending, smaller tables 
will be made available for inspection by newer workers sooner rather 
than later.



---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] who gets paid for this

2007-03-09 Thread Dave Page

Joshua D. Drake wrote:

Christian Bird wrote:

Hi all,

I'm a grad student at UC Davis studying the postgres community and I
wanted to know if some on this list could help me out.  I'm studying
the factors that affect people graduating from being mailing list
participant to developers with write access to the repository. 


It is done on a meritocracy basis and has zero bearing if the person is
paid to work on PostgreSQL or not.

I believe (would need verification) that it is -core who decides who
gets actual commit privileges.


It is. Note also that not all of core are committers.

/D



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] autovacuum next steps, take 3

2007-03-09 Thread Tom Lane
Matthew T. O'Connor matthew@zeut.net writes:
 Does a new worker really care about the PID of other workers or what 
 table they are currently working on?

As written, it needs the PIDs so it can read in the other workers' todo
lists (which are in files named by PID).

 It's not clear to me why a worker cares that there is a new worker, 
 since the new worker is going to ignore all the tables that are already 
 claimed by all worker todo lists.

That seems wrong to me, since it means that new workers will ignore
tables that are scheduled for processing by an existing worker, no
matter how far in the future that schedule extends.  As an example,
suppose you have half a dozen large tables in need of vacuuming.
The first worker in will queue them all up, and subsequent workers
will do nothing useful, at least not till the first worker is done
with the first table.  Having the first worker update its todo
list file after each table allows the earlier tables to be exposed
for reconsideration, but that's expensive and it does nothing for
later tables.

I suggest that maybe we don't need exposed TODO lists at all.  Rather
the workers could have internal TODO lists that are priority-sorted
in some way, and expose only their current table OID in shared memory.
Then the algorithm for processing each table in your list is

1. Grab the AutovacSchedule LWLock exclusively.
2. Check to see if another worker is currently processing
   that table; if so drop LWLock and go to next list entry.
3. Recompute whether table needs vacuuming; if not,
   drop LWLock and go to next entry.  (This test covers the
   case where someone vacuumed the table since you made your
   list.)
4. Put table OID into shared memory, drop LWLock, then
   vacuum table.
5. Clear current-table OID from shared memory, then
   repeat for next list entry.

This creates a behavior of whoever gets to it first rather than
allowing workers to claim tables that they actually won't be able
to service any time soon.

regards, tom lane

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-09 Thread Greg Smith

On Fri, 9 Mar 2007, Jim Nasby wrote:

I'm wondering if pg_bench is a good test of this stuff. ISTM it's 
unrealistically write-heavy, which is going to tend to not only put a 
lot of dirty buffers into the pool, but also keep them pinned enough 
that you can't write them.


Whether it's unrealistically write-heavy kind of depends on what your 
real app is.  The standard pgbench is a bit weird because it does so many 
updates to tiny tables, which adds a level of locking contention that 
doesn't really reflect many real-world situations.  But the no-branch mode 
(update/select to accounts, insert into history) isn't too dissimilar from 
some insert-heavy logging applications I've seen.


The main reason I brought this all up was because Itagaki seemed to be 
using pgbench for some of his performance tests.  I just wanted to point 
out that the LRU background writer specifically tends to be very 
underutilized when using pgbench.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Bug in VACUUM FULL ?

2007-03-09 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes:

 It's not surprising that tuples could have xmax less than xmin, since
 transactions can commit in orders different than they start; when using
 READ COMMITTED updates it's not at all surprising that a transaction
 might update rows after a later-numbered transaction does.  However, in
 looking at this code previously I'd assumed that the OldestXmin cutoff
 could never fall between two such transactions, and so the above
 scenario wouldn't happen.  I'm not real sure why I thought that.
 For the cases that VACUUM FULL is interested in, both XIDs mentioned
 in a DEAD tuple must have committed before OldestXmin was computed, but
 there doesn't seem to be a compelling reason why OldestXmin might not
 have been determined by an unrelated third transaction with a number
 between those two.

No commentary but in case anyone else is having trouble following I had to
make the following diagram (I think this is what you're describing?) before I
fully understood what you were describing:

TXN 1   TXN 2   TXN 3   TXN 4   VACUUM

START
.   START
.   START   .
.   UPDATE  .
.   COMMIT  .
DELETE  .
COMMIT  .
.   START
COMMIT  .
.   START

So txn 4's xmin is txn 3, leaving the global OldestXmin = txn 3 which lies
between txn 1 and txn 2.

And the tuple chain consists of two tuples. The original which has xmax
younger than OldestXmin and so is RECENTLY_DEAD. And the updated tuple which
has xmax older than OldestXmin and so is DEAD even though it has xmin younger
than OldestXmin.


Hm, I wonder if you could just notice that xmin is younger than OldestXmin. 
In a more complex example you could have lots of DEAD tuples in the chain and
some RECENTLY_DEAD mixed in randomly. But I think all the DEAD tuples
following a RECENTLY_DEAD would have to have xmin younger than OldestXmin. 
Or maybe I'm making the same mistake again. Gosh, this is confusing.


-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] autovacuum next steps, take 3

2007-03-09 Thread Matthew T. O'Connor

Tom Lane wrote:

Matthew T. O'Connor matthew@zeut.net writes:
It's not clear to me why a worker cares that there is a new worker, 
since the new worker is going to ignore all the tables that are already 
claimed by all worker todo lists.


That seems wrong to me, since it means that new workers will ignore
tables that are scheduled for processing by an existing worker, no
matter how far in the future that schedule extends.  As an example,
suppose you have half a dozen large tables in need of vacuuming.
The first worker in will queue them all up, and subsequent workers
will do nothing useful, at least not till the first worker is done
with the first table.  Having the first worker update its todo
list file after each table allows the earlier tables to be exposed
for reconsideration, but that's expensive and it does nothing for
later tables.


Well the big problem that we have is not that large tables are being 
starved, so this doesn't bother me too much, plus there is only so much 
IO, so one worker working sequentially through the big tables seems OK 
to me.



I suggest that maybe we don't need exposed TODO lists at all.  Rather
the workers could have internal TODO lists that are priority-sorted
in some way, and expose only their current table OID in shared memory.
Then the algorithm for processing each table in your list is

1. Grab the AutovacSchedule LWLock exclusively.
2. Check to see if another worker is currently processing
   that table; if so drop LWLock and go to next list entry.
3. Recompute whether table needs vacuuming; if not,
   drop LWLock and go to next entry.  (This test covers the
   case where someone vacuumed the table since you made your
   list.)
4. Put table OID into shared memory, drop LWLock, then
   vacuum table.
5. Clear current-table OID from shared memory, then
   repeat for next list entry.

This creates a behavior of whoever gets to it first rather than
allowing workers to claim tables that they actually won't be able
to service any time soon.


Right, but you could wind up with as many workers working concurrently 
as you have tables in a database which doesn't seem like a good idea 
either.  One thing I like about the todo list setup Alvaro had is that 
new workers will be assigned fewer tables to work on and hence exit 
sooner.  We are going to fire off a new worker every autovac_naptime so 
availability of new workers isn't going to be a problem.



---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Bug in VACUUM FULL ?

2007-03-09 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 So txn 4's xmin is txn 3, leaving the global OldestXmin = txn 3 which lies
 between txn 1 and txn 2.

 And the tuple chain consists of two tuples. The original which has xmax
 younger than OldestXmin and so is RECENTLY_DEAD. And the updated tuple which
 has xmax older than OldestXmin and so is DEAD even though it has xmin younger
 than OldestXmin.

Right.

 Hm, I wonder if you could just notice that xmin is younger than OldestXmin. 

You can see that at the newer tuple, but the problem is to propagate the
knowledge back to the older tuple(s).  Or were you suggesting that we
treat the newer tuple as RECENTLY_DEAD instead of DEAD?  That seems a
step backwards in space-reclamation ability.  It'd be hard to implement
in any case, because one of the problem cases is where VACUUM has
already recycled the DEAD tuple before visiting the RECENTLY_DEAD tuple
that chains to it.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring

2007-03-09 Thread Greg Smith

On Fri, 9 Mar 2007, Tom Lane wrote:

I'd be interested to know what scale factor and shared_buffers setting 
led to the above measurement.


That was just a trivial example with 1 client, scale=10 (~160MB database), 
and shared_buffers=24MB.  Where things really get interesting with pgbench 
is on a system with enough horsepower+clients to dirty the whole buffer 
cache well before a checkpoint.  I regularly see 75% of the cache dirty 
and blocked from LRU writes with pgbench's slightly pathological workload 
in that situation.


You're correct that these results aren't particularly surprising or 
indicative of a problem to be corrected.  But they do shed some light on 
what pgbench is and isn't appropriate for testing.



It strikes me that the patch would be more useful if it produced a
histogram of the observed usage_counts, rather than merely the count
for usage_count = 0.


I'll start working in that direction.  With the feedback everyone has 
given me on how few of the buffers are truly pinned via the correct 
usage of the term, I'm going to revisit the usage details and revise that 
section.  I'm happy with how I'm reporting the checkpoint details now, 
still some work left to do on the bgwriter activity.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Bug in VACUUM FULL ?

2007-03-09 Thread Gregory Stark
Tom Lane [EMAIL PROTECTED] writes:

 Hm, I wonder if you could just notice that xmin is younger than OldestXmin. 

 You can see that at the newer tuple, but the problem is to propagate the
 knowledge back to the older tuple(s).  Or were you suggesting that we
 treat the newer tuple as RECENTLY_DEAD instead of DEAD?  That seems a
 step backwards in space-reclamation ability.  It'd be hard to implement
 in any case, because one of the problem cases is where VACUUM has
 already recycled the DEAD tuple before visiting the RECENTLY_DEAD tuple
 that chains to it.

I think I was suggesting treating the newer tuple as RECENTLY_DEAD. Ie, not
vacuuming a tuple if either xmin or xmax is younger than OldestXmin. It's a
step backwards in space-reclamation but really, how often can it happen?

But I'm not entirely sure that's enough really. Any number of old transactions
could come along and update the head of the tuple chain, setting both xmin and
xmax to old values. I guess only the one tuple immediately following the
RECENTLY_DEAD tuple would have the young xmin. That doesn't really help you
identify the later DEAD tuples.

Breaking the chain up into pieces seems weird. It seems like it's obviously
bogus and only works because we're sure the tuples are dead anyways so it
doesn't really matter what we do with them. If we're not sure they're dead it
seems like the right thing to do is either to keep the whole chain or to
relink the chain after removing the dead intervening tuples.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Bug in VACUUM FULL ?

2007-03-09 Thread Tom Lane
Gregory Stark [EMAIL PROTECTED] writes:
 Breaking the chain up into pieces seems weird. It seems like it's obviously
 bogus and only works because we're sure the tuples are dead anyways so it
 doesn't really matter what we do with them.

Yup, exactly.  If we wanted to be tense about this we'd try to get rid
of the nominally RECENTLY_DEAD tuples that precede any DEAD tuple in the
chain.  However, I concur with Simon to the extent that I don't want to
do any more work to fix this bug than necessary, and trying to recognize
such tuples seems like a lot more work than necessary.

Also, we know this case works because it already is working: in the
situation where VACUUM happens to visit and remove the DEAD tuple(s)
before reaching the RECENTLY_DEAD tuples that link forward to them,
it treats the RECENTLY_DEAD tuples as a disconnected chain and moves
them as-is.  I saw tons of this in the traces I was making today, and
it doesn't seem to create any bad effects.  (My attention was drawn to
it because I saw move_chain_tuple being used to move single-member
chains, which looks impossible when you first look at the code --- the
is-it-a-chain test seems to ensure that we can link either forward or
backward.  But not so if t_ctid points to an already-removed tuple.)

regards, tom lane

---(end of broadcast)---
TIP 6: explain analyze is your friend


[HACKERS] My honours project - databases using dynamically attached entity-properties

2007-03-09 Thread Edward Stanley
Hi,


Was wondering if people would mind having a read over what I plan to do for my 
undergraduate honours project - you can get the proposal here:

http://www.mcs.vuw.ac.nz/~eddie/489_Proposal.pdf

What I'd basically like to know is 
a) Is this problem worth solving? 
b) Is there already a good solution (particularly, within PostgreSQL)?
c) Any other feedback or comments you may have.


Regards
Eddie Stanley


---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Calculated view fields (8.1 != 8.2)

2007-03-09 Thread Gaetano Mendola
Florian G. Pflug wrote:
 Gaetano Mendola wrote:
 Martijn van Oosterhout wrote:
 On Fri, Mar 09, 2007 at 10:59:56AM +0100, Gaetano Mendola wrote:
 Is really this what we want? I did a migration 8.0.x = 8.2.3 and I
 had on first hour of service up
 lot of queries blocked due to this, consider in my case I have on
 v_ta milions of records and usually
 that join extracts 1 row. Is there a way to set till I don't check
 all my huge schema to disable this
 behaviour?
 Most people figured it was a improvment. It's configured per function
 now, which wasn't the case before. I dont't think there was ever any
 discussion about having a global switch.

 Well it's not an improvement in term of performances but a performance
 degradation in the best case and
 in the worst can be devastating:

 create table ta ( a integer, b integer );
 CREATE TABLE
 create table tb ( b integer, c integer );
 CREATE TABLE

 CREATE OR REPLACE FUNCTION sp_delete_selected_row ( INTEGER )
 RETURNS INTEGER AS'
 DECLARE
a_idALIAS FOR $1;
 BEGIN
DELETE FROM ta where a = a_id;
return 0;
 END;
 ' LANGUAGE 'plpgsql';
 CREATE FUNCTION

 CREATE OR REPLACE VIEW v_ta AS
   SELECT
  sp_delete_selected_row(a) AS a,
  b AS b
   FROM
  ta
 ;
 CREATE VIEW

 insert into ta values (2,3);
 INSERT 0 1
 insert into ta values (3,4);
 INSERT 0 1
 insert into tb values (4,5);
 INSERT 0 1

 select * from v_ta join tb using (b) where c = 5;

  b | a | c
 ---+---+---
  4 | 0 | 5
 (1 row)

 select * from ta;
  a | b
 ---+---
 (0 rows)


 All rows are gone instead of the only one extracted from that query.
 IMHO is a undesired side effect.
 In my case I destroyed my application statistics on how many time a
 certain row was extracted.
 
 This is insane. Whoever creates a view like that on a production system
 should *immediatly* be carried away from his keyboard, to prevent
 further damage. Imagine someone using View Data on  this view in
 pgadmin.. I don't wanna be near him when he clicks Refresh, and
 suddenly all data is gone...
 
 Maybe calling volatile functions in selects and views should be
 forbidden entirely, except for volatile functions in the top-level
 select clause,
 to support things like select ..., nextval('seq') from 
 
 But it's probably not worth the effort - there will always be creative
 ways to shoot yourself into your foot.

I full agree with this, that was just an extreme example of an hidden undesired
call. In my framework I don't have by coding rule any function with side 
effects
applied at view fields, however I have some functions not marked correctly as
STABLE ( mea culpa ) that degraded the performances until I realized what was 
going
on; I'm in the opinion that is not sane call a function not marked as 
stable/immutable
for discarded column (I can in some way accept this ) and most of all on 
discarded
rows.

Regards
Gaetano Mendola
















---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] who gets paid for this

2007-03-09 Thread Christian Bird

I didn't mean to imply that getting paid is correlated with getting
commit privileges.  However, there is literature that supports the
idea that those who are under employ to help in OSS projects may
behave differently than those who are contributing in their free time
(check out 
http://gsyc.info/~jjamor/research/papers/2006-gsd-herraiz-robles-amor-romera-barahona.pdf).
We're trying to get an idea if there are perhaps two different
phenomena in our data.  We're trying to separate those who have commit
privileges into those employed by a company to help out as part of
their job and those who do so in their free time at the time of their
first commit.  I really appreciate any help that you can provide.  If
it appears that I'm making incorrect assumptions about how the
community works, please feel free to correct me or point me to
resources.  Thanks.

-- Chris

On 3/8/07, Tom Lane [EMAIL PROTECTED] wrote:

Josh Berkus josh@agliodbs.com writes:
 Christian,
 More specifically, could those
 who worked on apache as some aspect of their job prior to getting repo
 access let me know?  Or if there are devs who know this information
 about others, I'd be really appreciative to get it.

 Hmmm.  Wrong project.  And I think you're making the (incorrect) assumption
 that granting commit rights works the same way in all projects.   It does
 not.

Even more to the point, getting paid for has almost nothing to do
with has commit privileges.  At least on this project.

regards, tom lane




--
Christian Bird
[EMAIL PROTECTED]

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Calculated view fields (8.1 != 8.2)

2007-03-09 Thread Gaetano Mendola
Florian G. Pflug wrote:
 Martijn van Oosterhout wrote:
 On Fri, Mar 09, 2007 at 10:59:56AM +0100, Gaetano Mendola wrote:
 Is really this what we want? I did a migration 8.0.x = 8.2.3 and I
 had on first hour of service up
 lot of queries blocked due to this, consider in my case I have on
 v_ta milions of records and usually
 that join extracts 1 row. Is there a way to set till I don't check
 all my huge schema to disable this
 behaviour?

 Most people figured it was a improvment. It's configured per function
 now, which wasn't the case before. I dont't think there was ever any
 discussion about having a global switch.
 
 If your function is already marked immutable or stable, then nothing
 changes for you. If you *did* call volatile functions inside your
 select, then you now get
 consistens behaviour. Since you don't want your function to be evaluated
 in all
 cases, I assume that it shouldn't be marked volatile in the first place.

Well some function are volatile and can not be marked as stable.

We develop our applications layering the modules, we didn't have any reason
to forbid as coding rule to put function call on view in low level layers.
After all views are there also to build up your schema layering the info.

I can immagine a case when a lower module exports a view to upper layer stating
the interface as list of fields:

first_name, last_name, 

with an *hidden* field that is a function call that updates the statistics on
how many time a given record was selected, then this technique can not be used
anymore starting with 8.2.x.

The above is not my case but it can be a possible scenario (I admit not a sane
one ).

Regards
Gaetano Mendola

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] msvc failure in largeobject regression test

2007-03-09 Thread Andrew Dunstan

Magnus Hagander wrote:

On Tue, Jan 23, 2007 at 11:39:23AM -0800, Jeremy Drake wrote:
  

On Tue, 23 Jan 2007, Magnus Hagander wrote:



On Tue, Jan 23, 2007 at 09:31:40AM -0500, Andrew Dunstan wrote:
  

Magnus Hagander wrote:


Hi!

I get failures for the largeobject regression tests on my vc++ build. I
don't think this has ever worked, given that those tests are fairly new.
Any quick ideas on what's wrong before I dig deeper?


  

[snip]

I wonder if this is a line-end issue? Assuming you are working from CVS,
does your client turn \n into \r\n ? I see that other windows boxes are
happily passing this test on the buildfarm, and of course the mingw cvs
doesn't adjust line endings.


Bingo!

That's it. I copeid the file in binary mode from a linux box and now it
passes.
  

I thought about that when I wrote it, and thus tried it under mingw and
cygwin without issue ;)  I don't think the regression tests were in a
position of running on the msvc build at the time...  My thought for what
to do if this did run into a problem would be an alternate output file
that is also acceptable (I don't know what they're called but other tests
have them IIRC).


Either that, or we require a checkout using Unix style linefeeds. I've
confirmed that removing the file and checking it back out with cvs --lf
update tenk.data works - tests pass fine.

Yet another option might be to flag that file as binary in cvs, in which
case I think cvsnt shouldn't go mess with it.


  


I have just run into this today while trying to get buildfarm working 
for MSVC. After some consideration I think an alternative result file is 
the best solution. I have looked at switches for cnsnt, but they are 
likely to be fragile at best.


cheers

andrew


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Auto creation of Partitions

2007-03-09 Thread Robert Treat
On Friday 09 March 2007 01:23, NikhilS wrote:
 Hi,

  This follows on from the suggestion I made - taken along the lines of
  the subject auto creation of partitions where I suggested the syntax
  of partition check(month of mydatecol) and have a new partition created
  as data was entered. With this scenario dropping the partition when it
  was empty would complement the creation of a new partition as needed.
 
  Given that there seems to be no real support of going with auto
  maintenance were new partitions are added as needed, then the auto
  dropping of empty partitions would also not apply.
 
  Leaving us with only specific add partition / drop partition commands.
  And have the parent table pick up rows not matching any partition check
  criteria.

 I was thinking along the lines of what Jim had suggested earlier regarding
 overflow partition. Instead of dumping unmatched rows to the master table,
 we could put them into a default DUMP/DUMB partition.


I'm quite content dumping unmatched rows into the master table. This makes it 
very easy to scan partitioned tables for busted partition setups.  Having a 
DUMP tables seems only different semantically, so why learn new semantics? 

 Given that Simon wants to do away with having the master table APPENDed in
 the planning phase, this would be better.


ISTM you're trading appending the master table for appending the DUMP 
partition, which afaict would give you no gain.  

-- 
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [HACKERS] Auto creation of Partitions

2007-03-09 Thread NikhilS

Hi,

On 3/10/07, Hannu Krosing [EMAIL PROTECTED] wrote:


Ühel kenal päeval, R, 2007-03-09 kell 15:41, kirjutas Jim Nasby:
 On Mar 9, 2007, at 3:31 AM, Zeugswetter Andreas ADI SD wrote:
  Since partition is inheritance-based, a simple DROP or  NO
  INHERIT
  will do the job to deal with the partition. Do we want to reinvent
  additional syntax when these are around and are documented?
 
  Well, if the syntax for adding a new partition eventually
  ends up as ALTER TABLE ADD PARTITION, then it would make more
  sense that you remove a partition via ALTER TABLE DROP PARTITION.
 
  But DROP PARTITION usually moves the data from this partition to other
  partitions,
  so it is something different.

 It does? IIRC every partitioning system I've seen DROP PARTITION
 drops the data as well. It's up to you to move it somewhere else if
 you want to keep it.

Will this proposed DROP PARTITION just disassociate the table from the
master, or will it actually drop the partitions table from the whole
database ?



Thats why I would prefer the existing mechanism, there a DROP on the child
removes it and a NO INHERIT disassociates it. There might be situations
where we would want to just disassociate and not drop.

Regards,
Nikhils

--


Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me:  callto:hkrosing
Get Skype for free:  http://www.skype.com






--
EnterpriseDB   http://www.enterprisedb.com


Re: [HACKERS] Auto creation of Partitions

2007-03-09 Thread NikhilS

Hi,


 Given that Simon wants to do away with having the master table APPENDed
in
 the planning phase, this would be better.


ISTM you're trading appending the master table for appending the DUMP
partition, which afaict would give you no gain.




If there are entries in the master table, I think it would get appended for
all queries regardless of whether we need to examine its contents or not.
Segregating dump data into a partition will avoid that.

I have seen examples in some other databases wherein a partition specifies a
range of someval - MAXINT for instance, to catch such cases.

That again means that the onus is on the partition creator most of the
times..
Regards,
Nikhils




--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL





--
EnterpriseDB   http://www.enterprisedb.com


Re: [HACKERS] msvc failure in largeobject regression test

2007-03-09 Thread Tom Lane
Andrew Dunstan [EMAIL PROTECTED] writes:
 Magnus Hagander wrote:
 I wonder if this is a line-end issue? Assuming you are working from CVS,
 does your client turn \n into \r\n ?

 I have just run into this today while trying to get buildfarm working 
 for MSVC. After some consideration I think an alternative result file is 
 the best solution. I have looked at switches for cnsnt, but they are 
 likely to be fragile at best.

Are you proposing an alternate result file that has a different linefeed
style?  I would really really rather that we not go there, because it
will be impossibly fragile to maintain.  Or are you willing to accept
that the Windows builds will break every time someone changes that
regression test, until someone else with a Windows machine fixes the
result file?

I would find it preferable to make pg_regress compensate for this
issue somehow ...

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Calculated view fields (8.1 != 8.2)

2007-03-09 Thread Tom Lane
Gaetano Mendola [EMAIL PROTECTED] writes:
 I can immagine a case when a lower module exports a view to upper layer 
 stating
 the interface as list of fields:

 first_name, last_name, 

 with an *hidden* field that is a function call that updates the statistics on
 how many time a given record was selected, then this technique can not be used
 anymore starting with 8.2.x.

You're living in a dream world if you think that works reliably in *any*
version of Postgres.  But for starters, what is your definition of
selected --- pulled from the physical table?  Accumulated into an
aggregate?  Delivered as a recognizable row to the client?  Delivered N
times to the client due to joining N times to some other table?

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq