Re: [HACKERS] RFC: changing autovacuum_naptime semantics
On Mar 9, 2007, at 6:42 AM, Tom Lane wrote: Alvaro Herrera [EMAIL PROTECTED] writes: Now regarding your restartable vacuum work. I think that stopping a vacuum at some point and being able to restart it later is very cool and may get you some hot chicks, but I'm not sure it's really useful. Too true :-( Yeah. Wouldn't 'divide and conquer' kinda approach make it better ? Ie. let vacuum to work on some part of table/db. Than stop, pick up another part later, vacuum it, etc, etc ? -- Grzegorz Jaskiewicz [EMAIL PROTECTED] ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Auto creation of Partitions
Since partition is inheritance-based, a simple DROP or NO INHERIT will do the job to deal with the partition. Do we want to reinvent additional syntax when these are around and are documented? Well, if the syntax for adding a new partition eventually ends up as ALTER TABLE ADD PARTITION, then it would make more sense that you remove a partition via ALTER TABLE DROP PARTITION. But DROP PARTITION usually moves the data from this partition to other partitions, so it is something different. Andreas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Calculated view fields (8.1 != 8.2)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Gaetano Mendola wrote: Tom Lane wrote: Gaetano Mendola [EMAIL PROTECTED] writes: [ 8.2 evaluates volatile functions in the targetlist of a view ] If I mark the function as STABLE or IMMUTABLE then even with version 8.2 the function is not evaluated. Is this the intended behavior? Yes; people complained that we needed to be more careful about the number of times volatile functions get evaluated. I suspect that functions are evaluated also for record discarded due to joins. Is that the case? Like: SELECT * FROM ta JOIN tb USING (id) where tb.foo = 4; If ta is a view with some calculated fields are the function on ta evaluated only for record matching the filters or in some case ( like a full scan on ta ) also for the records discarded due to the join? I did a check on a 8.2 and I can confirm my suspects: kalman=# create table ta ( a integer, b integer ); CREATE TABLE kalman=# create table tb ( b integer, c integer ); CREATE TABLE kalman=# kalman=# CREATE OR REPLACE FUNCTION sp_test ( INTEGER ) kalman-# RETURNS INTEGER AS' kalman'# DECLARE kalman'# a_idALIAS FOR $1; kalman'# BEGIN kalman'# RAISE NOTICE ''here''; kalman'# kalman'# return 3; kalman'# END; kalman'# ' LANGUAGE 'plpgsql'; CREATE FUNCTION kalman=# kalman=# CREATE OR REPLACE VIEW v_ta AS kalman-#SELECT kalman-# sp_test(a) AS a, kalman-# b AS b kalman-#FROM kalman-# ta c kalman-# ; CREATE VIEW kalman=# kalman=# insert into ta values (2,3); INSERT 0 1 kalman=# insert into ta values (3,4); INSERT 0 1 kalman=# insert into tb values (4,5); INSERT 0 1 kalman=# kalman=# select * from v_ta join tb using (b) where c = 5; NOTICE: here NOTICE: here b | a | c - ---+---+--- 4 | 3 | 5 (1 row) Is really this what we want? I did a migration 8.0.x = 8.2.3 and I had on first hour of service up lot of queries blocked due to this, consider in my case I have on v_ta milions of records and usually that join extracts 1 row. Is there a way to set till I don't check all my huge schema to disable this behaviour? Regards Gaetano Mendola -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.5 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFF8TAc7UpzwH2SGd4RAgajAKCvIxLH9JSBk4gxSbuaq4WE2y7v2wCfbnRa jWDV3hlEq/Loye6G+E2S9Ew= =LR5T -END PGP SIGNATURE- ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
[HACKERS] A naive question about the Makefile
Hi all, I am adding some new files into the system, so I wonder what I need to do to update the Makefiles? Is the system smart enough to recognize my new files and add corresponding entries into the Makefiles? Thanks. Regards Yu ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] A naive question about the Makefile
Am Freitag, 9. März 2007 10:38 schrieb Cao Yu: Is the system smart enough to recognize my new files and add corresponding entries into the Makefiles? No, you need to add it yourself. Look for where the other files are listed. -- Peter Eisentraut http://developer.postgresql.org/~petere/ ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Auto creation of Partitions
On Fri, 2007-03-09 at 11:48 +0530, NikhilS wrote: Hi, On 3/9/07, Shane Ambler [EMAIL PROTECTED] wrote: Note to Nikhil: Make sure the new syntax doesn't prevent partitions from being placed upon multiple tablespaces in some manner, at CREATE TABLE time. What if the syntax was something like - CREATE TABLE tabname ( ... ... ) PARTITION BY HASH(expr) | RANGE(expr) | LIST(expr) [PARTITIONS num_partitions] /* will apply to HASH only for now*/ [PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname], PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname] ... ]; And (if we use the ALTER TABLE to add partitions) ALTER TABLE tabname ADD PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname]; We could as well drop the USING part. Why would we support HASH partitions? If you did, the full syntax for hash clusters should be supported. If we do the CHECK clauses like that then we still have don't have a guaranteed non-overlap between partitions. It would be easier to use Oracle syntax and then construct the CHECK clauses from that. Also, the syntax needs to be fairly complex to allow for a mixture of modes, e.g. range and list partitioning. That is currently possible today and the syntax for doing that is IMHO much simpler than the Oracle simple way of specifying it. An alternative is to provide a partitioning function which decides which partition each values goes into. PARTITION FUNCTION which_partition(date_col) The partition function must return an unsigned integer 0, which would correspond to particular partitions. Partitions would be numbered 1..N, and named tablename_partM where 1 = M = N. The input and contents of the partition function would be up to the user. e.g. CREATE FUNCTION range_partition(date date_col) { if (date_col D1) return 1; else if (date_col D2) return 2; else if (date_col D3) return 3; return 4; } Doing it this way would allow us to easily join two tables based upon a common partition function. In time, I would suggest we support both ways: declarative and functional. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] CLUSTER and MVCC
Is there a particular reason why CLUSTER isn't MVCC-safe? It seems to me that it would be trivial to fix, by using SnapshotAny instead of SnapshotNow, and not overwriting the xmin/xmax with the xid of the cluster command. I feel that I must missing something, or someone would've already fixed it a long time ago... Csaba, you mentioned recently (http://archives.postgresql.org/pgsql-hackers/2007-03/msg00027.php) that you're actually using the MVCC-violation to clean up tables during a backup. Can you tell us a bit more about that? Would you be upset if we shut that backdoor? In any case, the MVCC-violation needs to be documented. I'll send a doc patch to pgsql-patches shortly. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] [COMMITTERS] pgsql: Remove unsafe calling of WSAStartup and WSA Cleanup from DllMain.
On Fri, Mar 09, 2007 at 08:16:12AM +, Dave Page wrote: Magnus Hagander wrote: For example, do you recall us being confused when we found we needed to call it in slon.exe? Eh, no, actually not. Sorry. Well, it was only a couple of years ago!! Seriously though, from what I recall that was the origin of this code - you left it out because libpq called WSAStartup on your system, and I added it because on mine it didn't. Or something like that. I remember us getting confused about it on IM anyway. Hmm. Was that actually fort he libpq stuff, though? I don't recall it clearly, but somethign tells me the problem was around the pipe emulation and not around libpq. In which case it can simply be because I was building against a libpq built with MSVC = it had the broken startup code, and you used a mingw one, which didn't have it. Per the docs, an application like slon *should* make it's own call to WSAStartup() because it uses socket functions diretly (port/pipe.c)... Another question related to backpatching - should I backpatch this to 8.1 and 8.0 as well? I know we said we more or less don't maintain the win32 port back there because it was too new, but this is all code in the client libpq, which has been around no win32 much longer. The reason I'm asking is that the original reporter of this problem is on 8.1... I'm leaning towards yes, but would like to hear further comments... //Magnus ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Auto creation of Partitions
Hi, Why would we support HASH partitions? If you did, the full syntax for hash clusters should be supported. In MySQL, Oracle, the syntax for HASH partitions seems to be similar to the one mentioned. I do not know much about hash clusters though. If we do the CHECK clauses like that then we still have don't have a guaranteed non-overlap between partitions. It would be easier to use Oracle syntax and then construct the CHECK clauses from that. Again Oracle, MySQL use VALUES LESS THAN (expr) format for RANGE partitions. So you mean that they end up creating ranges like MININT - Range1, Range1+1 - Range2 etc for each of the partitions? I think Postgres users are used to the CHECK clauses and I still feel that the onus of distinct partitions lies on the partition creator. Also, the syntax needs to be fairly complex to allow for a mixture of modes, e.g. range and list partitioning. That is currently possible today and the syntax for doing that is IMHO much simpler than the Oracle simple way of specifying it. Subpartitioning is not being targeted right now, but could be put on the TODO list for further enhancements. An alternative is to provide a partitioning function which decides which partition each values goes into. PARTITION FUNCTION which_partition(date_col) The partition function must return an unsigned integer 0, which would correspond to particular partitions. Partitions would be numbered 1..N, and named tablename_partM where 1 = M = N. The input and contents of the partition function would be up to the user. e.g. CREATE FUNCTION range_partition(date date_col) { if (date_col D1) return 1; else if (date_col D2) return 2; else if (date_col D3) return 3; return 4; } Doing it this way would allow us to easily join two tables based upon a common partition function. In time, I would suggest we support both ways: declarative and functional. Till now, we are going the declarative way. Regards, Nikhils -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] who gets paid for this
Tom Lane wrote: Even more to the point, getting paid for has almost nothing to do with has commit privileges. At least on this project. Darn. So the cheque isn't really in the mail? cheers andrew ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] CLUSTER and MVCC
On Fri, 2007-03-09 at 12:29, Heikki Linnakangas wrote: Csaba, you mentioned recently (http://archives.postgresql.org/pgsql-hackers/2007-03/msg00027.php) that you're actually using the MVCC-violation to clean up tables during a backup. Can you tell us a bit more about that? Would you be upset if we shut that backdoor? My use case: a queue-like table (in fact a 'task' table) which is very frequently inserted/updated/deleted. This table tends to be bloated in the presence of any long running transaction... the only transactional behavior we need from this table is to make sure that when we insert something in this table in a transaction (possibly together with other actions) and then commit/rollback, it commits/rolls back the insert. CLUSTER's violation of MVCC does not affect this, as CLUSTER will not be able to lock the table if another transaction inserted something in it (the inserting transaction will have a lock on the table). Selections on this table are not critical for us, it just doesn't matter which job processor is getting which task and in what order... (actually it does matter, but CLUSTER won't affect that either). So what I do is execute CLUSTER once in 5 minutes on this table. This works just fine, and keeps the table size small even if I have long running transactions in progress. The DB backup is one of such unavoidable long running transactions, and I use the table exclusion switch to exclude this task table from the backup so it won't get locked by it and let CLUSTER still do it's job (I had a rudimentary patch to do this even before the feature was introduced to pg_dump). The table can be dumped separately which is a brief operation, but I would have anyway to clear it on a crash... Now I could try and disable the CLUSTER cron job and see if i get problems, as last it was disabled with postgres 7.4, maybe something changed in between... but I can tell for sure that last time I enabled it it really fixed our load on the DB server... Wouldn't be possible to do it like Simon (IIRC) suggested, and add a parameter to enable/disable the current behavior, and use the MVCC behavior as default ? Cheers, Csaba. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] CLUSTER and MVCC
Ühel kenal päeval, R, 2007-03-09 kell 11:29, kirjutas Heikki Linnakangas: Is there a particular reason why CLUSTER isn't MVCC-safe? It seems to me that it would be trivial to fix, by using SnapshotAny instead of SnapshotNow, and not overwriting the xmin/xmax with the xid of the cluster command. I feel that I must missing something, or someone would've already fixed it a long time ago... Probably it is not MVCC safe because the relation is swapped out from under the pg_class. That is, it can be possible , that older and newer transactions read different datafiles and so simle MVCC does not work. Csaba, you mentioned recently (http://archives.postgresql.org/pgsql-hackers/2007-03/msg00027.php) that you're actually using the MVCC-violation to clean up tables during a backup. Can you tell us a bit more about that? Would you be upset if we shut that backdoor? In any case, the MVCC-violation needs to be documented. I'll send a doc patch to pgsql-patches shortly. -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] RFC: changing autovacuum_naptime semantics
Tom Lane [EMAIL PROTECTED] writes: Er, why not just finish out the scan at the reduced I/O rate? Any sort of abort behavior is going to create net inefficiency, eg doing an index scan to remove only a few tuples. ISTM that the vacuum ought to just continue along its existing path at a slower I/O rate. I think the main motivation to abort a vacuum scan is so we can switch to some more urgent scan. So if in the middle of a 1-hour long vacuum of some big warehouse table we realize that a small hot table is long overdue for a vacuum we want to be able to remove the tuples we've found so far, switch to the hot table, and when we don't have more urgent tables to vacuum resume the large warehouse table vacuum. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Auto creation of Partitions
Simon, What happens to the data when the function is dropped or replaced? - Luke Msg is shrt cuz m on ma treo -Original Message- From: Simon Riggs [mailto:[EMAIL PROTECTED] Sent: Friday, March 09, 2007 06:20 AM Eastern Standard Time To: NikhilS Cc: Shane Ambler; Luke Lonergan; Zeugswetter Andreas ADI SD; Peter Eisentraut; pgsql-hackers@postgresql.org Subject:Re: [HACKERS] Auto creation of Partitions On Fri, 2007-03-09 at 11:48 +0530, NikhilS wrote: Hi, On 3/9/07, Shane Ambler [EMAIL PROTECTED] wrote: Note to Nikhil: Make sure the new syntax doesn't prevent partitions from being placed upon multiple tablespaces in some manner, at CREATE TABLE time. What if the syntax was something like - CREATE TABLE tabname ( ... ... ) PARTITION BY HASH(expr) | RANGE(expr) | LIST(expr) [PARTITIONS num_partitions] /* will apply to HASH only for now*/ [PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname], PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname] ... ]; And (if we use the ALTER TABLE to add partitions) ALTER TABLE tabname ADD PARTITION partition_name CHECK(...) [USING TABLESPACE tblspcname]; We could as well drop the USING part. Why would we support HASH partitions? If you did, the full syntax for hash clusters should be supported. If we do the CHECK clauses like that then we still have don't have a guaranteed non-overlap between partitions. It would be easier to use Oracle syntax and then construct the CHECK clauses from that. Also, the syntax needs to be fairly complex to allow for a mixture of modes, e.g. range and list partitioning. That is currently possible today and the syntax for doing that is IMHO much simpler than the Oracle simple way of specifying it. An alternative is to provide a partitioning function which decides which partition each values goes into. PARTITION FUNCTION which_partition(date_col) The partition function must return an unsigned integer 0, which would correspond to particular partitions. Partitions would be numbered 1..N, and named tablename_partM where 1 = M = N. The input and contents of the partition function would be up to the user. e.g. CREATE FUNCTION range_partition(date date_col) { if (date_col D1) return 1; else if (date_col D2) return 2; else if (date_col D3) return 3; return 4; } Doing it this way would allow us to easily join two tables based upon a common partition function. In time, I would suggest we support both ways: declarative and functional. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] CLUSTER and MVCC
Csaba Nagy [EMAIL PROTECTED] writes: Wouldn't be possible to do it like Simon (IIRC) suggested, and add a parameter to enable/disable the current behavior, and use the MVCC behavior as default ? Doing it in CLUSTER would be weird. However perhaps it would be useful to have some sort of stand-alone tool that just bumped all the xmin/xmax's. It would have to be super-user-only and carry big warning labels saying it breaks MVCC. But it would be useful any time you have a table that you want to exempt a particular table from serializable snapshots. Basically a per-table way to force a read-committed snapshot on. Though, actually it's not quite a read-committed snapshot is it? Anyone using an old serializable snapshot will see what, no tuples at all? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] RFC: changing autovacuum_naptime semantics
Gregory Stark wrote: Tom Lane [EMAIL PROTECTED] writes: Er, why not just finish out the scan at the reduced I/O rate? Any sort of abort behavior is going to create net inefficiency, eg doing an index scan to remove only a few tuples. ISTM that the vacuum ought to just continue along its existing path at a slower I/O rate. I think the main motivation to abort a vacuum scan is so we can switch to some more urgent scan. So if in the middle of a 1-hour long vacuum of some big warehouse table we realize that a small hot table is long overdue for a vacuum we want to be able to remove the tuples we've found so far, switch to the hot table, and when we don't have more urgent tables to vacuum resume the large warehouse table vacuum. Why not just let another autovac worker do the hot table? -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] CLUSTER and MVCC
On Fri, 2007-03-09 at 13:42, Gregory Stark wrote: Csaba Nagy [EMAIL PROTECTED] writes: Wouldn't be possible to do it like Simon (IIRC) suggested, and add a parameter to enable/disable the current behavior, and use the MVCC behavior as default ? Doing it in CLUSTER would be weird. However perhaps it would be useful to have some sort of stand-alone tool that just bumped all the xmin/xmax's. It would have to be super-user-only and carry big warning labels saying it breaks MVCC. Well, the current behavior of CLUSTER is just perfect for what I'm using it. If anything else would do the job, I would be happy to use it instead... But it would be useful any time you have a table that you want to exempt a particular table from serializable snapshots. Basically a per-table way to force a read-committed snapshot on. Though, actually it's not quite a read-committed snapshot is it? Anyone using an old serializable snapshot will see what, no tuples at all? I'm afraid what I need has nothing to do with serializable snapshots... I still want the table to be completely transactional except if somebody can get an exclusive lock on it, it can be compacted regardless of other running transactions. I'm not sure how to express this in other way... it means something like: no transaction cares about the content of the table until it gets some kind of lock on it. In other words the table's state is not connected with the state of other tables until I actually do something on it... Cheers, Csaba. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] CLUSTER and MVCC
Csaba Nagy wrote: On Fri, 2007-03-09 at 12:29, Heikki Linnakangas wrote: Csaba, you mentioned recently (http://archives.postgresql.org/pgsql-hackers/2007-03/msg00027.php) that you're actually using the MVCC-violation to clean up tables during a backup. Can you tell us a bit more about that? Would you be upset if we shut that backdoor? My use case: a queue-like table (in fact a 'task' table) which is very frequently inserted/updated/deleted. This table tends to be bloated in the presence of any long running transaction... the only transactional behavior we need from this table is to make sure that when we insert something in this table in a transaction (possibly together with other actions) and then commit/rollback, it commits/rolls back the insert. CLUSTER's violation of MVCC does not affect this, as CLUSTER will not be able to lock the table if another transaction inserted something in it (the inserting transaction will have a lock on the table). Selections on this table are not critical for us, it just doesn't matter which job processor is getting which task and in what order... (actually it does matter, but CLUSTER won't affect that either). Hmm. You could use something along these lines instead: 0. LOCK TABLE queue_table 1. SELECT * INTO queue_table_new FROM queue_table 2. DROP TABLE queue_table 3. ALTER TABLE queue_table_new RENAME queue_table After all, it's not that you care about the clustering of the table, you just want to remove old tuples. As a long term solution, it would be nice if we had more fine-grained bookkeeping of snapshots that are in use in the system. In your case, there's a lot of tuples that are not visible to pg_dump because xmin is too new, and also not visible to any other transaction because xmax is too old. If we had a way to recognize situations like that, and vacuum those tuples, much of the problem with long-running transactions would go away. Wouldn't be possible to do it like Simon (IIRC) suggested, and add a parameter to enable/disable the current behavior, and use the MVCC behavior as default ? I guess we could, but I don't see why should encourage using CLUSTER for that. A more aggressive, MVCC-breaking version of VACUUM would make more sense to me, but I don't like the idea of adding break-MVCC flags to any commands. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] CLUSTER and MVCC
Gregory Stark wrote: Csaba Nagy [EMAIL PROTECTED] writes: Wouldn't be possible to do it like Simon (IIRC) suggested, and add a parameter to enable/disable the current behavior, and use the MVCC behavior as default ? Doing it in CLUSTER would be weird. However perhaps it would be useful to have some sort of stand-alone tool that just bumped all the xmin/xmax's. It would have to be super-user-only and carry big warning labels saying it breaks MVCC. But it would be useful any time you have a table that you want to exempt a particular table from serializable snapshots. Basically a per-table way to force a read-committed snapshot on. Though, actually it's not quite a read-committed snapshot is it? Anyone using an old serializable snapshot will see what, no tuples at all? Unless you used FrozenTransactionId ... But I'm not really seeing the problem here. Why isn't Csaba's problem fixed by the fact that HOT reduces the number of dead tuples in the first place? If it does, then he no longer needs the CLUSTER workaround, or at least, he needs it to a much lesser extent. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] CLUSTER and MVCC
Is there a particular reason why CLUSTER isn't MVCC-safe? It seems to me that it would be trivial to fix, by using SnapshotAny instead of SnapshotNow, and not overwriting the xmin/xmax with the xid of the cluster command. It's trivial to fix now in this way, but it would break HOT, since an indexscan only returns one row per index entry. Well, with SnapshotAny HOT should probably return all possibly visible tuples with an indexscan. (Btw, does CLUSTER really do an index scan ? Seems for reading a whole table a seq scan and sort is usually cheaper, at least when the clustering is so bad that a CLUSTER is needed.) Andreas ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] CLUSTER and MVCC
On Fri, 2007-03-09 at 14:00, Alvaro Herrera wrote: But I'm not really seeing the problem here. Why isn't Csaba's problem fixed by the fact that HOT reduces the number of dead tuples in the first place? If it does, then he no longer needs the CLUSTER workaround, or at least, he needs it to a much lesser extent. Is this actually true in the case of HOT + long running transactions ? I was supposing HOT has the same problems in the presence of long running transactions... Cheers, Csaba. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] CLUSTER and MVCC
Hmm. You could use something along these lines instead: 0. LOCK TABLE queue_table 1. SELECT * INTO queue_table_new FROM queue_table 2. DROP TABLE queue_table 3. ALTER TABLE queue_table_new RENAME queue_table After all, it's not that you care about the clustering of the table, you just want to remove old tuples. ... and then restart the app so all my pooled connections drop their cached plans ;-) Seriously, that won't work. If a session tries to insert a new row after I lock the table to clean it up, I still want it to be able to insert after the cleanup is finished... if I drop the table it tries to insert to, it will fail. As a long term solution, it would be nice if we had more fine-grained bookkeeping of snapshots that are in use in the system. In your case, there's a lot of tuples that are not visible to pg_dump because xmin is too new, and also not visible to any other transaction because xmax is too old. If we had a way to recognize situations like that, and vacuum those tuples, much of the problem with long-running transactions would go away. In the general case that won't work either in a strict MVCC sense... if you have an old transaction, you should never clean up a dead tuple which could be still visible to it. Wouldn't be possible to do it like Simon (IIRC) suggested, and add a parameter to enable/disable the current behavior, and use the MVCC behavior as default ? I guess we could, but I don't see why should encourage using CLUSTER for that. A more aggressive, MVCC-breaking version of VACUUM would make more sense to me, but I don't like the idea of adding break-MVCC flags to any commands. Well, if there would be any other way to avoid the table bloat I would agree. Cheers, Csaba. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring
On Fri, 9 Mar 2007, ITAGAKI Takahiro wrote: Pinned means bufHdr-refcount 0 and you don't distinguish pinned or recently-used (bufHdr-usage_count 0) buffers in your patch. Thank you, I will revise the terminology used accordingly. I was using pinned as a shortcut for will be ignored by skip_pinned which was sloppy of me. As I said, I was trying to show how the buffer cache looks from the perspective of the background writer, and therefore lumping them together because that's how SyncOneBuffer views them. A buffer cache full of either type will be largely ignored by the LRU writer, and that's what I've been finding when running insert/update heavy workloads like pgbench. If I might suggest a terminology change to avoid this confusion in the future, I'd like to rename the SyncOneBuffer skip_pinned parameter to something like skip_active, which is closer to the real behavior. I know Oracle refers to these as hot and cold LRU entries. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Auto creation of Partitions
On Fri, 2007-03-09 at 07:40 -0500, Luke Lonergan wrote: What happens to the data when the function is dropped or replaced? Well, that wouldn't happen because you build in a dependency. I'm not working on this, so don't expect lots of detail. The idea is essentially to implement things the way SQLServer does it. The function would need some care. It might even need a function that writes a function. e.g. CreateRangePartitionFunction('{date1,date2,date3}', funcname); ReplaceRangePartitionFunction('{date1,date2,date3}', funcname); which would then give a properly designed function called funcname to be used for partitioning, which would include tests to make sure a partition wasn't inadvertently excluded from the list of existing partitions of any table that used it. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] Interaction of PITR backups and Bulk operations avoiding WAL
Reviewing earlier threads, I realised that there was a potential bug/loophole in PITR backups in conjunction with avoiding WAL for bulk operations. This would be rare, but should be fixed. http://archives.postgresql.org/pgsql-hackers/2006-05/msg01113.php Say you issue COPY, CREATE INDEX etc.. pg_start_backup() pg_stop_backup() ...then bulk operation ends. This will result in a base backup that does not contain the data written during the bulk operation and the changes aren't in WAL either. I propose to fix this by making two new calls bool RequestBulkCommandUseNoWAL(void) void ResetBulkCommandUseNoWAL(void) so we would use it like this use_wal = RequestBulkCommandUseNoWAL() and then at end of operation if (!use_wal) ResetBulkCommandUseNoWAL(); The routine would record a flag on the shmem ControlFile data that would prevent pg_start backup functions from executing while a bulk operation was in progress. It would also prevent a bulk operation from using no WAL while a backup was in progress, as is already the case, since the backup can only take place while archiving is enabled. A new entry point pg_start_backup(text, bool) would allow the user to specify whether to wait for bulk ops to finish, or not. The old entry point would always wait, to ensure safety in all cases. Thoughts? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] Re: [COMMITTERS] pgsql: Remove unsafe calling of WSAStartup and WSA Cleanup from DllMain.
Magnus Hagander wrote: In which case it can simply be because I was building against a libpq built with MSVC = it had the broken startup code, and you used a mingw one, which didn't have it. Maybe - but it does imply it's potentially easy to break code with this change. Per the docs, an application like slon *should* make it's own call to WSAStartup() because it uses socket functions diretly (port/pipe.c)... Another question related to backpatching - should I backpatch this to 8.1 and 8.0 as well? I know we said we more or less don't maintain the win32 port back there because it was too new, but this is all code in the client libpq, which has been around no win32 much longer. The reason I'm asking is that the original reporter of this problem is on 8.1... I'm leaning towards yes, but would like to hear further comments... I'm far from convinced it should be backpatched at all. /D ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Calculated view fields (8.1 != 8.2)
On Fri, Mar 09, 2007 at 10:59:56AM +0100, Gaetano Mendola wrote: Is really this what we want? I did a migration 8.0.x = 8.2.3 and I had on first hour of service up lot of queries blocked due to this, consider in my case I have on v_ta milions of records and usually that join extracts 1 row. Is there a way to set till I don't check all my huge schema to disable this behaviour? Most people figured it was a improvment. It's configured per function now, which wasn't the case before. I dont't think there was ever any discussion about having a global switch. Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate. signature.asc Description: Digital signature
Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch
On Thu, Mar 08, 2007 at 07:01:17PM -0500, Umar Farooq Minhas wrote: displayed, I want cpu cost and io cost displayed separated when i run EXPLAIN on a particular query. Till now I haven't been able to figure out a 'clean' way of doing this. Can anyone tell me how much time should I expect to spend making such a change ? and from where should I start ? costsize.c ? That's going to be a lot of work. You need to duplicate the variable and eery usage of that variable. And I can't imagine why you'd be interested anyway... I have another question. Looking at the optimizer code, it pretty much looks insensitive to the memory factor. The only parameters being utilized are the effective_cache_size ( in estimating index cost only) and work_mem for (sort, aggregation, groups, hash/merge joins). Are these the only memory factors that DIRECTLY effect the cost estimates of the planner/optimizer? Sure, what other factors were you considering? Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate. signature.asc Description: Digital signature
Re: [HACKERS] CLUSTER and MVCC
Csaba Nagy wrote: On Fri, 2007-03-09 at 14:00, Alvaro Herrera wrote: But I'm not really seeing the problem here. Why isn't Csaba's problem fixed by the fact that HOT reduces the number of dead tuples in the first place? If it does, then he no longer needs the CLUSTER workaround, or at least, he needs it to a much lesser extent. Is this actually true in the case of HOT + long running transactions ? I was supposing HOT has the same problems in the presence of long running transactions... It does, HOT won't help you here. A long-running transaction is just as much of a problem with HOT as without. Besides, I don't recall that you're doing updates in the first place. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Calculated view fields (8.1 != 8.2)
Martijn van Oosterhout kleptog@svana.org writes: Most people figured it was a improvment. It's configured per function now, which wasn't the case before. I dont't think there was ever any discussion about having a global switch. Volatile functions that are not at the top level of a query are *always* going to be a risk factor, in that you don't know quite where the planner is going to evaluate them. While I'm not by any means wedded to the 8.2 no-flattening patch, it seems to me to be reasonable because it reduces that uncertainty a bit. The fact that Gaetano's code depended on the uncertainty being resolved in a different direction is unfortunate, but I think his code is really to blame, because postponing the function eval like that couldn't be guaranteed anyway across all queries. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Calculated view fields (8.1 != 8.2)
Martijn van Oosterhout wrote: On Fri, Mar 09, 2007 at 10:59:56AM +0100, Gaetano Mendola wrote: Is really this what we want? I did a migration 8.0.x = 8.2.3 and I had on first hour of service up lot of queries blocked due to this, consider in my case I have on v_ta milions of records and usually that join extracts 1 row. Is there a way to set till I don't check all my huge schema to disable this behaviour? Most people figured it was a improvment. It's configured per function now, which wasn't the case before. I dont't think there was ever any discussion about having a global switch. If your function is already marked immutable or stable, then nothing changes for you. If you *did* call volatile functions inside your select, then you now get consistens behaviour. Since you don't want your function to be evaluated in all cases, I assume that it shouldn't be marked volatile in the first place. I think a lot of people forget to mark their functions volatile/stable/immutable correctly, or don't know about the implications of these flags. Maybe there should be a guc force_explicit_sideeffeect_declaration (defaulting to on) which makes specifying either volatile, stable or immutable mandatory. Then people would (hopefully) read the relevant part of the docs before creating a function, and probably get the declaration right in the first place. greetings, Florian Pflug ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] CLUSTER and MVCC
Heikki Linnakangas wrote: Csaba Nagy wrote: On Fri, 2007-03-09 at 14:00, Alvaro Herrera wrote: But I'm not really seeing the problem here. Why isn't Csaba's problem fixed by the fact that HOT reduces the number of dead tuples in the first place? If it does, then he no longer needs the CLUSTER workaround, or at least, he needs it to a much lesser extent. Is this actually true in the case of HOT + long running transactions ? I was supposing HOT has the same problems in the presence of long running transactions... It does, HOT won't help you here. A long-running transaction is just as much of a problem with HOT as without. Besides, I don't recall that you're doing updates in the first place. Couldn't HOT in principle deal with this? Let's say you have two long-running transactions, which see row versions A and D. While those transactions are running, the row is constantly updated, leading to row versions B, C (before the second long-running transaction started), D, E, F, ... Z. Now, the versions B,C,E,F,...Z could be removed by HOT or vacuum, because they are not currently visible, nor will they ever become visible because they are already deleted. greetings, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Calculated view fields (8.1 != 8.2)
Martijn van Oosterhout wrote: On Fri, Mar 09, 2007 at 10:59:56AM +0100, Gaetano Mendola wrote: Is really this what we want? I did a migration 8.0.x = 8.2.3 and I had on first hour of service up lot of queries blocked due to this, consider in my case I have on v_ta milions of records and usually that join extracts 1 row. Is there a way to set till I don't check all my huge schema to disable this behaviour? Most people figured it was a improvment. It's configured per function now, which wasn't the case before. I dont't think there was ever any discussion about having a global switch. Well it's not an improvement in term of performances but a performance degradation in the best case and in the worst can be devastating: create table ta ( a integer, b integer ); CREATE TABLE create table tb ( b integer, c integer ); CREATE TABLE CREATE OR REPLACE FUNCTION sp_delete_selected_row ( INTEGER ) RETURNS INTEGER AS' DECLARE a_idALIAS FOR $1; BEGIN DELETE FROM ta where a = a_id; return 0; END; ' LANGUAGE 'plpgsql'; CREATE FUNCTION CREATE OR REPLACE VIEW v_ta AS SELECT sp_delete_selected_row(a) AS a, b AS b FROM ta ; CREATE VIEW insert into ta values (2,3); INSERT 0 1 insert into ta values (3,4); INSERT 0 1 insert into tb values (4,5); INSERT 0 1 select * from v_ta join tb using (b) where c = 5; b | a | c ---+---+--- 4 | 0 | 5 (1 row) select * from ta; a | b ---+--- (0 rows) All rows are gone instead of the only one extracted from that query. IMHO is a undesired side effect. In my case I destroyed my application statistics on how many time a certain row was extracted. Regards Gaetano Mendola ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] CLUSTER and MVCC
Csaba Nagy wrote: Hmm. You could use something along these lines instead: 0. LOCK TABLE queue_table 1. SELECT * INTO queue_table_new FROM queue_table 2. DROP TABLE queue_table 3. ALTER TABLE queue_table_new RENAME queue_table After all, it's not that you care about the clustering of the table, you just want to remove old tuples. ... and then restart the app so all my pooled connections drop their cached plans ;-) Yeah, though Tom's working on plan invalidation for 8.3, so that wouldn't be an issue. Seriously, that won't work. If a session tries to insert a new row after I lock the table to clean it up, I still want it to be able to insert after the cleanup is finished... if I drop the table it tries to insert to, it will fail. Hmm. How about: 1. LOCK TABLE queue_table 2. SELECT * INTO temp_table FROM queue_table 3. TRUNCATE queue_table 4. INSERT INTO queue_table SELECT * FROM temp_table That way you're copying the rows twice, but if there isn't many live tuples it shouldn't matter too much. As a long term solution, it would be nice if we had more fine-grained bookkeeping of snapshots that are in use in the system. In your case, there's a lot of tuples that are not visible to pg_dump because xmin is too new, and also not visible to any other transaction because xmax is too old. If we had a way to recognize situations like that, and vacuum those tuples, much of the problem with long-running transactions would go away. In the general case that won't work either in a strict MVCC sense... if you have an old transaction, you should never clean up a dead tuple which could be still visible to it. We wouldn't clean up tuples that are visible to a transaction, but if you have one long-running transaction like pg_dump in a database with otherwise short transaction, you'll have a lot of tuples that are not vacuumable because of the long-running process, but are not in fact visible to any transaction. That's transactions that were inserted too late to be seen by the old transaction, and deleted too long time ago to be seen by any other transaction. Let me illustrate this with a timeline: xmin1xmax1 || -+--X-X+X-+ooXoXoXoXXo+--now | | xmin2 xmax2 xmin1 and xmax1 are the xmin and xmax of an old, long-running serializable transaction, like pg_dump. The Xs between them are xids of transactions that the old transaction sees as in-progress, IOW the SnapshotData.xip-array. xmin2 and xmax2 are the xmin and xmax of a newer transaction. Because of the old-running transaction, xmin2 is far behind xmax2, but there's a wide gap between that and the next transaction that the newer transaction sees as in-progress. The current rule to determine if a tuple is dead or not is to check that tuple's xmax oldestxmin. Oldestxmin is in this case xmin1. But in addition to that, any tuple with an xmin xmax1 and xmax that's not in the xip-array of any snapshot in use (marked with o above), isn't visible to any current or future transaction and can therefore be safely vacuumed. The implementation problem is that we don't have a global view of all snapshots in the system. If we solve that, we can be more aggressive with vacuuming in presence of long-running transactions. It's not an easy problem, we don't want to add a lot of accounting overhead, but maybe we could have some kind of an approximation of the global state with little overhead, that would give most of the benefit. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] CLUSTER and MVCC
Zeugswetter Andreas ADI SD wrote: Is there a particular reason why CLUSTER isn't MVCC-safe? It seems to me that it would be trivial to fix, by using SnapshotAny instead of SnapshotNow, and not overwriting the xmin/xmax with the xid of the cluster command. It's trivial to fix now in this way, but it would break HOT, since an indexscan only returns one row per index entry. Well, with SnapshotAny HOT should probably return all possibly visible tuples with an indexscan. (Btw, does CLUSTER really do an index scan ? Seems for reading a whole table a seq scan and sort is usually cheaper, at least when the clustering is so bad that a CLUSTER is needed.) Yes, it does an indexscan (last time I checked, at least). I think if a performance improvement is demonstrated, we would accept a patch ... -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] CLUSTER and MVCC
Florian G. Pflug wrote: Couldn't HOT in principle deal with this? Let's say you have two long-running transactions, which see row versions A and D. While those transactions are running, the row is constantly updated, leading to row versions B, C (before the second long-running transaction started), D, E, F, ... Z. Now, the versions B,C,E,F,...Z could be removed by HOT or vacuum, because they are not currently visible, nor will they ever become visible because they are already deleted. Yes, you could detect that but you'd need a global view of all snapshots in the system. I just posted a reply in this thread with more details.. It's not just with HOT, it's the way we determine that a tuple is vacuumable in general. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] CLUSTER and MVCC
On 3/9/07, Florian G. Pflug [EMAIL PROTECTED] wrote: Heikki Linnakangas wrote: Csaba Nagy wrote: On Fri, 2007-03-09 at 14:00, Alvaro Herrera wrote: But I'm not really seeing the problem here. Why isn't Csaba's problem fixed by the fact that HOT reduces the number of dead tuples in the first place? If it does, then he no longer needs the CLUSTER workaround, or at least, he needs it to a much lesser extent. Is this actually true in the case of HOT + long running transactions ? I was supposing HOT has the same problems in the presence of long running transactions... It does, HOT won't help you here. A long-running transaction is just as much of a problem with HOT as without. Besides, I don't recall that you're doing updates in the first place. Couldn't HOT in principle deal with this? Let's say you have two long-running transactions, which see row versions A and D. While those transactions are running, the row is constantly updated, leading to row versions B, C (before the second long-running transaction started), D, E, F, ... Z. Now, the versions B,C,E,F,...Z could be removed by HOT or vacuum, because they are not currently visible, nor will they ever become visible because they are already deleted. Couldn't they (or at least one of them) become visible due to SAVEPOINT rollback? greetings, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq -- Mike Rylander [EMAIL PROTECTED] GPLS -- PINES Development Database Developer http://open-ils.org ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] CLUSTER and MVCC
Mike Rylander wrote: On 3/9/07, Florian G. Pflug [EMAIL PROTECTED] wrote: Couldn't HOT in principle deal with this? Let's say you have two long-running transactions, which see row versions A and D. While those transactions are running, the row is constantly updated, leading to row versions B, C (before the second long-running transaction started), D, E, F, ... Z. Now, the versions B,C,E,F,...Z could be removed by HOT or vacuum, because they are not currently visible, nor will they ever become visible because they are already deleted. Couldn't they (or at least one of them) become visible due to SAVEPOINT rollback? You wouldn't remove tuples with an uncommited xmax, of course. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Calculated view fields (8.1 != 8.2)
Gaetano Mendola wrote: Martijn van Oosterhout wrote: On Fri, Mar 09, 2007 at 10:59:56AM +0100, Gaetano Mendola wrote: Is really this what we want? I did a migration 8.0.x = 8.2.3 and I had on first hour of service up lot of queries blocked due to this, consider in my case I have on v_ta milions of records and usually that join extracts 1 row. Is there a way to set till I don't check all my huge schema to disable this behaviour? Most people figured it was a improvment. It's configured per function now, which wasn't the case before. I dont't think there was ever any discussion about having a global switch. Well it's not an improvement in term of performances but a performance degradation in the best case and in the worst can be devastating: create table ta ( a integer, b integer ); CREATE TABLE create table tb ( b integer, c integer ); CREATE TABLE CREATE OR REPLACE FUNCTION sp_delete_selected_row ( INTEGER ) RETURNS INTEGER AS' DECLARE a_idALIAS FOR $1; BEGIN DELETE FROM ta where a = a_id; return 0; END; ' LANGUAGE 'plpgsql'; CREATE FUNCTION CREATE OR REPLACE VIEW v_ta AS SELECT sp_delete_selected_row(a) AS a, b AS b FROM ta ; CREATE VIEW insert into ta values (2,3); INSERT 0 1 insert into ta values (3,4); INSERT 0 1 insert into tb values (4,5); INSERT 0 1 select * from v_ta join tb using (b) where c = 5; b | a | c ---+---+--- 4 | 0 | 5 (1 row) select * from ta; a | b ---+--- (0 rows) All rows are gone instead of the only one extracted from that query. IMHO is a undesired side effect. In my case I destroyed my application statistics on how many time a certain row was extracted. This is insane. Whoever creates a view like that on a production system should *immediatly* be carried away from his keyboard, to prevent further damage. Imagine someone using View Data on this view in pgadmin.. I don't wanna be near him when he clicks Refresh, and suddenly all data is gone... Maybe calling volatile functions in selects and views should be forbidden entirely, except for volatile functions in the top-level select clause, to support things like select ..., nextval('seq') from But it's probably not worth the effort - there will always be creative ways to shoot yourself into your foot. greetings, Florian Pflug ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] CLUSTER and MVCC
Heikki Linnakangas [EMAIL PROTECTED] writes: Is there a particular reason why CLUSTER isn't MVCC-safe? It seems to me that it would be trivial to fix, by using SnapshotAny instead of SnapshotNow, and not overwriting the xmin/xmax with the xid of the cluster command. The reason it's not trivial is that you also have to preserve the t_ctid links of update chains. If you look into VACUUM FULL, a very large part of its complexity is that it moves update chains as a unit to make that possible. (BTW, I believe the problem Pavan Deolasee reported yesterday is a bug somewhere in there --- it looks to me like sometimes the same update chain is getting copied multiple times.) regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Re: [COMMITTERS] pgsql: Remove unsafe calling of WSAStartup and WSA Cleanup from DllMain.
Dave Page [EMAIL PROTECTED] writes: Magnus Hagander wrote: Another question related to backpatching - should I backpatch this to 8.1 and 8.0 as well? I'm far from convinced it should be backpatched at all. I tend to agree with Dave --- I think this change needs to go through a beta-testing cycle before we unleash it on the world. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] who gets paid for this
On Thu, Mar 08, 2007 at 12:10:22 -0800, Christian Bird [EMAIL PROTECTED] wrote: I'm a grad student at UC Davis studying the postgres community and I wanted to know if some on this list could help me out. I'm studying the factors that affect people graduating from being mailing list participant to developers with write access to the repository. Is it possible to find out who is being employed to work on postgres and who is doing it on their own time? Some of my data points to there being two ways that people make the jump. More specifically, could those who worked on apache as some aspect of their job prior to getting repo access let me know? Or if there are devs who know this information about others, I'd be really appreciative to get it. Thanks a lot. Si Chen from Open Source Strategies talked to a number of mailing list contributors (which is different than code contributors) a year or two ago. They are supposed to have a web page about this at http://www.opensourcestrategies.com/pgsurvey/control/main but I am getting a 500 error right now. The rest of their web pages are working, so they may still be there. There is a contact link on their main page which you might use to contact them and see if you can get access to those results. ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] CLUSTER and MVCC
On Fri, 2007-03-09 at 12:48 -0300, Alvaro Herrera wrote: Zeugswetter Andreas ADI SD wrote: Is there a particular reason why CLUSTER isn't MVCC-safe? It seems to me that it would be trivial to fix, by using SnapshotAny instead of SnapshotNow, and not overwriting the xmin/xmax with the xid of the cluster command. It's trivial to fix now in this way, but it would break HOT, since an indexscan only returns one row per index entry. Well, with SnapshotAny HOT should probably return all possibly visible tuples with an indexscan. (Btw, does CLUSTER really do an index scan ? Seems for reading a whole table a seq scan and sort is usually cheaper, at least when the clustering is so bad that a CLUSTER is needed.) Yes, it does an indexscan (last time I checked, at least). I think if a performance improvement is demonstrated, we would accept a patch ... Again, right now, most things people do here will break HOT. At this late stage before freeze, please everybody be careful to look and plan for patch conflicts. (That isn't stay away, just be careful). Thanks. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Interaction of PITR backups and Bulk operations avoiding WAL
Simon Riggs [EMAIL PROTECTED] writes: Say you issue COPY, CREATE INDEX etc.. pg_start_backup() pg_stop_backup() ...then bulk operation ends. This will result in a base backup that does not contain the data written during the bulk operation and the changes aren't in WAL either. Uh, no. The state of XLogArchivingActive() isn't affected by that. It strikes me that allowing archive_command to be changed on the fly might not be such a good idea though, or at least it shouldn't be possible to flip it from empty to nonempty during live operation. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] CLUSTER and MVCC
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: Is there a particular reason why CLUSTER isn't MVCC-safe? It seems to me that it would be trivial to fix, by using SnapshotAny instead of SnapshotNow, and not overwriting the xmin/xmax with the xid of the cluster command. The reason it's not trivial is that you also have to preserve the t_ctid links of update chains. If you look into VACUUM FULL, a very large part of its complexity is that it moves update chains as a unit to make that possible. (BTW, I believe the problem Pavan Deolasee reported yesterday is a bug somewhere in there --- it looks to me like sometimes the same update chain is getting copied multiple times.) Ah, that's it. Thanks. The easiest solution I can think of is to skip newer versions of updated rows when scanning the old relation, and to fetch and copy all tuples in the update chain to the new relation whenever you encounter the first tuple in the chain. To get a stable view of what's the first tuple in chain, you need to get the oldest xmin once at the beginning, and use that throughout the operation. Since we take an exclusive lock on the table, no-one can insert new updated tuples during the operation, and all updaters are finished before the lock is granted. Those tuples wouldn't be in the cluster order, though, but that's not a big deal. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Interaction of PITR backups and Bulk operationsavoiding WAL
On Fri, 2007-03-09 at 11:15 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: Say you issue COPY, CREATE INDEX etc.. pg_start_backup() pg_stop_backup() ...then bulk operation ends. This will result in a base backup that does not contain the data written during the bulk operation and the changes aren't in WAL either. Uh, no. The state of XLogArchivingActive() isn't affected by that. Sorry, error case should have been Say you issue COPY, CREATE INDEX etc.. set archive_command pg_ctl reload pg_start_backup() pg_stop_backup() ...then bulk operation ends. It strikes me that allowing archive_command to be changed on the fly might not be such a good idea though, or at least it shouldn't be possible to flip it from empty to nonempty during live operation. As long as we allow it to be turned on/off during normal operation then there is a current window of error. I'd rather fix it the proposed way than force a restart. ISTM wrong to have an availability feature cause downtime. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] CLUSTER and MVCC
Heikki Linnakangas [EMAIL PROTECTED] writes: We wouldn't clean up tuples that are visible to a transaction, but if you have one long-running transaction like pg_dump in a database with otherwise short transaction, you'll have a lot of tuples that are not vacuumable because of the long-running process, but are not in fact visible to any transaction. It sounds to me like you are proposing to remove the middles of update chains, which would break READ-COMMITTED updates initiated by the older transactions. Now admittedly pg_dump isn't going to issue any such updates, but VACUUM doesn't know that. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] CLUSTER and MVCC
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: We wouldn't clean up tuples that are visible to a transaction, but if you have one long-running transaction like pg_dump in a database with otherwise short transaction, you'll have a lot of tuples that are not vacuumable because of the long-running process, but are not in fact visible to any transaction. It sounds to me like you are proposing to remove the middles of update chains, which would break READ-COMMITTED updates initiated by the older transactions. Now admittedly pg_dump isn't going to issue any such updates, but VACUUM doesn't know that. I was thinking of inserts+deletes. Updates are harder, you'd need to change the ctid of the old version to skip the middle part of the chain, atomically, but I suppose they could be handled as well. Isolation level doesn't really matter. We just need a global view of in-use *snapshots* in the system, serializable or not. Not that that's an easy thing to do... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] CLUSTER and MVCC
Heikki Linnakangas wrote: The implementation problem is that we don't have a global view of all snapshots in the system. If we solve that, we can be more aggressive with vacuuming in presence of long-running transactions. It's not an easy problem, we don't want to add a lot of accounting overhead, but maybe we could have some kind of an approximation of the global state with little overhead, that would give most of the benefit. Hm.. Maybe there could be a fixed-sized list of xids together with a usecount in shared memory. If a transaction puts an xid into it's snapshot, it increments the usecount of that xid in the global list (inserting it if it's not already in the list). If there is no free space in the list, it first removes all xid with xid oldestxmin. If there is still no free space, it does nothing. When the transaction is done with the snapshot, it decrements all the usecounts of xids it incremented before. You than know that a xid is *not* viewed as in-progress by any transaction if the xid is in that list, and has a refcount of zero. greetings, Florian Pflug ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Interaction of PITR backups and Bulk operationsavoiding WAL
Simon Riggs [EMAIL PROTECTED] writes: On Fri, 2007-03-09 at 11:15 -0500, Tom Lane wrote: It strikes me that allowing archive_command to be changed on the fly might not be such a good idea though, or at least it shouldn't be possible to flip it from empty to nonempty during live operation. I'd rather fix it the proposed way than force a restart. ISTM wrong to have an availability feature cause downtime. I don't think that people are very likely to need to turn archiving on and off on-the-fly. Your proposed solution introduces a great deal of complexity (and risk of future bugs-of-omission, to say nothing of race conditions) to solve a non-problem. We have better things to be doing with our development time. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] CLUSTER and MVCC
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: We wouldn't clean up tuples that are visible to a transaction, but if you have one long-running transaction like pg_dump in a database with otherwise short transaction, you'll have a lot of tuples that are not vacuumable because of the long-running process, but are not in fact visible to any transaction. It sounds to me like you are proposing to remove the middles of update chains, which would break READ-COMMITTED updates initiated by the older transactions. Now admittedly pg_dump isn't going to issue any such updates, but VACUUM doesn't know that. You could restrict this to serializable transactions, or even to read-only transactions. Or maybe the tuple could be reduced to just it's header - doesn't HOT do something similar? greetings, Florian Pflug ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Interaction of PITR backups and Bulkoperationsavoiding WAL
On Fri, 2007-03-09 at 11:47 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: On Fri, 2007-03-09 at 11:15 -0500, Tom Lane wrote: It strikes me that allowing archive_command to be changed on the fly might not be such a good idea though, or at least it shouldn't be possible to flip it from empty to nonempty during live operation. I'd rather fix it the proposed way than force a restart. ISTM wrong to have an availability feature cause downtime. I don't think that people are very likely to need to turn archiving on and off on-the-fly. Your proposed solution introduces a great deal of complexity (and risk of future bugs-of-omission, to say nothing of race conditions) to solve a non-problem. We have better things to be doing with our development time. It's certainly a quicker fix. Unless others object, I'll set archive_command to only be changeable at server startup. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Interaction of PITR backups and Bulk operationsavoiding WAL
Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: On Fri, 2007-03-09 at 11:15 -0500, Tom Lane wrote: It strikes me that allowing archive_command to be changed on the fly might not be such a good idea though, or at least it shouldn't be possible to flip it from empty to nonempty during live operation. I'd rather fix it the proposed way than force a restart. ISTM wrong to have an availability feature cause downtime. I don't think that people are very likely to need to turn archiving on and off on-the-fly. Your proposed solution introduces a great deal of complexity (and risk of future bugs-of-omission, to say nothing of race conditions) to solve a non-problem. We have better things to be doing with our development time. So how to do a file based backup without permanent archiving? If pg_start_backup would turn on archiving temporarily with forcing archiving all WAL files that contain open transactions, this would be possible. This is what's requested for sites where PITR isn't needed, just filesystem level backup. Currently, this can be mimicked somehow by turning on archiving on-the-fly, hoping that all xactions are in the WAL archive when pg_start_backup is issued (Simons mail shows how this will fail). Regards, Andreas ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Interaction of PITR backups and Bulk operationsavoiding WAL
On Fri, 2007-03-09 at 17:47, Tom Lane wrote: I don't think that people are very likely to need to turn archiving on and off on-the-fly. We did need occasionally to turn archiving on on-the-fly. It did happen that I started up a new DB machine and I did not have yet the log archive available, so I had to wait with configuring that, but the machine went on-line before the archive machine was ready... and then later I had to switch on archiving. It was very convenient that I could do it without a restart. It's true that has been rare occasion, more often you just need to change the archive command (e.g. to archive to a different location if the archive repository goes down). It's somewhat moot for us as we changed to use Slony (which is a heavy beast but once it works it's great). Cheers, Csaba. ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Estimating seq_page_fetch and random_page_fetch
It would be interested to see some code here. Maybe this would be a great oportunity to start - some sort of 'auto- tune' (as an option), in the area. -- GJ C/C++/SQL freelance to hire. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bug: Buffer cache is not scan resistant
Cool! - Luke Msg is shrt cuz m on ma treo -Original Message- From: Simon Riggs [mailto:[EMAIL PROTECTED] Sent: Friday, March 09, 2007 02:32 PM Eastern Standard Time To: Luke Lonergan; ITAGAKI Takahiro Cc: Sherry Moore; Tom Lane; Mark Kirkwood; Pavan Deolasee; Gavin Sherry; PGSQL Hackers; Doug Rady Subject:Re: [HACKERS] Bug: Buffer cache is not scan resistant On Tue, 2007-03-06 at 22:32 -0500, Luke Lonergan wrote: Incidentally, we tried triggering NTA (L2 cache bypass) unconditionally and in various patterns and did not see the substantial gain as with reducing the working set size. My conclusion: Fixing the OS is not sufficient to alleviate the issue. We see a 2x penalty (1700MB/s versus 3500MB/s) at the higher data rates due to this effect. I've implemented buffer recycling, as previously described, patch being posted now to -patches as scan_recycle_buffers. This version includes buffer recycling - for SeqScans larger than shared buffers, with the objective of improving L2 cache efficiency *and* reducing the effects of shared buffer cache spoiling (both as previously discussed on this thread) - for VACUUMs of any size, with the objective of reducing WAL thrashing whilst keeping VACUUM's behaviour of not spoiling the buffer cache (as originally suggested by Itagaki-san, just with a different implementation). Behaviour is not activated by default in this patch. To request buffer recycling, set the USERSET GUC SET scan_recycle_buffers = N tested with 1,4,8,16, but only 8 seems sensible, IMHO. Patch effects StrategyGetBuffer, so only effects the disk-cache path. The idea is that if its already in shared buffer cache then we get substantial benefit already and nothing else is needed. So for the general case, the patch adds a single if test into the I/O path. The parameter is picked up at the start of SeqScan and VACUUM (currently). Any change mid-scan will be ignored. IMHO its possible to do this and to allow Synch Scans at the same time, with some thought. There is no need for us to rely on cache spoiling behaviour of scans to implement that feature as well. Independent performance tests requested, so that we can discuss this objectively. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] who gets paid for this
Andrew Dunstan wrote: Tom Lane wrote: Even more to the point, getting paid for has almost nothing to do with has commit privileges. At least on this project. Darn. So the cheque isn't really in the mail? I think his question was just which ratio of developers works on PostgreSQL on company time. regards, Lukas ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring
On Mar 8, 2007, at 11:51 PM, Greg Smith wrote: almost everything that's dirty is also pinned during pgbench, and the LRU is lucky to find anything it can write as a result I'm wondering if pg_bench is a good test of this stuff. ISTM it's unrealistically write-heavy, which is going to tend to not only put a lot of dirty buffers into the pool, but also keep them pinned enough that you can't write them. Perhaps you should either modify pg_bench to do a lot more selects out of the various tables or look towards a different benchmark. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring
On Mar 8, 2007, at 11:51 PM, Greg Smith wrote: almost everything that's dirty is also pinned during pgbench, and the LRU is lucky to find anything it can write as a result I'm wondering if pg_bench is a good test of this stuff. ISTM it's unrealistically write-heavy, which is going to tend to not only put a lot of dirty buffers into the pool, but also keep them pinned enough that you can't write them. Perhaps you should either modify pg_bench to do a lot more selects out of the various tables -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring
On Mar 9, 2007, at 7:57 AM, Greg Smith wrote: On Fri, 9 Mar 2007, ITAGAKI Takahiro wrote: Pinned means bufHdr-refcount 0 and you don't distinguish pinned or recently-used (bufHdr-usage_count 0) buffers in your patch. Thank you, I will revise the terminology used accordingly. I was using pinned as a shortcut for will be ignored by skip_pinned which was sloppy of me. As I said, I was trying to show how the buffer cache looks from the perspective of the background writer, and therefore lumping them together because that's how SyncOneBuffer views them. A buffer cache full of either type will be largely ignored by the LRU writer, and that's what I've been finding when running insert/update heavy workloads like pgbench. If I might suggest a terminology change to avoid this confusion in the future, I'd like to rename the SyncOneBuffer skip_pinned parameter to something like skip_active, which is closer to the real behavior. I know Oracle refers to these as hot and cold LRU entries. Well, AIUI, whether the buffer is actually pinned or not is almost inconsequential (other than if a buffer *is* pinned then it's usage count is about to become 0, so there's no reason to consider writing it). What that parameter really does is control whether you're going to follow the LRU semantics or not... -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug in VACUUM FULL ?
Pavan Deolasee [EMAIL PROTECTED] writes: The problem mentioned before is hard to reproduce with the suggested change, but its not completely gone away. I have seen that again on CVS HEAD with the patch applied. I am facing another issue with VACUUM FULL. This problem gets reproduced with HOT very easily, but takes few attempts to reproduce with CVS HEAD, but it certainly exists. I've been banging away on this since yesterday, and I think I've achieved a full understanding of what's going on. There are three or four different-looking pathologies but they all seem to arise from the same problem: the update-chain-moving code assumes that RECENTLY_DEAD tuples will never have update successors that are entirely DEAD (according to HeapTupleSatisfiesVacuum). When faced with an update chain in which that does happen, it can move the chain multiple times, neglect to remove index entries for tuples that get truncated away, crash on an Assert, or other interesting stuff. Here's an example from some debug printouts I inserted into repair_frag: chain forward branches 17/174 to 17/183 (x 1993046 1993057) RECENTLY_DEAD chain forward branches 17/183 to 15/109 (x 1993057 1993055) RECENTLY_DEAD chain forward branches 15/109 to 15/111 (x 1993055 1993045) DEAD chain forward branches 15/111 to 15/114 (x 1993045 1993025) DEAD chain forward branches 15/114 to 15/116 (x 1993025 1993096) RECENTLY_DEAD chain forward branches 15/116 to 15/119 (x 1993096 1993107) RECENTLY_DEAD chain forward branches 15/119 to 15/121 (x 1993107 1993120) RECENTLY_DEAD chain forward branches 15/121 to 15/125 (x 1993120 1993121) RECENTLY_DEAD chain forward branches 15/125 to 15/128 (x 1993121 1993122) RECENTLY_DEAD chain forward branches 15/128 to 15/131 (x 1993122 1993092) RECENTLY_DEAD chain forward branches 15/131 to 15/133 (x 1993092 1993145) RECENTLY_DEAD chain forward branches 15/133 to 15/139 (x 1993145 1993182) RECENTLY_DEAD chain forward branches 15/139 to 15/141 (x 1993182 1993183) RECENTLY_DEAD chain forward branches 15/141 to 15/147 (x 1993183 1993155) RECENTLY_DEAD chain forward branches 15/147 to 15/150 (x 1993155 1993167) LIVE chain back stops at branches 15/114: xmin 1993025 1993050 moved branches 15/150 to 0/69; next 0/69 moved branches 15/147 to 0/70; next 0/69 moved branches 15/141 to 0/71; next 0/70 moved branches 15/139 to 0/72; next 0/71 moved branches 15/133 to 0/73; next 0/72 moved branches 15/131 to 0/74; next 0/73 moved branches 15/128 to 0/75; next 0/74 moved branches 15/125 to 0/76; next 0/75 moved branches 15/121 to 0/77; next 0/76 moved branches 15/119 to 0/78; next 0/77 moved branches 15/116 to 0/79; next 0/78 moved branches 15/114 to 0/80; next 0/79 Since TIDs 17/174 and 17/183 didn't get moved, when the repair_frag search arrives at 17/183 it will copy this chain again, leading to duplicate copies of the LIVE tuple at the chain end, leading to trouble. It's not surprising that tuples could have xmax less than xmin, since transactions can commit in orders different than they start; when using READ COMMITTED updates it's not at all surprising that a transaction might update rows after a later-numbered transaction does. However, in looking at this code previously I'd assumed that the OldestXmin cutoff could never fall between two such transactions, and so the above scenario wouldn't happen. I'm not real sure why I thought that. For the cases that VACUUM FULL is interested in, both XIDs mentioned in a DEAD tuple must have committed before OldestXmin was computed, but there doesn't seem to be a compelling reason why OldestXmin might not have been determined by an unrelated third transaction with a number between those two. I believe it's the case that any update chain members appearing before a DEAD entry must in fact also be dead (ie, not listed as live in any active snapshot) but a test based on OldestXmin hasn't got enough resolution to prove that they are dead. Does anyone want to argue that this is an error in the calculation of OldestXmin (and if so, how would you propose to fix it)? If not, I'll set to work on fixing the chain-moving logic. I think the correct behavior is that we shouldn't consider DEAD tuples part of a movable chain, and so in the above example there are two separate chains to move not one. Alternatively we could try to recognize that the older part of the chain is really dead and removable, but that seems complicated and likely to introduce new bugs. I wonder whether this has any implications for HOT ... regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Auto creation of Partitions
On Mar 9, 2007, at 3:31 AM, Zeugswetter Andreas ADI SD wrote: Since partition is inheritance-based, a simple DROP or NO INHERIT will do the job to deal with the partition. Do we want to reinvent additional syntax when these are around and are documented? Well, if the syntax for adding a new partition eventually ends up as ALTER TABLE ADD PARTITION, then it would make more sense that you remove a partition via ALTER TABLE DROP PARTITION. But DROP PARTITION usually moves the data from this partition to other partitions, so it is something different. It does? IIRC every partitioning system I've seen DROP PARTITION drops the data as well. It's up to you to move it somewhere else if you want to keep it. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Auto creation of Partitions
Ühel kenal päeval, R, 2007-03-09 kell 15:41, kirjutas Jim Nasby: On Mar 9, 2007, at 3:31 AM, Zeugswetter Andreas ADI SD wrote: Since partition is inheritance-based, a simple DROP or NO INHERIT will do the job to deal with the partition. Do we want to reinvent additional syntax when these are around and are documented? Well, if the syntax for adding a new partition eventually ends up as ALTER TABLE ADD PARTITION, then it would make more sense that you remove a partition via ALTER TABLE DROP PARTITION. But DROP PARTITION usually moves the data from this partition to other partitions, so it is something different. It does? IIRC every partitioning system I've seen DROP PARTITION drops the data as well. It's up to you to move it somewhere else if you want to keep it. Will this proposed DROP PARTITION just disassociate the table from the master, or will it actually drop the partitions table from the whole database ? -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] who gets paid for this
Christian Bird wrote: Hi all, I'm a grad student at UC Davis studying the postgres community and I wanted to know if some on this list could help me out. I'm studying the factors that affect people graduating from being mailing list participant to developers with write access to the repository. It is done on a meritocracy basis and has zero bearing if the person is paid to work on PostgreSQL or not. I believe (would need verification) that it is -core who decides who gets actual commit privileges. Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Bug in VACUUM FULL ?
On Fri, 2007-03-09 at 16:40 -0500, Tom Lane wrote: I wonder whether this has any implications for HOT ... My general feeling, expressed in a number of recent posts was that the VACUUM FULL code really isn't worth the trouble it causes. Especially when CLUSTER does a better job anyway? I've proposed a number of different proposals for changing VACUUM FULL, and Hannu posted some really cool ideas. Please can we spend time doing something useful, rather than trying to fix up a bag of worms that nobody ever runs? C'mon guys, this isn't a challenge, its a lost cause. I don't really mean to be radical, but I just think VACUUM FULL's time has come. A better utility could be written in the time it takes to fix and be certain of a fix. Yes, we need a utility that compacts a table, but isn't there a faster, safer way of doing that than the current VACUUM FULL algorithm and code? We can still *call* it VACUUM FULL. Modular replacement has been done numerous times over the years with great success, e.g. tuplesort, index build... lets do the same thing now and kiss goodbye to some code whose time has come. Put it another way: if anybody submitted a patch that does what VACUUM FULL does, coded in the way it is, it would never be applied, now. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
[HACKERS] autovacuum next steps, take 3
Here is a low-level, very detailed description of the implementation of the autovacuum ideas we have so far. launcher's dealing with databases - We'll add a new member nexttime to the autovac_dbase struct, which will be the time_t of the next time a worker needs to process that DB. Initially, those times will be 0 for all databases. The launcher will keep that list in memory, and on each iteration it will fetch the entry that has the earliest time, and sleep until that time. When it awakens, it will start a worker on that database and set the nexttime to now+naptime. The list will be a Dllist so that it's easy to keep it sorted by increasing time and picking the head of the list each time, and then putting that node as a new tail. Every so often seconds, the launcher will call autovac_get_database_list and compare that list with the list it has on memory. If a new database is in the list, it will assign a nexttime between the current instant and the time of the head of the Dllist. Then it'll put it as the new head. The new database will thus be put as the next database to be processed. When a node with nexttime=0 is found, the amount of time to sleep will be determined as Min(naptime/num_elements, 1), so that initially databases will be distributed roughly evenly in the naptime interval. When a nexttime in the past is detected, the launcher will start a worker either right away or as soon as possible (read below). launcher and worker interactions The launcher PID will be in shared memory, so that workers can signal it. We will also keep worker information in shared memory as an array of WorkerInfo structs: typedef struct { Oid wi_dboid; Oid wi_tableoid; int wi_workerpid; boolwi_finished; } WorkerInfo; We will use SIGUSR1 to communicate between workers and launcher. When the launcher wants to start a worker, it sets the dboid field and signals the postmaster. Then goes back to sleep. When a worker has started up and is about to start vacuuming, it will store its PID in workerpid, and then send a SIGUSR1 to the launcher. If the schedule says that there's no need to run a new worker, the launcher will go back to sleeping. We cannot call SendPostmasterSignal a second time just after calling it; the second call would be lost. So it is important that the launcher does not try to start a worker until there's no worker starting. So if the launcher wakes up for any reason and detects that there is a WorkerInfo entry with valid dboid but workerpid is zero, it will go back to sleep. Since the starting worker will send a signal as soon as it finishes starting up, the launcher will wake up, detect this condition and then it can start a second worker. Also, the launcher cannot start new workers when there are autovacuum_max_workers already running. So if there are that many when it wakes up, it cannot do anything else but go back to sleep again. When one of those workers finishes, it will wake the launcher by setting the finished flag on its WorkerInfo, and sending SIGUSR1 to the launcher. The launcher then wakes up, resets the WorkerInfo struct, and can start another worker if needed. There is an additional problem if, for some reason, a worker starts and is not able to finish its task correctly. It will not be able to set its finished flag, so the launcher will believe that it's still starting up. To prevent this problem, we check the PGPROCs of worker processes, and clean them up if we find they are not actually running (or the PIDs correspond to processes that are not autovacuum workers). We only do it if all WorkerInfo structures are in use, thus frequently enough so that this problem doesn't cause any starvation, but seldom enough so that it's not a performance hit. worker to-do list - When each worker starts, it determines which tables to process in the usual fashion: get pg_autovacuum and pgstat data and compute the equations. The worker then takes a snapshot of what's currently going on in the database, by storing worker PIDs, the corresponding table OID that's being currently worked, and the to-do list for each worker. It removes from its to-do list the tables being processed. Finally, it writes the list to disk. The table list will be written to a file in PGDATA/vacuum/database-oid/todo.worker-pid The file will consist of table OIDs, in the order in which they are going to be vacuumed. At this point, vacuuming can begin. Before processing each table, it scans the WorkerInfos to see if there's a new worker, in which case it reads its to-do list to memory. Then it again fetches the tables being processed by other workers in the same database, and for each other worker, removes from its own in-memory to-do all those tables mentioned in the other lists that appear earlier than the current
Re: [HACKERS] Bug in VACUUM FULL ?
Simon Riggs [EMAIL PROTECTED] writes: On Fri, 2007-03-09 at 16:40 -0500, Tom Lane wrote: I wonder whether this has any implications for HOT ... My general feeling, expressed in a number of recent posts was that the VACUUM FULL code really isn't worth the trouble it causes. Especially when CLUSTER does a better job anyway? Point A: we have to fix the back branches anyway. Point B: until we have an MVCC-safe CLUSTER, that is not a substitute. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug in VACUUM FULL ?
Put it another way: if anybody submitted a patch that does what VACUUM FULL does, coded in the way it is, it would never be applied, now. Have an opinion do we? How about we just alias VACUUM FULL to cluster and add the reporting stuff from VERBOSE? Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring
Jim Nasby [EMAIL PROTECTED] writes: On Mar 8, 2007, at 11:51 PM, Greg Smith wrote: almost everything that's dirty is also pinned during pgbench, and the LRU is lucky to find anything it can write as a result I'm wondering if pg_bench is a good test of this stuff. On reflection I think that Greg's result is probably unsurprising, and furthermore does not indicate that anything is wrong. What it shows (now that we got past the terminology) is that only about half of the buffer pool is subject to replacement during any given clock sweep. For low-usage pages that's about what you'd expect: a page is sucked in on demand (using a buffer returned by the clock sweep), and when we're done with it it'll have usage_count = 1. If it's not touched again then when the clock sweep returns to it it'll be decremented to usage_count 0, and on the next visit it'll be recycled for use as something else. Thus for low-usage pages you'd fully expect that about half of the buffer population has usage_count 1 and the rest has usage count 0; which is strikingly close to Greg's measurement that 48.8% of the population has usage_count 0. What this seems to tell us is that pgbench's footprint of heavily used pages (those able to achieve usage_counts above 1) is very small. Which is probably right unless you've used a very large scale factor. I'd be interested to know what scale factor and shared_buffers setting led to the above measurement. It strikes me that the patch would be more useful if it produced a histogram of the observed usage_counts, rather than merely the count for usage_count = 0. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] autovacuum next steps, take 3
My initial reaction is that this looks good to me, but still a few comments below. Alvaro Herrera wrote: Here is a low-level, very detailed description of the implementation of the autovacuum ideas we have so far. launcher's dealing with databases - [ Snip ] launcher and worker interactions [Snip] worker to-do list - When each worker starts, it determines which tables to process in the usual fashion: get pg_autovacuum and pgstat data and compute the equations. The worker then takes a snapshot of what's currently going on in the database, by storing worker PIDs, the corresponding table OID that's being currently worked, and the to-do list for each worker. Does a new worker really care about the PID of other workers or what table they are currently working on? It removes from its to-do list the tables being processed. Finally, it writes the list to disk. Just to be clear, the new worker removes from it's todo list all the tables mentioned in the todo lists of all the other workers? The table list will be written to a file in PGDATA/vacuum/database-oid/todo.worker-pid The file will consist of table OIDs, in the order in which they are going to be vacuumed. At this point, vacuuming can begin. This all sounds good to me so far. Before processing each table, it scans the WorkerInfos to see if there's a new worker, in which case it reads its to-do list to memory. It's not clear to me why a worker cares that there is a new worker, since the new worker is going to ignore all the tables that are already claimed by all worker todo lists. Then it again fetches the tables being processed by other workers in the same database, and for each other worker, removes from its own in-memory to-do all those tables mentioned in the other lists that appear earlier than the current table being processed (inclusive). Then it picks the next non-removed table in the list. All of this must be done with the Autovacuum LWLock grabbed in exclusive mode, so that no other worker can pick the same table (no IO takes places here, because the whole lists were saved in memory at the start.) Again it's not clear to me what this is gaining us? It seems to me that if when a worker starts up writes out it's to-do list, it should just do it, I don't see the value in workers constantly updating their todo lists. Maybe I'm just missing something can you enlighten me? other things to consider This proposal doesn't deal with the hot tables stuff at all, but that is very easy to bolt on later: just change the first phase, where the initial to-do list is determined, to exclude cold tables. That way, the vacuuming will be fast. Determining what is a cold table is still an exercise to the reader ... I think we can make this algorithm naturally favor small / hot tables with one small change. Having workers remove tables that they just vacuumed from their to-do lists and re-write their todo lists to disk. Assuming the todo lists are ordered by size ascending, smaller tables will be made available for inspection by newer workers sooner rather than later. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] who gets paid for this
Joshua D. Drake wrote: Christian Bird wrote: Hi all, I'm a grad student at UC Davis studying the postgres community and I wanted to know if some on this list could help me out. I'm studying the factors that affect people graduating from being mailing list participant to developers with write access to the repository. It is done on a meritocracy basis and has zero bearing if the person is paid to work on PostgreSQL or not. I believe (would need verification) that it is -core who decides who gets actual commit privileges. It is. Note also that not all of core are committers. /D ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] autovacuum next steps, take 3
Matthew T. O'Connor matthew@zeut.net writes: Does a new worker really care about the PID of other workers or what table they are currently working on? As written, it needs the PIDs so it can read in the other workers' todo lists (which are in files named by PID). It's not clear to me why a worker cares that there is a new worker, since the new worker is going to ignore all the tables that are already claimed by all worker todo lists. That seems wrong to me, since it means that new workers will ignore tables that are scheduled for processing by an existing worker, no matter how far in the future that schedule extends. As an example, suppose you have half a dozen large tables in need of vacuuming. The first worker in will queue them all up, and subsequent workers will do nothing useful, at least not till the first worker is done with the first table. Having the first worker update its todo list file after each table allows the earlier tables to be exposed for reconsideration, but that's expensive and it does nothing for later tables. I suggest that maybe we don't need exposed TODO lists at all. Rather the workers could have internal TODO lists that are priority-sorted in some way, and expose only their current table OID in shared memory. Then the algorithm for processing each table in your list is 1. Grab the AutovacSchedule LWLock exclusively. 2. Check to see if another worker is currently processing that table; if so drop LWLock and go to next list entry. 3. Recompute whether table needs vacuuming; if not, drop LWLock and go to next entry. (This test covers the case where someone vacuumed the table since you made your list.) 4. Put table OID into shared memory, drop LWLock, then vacuum table. 5. Clear current-table OID from shared memory, then repeat for next list entry. This creates a behavior of whoever gets to it first rather than allowing workers to claim tables that they actually won't be able to service any time soon. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring
On Fri, 9 Mar 2007, Jim Nasby wrote: I'm wondering if pg_bench is a good test of this stuff. ISTM it's unrealistically write-heavy, which is going to tend to not only put a lot of dirty buffers into the pool, but also keep them pinned enough that you can't write them. Whether it's unrealistically write-heavy kind of depends on what your real app is. The standard pgbench is a bit weird because it does so many updates to tiny tables, which adds a level of locking contention that doesn't really reflect many real-world situations. But the no-branch mode (update/select to accounts, insert into history) isn't too dissimilar from some insert-heavy logging applications I've seen. The main reason I brought this all up was because Itagaki seemed to be using pgbench for some of his performance tests. I just wanted to point out that the LRU background writer specifically tends to be very underutilized when using pgbench. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Bug in VACUUM FULL ?
Tom Lane [EMAIL PROTECTED] writes: It's not surprising that tuples could have xmax less than xmin, since transactions can commit in orders different than they start; when using READ COMMITTED updates it's not at all surprising that a transaction might update rows after a later-numbered transaction does. However, in looking at this code previously I'd assumed that the OldestXmin cutoff could never fall between two such transactions, and so the above scenario wouldn't happen. I'm not real sure why I thought that. For the cases that VACUUM FULL is interested in, both XIDs mentioned in a DEAD tuple must have committed before OldestXmin was computed, but there doesn't seem to be a compelling reason why OldestXmin might not have been determined by an unrelated third transaction with a number between those two. No commentary but in case anyone else is having trouble following I had to make the following diagram (I think this is what you're describing?) before I fully understood what you were describing: TXN 1 TXN 2 TXN 3 TXN 4 VACUUM START . START . START . . UPDATE . . COMMIT . DELETE . COMMIT . . START COMMIT . . START So txn 4's xmin is txn 3, leaving the global OldestXmin = txn 3 which lies between txn 1 and txn 2. And the tuple chain consists of two tuples. The original which has xmax younger than OldestXmin and so is RECENTLY_DEAD. And the updated tuple which has xmax older than OldestXmin and so is DEAD even though it has xmin younger than OldestXmin. Hm, I wonder if you could just notice that xmin is younger than OldestXmin. In a more complex example you could have lots of DEAD tuples in the chain and some RECENTLY_DEAD mixed in randomly. But I think all the DEAD tuples following a RECENTLY_DEAD would have to have xmin younger than OldestXmin. Or maybe I'm making the same mistake again. Gosh, this is confusing. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] autovacuum next steps, take 3
Tom Lane wrote: Matthew T. O'Connor matthew@zeut.net writes: It's not clear to me why a worker cares that there is a new worker, since the new worker is going to ignore all the tables that are already claimed by all worker todo lists. That seems wrong to me, since it means that new workers will ignore tables that are scheduled for processing by an existing worker, no matter how far in the future that schedule extends. As an example, suppose you have half a dozen large tables in need of vacuuming. The first worker in will queue them all up, and subsequent workers will do nothing useful, at least not till the first worker is done with the first table. Having the first worker update its todo list file after each table allows the earlier tables to be exposed for reconsideration, but that's expensive and it does nothing for later tables. Well the big problem that we have is not that large tables are being starved, so this doesn't bother me too much, plus there is only so much IO, so one worker working sequentially through the big tables seems OK to me. I suggest that maybe we don't need exposed TODO lists at all. Rather the workers could have internal TODO lists that are priority-sorted in some way, and expose only their current table OID in shared memory. Then the algorithm for processing each table in your list is 1. Grab the AutovacSchedule LWLock exclusively. 2. Check to see if another worker is currently processing that table; if so drop LWLock and go to next list entry. 3. Recompute whether table needs vacuuming; if not, drop LWLock and go to next entry. (This test covers the case where someone vacuumed the table since you made your list.) 4. Put table OID into shared memory, drop LWLock, then vacuum table. 5. Clear current-table OID from shared memory, then repeat for next list entry. This creates a behavior of whoever gets to it first rather than allowing workers to claim tables that they actually won't be able to service any time soon. Right, but you could wind up with as many workers working concurrently as you have tables in a database which doesn't seem like a good idea either. One thing I like about the todo list setup Alvaro had is that new workers will be assigned fewer tables to work on and hence exit sooner. We are going to fire off a new worker every autovac_naptime so availability of new workers isn't going to be a problem. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug in VACUUM FULL ?
Gregory Stark [EMAIL PROTECTED] writes: So txn 4's xmin is txn 3, leaving the global OldestXmin = txn 3 which lies between txn 1 and txn 2. And the tuple chain consists of two tuples. The original which has xmax younger than OldestXmin and so is RECENTLY_DEAD. And the updated tuple which has xmax older than OldestXmin and so is DEAD even though it has xmin younger than OldestXmin. Right. Hm, I wonder if you could just notice that xmin is younger than OldestXmin. You can see that at the newer tuple, but the problem is to propagate the knowledge back to the older tuple(s). Or were you suggesting that we treat the newer tuple as RECENTLY_DEAD instead of DEAD? That seems a step backwards in space-reclamation ability. It'd be hard to implement in any case, because one of the problem cases is where VACUUM has already recycled the DEAD tuple before visiting the RECENTLY_DEAD tuple that chains to it. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Log levels for checkpoint/bgwriter monitoring
On Fri, 9 Mar 2007, Tom Lane wrote: I'd be interested to know what scale factor and shared_buffers setting led to the above measurement. That was just a trivial example with 1 client, scale=10 (~160MB database), and shared_buffers=24MB. Where things really get interesting with pgbench is on a system with enough horsepower+clients to dirty the whole buffer cache well before a checkpoint. I regularly see 75% of the cache dirty and blocked from LRU writes with pgbench's slightly pathological workload in that situation. You're correct that these results aren't particularly surprising or indicative of a problem to be corrected. But they do shed some light on what pgbench is and isn't appropriate for testing. It strikes me that the patch would be more useful if it produced a histogram of the observed usage_counts, rather than merely the count for usage_count = 0. I'll start working in that direction. With the feedback everyone has given me on how few of the buffers are truly pinned via the correct usage of the term, I'm going to revisit the usage details and revise that section. I'm happy with how I'm reporting the checkpoint details now, still some work left to do on the bgwriter activity. -- * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Bug in VACUUM FULL ?
Tom Lane [EMAIL PROTECTED] writes: Hm, I wonder if you could just notice that xmin is younger than OldestXmin. You can see that at the newer tuple, but the problem is to propagate the knowledge back to the older tuple(s). Or were you suggesting that we treat the newer tuple as RECENTLY_DEAD instead of DEAD? That seems a step backwards in space-reclamation ability. It'd be hard to implement in any case, because one of the problem cases is where VACUUM has already recycled the DEAD tuple before visiting the RECENTLY_DEAD tuple that chains to it. I think I was suggesting treating the newer tuple as RECENTLY_DEAD. Ie, not vacuuming a tuple if either xmin or xmax is younger than OldestXmin. It's a step backwards in space-reclamation but really, how often can it happen? But I'm not entirely sure that's enough really. Any number of old transactions could come along and update the head of the tuple chain, setting both xmin and xmax to old values. I guess only the one tuple immediately following the RECENTLY_DEAD tuple would have the young xmin. That doesn't really help you identify the later DEAD tuples. Breaking the chain up into pieces seems weird. It seems like it's obviously bogus and only works because we're sure the tuples are dead anyways so it doesn't really matter what we do with them. If we're not sure they're dead it seems like the right thing to do is either to keep the whole chain or to relink the chain after removing the dead intervening tuples. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Bug in VACUUM FULL ?
Gregory Stark [EMAIL PROTECTED] writes: Breaking the chain up into pieces seems weird. It seems like it's obviously bogus and only works because we're sure the tuples are dead anyways so it doesn't really matter what we do with them. Yup, exactly. If we wanted to be tense about this we'd try to get rid of the nominally RECENTLY_DEAD tuples that precede any DEAD tuple in the chain. However, I concur with Simon to the extent that I don't want to do any more work to fix this bug than necessary, and trying to recognize such tuples seems like a lot more work than necessary. Also, we know this case works because it already is working: in the situation where VACUUM happens to visit and remove the DEAD tuple(s) before reaching the RECENTLY_DEAD tuples that link forward to them, it treats the RECENTLY_DEAD tuples as a disconnected chain and moves them as-is. I saw tons of this in the traces I was making today, and it doesn't seem to create any bad effects. (My attention was drawn to it because I saw move_chain_tuple being used to move single-member chains, which looks impossible when you first look at the code --- the is-it-a-chain test seems to ensure that we can link either forward or backward. But not so if t_ctid points to an already-removed tuple.) regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
[HACKERS] My honours project - databases using dynamically attached entity-properties
Hi, Was wondering if people would mind having a read over what I plan to do for my undergraduate honours project - you can get the proposal here: http://www.mcs.vuw.ac.nz/~eddie/489_Proposal.pdf What I'd basically like to know is a) Is this problem worth solving? b) Is there already a good solution (particularly, within PostgreSQL)? c) Any other feedback or comments you may have. Regards Eddie Stanley ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Calculated view fields (8.1 != 8.2)
Florian G. Pflug wrote: Gaetano Mendola wrote: Martijn van Oosterhout wrote: On Fri, Mar 09, 2007 at 10:59:56AM +0100, Gaetano Mendola wrote: Is really this what we want? I did a migration 8.0.x = 8.2.3 and I had on first hour of service up lot of queries blocked due to this, consider in my case I have on v_ta milions of records and usually that join extracts 1 row. Is there a way to set till I don't check all my huge schema to disable this behaviour? Most people figured it was a improvment. It's configured per function now, which wasn't the case before. I dont't think there was ever any discussion about having a global switch. Well it's not an improvement in term of performances but a performance degradation in the best case and in the worst can be devastating: create table ta ( a integer, b integer ); CREATE TABLE create table tb ( b integer, c integer ); CREATE TABLE CREATE OR REPLACE FUNCTION sp_delete_selected_row ( INTEGER ) RETURNS INTEGER AS' DECLARE a_idALIAS FOR $1; BEGIN DELETE FROM ta where a = a_id; return 0; END; ' LANGUAGE 'plpgsql'; CREATE FUNCTION CREATE OR REPLACE VIEW v_ta AS SELECT sp_delete_selected_row(a) AS a, b AS b FROM ta ; CREATE VIEW insert into ta values (2,3); INSERT 0 1 insert into ta values (3,4); INSERT 0 1 insert into tb values (4,5); INSERT 0 1 select * from v_ta join tb using (b) where c = 5; b | a | c ---+---+--- 4 | 0 | 5 (1 row) select * from ta; a | b ---+--- (0 rows) All rows are gone instead of the only one extracted from that query. IMHO is a undesired side effect. In my case I destroyed my application statistics on how many time a certain row was extracted. This is insane. Whoever creates a view like that on a production system should *immediatly* be carried away from his keyboard, to prevent further damage. Imagine someone using View Data on this view in pgadmin.. I don't wanna be near him when he clicks Refresh, and suddenly all data is gone... Maybe calling volatile functions in selects and views should be forbidden entirely, except for volatile functions in the top-level select clause, to support things like select ..., nextval('seq') from But it's probably not worth the effort - there will always be creative ways to shoot yourself into your foot. I full agree with this, that was just an extreme example of an hidden undesired call. In my framework I don't have by coding rule any function with side effects applied at view fields, however I have some functions not marked correctly as STABLE ( mea culpa ) that degraded the performances until I realized what was going on; I'm in the opinion that is not sane call a function not marked as stable/immutable for discarded column (I can in some way accept this ) and most of all on discarded rows. Regards Gaetano Mendola ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] who gets paid for this
I didn't mean to imply that getting paid is correlated with getting commit privileges. However, there is literature that supports the idea that those who are under employ to help in OSS projects may behave differently than those who are contributing in their free time (check out http://gsyc.info/~jjamor/research/papers/2006-gsd-herraiz-robles-amor-romera-barahona.pdf). We're trying to get an idea if there are perhaps two different phenomena in our data. We're trying to separate those who have commit privileges into those employed by a company to help out as part of their job and those who do so in their free time at the time of their first commit. I really appreciate any help that you can provide. If it appears that I'm making incorrect assumptions about how the community works, please feel free to correct me or point me to resources. Thanks. -- Chris On 3/8/07, Tom Lane [EMAIL PROTECTED] wrote: Josh Berkus josh@agliodbs.com writes: Christian, More specifically, could those who worked on apache as some aspect of their job prior to getting repo access let me know? Or if there are devs who know this information about others, I'd be really appreciative to get it. Hmmm. Wrong project. And I think you're making the (incorrect) assumption that granting commit rights works the same way in all projects. It does not. Even more to the point, getting paid for has almost nothing to do with has commit privileges. At least on this project. regards, tom lane -- Christian Bird [EMAIL PROTECTED] ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Calculated view fields (8.1 != 8.2)
Florian G. Pflug wrote: Martijn van Oosterhout wrote: On Fri, Mar 09, 2007 at 10:59:56AM +0100, Gaetano Mendola wrote: Is really this what we want? I did a migration 8.0.x = 8.2.3 and I had on first hour of service up lot of queries blocked due to this, consider in my case I have on v_ta milions of records and usually that join extracts 1 row. Is there a way to set till I don't check all my huge schema to disable this behaviour? Most people figured it was a improvment. It's configured per function now, which wasn't the case before. I dont't think there was ever any discussion about having a global switch. If your function is already marked immutable or stable, then nothing changes for you. If you *did* call volatile functions inside your select, then you now get consistens behaviour. Since you don't want your function to be evaluated in all cases, I assume that it shouldn't be marked volatile in the first place. Well some function are volatile and can not be marked as stable. We develop our applications layering the modules, we didn't have any reason to forbid as coding rule to put function call on view in low level layers. After all views are there also to build up your schema layering the info. I can immagine a case when a lower module exports a view to upper layer stating the interface as list of fields: first_name, last_name, with an *hidden* field that is a function call that updates the statistics on how many time a given record was selected, then this technique can not be used anymore starting with 8.2.x. The above is not my case but it can be a possible scenario (I admit not a sane one ). Regards Gaetano Mendola ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] msvc failure in largeobject regression test
Magnus Hagander wrote: On Tue, Jan 23, 2007 at 11:39:23AM -0800, Jeremy Drake wrote: On Tue, 23 Jan 2007, Magnus Hagander wrote: On Tue, Jan 23, 2007 at 09:31:40AM -0500, Andrew Dunstan wrote: Magnus Hagander wrote: Hi! I get failures for the largeobject regression tests on my vc++ build. I don't think this has ever worked, given that those tests are fairly new. Any quick ideas on what's wrong before I dig deeper? [snip] I wonder if this is a line-end issue? Assuming you are working from CVS, does your client turn \n into \r\n ? I see that other windows boxes are happily passing this test on the buildfarm, and of course the mingw cvs doesn't adjust line endings. Bingo! That's it. I copeid the file in binary mode from a linux box and now it passes. I thought about that when I wrote it, and thus tried it under mingw and cygwin without issue ;) I don't think the regression tests were in a position of running on the msvc build at the time... My thought for what to do if this did run into a problem would be an alternate output file that is also acceptable (I don't know what they're called but other tests have them IIRC). Either that, or we require a checkout using Unix style linefeeds. I've confirmed that removing the file and checking it back out with cvs --lf update tenk.data works - tests pass fine. Yet another option might be to flag that file as binary in cvs, in which case I think cvsnt shouldn't go mess with it. I have just run into this today while trying to get buildfarm working for MSVC. After some consideration I think an alternative result file is the best solution. I have looked at switches for cnsnt, but they are likely to be fragile at best. cheers andrew ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Auto creation of Partitions
On Friday 09 March 2007 01:23, NikhilS wrote: Hi, This follows on from the suggestion I made - taken along the lines of the subject auto creation of partitions where I suggested the syntax of partition check(month of mydatecol) and have a new partition created as data was entered. With this scenario dropping the partition when it was empty would complement the creation of a new partition as needed. Given that there seems to be no real support of going with auto maintenance were new partitions are added as needed, then the auto dropping of empty partitions would also not apply. Leaving us with only specific add partition / drop partition commands. And have the parent table pick up rows not matching any partition check criteria. I was thinking along the lines of what Jim had suggested earlier regarding overflow partition. Instead of dumping unmatched rows to the master table, we could put them into a default DUMP/DUMB partition. I'm quite content dumping unmatched rows into the master table. This makes it very easy to scan partitioned tables for busted partition setups. Having a DUMP tables seems only different semantically, so why learn new semantics? Given that Simon wants to do away with having the master table APPENDed in the planning phase, this would be better. ISTM you're trading appending the master table for appending the DUMP partition, which afaict would give you no gain. -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Auto creation of Partitions
Hi, On 3/10/07, Hannu Krosing [EMAIL PROTECTED] wrote: Ühel kenal päeval, R, 2007-03-09 kell 15:41, kirjutas Jim Nasby: On Mar 9, 2007, at 3:31 AM, Zeugswetter Andreas ADI SD wrote: Since partition is inheritance-based, a simple DROP or NO INHERIT will do the job to deal with the partition. Do we want to reinvent additional syntax when these are around and are documented? Well, if the syntax for adding a new partition eventually ends up as ALTER TABLE ADD PARTITION, then it would make more sense that you remove a partition via ALTER TABLE DROP PARTITION. But DROP PARTITION usually moves the data from this partition to other partitions, so it is something different. It does? IIRC every partitioning system I've seen DROP PARTITION drops the data as well. It's up to you to move it somewhere else if you want to keep it. Will this proposed DROP PARTITION just disassociate the table from the master, or will it actually drop the partitions table from the whole database ? Thats why I would prefer the existing mechanism, there a DROP on the child removes it and a NO INHERIT disassociates it. There might be situations where we would want to just disassociate and not drop. Regards, Nikhils -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Auto creation of Partitions
Hi, Given that Simon wants to do away with having the master table APPENDed in the planning phase, this would be better. ISTM you're trading appending the master table for appending the DUMP partition, which afaict would give you no gain. If there are entries in the master table, I think it would get appended for all queries regardless of whether we need to examine its contents or not. Segregating dump data into a partition will avoid that. I have seen examples in some other databases wherein a partition specifies a range of someval - MAXINT for instance, to catch such cases. That again means that the onus is on the partition creator most of the times.. Regards, Nikhils -- Robert Treat Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] msvc failure in largeobject regression test
Andrew Dunstan [EMAIL PROTECTED] writes: Magnus Hagander wrote: I wonder if this is a line-end issue? Assuming you are working from CVS, does your client turn \n into \r\n ? I have just run into this today while trying to get buildfarm working for MSVC. After some consideration I think an alternative result file is the best solution. I have looked at switches for cnsnt, but they are likely to be fragile at best. Are you proposing an alternate result file that has a different linefeed style? I would really really rather that we not go there, because it will be impossibly fragile to maintain. Or are you willing to accept that the Windows builds will break every time someone changes that regression test, until someone else with a Windows machine fixes the result file? I would find it preferable to make pg_regress compensate for this issue somehow ... regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Calculated view fields (8.1 != 8.2)
Gaetano Mendola [EMAIL PROTECTED] writes: I can immagine a case when a lower module exports a view to upper layer stating the interface as list of fields: first_name, last_name, with an *hidden* field that is a function call that updates the statistics on how many time a given record was selected, then this technique can not be used anymore starting with 8.2.x. You're living in a dream world if you think that works reliably in *any* version of Postgres. But for starters, what is your definition of selected --- pulled from the physical table? Accumulated into an aggregate? Delivered as a recognizable row to the client? Delivered N times to the client due to joining N times to some other table? regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq