Re: [HACKERS] Proposal: Commit timestamp
I would be *very* concerned that system time is not a guaranteed monotonic entity. Surely a counter or other internally managed mechanism would be a better solution. Furthermore, what would be the ramifications of master and slave system times being out of sync? Finally what if system time is rolled forward a few minutes as part of a correction and there were transactions completed in that time? There is a change, albeit small, that two transactions will have the same timestamp. More importantly, this will throw all kinds of issues in when the slave sees transactions in the future. Even with regular NTP syncs, drift can cause a clock to be rolled forward a few milliseconds, possibly resulting in duplicate transaction IDs. In summary, I don't think the use of system time has any place in PostgreSQL's internal consistency mechanisms, it is too unreliable an environment property. Why can't a counter be used for this instead? - Naz. Jan Wieck wrote: For a future multimaster replication system, I will need a couple of features in the PostgreSQL server itself. I will submit separate proposals per feature so that discussions can be kept focused on one feature per thread. For conflict resolution purposes in an asynchronous multimaster system, the last update definition often comes into play. For this to work, the system must provide a monotonically increasing timestamp taken at the commit of a transaction. During replication, the replication process must be able to provide the remote nodes timestamp so that the replicated data will be as of the time it was written on the remote node, and not the current local time of the replica, which is by definition of asynchronous later. To provide this data, I would like to add another log directory, pg_tslog. The files in this directory will be similar to the clog, but contain arrays of timestamptz values. On commit, the current system time will be taken. As long as this time is lower or equal to the last taken time in this PostgreSQL instance, the value will be increased by one microsecond. The resulting time will be added to the commit WAL record and written into the pg_tslog file. If a per database configurable tslog_priority is given, the timestamp will be truncated to milliseconds and the increment logic is done on milliseconds. The priority is added to the timestamp. This guarantees that no two timestamps for commits will ever be exactly identical, even across different servers. The COMMIT syntax will get extended to COMMIT [TRANSACTION] [WITH TIMESTAMP timestamptz]; The extension is limited to superusers and will override the normally generated commit timestamp. This will be used to give the replicating transaction on the replica the exact same timestamp it got on the originating master node. The pg_tslog segments will be purged like the clog segments, after all transactions belonging to them have been stamped frozen. A frozen xid by definition has a timestamp of epoch. To ensure a system using this timestamp feature has enough time to perform its work, a new GUC variable defining an interval will prevent vacuum from freezing xid's that are younger than that. A function get_commit_timestamp(xid) returning timpstamptz will return the commit time of a transaction as recorded by this feature. Comments, changes, additions? Jan ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] tsearch in core patch, for inclusion
Andrew Dunstan wrote: I am constantly running into this: Q. Does PostgreSQL have full text indexing? A. Yes it is in contrib. Q. But that isn't part of core. A. *sigh* Where on the website can I see what plugins are included with PostgreSQL? Where on the website can I see the Official PostgreSQL Documentation for Full Text Indexing? With TSearch2 in core will that fix the many upgrade problems associated with using TSearch2? contrib is a horrible misnomer. Can we maybe bite the bullet and call it something else? After years of PG use, I am still afraid to use contrib modules because it just *feels* like voodoo. I have spent much time reading this mailing list and on IRC with PG users, and I know that contrib modules are on the whole tested and safe, but the lack of web documentation and any indication of what they do other than check the notes that come with the source makes me just feel like they are use and cross fingers type thing. I don't know how hard it would be to implement, but perhaps contrib modules could be compiled in a similar way to Apache modules. E.g., ./configure --with-modulename with the onus for packaging them appropriately falling onto the shoulders of the module authors. I feel that even a basic module management system like this would greatly increase awareness of and confidence in the contrib modules. Oh, and +1 on renaming contrib +1 on the need for a comprehensive list of them +1 on the need for more doc on the website about each of them, onus falling on module authors, perhaps require at least a basic doc patch as a requirement for /contrib inclusion. - Naz ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Piggybacking vacuum I/O
I'd like to see still more evidence that it's a problem before we start changing that piece of code. It has served us well for years. Bruce Momjian wrote: Is there a TODO here? --- Heikki Linnakangas wrote: Pavan Deolasee wrote: Another simpler solution for VACUUM would be to read the entire CLOG file in local memory. Most of the transaction status queries can be satisfied from this local copy and the normal CLOG is consulted only when the status is unknown (TRANSACTION_STATUS_IN_PROGRESS) The clog is only for finished (committed/aborted/crashed) transactions. If a transaction is in progress, the clog is never consulted. Anyway, that'd only be reasonable for vacuums, and I'm actually more worried if we had normal backends thrashing the clog buffers. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] autovacuum process handling
Hi, Alvaro Herrera wrote: Yeah. For what I need, the launcher just needs to know when a worker has finished and how many workers there are. Oh, so it's not all that less communication. My replication manager also needs to know when a worker dies. You said you are using a signal from manager to postmaster to request a worker to be forked. How do you do the other part, where the postmaster needs to tell the launcher which worker terminated? For Postgres-R, I'm currently questioning if I shouldn't merge the replication manager process with the postmaster. Of course, that would violate the postmaster does not touch shared memory constraint. I suggest you don't. Reliability from Postmaster is very important. Yes, so? As long as I can't restart the replication manager, but operation of the whole DBMS relies on it, I have to take the postmaster dows as soon as it detects a crashed replication manager. So I still argue that reliability is getting better than status quo, if I'm merging these two processes (because of less code for communication between the two). Of course, the other way to gain reliability would be to make the replication manager restartable. But restarting the replication manager means recovering data from other nodes in the cluster, thus a lot of network traffic. Needless to say, this is quite an expensive operation. That's why I'm questioning, if that's the behavior we want. Isn't it better to force the administrators to look into the issue and probably replace a broken node instead of having one node going amok by requesting recovery over and over again, possibly forcing crashes of other nodes, too, because of the additional load for recovery? But it would make some things a lot easier: * What if the launcher/manager dies (but you potentially still have active workers)? Maybe, for autovacuum you can simply restart the launcher and that one detects workers from shmem. With replication, I certainly have to take down the postmaster as well, as we are certainly out of sync and can't simply restart the replication manager. So in that case, no postmaster can run without a replication manager and vice versa. Why not make it one single process, then? Well, the point of the postmaster is that it can notice when one process dies and take appropriate action. When a backend dies, the postmaster closes all others. But if the postmaster crashes due to a bug in the manager (due to both being integrated in a single process), how do you close the backends? There's no one to do it. That's a point. But again, as long as the replication manager won't be able to restart, you gain nothing by closing backends on a crashed node. In my case, the launcher is not critical. It can die and the postmaster should just start a new one without much noise. A worker is critical because it's connected to tables; it's as critical as a regular backend. So if a worker dies, the postmaster must take everyone down and cause a restart. This is pretty easy to do. Yeah, that's the main difference, and I see why your approach makes perfect sense for the autovacuum case. In contrast, the replication manager is critical (to one node), and a restart is expensive (for the whole cluster). * Startup races: depending on how you start workers, the launcher/ manager may get a database is starting up error when requesting the postmaster to fork backends. That probably also applies to autovacuum, as those workers shouldn't work concurrently to a startup process. But maybe there are other means of ensuring that no autovacuum gets triggered during startup? Oh, this is very easy as well. In my case the launcher just sets a database OID to be processed in shared memory, and then calls SendPostmasterSignal with a particular value. The postmaster must only check this signal within ServerLoop, which means it won't act on it (i.e., won't start a worker) until the startup process has finished. It seems like your launcher is perfectly fine with requesting workers and not getting them. The replication manager currently isn't. Maybe I should make it more fault tolerant in that regard... I guess your problem is that the manager's task is quite a lot more involved than my launcher's. But in that case, it's even more important to have them separate. More involved with what? It does not touch shared memory, it mainly keeps track of the backends states (by getting a notice from the postmaster) and does all the necessary forwarding of messages between the communication system and the backends. It's main loop is similar to the postmasters, mainly consisting of a select(). I don't understand why the manager talks to postmaster. If it doesn't, well, then there's no concurrency issue gone, because the remote backends will be talking to *somebody* anyway; be it postmaster, or manager. As with your launcher, I only send one message: the worker
Re: [HACKERS] Proposal: Commit timestamp
Hi, Jan Wieck wrote: The replication system I have in mind will have another field type of the balance nature, where it will never communicate the current value but only deltas that get applied regardless of the two timestamps. I'd favor a more generally usable conflict resolution function interface, on top of which you can implement both, the last update wins as well as the balance conflict resolution type. Passing the last common ancestor and the two conflicting heads to the conflict resolution function (CRF) should be enough. That would easily allow to implement the balance type (as you can calculate both deltas). And if you want to rely on something as arbitrary as a timestamp, you'd simply have to add a timestamp column to your table and let the CRF decide uppon that. This would allow pretty much any type of conflict resolution, for example: higher priority cleanup transactions, which change lots of tuples and should better not be aborted later on. Those could be implemented by adding a priority column and having the CRF respect that one, too. To find the last common ancestor tuple, transaction ids and MVCC are enough. You wouldn't need to add timestamps. You'd only have to make sure VACUUM doesn't delete tuples you still need. Regards Markus ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Proposal: Commit timestamp
Jan Wieck wrote: But it is a datum that needs to be collected at the moment where basically the clog entry is made ... I don't think any external module can do that ever. How atomic does it need to be? External modules can register callbacks that get called right after the clog update and removing the xid from MyProc entry. That's about as close to making the clog entry you can get. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Recursive Queries
The CONNECT BY patch from evgen potemkin has been ported to pg 8.2... and it's now in BSD License... I will test it on our test environement Le jeudi 25 janvier 2007 à 11:08 +, Gregory Stark a écrit : Hm, having skimmed through the Evgen Potemkin's recursive queries patch I find it quite different from what I was expecting. My own thinking was headed off in a different direction. Basically the existing patch reimplements a new kind of join which implements a kind of nested loop join (with newer versions adding a kind of hash join) which feeds a new kind of tuplestore called a tupleconn. I was thinking to have a new node above a standard join. The new node would repeatedly feed back down to the join the results of the previous iteration and reexecute the join to get the next generation. I think my approach is more in line with the DB2/ANSI WITH style query which is expected to do a breadth-first search. The Oracle CONNECT BY syntax is expected to do a depth first search. I have two major issues with the repeated-join model though. a) Ideally we would want to switch between nested loop, merge join, and hash join depending on the size of the previous generation. That means the join node wouldn't be the same type of join for all the iterations. This is important since in most applications you're traversing either up or down a tree and are likely starting with very few nodes but often ending up with very broad levels with many nodes. No single type of join will be appropriate for the whole plan execution. b) I do want to be able to support depth-first searching too. I'm not sure how to reconcile that with the repeated-join conceptual model. We could always resort the entire result set after generating it but that seems like an unsatisfactory solution. ___ Ce message et les �ventuels documents joints peuvent contenir des informations confidentielles. Au cas o� il ne vous serait pas destin�, nous vous remercions de bien vouloir le supprimer et en aviser imm�diatement l'exp�diteur. Toute utilisation de ce message non conforme � sa destination, toute diffusion ou publication, totale ou partielle et quel qu'en soit le moyen est formellement interdite. Les communications sur internet n'�tant pas s�curis�es, l'int�grit� de ce message n'est pas assur�e et la soci�t� �mettrice ne peut �tre tenue pour responsable de son contenu.
Re: [HACKERS] Proposal: Change of pg_trigger.tg_enabled and adding
Hi, Nice proposal. I'd support that enhancement and could make use of such triggers in Postgres-R as well, at least to provide these triggers to the user. Jan Wieck wrote: Good question. I don't know. I'd rather error on the safe side and make it multiple states, for now I only have Normal and Replica mode. Are these triggers intended to help implement async replication or are these for users to be able to take action on remote replay of a transaction (i.e. on the replica)? Does that give a further distinction? In Postgres-R, I mostly use the terms 'local' and 'remote'. Also, normal mode can easily be confused with non-replicated mode, thus I'd not mix that with replicated, local transaction mode (even if it's mostly equal, as in this case). My naming proposal would thus be: A fires always (i.e. fires N times, where N = nr of nodes) L fires on the transaction local node (i.e. only exactly once) R fires on the remote nodes only (i.e. (N - 1) times) 0 fires never '1' for fires on both nodes seems confusing as well, because it's not like in single node DB operation, in that one event can fire the trigger multiple times (on different nodes). The current, single node PostgreSQL should thus use '0' or 'L'. Regards Markus ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] crash on 8.2 and cvshead - failed to add item to the
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: I'm afraid the bug has been there for ages, but the 90%-fillfactor on rightmost page patch made it much more likely to get triggered. But that patch has been there for ages too; the only new thing in 8.2 is that the fillfactor is configurable, but its default is the same. So I'm still wondering why the bug isn't seen in 8.1. (Joe, did you try anything older than 8.1?) The hardcoded fillfactor was 90% when building an index, and that's still the default. However, when inserting to an existing index, the fillfactor on the rightmost page was 2/3. It was changed to use the user-configurable fillfactor, which now defaults to 90%. Hmm. Now that I think of it, we might have the same bug in nbtsort.c. I'll have a look... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Proposal: Snapshot cloning
On Thu, 2007-01-25 at 22:19 -0500, Jan Wieck wrote: The idea is to clone an existing serializable transactions snapshot visibility information from one backend to another. The semantics would be like this: backend1: start transaction; backend1: set transaction isolation level serializable; backend1: select pg_backend_pid(); backend1: select publish_snapshot(); -- will block Great idea. It can also be used by pg_dump to publish its snapshot so that we can make VACUUM continue to process effectively while it pg_dump is running. Two questions: - why does it have to block? I don't see any reason - the first process can begin doing useful work. The second process might fail or itself be blocked by something. - why just serializable snapshots? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Proposal: Snapshot cloning
Jan Wieck [EMAIL PROTECTED] writes: backend1: select publish_snapshot(); -- will block backend2: start transaction; backend2: set transaction isolation level serializable; backend2: select clone_snapshot(pid); -- will unblock backend1 It seems simpler to have a current_snapshot() function that returns an bytea or a new snapshot data type which set_current_snapshot(bytea) took to change your snapshot. Then you could use tables or out-of-band communication to pass around your snapshots however you please. set_current_snapshot() would have to sanity check that the xmin of the new snapshot isn't older than the current globaloldestxmin. That could be handy for debugging purposes too. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Proposal: Commit timestamp
Jan, et. al., On Jan 26, 2007, at 2:37 AM, Naz Gassiep wrote: I would be *very* concerned that system time is not a guaranteed monotonic entity. Surely a counter or other internally managed mechanism would be a better solution. As you should be concerned. Looking on my desk through the last few issues in IEEE Transactions on Parallel and Distributed Systems, I see no time synch stuff for clusters of machines that is actually based on time. Almost all rely on something like a Lamport timestamp or some relaxation thereof. A few are based off a tree based pulse. Using actual times is fraught with problems and is typically inappropriate for cluster synchronization needs. Furthermore, what would be the ramifications of master and slave system times being out of sync? I'm much more concerned with the overall approach. The algorithm for replication should be published in theoretic style with a thorough analysis of its assumptions and a proof of correctness based on those assumptions. Databases and replication therein are definitely technologies that aren't off-the-cuff, and rigorous academic discussion and acceptance before they will get adopted. People generally will not adopt technologies to store mission critical data until they are confident that it will both work as designed and work as implemented -- the second is far less important as the weakness there are simply bugs. I'm not implying that this rigorous dissection of replication design hasn't happened, but I didn't see it referenced anywhere in this thread. Can you point me to it? I've reviewed many of these papers and would like to better understand what you are aiming at. Best regards, Theo Schlossnagle ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Proposal: Commit timestamp
On 1/26/2007 2:37 AM, Naz Gassiep wrote: I would be *very* concerned that system time is not a guaranteed monotonic entity. Surely a counter or other internally managed mechanism would be a better solution. Such a counter has only local relevance. How do you plan to compare the two separate counters on different machines to tell which transaction happened last? Even if the system clock isn't monotonically increasing, the described increment system guarantees the timestamp used to appear so. Granted, this system will not work too well on a platform that doesn't allow to slew the system clock. Furthermore, what would be the ramifications of master and slave system times being out of sync? The origin of a transaction must scan all tuples it updates and make sure that the timestamp it uses for commit appears in the future with respect to them. Finally what if system time is rolled forward a few minutes as part of a correction and there were transactions completed in that time? There is a change, albeit small, that two transactions will have the same timestamp. More importantly, this will throw all kinds of issues in when the slave sees transactions in the future. Even with regular NTP syncs, drift can cause a clock to be rolled forward a few milliseconds, possibly resulting in duplicate transaction IDs. In summary, I don't think the use of system time has any place in PostgreSQL's internal consistency mechanisms, it is too unreliable an environment property. Why can't a counter be used for this instead? This is nothing used for PostgreSQL's consistency. It is a vehicle intended to be used to synchronize the last update wins decision process of an asynchronous multimaster system. If not with a timestamp, how would you make sure that the replication processes of two different nodes will come to the same conclusion as to which update was last? Especially considering that the replication might take place hours after the original transaction happened. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Proposal: Commit timestamp
On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote: To provide this data, I would like to add another log directory, pg_tslog. The files in this directory will be similar to the clog, but contain arrays of timestamptz values. On commit, the current system time will be taken. As long as this time is lower or equal to the last taken time in this PostgreSQL instance, the value will be increased by one microsecond. The resulting time will be added to the commit WAL record and written into the pg_tslog file. A transaction time table/log has other uses as well, so its fairly interesting to have this. COMMIT [TRANSACTION] [WITH TIMESTAMP timestamptz]; The extension is limited to superusers and will override the normally generated commit timestamp. I don't think its acceptable to override the normal timestamp. That could lead to non monotonic time values which could screw up PITR. My view is that you still need PITR even when you are using replication, because the former provides recoverability and the latter provides availability. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Proposal: Snapshot cloning
On 1/26/2007 8:06 AM, Gregory Stark wrote: Jan Wieck [EMAIL PROTECTED] writes: backend1: select publish_snapshot(); -- will block backend2: start transaction; backend2: set transaction isolation level serializable; backend2: select clone_snapshot(pid); -- will unblock backend1 It seems simpler to have a current_snapshot() function that returns an bytea or a new snapshot data type which set_current_snapshot(bytea) took to change your snapshot. Then you could use tables or out-of-band communication to pass around your snapshots however you please. set_current_snapshot() would have to sanity check that the xmin of the new snapshot isn't older than the current globaloldestxmin. That would solve the backend to backend IPC problem nicely. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Proposal: Commit timestamp
On 1/26/2007 8:26 AM, Simon Riggs wrote: On Thu, 2007-01-25 at 18:16 -0500, Jan Wieck wrote: To provide this data, I would like to add another log directory, pg_tslog. The files in this directory will be similar to the clog, but contain arrays of timestamptz values. On commit, the current system time will be taken. As long as this time is lower or equal to the last taken time in this PostgreSQL instance, the value will be increased by one microsecond. The resulting time will be added to the commit WAL record and written into the pg_tslog file. A transaction time table/log has other uses as well, so its fairly interesting to have this. COMMIT [TRANSACTION] [WITH TIMESTAMP timestamptz]; The extension is limited to superusers and will override the normally generated commit timestamp. I don't think its acceptable to override the normal timestamp. That could lead to non monotonic time values which could screw up PITR. My view is that you still need PITR even when you are using replication, because the former provides recoverability and the latter provides availability. Without that it is rendered useless for conflict resolution purposes. The timestamp used does not necessarily have much to do with the real time at commit. Although I'd like it to be as close as possible. This timestamp marks the age of the new datum in an update. Since the replication is asynchronous, the update on the remote systems will happen later, but the timestamp recorded with that datum must be the timestamp of the original transaction, not the current time when it is replicated remotely. All we have to determine that is the xmin in the rows tuple header, so that xmin must resolve to the original transactions timestamp. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] HAVING push-down
I've just read a paper that says PostgreSQL doesn't do this. My reading of the code is that we *do* evaluate the HAVING clause prior to calculating the aggregates for it. I thought I'd check to resolve the confusion. - - - If not, it seems fairly straightforward to push down some or all of a HAVING clause so that the qual clause is tested prior to aggregation, not after aggregation. This could, for certain queries, significantly reduce the amount of effort that the final Agg node performs. We might think about deeper push-down within the query, but since the Agg node already has the havingQual, it seems a straightforward act to decide whether to apply it before or after the aggregation. We already do find_unaggregated_cols(), so little additional analysis seems required. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Proposal: Snapshot cloning
Ühel kenal päeval, R, 2007-01-26 kell 12:25, kirjutas Simon Riggs: On Thu, 2007-01-25 at 22:19 -0500, Jan Wieck wrote: The idea is to clone an existing serializable transactions snapshot visibility information from one backend to another. The semantics would be like this: backend1: start transaction; backend1: set transaction isolation level serializable; backend1: select pg_backend_pid(); backend1: select publish_snapshot(); -- will block Great idea. It can also be used by pg_dump to publish its snapshot so that we can make VACUUM continue to process effectively while it pg_dump is running. Do you mean we that vacuum would clean up tuples still visible to pgdump ? Two questions: - why does it have to block? I don't see any reason - the first process can begin doing useful work. The second process might fail or itself be blocked by something. As I see it, it has to block so that it's transaction woud not end so that the system knows that it can't yet remove tuples in that snapshot. And it should block util all its consumers have ended their use of the published snapshot - why just serializable snapshots? There s probably no point to aquire it into read-commited transaction when the next command will revert to its own snapshot anyway. -- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Piggybacking vacuum I/O
Heikki Linnakangas wrote: I'd like to see still more evidence that it's a problem before we start changing that piece of code. It has served us well for years. So the TODO could be investigate whether caching pg_clog and/or pg_subtrans in local memory can be useful for vacuum performance. Bruce Momjian wrote: Is there a TODO here? --- Heikki Linnakangas wrote: Pavan Deolasee wrote: Another simpler solution for VACUUM would be to read the entire CLOG file in local memory. Most of the transaction status queries can be satisfied from this local copy and the normal CLOG is consulted only when the status is unknown (TRANSACTION_STATUS_IN_PROGRESS) The clog is only for finished (committed/aborted/crashed) transactions. If a transaction is in progress, the clog is never consulted. Anyway, that'd only be reasonable for vacuums, and I'm actually more worried if we had normal backends thrashing the clog buffers. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [pgsql-patches] pg_dump pretty_print
-BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 Peter Eisentraut replied: The harm here is that under undefined circumstances a dump file will not be a proper and robust representation of the original database, which would add significant confusion and potential for error. What undefined circumstances are we talking here? If there is a chance that pg_get_viewdef and company do not output a version that can be read again by the database because we simply changed the whitespace, that sounds like a serious bug to be fixed, not a reason to reject this optional flag. - -- Greg Sabino Mullane [EMAIL PROTECTED] PGP Key: 0x14964AC8 200701251003 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -BEGIN PGP SIGNATURE- iD8DBQFFuXd2vJuQZxSWSsgRA9VDAJ9S1b+4DJomO3Bmij4wvida9wtgfgCeID16 qeoNrrehtTGIeJeL8T+mx9M= =VecV -END PGP SIGNATURE- ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Proposal: Snapshot cloning
On Fri, 2007-01-26 at 16:09 +0200, Hannu Krosing wrote: Ühel kenal päeval, R, 2007-01-26 kell 12:25, kirjutas Simon Riggs: Two questions: - why does it have to block? I don't see any reason - the first process can begin doing useful work. The second process might fail or itself be blocked by something. As I see it, it has to block so that it's transaction woud not end so that the system knows that it can't yet remove tuples in that snapshot. And it should block util all its consumers have ended their use of the published snapshot Agreed that the Snapshot must be visible to all, but thats no reason why the original call has to block, just that we must do something to prevent the Snapshot from disappearing from view. - why just serializable snapshots? There s probably no point to aquire it into read-commited transaction when the next command will revert to its own snapshot anyway. But the stated use case was to share snapshots, which seems valid whatever the type of Snapshot. One of the stated cases was parallel query... -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] autovacuum process handling
Markus Schiltknecht wrote: Hi, Alvaro Herrera wrote: Yeah. For what I need, the launcher just needs to know when a worker has finished and how many workers there are. Oh, so it's not all that less communication. My replication manager also needs to know when a worker dies. You said you are using a signal from manager to postmaster to request a worker to be forked. How do you do the other part, where the postmaster needs to tell the launcher which worker terminated? I haven't done that yet, since the current incarnation does not need it. But I have considered using some signal like SIGUSR1 to mean something changed in your processes, look into your shared memory. The autovacuum shared memory area would contain PIDs (or maybe PGPROC pointers?) of workers; so when the launcher goes to check that it notices that one worker is no longer there, meaning that it must have terminated its job. For Postgres-R, I'm currently questioning if I shouldn't merge the replication manager process with the postmaster. Of course, that would violate the postmaster does not touch shared memory constraint. I suggest you don't. Reliability from Postmaster is very important. Yes, so? As long as I can't restart the replication manager, but operation of the whole DBMS relies on it, I have to take the postmaster dows as soon as it detects a crashed replication manager. Sure. But you also need to take down all regular backends, and bgwriter as well. If the postmaster just dies, this won't work cleanly. That's why I'm questioning, if that's the behavior we want. Isn't it better to force the administrators to look into the issue and probably replace a broken node instead of having one node going amok by requesting recovery over and over again, possibly forcing crashes of other nodes, too, because of the additional load for recovery? Maybe what you want, then, is that when the replication manager dies, then the postmaster should close all processes and then shut itself down. This also can be arranged easily. But just crashing the postmaster because the manager sees something wrong is certainly not a good idea. Well, the point of the postmaster is that it can notice when one process dies and take appropriate action. When a backend dies, the postmaster closes all others. But if the postmaster crashes due to a bug in the manager (due to both being integrated in a single process), how do you close the backends? There's no one to do it. That's a point. But again, as long as the replication manager won't be able to restart, you gain nothing by closing backends on a crashed node. Sure you do -- they won't corrupt anything :-) Plus, what use are running backends in a multimaster environment, if they can't communicate with the outside? Much better would be, AFAICS, to shut everyone down so that the users can connect to a working node. I guess your problem is that the manager's task is quite a lot more involved than my launcher's. But in that case, it's even more important to have them separate. More involved with what? It does not touch shared memory, it mainly keeps track of the backends states (by getting a notice from the postmaster) and does all the necessary forwarding of messages between the communication system and the backends. It's main loop is similar to the postmasters, mainly consisting of a select(). I meant more complicated. And if it has to listen on a socket and forward messages to remote backends, it certainly is a lot more complicated than the current autovac launcher. I don't understand why the manager talks to postmaster. If it doesn't, well, then there's no concurrency issue gone, because the remote backends will be talking to *somebody* anyway; be it postmaster, or manager. As with your launcher, I only send one message: the worker request. But the other way around, from the postmaster to the replication manager, there are also some messages: a database is ready message and a worker terminated messages. Thinking about handling the restarting cycle, I would need to add a database is restarting messages, which has to be followed by another database is ready message. For sure, the replication manager needs to keep running during a restarting cycle. And it needs to know the database's state, so as to be able to decide if it can request workers or not. I think this would be pretty easy to do if you made the remote backends keep state in shared memory. The manager just needs to get a signal to know that it should check the shared memory. This can be arranged easily: just have the remote backends signal the postmaster, and have the postmaster signal the manager. Alternatively, have the manager PID stored in shared memory and have the remote backends signal (SIGUSR1 or some such) the manager. (bgwriter does this: it announces its PID in shared memory, and the backends signal it when they want a CHECKPOINT). I
Re: [HACKERS] Proposal: Snapshot cloning
On Fri, 2007-01-26 at 16:09 +0200, Hannu Krosing wrote: Ühel kenal päeval, R, 2007-01-26 kell 12:25, kirjutas Simon Riggs: Great idea. It can also be used by pg_dump to publish its snapshot so that we can make VACUUM continue to process effectively while it pg_dump is running. Do you mean we that vacuum would clean up tuples still visible to pgdump ? No, that would break MVCC. But we may have done lots of updates/deletes that are *not* visible to any Snapshot, yet are not yet removable because they are higher than OldestXmin but we don't know that because previously the Snapshot details were not available. ISTM that this proposal is a way of making the Snapshot limits publicly available so that they can be used by VACUUM. Sure it isn't every backend, but the details may be useful. So this is an additional benefit to this proposal. (There's a hole in the above idea, so don't jump on my back to explain it - I see it and am trying to work out a way around it...) -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Proposal: Commit timestamp
* Jan Wieck ([EMAIL PROTECTED]) wrote: On 1/26/2007 2:37 AM, Naz Gassiep wrote: I would be *very* concerned that system time is not a guaranteed monotonic entity. Surely a counter or other internally managed mechanism would be a better solution. Such a counter has only local relevance. How do you plan to compare the two separate counters on different machines to tell which transaction happened last? I'd also suggest you look into Lamport timestamps... Trusting the system clock just isn't practical, even with NTP. I've developed (albeit relatively small) systems using Lamport timestamps and would be happy to talk about it offlist. I've probably got some code I could share as well. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] crash on 8.2 and cvshead - failed to add item to the
Heikki Linnakangas [EMAIL PROTECTED] writes: Tom Lane wrote: I'm still wondering why the bug isn't seen in 8.1. The hardcoded fillfactor was 90% when building an index, and that's still the default. However, when inserting to an existing index, the fillfactor on the rightmost page was 2/3. It was changed to use the user-configurable fillfactor, which now defaults to 90%. Ah. I thought I remembered that those had been two separate changes, but you're right, 8.1 and before always split 1:1 or 2:1. So it'd take a really nasty corner case to expose the bug there. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Proposal: Commit timestamp
Stephen Frost wrote: I'd also suggest you look into Lamport timestamps... Trusting the system clock just isn't practical, even with NTP. I've developed (albeit relatively small) systems using Lamport timestamps and would be happy to talk about it offlist. I've probably got some code I could share as well. that looks like what Oracle RAC uses: http://www.lc.leidenuniv.nl/awcourse/oracle/rac.920/a96597/coord.htm cheers andrew ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] BUG #2917: spi_prepare doesn't accept typename aliases
I wrote: Tom Lane wrote: I think parseTypeString() may be the thing to use. It's what plpgsql uses... OK, I'll see what I can do. see attached patch. If this is OK I will apply it and also fix pltcl and plpython similarly, mutatis mutandis. cheers andrew Index: src/pl/plperl/plperl.c === RCS file: /cvsroot/pgsql/src/pl/plperl/plperl.c,v retrieving revision 1.123 diff -c -r1.123 plperl.c *** src/pl/plperl/plperl.c 21 Nov 2006 16:59:02 - 1.123 --- src/pl/plperl/plperl.c 26 Jan 2007 15:13:05 - *** *** 2128,2145 PG_TRY(); { / ! * Lookup the argument types by name in the system cache ! * and remember the required information for input conversion / for (i = 0; i argc; i++) { ! List *names; HeapTuple typeTup; ! /* Parse possibly-qualified type name and look it up in pg_type */ ! names = stringToQualifiedNameList(SvPV(argv[i], PL_na), ! plperl_spi_prepare); ! typeTup = typenameType(NULL, makeTypeNameFromNameList(names)); qdesc-argtypes[i] = HeapTupleGetOid(typeTup); perm_fmgr_info(((Form_pg_type) GETSTRUCT(typeTup))-typinput, (qdesc-arginfuncs[i])); --- 2128,2152 PG_TRY(); { / ! * Resolve argument type names and then look them up by oid ! * in the system cache, and remember the required information ! * for input conversion. / for (i = 0; i argc; i++) { ! Oid typeId; ! int32 typmod; HeapTuple typeTup; ! parseTypeString(SvPV(argv[i], PL_na), typeId, typmod); ! ! typeTup = SearchSysCache(TYPEOID, ! ObjectIdGetDatum(typeId), ! 0,0,0); ! if (!HeapTupleIsValid(typeTup)) ! elog(ERROR, cache lookup failed for type %u, typeId); ! ! qdesc-argtypes[i] = HeapTupleGetOid(typeTup); perm_fmgr_info(((Form_pg_type) GETSTRUCT(typeTup))-typinput, (qdesc-arginfuncs[i])); Index: src/pl/plperl/expected/plperl.out === RCS file: /cvsroot/pgsql/src/pl/plperl/expected/plperl.out,v retrieving revision 1.9 diff -c -r1.9 plperl.out *** src/pl/plperl/expected/plperl.out 13 Aug 2006 17:31:10 - 1.9 --- src/pl/plperl/expected/plperl.out 26 Jan 2007 15:13:05 - *** *** 438,444 -- Test spi_prepare/spi_exec_prepared/spi_freeplan -- CREATE OR REPLACE FUNCTION perl_spi_prepared(INTEGER) RETURNS INTEGER AS $$ !my $x = spi_prepare('select $1 AS a', 'INT4'); my $q = spi_exec_prepared( $x, $_[0] + 1); spi_freeplan($x); return $q-{rows}-[0]-{a}; --- 438,444 -- Test spi_prepare/spi_exec_prepared/spi_freeplan -- CREATE OR REPLACE FUNCTION perl_spi_prepared(INTEGER) RETURNS INTEGER AS $$ !my $x = spi_prepare('select $1 AS a', 'INTEGER'); my $q = spi_exec_prepared( $x, $_[0] + 1); spi_freeplan($x); return $q-{rows}-[0]-{a}; *** *** 468,470 --- 468,504 4 (2 rows) + -- + -- Test prepare with a type with spaces + -- + CREATE OR REPLACE FUNCTION perl_spi_prepared_double(double precision) RETURNS double precision AS $$ + my $x = spi_prepare('SELECT 10.0 * $1 AS a', 'DOUBLE PRECISION'); + my $q = spi_query_prepared($x,$_[0]); + my $result; + while (defined (my $y = spi_fetchrow($q))) { + $result = $y-{a}; + } + spi_freeplan($x); + return $result; + $$ LANGUAGE plperl; + SELECT perl_spi_prepared_double(4.35) as double precision; + double precision + -- + 43.5 + (1 row) + + -- + -- Test with a bad type + -- + CREATE OR REPLACE FUNCTION perl_spi_prepared_bad(double precision) RETURNS double precision AS $$ + my $x = spi_prepare('SELECT 10.0 * $1 AS a', 'does_not_exist'); + my $q = spi_query_prepared($x,$_[0]); + my $result; + while (defined (my $y = spi_fetchrow($q))) { + $result = $y-{a}; + } + spi_freeplan($x); + return $result; + $$ LANGUAGE plperl; + SELECT perl_spi_prepared_bad(4.35) as double precision; + ERROR: error from Perl function: type does_not_exist does not exist at line 2. Index: src/pl/plperl/sql/plperl.sql === RCS file: /cvsroot/pgsql/src/pl/plperl/sql/plperl.sql,v retrieving revision 1.11 diff -c -r1.11 plperl.sql *** src/pl/plperl/sql/plperl.sql 13 Aug 2006 17:31:10 - 1.11 --- src/pl/plperl/sql/plperl.sql 26 Jan 2007 15:13:05 - *** *** 316,322 -- Test spi_prepare/spi_exec_prepared/spi_freeplan -- CREATE OR REPLACE FUNCTION perl_spi_prepared(INTEGER) RETURNS INTEGER AS $$ !my $x = spi_prepare('select $1 AS a', 'INT4');
Re: [HACKERS] HAVING push-down
Simon Riggs [EMAIL PROTECTED] writes: I've just read a paper that says PostgreSQL doesn't do this. My reading of the code is that we *do* evaluate the HAVING clause prior to calculating the aggregates for it. I thought I'd check to resolve the confusion. - - - If not, it seems fairly straightforward to push down some or all of a HAVING clause so that the qual clause is tested prior to aggregation, not after aggregation. This could, for certain queries, significantly reduce the amount of effort that the final Agg node performs. You mean in cases like this? postgres=# explain select count(*) from customer group by c_w_id,c_d_id,c_id having c_w_id = 1 and c_d_id=1 and c_id=1; QUERY PLAN GroupAggregate (cost=0.00..13.61 rows=1 width=12) - Index Scan using pk_customer on customer (cost=0.00..13.56 rows=4 width=12) Index Cond: ((c_w_id = 1) AND (c_d_id = 1) AND (c_id = 1)) (3 rows) I think we push having clauses into WHERE clauses whenever there are no aggregates in them. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] BUG #2917: spi_prepare doesn't accept typename aliases
Andrew Dunstan [EMAIL PROTECTED] writes: see attached patch. If this is OK I will apply it and also fix pltcl and plpython similarly, mutatis mutandis. Looks alright as far as it goes, but I'd suggest making one additional cleanup while you're in there: get rid of the direct syscache access altogether, instead using getTypeInputInfo(). The loop body should just consist of three function calls: parseTypeString, getTypeInputInfo, perm_fmgr_info. If you wanted to be a bit more ambitious maybe you could change the fact that this code is throwing away typmod, which means that declarations like varchar(32) would fail to work as expected. Perhaps it should be fixed to save the typmods alongside the typioparams and then pass them to InputFunctionCall instead of passing -1. On the other hand, we don't currently enforce typmod for any function input or result arguments, so maybe it's consistent that spi_prepare arguments ignore typmods too. Thoughts? regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] HAVING push-down
On Fri, 2007-01-26 at 15:22 +, Gregory Stark wrote: Simon Riggs [EMAIL PROTECTED] writes: I've just read a paper that says PostgreSQL doesn't do this. My reading of the code is that we *do* evaluate the HAVING clause prior to calculating the aggregates for it. I thought I'd check to resolve the confusion. You mean in cases like this? postgres=# explain select count(*) from customer group by c_w_id,c_d_id,c_id having c_w_id = 1 and c_d_id=1 and c_id=1; QUERY PLAN GroupAggregate (cost=0.00..13.61 rows=1 width=12) - Index Scan using pk_customer on customer (cost=0.00..13.56 rows=4 width=12) Index Cond: ((c_w_id = 1) AND (c_d_id = 1) AND (c_id = 1)) (3 rows) OK, thanks. I'll feedback to the author of the paper I was reviewing. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] crash on 8.2 and cvshead - failed to add item to the
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: To see what's going on, I added some logs to the split code to print out the free space on both halves as calculated by findsplitloc, and the actual free space on the pages after split. I'm seeing a discrepancy of 4 bytes on the right half; actual space free on right page after split is 4 bytes less than anticipated. Hm, mis-counting the positions of itempointers maybe? Found it: /* Count up total space in data items without actually scanning 'em */ dataitemtotal = rightspace - (int) PageGetFreeSpace(page); This is 4 bytes off, because PageGetFreeSpace subtracts sizeof(ItemIdData) from the actual free space on page. We could do dataitemtotal = rightspace - ((int) PageGetFreeSpace(page) +sizeof(ItemIdData)); but that again would be 4 bytes off in the other direction if there's 0 bytes left on the page :(. IMHO the right fix is to modify PageGetFreeSpace not to do the subtraction, it's a hack anyway, but that means we have to go through and fix every caller of it. Or we can add a new PageGetReallyFreeSpace function and keep the old one for compatibility. What do we want? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Implied Functional index use (redux)
On Thu, 2007-01-25 at 16:20 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: A simpler, alternate proposal is to allow the user to specify whether a functional index is transformable or not using CREATE or ALTER INDEX, with a default of not transformable. That then leaves the responsibility for specifying this with the user, who as we have seen is the really only person really capable of judging the whole case on its merits. e.g. CREATE INDEX fooidx ON foo (foofunc(foocol1)) [TABLESPACE ...] [ENABLE|DISABLE TRANSFORM] [WHERE ...]; This is a foot-gun and nothing else. I hardly think the average DBA will realize such subtleties as numeric equality doesn't guarantee that such-and-such works. If it's not specified by the datatype author it's not going to be safe. OK, no problem. The most beneficial use case is for string handling: name lookups, case insensitive indexing and index size reduction generally. If, for some reason, bpchar were to be excluded then it would take away a great chunk of benefit. Two questions: - Will bpchar be transformable? - Do you see a clear way forward for specifying the information required to allow the transform? We need to specify the operator, which might be taken to include the datatype. (Peter suggested placing this on the function itself, though I think current precedent is to place on the operator.) If you can say where you want the info to live, I can work out the details and repost. If there's clear benefit and a clear way forward, then we might just be OK for 8.3. If not, I'll put this back on the shelf again in favour of other ideas. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] HAVING push-down
Simon Riggs wrote: On Fri, 2007-01-26 at 15:22 +, Gregory Stark wrote: Simon Riggs [EMAIL PROTECTED] writes: I've just read a paper that says PostgreSQL doesn't do this. My reading of the code is that we *do* evaluate the HAVING clause prior to calculating the aggregates for it. I thought I'd check to resolve the confusion. You mean in cases like this? postgres=# explain select count(*) from customer group by c_w_id,c_d_id,c_id having c_w_id = 1 and c_d_id=1 and c_id=1; QUERY PLAN GroupAggregate (cost=0.00..13.61 rows=1 width=12) - Index Scan using pk_customer on customer (cost=0.00..13.56 rows=4 width=12) Index Cond: ((c_w_id = 1) AND (c_d_id = 1) AND (c_id = 1)) (3 rows) OK, thanks. I'll feedback to the author of the paper I was reviewing. Care to share the paper in general? It might be beneficial for all of us. Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] crash on 8.2 and cvshead - failed to add item to the
Heikki Linnakangas [EMAIL PROTECTED] writes: IMHO the right fix is to modify PageGetFreeSpace not to do the subtraction, it's a hack anyway, but that means we have to go through and fix every caller of it. Or we can add a new PageGetReallyFreeSpace function and keep the old one for compatibility. What do we want? It'd probably be a good idea to take a look at each caller and see whether it has a problem with that. I believe PageGetFreeSpace's behavior is actually the right thing for many of 'em. The idea is that subtracting the 4 bytes is often necessary and always safe/conservative (but is that true in this case? We're overestimating dataitemtotal, can that hurt us?). Is it worth changing each caller to try to account exactly for those 4 bytes? In short, I'm inclined to leave the function alone unless changing it can be shown to be a win for most callers. Add a new function (perhaps PageGetExactFreeSpace would be a better name). Keep in mind also that we need a minimal-change version for back-patching. If this is cleanup rather than bug fix, please submit it separately. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Proposal: Commit timestamp
On 1/26/2007 9:38 AM, Stephen Frost wrote: * Jan Wieck ([EMAIL PROTECTED]) wrote: On 1/26/2007 2:37 AM, Naz Gassiep wrote: I would be *very* concerned that system time is not a guaranteed monotonic entity. Surely a counter or other internally managed mechanism would be a better solution. Such a counter has only local relevance. How do you plan to compare the two separate counters on different machines to tell which transaction happened last? I'd also suggest you look into Lamport timestamps... Trusting the system clock just isn't practical, even with NTP. I've developed (albeit relatively small) systems using Lamport timestamps and would be happy to talk about it offlist. I've probably got some code I could share as well. I think the system I described is a slightly modified Lamport generator. The maximum timestamp of any row updated in this transaction, you can consider that the counters received from other nodes. Then I make sure that the next counter (timestamp) is higher than anything I know so far, and I add cluster-wide unique tie breaker to that. Looking closer, I don't even have to check the timestamps of the rows updated. Since a remote transaction replicated will bump the local Lamport clock on commit, a local transaction modifying such a row will have a timestamp in the future of that remote transaction, even if my local clock is limping behind. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Implied Functional index use (redux)
Simon Riggs [EMAIL PROTECTED] writes: If there's clear benefit and a clear way forward, then we might just be OK for 8.3. If not, I'll put this back on the shelf again in favour of other ideas. I think this is still a long way off, and there are probably more useful things to work on for 8.3. Part of my antagonism stems from the fact that by means of the operator family rewrite I've been getting rid of some longstanding but really quite unacceptable assumptions about this operator does that. I don't want to see us start putting unsupported semantic assumptions back into the optimizer; rather its assumptions about operator behavior need to be clearly specified. As an example, without some careful preliminary thinking I'd have probably folded all the numeric types into one big opfamily and thereby broken transitivity :-(, leading to bugs that would be devilish to figure out. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] HAVING push-down
Simon Riggs [EMAIL PROTECTED] writes: I've just read a paper that says PostgreSQL doesn't do this. What does he mean by that exactly, and which PG version is he looking at? As Greg notes, we do know how to push down non-aggregated conditions, but I'm not sure that's what he's thinking of. There have been some relevant bug fixes, eg 2004-07-10 14:39 tgl * src/backend/executor/: nodeAgg.c (REL7_4_STABLE), nodeAgg.c: Test HAVING condition before computing targetlist of an Aggregate node. This is required by SQL spec to avoid failures in cases like SELECT sum(win)/sum(lose) FROM ... GROUP BY ... HAVING sum(lose) 0; AFAICT we have gotten this wrong since day one. Kudos to Holger Jakobs for being the first to notice. Also, it's still true that we run all the aggregate transition functions in parallel, so if you were hoping to use HAVING on an aggregate condition to prevent an overflow or something in the state accumulation function for a targetlist aggregate, you'd lose. But I don't see any way to avoid that without scanning the data twice, which we're surely not gonna do. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] HAVING push-down
On Fri, 2007-01-26 at 11:16 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: I've just read a paper that says PostgreSQL doesn't do this. What does he mean by that exactly, and which PG version is he looking at? As Greg notes, we do know how to push down non-aggregated conditions, but I'm not sure that's what he's thinking of. Yes, it was specifically non-aggregated conditions. There have been some relevant bug fixes, eg 2004-07-10 14:39 tgl * src/backend/executor/: nodeAgg.c (REL7_4_STABLE), nodeAgg.c: Test HAVING condition before computing targetlist of an Aggregate node. This is required by SQL spec to avoid failures in cases like SELECT sum(win)/sum(lose) FROM ... GROUP BY ... HAVING sum(lose) 0; AFAICT we have gotten this wrong since day one. Kudos to Holger Jakobs for being the first to notice. Also, it's still true that we run all the aggregate transition functions in parallel, so if you were hoping to use HAVING on an aggregate condition to prevent an overflow or something in the state accumulation function for a targetlist aggregate, you'd lose. But I don't see any way to avoid that without scanning the data twice, which we're surely not gonna do. I'll send you the paper off-line, there's some more interesting stuff also. p.12 -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Piggybacking vacuum I/O
On 1/26/07, Alvaro Herrera [EMAIL PROTECTED] wrote: Heikki Linnakangas wrote: I'd like to see still more evidence that it's a problem before we start changing that piece of code. It has served us well for years. So the TODO could be investigate whether caching pg_clog and/or pg_subtrans in local memory can be useful for vacuum performance. As Heikki suggested, we should also investigate the same for normal backends as well. It would also be interesting to investigate whether early setting of hint bits can reduce subsequent writes of blocks. A typical case would be a large table being updated heavily for a while, followed by SELECT queries. The SELECT queries would set hint bits for the previously UPDATEd tuples (old and new versions) and thus cause subsequent writes of those blocks for what could have been read-only queries. Thanks, Pavan EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] Proposal: Change of pg_trigger.tg_enabled and adding
[EMAIL PROTECTED] (Markus Schiltknecht) writes: Nice proposal. I'd support that enhancement and could make use of such triggers in Postgres-R as well, at least to provide these triggers to the user. Jan Wieck wrote: Good question. I don't know. I'd rather error on the safe side and make it multiple states, for now I only have Normal and Replica mode. Are these triggers intended to help implement async replication or are these for users to be able to take action on remote replay of a transaction (i.e. on the replica)? Does that give a further distinction? Well, there's specific intent, and then there's general intent... If I understand correctly (and I think I do), the various threads that Jan has been starting do have *specific* intent in that he's got an implementation in mind that would specifically use the features he's asking about. But there is also the general intent that the features be usable more widely than that. If some generalization makes this particular feature useful for Postgres-R as well as Jan's work, that's better still. In Postgres-R, I mostly use the terms 'local' and 'remote'. Also, normal mode can easily be confused with non-replicated mode, thus I'd not mix that with replicated, local transaction mode (even if it's mostly equal, as in this case). My naming proposal would thus be: A fires always (i.e. fires N times, where N = nr of nodes) L fires on the transaction local node (i.e. only exactly once) R fires on the remote nodes only (i.e. (N - 1) times) 0 fires never '1' for fires on both nodes seems confusing as well, because it's not like in single node DB operation, in that one event can fire the trigger multiple times (on different nodes). The current, single node PostgreSQL should thus use '0' or 'L'. I rather like your L for local and R for remote. An alternative to A for always would be B, standing for runs [B]oth on local and remote nodes. Of course, this is picking at nits; the important question is not what to call the names of the states, but rather whether the set of states is both desirable and complete... -- select 'cbbrowne' || '@' || 'cbbrowne.com'; http://cbbrowne.com/info/x.html Rules of the Evil Overlord #97. My dungeon cells will not be furnished with objects that contain reflective surfaces or anything that can be unravelled. http://www.eviloverlord.com/ ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Proposal: Snapshot cloning
[EMAIL PROTECTED] (Gregory Stark) writes: Jan Wieck [EMAIL PROTECTED] writes: backend1: select publish_snapshot(); -- will block backend2: start transaction; backend2: set transaction isolation level serializable; backend2: select clone_snapshot(pid); -- will unblock backend1 It seems simpler to have a current_snapshot() function that returns an bytea or a new snapshot data type which set_current_snapshot(bytea) took to change your snapshot. Then you could use tables or out-of-band communication to pass around your snapshots however you please. set_current_snapshot() would have to sanity check that the xmin of the new snapshot isn't older than the current globaloldestxmin. That could be handy for debugging purposes too. Here's a wild thought... Would there be any sense in setting up the ability to declare expressly a transaction's visibility parameters? Consider that the Slony-I sl_event table records: ev_minxid, ev_maxxid, ev_xip Grabbing a sample from an instance... [ ev_minxid| ev_maxxid| ev_xip ] = [1377591608 | 1377591612 | '1377591608','1377591610'] Would it be plausible to, in effect, assert these things? To say: start transaction; set transaction isolation level serializable; select set_transaction_visibility(1377591608, 1377591612, [1377591608, 1377591610]); And thus assert the visibility that was recorded at that point in time? I may very well have the parameters characterized in a wrong way; please assume an appropriate way instead as needed :-). This would permit, if I am seeing this right, a way that you could, in effect, get a form of time travel via this where you'd be able to arbitrarily point at different forms of data visibility. The wild part being that you could assert data visibility declarations that a normal connection couldn't naturally obtain... -- let name=cbbrowne and tld=linuxdatabases.info in name ^ @ ^ tld;; http://linuxfinances.info/info/multiplexor.html Sturgeon's Law: 90% of *EVERYTHING* is crud. ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Piggybacking vacuum I/O
Pavan Deolasee wrote: On 1/26/07, Alvaro Herrera [EMAIL PROTECTED] wrote: Heikki Linnakangas wrote: I'd like to see still more evidence that it's a problem before we start changing that piece of code. It has served us well for years. So the TODO could be investigate whether caching pg_clog and/or pg_subtrans in local memory can be useful for vacuum performance. As Heikki suggested, we should also investigate the same for normal backends as well. Maybe. An idea that comes to mind is to never cache the latest page, since it'll most likely result in extra reads anyway because there'll be a lot of IN_PROGRESS transactions. Problem to solve: how much memory to dedicate to this? Could we mmap() portions of the pg_clog segment, so that the page could be shared across backends instead of allocating them for each? It would also be interesting to investigate whether early setting of hint bits can reduce subsequent writes of blocks. A typical case would be a large table being updated heavily for a while, followed by SELECT queries. The SELECT queries would set hint bits for the previously UPDATEd tuples (old and new versions) and thus cause subsequent writes of those blocks for what could have been read-only queries. This has been suggested before, but I don't see how this could work. How does the UPDATE transaction go back to the pages it wrote to update the hint bits, _after_ it committed? Maybe have the bgwriter update hint bits as it evicts pages out of the cache? It could result in pg_clog read traffic for each page that needs eviction; not such a hot idea. I don't see how this is related to the above proposal though. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Proposal: Snapshot cloning
Simon Riggs [EMAIL PROTECTED] writes: No, that would break MVCC. But we may have done lots of updates/deletes that are *not* visible to any Snapshot, yet are not yet removable because they are higher than OldestXmin but we don't know that because previously the Snapshot details were not available. ISTM that this proposal is a way of making the Snapshot limits publicly available so that they can be used by VACUUM. Certainly not, unless you intend that *every* snapshot *must* be published, which is an overhead up with which we will not put. One pretty serious problem with the proposal as written is the part about the sender blocking until the receiver takes the snap; that means it's not really a publish in the sense that you can make it available without worrying about exactly how many readers there might or might not be. That alone is sufficient to kill any thought of VACUUM making use of the info. I'd feel happier with an implementation more like prepared transactions: you stuff the information into shared memory and it sits there, readable by anyone, until such time as you take it down again. Like prepared xacts, GlobalXmin calculations would need to include these snapshots (and hence they'd limit vacuums). A shared-memory area would have to be fixed size, but perhaps backing files, like those used by prepared xacts, could handle the overflow for very large xip lists. Presumably crash safety is not an issue so this wouldn't require any complicated mechanism. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Implied Functional index use (redux)
On Fri, 2007-01-26 at 10:58 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: If there's clear benefit and a clear way forward, then we might just be OK for 8.3. If not, I'll put this back on the shelf again in favour of other ideas. I think this is still a long way off, and there are probably more useful things to work on for 8.3. Part of my antagonism stems from the fact that by means of the operator family rewrite I've been getting rid of some longstanding but really quite unacceptable assumptions about this operator does that. I don't want to see us start putting unsupported semantic assumptions back into the optimizer; rather its assumptions about operator behavior need to be clearly specified. As an example, without some careful preliminary thinking I'd have probably folded all the numeric types into one big opfamily and thereby broken transitivity :-(, leading to bugs that would be devilish to figure out. OK, no problems. All of the above says time, which is becoming rare as we approach 8.3 anyways. I'll pick it up again in 8.4. Some notes-to-self for the future: - ideally want to be able to decide transformability at CREATE INDEX time; this will reduce planning time for functional index usage when there is no possible transforms. - may want to do this by having a special catalog table that holds the cases that *will* work, to make it both safer and faster to look up. Sort of like pg_autovacuum - absence means No. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Proposal: Snapshot cloning
Hannu Krosing [EMAIL PROTECTED] writes: Ãhel kenal päeval, N, 2007-01-25 kell 22:19, kirjutas Jan Wieck: The cloning process needs to make sure that the clone_snapshot() call is made from the same DB user in the same database as corresponding publish_snapshot() call was done. Why ? Snapshot is universal and same for whole db instance, so why limit it to same user/database ? Yeah. Use-case: pg_dumpall could guarantee that it produces consistent snapshots across multiple databases. (Not sure I actually want that, but it's at least arguably useful to someone.) I think you would want to mark a snapshot with an owner, but that would be for the purpose of restricting who could take it down, not who could copy it. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Piggybacking vacuum I/O
On 1/26/07, Alvaro Herrera [EMAIL PROTECTED] wrote: Maybe have the bgwriter update hint bits as it evicts pages out of the cache? It could result in pg_clog read traffic for each page that needs eviction; not such a hot idea. I thought once we enhance clog so that there are no clog reads, bgwriter would be able to update hint bits without getting into any deadlock with pg_clog read. May be we can have this as a seperate TODO Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] HAVING push-down
On Fri, 2007-01-26 at 07:46 -0800, Joshua D. Drake wrote: Care to share the paper in general? It might be beneficial for all of us. I'll ask the author, but don't expect an immediate response. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] Recursive query syntax ambiguity
Hm, I had hoped that the DB2/ANSI syntax would only require making WITH a fully reserved word, and not the other tokens it uses. Certainly for non-recursive queries that's the case as the only other token it uses is AS which is already a fully reserved word. However to fully support the DB2/ANSI syntax we would definitely have an ambiguity and I think we would have to make CYCLE a fully reserved word which seems like a much bigger concession than WITH. Observe the following case: WITH RECURSIVE foo (x,y) AS (select 1,2) SEARCH DEPTH FIRST BY x CYCLE x,y SET ... The parser can't search arbitrarily far checking for a SET to see if the CYCLE is a keyword or a binary operator. Even if it could things like this would be entirely ambiguous: WITH RECURSIVE foo (x,y) AS (select 1,2) SEARCH DEPTH FIRST BY x CYCLE x, y CYCLE y SET ... I'm nowhere near actually implementing this functionality yet so there's no pressing need for action. In fact I think the search clause is actually an ANSIism that isn't supported by DB2 itself yet either. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Proposal: Snapshot cloning
On Fri, 2007-01-26 at 11:36 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: No, that would break MVCC. But we may have done lots of updates/deletes that are *not* visible to any Snapshot, yet are not yet removable because they are higher than OldestXmin but we don't know that because previously the Snapshot details were not available. ISTM that this proposal is a way of making the Snapshot limits publicly available so that they can be used by VACUUM. Certainly not, unless you intend that *every* snapshot *must* be published, which is an overhead up with which we will not put. Agreed, but that's the general case problem. What I was hoping was that this would provide a mechanism for long running transactions (LRTs) to publish their min/max Xids. Then if all backends publish the minimum Xid of any Snapshot they have generated in the proc array, we'd be able to decide if there are any large holes in the global set of Snapshots. As a general case that's hard to evaluate, but in the common case of a lone LRT and all the rest short duration transactions you can end up with a gap of 250,000+ transactions opening up between the two. It would be fairly easy to have VACUUM check for large visibility gaps between groups of transactions and then use that to improve its effectiveness in the presence of LRTs. Theoretically we have to keep the chain of intermediate updates around so it can be traversed by the old transaction, but in practical terms traversing a long chain of updates isn't sensible. Serializable LRTs will never traverse the chain anyway (that's a serializability error), but there are some special cases to consider, hence my mentioning an unresolved problem previously. We'd need to be much more careful about the way Snapshots are managed, so we can be certain that we take them all into account. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Proposal: Snapshot cloning
Tom Lane [EMAIL PROTECTED] writes: set_current_snapshot() would have to sanity check that the xmin of the new snapshot isn't older than the current globaloldestxmin. That would solve the backend to backend IPC problem nicely. But it fails on the count of making sure that globaloldestxmin doesn't advance past the snap you want to use. And exactly how will you pass a snap through a table? It won't become visible until you commit ... whereupon your own xmin isn't blocking the advance of globaloldestxmin. Hm, good point. You could always do it in a separate connection, but that starts to get annoying. I was more envisioning passing it around out-of-band though, something like: $db-execute(SET TRANSACTION ISOLATION LEVEL SERIALIZABLE); $snap = $db-execute(select current_snapshot()); for each db { if (fork()) $slave[i] = $db-connect(); $slave[i]-execute(select set_snapshot($snap)); $slave[i]-execute(copy table[i] to file[i]); } I'm also wondering about something like: $db-execute(SET TRANSACTION ISOLATION LEVEL SERIALIZABLE); $snap = $db-execute(select current_snapshot()); if (fork()) $slave = $db-connect(); $slave-execute(select set_snapshot($snap); $slave-execute(copy tab from hugefile); signal parent } else { while(no signal yet) { $rows_loaded_so_far = $db-execute(select count(*) from tab); display_progress($rows_loaded_so_far); sleep(60); } } Sorry for the vaguely perlish pseudocode but it's the clearest way I can think to write it. I don't think it would make much sense to try to do anything like this in plpgsql; I think you really do want to be doing it in a language outside the database where it's easier to open multiple connections and handle IPC. I realize the second idea might take more hackery than just setting the snapshot... In particular as written above it wouldn't work because the slave would be writing with a new xid that isn't actually in the snapshot. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Piggybacking vacuum I/O
Alvaro Herrera [EMAIL PROTECTED] writes: Pavan Deolasee wrote: It would also be interesting to investigate whether early setting of hint bits can reduce subsequent writes of blocks. A typical case would be a large table being updated heavily for a while, followed by SELECT queries. The SELECT queries would set hint bits for the previously UPDATEd tuples (old and new versions) and thus cause subsequent writes of those blocks for what could have been read-only queries. This has been suggested before, but I don't see how this could work. How does the UPDATE transaction go back to the pages it wrote to update the hint bits, _after_ it committed? I think what he's suggesting is deliberately not updating the hint bits during a SELECT ... but that's surely misguided. If you don't set the hint bit after discovering the transaction commit state, then the next visitor of the tuple will have to repeat the clog lookup, meaning that any such policy greatly increases clog read traffic and contention. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Proposal: Snapshot cloning
On 1/26/2007 12:22 PM, Simon Riggs wrote: On Fri, 2007-01-26 at 11:36 -0500, Tom Lane wrote: Simon Riggs [EMAIL PROTECTED] writes: No, that would break MVCC. But we may have done lots of updates/deletes that are *not* visible to any Snapshot, yet are not yet removable because they are higher than OldestXmin but we don't know that because previously the Snapshot details were not available. ISTM that this proposal is a way of making the Snapshot limits publicly available so that they can be used by VACUUM. Certainly not, unless you intend that *every* snapshot *must* be published, which is an overhead up with which we will not put. Agreed, but that's the general case problem. What I was hoping was that this would provide a mechanism for long running transactions (LRTs) to publish their min/max Xids. Then if all backends publish the minimum Xid of any Snapshot they have generated in the proc array, we'd be able to decide if there are any large holes in the global set of Snapshots. As a general case that's hard to evaluate, but in the common case of a lone LRT and all the rest short duration transactions you can end up with a gap of 250,000+ transactions opening up between the two. It would be fairly easy to have VACUUM check for large visibility gaps between groups of transactions and then use that to improve its effectiveness in the presence of LRTs. There is a flaw in that theory. If you have a single LTR, then each subsequent transactions xmin will be exactly that one, no? Jan Theoretically we have to keep the chain of intermediate updates around so it can be traversed by the old transaction, but in practical terms traversing a long chain of updates isn't sensible. Serializable LRTs will never traverse the chain anyway (that's a serializability error), but there are some special cases to consider, hence my mentioning an unresolved problem previously. We'd need to be much more careful about the way Snapshots are managed, so we can be certain that we take them all into account. -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] Recursive query syntax ambiguity
Woah, I just realized it's much worse than that. I think the syntax in the ANSI is not parsable in LALR(1) at all. Note the following: WITH RECURSIVE foo (a,b) AS (subq) SEARCH BREADTH FIRST BY a,b,c(x,z),d(y,z) AS (subq) SELECT ... To determine whether c is the name of a new with list element it has to scan as far ahead as the , before the d. Note that d here is in fact not part of the search clause at all, it's the name of a second with list element. bleagh. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Proposal: Snapshot cloning
On 1/26/2007 11:58 AM, Tom Lane wrote: Jan Wieck [EMAIL PROTECTED] writes: On 1/26/2007 8:06 AM, Gregory Stark wrote: It seems simpler to have a current_snapshot() function that returns an bytea or a new snapshot data type which set_current_snapshot(bytea) took to change your snapshot. Then you could use tables or out-of-band communication to pass around your snapshots however you please. set_current_snapshot() would have to sanity check that the xmin of the new snapshot isn't older than the current globaloldestxmin. That would solve the backend to backend IPC problem nicely. But it fails on the count of making sure that globaloldestxmin doesn't advance past the snap you want to use. And exactly how will you pass a snap through a table? It won't become visible until you commit ... whereupon your own xmin isn't blocking the advance of globaloldestxmin. The client receives the snapshot information as a result from the function call to current_snapshot(). The call to set_current_snapshot(snap) errors out if snap's xmin is older than globaloldestxmin. It is the client app that has to make sure that the transaction that created snap is still in progress. I didn't say passing anything through a table. Take a modified pg_dump as an example. It could write multiple files. A pre-load sql with the first part of the schema. Then a post-load sql with the finalization of same (creating indexes, adding constraints). It then builds a list of all relations to COPY, starts n threads each writing a different file. Each thread connects to the DB and adjusts the snapshot to the one of the main transaction (which is still open). Then each thread grabs the next table to dump from the list and writes the COPY data to its output file. The threads exit when the list of tables is empty. The main thread waits until the last thread has joined and commits the main transaction. Wouldn't be too hard to write a script that restores that split dump in parallel as well. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] [pgsql-patches] pg_dump pretty_print
Greg Sabino Mullane [EMAIL PROTECTED] writes: Peter Eisentraut replied: The harm here is that under undefined circumstances a dump file will not be a proper and robust representation of the original database, which would add significant confusion and potential for error. What undefined circumstances are we talking here? If there is a chance that pg_get_viewdef and company do not output a version that can be read again by the database because we simply changed the whitespace, that sounds like a serious bug to be fixed, not a reason to reject this optional flag. The original definition of the prettyprint flag was that it'd produce a version that was nice to look at but not guaranteed to parse back exactly the same; in particular it might omit parentheses that perhaps were really needed to ensure the same parsing. (I think there might be some other issues too ... but whitespace is NOT one of them.) It's possible that the current prettyprint code is smart enough to never make such an error --- and then again it's possible that it isn't. Like Peter, I've not got much confidence in that code, and don't want to trust pg_dump's correctness to it. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Recursive query syntax ambiguity
Gregory Stark wrote: Woah, I just realized it's much worse than that. I think the syntax in the ANSI is not parsable in LALR(1) at all. Note the following: WITH RECURSIVE foo (a,b) AS (subq) SEARCH BREADTH FIRST BY a,b,c(x,z),d(y,z) AS (subq) SELECT ... To determine whether c is the name of a new with list element it has to scan as far ahead as the , before the d. Note that d here is in fact not part of the search clause at all, it's the name of a second with list element. bleagh. Can you post the rules you have so far that you're playing around with? (Also maybe the rules from the standard - I don't have a copy handy). cheers andrew ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Piggybacking vacuum I/O
On 1/26/07, Tom Lane [EMAIL PROTECTED] wrote: I think what he's suggesting is deliberately not updating the hint bits during a SELECT ... No, I was suggesting doing it in bgwriter so that we may not need to that during a SELECT. Of course, we need to investigate more and have numbers to prove the need. Also you have already expressed concerns that doing so in bgwriter is deadlock prone. So there is certainly more work needed for any such scheme to work. Thanks, Pavan -- EnterpriseDB http://www.enterprisedb.com
[HACKERS] NULL value in subselect in UNION causes error
Checked it against HEAD and 8.2: postgres=# select 1, 1, 1 union select * from (select 2, null, 2) two; ERROR: failed to find conversion function from unknown to integer Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] [GENERAL] Autovacuum Improvements
On Wed, Jan 24, 2007 at 07:30:05PM -0500, Tom Lane wrote: Kenneth Marshall [EMAIL PROTECTED] writes: Not that I am aware of. Even extending the relation by one additional block can make a big difference in performance Do you have any evidence to back up that assertion? It seems a bit nontrivial to me --- not the extension part exactly, but making sure that the space will get used promptly. With the current code the backend extending a relation will do subsequent inserts into the block it just got, which is fine, but there's no mechanism for remembering that any other newly-added blocks are available --- unless you wanted to push them into the FSM, which could work but the current FSM code doesn't support piecemeal addition of space, and in any case there's some question in my mind about the concurrency cost of increasing FSM traffic even more. In short, it's hardly an unquestionable improvement, so we need some evidence. regards, tom lane My comment was purely based on the reduction in fragmentation of the file behind the relation. A result that I have seen repeatedly in file related data processing. It does sound much more complicated to make the additional space available to other backends. If one backend was doing many inserts, it might still be of value even for just that backend. As you mention, testing is needed to see if there is enough value in this process. Ken ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] New feature proposal
Dear Developers, Thanks for your answers. I didn't know about generate_series, but it looks to be exactly what I was suggesting. Regards, Sorin Schwimmer Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail beta. http://new.mail.yahoo.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Recursive query syntax ambiguity
Andrew Dunstan [EMAIL PROTECTED] writes: Can you post the rules you have so far that you're playing around with? (Also maybe the rules from the standard - I don't have a copy handy). This is the best compromise I've come up with so far. It makes CYCLE a reserved word and requires a CYCLE clause if there's a SEARCH clause. --- gram.y 09 Jan 2007 02:14:14 + 2.573 +++ gram.y 26 Jan 2007 20:02:21 + @@ -350,6 +350,10 @@ %type node xml_root_version opt_xml_root_standalone %type booleandocument_or_content xml_whitespace_option +%type node common_table_expression +%type list with_cte_list cte_list +%type boolean recursive_is_depth_first + /* * If you make any token changes, update the keyword table in @@ -364,7 +368,7 @@ ASSERTION ASSIGNMENT ASYMMETRIC AT AUTHORIZATION BACKWARD BEFORE BEGIN_P BETWEEN BIGINT BINARY BIT - BOOLEAN_P BOTH BY + BOOLEAN_P BOTH BREADTH BY CACHE CALLED CASCADE CASCADED CASE CAST CHAIN CHAR_P CHARACTER CHARACTERISTICS CHECK CHECKPOINT CLASS CLOSE @@ -376,7 +380,7 @@ DATABASE DAY_P DEALLOCATE DEC DECIMAL_P DECLARE DEFAULT DEFAULTS DEFERRABLE DEFERRED DEFINER DELETE_P DELIMITER DELIMITERS - DESC DISABLE_P DISTINCT DO DOCUMENT_P DOMAIN_P DOUBLE_P DROP + DEPTH DESC DISABLE_P DISTINCT DO DOCUMENT_P DOMAIN_P DOUBLE_P DROP EACH ELSE ENABLE_P ENCODING ENCRYPTED END_P ESCAPE EXCEPT EXCLUDING EXCLUSIVE EXECUTE EXISTS EXPLAIN EXTERNAL EXTRACT @@ -416,11 +420,11 @@ QUOTE - READ REAL REASSIGN RECHECK REFERENCES REINDEX RELATIVE_P RELEASE RENAME + READ REAL REASSIGN RECHECK RECURSIVE REFERENCES REINDEX RELATIVE_P RELEASE RENAME REPEATABLE REPLACE RESET RESTART RESTRICT RETURNING RETURNS REVOKE RIGHT ROLE ROLLBACK ROW ROWS RULE - SAVEPOINT SCHEMA SCROLL SECOND_P SECURITY SELECT SEQUENCE + SAVEPOINT SCHEMA SCROLL SEARCH SECOND_P SECURITY SELECT SEQUENCE SERIALIZABLE SESSION SESSION_USER SET SETOF SHARE SHOW SIMILAR SIMPLE SMALLINT SOME STABLE STANDALONE_P START STATEMENT STATISTICS STDIN STDOUT STORAGE STRICT_P STRIP_P SUBSTRING SUPERUSER_P @@ -5681,6 +5685,25 @@ list_nth($3, 0), list_nth($3, 1)); $$ = $1; } + | with_cte_list simple_select { $$ = $2; } + | with_cte_list select_clause sort_clause + { + insertSelectOptions((SelectStmt *) $2, $3, NIL, + NULL, NULL); + $$ = $2; + } + | with_cte_list select_clause opt_sort_clause for_locking_clause opt_select_limit + { + insertSelectOptions((SelectStmt *) $2, $3, $4, + list_nth($5, 0), list_nth($5, 1)); + $$ = $2; + } + | with_cte_list select_clause opt_sort_clause select_limit opt_for_locking_clause + { + insertSelectOptions((SelectStmt *) $2, $3, $5, + list_nth($4, 0), list_nth($4, 1)); + $$ = $2; + } ; select_clause: @@ -5742,6 +5765,72 @@ } ; +/* + * ANSI standard WITH clause looks like: + * WITH [ RECURSIVE ] query name [ (column,...) ] AS (query) [SEARCH or CYCLE clause] + * + * It seems with_cte_list has to be a separate token or else there's a s/r + * conflict between RECURSIVE and the cte name. This means we'll need a silly + * node just to hold the list and the recursive flag :( it doesn't seem like + * making RECURSIVE a fully reserved word would be very nice as it's probably a + * common column name. + * + * For now we don't support recursive so just ignore it and pass up the list + * + */ +with_cte_list: + WITH cte_list + { + $$ = $2; + } + | WITH RECURSIVE cte_list + { + elog(WARNING, WITH RECURSIVE not supported yet); + $$ = $3; + } + ; + +cte_list: + common_table_expression { $$ = list_make1($1); } + |
[HACKERS] VC2005 build and pthreads
Folks, I would like to build pg on VC2005. How do I use pthreads that is mentioned in the README file. Do I need the DLL? Sources? LIB? Where do I install or copy them.. Regards, Gevik ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] NULL value in subselect in UNION causes error
Jan Wieck [EMAIL PROTECTED] writes: Checked it against HEAD and 8.2: postgres=# select 1, 1, 1 union select * from (select 2, null, 2) two; ERROR: failed to find conversion function from unknown to integer It's always done that. The SQL spec would tell you that you have to cast the null to some specific type. We allow untyped nulls in contexts where we can derive a type from reasonably nearby context, but an integer two levels up and over in another union arm isn't very nearby IMHO. So I'm not particularly concerned about making this work ... regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Recursive query syntax ambiguity
On Fri, Jan 26, 2007 at 05:10:01PM +, Gregory Stark wrote: However to fully support the DB2/ANSI syntax we would definitely have an ambiguity and I think we would have to make CYCLE a fully reserved word which seems like a much bigger concession than WITH. Observe the following case: WITH RECURSIVE foo (x,y) AS (select 1,2) SEARCH DEPTH FIRST BY x CYCLE x,y SET ... The parser can't search arbitrarily far checking for a SET to see if the CYCLE is a keyword or a binary operator. Er, CYCLE isn't a binary operator, and users can't make binary operators that are words, so I'm not sure of the problem here. I think the parser can tell that the expression ends at the word cycle. Or am I missing obvious? Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate. signature.asc Description: Digital signature
Re: [HACKERS] NULL value in subselect in UNION causes error
On 1/26/2007 3:41 PM, Tom Lane wrote: Jan Wieck [EMAIL PROTECTED] writes: Checked it against HEAD and 8.2: postgres=# select 1, 1, 1 union select * from (select 2, null, 2) two; ERROR: failed to find conversion function from unknown to integer It's always done that. The SQL spec would tell you that you have to cast the null to some specific type. We allow untyped nulls in contexts where we can derive a type from reasonably nearby context, but an integer two levels up and over in another union arm isn't very nearby IMHO. So I'm not particularly concerned about making this work ... That explains. Thanks. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] VC2005 build and pthreads
On Fri, Jan 26, 2007 at 09:34:10PM +0100, Gevik Babakhani wrote: Folks, I would like to build pg on VC2005. How do I use pthreads that is mentioned in the README file. Do I need the DLL? Sources? LIB? Where do I install or copy them.. Err, pthreads is a threads library for Unix, I don't think Windows has that, nor can I think of a situation where you'd need to worry about threads anyway? Have a nice day. -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate. signature.asc Description: Digital signature
[HACKERS] pg_restore exclude schema from being droped option
Hi list, I am using pg_dump and pg_restore to backup and restore a database but there is something that I believe is missing from the restore process: an option in pg_restore to exclude a schema from being dropped when -c option is defined. And here is why: Suppose that I have a database with about 12 tables of customer data (address, notes, configuration ... no more than 10.000 rows each) and 50 tables of history data (history files with about 1.000.000 rows each). Now, to backup the database I choose to create 2 scripts, BackupData.sh to backup all small tables and BackupHist.sh to backup history tables. When I call pg_restore -c to restore data tables, pg_restore report a failure because it is trying to drop a schema that it is not empty. So it would be very helpful to have an option to exclude the schema (for example: public) from being dropped. I now that I there are some alternatives for example: pg_restore -l, comment out the drop schema line and pg_restore -L or put data tables and hist tables into separate schemas but an option to pg_restore would be a lot easier, I guess. What do you thing? regards, Kostis Mentzelos ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
[HACKERS] PostgreSQL Data Loss
Hi everyone, I have a problem with one of my costomers. I made a program that uses a PostgreSQL (win32) database to save its data. My customer claims that he lost lots of data reguarding his own clients and that those data had surely been saved on the database. My first guess is that he is the one who deleted the data but wants to blame someone else, obviously I can't prove it. Could it be possible for PostgreSQL to lose its data? Maybe with a file corruption? Could it be possible to restore these data? My program does not modify or delete data since its more like a log that only adds information. It is obviously possible to delete these logs but it requires to answer yes to 2 different warnings, so the data can't be deleted accidentally. I have other customers with even 10 times the amount of data of the one who claimed the loss but no problems with them. He obviously made no backups (and claims whe never told him to do them so we are responsible even for this) though the program has a dedicated Backup-section. Any suggestion? Daniele ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] VC2005 build and pthreads
pthreads in needed to buold PG in vc++ 2005 please read pgsql/src/tools/msvc/README Have a nice day. On Fri, 2007-01-26 at 21:47 +0100, Martijn van Oosterhout wrote: On Fri, Jan 26, 2007 at 09:34:10PM +0100, Gevik Babakhani wrote: Folks, I would like to build pg on VC2005. How do I use pthreads that is mentioned in the README file. Do I need the DLL? Sources? LIB? Where do I install or copy them.. Err, pthreads is a threads library for Unix, I don't think Windows has that, nor can I think of a situation where you'd need to worry about threads anyway? Have a nice day. ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Recursive query syntax ambiguity
Ok, looking at your example: WITH RECURSIVE foo (a,b) AS (subq) SEARCH BREADTH FIRST BY a,b , c(x,z),d(y,z) AS (subq) SELECT ... What you're trying to say is that the c is a with list element, not a cycle column. But the parser will see that as soon as it hits the open parenthesis, since a cycle column is always just a column name. Also, the AS is the with list element doesn't appear to be optional, I assume you left that out after the c(x,z) for clarity. I think bison should be able to handle this as long as the name in common_table_expression matches exactly the same things as whatever columnList uses. It can the merge the two parse paths, allowing it to see further. Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ From each according to his ability. To each according to his ability to litigate. signature.asc Description: Digital signature
Re: [HACKERS] [PERFORM] how to plan for vacuum?
On Jan 25, 2007, at 10:33 AM, Ray Stell wrote: On Thu, Jan 25, 2007 at 08:04:49AM -0800, Joshua D. Drake wrote: It really depends on the system. Most of our systems run anywhere from 10-25ms. I find that any more than that, Vacuum takes too long. How do you measure the impact of setting it to 12 as opposed to 15? If you've got a tool that will report disk utilization as a percentage it's very easy; I'll decrease the setting until I'm at about 90% utilization with the system's normal workload (leaving some room for spikes, etc). Sometimes I'll also tune the costs if reads vs. writes are a concern. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] VC2005 build and pthreads
Martijn van Oosterhout wrote: On Fri, Jan 26, 2007 at 09:34:10PM +0100, Gevik Babakhani wrote: Folks, I would like to build pg on VC2005. How do I use pthreads that is mentioned in the README file. Do I need the DLL? Sources? LIB? Where do I install or copy them.. Err, pthreads is a threads library for Unix, I don't think Windows has that, nor can I think of a situation where you'd need to worry about threads anyway? There is a pthreads for win32 as well. However, you don't need it to build, unless you build ecpg. I forgot to update the README when I put that patch in .-) If you want it, it's on ftp://sources.redhat.com/pub/pthreads-win32. IIRC, that's actually mentioned in the README file. You need the lib and headers. //Magnus ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] No ~ operator for box, point
On Jan 25, 2007, at 6:26 PM, Tom Lane wrote: Martijn van Oosterhout kleptog@svana.org writes: On Thu, Jan 25, 2007 at 01:59:33PM -0500, Merlin Moncure wrote: On 1/25/07, Jim C. Nasby [EMAIL PROTECTED] wrote: decibel=# select box '((0,0),(2,2))' ~ point '(1,1)'; ERROR: operator does not exist: box ~ point I don't see a reason, although you can do it with polygon and not box. Seems like an old oversight. Ok. If I ever get some time I'll submit a patch to bring everything in-line (there's other missing operators as well). Also, I can't find the ~ operator defined for polygon in the documentation, am I missing something? ~ is deprecated, contains is preferentially spelled @ now. Ok, I'll keep that in mind. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Recursive query syntax ambiguity
Martijn van Oosterhout kleptog@svana.org writes: Er, CYCLE isn't a binary operator, and users can't make binary operators that are words, so I'm not sure of the problem here. Well, the problem typically is not being able to tell whether an operator is supposed to be infix or postfix; hence keywords that can terminate arbitrary expressions usually have to be reserved words. However, now that I look at the syntax I think Greg may be misreading it. I see search or cycle clause ::= search clause | cycle clause | search clause cycle clause search clause ::= SEARCH recursive search order SET sequence column recursive search order ::= DEPTH FIRST BY sort specification list | BREADTH FIRST BY sort specification list sequence column ::= column name cycle clause ::= CYCLE cycle column list SET cycle mark column TO cycle mark value DEFAULT non-cycle mark value USING path column cycle column list ::= cycle column [ {commacycle column}...] cycle column ::= column name cycle mark column ::= column name path column ::= column name cycle mark value ::= value expression non-cycle mark value ::= value expression and so CYCLE would come *after* SET sequence column not before it. It looks to me like we'd have to promote SET to fully reserved status, but that probably isn't going to surprise anyone. DEFAULT and USING already are fully reserved. I don't see anything else here that looks like it should need to be reserved. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] pg_restore exclude schema from being droped option
Kostis Mentzelos [EMAIL PROTECTED] writes: Now, to backup the database I choose to create 2 scripts, BackupData.sh to backup all small tables and BackupHist.sh to backup history tables. When I call pg_restore -c to restore data tables, pg_restore report a failure because it is trying to drop a schema that it is not empty. Why (or how) is the schema part of the backup at all? If you used pg_dump's -t switch to select the tables to back up, there should not be a schema entry in the dump. So there's something you're not telling us about how you are using the tools. regards, tom lane ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] Proposal: Snapshot cloning
On Fri, 2007-01-26 at 12:43 -0500, Jan Wieck wrote: There is a flaw in that theory. If you have a single LTR, then each subsequent transactions xmin will be exactly that one, no? You got me. My description was too loose, but you also got the rough picture. We'll save the detail for another day, but we all know its a bridge we will have to cross one day, soon. I wasn't meaning to raise this specific discussion now, just to say that publishing snapshots for known LRTs is one way by which we can solve the LRT/VACUUMing issue. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [HACKERS] BUG #2917: spi_prepare doesn't accept typename aliases
On Jan 26, 2007, at 9:31 AM, Tom Lane wrote: If you wanted to be a bit more ambitious maybe you could change the fact that this code is throwing away typmod, which means that declarations like varchar(32) would fail to work as expected. Perhaps it should be fixed to save the typmods alongside the typioparams and then pass them to InputFunctionCall instead of passing -1. On the other hand, we don't currently enforce typmod for any function input or result arguments, so maybe it's consistent that spi_prepare arguments ignore typmods too. Thoughts? I'd like to see us move towards supporting that; both for function parameters/results as well as inside functions. It'd be nice if both cases got fixed at once, but IMHO fixing only one now would be better than fixing none. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Proposal: Change of pg_trigger.tg_enabled and adding
On Jan 26, 2007, at 5:13 AM, Markus Schiltknecht wrote: In Postgres-R, I mostly use the terms 'local' and 'remote'. Note that those terms only make sense if you limit yourself to thinking the master is pushing data out to the slave... I think it'd make the most sense if the name reflected whether the trigger should be fired by a replication process or not; that way it doesn't really matter if it's a master or a slave... if the data in the table is being modified by a replication process then you don't fire the trigger/rule, according to the setting. But maybe there is some need to discern between origin and target... Also, if enums will be in 8.3, perhaps they can be used instead of char? -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Proposal: Change of pg_trigger.tg_enabled and adding pg_rewrite.ev_enabled
On Jan 25, 2007, at 5:33 PM, Jan Wieck wrote: A new per session GUC variable, restricted to superusers, will define if the session is in origin or replica mode. It would be nice if we had a separate role for replication services so that we weren't exposing superuser so much. IIRC Oracle even uses 2 roles; one for administration of replication and one that the replication code actually runs under. -- Jim Nasby[EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] PostgreSQL Data Loss
BluDes wrote: I made a program that uses a PostgreSQL (win32) database to save its data. What version of PostgreSQL is this? My customer claims that he lost lots of data reguarding his own clients and that those data had surely been saved on the database. My first guess is that he is the one who deleted the data but wants to blame someone else, obviously I can't prove it. Did he lose all data in one table, or just some rows? Or is there some other pattern? Could it be possible for PostgreSQL to lose its data? Not when properly installed. Maybe with a file corruption? I doubt it. You'd almost certainly get warnings or errors if there's corruption. Could it be possible to restore these data? The first thing to do is to take a filesystem-level physical copy of the data directory to prevent further damage. Copy the data directory to another system for forensics. You might be able to get a picture of what happened by looking at the WAL logs using the xlogviewer tool in pgfoundry. You can also modify the PostgreSQL source code so that it shows also row versions marked as deleted, and recover the deleted data. I can't remember exactly how to do it, maybe others who have done it can fill in. A row stays physically in the file until the table is vacuumed; hopefully it hasn't been. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Proposal: Change of pg_trigger.tg_enabled and adding
On 1/26/2007 4:39 PM, Jim Nasby wrote: On Jan 26, 2007, at 5:13 AM, Markus Schiltknecht wrote: In Postgres-R, I mostly use the terms 'local' and 'remote'. Note that those terms only make sense if you limit yourself to thinking the master is pushing data out to the slave... I think it'd make the most sense if the name reflected whether the trigger should be fired by a replication process or not; that way it doesn't really matter if it's a master or a slave... if the data in the table is being modified by a replication process then you don't fire the trigger/rule, according to the setting. But maybe there is some need to discern between origin and target... That's why I prefer origin and replica. I want to use the same terms in the sessions mode GUC, and there local could be misinterpreted as doesn't replicate at all. Also, if enums will be in 8.3, perhaps they can be used instead of char? I don't like this one. It makes it impossible to provide patches, enabling this replication system on older Postgres releases. And you know that your customers will want them. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Proposal: Snapshot cloning
Simon Riggs [EMAIL PROTECTED] writes: You got me. My description was too loose, but you also got the rough picture. We'll save the detail for another day, but we all know its a bridge we will have to cross one day, soon. I wasn't meaning to raise this specific discussion now, just to say that publishing snapshots for known LRTs is one way by which we can solve the LRT/VACUUMing issue. I don't actually see that it buys you a darn thing ... you still won't be able to delete dead updated tuples because of the possibility of the LRT deciding to chase ctid chains up from the tuples it can see. You also seem to be assuming that a transaction can have only one snapshot, which is not something we can enforce in enough cases to make it a very useful restriction. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Proposal: Change of pg_trigger.tg_enabled and adding
On 1/26/2007 4:40 PM, Jim Nasby wrote: On Jan 25, 2007, at 5:33 PM, Jan Wieck wrote: A new per session GUC variable, restricted to superusers, will define if the session is in origin or replica mode. It would be nice if we had a separate role for replication services so that we weren't exposing superuser so much. IIRC Oracle even uses 2 roles; one for administration of replication and one that the replication code actually runs under. So you think about another flag in pg_shadow? Would work for me. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Proposal: Change of pg_trigger.tg_enabled and adding
On 1/26/2007 4:47 PM, Jan Wieck wrote: On 1/26/2007 4:39 PM, Jim Nasby wrote: On Jan 26, 2007, at 5:13 AM, Markus Schiltknecht wrote: In Postgres-R, I mostly use the terms 'local' and 'remote'. Note that those terms only make sense if you limit yourself to thinking the master is pushing data out to the slave... I think it'd make the most sense if the name reflected whether the trigger should be fired by a replication process or not; that way it doesn't really matter if it's a master or a slave... if the data in the table is being modified by a replication process then you don't fire the trigger/rule, according to the setting. But maybe there is some need to discern between origin and target... That's why I prefer origin and replica. I want to use the same terms in the sessions mode GUC, and there local could be misinterpreted as doesn't replicate at all. I will need that local mode anyway for some conflict resolutions. Think of a duplicate key (yeah, yeah, what comes now sounds bad ...) conflict, where you need to delete one of the entries without causing that delete to replicate. Before people panic, the final system is supposed to have something smarter than deleting a dupkey in its repertoire. But I'll rather go with this cheap shot first and add a group communication based advisory locking system later, you know? Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] PostgreSQL Data Loss
If data are deleted then they are still stored in database until VACUUM cleans them. You can look by some hex viewer, if you see some know text data there. Or I think there is also some tool which dump tuple list from pages. You can also see deleted data if you change current transaction ID. But I not sure if it is simply possible. Before experiments, do not forget backup of database files. Zdenek BluDes wrote: Hi everyone, I have a problem with one of my costomers. I made a program that uses a PostgreSQL (win32) database to save its data. My customer claims that he lost lots of data reguarding his own clients and that those data had surely been saved on the database. My first guess is that he is the one who deleted the data but wants to blame someone else, obviously I can't prove it. Could it be possible for PostgreSQL to lose its data? Maybe with a file corruption? Could it be possible to restore these data? My program does not modify or delete data since its more like a log that only adds information. It is obviously possible to delete these logs but it requires to answer yes to 2 different warnings, so the data can't be deleted accidentally. I have other customers with even 10 times the amount of data of the one who claimed the loss but no problems with them. He obviously made no backups (and claims whe never told him to do them so we are responsible even for this) though the program has a dedicated Backup-section. Any suggestion? Daniele ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Proposal: Change of pg_trigger.tg_enabled and adding
Jan Wieck [EMAIL PROTECTED] writes: On 1/26/2007 4:39 PM, Jim Nasby wrote: Also, if enums will be in 8.3, perhaps they can be used instead of char? I don't like this one. It makes it impossible to provide patches, enabling this replication system on older Postgres releases. And you know that your customers will want them. Also, at the level of C code enums will not be terribly easy to work with. We use the char-as-poor-mans-enum trick in all the other system catalogs, so I feel no desire to do it differently here. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Proposal: Change of pg_trigger.tg_enabled and adding
Jan Wieck [EMAIL PROTECTED] writes: On 1/26/2007 4:40 PM, Jim Nasby wrote: It would be nice if we had a separate role for replication services so that we weren't exposing superuser so much. So you think about another flag in pg_shadow? Would work for me. How exactly would such a role differ from a regular superuser? It would still need an awful lot of privilege bypassing ability. I'm pretty dubious that you could lock it down enough to make it worth the trouble of supporting an additional concept. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] 10 weeks to feature freeze (Pending Work)
I thought that the following todo item just barely missed 8.2: Allow a warm standby system to also allow read-only statements [pitr] This is useful for checking PITR recovery. I assume it's not on this list either because it is already complete and slated for 8.3, or it is going to take too long to make it into 8.3 or it has been rejected as a good idea entirely or it's just not big enough of a priority for anyone to push for it to get into 8.3. It is the one feature that would make the most difference to me as it would allow me to very easily set up a server for reporting purposes that could always be within minutes of the live data. I know there are other solutions for this but if this feature is just around the corner it would be my first choice. Does anyone know the status of this feature? Thanks, Rick Gigger Joshua D. Drake wrote: Or so... :) Thought I would do a poll of what is happening in the world for 8.3. I have: Alvaro Herrera: Autovacuum improvements (maintenance window etc..) Gavin Sherry: Bitmap Indexes (on disk), possible basic Window functions Jonah Harris: WITH/Recursive Queries? Andrei Kovalesvki: Some Win32 work with Magnus Magnus Hagander: VC++ support (thank goodness) Heikki Linnakangas: Working on Vacuum for Bitmap Indexes? Oleg Bartunov: Tsearch2 in core Neil Conway: Patch Review (including enums), pg_fcache Vertical projects: Pavel Stehule: PLpsm Alexey Klyukin: PLphp Andrei Kovalesvki: ODBCng I am sure there are more, the ones with question marks are unknowns but heard of in the ether somewhere. Any additions or confirmations? Sincerely, Joshua D. Drake ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] 10 weeks to feature freeze (Pending Work)
Rick Gigger [EMAIL PROTECTED] writes: I thought that the following todo item just barely missed 8.2: Allow a warm standby system to also allow read-only statements [pitr] No, it's a someday-wishlist item; the work involved is not small. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Proposal: Change of pg_trigger.tg_enabled and adding
On 1/26/2007 5:09 PM, Tom Lane wrote: Jan Wieck [EMAIL PROTECTED] writes: On 1/26/2007 4:40 PM, Jim Nasby wrote: It would be nice if we had a separate role for replication services so that we weren't exposing superuser so much. So you think about another flag in pg_shadow? Would work for me. How exactly would such a role differ from a regular superuser? It would still need an awful lot of privilege bypassing ability. I'm pretty dubious that you could lock it down enough to make it worth the trouble of supporting an additional concept. As already said in the other mail, conflict resolution means that at some point you will be in the situation where you need a third role. The one of the replication admin that can do things that don't replicate. Polluting the system catalogs with flags for one specific external system isn't my thing. The different trigger modes as well as the snapshot cloning and the commit timestamp are all features, not exclusively useful for the one replication system I have in mind. They would have made my life developing Slony-I a lot easier to begin with. I would never have needed the stupid xxid or the poking around in the system catalog. Jan -- #==# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #== [EMAIL PROTECTED] # ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] PostgreSQL Data Loss
On Jan 26, 2007, at 2:22 AM, BluDes wrote: I have a problem with one of my costomers. I made a program that uses a PostgreSQL (win32) database to save its data. My customer claims that he lost lots of data reguarding his own clients and that those data had surely been saved on the database. My first guess is that he is the one who deleted the data but wants to blame someone else, obviously I can't prove it. Could it be possible for PostgreSQL to lose its data? Maybe with a file corruption? Could it be possible to restore these data? My program does not modify or delete data since its more like a log that only adds information. It is obviously possible to delete these logs but it requires to answer yes to 2 different warnings, so the data can't be deleted accidentally. I have other customers with even 10 times the amount of data of the one who claimed the loss but no problems with them. He obviously made no backups (and claims whe never told him to do them so we are responsible even for this) though the program has a dedicated Backup-section. I have seen this data loss pattern many, many times, and on Oracle too. The most frequent culprits in my experience: 1.) The customer screwed up big time and does not want to admit that they made a mistake, hoping you can somehow pull their butt out of the fire for free. 2.) Someone else sabotaged or messed up the customers database, and the customer is not aware of it. 3.) The customer deleted their own data and is oblivious to the fact that they are responsible. 4.) There is some rare edge case in your application that generates SQL that deletes all the data. There is always the possibility that there is in fact some data loss due to a failure of the database, but it is a rare kind of corruption that deletes a person's data but leaves everything else intact with no error messages, warnings, or other indications that something is not right. Given the description of the problem, I find an internal failure of the database to be a low probability reason for the data loss. Having run many database systems that had various levels of pervasive internal change auditing/versioning, often unbeknownst to the casual user, virtually all of the several data loss cases I've seen with a description like the above clearly fit in the cases #1-3 above when we went into the audit logs i.e. someone explicitly did the deleting. I cannot tell you how many times people have tried to pretend that the database lost or messed up their data and then been embarrassed when they discover that I can step through every single action they took to destroy their own data. I've never seen a single case like the one described above that was due to an internal database failure; when there is an internal database failure, it is usually ugly and obvious. Cheers, J. Andrew Rogers [EMAIL PROTECTED] ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] 10 weeks to feature freeze (Pending Work)
Henry B. Hotz: GSSAPI authentication method for C (FE/BE) and Java (FE). Magnus Haglander: SSPI (GSSAPI compatible) authentication method for C (FE) on Windows. (That fair Magnus? Or you want to volunteer for BE support as well?) GSSAPI isn't much more than a functional replacement for Kerberos 5, but it's supported on lots more platforms. In particular Java and Windows have native support (as well as Solaris 9). If anyone is interested I currently have working-but-incomplete patches to support SASL in C. I've decided not to finish and submit them because the glue code to make configuration reasonable, and to allow use of existing Postgres password databases with the password- based mechanisms is still significant. On Jan 22, 2007, at 2:16 PM, Joshua D. Drake wrote: Or so... :) Thought I would do a poll of what is happening in the world for 8.3. I have: Alvaro Herrera: Autovacuum improvements (maintenance window etc..) Gavin Sherry: Bitmap Indexes (on disk), possible basic Window functions Jonah Harris: WITH/Recursive Queries? Andrei Kovalesvki: Some Win32 work with Magnus Magnus Hagander: VC++ support (thank goodness) Heikki Linnakangas: Working on Vacuum for Bitmap Indexes? Oleg Bartunov: Tsearch2 in core Neil Conway: Patch Review (including enums), pg_fcache Vertical projects: Pavel Stehule: PLpsm Alexey Klyukin: PLphp Andrei Kovalesvki: ODBCng I am sure there are more, the ones with question marks are unknowns but heard of in the ether somewhere. Any additions or confirmations? Sincerely, Joshua D. Drake -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240 Providing the most comprehensive PostgreSQL solutions since 1997 http://www.commandprompt.com/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/ donate PostgreSQL Replication: http://www.commandprompt.com/products/ ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] BUG #2917: spi_prepare doesn't accept typename aliases
Jim Nasby wrote: On Jan 26, 2007, at 9:31 AM, Tom Lane wrote: If you wanted to be a bit more ambitious maybe you could change the fact that this code is throwing away typmod, which means that declarations like varchar(32) would fail to work as expected. Perhaps it should be fixed to save the typmods alongside the typioparams and then pass them to InputFunctionCall instead of passing -1. On the other hand, we don't currently enforce typmod for any function input or result arguments, so maybe it's consistent that spi_prepare arguments ignore typmods too. Thoughts? I'd like to see us move towards supporting that; both for function parameters/results as well as inside functions. It'd be nice if both cases got fixed at once, but IMHO fixing only one now would be better than fixing none. I'm not going to do either in fixing this bug - I think they should be fixed but are a separate issue. These probably belong on the TODO list. cheers andrew ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
[HACKERS] How does EXEC_BACKEND process signals?
In testing the new autovac facility, I noticed this log in the EXEC_BACKEND (on Linux) scenario (I pressed Ctrl-C only once): DEBUG: postmaster received signal 2 LOG: received fast shutdown request LOG: aborting any active transactions FATAL: terminating connection due to administrator command FATAL: terminating connection due to administrator command FATAL: terminating connection due to administrator command FATAL: terminating connection due to administrator command FATAL: terminating connection due to administrator command FATAL: terminating connection due to administrator command FATAL: terminating connection due to administrator command DEBUG: drop cascades to table fktable DEBUG: drop auto-cascades to type fktable DEBUG: drop cascades to table pktable DEBUG: drop auto-cascades to type pktable DEBUG: drop auto-cascades to constraint pktable_pkey on table pktable DEBUG: drop auto-cascades to index pktable_pkey DEBUG: server process (PID 21893) exited with exit code 1 DEBUG: server process (PID 21895) exited with exit code 1 DEBUG: server process (PID 21899) exited with exit code 1 DEBUG: server process (PID 21900) exited with exit code 1 DEBUG: server process (PID 21902) exited with exit code 1 DEBUG: server process (PID 21904) exited with exit code 1 DEBUG: server process (PID 21906) exited with exit code 1 LOG: shutting down LOG: autovacuum launcher shutting down DEBUG: forked new backend, pid=21907 socket=6 LOG: database system is shut down LOG: background writer process (PID 21220) exited with exit code 0 LOG: terminating any other active server processes DEBUG: sending SIGQUIT to process 21907 DEBUG: server process (PID 21907) exited with exit code 1 LOG: all server processes terminated; reinitializing LOG: database system was shut down at 2007-01-26 20:21:10 CLST LOG: checkpoint record is at 0/4293338 LOG: redo record is at 0/4293338; undo record is at 0/0; shutdown TRUE LOG: next transaction ID: 0/17395; next OID: 60723 LOG: next MultiXactId: 1; next MultiXactOffset: 0 LOG: database system is ready DEBUG: transaction ID wrap limit is 2147484176, limited by database postgres LOG: shutting down DEBUG: recycled transaction log file 00010002 DEBUG: recycled transaction log file 00010003 LOG: database system is shut down The strange thing is that we're seeing a forked a new backend line _after_ the shutdown signal was received. I don't think this is related to my local changes, because I've been careful with that, but one never knows. I wonder if this could cause more than just a curiosity. The backend was evidently shut down promptly. I'll post the autovac patch right away. -- Alvaro Herrerahttp://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc. ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] PostgreSQL Data Loss
BluDes wrote: Hi everyone, I have a problem with one of my costomers. I made a program that uses a PostgreSQL (win32) database to save its data. My customer claims that he lost lots of data reguarding his own clients and that those data had surely been saved on the database. My first guess is that he is the one who deleted the data but wants to blame someone else, obviously I can't prove it. Could it be possible for PostgreSQL to lose its data? Maybe with a file corruption? Could it be possible to restore these data? My program does not modify or delete data since its more like a log that only adds information. It is obviously possible to delete these logs but it requires to answer yes to 2 different warnings, so the data can't be deleted accidentally. I have other customers with even 10 times the amount of data of the one who claimed the loss but no problems with them. He obviously made no backups (and claims whe never told him to do them so we are responsible even for this) though the program has a dedicated Backup-section. Any suggestion? This isn't any sort of report that can be responded to. We need to know what has happened to the machine, what is in the server logs, what are the symptoms of data loss. The most likely explanations are pilot error and hardware error. cheers andrew ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] 10 weeks to feature freeze (Pending Work)
Rick Gigger [EMAIL PROTECTED] writes: I thought that the following todo item just barely missed 8.2: Allow a warm standby system to also allow read-only statements [pitr] This is useful for checking PITR recovery. No, nobody worked on it prior to 8.2. Afaik there's still nobody working on it. It's not trivial. Consider for example that your read-only query would still need to come up with a snapshot and there's nowhere currently to find out what transactions were in-progress at that point in the log replay. There's also the problem that currently WAL replay doesn't take have allow for any locking so there's no way for read-only queries to protect themselves against the WAL replay thrashing the buffer pages they're looking at. It does seem to be doable and I agree it would be a great feature, but as far as I know nobody's working on it for 8.3. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] PostgreSQL Data Loss
BluDes [EMAIL PROTECTED] writes: My customer claims that he lost lots of data reguarding his own clients and that those data had surely been saved on the database. Has this Postgres database been running for a long time? There is a regular job called VACUUM that has to be run on every table periodically to recover free space. If this isn't run for a very long time (how long depends on how busy the database is, but even on extremely large databases it's usually a matter of months, on more normal databases it would be years) then very old records seem to suddenly disappear. There is a way to recover data that this has happened to though as long as you don't run vacuum after the data has disappeared. To repeat: If you think this may have happened DO NOT run vacuum now. Do you think this may have happened? How long ago was this database created? Does your system periodically run VACUUM? Is the missing data in every table or just a particular table? Incidentally recent versions of Postgres don't allow this to occur and stop running with a message insisting you run vacuum before continuing. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] 10 weeks to feature freeze (Pending Work)
Rick Gigger [EMAIL PROTECTED] writes: I thought that the following todo item just barely missed 8.2: Allow a warm standby system to also allow read-only statements [pitr] This is useful for checking PITR recovery. No, nobody worked on it prior to 8.2. Afaik there's still nobody working on it. It's not trivial. Consider for example that your read-only query would still need to come up with a snapshot and there's nowhere currently to find out what transactions were in-progress at that point in the log replay. There's also the problem that currently WAL replay doesn't take have allow for any locking so there's no way for read-only queries to protect themselves against the WAL replay thrashing the buffer pages they're looking at. It does seem to be doable and I agree it would be a great feature, but as far as I know nobody's working on it for 8.3. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Recursive query syntax ambiguity
Tom Lane [EMAIL PROTECTED] writes: search clause ::= SEARCH recursive search order SET sequence column and so CYCLE would come *after* SET sequence column not before it. Ah, thanks, I had glossed right over the SET sequence column bit. The SET that I had was the SET cycle column which remains after the CYCLE keyword. It looks to me like we'd have to promote SET to fully reserved status, but that probably isn't going to surprise anyone. DEFAULT and USING already are fully reserved. I don't see anything else here that looks like it should need to be reserved. Having fixed that everything works fine with SET and WITH being reserved keywords. You didn't mean to say I should be able to leave WITH unreserved did you? Of course that was the easy part... Implementing non-recursive common table expressions should be fairly mechanical though I think I'll have lots of questions about how to get all the variable references fixed up. Non-recursive common table expressions are always non-correlated. They can refer to previous common table expressions but only to select from them either in the FROM clause or in subqueries. So as far as I can see they can just go in an InitPlan (or One-Time-Plan? I'm not sure what the distinction is) and be referred to in the same way. Recursive queries are of course a whole lot trickier. I've been slowly wrapping my head around them. So far I have a pretty good idea how to churn out a typical recursive query analogous to a CONNECT BY query. But the spec is a lot more ambitious than that. I haven't quite wrapped my head around the idea of mutually recursive or non-linearly-recursive queries yet. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Piggybacking vacuum I/O
Tom Lane wrote: Heikki Linnakangas [EMAIL PROTECTED] writes: I'd like to see still more evidence that it's a problem before we start changing that piece of code. It has served us well for years. What I see here is mostly evidence suggesting that we should consider raising NUM_CLOG_BUFFERS, rather than anything more invasive. Added to TODO: * Consider increasing NUM_CLOG_BUFFERS -- Bruce Momjian [EMAIL PROTECTED] EnterpriseDBhttp://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster