Re: [HACKERS] Synchronous replication, reading WAL for sending
Hi, On Wed, Dec 24, 2008 at 2:34 PM, Pavan Deolasee pavan.deola...@gmail.com wrote: On Tue, Dec 23, 2008 at 9:12 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: As the patch stands, whenever XLOG segment is switched in XLogInsert, we wait for the segment to be sent to the standby server. That's not good. Particularly in asynchronous mode, you'd expect the standby to not have any significant ill effect on the master. But in case of a flaky network connection, or a busy or dead standby, it can take a long time for the standby to respond, or the primary to give up. During that time, all WAL insertions on the primary are blocked. (How long is the default TCP timeout again?) Another point is that in the future, we really shouldn't require setting up archiving and file-based log shipping using external scripts, when all you want is replication. It should be enough to restore a base backup on the standby, and point it to the IP address of the primary, and have it catch up. This is very important, IMHO. It's quite a lot of work to set up archiving and log-file shipping, for no obvious reason. It's really only needed at the moment because we're building this feature from spare parts. I had similar suggestions when I first wrote the high level design doc. From the wiki page: - WALSender reads from WAL buffers and/or WAL files and sends the buffers to WALReceiver. In phase one, we may assume that WALSender can only read from WAL buffers and WAL files in pg_xlog directory. Later on, this can be improved so that WALSender can temporarily restore archived files and read from that too. You mean that only walsender performs xlog streaming and copying from pg_xlog serially? I think that this would degrade the performance. And, I'm worried about the situation that the speed to generate xlog on the primary is higher than that to copy them to the standby. We might not be able to start xlog streaming forever. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous replication, reading WAL for sending
On Wed, Dec 24, 2008 at 1:50 PM, Fujii Masao masao.fu...@gmail.com wrote: And, I'm worried about the situation that the speed to generate xlog on the primary is higher than that to copy them to the standby. We might not be able to start xlog streaming forever. If that's the case, how do you expect the standby to keep pace with the primary after initial sync up ? Frankly, I myself have every doubt that on a relatively high load setup, the standby will not be able keep pace with the primary for two reasons: - Lack of read ahead of data blocks (Suzuki-san's work may help this) - Single threaded recovery But then these are general problems which may impact any log-based replication. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous replication, reading WAL for sending
Hi, On Wed, Dec 24, 2008 at 5:48 PM, Pavan Deolasee pavan.deola...@gmail.com wrote: On Wed, Dec 24, 2008 at 1:50 PM, Fujii Masao masao.fu...@gmail.com wrote: And, I'm worried about the situation that the speed to generate xlog on the primary is higher than that to copy them to the standby. We might not be able to start xlog streaming forever. If that's the case, how do you expect the standby to keep pace with the primary after initial sync up ? Good question. If streaming and copying are performed parallelly, such situation doesn't happen because the speed to generate xlog also depends on streaming. This is a price to pay. I think that the serial operations would need a pace maker. And, I don't know better pace maker than concurrent streaming. Frankly, I myself have every doubt that on a relatively high load setup, the standby will not be able keep pace with the primary for two reasons: - Lack of read ahead of data blocks (Suzuki-san's work may help this) - Single threaded recovery But then these are general problems which may impact any log-based replication. Right. Completely high load setup is probably impossible. There is certainly a price to pay. But, in order to reduce a price as much as possible, I think that we should not focus two or more operations on single process (walsender) just like single threaded recovery. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Wed, 2008-12-24 at 11:39 +0900, Fujii Masao wrote: We might ask why pg_start_backup() needs to perform checkpoint though, since you have remarked that is a problem also. The answer is that it doesn't really need to, we just need to be certain that archiving has been running since whenever we choose as the start time. So we could easily just use the last normal checkpoint time, as long as we had some way of tracking the archiving. ISTM we can solve the checkpoint problem more easily and it would potentially save much more time than tuning rsync for Postgres, which is what the other idea amounted to. So I do see a solution that is both better and more quickly achievable for 8.4. Sounds good. I agree that pg_start_backup basically doesn't need checkpoint. But, for full_page_write == off, we probably cannot get rid of it. Even if full_page_write == on, since we cannot make out whether all indispensable full pages were written after last checkpoint, pg_start_backup must do checkpoint with forcePageWrite = on. Yes, OK. So I think it would only work when full_page_writes = on, and has been on since last checkpoint. So two changes: * We just need a boolean that starts at true every checkpoint and gets set to false anytime someone resets full_page_writes or archive_command. If the flag is set full_page_writes = on then we skip the checkpoint entirely and use the value from the last checkpoint. * My infra patch also had a modified version of pg_start_backup() that allowed you to specify IMMEDIATE checkpoint or not. Reworking that seems a waste of time, and I want to listen to everybody else now and change pg_start_backup() so it throws an IMMEDIATE CHECKPOINT and leave it there. Can you work on those also? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Archiving control (a part of synch rep patches)
Hi, On Wed, Dec 24, 2008 at 4:39 PM, Fujii Masao masao.fu...@gmail.com wrote: (3) Revert the arhive status file of the specified file from .done to .ready. This would try to archive the corresponding xlog file again. The archive status of only skipped xlog file can be reverted. We resume xlog archiving by using (2) and (3), when replication ends. Umm.. reverting is very tricky, then it might become hotbeds of bugs. So, I will reconsider the method of resuming archive. http://archives.postgresql.org/pgsql-hackers/2008-12/msg00723.php That conceivably might mean we need to write a .ready message after a WAL file filled, which might mean we have problems if the replication timeout is longer than the checkpoint timeout, but that seems an unlikely configuration. And if anybody has a problem with that we just recommend they use the always mode. It's also problem that reverting competes with xlog deletion by checkpoint. Perhaps we should create .ready after the xlog file fills and it's replicated during replication. I will consider this course deeply. Happy Holidays! -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous replication, reading WAL for sending
On Wed, 2008-12-24 at 18:31 +0900, Fujii Masao wrote: Frankly, I myself have every doubt that on a relatively high load setup, the standby will not be able keep pace with the primary for two reasons: - Lack of read ahead of data blocks (Suzuki-san's work may help this) - Single threaded recovery But then these are general problems which may impact any log-based replication. Right. Completely high load setup is probably impossible. There is certainly a price to pay. But, in order to reduce a price as much as possible, I think that we should not focus two or more operations on single process (walsender) just like single threaded recovery. I think we may be pleasantly surprised. In 8.3 there were two main sources of wait: * restartpoints * waiting for archive files Restartpoints will now be handled by bgwriter, giving probably 20% gain, plus the WAL data is streamed directly into memory by walreceiver. So I think the startup process may achieve a better steady state and perform very quickly. Suzuki-san's numbers show that full_page_writes = on does not benefit significantly from having read ahead and we already know that is effective in reducing the I/O bottleneck during recovery. If we want to speed up recovery more, I think we'll see the need for an additional process to do WAL CRC checks. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous replication, reading WAL for sending
On Wed, Dec 24, 2008 at 3:01 PM, Fujii Masao masao.fu...@gmail.com wrote: Good question. If streaming and copying are performed parallelly, such situation doesn't happen because the speed to generate xlog also depends on streaming. This is a price to pay. I think that the serial operations would need a pace maker. And, I don't know better pace maker than concurrent streaming. These operations need not be even parallel. My apologies if this has been discussed before, but what we are talking about is just a stream of WAL starting at some LSN. The only difference is that the LSN itself may be in buffers or in the files. So walsender would send as much as it can from the files and then switch to read from buffers. Also, I think you are underestimating the power of network for most practical purposes. Networks are usually not bottlenecks unless we are talking about slow WAN setups which I am not sure how common for PG users. . Right. Completely high load setup is probably impossible. If that's the case, I don't think you need to worry too much about network or the walsender being a bottleneck for initial sync up (and note that we are only talking about WAL sync up and not the base backup). Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous replication, reading WAL for sending
On Wed, Dec 24, 2008 at 3:40 PM, Simon Riggs si...@2ndquadrant.com wrote: If we want to speed up recovery more, I think we'll see the need for an additional process to do WAL CRC checks. Yeah, any such helper process along with other optimizations would certainly help. But I can't still believe that on a high load, high end setup, single recovery process without any read-ahead for data blocks, can keep pace with the WAL generated by hundreds of processes at the primary and shipped over a high speed link to standby. BTW, on a completely different note, given that the entire recovery is based on physical redo, are there any inherent limitations that we can't do parallel recovery where different recovery processes apply redo logs to completely independent set of data blocks ? I also sometimes wonder why we don't have block level recovery when a single block in the database is corrupted. Can't this be done by just selectively applying WAL records to that particular block ? If it's just because nobody had time/interest to do this, then it's OK, but I wonder if there are any design issues. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Archiving control (a part of synch rep patches)
On Wed, 2008-12-24 at 19:04 +0900, Fujii Masao wrote: Hi, On Wed, Dec 24, 2008 at 4:39 PM, Fujii Masao masao.fu...@gmail.com wrote: (3) Revert the arhive status file of the specified file from .done to .ready. This would try to archive the corresponding xlog file again. The archive status of only skipped xlog file can be reverted. We resume xlog archiving by using (2) and (3), when replication ends. Umm.. reverting is very tricky, then it might become hotbeds of bugs. So, I will reconsider the method of resuming archive. Yeh, don't do that. No worries if we send a few files too many. Tears if we miss just one. * Turn it off for the *next* file to be archived. Keep track of last file sent. * Turn it on again, mark *all* WAL files as .ready in sequence low to high, as long as they are higher than the last file sent. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] generic reloptions improvement
On Monday 22 December 2008 18:24:53 Alvaro Herrera wrote: Alvaro Herrera alvhe...@commandprompt.com wrote: Here's a patch for improving the general reloptions mechanism. What this patch does is add a table-based option parser. This allows adding new options very easily, and stops the business of having to pass the minimum and default fillfactor each time you want the reloptions processed. Here's the next version, which also fixes some particularly embarrasing bugs. I'm not sure how important this is, but if you are enumerating the access methods (RELOPT_KIND_BTREE, etc.), how will this work with user-defined access methods? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Synchronous replication, reading WAL for sending
On Wed, 2008-12-24 at 15:51 +0530, Pavan Deolasee wrote: On Wed, Dec 24, 2008 at 3:40 PM, Simon Riggs si...@2ndquadrant.com wrote: If we want to speed up recovery more, I think we'll see the need for an additional process to do WAL CRC checks. Yeah, any such helper process along with other optimizations would certainly help. But I can't still believe that on a high load, high end setup, single recovery process without any read-ahead for data blocks, can keep pace with the WAL generated by hundreds of processes at the primary and shipped over a high speed link to standby. Suzuki-san has provided measurements. I think we need more. With bgwriter performing restartpoints, we'll find that more RAM helps much more than it did previously. BTW, on a completely different note, given that the entire recovery is based on physical redo, are there any inherent limitations that we can't do parallel recovery where different recovery processes apply redo logs to completely independent set of data blocks ? That's possible, but will significantly complicate the recovery code. Retaining the ability to do standby queries would be almost impossible in that case, since you would need to parallelise the WAL stream without changing the commit order of transactions. The main CPU bottleneck is CRC, by a long way. If we move effort away from startup process that is the best next action, AFAICS. I also sometimes wonder why we don't have block level recovery when a single block in the database is corrupted. Can't this be done by just selectively applying WAL records to that particular block ? If it's just because nobody had time/interest to do this, then it's OK, but I wonder if there are any design issues. You'll be able to do this with my rmgr patch. Selective application of WAL records is one of the primary use cases, but there are many others. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] generic reloptions improvement
Peter Eisentraut wrote: I'm not sure how important this is, but if you are enumerating the access methods (RELOPT_KIND_BTREE, etc.), how will this work with user-defined access methods? It is important. I'm intending to have a new routine which would reserve a value at runtime. This value would be later be passed by the AM to create new options on the table. -- Alvaro Herrerahttp://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Tue, 2008-12-23 at 23:59 -0500, Robert Treat wrote: On Friday 19 December 2008 19:36:42 Simon Riggs wrote: Perhaps we should listen to the people that have said they don't want queries cancelled, even if the alternative is inconsistent answers. That is easily possible yet is not currently an option. Plus we have the option I referred to up thread, which is to defer query cancel until the query reads a modified data block. I'm OK with implementing either of those, as non-default options. Do we need those options or are we ok? Haven't seen any feed back on this, but I think the two options of cancel query for replay, and pause replay for queries, are probably enough for a first go around (especially if you can get the query canceling to work only when changes are made to the specific database in question) Thanks for picking up on this. This question is the #1 usability issue for Hot Standby, since at least May 2008. There are many potential additions and we need to track this carefully over the next few months to see if we have it just right. I'll take viewpoints at any time on that; this door is never closed, though tempus fugit. Greg and Heikki have highlighted in this thread some aspects of btree garbage collection that will increase the chance of queries being cancelled in various circumstances. If this is important enough to trigger additional actions then we need to highlight that now so we have time to take those corrective actions. I've listened to many different viewpoints on and off list. Everybody takes a slightly different angle on it and I'm in favour of giving everybody what they want with the right set of options. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Wed, Dec 24, 2008 at 4:41 PM, Simon Riggs si...@2ndquadrant.com wrote: Greg and Heikki have highlighted in this thread some aspects of btree garbage collection that will increase the chance of queries being cancelled in various circumstances Even HOT-prune may lead to frequent query cancellations and unlike VACUUM there is no way user can control the frequency of prune operations. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Archiving control (a part of synch rep patches)
Hi, On Wed, Dec 24, 2008 at 7:26 PM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, 2008-12-24 at 19:04 +0900, Fujii Masao wrote: Hi, On Wed, Dec 24, 2008 at 4:39 PM, Fujii Masao masao.fu...@gmail.com wrote: (3) Revert the arhive status file of the specified file from .done to .ready. This would try to archive the corresponding xlog file again. The archive status of only skipped xlog file can be reverted. We resume xlog archiving by using (2) and (3), when replication ends. Umm.. reverting is very tricky, then it might become hotbeds of bugs. So, I will reconsider the method of resuming archive. Yeh, don't do that. No worries if we send a few files too many. Tears if we miss just one. * Turn it off for the *next* file to be archived. Keep track of last file sent. OK * Turn it on again, mark *all* WAL files as .ready in sequence low to high, as long as they are higher than the last file sent. You mean as follows? - we don't create .ready during replication. - we create .ready of the xlog files which are higher than the last sent when replication ends, and forbid that subsequent checkpoint create .ready of lower xlog files. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, On Wed, Dec 24, 2008 at 6:57 PM, Simon Riggs si...@2ndquadrant.com wrote: Yes, OK. So I think it would only work when full_page_writes = on, and has been on since last checkpoint. So two changes: * We just need a boolean that starts at true every checkpoint and gets set to false anytime someone resets full_page_writes or archive_command. If the flag is set full_page_writes = on then we skip the checkpoint entirely and use the value from the last checkpoint. Sounds good. pg_start_backup on the standby (probably you are planning?) also needs this logic? If so, resetting full_page_writes or archive_command should generate its xlog. I have another thought: should we forbid the reset of archive_command during online backup? Currently we can do. If we don't need to do so, we also don't need to track the reset of archiving for fast pg_start_backup. * My infra patch also had a modified version of pg_start_backup() that allowed you to specify IMMEDIATE checkpoint or not. Reworking that seems a waste of time, and I want to listen to everybody else now and change pg_start_backup() so it throws an IMMEDIATE CHECKPOINT and leave it there. Can you work on those also? Umm.. I'm busy. Of course, I will try it if no one raises his or her hand. But, I'd like to put coding the core of synch rep ahead of this. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [idea] a copied relkind in pg_attribute
It is an idea to improve the implementation of SE-PostgreSQL. I need a copied relkind in pg_attribute, to help its decision making. When we access on pg_attribute via ALTER TABLE or DML statement directly, SE-PostgreSQL checks privilleges for the fetched tuples. If the relation pointed by its attrelid has RELKIND_RELATION, the fetched tuple is a column, so db_column object class should be applied. Otherwise, db_tuple object class should be applied, because it is not a column, like an entity of composite type. The current implementation need to lookup RELOID system cache to identify the relkind of the relation, because pg_attribtue does not have any information about relkind. However, I also think it is not an ideal implementation, even if its frequency is enough small. So, I have a plan to put a new member named as attkind into pg_attribute to hold a copied value of its relkind. It enables to identify the attribute without looking up system cache, so it is better than current one. The pg_class.relkind is always unchanged, so it does not have any problematical point. Please comment anything, if you have alternative idea or opinions. If reviewer intend to comment about the point, it can be fixed with the above way. Thanks, -- KaiGai Kohei kai...@kaigai.gr.jp -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Wed, 2008-12-24 at 16:48 +0530, Pavan Deolasee wrote: On Wed, Dec 24, 2008 at 4:41 PM, Simon Riggs si...@2ndquadrant.com wrote: Greg and Heikki have highlighted in this thread some aspects of btree garbage collection that will increase the chance of queries being cancelled in various circumstances Even HOT-prune may lead to frequent query cancellations and unlike VACUUM there is no way user can control the frequency of prune operations. The patch does go to some trouble to handle that case, as I'm sure you've seen. Are you saying that part of the patch is ineffective and should be removed, or? Should/could there be a way to control frequency of prune operations? We could maintain cleanupxmin as a constant minimum distance from xmax, for example. Are we saying we should take further measures, as I asked upthread? If it is a consensus that I take some action, then I will. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Wed, Dec 24, 2008 at 5:26 PM, Simon Riggs si...@2ndquadrant.com wrote: The patch does go to some trouble to handle that case, as I'm sure you've seen. Are you saying that part of the patch is ineffective and should be removed, or? Umm.. are you talking about the wait mechanism ? That's the only thing I remember. Otherwise, prune record is pretty much same as any vacuum cleanup record. Should/could there be a way to control frequency of prune operations? We could maintain cleanupxmin as a constant minimum distance from xmax, for example. Well, there can be. But tuning any such thing might be difficult and would have implications on the primary. I am not saying we can do that, but we will need additional tests to see its impact. Are we saying we should take further measures, as I asked upthread? If it is a consensus that I take some action, then I will. Again, I haven't seen how frequently queries may get canceled. Or if the delay is set to a large value, how far behind standby may get during replication, so I can't really comment. Have you done any tests on a reasonable hardware and checked if moderately long read queries can be run on standby without standby lagging behind a lot. I would prefer to have a solution which can be smarter than canceling all queries as soon as a cleanup record comes and timeout occurs. For example, if the queries are being run on a completely different set of tables where as the updates/deletes are happening on another set of tables, there is no reason why those queries should be canceled. I think it would be very common to have large history tables which may receive long read-only queries, but no updates/deletes. Whereas other frequently updated tables which receive very few queries on the standby. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Wed, 2008-12-24 at 17:56 +0530, Pavan Deolasee wrote: On Wed, Dec 24, 2008 at 5:26 PM, Simon Riggs si...@2ndquadrant.com wrote: The patch does go to some trouble to handle that case, as I'm sure you've seen. Are you saying that part of the patch is ineffective and should be removed, or? Umm.. are you talking about the wait mechanism ? That's the only thing I remember. Otherwise, prune record is pretty much same as any vacuum cleanup record. With respect, I was hoping you might look in the patch and see if you agree with the way it is handled. No need to remember. The whole latestRemovedXid concept is designed to do help. Should/could there be a way to control frequency of prune operations? We could maintain cleanupxmin as a constant minimum distance from xmax, for example. Well, there can be. But tuning any such thing might be difficult and would have implications on the primary. I am not saying we can do that, but we will need additional tests to see its impact. Are we saying we should take further measures, as I asked upthread? If it is a consensus that I take some action, then I will. Again, I haven't seen how frequently queries may get canceled. Or if the delay is set to a large value, how far behind standby may get during replication, so I can't really comment. Have you done any tests on a reasonable hardware and checked if moderately long read queries can be run on standby without standby lagging behind a lot. Queries get cancelled if data they need to see if removed and the max_standby_delay expires. So lag will be max_standby_delay, by definition. Not sure what further tests would show. Queries that run for longer than max_standby delay plus mean time between cleanup records will currently end up being cancelled. I would prefer to have a solution which can be smarter than canceling all queries as soon as a cleanup record comes and timeout occurs. Currently, it was the consensus view that queries should be cancelled, though there are other options still on the table. It's discussed in Design Notes on the Wiki. Simply ignoring WAL removal has been discussed and rejected (so far). http://archives.postgresql.org/pgsql-hackers/2008-05/msg00753.php Explicitly defining the tables a transaction wishes to see has also been discussed and rejected (so far). http://archives.postgresql.org/pgsql-hackers/2008-08/msg00268.php; For example, if the queries are being run on a completely different set of tables where as the updates/deletes are happening on another set of tables, there is no reason why those queries should be canceled. I think it would be very common to have large history tables which may receive long read-only queries, but no updates/deletes. Whereas other frequently updated tables which receive very few queries on the standby. There is currently no way to tell which tables a query will touch during the course of its execution. Nor is there likely to be because of user-defined volatile functions. I attempted to find ways to explicitly limit the set of tables over which a query might venture, but that cam to nothing also. We've also discussed storing lastCleanedLSN for each buffer, so queries can cancel themselves if they need to read a buffer that has had data removed from it that they would have needed to see. I'll write that up also. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, On Wed, Dec 24, 2008 at 7:58 PM, Fujii Masao masao.fu...@gmail.com wrote: Hi, On Wed, Dec 24, 2008 at 6:57 PM, Simon Riggs si...@2ndquadrant.com wrote: Yes, OK. So I think it would only work when full_page_writes = on, and has been on since last checkpoint. So two changes: * We just need a boolean that starts at true every checkpoint and gets set to false anytime someone resets full_page_writes or archive_command. If the flag is set full_page_writes = on then we skip the checkpoint entirely and use the value from the last checkpoint. Sounds good. I attached the self-contained patch to skip checkpoint at pg_start_backup. pg_start_backup on the standby (probably you are planning?) also needs this logic? If so, resetting full_page_writes or archive_command should generate its xlog. Now, the patch doesn't care about this. I have another thought: should we forbid the reset of archive_command during online backup? Currently we can do. If we don't need to do so, we also don't need to track the reset of archiving for fast pg_start_backup. Now, doesn't care too. Happy Holidays! -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center ? GNUmakefile ? config.log ? config.status ? contrib/pgbench/pgbench ? src/Makefile.global ? src/backend/postgres ? src/backend/catalog/postgres.bki ? src/backend/catalog/postgres.description ? src/backend/catalog/postgres.shdescription ? src/backend/postmaster/walreceiver.c ? src/backend/postmaster/walsender.c ? src/backend/snowball/snowball_create.sql ? src/backend/utils/probes.h ? src/backend/utils/mb/conversion_procs/conversion_create.sql ? src/bin/initdb/initdb ? src/bin/pg_config/pg_config ? src/bin/pg_controldata/pg_controldata ? src/bin/pg_ctl/pg_ctl ? src/bin/pg_dump/pg_dump ? src/bin/pg_dump/pg_dumpall ? src/bin/pg_dump/pg_restore ? src/bin/pg_resetxlog/pg_resetxlog ? src/bin/psql/psql ? src/bin/scripts/clusterdb ? src/bin/scripts/createdb ? src/bin/scripts/createlang ? src/bin/scripts/createuser ? src/bin/scripts/dropdb ? src/bin/scripts/droplang ? src/bin/scripts/dropuser ? src/bin/scripts/reindexdb ? src/bin/scripts/vacuumdb ? src/include/pg_config.h ? src/include/stamp-h ? src/include/postmaster/walreceiver.h ? src/include/postmaster/walsender.h ? src/interfaces/ecpg/compatlib/exports.list ? src/interfaces/ecpg/compatlib/libecpg_compat.so.3.1 ? src/interfaces/ecpg/ecpglib/exports.list ? src/interfaces/ecpg/ecpglib/libecpg.so.6.1 ? src/interfaces/ecpg/include/ecpg_config.h ? src/interfaces/ecpg/pgtypeslib/exports.list ? src/interfaces/ecpg/pgtypeslib/libpgtypes.so.3.1 ? src/interfaces/ecpg/preproc/ecpg ? src/interfaces/libpq/exports.list ? src/interfaces/libpq/libpq.so.5.2 ? src/port/pg_config_paths.h ? src/test/regress/pg_regress ? src/test/regress/testtablespace ? src/timezone/zic Index: src/backend/access/transam/xlog.c === RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.324 diff -c -r1.324 xlog.c *** src/backend/access/transam/xlog.c 17 Dec 2008 01:39:03 - 1.324 --- src/backend/access/transam/xlog.c 24 Dec 2008 14:57:27 - *** *** 295,300 --- 295,302 /* Protected by info_lck: */ XLogwrtRqst LogwrtRqst; XLogwrtResult LogwrtResult; + uint32 bkpCount; /* ID of bkp using the same ckpt */ + bool bkpForceCkpt; /* reset full_page_writes since last ckpt? */ uint32 ckptXidEpoch; /* nextXID epoch of latest checkpoint */ TransactionId ckptXid; XLogRecPtr asyncCommitLSN; /* LSN of newest async commit */ *** *** 6025,6036 UpdateControlFile(); LWLockRelease(ControlFileLock); ! /* Update shared-memory copy of checkpoint XID/epoch */ { /* use volatile pointer to prevent code rearrangement */ volatile XLogCtlData *xlogctl = XLogCtl; SpinLockAcquire(xlogctl-info_lck); xlogctl-ckptXidEpoch = checkPoint.nextXidEpoch; xlogctl-ckptXid = checkPoint.nextXid; SpinLockRelease(xlogctl-info_lck); --- 6027,6043 UpdateControlFile(); LWLockRelease(ControlFileLock); ! /* ! * Update shared-memory copy of checkpoint XID/epoch and reset the ! * variables of backup ID/flag. ! */ { /* use volatile pointer to prevent code rearrangement */ volatile XLogCtlData *xlogctl = XLogCtl; SpinLockAcquire(xlogctl-info_lck); + xlogctl-bkpCount = 0; + xlogctl-bkpForceCkpt = true; xlogctl-ckptXidEpoch = checkPoint.nextXidEpoch; xlogctl-ckptXid = checkPoint.nextXid; SpinLockRelease(xlogctl-info_lck); *** *** 6502,6507 --- 6509,6535 } } + bool + assign_full_page_writes(bool newval, bool doit, GucSource source) + { + /* + * If full_page_writes is reset, since all indispensable full pages + * might not be written since last checkpoint, we force a checkpoint + * at pg_start_backup. + */ + if (doit fullPageWrites != newval) + { + /* use volatile
Re: [HACKERS] Synchronous replication, reading WAL for sending
Fujii Masao wrote: - WALSender reads from WAL buffers and/or WAL files and sends the buffers to WALReceiver. In phase one, we may assume that WALSender can only read from WAL buffers and WAL files in pg_xlog directory. Later on, this can be improved so that WALSender can temporarily restore archived files and read from that too. You mean that only walsender performs xlog streaming and copying from pg_xlog serially? I think that this would degrade the performance. And, I'm worried about the situation that the speed to generate xlog on the primary is higher than that to copy them to the standby. We might not be able to start xlog streaming forever. I've seen a few references to this. Somebody else mentioned how a single TCP/IP stream might not have the bandwidth to match changes to the database. TCP/IP streams do have a window size that adjusts with the load, and unless one gets into aggressive networking such as bittorrent which arguably reduce performance of the entire network, why shouldn't one TCP/IP stream be enough? And if one TCP/IP stream isn't enough, doesn't this point to much larger problems, that won't be solved by streaming it some other way over the network? As in, it doesn't matter what you do - your network pipe isn't big enough? Over the Internet from my house to a co-located box, I can reliably get 1.1+ Mbyte/s using a single TCP/IP connection. The network connection at the co-lo is 10Mbit/s and my Internet connection to my house is also 10Mbit/s. One TCP/IP connection seems pretty capable to stream data to the full potential of the network... Also, I assume that most database loads have peaks and lows. Especially for very larger updates, perhaps end of day processing, I see it as a guarantee that all of the stand bys will fall more behind for a period (a few seconds to a minute?), but they will catch up shortly after the peak is over. Cheers, mark -- Mark Mielke m...@mielke.cc -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
On Thu, 2008-12-25 at 00:10 +0900, Fujii Masao wrote: Hi, On Wed, Dec 24, 2008 at 7:58 PM, Fujii Masao masao.fu...@gmail.com wrote: Hi, On Wed, Dec 24, 2008 at 6:57 PM, Simon Riggs si...@2ndquadrant.com wrote: Yes, OK. So I think it would only work when full_page_writes = on, and has been on since last checkpoint. So two changes: * We just need a boolean that starts at true every checkpoint and gets set to false anytime someone resets full_page_writes or archive_command. If the flag is set full_page_writes = on then we skip the checkpoint entirely and use the value from the last checkpoint. Sounds good. I attached the self-contained patch to skip checkpoint at pg_start_backup. Good. Can we change to IMMEDIATE when it we need the checkpoint? What is bkpCount for? I think we should discuss whatever that is for separately. It isn't used in any if test, AFAICS. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: Second thoughts
Hi Markus, I didn't have much reliability issues with ensemble, appia or spread, so far. Although, I admit I didn't ever run any of these in production. Performance is certainly an issue, yes. I may suggest another reading even though a bit dates, most of the results still apply: http://jmob.objectweb.org/jgroups/JGroups-middleware-2004.pdf The baseline is that if you use UDP multicast, you need a dedicated switch and the tuning is a nightmare. I discussed these issues with the developers of Spread and they have no real magic. TCP seems a more reliable alternative (especially predictable performance) but the TCP timeouts are also tricky to tune depending on the platform. We worked quite a bit with Nuno around Appia in the context of Sequoia and performance can be outstanding when properly tuned or absolutely awful is some default values are wrong. The chaotic behavior of GCS under stress quickly compromises the reliability of the replication system, and admission control on UDP multicast has no good solution so far. It's just a heads up on what is awaiting you in production when the system is stressed. There is no good solution so far besides a good admission control on top of the GCS (in the application). I am now off for the holidays. Cheers, Emmanuel -- Emmanuel Cecchet Aster Data Systems Web: http://www.asterdata.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [Fwd: Re: [HACKERS] Transactions and temp tables]
Hi Heikki, The point of using temp tables was performance. Using regular tables in our case would hurt performance too much. Well if we cannot get a temporary fix in 8.4, we will maintain a separate patch to get that functionality just for temp tables that are created and dropped in the same transaction. Hopefully we will be able to come up with a working solution in 8.5. Thanks for your help, Emmanuel Emmanuel Cecchet wrote: I just saw that this new patch was not considered because the previous version ended being rejected. Note that this version of the patch aims at supporting ONLY temp tables that are created AND dropped in the same transaction. We need to be able to use temp tables in transactions that are doing 2PC, but the temp table lifespan does not need to cross transaction boundaries. Please let me know if this patch could be integrated in 8.4. IMHO, this is just getting too kludgey. We came up with pretty good ideas on how to handle temp tables properly, by treating the same as non-temp tables. That should eliminate all the problems the latest patch did, and also the issues with sequences, and allow all access to temp tables, not just a limited subset. I don't think it's worthwhile to apply the kludge as a stopgap measure, let's do it properly in 8.5. As a workaround, you can use a regular table instead of a temporary one. If you create and drop the regular table in the same transaction (that's the same limitation that latest patch has), you won't end up with a bogus table in your database if the connection is dropped unexpectedly. If your application uses multiple connections simultaenously, you'll need a little bit of code in the application so that you don't try to create a table with the same name in all backends. You could also create a different schema for each connection, and do set search_path='semitempschemaX, public', so that you can use the same table name and still have separate tables for each connections. (sorry for the late reply) -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Emmanuel Cecchet Aster Data Systems Web: http://www.asterdata.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hot standby and b-tree killed items
On Wed, 2008-12-24 at 09:59 -0500, Robert Treat wrote: I think the uncertainty comes from peoples experience with typical replication use cases vs a lack of experience with this current implementation. Quite possibly. Publishing user feedback on this will be very important in making this a usable feature. I'd be very happy if you were to direct the search for optimal usability. One such example is that it is pretty common to use read-only slaves to do horizontal scaling of read queries across a bunch of slaves. This is not the scenario of running reporting queries on a second machine to lower load; you would be running a large number of read-only, relativly short, oltp-ish queries (think pg_benchs select only test i suppose), but you also have a fairly regular stream of inserts/updates going on with these same tables, its just you have 95/5 split of read/write (or similar). One thing to consider also is latency of information. Sending queries to master or slave may return different answers if querying very recent data. This is standard practice in things like mysql or using slony or what have you. I suspect it's one of the first things people are going to want to do with hot standby. But it's unclear how well this will work because we don't have any experience with it yet, coupled with the two downsides being mentioned as canceled queries and replay lag, which happen to be probably the two worst downsides you would have in the above scenario. :-) Hmm I'm not sure why I didn't think of running this test before, but read/write pg_bench on a master with pg_bench select test on slave isn't that bad of a scenario to match the above; it might be a little too much activity on the master, but has anyone else run such a test? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [idea] a copied relkind in pg_attribute
On Wed, Dec 24, 2008 at 6:50 AM, KaiGai Kohei kai...@kaigai.gr.jp wrote: The current implementation need to lookup RELOID system cache to identify the relkind of the relation, because pg_attribtue does not have any information about relkind. However, I also think it is not an ideal implementation, even if its frequency is enough small. but you still can do it, right? there are any other advantages, like something you can't do now? if not, i think you need to probe that there will be any benefit in speed and that is significant otherwise this seems like premature optimization and that AFAIR is the root of all evil :) -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] generic reloptions improvement
Alvaro Herrera alvhe...@commandprompt.com writes: Peter Eisentraut wrote: I'm not sure how important this is, but if you are enumerating the access methods (RELOPT_KIND_BTREE, etc.), how will this work with user-defined access methods? It is important. I'm intending to have a new routine which would reserve a value at runtime. This value would be later be passed by the AM to create new options on the table. What do you mean by at runtime? Surely the value would have to remain stable across database restarts, since it's going to relate to stuff that is in catalog entries. I'd feel more comfortable with a scheme that used the AM's pg_am OID. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Window-functions patch handling of aggregates
Gregory Stark st...@enterprisedb.com writes: Tom Lane t...@sss.pgh.pa.us writes: Unless we want to move the goalposts on what an aggregate is allowed to do internally, we're going to have to change this to re-aggregate repeatedly. Neither prospect is appetizing in the least. Does it currently copy the state datum before calling the final function? Would that help? No. The entire point of what we have now formalized as aggregates with internal-type transition values is that the transition value isn't necessarily a single palloc chunk. For stuff like array_agg, the performance costs of requiring that are enormous. On looking at what array_agg does, it seems the issue there is that the final-function actually deletes the working state when it thinks it's done with it (see construct_md_array). It would probably be possible to fix things so that it doesn't do that when it's called by a WindowAgg instead of a regular Agg. What I'm more concerned about is what third-party code will break. contrib/intagg has done this type of thing since forever, and I'm sure people have copied that... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] uuids on freebsd
On Dec 17, 2:30 pm, Andrew Gierth and...@tao11.riddles.org.uk wrote: Has anyone ever managed to get uuid generation working on FreeBSD? [...] ([...] The only solution I could come up with was to knock off a quick uuid-freebsd module that uses the base system uuid functions rather than the ossp ones. I could put this on pgfoundry if there isn't likely to be a real fix in the near future.) +1 for putting it on pgFoundry. I managed to hack the old contrib/uuid so it would use the functions from Linux e2fs uuid on 8.0/8.1/8.2. I might have time soon to see if I can get that to work on 8.3. -- DLL -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep: First Thoughts on Code
Hi, I fixed some bugs. On Thu, Dec 25, 2008 at 12:31 AM, Simon Riggs si...@2ndquadrant.com wrote: Can we change to IMMEDIATE when it we need the checkpoint? Perhaps yes, though current patch doesn't care about it. I'm not sure if we really need the feature. Yes, as you say, I'd like to also listen to everybody else. What is bkpCount for? So far, name of a backup history file consists of only checkpoint redo location. But, in this patch, since some backups use the same checkpoint, a backup history file could be overwritten unfortunately. So, I introduced bkpCount as ID of backups which use the same checkpoint. I think we should discuss whatever that is for separately. It isn't used in any if test, AFAICS. Yes, this patch is testbed. We need to discuss more. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center ? GNUmakefile ? config.log ? config.status ? contrib/make.log ? contrib/pgbench/pgbench ? src/Makefile.global ? src/backend/postgres ? src/backend/catalog/postgres.bki ? src/backend/catalog/postgres.description ? src/backend/catalog/postgres.shdescription ? src/backend/snowball/snowball_create.sql ? src/backend/utils/probes.h ? src/backend/utils/mb/conversion_procs/conversion_create.sql ? src/bin/initdb/initdb ? src/bin/pg_config/pg_config ? src/bin/pg_controldata/pg_controldata ? src/bin/pg_ctl/pg_ctl ? src/bin/pg_dump/pg_dump ? src/bin/pg_dump/pg_dumpall ? src/bin/pg_dump/pg_restore ? src/bin/pg_resetxlog/pg_resetxlog ? src/bin/psql/psql ? src/bin/scripts/clusterdb ? src/bin/scripts/createdb ? src/bin/scripts/createlang ? src/bin/scripts/createuser ? src/bin/scripts/dropdb ? src/bin/scripts/droplang ? src/bin/scripts/dropuser ? src/bin/scripts/reindexdb ? src/bin/scripts/vacuumdb ? src/include/pg_config.h ? src/include/stamp-h ? src/interfaces/ecpg/compatlib/exports.list ? src/interfaces/ecpg/compatlib/libecpg_compat.so.3.1 ? src/interfaces/ecpg/ecpglib/exports.list ? src/interfaces/ecpg/ecpglib/libecpg.so.6.1 ? src/interfaces/ecpg/include/ecpg_config.h ? src/interfaces/ecpg/pgtypeslib/exports.list ? src/interfaces/ecpg/pgtypeslib/libpgtypes.so.3.1 ? src/interfaces/ecpg/preproc/ecpg ? src/interfaces/libpq/exports.list ? src/interfaces/libpq/libpq.so.5.2 ? src/port/pg_config_paths.h ? src/test/regress/log ? src/test/regress/pg_regress ? src/test/regress/results ? src/test/regress/testtablespace ? src/test/regress/tmp_check ? src/test/regress/expected/constraints.out ? src/test/regress/expected/copy.out ? src/test/regress/expected/create_function_1.out ? src/test/regress/expected/create_function_2.out ? src/test/regress/expected/largeobject.out ? src/test/regress/expected/largeobject_1.out ? src/test/regress/expected/misc.out ? src/test/regress/expected/tablespace.out ? src/test/regress/sql/constraints.sql ? src/test/regress/sql/copy.sql ? src/test/regress/sql/create_function_1.sql ? src/test/regress/sql/create_function_2.sql ? src/test/regress/sql/largeobject.sql ? src/test/regress/sql/misc.sql ? src/test/regress/sql/tablespace.sql ? src/timezone/zic Index: src/backend/access/transam/xlog.c === RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.324 diff -c -r1.324 xlog.c *** src/backend/access/transam/xlog.c 17 Dec 2008 01:39:03 - 1.324 --- src/backend/access/transam/xlog.c 24 Dec 2008 18:13:45 - *** *** 295,300 --- 295,302 /* Protected by info_lck: */ XLogwrtRqst LogwrtRqst; XLogwrtResult LogwrtResult; + uint32 bkpCount; /* ID of bkp using the same ckpt */ + bool bkpForceCkpt; /* reset full_page_writes since last ckpt? */ uint32 ckptXidEpoch; /* nextXID epoch of latest checkpoint */ TransactionId ckptXid; XLogRecPtr asyncCommitLSN; /* LSN of newest async commit */ *** *** 318,323 --- 320,332 static XLogCtlData *XLogCtl = NULL; /* + * We don't allow more than MAX_BKP_COUNT backups to use the same checkpoint. + * If XLogCtl-bkpCount MAX_BKP_COUNT, we force new checkpoint at pg_standby + * even if there are all indispensable full pages since last checkpoint. + */ + #define MAX_BKP_COUNT 256 + + /* * We maintain an image of pg_control in shared memory. */ static ControlFileData *ControlFile = NULL; *** *** 6025,6036 UpdateControlFile(); LWLockRelease(ControlFileLock); ! /* Update shared-memory copy of checkpoint XID/epoch */ { /* use volatile pointer to prevent code rearrangement */ volatile XLogCtlData *xlogctl = XLogCtl; SpinLockAcquire(xlogctl-info_lck); xlogctl-ckptXidEpoch = checkPoint.nextXidEpoch; xlogctl-ckptXid = checkPoint.nextXid; SpinLockRelease(xlogctl-info_lck); --- 6034,6050 UpdateControlFile(); LWLockRelease(ControlFileLock); ! /* ! * Update shared-memory copy of checkpoint XID/epoch and reset the ! * variables of backup ID/flag. ! */ { /* use
Re: [HACKERS] Hot standby and b-tree killed items
On Wed, Dec 24, 2008 at 7:18 PM, Simon Riggs si...@2ndquadrant.com wrote: With respect, I was hoping you might look in the patch and see if you agree with the way it is handled. No need to remember. The whole latestRemovedXid concept is designed to do help. Well, that's common for all cleanup record including vacuum. But reading your comment, it seemed as there is something special to handle HOT prune case which I did not see. Anyways, the trouble with HOT prune is that uples may be cleaned up very frequently and that can lead to query cancellation at the standby. That's what I wanted to emphasize. Queries get cancelled if data they need to see if removed and the max_standby_delay expires. So lag will be max_standby_delay, by definition. That's per cleanup record, isn't it ? We've also discussed storing lastCleanedLSN for each buffer, so queries can cancel themselves if they need to read a buffer that has had data removed from it that they would have needed to see. I'll write that up also. What if we do that at table level ? So if a query touches a table which had cleanup activity since the query was started, it cancels itself automatically, Happy X'mas to all of you! Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] incoherent view of serializable transactions
Emmanuel Cecchet m...@frogthinker.org 12/22/08 11:42 PM If you want to know how to build SERIALIZABLE with a database that provides SI (Snapshot Isolation), read http://portal.acm.org/citation.cfm?doid=1376616.137669 The link didn't seem to work for me, but I think I found the article you meant: Serializable Isolation for Snapshot Databases by Michael J. Cahill, et al An interesting work. If nothing else, it will help flesh out the documentation of anomalies. If the PostgreSQL community ever does want to better approach true serializable behavior, this should provide a good theoretical basis. Thanks very much for the cite. I still don't get why people would use SERIALIZABLE since there is no efficient implementation of it. Read the last paragraph of the secion 1 (Introduction) of the article you just cited. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] merry christmas folks
WITH RECURSIVE tree(b, l, lv) AS ( ( WITH RECURSIVE t(b, l) AS ( select b/11.6, 0 AS l from (select generate_series(0,30)::float8 b union all select generate_series(30,0, -1)::float8 b) ziew UNION ALL select (b*1.06), l+1 FROM t WHERE l 3 ) select b, l, 1 AS lv from t order by l asc ) UNION ALL select b*1.1, l+(lv*62) AS l , lv+1 FROM tree WHERE lv 4 ) select array_to_string(array_agg(SUBSTRING(' ', LEAST(GREATEST(b::integer,1),6), 1)),'') from tree group by l order by l asc; ;] -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] DTrace probes patch
Thanks, applied. --- Robert Lor wrote: Tom Lane wrote: Robert Lor robert@sun.com writes: Tom Lane wrote: I agree. If the probe is meant to track only *some* WAL writes then it needs to be named something less generic than TRACE_POSTGRESQL_WAL_BUFFER_WRITE. How about change it to TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY similar to TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY for shared buffers? Works for me... Attached is the patch for the above name change. -Robert Index: src/backend/access/transam/xlog.c === RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v retrieving revision 1.324 diff -u -3 -p -r1.324 xlog.c --- src/backend/access/transam/xlog.c 17 Dec 2008 01:39:03 - 1.324 +++ src/backend/access/transam/xlog.c 22 Dec 2008 16:28:00 - @@ -1318,14 +1318,14 @@ AdvanceXLInsertBuffer(bool new_segment) * Have to write buffers while holding insert lock. This is * not good, so only write as much as we absolutely must. */ - TRACE_POSTGRESQL_WAL_BUFFER_WRITE_START(); + TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START(); WriteRqst.Write = OldPageRqstPtr; WriteRqst.Flush.xlogid = 0; WriteRqst.Flush.xrecoff = 0; XLogWrite(WriteRqst, false, false); LWLockRelease(WALWriteLock); Insert-LogwrtResult = LogwrtResult; - TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DONE(); + TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE(); } } } Index: src/backend/utils/probes.d === RCS file: /projects/cvsroot/pgsql/src/backend/utils/probes.d,v retrieving revision 1.4 diff -u -3 -p -r1.4 probes.d --- src/backend/utils/probes.d17 Dec 2008 01:39:04 - 1.4 +++ src/backend/utils/probes.d22 Dec 2008 16:28:01 - @@ -89,6 +89,6 @@ provider postgresql { probe xlog__insert(unsigned char, unsigned char); probe xlog__switch(); - probe wal__buffer__write__start(); - probe wal__buffer__write__done(); + probe wal__buffer__write__dirty__start(); + probe wal__buffer__write__dirty__done(); }; -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] incoherent view of serializable transactions
Hi Kevin, The link didn't seem to work for me, but I think I found the article you meant: Serializable Isolation for Snapshot Databases by Michael J. Cahill, et al An interesting work. If nothing else, it will help flesh out the documentation of anomalies. If the PostgreSQL community ever does want to better approach true serializable behavior, this should provide a good theoretical basis. Sorry for the broken link. Yes this is the paper. Note that the paper was not necessarily enthusiastically received by the community when presented at the conference. While this is an interesting academic paper, it's practicality left a majority of the audience perplex. There was an interesting comment by Oracle folks: Oracle does not provide serializability but only snapshot isolation, and still users prefer to 'downgrade' to read committed for better performance. The Oracle guys experience seemed to indicate that there was no need for serializability (well, that's also less work for them ;-)) in their customer base. Having both a foot in academia and in industry, I understand the intellectual interest for serializability on the academic side, but I still have not seen a single use case in industry where this was a requirement (but my db experience is probably narrow). Have nice serializable holidays ;-) manu -- Emmanuel Cecchet FTO @ Frog Thinker Open Source Development Consulting -- Web: http://www.frogthinker.org email: m...@frogthinker.org Skype: emmanuel_cecchet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel restore vs. windows
On Sun, Dec 14, 2008 at 12:13 PM, Andrew Dunstan and...@dunslane.net wrote: This version completes properly on Windows with the regression database. actually, this one doesn't apply cleanly on head -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel restore vs. windows
Jaime Casanova wrote: On Sun, Dec 14, 2008 at 12:13 PM, Andrew Dunstan and...@dunslane.net wrote: This version completes properly on Windows with the regression database. actually, this one doesn't apply cleanly on head I will have a new patch a day or two after Christmas, which I hope will be very close to being fully done. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] incoherent view of serializable transactions
Robert Haas wrote: ... serializable transaction ... If we were to construct a database that had one giant lock for the entire database so that only a single query could execute at one time, transactions would be serializable (because they'd in fact be serialized). However, performance would suck. I wonder if this giant-lock-for-isolation-level-serializable is a mode postgres should support. ISTM it would meet the letter of the spec, and at least some of the people using transaction isolation level serializable are doing so precisely because they *want* the database to deal with all possible serialization issues, and accepting performance penalties. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] incoherent view of serializable transactions
On Dec 24, 2008, at 6:46 PM, Ron Mayer rm...@cheapcomplexdevices.com wrote: Robert Haas wrote: ... serializable transaction ... If we were to construct a database that had one giant lock for the entire database so that only a single query could execute at one time, transactions would be serializable (because they'd in fact be serialized). However, performance would suck. I wonder if this giant-lock-for-isolation-level-serializable is a mode postgres should support. ISTM it would meet the letter of the spec, and at least some of the people using transaction isolation level serializable are doing so precisely because they *want* the database to deal with all possible serialization issues, and accepting performance penalties. No. :-) Cheers, ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [idea] a copied relkind in pg_attribute
Jaime Casanova wrote: On Wed, Dec 24, 2008 at 6:50 AM, KaiGai Kohei kai...@kaigai.gr.jp wrote: The current implementation need to lookup RELOID system cache to identify the relkind of the relation, because pg_attribtue does not have any information about relkind. However, I also think it is not an ideal implementation, even if its frequency is enough small. but you still can do it, right? there are any other advantages, like something you can't do now? Yes, we still can see relkind via system cache. When we fetch a tuple from pg_attribute, it need to look up the RELOID system cache to get the relkind of its relation. The reason why I considered the change is preferable is advantanges in performance and robustness in security design. The first one is obvious. If we have attkind member, it is not necessary to look up the RELOID system cache for each tuples. The other need an explanation. A database superuser is allowed to update system catalog by hand, if it is allowed by the security policy. For example, he will be able to update relkind in some of pg_class, even if it never happen in general DDLs. If a relkind is changed from 'r' to 'c', we deal pg_attribute entries pointing the tuple as db_tuple class, not db_column class, because they are not already columns. It means we fundamentally have to check permissions on pg_attribute when pg_class is updated, or pg_attribute should have its identifier information. I think the later approach is more simple. Please consider an another instance. In filesystem, 'x' permission bit has different meaning between files and directries. If a derectory without no child files is handled as a regular file suddenly, it can make a confusion. It is a similar situation. So, I want to put a needed attribute to identify itself. Thanks, if not, i think you need to probe that there will be any benefit in speed and that is significant otherwise this seems like premature optimization and that AFAIR is the root of all evil :) -- OSS Platform Development Division, NEC KaiGai Kohei kai...@ak.jp.nec.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Window-functions patch handling of aggregates
2008/12/25 Tom Lane t...@sss.pgh.pa.us: Gregory Stark st...@enterprisedb.com writes: Tom Lane t...@sss.pgh.pa.us writes: Unless we want to move the goalposts on what an aggregate is allowed to do internally, we're going to have to change this to re-aggregate repeatedly. Neither prospect is appetizing in the least. Does it currently copy the state datum before calling the final function? Would that help? No. The entire point of what we have now formalized as aggregates with internal-type transition values is that the transition value isn't necessarily a single palloc chunk. For stuff like array_agg, the performance costs of requiring that are enormous. On looking at what array_agg does, it seems the issue there is that the final-function actually deletes the working state when it thinks it's done with it (see construct_md_array). It would probably be possible to fix things so that it doesn't do that when it's called by a WindowAgg instead of a regular Agg. What I'm more concerned about is what third-party code will break. contrib/intagg has done this type of thing since forever, and I'm sure people have copied that... I have concerned it once before on the first design of the window functions. I don't have much idea about all over the aggregate functions but at least count(*) does some assumption of AggState in its context. So I concluded when the window functions are introduced it must be announced that if you use aggregate in the window context, you must be sure it supports window as well as aggregate. It is because currently (= 8.3) aggregates are assumed it is called in AggState only but the assumption would be broken now. It is designed, and announcing helps much third party code to support or not to support window functions (i.e. you can stop by error if window is not supported by the function). Regards, -- Hitoshi Harada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] hot standby on mingw
Hi, i'm getting this error when triyng to build the hot standby (v5 from 20081218 obtained in the wiki) on mingw make[4]: Entering directory `/home/Administrador/pgsql/src/backend/access/heap' gcc -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -fno-strict-aliasing -fwrapv -g -I../../../../src/include -I./src/include/port/win32 -DEXEC_BACKEND -I../../../../src/include/port/win32 -DBUILDING_DLL -c -o heapam.o heapam.c -MMD -MP -MF .deps/heapam.Po In file included from heapam.c:51: ../../../../src/include/access/xact.h:109: error: syntax error before uint ../../../../src/include/access/xact.h:109: warning: no semicolon at end of struct or union ../../../../src/include/access/xact.h:117: error: syntax error before '}' token ../../../../src/include/access/xact.h:117: warning: type defaults to `int' in declaration of `xl_xact_commit' ../../../../src/include/access/xact.h:117: warning: data definition has no type or storage class ../../../../src/include/access/xact.h:152: error: syntax error before uint ../../../../src/include/access/xact.h:152: warning: no semicolon at end of struct or union ../../../../src/include/access/xact.h:158: error: syntax error before '}' token ../../../../src/include/access/xact.h:158: warning: type defaults to `int' in declaration of `xl_xact_abort' ../../../../src/include/access/xact.h:158: warning: data definition has no type or storage class ../../../../src/include/access/xact.h:173: error: syntax error before xl_xact_commit ../../../../src/include/access/xact.h:173: warning: no semicolon at end of struct or union ../../../../src/include/access/xact.h:175: warning: type defaults to `int' in declaration of `xl_xact_commit_prepared' ../../../../src/include/access/xact.h:175: warning: data definition has no type or storage class ../../../../src/include/access/xact.h:182: error: syntax error before xl_xact_abort ../../../../src/include/access/xact.h:182: warning: no semicolon at end of struct or union ../../../../src/include/access/xact.h:184: warning: type defaults to `int' in declaration of `xl_xact_abort_prepared' ../../../../src/include/access/xact.h:184: warning: data definition has no type or storage class make[4]: *** [heapam.o] Error 1 i guess the problem is the uint type, when i change that for int the error vanishes -- Atentamente, Jaime Casanova Soporte y capacitación de PostgreSQL Asesoría y desarrollo de sistemas Guayaquil - Ecuador Cel. +59387171157 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers