Re: [HACKERS] bgworker sigusr1 handler
On Wed, Apr 10, 2013 at 12:15 PM, Robert Haas robertmh...@gmail.com wrote: Just for fun, I implemented a toy background worker tonight using the new bgworker framework. Generally, it went well, and I'm pleased with the design of the new facility. However, I did notice one oddity. I initialized the worker flags like this: worker.bgw_flags = BGWORKER_SHMEM_ACCESS; And... latches didn't work. It turns out that if you request database access, then the SIGUSR1 handler is set to procsignal_sigusr1_handler, which is fine. But if you don't, it gets set to SIG_IGN. And the result of *that* is that if someone sets a latch for which the background process is waiting, the background process fails to notice. Now, once you understand what's going on here, it's not hard to work around. But it seems to me that it would be a saner default to set the signal handler to something like the bgwriter handler, which just calls latch_sigusr1_handler. There is currently a bug with bgworkers and SIGHUP. If postmaster receives a SIGHUP, it does not notify its registered bgworkers: http://www.postgresql.org/message-id/CAB7nPqQ-ccL9Q7wxpWNaG5Zs-hMLh_ayQb=rM2=+pxtwd+8...@mail.gmail.com You can have a try with the example I provided, then try to reload parameters with pg_ctl reload and you will notice that bgworkers do not process SIGHUP as a normal backend would do. -- Michael
Re: [HACKERS] page 1 of relation global/11787 was uninitialized
Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: Afaik we don't have any debugging utility to dump the pg_filenode.map contents? Hardly need one ... od -t d4 $PGDATA/global/pg_filenode.map is readable enough, though it does leave you still having to map the numeric OIDs back to names. The OIDs will be stable though. Shouldn't that be od -t u4 $PGDATA/global/pg_filenode.map? Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] lwlock contention with SSI
On Tue, Apr 09, 2013 at 07:49:51PM -0400, Robert Haas wrote: These locks are all SSI-related and they're all really hot. Lock 28 is SerializableXactHashLock and lock 29 is SerializableFinishedListLock; both are acquired an order of magnitude more often than any non-SSI lock, and cause two orders of magnitude more blocking than any other lock whatsoever. Lock 30 is SerializablePredicateLockListLock, which has no exclusive lock acquisitions at all on this test, but the shared acquisitions result in significant spinlock contention. This matches what I saw when I looked into this a while ago. I even started sketching out some plans of how we might deal with it, but unfortunately I never had much time to work on it, and that seems unlikely to change any time soon. :-\ As it is, pretty much any operation involving SSI requires acquiring SerializableXactHashLock (usually exclusive), except for checking whether a read or write indicates a conflict. That includes starting and ending a transaction. Two things make this hard to fix: - SSI is about checking for rw-conflicts, which are inherently about *pairs* of transactions. This makes it hard to do fine-grained locking, because a lot of operations involve looking at or modifying the conflict list of more than one transaction. - SerializableXactHashLock protects many things. Besides the SERIALIZABLEXACT structures themselves, there's also the free lists for SERIALIZABLEXACTs and RWConflicts, the SerializableXidHash table, the latest SxactCommitSeqno and SxactGlobalXmin, etc. I'm trying to swap back in my notes about how to address this. It is bound to be a substantial project, however. Dan -- Dan R. K. PortsUW CSEhttp://drkp.net/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, Apr 10, 2013 at 4:36 AM, Jeff Davis pg...@j-davis.com wrote: On Tue, 2013-04-09 at 05:35 +0300, Ants Aasma wrote: And here you go. I decided to be verbose with the comments as it's easier to delete a comment to write one. I also left in a huge jumble of macros to calculate the contents of a helper var during compile time. This can easily be replaced with the calculated values once we settle on specific parameters. Great, thank you. Is it possible to put an interface over it that somewhat resembles the CRC checksum (INIT/COMP/FIN)? It looks a little challenging because of the nature of the algorithm, but it would make it easier to extend to other places (e.g. WAL). It doesn't have to match the INIT/COMP/FIN pattern exactly. The algorithm has 128 bytes of state. Storing it on every step would negate any performance gains and C doesn't have a way to keep it in registers. If we can trust that the compiler doesn't clobber xmm registers then it could be split up into the following pieces: 1. init 2. process 128 bytes 3. aggregate state 4. mix in block number Even if we don't split it up, factoring out steps 1..3 would make sense as there is no point in making step 4 platform specific and so is just duplicated. Regardless, we should have some kind of fairly generic interface and move the code to its own file (e.g. checksum.c). To make the interface more generic, would it make sense to require the caller to save the page's stored checksum and zero it before calculating? That would avoid the awkwardness of avoiding the pd_checksum field. For example (code for illustration only): Yes, that would help make it reusable. That would make it possible to use a different word size -- is uint16 optimal or would a larger word be more efficient? Larger words would have better mixing as multiplies mix 4 bytes at a time instead of 2. Performance of the vectorized version will be the same as it is tied to the vector length but unvectorized will get a speed up. The reason I picked 16bits is not actually related to the checksum hole but because pmullw instruction is guaranteed to be available on all 64bit CPUs whereas pmulld is only available on the latest CPUs. It looks like the block size needs to be an even multiple of sizeof(uint16)*NSUMS. And it also look like it's hard to combine different regions of memory into the same calculation (unless we want to just calculate them separately and XOR them or something). Does that mean that this is not suitable for WAL at all? I think it would be possible to define a padding scheme for irregularly sized memory segments where we would only need a lead-out command for blocks that are not a multiple of 128 bytes. The performance of it would need to be measured. All-in-all, it's not really a great match for WAL. While all of the fast checksums process many bytes in a single iteration, they still process an order of magnitude bytes less and so have an easier time with irregularly shaped blocks. Using SIMD for WAL is not a requirement at all; I just thought it might be a nice benefit for non-checksum-enabled users in some later release. I think we should first deal with using it for page checksums and if future versions want to reuse some of the code for WAL checksums then we can rearrange the code. Regards, Ants Aasma -- Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 10 April 2013 09:01, Ants Aasma a...@cybertec.at wrote: Using SIMD for WAL is not a requirement at all; I just thought it might be a nice benefit for non-checksum-enabled users in some later release. I think we should first deal with using it for page checksums and if future versions want to reuse some of the code for WAL checksums then we can rearrange the code. We have essentially the same problem in both cases: we want to calculate a checksum of BLCKSZ chunks of data, plus some smaller header data. We currently use the same code for both cases and it makes sense to do the same thing with any new code. This is also the *same* issue: when we make a new hint we need to issue a full page write in WAL, so we are calculating checksums in two new places: XLOG_HINT records and data blocks. Few technical points: * We're taking a copy of the buffer, so Jeff's zero trick works safely, I think. * We can use a different algorithm for big and small blocks, we just need a way to show we've done that, for example setting the high order bit of the checksum. * We might even be able to calculate CRC32 checksum for normal WAL records, and use Ants' checksum for full page writes (only). So checking WAL checksum would then be to confirm header passes CRC32 and then re-check the Ants checksum of each backup block. This work needs to happen now, since once the checksum algorithm is set we won't easily be able to change it. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] [BUGS] replication_timeout not effective
Sent: Wednesday, April 10, 2013 1:49 PM Dang Minh Huong wrote: To: Amit Kapila Subject: Re: [BUGS] replication_timeout not effective On Wednesday, April 10, 2013 1:49 PM Hi, Thank you for your soon reply. I'm trying to set the network timeout related parameters to terminate it. # i've tried to set postgresql.conf's tcp_keepalives_* but not success. I have also tried those, but they didn't work that's why I have proposed this feature in 9.3 Please send mail on community list, others can also help you if they have any idea for avoiding such problems. 2013/04/10 14:05、Amit Kapila amit.kap...@huawei.com のメッセージ: On Wednesday, April 10, 2013 9:35 AM Dang Minh Huong wrote: Hi, I'm wondering if this is a bug of PostgreSQL. PostgreSQL's show that replication_timeout parameter can Terminate replication connections that are inactive longer than the specified number of milliseconds. But in my environment the sender process is hang up (in several tens of minunites) if i turn off (by power off) Standby PC while pg_basebackup is excuting. Is this correct? As my debug, sender process is terminated when recieve SIGPIPE process but it come too slow (about 30minutes after standby PC was down). For such scenario's, new parameter wal_sender_timeout has been introduced in 9.3. Refer below: http://www.postgresql.org/docs/devel/static/runtime-config- replication.html#RUNTIME-CONFIG-REPLICATION-SENDER I am not sure how to get rid of this problem in 9.1.9 With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inconsistent DB data in Streaming Replication
it's one of the reasons why a fresh base backup is required when starting old master as new standby? If yes, I agree with you. I've often heard the complaints about a backup when restarting new standby. That's really big problem. I think Fujii Masao is on the same page. In case of syncrep the master just waits for confirmation from standby before returning to client on commit. Not just commit, you must stop any *writing* of the wal records effectively killing any parallelism. Min issue is that it will make *all* backends dependant on each sync commit, essentially serialising all backends commits, with the serialisation *including* the latency of roundtrip to client. With current sync streaming the other backends can continue to write wal, with proposed approach you cannot write any records after the one waiting an ACK from standby. Let me rephrase the proposal in a more accurate manner: Consider following scenario: (1) A client sends the COMMIT command to the master server. (2) The master writes WAL record to disk (3) The master writes the data page related to this transaction. i.e. via checkpoint or bgwriter. (4) The master sends WAL records continuously to the standby, up to the commit WAL record. (5) The standby receives WAL records, writes them to the disk, and then replies the ACK. (6) The master returns a success indication to a client after it receives ACK. If failover happens between (3) and (4), WAL and DB data in old master are ahead of them in new master. After failover, new master continues running new transactions independently from old master. Then WAL record and DB data would become inconsistent between those two servers. To resolve these inconsistencies, the backup of new master needs to be taken onto new standby. But taking backup is not feasible in case of larger database size with several TB over a slow WAN. So to avoid this type of inconsistency without taking fresh backup we are thinking to do following thing: I think that you can introduce GUC specifying whether this extra check is required to avoid a backup when failback. Approach: Introduce new GUC option specifying whether to prevent PostgreSQL from writing DB data before corresponding WAL records have been replicated to the standby. That is, if this GUC option is enabled, PostgreSQL waits for corresponding WAL records to be not only written to the disk but also replicated to the standby before writing DB data. So the process becomes as follows: (1) A client sends the COMMIT command to the master server. (2) The master writes the commit WAL record to the disk. (3) The master sends WAL records continuously to standby up to the commit WAL record. (4) The standby receives WAL records, writes them to disk, and then replies the ACK. (5) *The master then forces a write of the data page related to this transaction. * (6) The master returns a success indication to a client after it receives ACK. While master is waiting to force a write (point 5) for this data page, streaming replication continuous. Also other data page writes are not dependent on this particular page write. So the commit of data pages are not serialized. Regards, Samrat
Re: [HACKERS] [BUGS] replication_timeout not effective
Hi Amit, Thank you for your consideration. My project not allows to use 9.2 or 9.3. In 9.3, it sounds replication_timeout is replaced by wal_sender_timeout. So if it is solved in 9.3 i think there is a way to terminate it. I hope it is fixed in 9.1 soon Regards, 2013/04/10 18:33、Amit Kapila amit.kap...@huawei.com のメッセージ: Sent: Wednesday, April 10, 2013 1:49 PM Dang Minh Huong wrote: To: Amit Kapila Subject: Re: [BUGS] replication_timeout not effective On Wednesday, April 10, 2013 1:49 PM Hi, Thank you for your soon reply. I'm trying to set the network timeout related parameters to terminate it. # i've tried to set postgresql.conf's tcp_keepalives_* but not success. I have also tried those, but they didn't work that's why I have proposed this feature in 9.3 Please send mail on community list, others can also help you if they have any idea for avoiding such problems. 2013/04/10 14:05、Amit Kapila amit.kap...@huawei.com のメッセージ: On Wednesday, April 10, 2013 9:35 AM Dang Minh Huong wrote: Hi, I'm wondering if this is a bug of PostgreSQL. PostgreSQL's show that replication_timeout parameter can Terminate replication connections that are inactive longer than the specified number of milliseconds. But in my environment the sender process is hang up (in several tens of minunites) if i turn off (by power off) Standby PC while pg_basebackup is excuting. Is this correct? As my debug, sender process is terminated when recieve SIGPIPE process but it come too slow (about 30minutes after standby PC was down). For such scenario's, new parameter wal_sender_timeout has been introduced in 9.3. Refer below: http://www.postgresql.org/docs/devel/static/runtime-config- replication.html#RUNTIME-CONFIG-REPLICATION-SENDER I am not sure how to get rid of this problem in 9.1.9 With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inconsistent DB data in Streaming Replication
(5) *The master then forces a write of the data page related to this transaction.* *Sorry, this is incorrect. Whenever the master writes the data page it checks that the WAL record is written in standby till that LSN. * * * While master is waiting to force a write (point 5) for this data page, streaming replication continuous. Also other data page writes are not dependent on this particular page write. So the commit of data pages are not serialized. *Sorry, this is incorrect. Streaming replication continuous, master is not waiting, whenever the master writes the data page it checks that the WAL record is written in standby till that LSN.* * * *Regards,* *Samrat* * *
Re: [HACKERS] Enabling Checksums
On Wed, Apr 10, 2013 at 12:25 PM, Simon Riggs si...@2ndquadrant.com wrote: On 10 April 2013 09:01, Ants Aasma a...@cybertec.at wrote: Using SIMD for WAL is not a requirement at all; I just thought it might be a nice benefit for non-checksum-enabled users in some later release. I think we should first deal with using it for page checksums and if future versions want to reuse some of the code for WAL checksums then we can rearrange the code. We have essentially the same problem in both cases: we want to calculate a checksum of BLCKSZ chunks of data, plus some smaller header data. We currently use the same code for both cases and it makes sense to do the same thing with any new code. This is also the *same* issue: when we make a new hint we need to issue a full page write in WAL, so we are calculating checksums in two new places: XLOG_HINT records and data blocks. Few technical points: * We're taking a copy of the buffer, so Jeff's zero trick works safely, I think. * We can use a different algorithm for big and small blocks, we just need a way to show we've done that, for example setting the high order bit of the checksum. * We might even be able to calculate CRC32 checksum for normal WAL records, and use Ants' checksum for full page writes (only). So checking WAL checksum would then be to confirm header passes CRC32 and then re-check the Ants checksum of each backup block. If we ensure that the checksum on the page is correct when we do a full page write then we could only include the checksum field in the WAL CRC. When reading WAL we would first check that the CRC is correct and then verify the the page checksum. This work needs to happen now, since once the checksum algorithm is set we won't easily be able to change it. The page checksum algorithm needs to be decided now, but WAL CRCs and full page writes can be changed in 9.4 and don't need to be perfect on the first try. Regards, Ants Aasma -- Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inconsistent DB data in Streaming Replication
On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote: (5) The master then forces a write of the data page related to this transaction. Sorry, this is incorrect. Whenever the master writes the data page it checks that the WAL record is written in standby till that LSN. While master is waiting to force a write (point 5) for this data page, streaming replication continuous. Also other data page writes are not dependent on this particular page write. So the commit of data pages are not serialized. Sorry, this is incorrect. Streaming replication continuous, master is not waiting, whenever the master writes the data page it checks that the WAL record is written in standby till that LSN. I am not sure it will resolve the problem completely as your old-master can have some WAL extra then new-master for same timeline. I don't remember exactly will timeline switch feature take care of this extra WAL, Heikki can confirm this point? Also I think this can serialize flush of data pages in checkpoint/bgwriter which is currently not the case. With Regards, Amit Kapila. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] replication_timeout not effective
Hello, On Wed, Apr 10, 2013 at 6:57 PM, Dang Minh Huong kakalo...@gmail.com wrote: In 9.3, it sounds replication_timeout is replaced by wal_sender_timeout. So if it is solved in 9.3 i think there is a way to terminate it. I hope it is fixed in 9.1 soon Hmm. He said that, But in my environment the sender process is hang up (in several tens of minunites) if i turn off (by power off) Standby PC while *pg_basebackup* is excuting. Does basebackup run only on 'replication connection' ? As far as I saw base backup uses 'base backup' connection in addition to 'streaming' connection. The former seems not under the control of wal_sender_timeout or replication_timeout and easily blocked at send(2) after sudden cut out of the network connection underneath. Although the latter indeed is terminated by them. Blocking in send(2) might could occur for async-rep connection but not likely for sync-rep since it does not fill the buffers of libpq and socket easilly. I suppose he says about this. This seems to occur as of the latest 9.3dev. regards, -- Kyotaro Horiguchi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] replication_timeout not effective
On 2013-04-10 22:38:07 +0900, Kyotaro HORIGUCHI wrote: Hello, On Wed, Apr 10, 2013 at 6:57 PM, Dang Minh Huong kakalo...@gmail.com wrote: In 9.3, it sounds replication_timeout is replaced by wal_sender_timeout. So if it is solved in 9.3 i think there is a way to terminate it. I hope it is fixed in 9.1 soon Hmm. He said that, But in my environment the sender process is hang up (in several tens of minunites) if i turn off (by power off) Standby PC while *pg_basebackup* is excuting. Does basebackup run only on 'replication connection' ? As far as I saw base backup uses 'base backup' connection in addition to 'streaming' connection. The former seems not under the control of wal_sender_timeout or replication_timeout and easily blocked at send(2) after sudden cut out of the network connection underneath. Although the latter indeed is terminated by them. Yes, it's run via a walsender connection. The only problem is that it doesn't check for those timeouts. I am not sure it would be a good thing to do so to be honest. At least not using the same timeout as actual WAL sending, thats just has different characteristics. On the other hand, hanging around that long isn't nice either... Blocking in send(2) might could occur for async-rep connection but not likely for sync-rep since it does not fill the buffers of libpq and socket easilly. You just need larger transactions for it. A COPY or so ought to do it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inconsistent DB data in Streaming Replication
Amit Kapila amit.kap...@huawei.com writes: On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote: Sorry, this is incorrect. Streaming replication continuous, master is not waiting, whenever the master writes the data page it checks that the WAL record is written in standby till that LSN. I am not sure it will resolve the problem completely as your old-master can have some WAL extra then new-master for same timeline. I don't remember exactly will timeline switch feature take care of this extra WAL, Heikki can confirm this point? Also I think this can serialize flush of data pages in checkpoint/bgwriter which is currently not the case. Yeah. TBH this entire discussion seems to be let's cripple performance in the normal case so that we can skip doing an rsync when resurrecting a crashed, failed-over master. This is not merely optimizing for the wrong thing, it's positively hazardous. After a fail-over, you should be wondering whether it's safe to resurrect the old master at all, not about how fast you can bring it back up without validating its data. IOW, I wouldn't consider skipping the rsync even if I had a feature like this. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inconsistent DB data in Streaming Replication
On 2013-04-10 10:10:31 -0400, Tom Lane wrote: Amit Kapila amit.kap...@huawei.com writes: On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote: Sorry, this is incorrect. Streaming replication continuous, master is not waiting, whenever the master writes the data page it checks that the WAL record is written in standby till that LSN. I am not sure it will resolve the problem completely as your old-master can have some WAL extra then new-master for same timeline. I don't remember exactly will timeline switch feature take care of this extra WAL, Heikki can confirm this point? Also I think this can serialize flush of data pages in checkpoint/bgwriter which is currently not the case. Yeah. TBH this entire discussion seems to be let's cripple performance in the normal case so that we can skip doing an rsync when resurrecting a crashed, failed-over master. This is not merely optimizing for the wrong thing, it's positively hazardous. After a fail-over, you should be wondering whether it's safe to resurrect the old master at all, not about how fast you can bring it back up without validating its data. IOW, I wouldn't consider skipping the rsync even if I had a feature like this. Agreed. Especially as in situations where you fall over in a planned way, e.g. for a hardware upgrade, you can avoid the need to resync with a littlebit of care. So its mostly in catastrophic situations this becomes a problem and in those you really should resync - and its a good idea not to use a normal rsync but a rsync --checksum or similar. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, Apr 10, 2013 at 01:15:12PM +0300, Ants Aasma wrote: This work needs to happen now, since once the checksum algorithm is set we won't easily be able to change it. The page checksum algorithm needs to be decided now, but WAL CRCs and full page writes can be changed in 9.4 and don't need to be perfect on the first try. I can confirm that --- pg_upgrade doesn't copy any of the WAL stream from old to new cluster. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inconsistent DB data in Streaming Replication
On 04/10/2013 09:10 AM, Tom Lane wrote: IOW, I wouldn't consider skipping the rsync even if I had a feature like this. Totally. Out in the field, we consider the old database corrupt the moment we fail over. There is literally no way to verify the safety of any data along the broken chain, given race conditions and multiple potential failure points. The only potential use case for this that I can see, would be for system maintenance and a controlled failover. I agree: that's a major PITA when doing DR testing, but I personally don't think this is the way to fix that particular edge case. Maybe checksums will fix this in the long run... I don't know. DRBD has a handy block-level verify function for things like this, and it can re-sync master/slave data by comparing the commit log across the servers if you tell it one node should be considered incorrect. The thing is... we have clogs, and we have WAL. If we can assume bidirectional communication and verification (checksum comparison?) of both of those components, the database *should* be able to re-sync itself. Even if that were possible given the internals, I can't see anyone jumping on this before 9.4 or 9.5 unless someone sponsors the feature. Automatic re-sync would (within available WALs) be an awesome feature, though... -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] replication_timeout not effective
Thanks all, (2013/04/10 22:55), Andres Freund wrote: On 2013-04-10 22:38:07 +0900, Kyotaro HORIGUCHI wrote: Hello, On Wed, Apr 10, 2013 at 6:57 PM, Dang Minh Huong kakalo...@gmail.com wrote: In 9.3, it sounds replication_timeout is replaced by wal_sender_timeout. So if it is solved in 9.3 i think there is a way to terminate it. I hope it is fixed in 9.1 soon Hmm. He said that, But in my environment the sender process is hang up (in several tens of minunites) if i turn off (by power off) Standby PC while *pg_basebackup* is excuting. Does basebackup run only on 'replication connection' ? As far as I saw base backup uses 'base backup' connection in addition to 'streaming' connection. The former seems not under the control of wal_sender_timeout or replication_timeout and easily blocked at send(2) after sudden cut out of the network connection underneath. Although the latter indeed is terminated by them. Yes, it's run via a walsender connection. The only problem is that it doesn't check for those timeouts. I am not sure it would be a good thing to do so to be honest. At least not using the same timeout as actual WAL sending, thats just has different characteristics. On the other hand, hanging around that long isn't nice either... I tried max_wal_sender with 1, so when the walsender is hanging. I can not run again pg_basebackup (or start the standby DB). I'm increasing it to 2, so the seconds successfully. But i'm afraid that when the third occures the hanging walsender in the first is not yet terminated... I think not, but is there a way to terminate hanging up but not restart PostgreSQL server or kill walsender process? (kill walsender process can caused a crash to DB server, so i don't want to do it). # i've also tried with pg_cancel_backend() but it did not work too. Blocking in send(2) might could occur for async-rep connection but not likely for sync-rep since it does not fill the buffers of libpq and socket easilly. You just need larger transactions for it. A COPY or so ought to do it. Greetings, Andres Freund Regard, Huong DM -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] replication_timeout not effective
On 2013-04-10 23:37:44 +0900, Dang Minh Huong wrote: Thanks all, (2013/04/10 22:55), Andres Freund wrote: On 2013-04-10 22:38:07 +0900, Kyotaro HORIGUCHI wrote: Hello, On Wed, Apr 10, 2013 at 6:57 PM, Dang Minh Huong kakalo...@gmail.com wrote: In 9.3, it sounds replication_timeout is replaced by wal_sender_timeout. So if it is solved in 9.3 i think there is a way to terminate it. I hope it is fixed in 9.1 soon Hmm. He said that, But in my environment the sender process is hang up (in several tens of minunites) if i turn off (by power off) Standby PC while *pg_basebackup* is excuting. Does basebackup run only on 'replication connection' ? As far as I saw base backup uses 'base backup' connection in addition to 'streaming' connection. The former seems not under the control of wal_sender_timeout or replication_timeout and easily blocked at send(2) after sudden cut out of the network connection underneath. Although the latter indeed is terminated by them. Yes, it's run via a walsender connection. The only problem is that it doesn't check for those timeouts. I am not sure it would be a good thing to do so to be honest. At least not using the same timeout as actual WAL sending, thats just has different characteristics. On the other hand, hanging around that long isn't nice either... I tried max_wal_sender with 1, so when the walsender is hanging. I can not run again pg_basebackup (or start the standby DB). I'm increasing it to 2, so the seconds successfully. But i'm afraid that when the third occures the hanging walsender in the first is not yet terminated... I think not, but is there a way to terminate hanging up but not restart PostgreSQL server or kill walsender process? (kill walsender process can caused a crash to DB server, so i don't want to do it). Depending on where its hanging a normal SELECT pg_terminate_backend(pid); might do it. Otherwise you will have to wait for the operating system's tcp timeout. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [GSOC] questions about idea rewrite pg_dump as library
Hi all, I'd like to introduce myself to the dev community. I am Shuai Fan, a student from Dalian University of Technology, DLUT , for short, China. And I am interested in working with PostgreSQL project in GSOC2013. I'm interested in the idea Rewrite (add) pg_dump and pg_restore utilities as libraries (.so, .dll .dylib). These days, I had a talk with Mr. Pavel Golub by email, the author of this post. And asked some questions about this idea. He adviced me to post the question to this mail list. My question is: There are lots of functions in pg_dump.c. If I rewrite this file as library. I should split pg_dump.c into two or more files(Mr. Pavel Golub's advice). However, some functions, do have return value. e.g. static DumpableObject *createBoundaryObjects(void); I thought, these functions must get return value through function argument by passing pointer to it, when using library. But, the question is: If I did so, function prototype may be changed. And Mr. Pavel Golub said it's better to keep all functions with the same declaration, and so we will only have one code base for both console application and library. I think he is right. But then, how can I deal with this situation? From my point of view, I can't malloc a variable (e.g. DumpableObject) in one library (e.g. pg_dumplib.so), and then return it's pointer to another library (or excutable program). Maybe, it's not safe(?). Or has something to do with memory leak(?). I'm not sure. Do you have any suggestions? Best wishes, Shuai
Re: [HACKERS] [BUGS] replication_timeout not effective
2013/04/10 23:44、Andres Freund and...@2ndquadrant.com のメッセージ: On 2013-04-10 23:37:44 +0900, Dang Minh Huong wrote: Thanks all, (2013/04/10 22:55), Andres Freund wrote: On 2013-04-10 22:38:07 +0900, Kyotaro HORIGUCHI wrote: Hello, On Wed, Apr 10, 2013 at 6:57 PM, Dang Minh Huong kakalo...@gmail.com wrote: In 9.3, it sounds replication_timeout is replaced by wal_sender_timeout. So if it is solved in 9.3 i think there is a way to terminate it. I hope it is fixed in 9.1 soon Hmm. He said that, But in my environment the sender process is hang up (in several tens of minunites) if i turn off (by power off) Standby PC while *pg_basebackup* is excuting. Does basebackup run only on 'replication connection' ? As far as I saw base backup uses 'base backup' connection in addition to 'streaming' connection. The former seems not under the control of wal_sender_timeout or replication_timeout and easily blocked at send(2) after sudden cut out of the network connection underneath. Although the latter indeed is terminated by them. Yes, it's run via a walsender connection. The only problem is that it doesn't check for those timeouts. I am not sure it would be a good thing to do so to be honest. At least not using the same timeout as actual WAL sending, thats just has different characteristics. On the other hand, hanging around that long isn't nice either... I tried max_wal_sender with 1, so when the walsender is hanging. I can not run again pg_basebackup (or start the standby DB). I'm increasing it to 2, so the seconds successfully. But i'm afraid that when the third occures the hanging walsender in the first is not yet terminated... I think not, but is there a way to terminate hanging up but not restart PostgreSQL server or kill walsender process? (kill walsender process can caused a crash to DB server, so i don't want to do it). Depending on where its hanging a normal SELECT pg_terminate_backend(pid); might do it. Greate! it worked. Thank you very much. Otherwise you will have to wait for the operating system's tcp timeout. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services Regards, Huong DM -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Problem with background worker
Marc Cousin escribió: On 20/03/2013 16:33, Alvaro Herrera wrote: Ah. The reason for this problem is that the statement start time (which also sets the transaction start time, when it's the first statement) is set by postgres.c, not the transaction-control functions in xact.c. So you'd need to add a SetCurrentStatementStartTimestamp() call somewhere in your loop. Yes, that works. Thanks a lot ! Maybe this should be added to the worker_spi example ? Done in http://git.postgresql.org/pg/commitdiff/e543631f3c162ab5f6020b1d0209e0353ca2229a -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Behaviour of bgworker with SIGHUP
Guillaume Lelarge wrote: worker_spi.naptime is the naptime between two checks. worker_spi.total_workers is the number of workers to launch at postmaster start time. The first one can change with a sighup, the last one obviously needs a restart. Many thanks. Pushed as http://git.postgresql.org/pg/commitdiff/e543631f3c162ab5f6020b1d0209e0353ca2229a along a few other tweaks. I hope the code is more useful as a sample now. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inconsistent DB data in Streaming Replication
On Wed, Apr 10, 2013 at 11:26 PM, Shaun Thomas stho...@optionshouse.com wrote: On 04/10/2013 09:10 AM, Tom Lane wrote: IOW, I wouldn't consider skipping the rsync even if I had a feature like this. Totally. Out in the field, we consider the old database corrupt the moment we fail over. Strange. If this is really true, shared disk failover solution is fundamentally broken because the standby needs to start up with the shared corrupted database at the failover. Also, we cannot trust the crash recovery at all if we adopt the same logic as you think. I think that there are the cases where we can replay and reuse the old database even after PostgreSQL crashes. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inconsistent DB data in Streaming Replication
On 04/10/2013 11:40 AM, Fujii Masao wrote: Strange. If this is really true, shared disk failover solution is fundamentally broken because the standby needs to start up with the shared corrupted database at the failover. How so? Shared disk doesn't use replication. The point I was trying to make is that replication requires synchronization between two disparate servers, and verifying they have exactly the same data is a non-trivial exercise. Even a single transaction after a failover (effectively) negates the old server because there's no easy catch up mechanism yet. Even if this isn't necessarily true, it's the safest approach IMO. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604 312-676-8870 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inconsistent DB data in Streaming Replication
On Wed, Apr 10, 2013 at 11:16 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-04-10 10:10:31 -0400, Tom Lane wrote: Amit Kapila amit.kap...@huawei.com writes: On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote: Sorry, this is incorrect. Streaming replication continuous, master is not waiting, whenever the master writes the data page it checks that the WAL record is written in standby till that LSN. I am not sure it will resolve the problem completely as your old-master can have some WAL extra then new-master for same timeline. I don't remember exactly will timeline switch feature take care of this extra WAL, Heikki can confirm this point? Also I think this can serialize flush of data pages in checkpoint/bgwriter which is currently not the case. Yeah. TBH this entire discussion seems to be let's cripple performance in the normal case so that we can skip doing an rsync when resurrecting a crashed, failed-over master. This is not merely optimizing for the wrong thing, it's positively hazardous. After a fail-over, you should be wondering whether it's safe to resurrect the old master at all, not about how fast you can bring it back up without validating its data. IOW, I wouldn't consider skipping the rsync even if I had a feature like this. Agreed. Especially as in situations where you fall over in a planned way, e.g. for a hardware upgrade, you can avoid the need to resync with a littlebit of care. It's really worth documenting that way. So its mostly in catastrophic situations this becomes a problem and in those you really should resync - and its a good idea not to use a normal rsync but a rsync --checksum or similar. If database is very large, rsync --checksum takes very long. And I'm concerned that most of data pages in master has the different checksum from those in the standby because of commit hint bit. I'm not sure how rsync --checksum can speed up the backup after failover. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inconsistent DB data in Streaming Replication
On Thu, Apr 11, 2013 at 1:44 AM, Shaun Thomas stho...@optionshouse.com wrote: On 04/10/2013 11:40 AM, Fujii Masao wrote: Strange. If this is really true, shared disk failover solution is fundamentally broken because the standby needs to start up with the shared corrupted database at the failover. How so? Shared disk doesn't use replication. The point I was trying to make is that replication requires synchronization between two disparate servers, and verifying they have exactly the same data is a non-trivial exercise. Even a single transaction after a failover (effectively) negates the old server because there's no easy catch up mechanism yet. Hmm... ISTM what Samrat is proposing can resolve the problem. That is, if we can think that any data page which has not been replicated to the standby is not written in the master, new standby (i.e., old master) can safely catch up with new master (i.e., old standby). In this approach, of course, new standby might have some WAL records which new master doesn't have, so before starting up new standby, we need to remove all the WAL files in new standby and retrieve any WAL files from new master. But, what's the problem in his approach? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inconsistent DB data in Streaming Replication
On Wed, Apr 10, 2013 at 7:44 PM, Shaun Thomas stho...@optionshouse.com wrote: On 04/10/2013 11:40 AM, Fujii Masao wrote: Strange. If this is really true, shared disk failover solution is fundamentally broken because the standby needs to start up with the shared corrupted database at the failover. How so? Shared disk doesn't use replication. The point I was trying to make is that replication requires synchronization between two disparate servers, and verifying they have exactly the same data is a non-trivial exercise. Even a single transaction after a failover (effectively) negates the old server because there's no easy catch up mechanism yet. Even if this isn't necessarily true, it's the safest approach IMO. We already rely on WAL-before-data to ensure correct recovery. What is proposed here is to slightly redefine it to require WAL to be replicated before it is considered to be flushed. This ensures that no data page on disk differs from the WAL that the slave has. The machinery to do this is already mostly there, we already wait for WAL flushes and we know the write location on the slave. The second requirement is that we never start up as master and we don't trust any local WAL. This is actually how pacemaker clusters work, you would only need to amend the RA to wipe the WAL and configure postgresql with restart_after_crash = false. It would be very helpful in restoring HA capability after failover if we wouldn't have to read through the whole database after a VM goes down and is migrated with the shared disk onto a new host. Regards, Ants Aasma -- Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inconsistent DB data in Streaming Replication
Ants Aasma a...@cybertec.at writes: We already rely on WAL-before-data to ensure correct recovery. What is proposed here is to slightly redefine it to require WAL to be replicated before it is considered to be flushed. This ensures that no data page on disk differs from the WAL that the slave has. The machinery to do this is already mostly there, we already wait for WAL flushes and we know the write location on the slave. The second requirement is that we never start up as master and we don't trust any local WAL. This is actually how pacemaker clusters work, you would only need to amend the RA to wipe the WAL and configure postgresql with restart_after_crash = false. It would be very helpful in restoring HA capability after failover if we wouldn't have to read through the whole database after a VM goes down and is migrated with the shared disk onto a new host. The problem with this is it's making an idealistic assumption that a crashed master didn't do anything wrong or lose/corrupt any data during its crash. As soon as you realize that's an unsafe assumption, the whole thing becomes worthless to you. If the idea had zero implementation cost, I would say sure, let people play with it until they find out (probably the hard way) that it's a bad idea. But it's going to introduce, at the very least, additional complexity into a portion of the system that is critical and plenty complicated enough already. That being the case, I don't want it there at all, not even as an option. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, 2013-04-10 at 11:01 +0300, Ants Aasma wrote: I think we should first deal with using it for page checksums and if future versions want to reuse some of the code for WAL checksums then we can rearrange the code. Sounds good to me, although I expect we at least want any assembly to be in a separate file (if the specialization makes it in 9.3). Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inconsistent DB data in Streaming Replication
2013-04-10 18:46 keltezéssel, Fujii Masao írta: On Wed, Apr 10, 2013 at 11:16 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-04-10 10:10:31 -0400, Tom Lane wrote: Amit Kapila amit.kap...@huawei.com writes: On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote: Sorry, this is incorrect. Streaming replication continuous, master is not waiting, whenever the master writes the data page it checks that the WAL record is written in standby till that LSN. I am not sure it will resolve the problem completely as your old-master can have some WAL extra then new-master for same timeline. I don't remember exactly will timeline switch feature take care of this extra WAL, Heikki can confirm this point? Also I think this can serialize flush of data pages in checkpoint/bgwriter which is currently not the case. Yeah. TBH this entire discussion seems to be let's cripple performance in the normal case so that we can skip doing an rsync when resurrecting a crashed, failed-over master. This is not merely optimizing for the wrong thing, it's positively hazardous. After a fail-over, you should be wondering whether it's safe to resurrect the old master at all, not about how fast you can bring it back up without validating its data. IOW, I wouldn't consider skipping the rsync even if I had a feature like this. Agreed. Especially as in situations where you fall over in a planned way, e.g. for a hardware upgrade, you can avoid the need to resync with a littlebit of care. It's really worth documenting that way. So its mostly in catastrophic situations this becomes a problem and in those you really should resync - and its a good idea not to use a normal rsync but a rsync --checksum or similar. If database is very large, rsync --checksum takes very long. And I'm concerned that most of data pages in master has the different checksum from those in the standby because of commit hint bit. I'm not sure how rsync --checksum can speed up the backup after failover. rsync --checksum alone may not but rsync --inplace may speed up backup a lot. Regards, -- -- Zoltán Böszörményi Cybertec Schönig Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt, Austria Web: http://www.postgresql-support.de http://www.postgresql.at/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inconsistent DB data in Streaming Replication
On 2013-04-10 20:39:25 +0200, Boszormenyi Zoltan wrote: 2013-04-10 18:46 keltezéssel, Fujii Masao írta: On Wed, Apr 10, 2013 at 11:16 PM, Andres Freund and...@2ndquadrant.com wrote: On 2013-04-10 10:10:31 -0400, Tom Lane wrote: Amit Kapila amit.kap...@huawei.com writes: On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote: Sorry, this is incorrect. Streaming replication continuous, master is not waiting, whenever the master writes the data page it checks that the WAL record is written in standby till that LSN. I am not sure it will resolve the problem completely as your old-master can have some WAL extra then new-master for same timeline. I don't remember exactly will timeline switch feature take care of this extra WAL, Heikki can confirm this point? Also I think this can serialize flush of data pages in checkpoint/bgwriter which is currently not the case. Yeah. TBH this entire discussion seems to be let's cripple performance in the normal case so that we can skip doing an rsync when resurrecting a crashed, failed-over master. This is not merely optimizing for the wrong thing, it's positively hazardous. After a fail-over, you should be wondering whether it's safe to resurrect the old master at all, not about how fast you can bring it back up without validating its data. IOW, I wouldn't consider skipping the rsync even if I had a feature like this. Agreed. Especially as in situations where you fall over in a planned way, e.g. for a hardware upgrade, you can avoid the need to resync with a littlebit of care. It's really worth documenting that way. So its mostly in catastrophic situations this becomes a problem and in those you really should resync - and its a good idea not to use a normal rsync but a rsync --checksum or similar. If database is very large, rsync --checksum takes very long. And I'm concerned that most of data pages in master has the different checksum from those in the standby because of commit hint bit. I'm not sure how rsync --checksum can speed up the backup after failover. Its not about speed, its about correctness. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [COMMITTERS] pgsql: Get rid of USE_WIDE_UPPER_LOWER dependency in trigram constructi
On 04/08/2013 10:11 AM, Dimitri Fontaine wrote: Tom Lane t...@sss.pgh.pa.us writes: If there is anybody still using Postgres on machines without wcstombs() or towlower(), and they have non-ASCII data indexed by pg_trgm, they'll need to REINDEX those indexes after pg_upgrade to 9.3, else searches may fail incorrectly. It seems likely that there are no such installations, though. Those conditions seem just complex enough to require a test script that will check that for you. What if we created a new binary responsible for auto checking all those release-note items that are possible to machine check, then issue a WARNING containing the URL to the release notes you should be reading, and a SQL script (ala pg_upgrade) to run after upgrade? $ pg_checkupgrade -d connection=string upgrade.sql NOTICE: checking 9.3 upgrade release notes WARNING: RN-93-0001 index idx_trgm_abc is not on-disk compatible with 9.3 WARNING: TN-93-0012 … WARNING: This script is NOT comprehensive, read release notes at … The target version would be hard coded on the binary itself for easier maintaining of it, and that proposal includes a unique identifier for any release note worthy warning that we know about, that would be included in the output of the program. I think most of the checks would only have to be SQL code, and some of them should include running some binary code the server side. When that's possible, we could maybe expose that binary code in a server side extension so as to make the client side binary life's easier. That would also be an excuse for the project to install some upgrade material on the old server, which has been discussed in the past for preparing pg_upgrade when we have a page format change. given something like this also will have to be dealt with by pg_upgrade, why not fold it into that (like into -c) completly and recommend running that? on the flipside if people don't read the release notes they will also not run any kind of binary/script mentioned there... Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bgworker sigusr1 handler
Robert Haas escribió: Just for fun, I implemented a toy background worker tonight using the new bgworker framework. Generally, it went well, and I'm pleased with the design of the new facility. Thanks. However, I did notice one oddity. I initialized the worker flags like this: worker.bgw_flags = BGWORKER_SHMEM_ACCESS; And... latches didn't work. It turns out that if you request database access, then the SIGUSR1 handler is set to procsignal_sigusr1_handler, which is fine. But if you don't, it gets set to SIG_IGN. And the result of *that* is that if someone sets a latch for which the background process is waiting, the background process fails to notice. Now, once you understand what's going on here, it's not hard to work around. But it seems to me that it would be a saner default to set the signal handler to something like the bgwriter handler, which just calls latch_sigusr1_handler. Sounds sensible -- done that way. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SIGHUP not received by custom bgworkers if postmaster is notified
Michael Paquier escribió: Hi all, While playing with custom background workers, I noticed that postmaster does not notify its registered bgworkers if it receives SIGHUP, so you have to send a SIGHUP directly to the bgworker process to notify it. Signal handling is correctly done for SIGQUIT and SIGTERM for shutdown only. Attached is a patch fixing that, I simply added a call to SignalUnconnectedWorkers in SIGHUP_handler:postmaster.c. Thanks, applied. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SIGHUP not received by custom bgworkers if postmaster is notified
Michael Paquier escribió: Hi all, Please find attached a simple example of bgworker that logs a message each time a SIGTERM or SIGHUP signal is received by it: - hello signal: processed SIGHUP when SIGHUP is handled by my example - hello signal: processed SIGTERM when SIGTERM is handled by my example I committed some improvements to worker_spi this morning that I think enough demostrate signal handling capabilities, which I think is what your submitted code would do. If you see more use for a separate body of sample worker code, by all means do submit that. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 10 April 2013 11:15, Ants Aasma a...@cybertec.at wrote: * We might even be able to calculate CRC32 checksum for normal WAL records, and use Ants' checksum for full page writes (only). So checking WAL checksum would then be to confirm header passes CRC32 and then re-check the Ants checksum of each backup block. If we ensure that the checksum on the page is correct when we do a full page write then we could only include the checksum field in the WAL CRC. When reading WAL we would first check that the CRC is correct and then verify the the page checksum. OK, so we have a single combined calculate a checksum for a block function. That uses Jeff's zeroing trick and Ants' bulk-oriented performance optimization. For buffer checksums we simply calculate for the block. For WAL full page writes, we first set the checksums for all defined buffers, then calculate the checksum of remaining data plus the pd_checksum field from each block using the normal WAL CRC32. Seems good to me. One set of fast code. And it avoids the weirdness that the checksum stored on the full page is actually wrong. It also means that the WAL checksum calculation includes the hole, yet we do not include the data for the hole. So we have to do an extra copy when restoring the backuo block. Comments? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services
Re: [HACKERS] [GSOC] questions about idea rewrite pg_dump as library
On 4/10/13 10:54 AM, ˧ wrote: I'm interested in the idea Rewrite (add) pg_dump and pg_restore utilities as libraries (.so, .dll .dylib). The pg_dump code is a giant mess, and refactoring it as a library is perhaps not a project for a new hacker. Independent of that, I think the first consideration in such a project would be, who else would be using that library? What are the use cases? And then come up with an interface around that, and then see about refactoring pg_dump. I think the main uses cases mentioned in connection with this idea are usually in the direction of finer-grained control over what gets dumped and how. But making pg_dump into a library would not necessarily address that. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GSOC] questions about idea rewrite pg_dump as library
Peter Eisentraut wrote: I think the main uses cases mentioned in connection with this idea are usually in the direction of finer-grained control over what gets dumped and how. But making pg_dump into a library would not necessarily address that. There's also the matter of embedding pg_dump into other programs. For example, calling the pg_dump executable from inside pgAdmin is a rather messy solution to the problem. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GSOC] questions about idea rewrite pg_dump as library
On 04/10/2013 11:02 PM, Alvaro Herrera wrote: Peter Eisentraut wrote: I think the main uses cases mentioned in connection with this idea are usually in the direction of finer-grained control over what gets dumped and how. But making pg_dump into a library would not necessarily address that. There's also the matter of embedding pg_dump into other programs. For example, calling the pg_dump executable from inside pgAdmin is a rather messy solution to the problem. Natural solution to this seems to move most of pg_dump functionality into backend as functions, so we have pg_dump_xxx() for everything we want to dump plus a topological sort function for getting the objects in right order. The main things left into pg_dump the library would be support various ways to format the dump results into text, tar and dump files. Hannu -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GSOC] questions about idea rewrite pg_dump as library
Hannu Krosing wrote: On 04/10/2013 11:02 PM, Alvaro Herrera wrote: Peter Eisentraut wrote: I think the main uses cases mentioned in connection with this idea are usually in the direction of finer-grained control over what gets dumped and how. But making pg_dump into a library would not necessarily address that. There's also the matter of embedding pg_dump into other programs. For example, calling the pg_dump executable from inside pgAdmin is a rather messy solution to the problem. Natural solution to this seems to move most of pg_dump functionality into backend as functions, so we have pg_dump_xxx() for everything we want to dump plus a topological sort function for getting the objects in right order. This idea doesn't work because of back-patch considerations (i.e. we would not be able to create the functions in back branches, and so this new style of pg_dump would only work with future server versions). So pg_dump itself would have to retain capability to dump stuff from old servers. This seems unlikely to fly at all, because we'd be then effectively maintaining pg_dump in two places, both backend and the pg_dump source code. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GSOC] questions about idea rewrite pg_dump as library
Alvaro Herrera alvhe...@2ndquadrant.com writes: Hannu Krosing wrote: Natural solution to this seems to move most of pg_dump functionality into backend as functions, so we have pg_dump_xxx() for everything we want to dump plus a topological sort function for getting the objects in right order. This idea doesn't work because of back-patch considerations (i.e. we would not be able to create the functions in back branches, and so this new style of pg_dump would only work with future server versions). So pg_dump itself would have to retain capability to dump stuff from old servers. This seems unlikely to fly at all, because we'd be then effectively maintaining pg_dump in two places, both backend and the pg_dump source code. There are other issues too, in particular that most of the backend's code tends to work on SnapshotNow time whereas pg_dump would really prefer it was all done according to the transaction snapshot. We have got bugs of that ilk already in pg_dump, but we shouldn't introduce a bunch more. Doing this right would therefore mean that we'd have to write a lot of duplicative code in the backend, ie, it's not clear that we gain any synergy by pushing the functionality over. It might simplify cross-backend-version issues (at least for backend versions released after we'd rewritten all that code) but otherwise I'm afraid it'd just be pushing the problems somewhere else. In any case, push it to the backend offers no detectable help with the core design issue here, which is figuring out what functionality needs to be exposed with what API. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] corrupt pages detected by enabling checksums
On Sat, Apr 6, 2013 at 10:44 AM, Andres Freund and...@2ndquadrant.com wrote: I feel pretty strongly that we shouldn't add any such complications to XLogInsert() itself, its complicated enough already and it should be made simpler, not more complicated. +1, emphatically. XLogInsert is a really nasty scalability bottleneck. We need to move as much logic out of that function as possible, and particularly out from under WALInsertLock. ...Robert -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sepgsql 3/3] Add db_procedure:execute permission checks
Kohei KaiGai wrote: This patch adds sepgsql support for permission checks almost equivalent to the existing FUNCTION EXECUTE privilege. While skimming this patch I noticed that you're using getObjectDescription() as the audit_name of objects. This may be a bit unstable, for example consider that it's translated if lc_messages is set to something other than english. I would suggest that the object identity as introduced by commit f8348ea32ec8 is a better choice for this. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SIGHUP not received by custom bgworkers if postmaster is notified
Thanks for committing the fix! On Thu, Apr 11, 2013 at 4:11 AM, Alvaro Herrera alvhe...@2ndquadrant.comwrote: Michael Paquier escribió: Hi all, Please find attached a simple example of bgworker that logs a message each time a SIGTERM or SIGHUP signal is received by it: - hello signal: processed SIGHUP when SIGHUP is handled by my example - hello signal: processed SIGTERM when SIGTERM is handled by my example I committed some improvements to worker_spi this morning that I think enough demostrate signal handling capabilities, which I think is what your submitted code would do. If you see more use for a separate body of sample worker code, by all means do submit that. Sure. -- Michael
[HACKERS] synchronize_seqscans' description is a bit misleading
If I'm reading the code right [1], this GUC does not actually *synchronize* the scans, but instead just makes sure that a new scan starts from a block that was reported by some other backend performing a scan on the same relation. Since the backends scanning the relation may be processing the relation at different speeds, even though each one took the hint when starting the scan, they may end up being out of sync with each other. Even in a single query, there may be different scan nodes scanning different parts of the same relation, and even they don't synchronize with each other (and for good reason). Imagining that all scans on a table are always synchronized, may make some wrongly believe that adding more backends scanning the same table will not incur any extra I/O; that is, only one stream of blocks will be read from disk no matter how many backends you add to the mix. I noticed this when I was creating partition tables, and each of those was a CREATE TABLE AS SELECT FROM original_table (to avoid WAL generation), and running more than 3 such transactions caused the disk read throughput to behave unpredictably, sometimes even dipping below 1 MB/s for a few seconds at a stretch. Please note that I am not complaining about the implementation, which I think is the best we can do without making backends wait for each other. It's just that the documentation [2] implies that the scans are synchronized through the entire run, which is clearly not the case. So I'd like the docs to be improved to reflect that. How about something like: doc synchronize_seqscans (boolean) This allows sequential scans of large tables to start from a point in the table that is already being read by another backend. This increases the probability that concurrent scans read the same block at about the same time and hence share the I/O workload. Note that, due to the difference in speeds of processing the table, the backends may eventually get out of sync, and hence stop sharing the I/O workload. When this is enabled, ... The default is on. /doc Best regards, [1] src/backend/access/heap/heapam.c [2] http://www.postgresql.org/docs/9.2/static/runtime-config-compatible.html#GUC-SYNCHRONIZE-SEQSCANS -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc.
Re: [HACKERS] [DOCS] synchronize_seqscans' description is a bit misleading
Gurjeet Singh gurj...@singh.im writes: If I'm reading the code right [1], this GUC does not actually *synchronize* the scans, but instead just makes sure that a new scan starts from a block that was reported by some other backend performing a scan on the same relation. Well, that's the only *direct* effect, but ... Since the backends scanning the relation may be processing the relation at different speeds, even though each one took the hint when starting the scan, they may end up being out of sync with each other. The point you're missing is that the synchronization is self-enforcing: whichever backend gets ahead of the others will be the one forced to request (and wait for) the next physical I/O. This will naturally slow down the lower-CPU-cost-per-page scans. The other ones tend to catch up during the I/O operation. The feature is not terribly useful unless I/O costs are high compared to the CPU cost-per-page. But when that is true, it's actually rather robust. Backends don't have to have exactly the same per-page processing cost, because pages stay in shared buffers for a while after the current scan leader reads them. Imagining that all scans on a table are always synchronized, may make some wrongly believe that adding more backends scanning the same table will not incur any extra I/O; that is, only one stream of blocks will be read from disk no matter how many backends you add to the mix. I noticed this when I was creating partition tables, and each of those was a CREATE TABLE AS SELECT FROM original_table (to avoid WAL generation), and running more than 3 such transactions caused the disk read throughput to behave unpredictably, sometimes even dipping below 1 MB/s for a few seconds at a stretch. It's not really the scans that's causing that to be unpredictable, it's the write I/O from the output side, which is forcing highly nonsequential behavior (or at least I suspect so ... how many disk units were involved in this test?) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, 2013-04-10 at 20:17 +0100, Simon Riggs wrote: OK, so we have a single combined calculate a checksum for a block function. That uses Jeff's zeroing trick and Ants' bulk-oriented performance optimization. For buffer checksums we simply calculate for the block. Sounds good. For WAL full page writes, we first set the checksums for all defined buffers, then calculate the checksum of remaining data plus the pd_checksum field from each block using the normal WAL CRC32. Seems good to me. One set of fast code. And it avoids the weirdness that the checksum stored on the full page is actually wrong. Oh, that's a nice benefit. It also means that the WAL checksum calculation includes the hole, yet we do not include the data for the hole. So we have to do an extra copy when restoring the backuo block. I like this, but it sounds like there is some room for discussion on some of these points. I assume changes to the WAL checksums are 9.4 material? I'm satisfied with SIMD data checksums in 9.3 and that we have a plan for using SIMD for WAL checksums later. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] ObjectClass/ObjectType mixup
src/backend/catalog/dependency.c:213: EventTriggerSupportsObjectType(getObjectClass(thisobj))) src/backend/commands/event_trigger.c:1014: Assert(EventTriggerSupportsObjectType(getObjectClass(object))); getObjectClass() returns type ObjectClass, but EventTriggerSupportsObjectType() takes type ObjectType, which are not the same thing. I think this code might be wrong. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [DOCS] synchronize_seqscans' description is a bit misleading
On Wed, Apr 10, 2013 at 11:10 PM, Tom Lane t...@sss.pgh.pa.us wrote: Gurjeet Singh gurj...@singh.im writes: If I'm reading the code right [1], this GUC does not actually *synchronize* the scans, but instead just makes sure that a new scan starts from a block that was reported by some other backend performing a scan on the same relation. Well, that's the only *direct* effect, but ... Since the backends scanning the relation may be processing the relation at different speeds, even though each one took the hint when starting the scan, they may end up being out of sync with each other. The point you're missing is that the synchronization is self-enforcing: whichever backend gets ahead of the others will be the one forced to request (and wait for) the next physical I/O. This will naturally slow down the lower-CPU-cost-per-page scans. The other ones tend to catch up during the I/O operation. Got it. So far, so good. Let's consider a pathological case where a scan is performed by a user controlled cursor, whose scan speed depends on how fast the user presses the Next button, then this scan is quickly going to fall out of sync with other scans. Moreover, if a new scan happens to pick up the block reported by this slow scan, then that new scan may have to read blocks off the disk afresh. So, again, it is not guaranteed that all the scans on a relation will synchronize with each other. Hence my proposal to include the term 'probability' in the definition. The feature is not terribly useful unless I/O costs are high compared to the CPU cost-per-page. But when that is true, it's actually rather robust. Backends don't have to have exactly the same per-page processing cost, because pages stay in shared buffers for a while after the current scan leader reads them. Agreed. Even if the buffer has been evicted from shared_buffers, there's a high likelihood that the scan that's close on the heels of others will fetch it from FS cache. Imagining that all scans on a table are always synchronized, may make some wrongly believe that adding more backends scanning the same table will not incur any extra I/O; that is, only one stream of blocks will be read from disk no matter how many backends you add to the mix. I noticed this when I was creating partition tables, and each of those was a CREATE TABLE AS SELECT FROM original_table (to avoid WAL generation), and running more than 3 such transactions caused the disk read throughput to behave unpredictably, sometimes even dipping below 1 MB/s for a few seconds at a stretch. It's not really the scans that's causing that to be unpredictable, it's the write I/O from the output side, which is forcing highly nonsequential behavior (or at least I suspect so ... how many disk units were involved in this test?) You may be right. I don't have access to the system anymore, and I don't remember the disk layout, but it's quite possible that write operations were causing the read throughput to drop. I did try to reproduce the behaviour on my laptop with up to 6 backends doing pure reads on a table that was multiple times the system RAM, but I could not get them to get out of sync. -- Gurjeet Singh http://gurjeet.singh.im/ EnterpriseDB Inc.
Re: [HACKERS] [DOCS] synchronize_seqscans' description is a bit misleading
Gurjeet Singh gurj...@singh.im writes: On Wed, Apr 10, 2013 at 11:10 PM, Tom Lane t...@sss.pgh.pa.us wrote: The point you're missing is that the synchronization is self-enforcing: Let's consider a pathological case where a scan is performed by a user controlled cursor, whose scan speed depends on how fast the user presses the Next button, then this scan is quickly going to fall out of sync with other scans. Moreover, if a new scan happens to pick up the block reported by this slow scan, then that new scan may have to read blocks off the disk afresh. Sure --- if a backend stalls completely, it will fall out of the synchronized group. And that's a good thing; we'd surely not want to block the other queries while waiting for a user who just went to lunch. So, again, it is not guaranteed that all the scans on a relation will synchronize with each other. Hence my proposal to include the term 'probability' in the definition. Yeah, it's definitely not guaranteed in any sense. But I don't really think your proposed wording is an improvement. The existing wording isn't promising guaranteed sync either, to my eyes. Perhaps we could compromise on, say, changing so that concurrent scans read the same block at about the same time to so that concurrent scans tend to read the same block at about the same time, or something like that. I don't mind making it sound a bit more uncertain, but I don't think that we need to emphasize the probability of failure. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [GSOC] questions about idea rewrite pg_dump as library
Hello, 帅. You wrote: 帅 Hi all, 帅 I'd like to introduce myself to the dev community. I am Shuai 帅 Fan, a student from Dalian University of Technology, DLUT , for 帅 short, China. And I am interested in working with PostgreSQL project in GSOC2013. 帅 I'm interested in the idea Rewrite (add) pg_dump and 帅 pg_restore utilities as libraries (.so, .dll .dylib). 帅 These days, I had a talk with Mr. Pavel Golub by email, the 帅 author of this post. And asked some questions about this idea. He 帅 adviced me to post the question to this mail list. 帅 My question is: 帅 There are lots of functions in pg_dump.c. If I rewrite this 帅 file as library. I should split pg_dump.c into two or more 帅 files(Mr. Pavel Golub's advice). However, some functions, do have return value. e.g. 帅 static DumpableObject *createBoundaryObjects(void); 帅 I thought, these functions must get return value through 帅 function argument by passing pointer to it, when using library. 帅 But, the question is: If I did so, function prototype may be 帅 changed. And Mr. Pavel Golub said it's better to keep all 帅 functions with the same declaration, and so we will only have one 帅 code base for both console application and library. I think he is right. 帅 But then, how can I deal with this situation? From my point of 帅 view, I can't malloc a variable (e.g. DumpableObject) in one 帅 library (e.g. pg_dumplib.so), and then return it's pointer to 帅 another library (or excutable program). Maybe, it's not safe(?). Or 帅 has something to do with memory leak(?). I'm not sure. 帅 帅 Do you have any suggestions? From my point of view the new library should export only two functions: 1. The execution function: ExecStatusType PGdumpdbParams(const char * const *keywords, const char * const *values); Return type may be other, but ExecStatusType seems to be OK for this purpose: PGRES_TUPLES_OK - for success, PGRES_FATAL_ERROR - for fail. Parameters will remain the same as usual command-line options for pg_dump. Thus we will have less work for existing application, e.g. pgAdmin. 2. Logging or notice processing function: typedef void (*PGdumplogProcessor) (char *relname, int done, char *message); PGdumplogProcessor PQsetNoticeProcessor(PGdumplogProcessor proc, void *arg); The purpose of this function is process output of the dump. The first argument is callback-function which accepts information about current relname (or operation, or stage), done indicates how much work done (for progress bars etc.), message contains some extra information. That's all! Only two functions. Indeed we don't need all those low-level dump functions like createBoundaryObjects etc. There will be the only one entry to the whole logic. And if one wants the only one table, funcction or schema - combination of correct parameters should be passed to PGdumpdbParams. 帅 Best wishes, 帅 Shuai -- With best wishes, Pavel mailto:pa...@gf.microolap.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers