Re: [HACKERS] bgworker sigusr1 handler

2013-04-10 Thread Michael Paquier
On Wed, Apr 10, 2013 at 12:15 PM, Robert Haas robertmh...@gmail.com wrote:

 Just for fun, I implemented a toy background worker tonight using the
 new bgworker framework.  Generally, it went well, and I'm pleased with
 the design of the new facility. However, I did notice one oddity.  I
 initialized the worker flags like this:

 worker.bgw_flags = BGWORKER_SHMEM_ACCESS;

 And... latches didn't work.  It turns out that if you request database
 access, then the SIGUSR1 handler is set to procsignal_sigusr1_handler,
 which is fine.  But if you don't, it gets set to SIG_IGN.  And the
 result of *that* is that if someone sets a latch for which the
 background process is waiting, the background process fails to notice.

 Now, once you understand what's going on here, it's not hard to work
 around.  But it seems to me that it would be a saner default to set
 the signal handler to something like the bgwriter handler, which just
 calls latch_sigusr1_handler.


There is currently a bug with bgworkers and SIGHUP. If postmaster receives
a SIGHUP, it does not notify its registered bgworkers:
http://www.postgresql.org/message-id/CAB7nPqQ-ccL9Q7wxpWNaG5Zs-hMLh_ayQb=rM2=+pxtwd+8...@mail.gmail.com

You can have a try with the example I provided, then try to reload
parameters with pg_ctl reload and you will notice that bgworkers do not
process SIGHUP as a normal backend would do.
-- 
Michael


Re: [HACKERS] page 1 of relation global/11787 was uninitialized

2013-04-10 Thread Albe Laurenz
Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
 Afaik we don't have any debugging utility to dump the pg_filenode.map
 contents?
 
 Hardly need one ... od -t d4 $PGDATA/global/pg_filenode.map
 is readable enough, though it does leave you still having to
 map the numeric OIDs back to names.  The OIDs will be stable though.

Shouldn't that be od -t u4 $PGDATA/global/pg_filenode.map?

Yours,
Laurenz Albe


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lwlock contention with SSI

2013-04-10 Thread Dan Ports
On Tue, Apr 09, 2013 at 07:49:51PM -0400, Robert Haas wrote:
 These locks are all SSI-related and they're all really hot.  Lock 28
 is SerializableXactHashLock and lock 29 is
 SerializableFinishedListLock; both are acquired an order of magnitude
 more often than any non-SSI lock, and cause two orders of magnitude
 more blocking than any other lock whatsoever.  Lock 30 is
 SerializablePredicateLockListLock, which has no exclusive lock
 acquisitions at all on this test, but the shared acquisitions result
 in significant spinlock contention.

This matches what I saw when I looked into this a while ago. I even
started sketching out some plans of how we might deal with it, but
unfortunately I never had much time to work on it, and that seems
unlikely to change any time soon. :-\

As it is, pretty much any operation involving SSI requires acquiring
SerializableXactHashLock (usually exclusive), except for checking
whether a read or write indicates a conflict. That includes starting
and ending a transaction.

Two things make this hard to fix:
 - SSI is about checking for rw-conflicts, which are inherently about
   *pairs* of transactions. This makes it hard to do fine-grained
   locking, because a lot of operations involve looking at or modifying
   the conflict list of more than one transaction.
 - SerializableXactHashLock protects many things. Besides the 
   SERIALIZABLEXACT structures themselves, there's also the free lists
   for SERIALIZABLEXACTs and RWConflicts, the SerializableXidHash
   table, the latest SxactCommitSeqno and SxactGlobalXmin, etc.

I'm trying to swap back in my notes about how to address this. It is
bound to be a substantial project, however.

Dan

-- 
Dan R. K. PortsUW CSEhttp://drkp.net/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-10 Thread Ants Aasma
On Wed, Apr 10, 2013 at 4:36 AM, Jeff Davis pg...@j-davis.com wrote:
 On Tue, 2013-04-09 at 05:35 +0300, Ants Aasma wrote:
 And here you go. I decided to be verbose with the comments as it's
 easier to delete a comment to write one. I also left in a huge jumble
 of macros to calculate the contents of a helper var during compile
 time. This can easily be replaced with the calculated values once we
 settle on specific parameters.

 Great, thank you.

 Is it possible to put an interface over it that somewhat resembles the
 CRC checksum (INIT/COMP/FIN)? It looks a little challenging because of
 the nature of the algorithm, but it would make it easier to extend to
 other places (e.g. WAL). It doesn't have to match the INIT/COMP/FIN
 pattern exactly.

The algorithm has 128 bytes of state. Storing it on every step would
negate any performance gains and C doesn't have a way to keep it in
registers. If we can trust that the compiler doesn't clobber xmm
registers then it could be split up into the following pieces:
1. init
2. process 128 bytes
3. aggregate state
4. mix in block number

Even if we don't split it up, factoring out steps 1..3 would make
sense as there is no point in making step 4 platform specific and so
is just duplicated.

 Regardless, we should have some kind of fairly generic interface and
 move the code to its own file (e.g. checksum.c).

 To make the interface more generic, would it make sense to require the
 caller to save the page's stored checksum and zero it before
 calculating? That would avoid the awkwardness of avoiding the
 pd_checksum field. For example (code for illustration only):

Yes, that would help make it reusable.

 That would make it possible to use a different word size -- is uint16
 optimal or would a larger word be more efficient?

Larger words would have better mixing as multiplies mix 4 bytes at a
time instead of 2. Performance of the vectorized version will be the
same as it is tied to the vector length but unvectorized will get a
speed up. The reason I picked 16bits is not actually related to the
checksum hole but because pmullw instruction is guaranteed to be
available on all 64bit CPUs whereas pmulld is only available on the
latest CPUs.

 It looks like the block size needs to be an even multiple of
 sizeof(uint16)*NSUMS. And it also look like it's hard to combine
 different regions of memory into the same calculation (unless we want to
 just calculate them separately and XOR them or something). Does that
 mean that this is not suitable for WAL at all?

I think it would be possible to define a padding scheme for
irregularly sized memory segments where we would only need a lead-out
command for blocks that are not a multiple of 128 bytes. The
performance of it would need to be measured. All-in-all, it's not
really a great match for WAL. While all of the fast checksums process
many bytes in a single iteration, they still process an order of
magnitude bytes less and so have an easier time with irregularly
shaped blocks.

 Using SIMD for WAL is not a requirement at all; I just thought it might
 be a nice benefit for non-checksum-enabled users in some later release.

I think we should first deal with using it for page checksums and if
future versions want to reuse some of the code for WAL checksums then
we can rearrange the code.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-10 Thread Simon Riggs
On 10 April 2013 09:01, Ants Aasma a...@cybertec.at wrote:


  Using SIMD for WAL is not a requirement at all; I just thought it might
  be a nice benefit for non-checksum-enabled users in some later release.

 I think we should first deal with using it for page checksums and if
 future versions want to reuse some of the code for WAL checksums then
 we can rearrange the code.


We have essentially the same problem in both cases: we want to calculate a
checksum of BLCKSZ chunks of data, plus some smaller header data. We
currently use the same code for both cases and it makes sense to do the
same thing with any new code. This is also the *same* issue: when we make a
new hint we need to issue a full page write in WAL, so we are calculating
checksums in two new places: XLOG_HINT records and data blocks.

Few technical points:

* We're taking a copy of the buffer, so Jeff's zero trick works safely, I
think.
* We can use a different algorithm for big and small blocks, we just need a
way to show we've done that, for example setting the high order bit of the
checksum.
* We might even be able to calculate CRC32 checksum for normal WAL records,
and use Ants' checksum for full page writes (only). So checking WAL
checksum would then be to confirm header passes CRC32 and then re-check the
Ants checksum of each backup block.

This work needs to happen now, since once the checksum algorithm is set we
won't easily be able to change it.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


Re: [HACKERS] [BUGS] replication_timeout not effective

2013-04-10 Thread Amit Kapila
 Sent: Wednesday, April 10, 2013 1:49 PM Dang Minh Huong wrote:
 To: Amit Kapila
 Subject: Re: [BUGS] replication_timeout not effective
On Wednesday, April 10, 2013 1:49 PM
 Hi,

 Thank you for your soon reply.

 I'm trying to set the network timeout related parameters to terminate
 it.

 # i've tried to set postgresql.conf's tcp_keepalives_* but not success.

I have also tried those, but they didn't work that's why I have proposed
this feature in 9.3

Please send mail on community list, others can also help you if they have
any idea for avoiding such problems.

 2013/04/10 14:05、Amit Kapila amit.kap...@huawei.com のメッセージ:

  On Wednesday, April 10, 2013 9:35 AM Dang Minh Huong wrote:
  Hi,
 
  I'm wondering  if this is a bug of PostgreSQL.
 
  PostgreSQL's show that replication_timeout parameter can Terminate
 replication connections that are inactive longer than the specified
 number of milliseconds. But in my environment the sender process  is
 hang up (in several tens of minunites) if i turn off  (by power off)
 Standby PC while pg_basebackup is excuting.
 
  Is this correct?
 
  As my debug, sender process is terminated when recieve SIGPIPE
 process but it come too slow (about 30minutes after standby PC was
 down).
 
  For such scenario's, new parameter wal_sender_timeout has been
 introduced in 9.3. Refer below:
  http://www.postgresql.org/docs/devel/static/runtime-config-
 replication.html#RUNTIME-CONFIG-REPLICATION-SENDER
 
  I am not sure how to get rid of this problem in 9.1.9
 
  With Regards,
  Amit Kapila.
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Samrat Revagade
it's one of the reasons why a fresh base backup is required when starting
old master as new standby? If yes, I agree with you. I've often heard the
complaints about a backup when restarting new standby. That's really big
problem.

I think Fujii Masao is on the same page.



In case of syncrep the master just waits for confirmation from standby
before returning to client on commit.

Not just commit, you must stop any *writing* of the wal records
effectively killing any parallelism.
 Min issue is that it will make *all* backends dependant on each sync
commit, essentially serialising all backends commits, with the
serialisation *including* the latency of roundtrip to client. With current
sync streaming the other backends can continue to write wal, with proposed
approach you cannot write any records after the one waiting an ACK from
standby.



Let me rephrase the proposal in a more accurate manner:

Consider following scenario:



(1) A client sends the COMMIT command to the master server.

(2) The master writes WAL record to disk

(3) The master writes the data page related to this transaction.  i.e. via
checkpoint or bgwriter.

(4) The master sends WAL records continuously to the standby, up to the
commit WAL record.

(5) The standby receives WAL records, writes them to the disk, and then
replies the ACK.

(6) The master returns a success indication to a client after it receives
ACK.



If failover happens between (3) and (4), WAL and DB data in old master are
ahead of them in new master. After failover, new master continues running
new transactions independently from old master. Then WAL record and DB data
would become inconsistent between those two servers. To resolve these
inconsistencies, the backup of new master needs to be taken onto new
standby.


But taking backup is not feasible in case of larger database size with
several TB over a slow WAN.

So to avoid this type of inconsistency without taking fresh backup we are
thinking to do following thing:



 I think that you can introduce GUC specifying whether this extra check
is required to avoid a backup when failback.

Approach:

Introduce new GUC option specifying whether to prevent PostgreSQL from
writing DB data before corresponding WAL records have been replicated to
the standby. That is, if this GUC option is enabled, PostgreSQL waits for
corresponding WAL records to be not only written to the disk but also
replicated to the standby before writing DB data.


So the process becomes as follows:

(1) A client sends the COMMIT command to the master server.

(2) The master writes the commit WAL record to the disk.

(3) The master sends WAL records continuously to standby up to the commit
WAL record.

(4) The standby receives WAL records, writes them to disk, and then replies
the ACK.

(5) *The master then forces a write of the data page related to this
transaction. *

(6) The master returns a success indication to a client after it receives
ACK.



While master is waiting to force a write (point 5) for this data page,
streaming replication continuous. Also other data page writes are not
dependent on this particular page write. So the commit of data pages are
not serialized.


Regards,

Samrat


Re: [HACKERS] [BUGS] replication_timeout not effective

2013-04-10 Thread Dang Minh Huong
Hi Amit,

Thank you for your consideration.

My project not allows to use 9.2 or 9.3.

In 9.3, it sounds replication_timeout is replaced by wal_sender_timeout. 
So if it is solved in 9.3 i think there is a way to terminate it. 
I hope it is fixed in 9.1 soon

Regards,

2013/04/10 18:33、Amit Kapila amit.kap...@huawei.com のメッセージ:

 Sent: Wednesday, April 10, 2013 1:49 PM Dang Minh Huong wrote:
 To: Amit Kapila
 Subject: Re: [BUGS] replication_timeout not effective
 On Wednesday, April 10, 2013 1:49 PM
 Hi,
 
 Thank you for your soon reply.
 
 I'm trying to set the network timeout related parameters to terminate
 it.
 
 # i've tried to set postgresql.conf's tcp_keepalives_* but not success.
 
 I have also tried those, but they didn't work that's why I have proposed
 this feature in 9.3
 
 Please send mail on community list, others can also help you if they have
 any idea for avoiding such problems.
 
 2013/04/10 14:05、Amit Kapila amit.kap...@huawei.com のメッセージ:
 
 On Wednesday, April 10, 2013 9:35 AM Dang Minh Huong wrote:
 Hi,
 
 I'm wondering  if this is a bug of PostgreSQL.
 
 PostgreSQL's show that replication_timeout parameter can Terminate
 replication connections that are inactive longer than the specified
 number of milliseconds. But in my environment the sender process  is
 hang up (in several tens of minunites) if i turn off  (by power off)
 Standby PC while pg_basebackup is excuting.
 
 Is this correct?
 
 As my debug, sender process is terminated when recieve SIGPIPE
 process but it come too slow (about 30minutes after standby PC was
 down).
 
 For such scenario's, new parameter wal_sender_timeout has been
 introduced in 9.3. Refer below:
 http://www.postgresql.org/docs/devel/static/runtime-config-
 replication.html#RUNTIME-CONFIG-REPLICATION-SENDER
 
 I am not sure how to get rid of this problem in 9.1.9
 
 With Regards,
 Amit Kapila.
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Samrat Revagade
(5) *The master then forces a write of the data page related to this
transaction.*

*Sorry, this is incorrect. Whenever the master writes the data page it
checks that the WAL record is written in standby till that LSN. *

*
*

While master is waiting to force a write (point 5) for this data page,
streaming replication continuous.

Also other data page writes are not dependent on this particular page
write. So the commit of data pages are not serialized.

*Sorry, this is incorrect. Streaming replication continuous, master is not
waiting, whenever the master writes the data page it checks that the WAL
record is written in standby till that LSN.*

*
*

*Regards,*

*Samrat*

*
*


Re: [HACKERS] Enabling Checksums

2013-04-10 Thread Ants Aasma
On Wed, Apr 10, 2013 at 12:25 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On 10 April 2013 09:01, Ants Aasma a...@cybertec.at wrote:


  Using SIMD for WAL is not a requirement at all; I just thought it might
  be a nice benefit for non-checksum-enabled users in some later release.

 I think we should first deal with using it for page checksums and if
 future versions want to reuse some of the code for WAL checksums then
 we can rearrange the code.


 We have essentially the same problem in both cases: we want to calculate a
 checksum of BLCKSZ chunks of data, plus some smaller header data. We
 currently use the same code for both cases and it makes sense to do the same
 thing with any new code. This is also the *same* issue: when we make a new
 hint we need to issue a full page write in WAL, so we are calculating
 checksums in two new places: XLOG_HINT records and data blocks.

 Few technical points:

 * We're taking a copy of the buffer, so Jeff's zero trick works safely, I
 think.
 * We can use a different algorithm for big and small blocks, we just need a
 way to show we've done that, for example setting the high order bit of the
 checksum.
 * We might even be able to calculate CRC32 checksum for normal WAL records,
 and use Ants' checksum for full page writes (only). So checking WAL checksum
 would then be to confirm header passes CRC32 and then re-check the Ants
 checksum of each backup block.

If we ensure that the checksum on the page is correct when we do a
full page write then we could only include the checksum field in the
WAL CRC. When reading WAL we would first check that the CRC is correct
and then verify the the page checksum.

 This work needs to happen now, since once the checksum algorithm is set we
 won't easily be able to change it.

The page checksum algorithm needs to be decided now, but WAL CRCs and
full page writes can be changed in 9.4 and don't need to be perfect on
the first try.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Amit Kapila
On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:

(5) The master then forces a write of the data page related to this
transaction.
 Sorry, this is incorrect. Whenever the master writes the data page it
checks that the WAL record is written in standby till that LSN. 

 While master is waiting to force a write (point 5) for this data page,
streaming replication continuous. 
 Also other data page writes are not dependent on this particular page
write. So the commit of data pages are not serialized.

Sorry, this is incorrect. Streaming replication continuous, master is not
waiting, whenever the master writes the data page it checks that the WAL
record is written in standby till that LSN.

I am not sure it will resolve the problem completely as your old-master can
have some WAL extra then new-master for same timeline. I don't remember
exactly will timeline switch feature
take care of this extra WAL, Heikki can confirm this point?
Also I think this can serialize flush of data pages in checkpoint/bgwriter
which is currently not the case.

With Regards,
Amit Kapila.






-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] replication_timeout not effective

2013-04-10 Thread Kyotaro HORIGUCHI
Hello,

On Wed, Apr 10, 2013 at 6:57 PM, Dang Minh Huong kakalo...@gmail.com wrote:
 In 9.3, it sounds replication_timeout is replaced by wal_sender_timeout.
 So if it is solved in 9.3 i think there is a way to terminate it.
 I hope it is fixed in 9.1 soon

Hmm. He said that,

 But in my environment the sender process is hang up (in several tens of 
 minunites) if i turn off  (by power off) Standby PC while *pg_basebackup* is 
 excuting.

Does basebackup run only on 'replication connection' ?
As far as I saw base backup uses 'base backup' connection in addition
to 'streaming' connection. The former seems not under the control of
wal_sender_timeout or replication_timeout and easily blocked at
send(2) after sudden cut out of the network connection underneath.
Although the latter indeed is terminated by them.

Blocking in send(2) might could occur for async-rep connection but not
likely for sync-rep since it does not fill the buffers of libpq and
socket easilly.

I suppose he says about this.

This seems to occur as of the latest 9.3dev.

regards,
--
Kyotaro Horiguchi


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] replication_timeout not effective

2013-04-10 Thread Andres Freund
On 2013-04-10 22:38:07 +0900, Kyotaro HORIGUCHI wrote:
 Hello,
 
 On Wed, Apr 10, 2013 at 6:57 PM, Dang Minh Huong kakalo...@gmail.com wrote:
  In 9.3, it sounds replication_timeout is replaced by wal_sender_timeout.
  So if it is solved in 9.3 i think there is a way to terminate it.
  I hope it is fixed in 9.1 soon
 
 Hmm. He said that,
 
  But in my environment the sender process is hang up (in several tens of 
  minunites) if i turn off  (by power off) Standby PC while *pg_basebackup* 
  is excuting.
 
 Does basebackup run only on 'replication connection' ?
 As far as I saw base backup uses 'base backup' connection in addition
 to 'streaming' connection. The former seems not under the control of
 wal_sender_timeout or replication_timeout and easily blocked at
 send(2) after sudden cut out of the network connection underneath.
 Although the latter indeed is terminated by them.

Yes, it's run via a walsender connection. The only problem is that it
doesn't check for those timeouts. I am not sure it would be a good thing
to do so to be honest. At least not using the same timeout as actual WAL
sending, thats just has different characteristics.
On the other hand, hanging around that long isn't nice either...

 Blocking in send(2) might could occur for async-rep connection but not
 likely for sync-rep since it does not fill the buffers of libpq and
 socket easilly.

You just need larger transactions for it. A COPY or so ought to do it.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Tom Lane
Amit Kapila amit.kap...@huawei.com writes:
 On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:
 Sorry, this is incorrect. Streaming replication continuous, master is not
 waiting, whenever the master writes the data page it checks that the WAL
 record is written in standby till that LSN.

 I am not sure it will resolve the problem completely as your old-master can
 have some WAL extra then new-master for same timeline. I don't remember
 exactly will timeline switch feature
 take care of this extra WAL, Heikki can confirm this point?
 Also I think this can serialize flush of data pages in checkpoint/bgwriter
 which is currently not the case.

Yeah.  TBH this entire discussion seems to be let's cripple performance
in the normal case so that we can skip doing an rsync when resurrecting
a crashed, failed-over master.  This is not merely optimizing for the
wrong thing, it's positively hazardous.  After a fail-over, you should
be wondering whether it's safe to resurrect the old master at all, not
about how fast you can bring it back up without validating its data.
IOW, I wouldn't consider skipping the rsync even if I had a feature
like this.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Andres Freund
On 2013-04-10 10:10:31 -0400, Tom Lane wrote:
 Amit Kapila amit.kap...@huawei.com writes:
  On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:
  Sorry, this is incorrect. Streaming replication continuous, master is not
  waiting, whenever the master writes the data page it checks that the WAL
  record is written in standby till that LSN.
 
  I am not sure it will resolve the problem completely as your old-master can
  have some WAL extra then new-master for same timeline. I don't remember
  exactly will timeline switch feature
  take care of this extra WAL, Heikki can confirm this point?
  Also I think this can serialize flush of data pages in checkpoint/bgwriter
  which is currently not the case.
 
 Yeah.  TBH this entire discussion seems to be let's cripple performance
 in the normal case so that we can skip doing an rsync when resurrecting
 a crashed, failed-over master.  This is not merely optimizing for the
 wrong thing, it's positively hazardous.  After a fail-over, you should
 be wondering whether it's safe to resurrect the old master at all, not
 about how fast you can bring it back up without validating its data.
 IOW, I wouldn't consider skipping the rsync even if I had a feature
 like this.

Agreed. Especially as in situations where you fall over in a planned
way, e.g. for a hardware upgrade, you can avoid the need to resync with
a littlebit of care. So its mostly in catastrophic situations this
becomes a problem and in those you really should resync - and its a good
idea not to use a normal rsync but a rsync --checksum or similar.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-10 Thread Bruce Momjian
On Wed, Apr 10, 2013 at 01:15:12PM +0300, Ants Aasma wrote:
  This work needs to happen now, since once the checksum algorithm is set we
  won't easily be able to change it.
 
 The page checksum algorithm needs to be decided now, but WAL CRCs and
 full page writes can be changed in 9.4 and don't need to be perfect on
 the first try.

I can confirm that --- pg_upgrade doesn't copy any of the WAL stream
from old to new cluster.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Shaun Thomas

On 04/10/2013 09:10 AM, Tom Lane wrote:


IOW, I wouldn't consider skipping the rsync even if I had a feature
like this.


Totally. Out in the field, we consider the old database corrupt the 
moment we fail over. There is literally no way to verify the safety of 
any data along the broken chain, given race conditions and multiple 
potential failure points.


The only potential use case for this that I can see, would be for system 
maintenance and a controlled failover. I agree: that's a major PITA when 
doing DR testing, but I personally don't think this is the way to fix 
that particular edge case.


Maybe checksums will fix this in the long run... I don't know. DRBD has 
a handy block-level verify function for things like this, and it can 
re-sync master/slave data by comparing the commit log across the servers 
if you tell it one node should be considered incorrect.


The thing is... we have clogs, and we have WAL. If we can assume 
bidirectional communication and verification (checksum comparison?) of 
both of those components, the database *should* be able to re-sync itself.


Even if that were possible given the internals, I can't see anyone 
jumping on this before 9.4 or 9.5 unless someone sponsors the feature.


Automatic re-sync would (within available WALs) be an awesome feature, 
though...


--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] replication_timeout not effective

2013-04-10 Thread Dang Minh Huong

Thanks all,

(2013/04/10 22:55), Andres Freund wrote:

On 2013-04-10 22:38:07 +0900, Kyotaro HORIGUCHI wrote:

Hello,

On Wed, Apr 10, 2013 at 6:57 PM, Dang Minh Huong kakalo...@gmail.com wrote:

In 9.3, it sounds replication_timeout is replaced by wal_sender_timeout.
So if it is solved in 9.3 i think there is a way to terminate it.
I hope it is fixed in 9.1 soon

Hmm. He said that,


But in my environment the sender process is hang up (in several tens of 
minunites) if i turn off  (by power off) Standby PC while *pg_basebackup* is 
excuting.

Does basebackup run only on 'replication connection' ?
As far as I saw base backup uses 'base backup' connection in addition
to 'streaming' connection. The former seems not under the control of
wal_sender_timeout or replication_timeout and easily blocked at
send(2) after sudden cut out of the network connection underneath.
Although the latter indeed is terminated by them.

Yes, it's run via a walsender connection. The only problem is that it
doesn't check for those timeouts. I am not sure it would be a good thing
to do so to be honest. At least not using the same timeout as actual WAL
sending, thats just has different characteristics.
On the other hand, hanging around that long isn't nice either...

I tried max_wal_sender with 1, so when the walsender is hanging.
I can not run again pg_basebackup (or start the standby DB).
I'm increasing it to 2, so the seconds successfully. But i'm afraid
 that when the third occures the hanging walsender in the first
 is not yet terminated...

 I think not, but is there a way to terminate hanging up but not
 restart PostgreSQL server or kill walsender process?
 (kill walsender process can caused a crash to DB server,
 so i don't want to do it).

 # i've also tried with pg_cancel_backend() but it did not work too.

Blocking in send(2) might could occur for async-rep connection but not
likely for sync-rep since it does not fill the buffers of libpq and
socket easilly.

You just need larger transactions for it. A COPY or so ought to do it.

Greetings,

Andres Freund


Regard,
Huong DM


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] replication_timeout not effective

2013-04-10 Thread Andres Freund
On 2013-04-10 23:37:44 +0900, Dang Minh Huong wrote:
 Thanks all,
 
 (2013/04/10 22:55), Andres Freund wrote:
 On 2013-04-10 22:38:07 +0900, Kyotaro HORIGUCHI wrote:
 Hello,
 
 On Wed, Apr 10, 2013 at 6:57 PM, Dang Minh Huong kakalo...@gmail.com 
 wrote:
 In 9.3, it sounds replication_timeout is replaced by wal_sender_timeout.
 So if it is solved in 9.3 i think there is a way to terminate it.
 I hope it is fixed in 9.1 soon
 Hmm. He said that,
 
 But in my environment the sender process is hang up (in several tens of 
 minunites) if i turn off  (by power off) Standby PC while *pg_basebackup* 
 is excuting.
 Does basebackup run only on 'replication connection' ?
 As far as I saw base backup uses 'base backup' connection in addition
 to 'streaming' connection. The former seems not under the control of
 wal_sender_timeout or replication_timeout and easily blocked at
 send(2) after sudden cut out of the network connection underneath.
 Although the latter indeed is terminated by them.
 Yes, it's run via a walsender connection. The only problem is that it
 doesn't check for those timeouts. I am not sure it would be a good thing
 to do so to be honest. At least not using the same timeout as actual WAL
 sending, thats just has different characteristics.
 On the other hand, hanging around that long isn't nice either...
 I tried max_wal_sender with 1, so when the walsender is hanging.
 I can not run again pg_basebackup (or start the standby DB).
 I'm increasing it to 2, so the seconds successfully. But i'm afraid
  that when the third occures the hanging walsender in the first
  is not yet terminated...
 
  I think not, but is there a way to terminate hanging up but not
  restart PostgreSQL server or kill walsender process?
  (kill walsender process can caused a crash to DB server,
  so i don't want to do it).

Depending on where its hanging a normal SELECT
pg_terminate_backend(pid); might do it.

Otherwise you will have to wait for the operating system's tcp timeout.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [GSOC] questions about idea rewrite pg_dump as library

2013-04-10 Thread
Hi all,

I'd like to introduce myself to the dev community. I am Shuai Fan, a 
student from Dalian University of Technology, DLUT , for short, China. And I am 
interested in working with PostgreSQL project in GSOC2013.
I'm interested in the idea Rewrite (add) pg_dump and pg_restore utilities 
as libraries (.so, .dll  .dylib).

These days, I had a talk with Mr. Pavel Golub by email, the author of this 
post. And asked some questions about this idea. He adviced me to post the 
question to this mail list.

My question is:
There are lots of functions in pg_dump.c. If I rewrite this file as 
library. I should split pg_dump.c into two or more files(Mr. Pavel Golub's 
advice). However, some functions, do have return value. e.g.

static DumpableObject *createBoundaryObjects(void);

I thought, these functions must get return value through function argument 
by passing pointer to it, when using library. But, the question is: If I did 
so, function prototype may be changed. And Mr. Pavel Golub said it's better to 
keep all functions with the same declaration, and so we will only have one 
code base for both console application and library. I think he is right.
But then, how can I deal with this situation? From my point of view, I 
can't malloc a variable (e.g. DumpableObject) in one library (e.g. 
pg_dumplib.so), and then return it's pointer to another library (or excutable 
program). Maybe, it's not safe(?). Or has something to do with memory 
leak(?). I'm not sure.
   
Do you have any suggestions?


Best wishes,
Shuai


Re: [HACKERS] [BUGS] replication_timeout not effective

2013-04-10 Thread Dang Minh Huong
2013/04/10 23:44、Andres Freund and...@2ndquadrant.com のメッセージ:

 On 2013-04-10 23:37:44 +0900, Dang Minh Huong wrote:
 Thanks all,
 
 (2013/04/10 22:55), Andres Freund wrote:
 On 2013-04-10 22:38:07 +0900, Kyotaro HORIGUCHI wrote:
 Hello,
 
 On Wed, Apr 10, 2013 at 6:57 PM, Dang Minh Huong kakalo...@gmail.com 
 wrote:
 In 9.3, it sounds replication_timeout is replaced by wal_sender_timeout.
 So if it is solved in 9.3 i think there is a way to terminate it.
 I hope it is fixed in 9.1 soon
 Hmm. He said that,
 
 But in my environment the sender process is hang up (in several tens of 
 minunites) if i turn off  (by power off) Standby PC while *pg_basebackup* 
 is excuting.
 Does basebackup run only on 'replication connection' ?
 As far as I saw base backup uses 'base backup' connection in addition
 to 'streaming' connection. The former seems not under the control of
 wal_sender_timeout or replication_timeout and easily blocked at
 send(2) after sudden cut out of the network connection underneath.
 Although the latter indeed is terminated by them.
 Yes, it's run via a walsender connection. The only problem is that it
 doesn't check for those timeouts. I am not sure it would be a good thing
 to do so to be honest. At least not using the same timeout as actual WAL
 sending, thats just has different characteristics.
 On the other hand, hanging around that long isn't nice either...
 I tried max_wal_sender with 1, so when the walsender is hanging.
 I can not run again pg_basebackup (or start the standby DB).
 I'm increasing it to 2, so the seconds successfully. But i'm afraid
 that when the third occures the hanging walsender in the first
 is not yet terminated...
 
 I think not, but is there a way to terminate hanging up but not
 restart PostgreSQL server or kill walsender process?
 (kill walsender process can caused a crash to DB server,
 so i don't want to do it).
 
 Depending on where its hanging a normal SELECT
 pg_terminate_backend(pid); might do it.
 
Greate! it worked. Thank you very much.

 Otherwise you will have to wait for the operating system's tcp timeout.
 
 Greetings,
 
 Andres Freund
 
 -- 
 Andres Freund   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

Regards,
Huong DM

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Problem with background worker

2013-04-10 Thread Alvaro Herrera
Marc Cousin escribió:
 On 20/03/2013 16:33, Alvaro Herrera wrote:

 Ah.  The reason for this problem is that the statement start time (which
 also sets the transaction start time, when it's the first statement) is
 set by postgres.c, not the transaction-control functions in xact.c.  So
 you'd need to add a SetCurrentStatementStartTimestamp() call somewhere
 in your loop.
 
 
 Yes, that works. Thanks a lot !
 
 Maybe this should be added to the worker_spi example ?

Done in
http://git.postgresql.org/pg/commitdiff/e543631f3c162ab5f6020b1d0209e0353ca2229a

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Behaviour of bgworker with SIGHUP

2013-04-10 Thread Alvaro Herrera
Guillaume Lelarge wrote:

 worker_spi.naptime is the naptime between two checks.
 worker_spi.total_workers is the number of workers to launch at
 postmaster start time. The first one can change with a sighup, the last
 one obviously needs a restart.

Many thanks.  Pushed as 
http://git.postgresql.org/pg/commitdiff/e543631f3c162ab5f6020b1d0209e0353ca2229a
along a few other tweaks.  I hope the code is more useful as a sample now.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Fujii Masao
On Wed, Apr 10, 2013 at 11:26 PM, Shaun Thomas stho...@optionshouse.com wrote:
 On 04/10/2013 09:10 AM, Tom Lane wrote:

 IOW, I wouldn't consider skipping the rsync even if I had a feature
 like this.


 Totally. Out in the field, we consider the old database corrupt the moment
 we fail over.

Strange. If this is really true, shared disk failover solution is
fundamentally broken
because the standby needs to start up with the shared corrupted
database at the
failover. Also, we cannot trust the crash recovery at all if we adopt
the same logic
as you think. I think that there are the cases where we can replay and reuse the
old database even after PostgreSQL crashes.

Regards,

--
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Shaun Thomas

On 04/10/2013 11:40 AM, Fujii Masao wrote:


Strange. If this is really true, shared disk failover solution is
fundamentally broken because the standby needs to start up with the
shared corrupted database at the failover.


How so? Shared disk doesn't use replication. The point I was trying to 
make is that replication requires synchronization between two disparate 
servers, and verifying they have exactly the same data is a non-trivial 
exercise. Even a single transaction after a failover (effectively) 
negates the old server because there's no easy catch up mechanism yet.


Even if this isn't necessarily true, it's the safest approach IMO.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Fujii Masao
On Wed, Apr 10, 2013 at 11:16 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-04-10 10:10:31 -0400, Tom Lane wrote:
 Amit Kapila amit.kap...@huawei.com writes:
  On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:
  Sorry, this is incorrect. Streaming replication continuous, master is not
  waiting, whenever the master writes the data page it checks that the WAL
  record is written in standby till that LSN.

  I am not sure it will resolve the problem completely as your old-master can
  have some WAL extra then new-master for same timeline. I don't remember
  exactly will timeline switch feature
  take care of this extra WAL, Heikki can confirm this point?
  Also I think this can serialize flush of data pages in checkpoint/bgwriter
  which is currently not the case.

 Yeah.  TBH this entire discussion seems to be let's cripple performance
 in the normal case so that we can skip doing an rsync when resurrecting
 a crashed, failed-over master.  This is not merely optimizing for the
 wrong thing, it's positively hazardous.  After a fail-over, you should
 be wondering whether it's safe to resurrect the old master at all, not
 about how fast you can bring it back up without validating its data.
 IOW, I wouldn't consider skipping the rsync even if I had a feature
 like this.

 Agreed. Especially as in situations where you fall over in a planned
 way, e.g. for a hardware upgrade, you can avoid the need to resync with
 a littlebit of care.

It's really worth documenting that way.

 So its mostly in catastrophic situations this
 becomes a problem and in those you really should resync - and its a good
 idea not to use a normal rsync but a rsync --checksum or similar.

If database is very large, rsync --checksum takes very long. And I'm concerned
that most of data pages in master has the different checksum from those in the
standby because of commit hint bit. I'm not sure how rsync --checksum can
speed up the backup after failover.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Fujii Masao
On Thu, Apr 11, 2013 at 1:44 AM, Shaun Thomas stho...@optionshouse.com wrote:
 On 04/10/2013 11:40 AM, Fujii Masao wrote:

 Strange. If this is really true, shared disk failover solution is
 fundamentally broken because the standby needs to start up with the
 shared corrupted database at the failover.


 How so? Shared disk doesn't use replication. The point I was trying to make
 is that replication requires synchronization between two disparate servers,
 and verifying they have exactly the same data is a non-trivial exercise.
 Even a single transaction after a failover (effectively) negates the old
 server because there's no easy catch up mechanism yet.

Hmm... ISTM what Samrat is proposing can resolve the problem. That is,
if we can think that any data page which has not been replicated to the standby
is not written in the master, new standby (i.e., old master) can safely catch up
with new master (i.e., old standby). In this approach, of course, new standby
might have some WAL records which new master doesn't have, so before
starting up new standby, we need to remove all the WAL files in new standby
and retrieve any WAL files from new master. But, what's the problem in his
approach?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Ants Aasma
On Wed, Apr 10, 2013 at 7:44 PM, Shaun Thomas stho...@optionshouse.com wrote:
 On 04/10/2013 11:40 AM, Fujii Masao wrote:

 Strange. If this is really true, shared disk failover solution is
 fundamentally broken because the standby needs to start up with the
 shared corrupted database at the failover.


 How so? Shared disk doesn't use replication. The point I was trying to make
 is that replication requires synchronization between two disparate servers,
 and verifying they have exactly the same data is a non-trivial exercise.
 Even a single transaction after a failover (effectively) negates the old
 server because there's no easy catch up mechanism yet.

 Even if this isn't necessarily true, it's the safest approach IMO.

We already rely on WAL-before-data to ensure correct recovery. What is
proposed here is to slightly redefine it to require WAL to be
replicated before it is considered to be flushed. This ensures that no
data page on disk differs from the WAL that the slave has. The
machinery to do this is already mostly there, we already wait for WAL
flushes and we know the write location on the slave. The second
requirement is that we never start up as master and we don't trust any
local WAL. This is actually how pacemaker clusters work, you would
only need to amend the RA to wipe the WAL and configure postgresql
with restart_after_crash = false.

It would be very helpful in restoring HA capability after failover if
we wouldn't have to read through the whole database after a VM goes
down and is migrated with the shared disk onto a new host.

Regards,
Ants Aasma
-- 
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Tom Lane
Ants Aasma a...@cybertec.at writes:
 We already rely on WAL-before-data to ensure correct recovery. What is
 proposed here is to slightly redefine it to require WAL to be
 replicated before it is considered to be flushed. This ensures that no
 data page on disk differs from the WAL that the slave has. The
 machinery to do this is already mostly there, we already wait for WAL
 flushes and we know the write location on the slave. The second
 requirement is that we never start up as master and we don't trust any
 local WAL. This is actually how pacemaker clusters work, you would
 only need to amend the RA to wipe the WAL and configure postgresql
 with restart_after_crash = false.

 It would be very helpful in restoring HA capability after failover if
 we wouldn't have to read through the whole database after a VM goes
 down and is migrated with the shared disk onto a new host.

The problem with this is it's making an idealistic assumption that a
crashed master didn't do anything wrong or lose/corrupt any data during
its crash.  As soon as you realize that's an unsafe assumption, the
whole thing becomes worthless to you.

If the idea had zero implementation cost, I would say sure, let people
play with it until they find out (probably the hard way) that it's a bad
idea.  But it's going to introduce, at the very least, additional
complexity into a portion of the system that is critical and plenty
complicated enough already.  That being the case, I don't want it there
at all, not even as an option.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-10 Thread Jeff Davis
On Wed, 2013-04-10 at 11:01 +0300, Ants Aasma wrote:
 I think we should first deal with using it for page checksums and if
 future versions want to reuse some of the code for WAL checksums then
 we can rearrange the code.

Sounds good to me, although I expect we at least want any assembly to be
in a separate file (if the specialization makes it in 9.3).

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Boszormenyi Zoltan

2013-04-10 18:46 keltezéssel, Fujii Masao írta:

On Wed, Apr 10, 2013 at 11:16 PM, Andres Freund and...@2ndquadrant.com wrote:

On 2013-04-10 10:10:31 -0400, Tom Lane wrote:

Amit Kapila amit.kap...@huawei.com writes:

On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:

Sorry, this is incorrect. Streaming replication continuous, master is not
waiting, whenever the master writes the data page it checks that the WAL
record is written in standby till that LSN.

I am not sure it will resolve the problem completely as your old-master can
have some WAL extra then new-master for same timeline. I don't remember
exactly will timeline switch feature
take care of this extra WAL, Heikki can confirm this point?
Also I think this can serialize flush of data pages in checkpoint/bgwriter
which is currently not the case.

Yeah.  TBH this entire discussion seems to be let's cripple performance
in the normal case so that we can skip doing an rsync when resurrecting
a crashed, failed-over master.  This is not merely optimizing for the
wrong thing, it's positively hazardous.  After a fail-over, you should
be wondering whether it's safe to resurrect the old master at all, not
about how fast you can bring it back up without validating its data.
IOW, I wouldn't consider skipping the rsync even if I had a feature
like this.

Agreed. Especially as in situations where you fall over in a planned
way, e.g. for a hardware upgrade, you can avoid the need to resync with
a littlebit of care.

It's really worth documenting that way.


So its mostly in catastrophic situations this
becomes a problem and in those you really should resync - and its a good
idea not to use a normal rsync but a rsync --checksum or similar.

If database is very large, rsync --checksum takes very long. And I'm concerned
that most of data pages in master has the different checksum from those in the
standby because of commit hint bit. I'm not sure how rsync --checksum can
speed up the backup after failover.


rsync --checksum alone may not but rsync --inplace may speed up backup a 
lot.



Regards,




--
--
Zoltán Böszörményi
Cybertec Schönig  Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
 http://www.postgresql.at/



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inconsistent DB data in Streaming Replication

2013-04-10 Thread Andres Freund
On 2013-04-10 20:39:25 +0200, Boszormenyi Zoltan wrote:
 2013-04-10 18:46 keltezéssel, Fujii Masao írta:
 On Wed, Apr 10, 2013 at 11:16 PM, Andres Freund and...@2ndquadrant.com 
 wrote:
 On 2013-04-10 10:10:31 -0400, Tom Lane wrote:
 Amit Kapila amit.kap...@huawei.com writes:
 On Wednesday, April 10, 2013 3:42 PM Samrat Revagade wrote:
 Sorry, this is incorrect. Streaming replication continuous, master is not
 waiting, whenever the master writes the data page it checks that the WAL
 record is written in standby till that LSN.
 I am not sure it will resolve the problem completely as your old-master 
 can
 have some WAL extra then new-master for same timeline. I don't remember
 exactly will timeline switch feature
 take care of this extra WAL, Heikki can confirm this point?
 Also I think this can serialize flush of data pages in checkpoint/bgwriter
 which is currently not the case.
 Yeah.  TBH this entire discussion seems to be let's cripple performance
 in the normal case so that we can skip doing an rsync when resurrecting
 a crashed, failed-over master.  This is not merely optimizing for the
 wrong thing, it's positively hazardous.  After a fail-over, you should
 be wondering whether it's safe to resurrect the old master at all, not
 about how fast you can bring it back up without validating its data.
 IOW, I wouldn't consider skipping the rsync even if I had a feature
 like this.
 Agreed. Especially as in situations where you fall over in a planned
 way, e.g. for a hardware upgrade, you can avoid the need to resync with
 a littlebit of care.
 It's really worth documenting that way.
 
 So its mostly in catastrophic situations this
 becomes a problem and in those you really should resync - and its a good
 idea not to use a normal rsync but a rsync --checksum or similar.
 If database is very large, rsync --checksum takes very long. And I'm 
 concerned
 that most of data pages in master has the different checksum from those in 
 the
 standby because of commit hint bit. I'm not sure how rsync --checksum can
 speed up the backup after failover.

Its not about speed, its about correctness.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: Get rid of USE_WIDE_UPPER_LOWER dependency in trigram constructi

2013-04-10 Thread Stefan Kaltenbrunner
On 04/08/2013 10:11 AM, Dimitri Fontaine wrote:
 Tom Lane t...@sss.pgh.pa.us writes:
 If there is anybody still using Postgres on machines without wcstombs() or
 towlower(), and they have non-ASCII data indexed by pg_trgm, they'll need
 to REINDEX those indexes after pg_upgrade to 9.3, else searches may fail
 incorrectly. It seems likely that there are no such installations, though.
 
 Those conditions seem just complex enough to require a test script that
 will check that for you. What if we created a new binary responsible for
 auto checking all those release-note items that are possible to machine
 check, then issue a WARNING containing the URL to the release notes you
 should be reading, and a SQL script (ala pg_upgrade) to run after
 upgrade?
 
   $ pg_checkupgrade -d connection=string  upgrade.sql
   NOTICE: checking 9.3 upgrade release notes
   WARNING: RN-93-0001 index idx_trgm_abc is not on-disk compatible with 9.3
   WARNING: TN-93-0012 …
   WARNING: This script is NOT comprehensive, read release notes at …
 
 The target version would be hard coded on the binary itself for easier
 maintaining of it, and that proposal includes a unique identifier for
 any release note worthy warning that we know about, that would be
 included in the output of the program.
 
 I think most of the checks would only have to be SQL code, and some of
 them should include running some binary code the server side. When
 that's possible, we could maybe expose that binary code in a server side
 extension so as to make the client side binary life's easier. That would
 also be an excuse for the project to install some upgrade material on
 the old server, which has been discussed in the past for preparing
 pg_upgrade when we have a page format change.

given something like this also will have to be dealt with by pg_upgrade,
why not fold it into that (like into -c) completly and recommend running
that? on the flipside if people don't read the release notes they will
also not run any kind of binary/script mentioned there...



Stefan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bgworker sigusr1 handler

2013-04-10 Thread Alvaro Herrera
Robert Haas escribió:
 Just for fun, I implemented a toy background worker tonight using the
 new bgworker framework.  Generally, it went well, and I'm pleased with
 the design of the new facility.

Thanks.

 However, I did notice one oddity.  I initialized the worker flags like
 this:
 
 worker.bgw_flags = BGWORKER_SHMEM_ACCESS;
 
 And... latches didn't work.  It turns out that if you request database
 access, then the SIGUSR1 handler is set to procsignal_sigusr1_handler,
 which is fine.  But if you don't, it gets set to SIG_IGN.  And the
 result of *that* is that if someone sets a latch for which the
 background process is waiting, the background process fails to notice.
 
 Now, once you understand what's going on here, it's not hard to work
 around.  But it seems to me that it would be a saner default to set
 the signal handler to something like the bgwriter handler, which just
 calls latch_sigusr1_handler.

Sounds sensible -- done that way.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SIGHUP not received by custom bgworkers if postmaster is notified

2013-04-10 Thread Alvaro Herrera
Michael Paquier escribió:
 Hi all,
 
 While playing with custom background workers, I noticed that postmaster
 does not notify its registered bgworkers if it receives SIGHUP,
 so you have to send a SIGHUP directly to the bgworker process to notify it.
 Signal handling is correctly done for SIGQUIT and SIGTERM for shutdown only.
 Attached is a patch fixing that, I simply added a call to
 SignalUnconnectedWorkers in SIGHUP_handler:postmaster.c.

Thanks, applied.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SIGHUP not received by custom bgworkers if postmaster is notified

2013-04-10 Thread Alvaro Herrera
Michael Paquier escribió:
 Hi all,
 
 Please find attached a simple example of bgworker that logs a message each
 time a SIGTERM or SIGHUP signal is received by it:
 - hello signal: processed SIGHUP when SIGHUP is handled by my example
 - hello signal: processed SIGTERM when SIGTERM is handled by my example

I committed some improvements to worker_spi this morning that I think
enough demostrate signal handling capabilities, which I think is what
your submitted code would do.  If you see more use for a separate body
of sample worker code, by all means do submit that.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-10 Thread Simon Riggs
On 10 April 2013 11:15, Ants Aasma a...@cybertec.at wrote:

  * We might even be able to calculate CRC32 checksum for normal WAL
 records,
  and use Ants' checksum for full page writes (only). So checking WAL
 checksum
  would then be to confirm header passes CRC32 and then re-check the Ants
  checksum of each backup block.

 If we ensure that the checksum on the page is correct when we do a
 full page write then we could only include the checksum field in the
 WAL CRC. When reading WAL we would first check that the CRC is correct
 and then verify the the page checksum.


OK, so we have a single combined calculate a checksum for a block
function. That uses Jeff's zeroing trick and Ants' bulk-oriented
performance optimization.

For buffer checksums we simply calculate for the block.

For WAL full page writes, we first set the checksums for all defined
buffers, then calculate the checksum of remaining data plus the pd_checksum
field from each block using the normal WAL CRC32.

Seems good to me. One set of fast code. And it avoids the weirdness that
the checksum stored on the full page is actually wrong.

It also means that the WAL checksum calculation includes the hole, yet we
do not include the data for the hole. So we have to do an extra copy when
restoring the backuo block.

Comments?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


Re: [HACKERS] [GSOC] questions about idea rewrite pg_dump as library

2013-04-10 Thread Peter Eisentraut
On 4/10/13 10:54 AM, ˧ wrote:
 I'm interested in the idea Rewrite (add) pg_dump and pg_restore
 utilities as libraries (.so, .dll  .dylib).

The pg_dump code is a giant mess, and refactoring it as a library is
perhaps not a project for a new hacker.

Independent of that, I think the first consideration in such a project
would be, who else would be using that library?  What are the use cases?
 And then come up with an interface around that, and then see about
refactoring pg_dump.

I think the main uses cases mentioned in connection with this idea are
usually in the direction of finer-grained control over what gets dumped
and how.  But making pg_dump into a library would not necessarily
address that.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GSOC] questions about idea rewrite pg_dump as library

2013-04-10 Thread Alvaro Herrera
Peter Eisentraut wrote:

 I think the main uses cases mentioned in connection with this idea are
 usually in the direction of finer-grained control over what gets dumped
 and how.  But making pg_dump into a library would not necessarily
 address that.

There's also the matter of embedding pg_dump into other programs.  For
example, calling the pg_dump executable from inside pgAdmin is a rather
messy solution to the problem.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GSOC] questions about idea rewrite pg_dump as library

2013-04-10 Thread Hannu Krosing

On 04/10/2013 11:02 PM, Alvaro Herrera wrote:

Peter Eisentraut wrote:


I think the main uses cases mentioned in connection with this idea are
usually in the direction of finer-grained control over what gets dumped
and how.  But making pg_dump into a library would not necessarily
address that.

There's also the matter of embedding pg_dump into other programs.  For
example, calling the pg_dump executable from inside pgAdmin is a rather
messy solution to the problem.


Natural solution to this seems to move most of pg_dump functionality
into backend as functions, so we have pg_dump_xxx() for everything
we want to dump plus a topological sort function for getting the
objects in right order.

The main things left into pg_dump the library would be support various ways
to format the dump results into text, tar and dump files.


Hannu


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GSOC] questions about idea rewrite pg_dump as library

2013-04-10 Thread Alvaro Herrera
Hannu Krosing wrote:
 On 04/10/2013 11:02 PM, Alvaro Herrera wrote:
 Peter Eisentraut wrote:
 
 I think the main uses cases mentioned in connection with this idea are
 usually in the direction of finer-grained control over what gets dumped
 and how.  But making pg_dump into a library would not necessarily
 address that.
 There's also the matter of embedding pg_dump into other programs.  For
 example, calling the pg_dump executable from inside pgAdmin is a rather
 messy solution to the problem.

 Natural solution to this seems to move most of pg_dump functionality
 into backend as functions, so we have pg_dump_xxx() for everything
 we want to dump plus a topological sort function for getting the
 objects in right order.

This idea doesn't work because of back-patch considerations (i.e. we
would not be able to create the functions in back branches, and so this
new style of pg_dump would only work with future server versions).  So
pg_dump itself would have to retain capability to dump stuff from old
servers.  This seems unlikely to fly at all, because we'd be then
effectively maintaining pg_dump in two places, both backend and the
pg_dump source code.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GSOC] questions about idea rewrite pg_dump as library

2013-04-10 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Hannu Krosing wrote:
 Natural solution to this seems to move most of pg_dump functionality
 into backend as functions, so we have pg_dump_xxx() for everything
 we want to dump plus a topological sort function for getting the
 objects in right order.

 This idea doesn't work because of back-patch considerations (i.e. we
 would not be able to create the functions in back branches, and so this
 new style of pg_dump would only work with future server versions).  So
 pg_dump itself would have to retain capability to dump stuff from old
 servers.  This seems unlikely to fly at all, because we'd be then
 effectively maintaining pg_dump in two places, both backend and the
 pg_dump source code.

There are other issues too, in particular that most of the backend's
code tends to work on SnapshotNow time whereas pg_dump would really
prefer it was all done according to the transaction snapshot.  We have
got bugs of that ilk already in pg_dump, but we shouldn't introduce a
bunch more.  Doing this right would therefore mean that we'd have to
write a lot of duplicative code in the backend, ie, it's not clear that
we gain any synergy by pushing the functionality over.  It might
simplify cross-backend-version issues (at least for backend versions
released after we'd rewritten all that code) but otherwise I'm afraid
it'd just be pushing the problems somewhere else.

In any case, push it to the backend offers no detectable help with the
core design issue here, which is figuring out what functionality needs
to be exposed with what API.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] corrupt pages detected by enabling checksums

2013-04-10 Thread Robert Haas
On Sat, Apr 6, 2013 at 10:44 AM, Andres Freund and...@2ndquadrant.com wrote:
 I feel pretty strongly that we shouldn't add any such complications to
 XLogInsert() itself, its complicated enough already and it should be
 made simpler, not more complicated.

+1, emphatically.  XLogInsert is a really nasty scalability
bottleneck.  We need to move as much logic out of that function as
possible, and particularly out from under WALInsertLock.

...Robert


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sepgsql 3/3] Add db_procedure:execute permission checks

2013-04-10 Thread Alvaro Herrera
Kohei KaiGai wrote:
 This patch adds sepgsql support for permission checks almost
 equivalent to the existing FUNCTION EXECUTE privilege.

While skimming this patch I noticed that you're using
getObjectDescription() as the audit_name of objects.  This may be a
bit unstable, for example consider that it's translated if lc_messages
is set to something other than english.  I would suggest that the object
identity as introduced by commit f8348ea32ec8 is a better choice for
this.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SIGHUP not received by custom bgworkers if postmaster is notified

2013-04-10 Thread Michael Paquier
Thanks for committing the fix!

On Thu, Apr 11, 2013 at 4:11 AM, Alvaro Herrera alvhe...@2ndquadrant.comwrote:

 Michael Paquier escribió:
  Hi all,
 
  Please find attached a simple example of bgworker that logs a message
 each
  time a SIGTERM or SIGHUP signal is received by it:
  - hello signal: processed SIGHUP when SIGHUP is handled by my example
  - hello signal: processed SIGTERM when SIGTERM is handled by my example

 I committed some improvements to worker_spi this morning that I think
 enough demostrate signal handling capabilities, which I think is what
 your submitted code would do.  If you see more use for a separate body
 of sample worker code, by all means do submit that.

Sure.
-- 
Michael


[HACKERS] synchronize_seqscans' description is a bit misleading

2013-04-10 Thread Gurjeet Singh
If I'm reading the code right [1], this GUC does not actually *synchronize*
the scans, but instead just makes sure that a new scan starts from a block
that was reported by some other backend performing a scan on the same
relation.

Since the backends scanning the relation may be processing the relation at
different speeds, even though each one took the hint when starting the
scan, they may end up being out of sync with each other. Even in a single
query, there may be different scan nodes scanning different parts of the
same relation, and even they don't synchronize with each other (and for
good reason).

Imagining that all scans on a table are always synchronized, may make some
wrongly believe that adding more backends scanning the same table will not
incur any extra I/O; that is, only one stream of blocks will be read from
disk no matter how many backends you add to the mix. I noticed this when I
was creating partition tables, and each of those was a CREATE TABLE AS
SELECT FROM original_table (to avoid WAL generation), and running more than
3 such transactions caused the disk read throughput to behave unpredictably,
sometimes even dipping below 1 MB/s for a few seconds at a stretch.

Please note that I am not complaining about the implementation, which I
think is the best we can do without making backends wait for each other.
It's just that the documentation [2] implies that the scans are
synchronized through the entire run, which is clearly not the case. So I'd
like the docs to be improved to reflect that.

How about something like:

doc
synchronize_seqscans (boolean)
This allows sequential scans of large tables to start from a point in
the table that is already being read by another backend. This increases the
probability that concurrent scans read the same block at about the same
time and hence share the I/O workload. Note that, due to the difference in
speeds of processing the table, the backends may eventually get out of
sync, and hence stop sharing the I/O workload.

When this is enabled, ... The default is on.
/doc

Best regards,

[1] src/backend/access/heap/heapam.c
[2]
http://www.postgresql.org/docs/9.2/static/runtime-config-compatible.html#GUC-SYNCHRONIZE-SEQSCANS

-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


Re: [HACKERS] [DOCS] synchronize_seqscans' description is a bit misleading

2013-04-10 Thread Tom Lane
Gurjeet Singh gurj...@singh.im writes:
 If I'm reading the code right [1], this GUC does not actually *synchronize*
 the scans, but instead just makes sure that a new scan starts from a block
 that was reported by some other backend performing a scan on the same
 relation.

Well, that's the only *direct* effect, but ...

 Since the backends scanning the relation may be processing the relation at
 different speeds, even though each one took the hint when starting the
 scan, they may end up being out of sync with each other.

The point you're missing is that the synchronization is self-enforcing:
whichever backend gets ahead of the others will be the one forced to
request (and wait for) the next physical I/O.  This will naturally slow
down the lower-CPU-cost-per-page scans.  The other ones tend to catch up
during the I/O operation.

The feature is not terribly useful unless I/O costs are high compared to
the CPU cost-per-page.  But when that is true, it's actually rather
robust.  Backends don't have to have exactly the same per-page
processing cost, because pages stay in shared buffers for a while after
the current scan leader reads them.

 Imagining that all scans on a table are always synchronized, may make some
 wrongly believe that adding more backends scanning the same table will not
 incur any extra I/O; that is, only one stream of blocks will be read from
 disk no matter how many backends you add to the mix. I noticed this when I
 was creating partition tables, and each of those was a CREATE TABLE AS
 SELECT FROM original_table (to avoid WAL generation), and running more than
 3 such transactions caused the disk read throughput to behave unpredictably,
 sometimes even dipping below 1 MB/s for a few seconds at a stretch.

It's not really the scans that's causing that to be unpredictable, it's
the write I/O from the output side, which is forcing highly
nonsequential behavior (or at least I suspect so ... how many disk units
were involved in this test?)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-04-10 Thread Jeff Davis
On Wed, 2013-04-10 at 20:17 +0100, Simon Riggs wrote:

 OK, so we have a single combined calculate a checksum for a block
 function. That uses Jeff's zeroing trick and Ants' bulk-oriented
 performance optimization.
 
 
 For buffer checksums we simply calculate for the block.

Sounds good.

 For WAL full page writes, we first set the checksums for all defined
 buffers, then calculate the checksum of remaining data plus the
 pd_checksum field from each block using the normal WAL CRC32.
 
 Seems good to me. One set of fast code. And it avoids the weirdness
 that the checksum stored on the full page is actually wrong.

Oh, that's a nice benefit.

 It also means that the WAL checksum calculation includes the hole, yet
 we do not include the data for the hole. So we have to do an extra
 copy when restoring the backuo block.

I like this, but it sounds like there is some room for discussion on
some of these points. I assume changes to the WAL checksums are 9.4
material?

I'm satisfied with SIMD data checksums in 9.3 and that we have a plan
for using SIMD for WAL checksums later.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] ObjectClass/ObjectType mixup

2013-04-10 Thread Peter Eisentraut
src/backend/catalog/dependency.c:213:   
EventTriggerSupportsObjectType(getObjectClass(thisobj)))
src/backend/commands/event_trigger.c:1014:  
Assert(EventTriggerSupportsObjectType(getObjectClass(object)));

getObjectClass() returns type ObjectClass, but
EventTriggerSupportsObjectType() takes type ObjectType, which are not
the same thing.  I think this code might be wrong.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [DOCS] synchronize_seqscans' description is a bit misleading

2013-04-10 Thread Gurjeet Singh
On Wed, Apr 10, 2013 at 11:10 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Gurjeet Singh gurj...@singh.im writes:
  If I'm reading the code right [1], this GUC does not actually
 *synchronize*
  the scans, but instead just makes sure that a new scan starts from a
 block
  that was reported by some other backend performing a scan on the same
  relation.

 Well, that's the only *direct* effect, but ...

  Since the backends scanning the relation may be processing the relation
 at
  different speeds, even though each one took the hint when starting the
  scan, they may end up being out of sync with each other.

 The point you're missing is that the synchronization is self-enforcing:
 whichever backend gets ahead of the others will be the one forced to
 request (and wait for) the next physical I/O.  This will naturally slow
 down the lower-CPU-cost-per-page scans.  The other ones tend to catch up
 during the I/O operation.


Got it. So far, so good.

Let's consider a pathological case where a scan is performed by a user
controlled cursor, whose scan speed depends on how fast the user presses
the Next button, then this scan is quickly going to fall out of sync with
other scans. Moreover, if a new scan happens to pick up the block reported
by this slow scan, then that new scan may have to read blocks off the disk
afresh.

So, again, it is not guaranteed that all the scans on a relation will
synchronize with each other. Hence my proposal to include the term
'probability' in the definition.


 The feature is not terribly useful unless I/O costs are high compared to
 the CPU cost-per-page.  But when that is true, it's actually rather
 robust.  Backends don't have to have exactly the same per-page
 processing cost, because pages stay in shared buffers for a while after
 the current scan leader reads them.


Agreed. Even if the buffer has been evicted from shared_buffers, there's a
high likelihood that the scan that's close on the heels of others will
fetch it from FS cache.



  Imagining that all scans on a table are always synchronized, may make
 some
  wrongly believe that adding more backends scanning the same table will
 not
  incur any extra I/O; that is, only one stream of blocks will be read from
  disk no matter how many backends you add to the mix. I noticed this when
 I
  was creating partition tables, and each of those was a CREATE TABLE AS
  SELECT FROM original_table (to avoid WAL generation), and running more
 than
  3 such transactions caused the disk read throughput to behave
 unpredictably,
  sometimes even dipping below 1 MB/s for a few seconds at a stretch.

 It's not really the scans that's causing that to be unpredictable, it's
 the write I/O from the output side, which is forcing highly
 nonsequential behavior (or at least I suspect so ... how many disk units
 were involved in this test?)


You may be right. I don't have access to the system anymore, and I don't
remember the disk layout, but it's quite possible that write operations
were causing the  read throughput to drop. I did try to reproduce the
behaviour on my laptop with up to 6 backends doing pure reads on a table
that was multiple times the system RAM, but I could not get them to get out
of sync.

-- 
Gurjeet Singh

http://gurjeet.singh.im/

EnterpriseDB Inc.


Re: [HACKERS] [DOCS] synchronize_seqscans' description is a bit misleading

2013-04-10 Thread Tom Lane
Gurjeet Singh gurj...@singh.im writes:
 On Wed, Apr 10, 2013 at 11:10 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 The point you're missing is that the synchronization is self-enforcing:

 Let's consider a pathological case where a scan is performed by a user
 controlled cursor, whose scan speed depends on how fast the user presses
 the Next button, then this scan is quickly going to fall out of sync with
 other scans. Moreover, if a new scan happens to pick up the block reported
 by this slow scan, then that new scan may have to read blocks off the disk
 afresh.

Sure --- if a backend stalls completely, it will fall out of the
synchronized group.  And that's a good thing; we'd surely not want to
block the other queries while waiting for a user who just went to lunch.

 So, again, it is not guaranteed that all the scans on a relation will
 synchronize with each other. Hence my proposal to include the term
 'probability' in the definition.

Yeah, it's definitely not guaranteed in any sense.  But I don't really
think your proposed wording is an improvement.  The existing wording
isn't promising guaranteed sync either, to my eyes.

Perhaps we could compromise on, say, changing so that concurrent scans
read the same block at about the same time to so that concurrent scans
tend to read the same block at about the same time, or something like
that.  I don't mind making it sound a bit more uncertain, but I don't
think that we need to emphasize the probability of failure.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GSOC] questions about idea rewrite pg_dump as library

2013-04-10 Thread Pavel Golub
Hello, 帅.

You wrote:

帅 Hi all, 
帅 I'd like to introduce myself to the dev community. I am Shuai
帅 Fan, a student from Dalian University of Technology, DLUT , for
帅 short, China. And I am interested in working with PostgreSQL project in 
GSOC2013.
帅 I'm interested in the idea Rewrite (add) pg_dump and
帅 pg_restore utilities as libraries (.so, .dll  .dylib). 

帅 These days, I had a talk with Mr. Pavel Golub by email, the
帅 author of this post. And asked some questions about this idea. He
帅 adviced me to post the question to this mail list. 

帅 My question is:
帅 There are lots of functions in pg_dump.c. If I rewrite this
帅 file as library. I should split pg_dump.c into two or more
帅 files(Mr. Pavel Golub's advice). However, some functions, do have return 
value. e.g.

帅 static DumpableObject *createBoundaryObjects(void);

帅 I thought, these functions must get return value through
帅 function argument by passing pointer to it, when using library.
帅 But, the question is: If I did so, function prototype may be
帅 changed. And Mr. Pavel Golub said it's better to keep all
帅 functions with the same declaration, and so we will only have one
帅 code base for both console application and library. I think he is right.
帅 But then, how can I deal with this situation? From my point of
帅 view, I can't malloc a variable (e.g. DumpableObject) in one
帅 library (e.g. pg_dumplib.so), and then return it's pointer to
帅 another library (or excutable program). Maybe, it's not safe(?). Or
帅 has something to do with memory leak(?). I'm not sure.
帅 
帅 Do you have any suggestions?

From my point of view the new library should export only two
functions:

1. The execution function:

ExecStatusType PGdumpdbParams(const char * const *keywords,
 const char * const *values);

Return type may be other, but ExecStatusType seems to be OK for this
purpose: PGRES_TUPLES_OK - for success, PGRES_FATAL_ERROR - for fail.

Parameters will remain the same as usual command-line options for
pg_dump. Thus we will have less work for existing application, e.g.
pgAdmin.

2. Logging or notice processing function:

typedef void (*PGdumplogProcessor) (char *relname, int done, char *message);

PGdumplogProcessor PQsetNoticeProcessor(PGdumplogProcessor proc,
 void *arg);

The purpose of this function is process output of the dump. The first
argument is callback-function which accepts information about current
relname (or operation, or stage), done indicates how much work done
(for progress bars etc.), message contains some extra information.

That's all! Only two functions. Indeed we don't need all those
low-level dump functions like createBoundaryObjects etc. There will be
the only one entry to the whole logic. And if one wants the only one
table, funcction or schema - combination of correct parameters should
be passed to PGdumpdbParams.

帅 Best wishes,
帅 Shuai




-- 
With best wishes,
 Pavel  mailto:pa...@gf.microolap.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers