Re: [HACKERS] Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

2010-06-10 Thread Fujii Masao
On Thu, Jun 10, 2010 at 12:09 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Ok, committed with some cosmetic changes.

Thanks!

 BTW, should there be doc changes for this? I didn't find anything explaining
 how restartpoints are triggered, we should add a paragraph somewhere.

+1

What about the attached patch?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


trigger_restartpoint_doc_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fix use of posix_fadvise in xlog.c

2010-06-10 Thread Heikki Linnakangas

On 10/06/10 06:47, Mark Wong wrote:

I wanted to propose a fix for to xlog.c regarding the use of
posix_fadvise() for 9.1 (unless someone feels it's ok for 9.0).
Currently posix_fadvise() is used right before a log file is closed so
it's effectively not doing anything, when posix_fadvise is to be
called.  This patch moves the posix_fadvise() call into 3 other
locations within XLogFileInit() where a file handle is returned.  The
first case is where an existing open file handle is returned.  The
next case is when a file is to be zeroed out.  The third case is
returning a file handle, which may be the file that was just zeroed
out.


I don't think POSIX_FADV_DONTNEED does what you think it does. It tells 
the kernel that you don't need to keep these pages in the cache 
anymore, I won't be accessing them anymore. If you call it when you 
open the file, before reading/writing, there is nothing in the cache and 
the call will do nothing.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] InvalidXLogRecPtr in docs

2010-06-10 Thread Fujii Masao
On Thu, Jun 10, 2010 at 11:56 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Wed, Jun 9, 2010 at 9:46 PM, Takahiro Itagaki
 itagaki.takah...@oss.ntt.co.jp wrote:
 I found a term InvalidXLogRecPtr in 9.0 docs.
 http://developer.postgresql.org/pgdocs/postgres/functions-admin.html#FUNCTIONS-RECOVERY-INFO-TABLE
 | ... then the return value will be InvalidXLogRecPtr (0/0).

 Maybe we should be returning NULL instead of 0/0.

 +1 for using NULL instead of an artificially chosen value, for both of
 those functions.

Okay, the attached patch makes those functions return NULL in that case.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


recovery_funcs_return_null_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] InvalidXLogRecPtr in docs

2010-06-10 Thread Heikki Linnakangas

On 10/06/10 05:56, Tom Lane wrote:

Robert Haasrobertmh...@gmail.com  writes:

On Wed, Jun 9, 2010 at 9:46 PM, Takahiro Itagaki
itagaki.takah...@oss.ntt.co.jp  wrote:

I found a term InvalidXLogRecPtr in 9.0 docs.
http://developer.postgresql.org/pgdocs/postgres/functions-admin.html#FUNCTIONS-RECOVERY-INFO-TABLE
| ... then the return value will be InvalidXLogRecPtr (0/0).



Maybe we should be returning NULL instead of 0/0.


+1 for using NULL instead of an artificially chosen value, for both of
those functions.


Agreed, committed a patch to do that.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] InvalidXLogRecPtr in docs

2010-06-10 Thread Heikki Linnakangas

On 10/06/10 09:42, Fujii Masao wrote:

On Thu, Jun 10, 2010 at 11:56 AM, Tom Lanet...@sss.pgh.pa.us  wrote:

Robert Haasrobertmh...@gmail.com  writes:

On Wed, Jun 9, 2010 at 9:46 PM, Takahiro Itagaki
itagaki.takah...@oss.ntt.co.jp  wrote:

I found a term InvalidXLogRecPtr in 9.0 docs.
http://developer.postgresql.org/pgdocs/postgres/functions-admin.html#FUNCTIONS-RECOVERY-INFO-TABLE
| ... then the return value will be InvalidXLogRecPtr (0/0).



Maybe we should be returning NULL instead of 0/0.


+1 for using NULL instead of an artificially chosen value, for both of
those functions.


Okay, the attached patch makes those functions return NULL in that case.


Ah, I just committed a patch to do the same, before seeing your email. 
Thanks anyway.


BTW, the docs claim about pg_last_xlog_location() that While streaming 
replication is in progress this will increase monotonically. That's a 
bit misleading: when the replication connection is broken for some 
reason and we restart it, we begin streaming from the beginning of the 
last WAL segment. So at that moment, pg_last_xlog_location() moves 
backwards to the beginning of the WAL segment.


Should we:
1. Just document that,
2. Change pg_last_xlog_location() to not move backwards in that case, or
3. Change the behavior so that we start streaming at the exact byte 
location where we left off?


I believe that starting from the beginning of the WAL segment is just 
paranoia, to avoid creating a WAL file that's missing some data from the 
beginning. Right?


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Command to prune archive at restartpoints

2010-06-10 Thread Heikki Linnakangas

On 09/06/10 10:21, Simon Riggs wrote:

On Tue, 2010-06-08 at 18:30 -0400, Andrew Dunstan wrote:


I prefer archive_cleanup_command. We should name things after their
principal function, not an implementation detail, IMNSHO.

More importantly, we should include an example in the docs. I created
one the other day  when this was actually bothering me a bit (see
http://people.planetpostgresql.org/andrew/index.php?/archives/85-Keeping-a-hot-standby-log-archive-clean.html).
That seemed to work ok, but maybe it's too long, and maybe people would
prefer a shell script to perl.


I submitted a patch to make the command pg_standby -a %r

That's a more portable solution, ISTM.

I'll commit that and fix the docs.


Huh, wait. There's no -a option in pg_standby, so I presume you're 
planning to add that too. I don't like confusing pg_standby into this, 
the docs are currently quite clear that if you want to use the built-in 
standby mode, you can't use pg_standby, and this would muddy the waters.


Maybe we could add a new pg_cleanuparchive binary, but we'll need some 
discussion...


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] variable TriggerFile can be declared as static

2010-06-10 Thread Fujii Masao
Hi,

Currently the variable TriggerFile is declared as extern, but
it's not used in other source file than xlog.c. How about
declaring it as static? Here is the patch.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


definition_triggerfile_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Command to prune archive at restartpoints

2010-06-10 Thread Simon Riggs
On Thu, 2010-06-10 at 10:18 +0300, Heikki Linnakangas wrote:
 On 09/06/10 10:21, Simon Riggs wrote:
  On Tue, 2010-06-08 at 18:30 -0400, Andrew Dunstan wrote:
 
  I prefer archive_cleanup_command. We should name things after their
  principal function, not an implementation detail, IMNSHO.
 
  More importantly, we should include an example in the docs. I created
  one the other day  when this was actually bothering me a bit (see
  http://people.planetpostgresql.org/andrew/index.php?/archives/85-Keeping-a-hot-standby-log-archive-clean.html).
  That seemed to work ok, but maybe it's too long, and maybe people would
  prefer a shell script to perl.
 
  I submitted a patch to make the command pg_standby -a %r
 
  That's a more portable solution, ISTM.
 
  I'll commit that and fix the docs.
 
 Huh, wait. There's no -a option in pg_standby, so I presume you're 
 planning to add that too. I don't like confusing pg_standby into this, 
 the docs are currently quite clear that if you want to use the built-in 
 standby mode, you can't use pg_standby, and this would muddy the waters.

It won't kill us to change that sentence. pg_standby is only used now
within the cleanup command etc

pg_standby already contains the exact logic we need here. Having two
sets of code for the same thing isn't how we do things.

 Maybe we could add a new pg_cleanuparchive binary, but we'll need some 
 discussion...

Which will go nowhere, as we both already know.

-- 
 Simon Riggs   www.2ndQuadrant.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] InvalidXLogRecPtr in docs

2010-06-10 Thread Fujii Masao
On Thu, Jun 10, 2010 at 4:07 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Ah, I just committed a patch to do the same, before seeing your email.
 Thanks anyway.

Yeah, thanks a lot!

 BTW, the docs claim about pg_last_xlog_location() that While streaming
 replication is in progress this will increase monotonically. That's a bit
 misleading: when the replication connection is broken for some reason and we
 restart it, we begin streaming from the beginning of the last WAL segment.
 So at that moment, pg_last_xlog_location() moves backwards to the beginning
 of the WAL segment.

 Should we:
 1. Just document that,
 2. Change pg_last_xlog_location() to not move backwards in that case, or
 3. Change the behavior so that we start streaming at the exact byte location
 where we left off?

I'm for 2 as follows.

diff --git a/src/backend/replication/walreceiver.c
b/src/backend/replication/walreceiver.c
index 26aeca6..f0fd813 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -524,7 +524,8 @@ XLogWalRcvFlush(void)

/* Update shared-memory status */
SpinLockAcquire(walrcv-mutex);
-   walrcv-receivedUpto = LogstreamResult.Flush;
+   if (XLByteLT(walrcv-receivedUpto, LogstreamResult.Flush))
+   walrcv-receivedUpto = LogstreamResult.Flush;
SpinLockRelease(walrcv-mutex);


 I believe that starting from the beginning of the WAL segment is just
 paranoia, to avoid creating a WAL file that's missing some data from the
 beginning. Right?

Only when the recovery starting record (i.e., the record at the checkpoint
redo location) is not found, we need to start replication from the beginning
of the segment, I think. That is, fetching_ckpt = true case in the following
code.

 if (PrimaryConnInfo)
 {
   RequestXLogStreaming(
   fetching_ckpt ? RedoStartLSN : *RecPtr,
   PrimaryConnInfo);
   continue;
 }

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] variable TriggerFile can be declared as static

2010-06-10 Thread Heikki Linnakangas

On 10/06/10 10:25, Fujii Masao wrote:

Currently the variable TriggerFile is declared as extern, but
it's not used in other source file than xlog.c. How about
declaring it as static? Here is the patch.


Thanks, applied.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] InvalidXLogRecPtr in docs

2010-06-10 Thread Heikki Linnakangas

On 10/06/10 10:43, Fujii Masao wrote:

On Thu, Jun 10, 2010 at 4:07 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

BTW, the docs claim about pg_last_xlog_location() that While streaming
replication is in progress this will increase monotonically. That's a bit
misleading: when the replication connection is broken for some reason and we
restart it, we begin streaming from the beginning of the last WAL segment.
So at that moment, pg_last_xlog_location() moves backwards to the beginning
of the WAL segment.

Should we:
1. Just document that,
2. Change pg_last_xlog_location() to not move backwards in that case, or
3. Change the behavior so that we start streaming at the exact byte location
where we left off?


I'm for 2 as follows.

diff --git a/src/backend/replication/walreceiver.c
b/src/backend/replication/walreceiver.c
index 26aeca6..f0fd813 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -524,7 +524,8 @@ XLogWalRcvFlush(void)

 /* Update shared-memory status */
 SpinLockAcquire(walrcv-mutex);
-   walrcv-receivedUpto = LogstreamResult.Flush;
+   if (XLByteLT(walrcv-receivedUpto, LogstreamResult.Flush))
+   walrcv-receivedUpto = LogstreamResult.Flush;
 SpinLockRelease(walrcv-mutex);


That's not enough, because we set receivedUpto in RequestXlogStreaming() 
already.



I believe that starting from the beginning of the WAL segment is just
paranoia, to avoid creating a WAL file that's missing some data from the
beginning. Right?


Only when the recovery starting record (i.e., the record at the checkpoint
redo location) is not found, we need to start replication from the beginning
of the segment, I think. That is, fetching_ckpt = true case in the following
code.


if (PrimaryConnInfo)
{
RequestXLogStreaming(
fetching_ckpt ? RedoStartLSN : *RecPtr,
PrimaryConnInfo);
continue;
}


Even then, we wouldn't need to start from the beginning of the WAL 
segment AFAICS. The point is to start from the Redo pointer, not from 
the checkpoint record, because as soon as we read the checkpoint record 
we'll need to start applying WAL from the Redo pointer, which is 
earlier. The WAL file boundaries don't come into play there.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] InvalidXLogRecPtr in docs

2010-06-10 Thread Fujii Masao
On Thu, Jun 10, 2010 at 5:04 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Should we:
 1. Just document that,
 2. Change pg_last_xlog_location() to not move backwards in that case, or
 3. Change the behavior so that we start streaming at the exact byte
 location
 where we left off?

 I'm for 2 as follows.

 diff --git a/src/backend/replication/walreceiver.c
 b/src/backend/replication/walreceiver.c
 index 26aeca6..f0fd813 100644
 --- a/src/backend/replication/walreceiver.c
 +++ b/src/backend/replication/walreceiver.c
 @@ -524,7 +524,8 @@ XLogWalRcvFlush(void)

                 /* Update shared-memory status */
                 SpinLockAcquire(walrcv-mutex);
 -               walrcv-receivedUpto = LogstreamResult.Flush;
 +               if (XLByteLT(walrcv-receivedUpto, LogstreamResult.Flush))
 +                       walrcv-receivedUpto = LogstreamResult.Flush;
                 SpinLockRelease(walrcv-mutex);

 That's not enough, because we set receivedUpto in RequestXlogStreaming()
 already.

Ah, you are right.

 I believe that starting from the beginning of the WAL segment is just
 paranoia, to avoid creating a WAL file that's missing some data from the
 beginning. Right?

 Only when the recovery starting record (i.e., the record at the checkpoint
 redo location) is not found, we need to start replication from the
 beginning
 of the segment, I think. That is, fetching_ckpt = true case in the
 following
 code.

 if (PrimaryConnInfo)
 {
        RequestXLogStreaming(
                fetching_ckpt ? RedoStartLSN : *RecPtr,
                PrimaryConnInfo);
        continue;
 }

 Even then, we wouldn't need to start from the beginning of the WAL segment
 AFAICS. The point is to start from the Redo pointer, not from the checkpoint
 record, because as soon as we read the checkpoint record we'll need to start
 applying WAL from the Redo pointer, which is earlier. The WAL file
 boundaries don't come into play there.

You mean that the WAL file containing the Redo pointer is guaranteed to exist
if we could read the checkpoint record, so we don't need to start from the
beginning of the segment? This is probably true. But what if we could not read
the checkpoint record? In this case, the WAL file containing the Redo pointer
also might not exist.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LLVM / clang

2010-06-10 Thread Peter Eisentraut
On ons, 2010-06-09 at 09:59 +0200, Florian Pflug wrote:
 The most heavily platform dependent part of the code is the spinlock
 implementation. You might want to check that it actually uses the
 version optimized for your platform, not the (much slower) generic
 implementation based on semaphores.

You only get the slow implementation if you configure explicitly with
--disable-spinlocks.  A toolchain that didn't support spinlocks would
fail the build and then the user could use that option to get past that
problem.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LLVM / clang

2010-06-10 Thread Peter Eisentraut
On tis, 2010-06-08 at 12:12 +0200, P. Caillaud wrote:
 I'd like to experiment on compiling postgres with LLVM (either llvm-gcc or  
 clang) on Linux, is it supported ? Where should I start ?

The way to choose a compiler is

./configure CC=your-cc ...other...options...

We support a fair amount of non-GCC compilers, so supporting one or two
more should be possible.

Quick testing shows that clang doesn't get through the configure stage
on this Debian system -- it looks like some amount of better integration
with glibc might be needed.  Building with llvm-gcc works fine, but I
understand that using llvm-gcc with native code generation isn't all
that different from using gcc itself, so that's not a surprising result.
The only issue is that the float8 regression test fails, so it is
apparently not *exactly* the same.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] InvalidXLogRecPtr in docs

2010-06-10 Thread Heikki Linnakangas

On 10/06/10 11:37, Fujii Masao wrote:

On Thu, Jun 10, 2010 at 5:04 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

I believe that starting from the beginning of the WAL segment is just
paranoia, to avoid creating a WAL file that's missing some data from the
beginning. Right?


Only when the recovery starting record (i.e., the record at the checkpoint
redo location) is not found, we need to start replication from the
beginning
of the segment, I think. That is, fetching_ckpt = true case in the
following
code.


if (PrimaryConnInfo)
{
RequestXLogStreaming(
fetching_ckpt ? RedoStartLSN : *RecPtr,
PrimaryConnInfo);
continue;
}


Even then, we wouldn't need to start from the beginning of the WAL segment
AFAICS. The point is to start from the Redo pointer, not from the checkpoint
record, because as soon as we read the checkpoint record we'll need to start
applying WAL from the Redo pointer, which is earlier. The WAL file
boundaries don't come into play there.


You mean that the WAL file containing the Redo pointer is guaranteed to exist
if we could read the checkpoint record, so we don't need to start from the
beginning of the segment? This is probably true. But what if we could not read
the checkpoint record? In this case, the WAL file containing the Redo pointer
also might not exist.


Oh, I think I understand the issue now: we need the header in the 
beginning of the WAL segment to be valid, even if the first record we're 
interested in is in the middle of the file. I missed that.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

2010-06-10 Thread Heikki Linnakangas

On 10/06/10 09:14, Fujii Masao wrote:

On Thu, Jun 10, 2010 at 12:09 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

BTW, should there be doc changes for this? I didn't find anything explaining
how restartpoints are triggered, we should add a paragraph somewhere.


+1

What about the attached patch?


 (description of wal_keep_segments)

*** 1902,1907  SET ENABLE_SEQSCAN TO OFF;
--- 1902,1908 
  for standby purposes, and the number of old WAL segments available
  for standbys is determined based only on the location of the previous
  checkpoint and status of WAL archiving.
+ This parameter has no effect on a restartpoint.
  This parameter can only be set in the filenamepostgresql.conf/
  file or on the server command line.
 /para


Hmm, I wonder if wal_keep_segments should take effect during recovery 
too? We don't support cascading slaves, but if you have two slaves 
connected to one master (without an archive), and you perform failover 
to one of them, without wal_keep_segments the 2nd slave might not find 
all the files it needs in the new master. Then again, that won't work 
without an archive anyway, because we error out at a TLI mismatch in 
replication. Seems like this is 9.1 material..



*** a/doc/src/sgml/wal.sgml
--- b/doc/src/sgml/wal.sgml
***
*** 424,429 
--- 424,430 
para
 There will always be at least one WAL segment file, and will normally
 not be more than (2 + varnamecheckpoint_completion_target/varname) * 
varnamecheckpoint_segments/varname + 1
+or varnamecheckpoint_segments/ + xref 
linkend=guc-wal-keep-segments + 1
 files.  Each segment file is normally 16 MB (though this size can be
 altered when building the server).  You can use this to estimate space
 requirements for acronymWAL/acronym.


That's not true, wal_keep_segments is the minimum number of files 
retained, independently of checkpoint_segments. The corret formula is (2 
+ checkpoint_completion_target * checkpoint_segments, wal_keep_segments)



para
+In archive recovery or standby mode, the server periodically performs
+firsttermrestartpoints/indextermprimaryrestartpoint//
+which are similar to checkpoints in normal operation: the server forces
+all its state to disk, updates the filenamepg_control/ file to
+indicate that the already-processed WAL data need not be scanned again,
+and then recycles old log segment files if they are in the
+filenamepg_xlog/ directory. Note that this recycling is not affected
+by varnamewal_keep_segments/ at all. A restartpoint is triggered,
+if at least one checkpoint record has been replayed since the last
+restartpoint, every varnamecheckpoint_timeout/ seconds, or every
+varnamecheckoint_segments/ log segments only in standby mode,
+whichever comes first


That last sentence is a bit unclear. How about:

A restartpoint is triggered if at least one checkpoint record has been 
replayed and varnamecheckpoint_timeout/ seconds have passed since 
last restartpoint. In standby mode, a restartpoint is also triggered if 
varnamecheckoint_segments/ log segments have been replayed since 
last restartpoint and at least one checkpoint record has been replayed 
since.



... In log shipping case, the checkpoint interval
+on the standby is normally smaller than that on the master.
+   /para


What does that mean? Restartpoints can't be performed more frequently 
than checkpoints in the master because restartpoints can only be 
performed at checkpoint records.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] failover vs. read only queries

2010-06-10 Thread Fujii Masao
On Thu, Jun 10, 2010 at 5:06 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Josh Berkus j...@agliodbs.com writes:
 The fact that failover current does *not* terminate existing queries and
 transactions was regarded as a feature by the audience, rather than a
 bug, when I did demos of HS/SR.  Of course, they might not have been
 thinking of the delay for writes.

 If there were an easy way to make the trigger file cancel all running
 queries, apply remaining logs and come up, then I'd vote for that for
 9.0.  I think it's the more desired behavior by most users.  However,
 I'm opposed to any complex solutions which might delay 9.0 release.

 My feeling about it is that if you want fast failover you should not
 have your failover target server configured as hot standby at all, let
 alone hot standby with a long max_standby_delay.  Such a slave could be
 very far behind on applying WAL when the crunch comes, and no amount of
 query killing will save you from that.  Put your long-running standby
 queries on a different slave instead.

 We should consider whether we can improve the situation in 9.1, but it
 is not a must-fix for 9.0; especially when the correct behavior isn't
 immediately obvious.

OK. Let's revisit in 9.1.

I attached the proposal patch for 9.1. The patch treats max_standby_delay
as zero (i.e., cancels all the conflicting queries immediately), ever since
the trigger file is created. So we can cause a recovery to end without
waiting for any lock held by queries, and minimize the failover time.
OTOH, queries which don't conflict with a recovery survive the failover.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


prevent_lock_conflict_from_slowing_failover_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] failover vs. read only queries

2010-06-10 Thread Fujii Masao
On Thu, Jun 10, 2010 at 9:58 AM, Takahiro Itagaki
itagaki.takah...@oss.ntt.co.jp wrote:

 Fujii Masao masao.fu...@gmail.com wrote:

  1. Reset max_standby_delay = 0 in postgresql.conf
  2. pg_ctl reload
  3. Create a trigger file

 As far as I read the HS code, SIGHUP is not checked while a recovery
 is waiting for queries :(  So pg_ctl reload would have no effect on
 the conflicting queries.

 Independently from the problem I raised, I think that we should call
 HandleStartupProcInterrupts() in that sleep loop.

 Hmmm, if reload doesn't work, can we write a query like below?

  SELECT pg_terminate_backend(pid)
    FROM pg_locks
   WHERE conflicted-with-recovery-process;

I'm not sure that, but as you suggested, we can minimize the failover
time by using the following operation even in 9.0.

1. Reset max_standby_delay = 0 in postgresql.conf
2. pg_ctl reload
3. Cancel all the queries or all the conflicting ones
4. Create a trigger file

For now, I'll use the above when building the HA system using 9.0
and a clusterware.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Streaming Replication: Checkpoint_segment and wal_keep_segments on standby

2010-06-10 Thread Fujii Masao
On Thu, Jun 10, 2010 at 7:19 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 --- 1902,1908 
          for standby purposes, and the number of old WAL segments
 available
          for standbys is determined based only on the location of the
 previous
          checkpoint and status of WAL archiving.
 +         This parameter has no effect on a restartpoint.
          This parameter can only be set in the
 filenamepostgresql.conf/
          file or on the server command line.
         /para

 Hmm, I wonder if wal_keep_segments should take effect during recovery too?
 We don't support cascading slaves, but if you have two slaves connected to
 one master (without an archive), and you perform failover to one of them,
 without wal_keep_segments the 2nd slave might not find all the files it
 needs in the new master. Then again, that won't work without an archive
 anyway, because we error out at a TLI mismatch in replication. Seems like
 this is 9.1 material..

Yep, since currently SR cannot get over the gap of TLI, wal_keep_segments
is not worth taking effect during recovery.

 *** a/doc/src/sgml/wal.sgml
 --- b/doc/src/sgml/wal.sgml
 ***
 *** 424,429 
 --- 424,430 
    para
     There will always be at least one WAL segment file, and will normally
     not be more than (2 + varnamecheckpoint_completion_target/varname)
 * varnamecheckpoint_segments/varname + 1
 +    or varnamecheckpoint_segments/ + xref
 linkend=guc-wal-keep-segments + 1
     files.  Each segment file is normally 16 MB (though this size can be
     altered when building the server).  You can use this to estimate space
     requirements for acronymWAL/acronym.

 That's not true, wal_keep_segments is the minimum number of files retained,
 independently of checkpoint_segments. The corret formula is (2 +
 checkpoint_completion_target * checkpoint_segments, wal_keep_segments)

You mean that the maximum number of WAL files is: ?

max {
  (2 + checkpoint_completion_target) * checkpoint_segments,
  wal_keep_segments
}

Just after a checkpoint removes old WAL files, there might be wal_keep_segments
WAL files. Additionally, checkpoint_segments WAL files might be generated before
the subsequent checkpoint removes old WAL files. So I think that the maximum
number is

max {
  (2 + checkpoint_completion_target) * checkpoint_segments,
  wal_keep_segments + checkpoint_segments
}

Am I missing something?

    para
 +    In archive recovery or standby mode, the server periodically performs
 +    firsttermrestartpoints/indextermprimaryrestartpoint//
 +    which are similar to checkpoints in normal operation: the server
 forces
 +    all its state to disk, updates the filenamepg_control/ file to
 +    indicate that the already-processed WAL data need not be scanned
 again,
 +    and then recycles old log segment files if they are in the
 +    filenamepg_xlog/ directory. Note that this recycling is not
 affected
 +    by varnamewal_keep_segments/ at all. A restartpoint is triggered,
 +    if at least one checkpoint record has been replayed since the last
 +    restartpoint, every varnamecheckpoint_timeout/ seconds, or every
 +    varnamecheckoint_segments/ log segments only in standby mode,
 +    whichever comes first

 That last sentence is a bit unclear. How about:

 A restartpoint is triggered if at least one checkpoint record has been
 replayed and varnamecheckpoint_timeout/ seconds have passed since last
 restartpoint. In standby mode, a restartpoint is also triggered if
 varnamecheckoint_segments/ log segments have been replayed since last
 restartpoint and at least one checkpoint record has been replayed since.

Thanks! Seems good.

 ... In log shipping case, the checkpoint interval
 +    on the standby is normally smaller than that on the master.
 +   /para

 What does that mean? Restartpoints can't be performed more frequently than
 checkpoints in the master because restartpoints can only be performed at
 checkpoint records.

Yes, that's what I meant.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parser handling of large object OIDs

2010-06-10 Thread Robert Haas
On Wed, Jun 9, 2010 at 10:26 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Wed, Jun 9, 2010 at 5:02 PM, Robert Haas robertmh...@gmail.com wrote:
 I believe that the comment code is probably right, because I think
 IConst can only handle values  2^31, whereas OIDs can be as large as
 2^32-1.

 I investigated this a little more and the above analysis turns out to
 be correct.  ALTER LARGE OBJECT OWNER and GRANT ... ON LARGE OBJECT
 don't work for large objects outside the range of a signed integer.

 Yup.

 Session demonstrating the problem and proposed patch attached.

 This patch seems extremely grotty, though.  Surely that's not the way we
 were doing it in the comment code?

I pretty much just moved the existing code from CommentLargeObject()
into a new function oidparse().  I couldn't really figure out where to
put the oidparse() function so I eventually decided on oid.c, and
therefore also ripped out the trip through the fmgr layer in favor of
calling the appropriate code directly.  Other than that it's the same
code.  I'm open to suggestions, but this is basically just a small bit
of code rearrangement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PERFORM] No hash join across partitioned tables?

2010-06-10 Thread Robert Haas
(moving to -hackers)

On Wed, Jun 9, 2010 at 4:11 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 In going back through emails I had marked as possibly needing another
 look before 9.0 is released, I came across this issue again.  As I
 understand it, analyze (or analyse) now collects statistics for both
 the parent individually, and for the parent and its children together.
  However, as I further understand it, autovacuum won't actually fire
 off an analyze unless there's enough activity on the parent table
 considered individually to warrant it.  So if you have an empty parent
 and a bunch of children with data in it, your stats will still stink,
 unless you analyze by hand.

 Check.

 Assuming my understanding of the problem is correct, we could:

 (a) fix it,
 (b) document that you should consider periodic manual analyze commands
 in this situation, or
 (c) do nothing.

 Thoughts?

 The objections to (a) are that it might result in excessive ANALYZE work
 if not done intelligently, and that we haven't got a patch ready anyway.
 I would have liked to get to this for 9.0 but I feel it's a bit late
 now.

I guess I can't really disagree with that.  Should we try to document
this in some way?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LLVM / clang

2010-06-10 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 Quick testing shows that clang doesn't get through the configure stage
 on this Debian system -- it looks like some amount of better integration
 with glibc might be needed.  Building with llvm-gcc works fine, but I
 understand that using llvm-gcc with native code generation isn't all
 that different from using gcc itself, so that's not a surprising result.
 The only issue is that the float8 regression test fails, so it is
 apparently not *exactly* the same.

There's a buildfarm animal using llvm-gcc, and it passes just fine ...
so the float8 failure sounds to me like another integration problem.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-10 Thread Robert Haas
On Mon, Jun 7, 2010 at 9:21 AM, Fujii Masao masao.fu...@gmail.com wrote:
 When an error is found in the WAL streamed from the master, a warning
 message is repeated without interval forever in the standby. This
 consumes CPU load very much, and would interfere with read-only queries.
 To fix this problem, we should add a sleep into emode_for_corrupt_record()
 or somewhere? Or we should stop walreceiver and retry to read WAL from
 pg_xlog or the archive?

I ran into this problem at one point, too, but was in the middle of
trying to investigate a different bug and didn't have time to track
down what was causing it.

I think the basic question here is - if there's an error in the WAL,
how do we expect to EVER recover?  Even if we can read from the
archive or pg_xlog, presumably it's the same WAL - why should we be
any more successful the second time?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] InvalidXLogRecPtr in docs

2010-06-10 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Even then, we wouldn't need to start from the beginning of the WAL 
 segment AFAICS. The point is to start from the Redo pointer, not from 
 the checkpoint record, because as soon as we read the checkpoint record 
 we'll need to start applying WAL from the Redo pointer, which is 
 earlier. The WAL file boundaries don't come into play there.

I don't believe it's a good idea to have SR not write full xlog segment
files.  Consider for example the following scenario:

1. SR writes some xlog file from the middle.
2. Filesystem says ah-hah, I know about sparse storage and doesn't
   allocate the first half of the file.
3. Failover: slave goes live.
4. xlog file gets recycled for re-use.
5. While reusing the file, we write into the first half ... or try to,
   but there's no disk space.
6. PANIC.

There are probably some other good reasons not to allow incomplete
copies of WAL files to exist on the slave system, anyway.

I'm not sure if it's worth the trouble, or even a particularly smart
idea, to force the output of the status function to be monotonic
regardless of what happens underneath.  I think removing that claim
from the docs altogether is the easiest answer.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Command to prune archive at restartpoints

2010-06-10 Thread Robert Haas
On Thu, Jun 10, 2010 at 3:28 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Thu, 2010-06-10 at 10:18 +0300, Heikki Linnakangas wrote:
 On 09/06/10 10:21, Simon Riggs wrote:
  On Tue, 2010-06-08 at 18:30 -0400, Andrew Dunstan wrote:
 
  I prefer archive_cleanup_command. We should name things after their
  principal function, not an implementation detail, IMNSHO.
 
  More importantly, we should include an example in the docs. I created
  one the other day  when this was actually bothering me a bit (see
  http://people.planetpostgresql.org/andrew/index.php?/archives/85-Keeping-a-hot-standby-log-archive-clean.html).
  That seemed to work ok, but maybe it's too long, and maybe people would
  prefer a shell script to perl.
 
  I submitted a patch to make the command pg_standby -a %r
 
  That's a more portable solution, ISTM.
 
  I'll commit that and fix the docs.

 Huh, wait. There's no -a option in pg_standby, so I presume you're
 planning to add that too. I don't like confusing pg_standby into this,
 the docs are currently quite clear that if you want to use the built-in
 standby mode, you can't use pg_standby, and this would muddy the waters.

 It won't kill us to change that sentence. pg_standby is only used now
 within the cleanup command etc

 pg_standby already contains the exact logic we need here. Having two
 sets of code for the same thing isn't how we do things.

 Maybe we could add a new pg_cleanuparchive binary, but we'll need some
 discussion...

 Which will go nowhere, as we both already know.

I have a feeling that I may be poking my nose into an incipient
shouting match, but FWIW I agree with Heikki that it would be
preferable to keep this separate from pg_standby.  Considering that
Andrew wrote this in 24 lines of Perl code (one-third of which are
basically just there for logging purposes), I'm not that worried about
code duplication, unless what we actually need is significantly more
complicated.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-10 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Jun 7, 2010 at 9:21 AM, Fujii Masao masao.fu...@gmail.com wrote:
 When an error is found in the WAL streamed from the master, a warning
 message is repeated without interval forever in the standby. This
 consumes CPU load very much, and would interfere with read-only queries.
 To fix this problem, we should add a sleep into emode_for_corrupt_record()
 or somewhere? Or we should stop walreceiver and retry to read WAL from
 pg_xlog or the archive?

 I ran into this problem at one point, too, but was in the middle of
 trying to investigate a different bug and didn't have time to track
 down what was causing it.

 I think the basic question here is - if there's an error in the WAL,
 how do we expect to EVER recover?  Even if we can read from the
 archive or pg_xlog, presumably it's the same WAL - why should we be
 any more successful the second time?

What warning message are we talking about?  All the error cases I can
think of in WAL-application are ERROR, or likely even PANIC.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Command to prune archive at restartpoints

2010-06-10 Thread Andrew Dunstan



Robert Haas wrote:

It won't kill us to change that sentence. pg_standby is only used now
within the cleanup command etc

pg_standby already contains the exact logic we need here. Having two
sets of code for the same thing isn't how we do things.



Well, we could factor out that part of the code so it could be used in 
two binaries. But ...



Maybe we could add a new pg_cleanuparchive binary, but we'll need some
discussion...
  

Which will go nowhere, as we both already know.



I have a feeling that I may be poking my nose into an incipient
shouting match, but FWIW I agree with Heikki that it would be
preferable to keep this separate from pg_standby.  Considering that
Andrew wrote this in 24 lines of Perl code (one-third of which are
basically just there for logging purposes), I'm not that worried about
code duplication, unless what we actually need is significantly more
complicated.

  


I think my logic needs a tiny piece of adjustment, to ignore the 
timeline segment of the file name. But that will hardly involve a great 
deal of extra code - just chop off the first 8 chars. It's not like the 
code for this in pg_standby.c is terribly complex.


The virtue of a perl script is that it's very easily customizable, e.g. 
you might only delete files if they are older than a certain age.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Error with GIT Repository

2010-06-10 Thread Luxenberg, Scott I.
Greetings all,

I have been trying to create/run a build farm as part of a project I am
working on. However, I have noticed the primary git repostitory,
git.postgresql.org/git, does not seem to be working. Namely, whenever I
try to clone the directory, I receive this error:

Error: Unable to find 5e4933c31d3cd2750ee1793efe6eca43055fb273e under
http://git.postgresql.org/git/postgresql.git
Cannot obtain needed blob 5e4933c31d3cd2750ee1793efe6eca4305fb273e while
processing commit c5609c66ce2ee4fdb180be95721252b47f90499
Error: fetch failed.

I thought it would be prudent to notify the list so someone could
possibly check into this.

Thanks!

Scott Luxenberg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fix use of posix_fadvise in xlog.c

2010-06-10 Thread Mark Wong



On Jun 9, 2010, at 11:25 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com 
 wrote:



On 10/06/10 06:47, Mark Wong wrote:

I wanted to propose a fix for to xlog.c regarding the use of
posix_fadvise() for 9.1 (unless someone feels it's ok for 9.0).
Currently posix_fadvise() is used right before a log file is closed  
so

it's effectively not doing anything, when posix_fadvise is to be
called.  This patch moves the posix_fadvise() call into 3 other
locations within XLogFileInit() where a file handle is returned.  The
first case is where an existing open file handle is returned.  The
next case is when a file is to be zeroed out.  The third case is
returning a file handle, which may be the file that was just zeroed
out.


I don't think POSIX_FADV_DONTNEED does what you think it does. It  
tells the kernel that you don't need to keep these pages in the  
cache anymore, I won't be accessing them anymore. If you call it  
when you open the file, before reading/writing, there is nothing in  
the cache and the call will do nothing.


Oops, my bad.  I think I was confused by the short description in the  
man page.  I didn't read the longer descriptoon. :( Then would it be  
worth making the this call after the file is zeroed out?


Regards,
Mark

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Command to prune archive at restartpoints

2010-06-10 Thread Heikki Linnakangas

On 10/06/10 17:38, Andrew Dunstan wrote:

I think my logic needs a tiny piece of adjustment, to ignore the
timeline segment of the file name.


I'm not sure you should ignore it. Presumably anything in an older 
timeline is indeed not required anymore and can be removed, and anything 
in a newer timeline... how did it get there? Seems safer not remove it.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with GIT Repository

2010-06-10 Thread Andrew Dunstan



Luxenberg, Scott I. wrote:

Greetings all,

I have been trying to create/run a build farm as part of a project I am
working on. 


That seems an odd thing to do since we have one ...


However, I have noticed the primary git repostitory,
git.postgresql.org/git, does not seem to be working. Namely, whenever I
try to clone the directory, I receive this error:

Error: Unable to find 5e4933c31d3cd2750ee1793efe6eca43055fb273e under
http://git.postgresql.org/git/postgresql.git
Cannot obtain needed blob 5e4933c31d3cd2750ee1793efe6eca4305fb273e while
processing commit c5609c66ce2ee4fdb180be95721252b47f90499
Error: fetch failed.

I thought it would be prudent to notify the list so someone could
possibly check into this.


  



Why are you cloning over http? Here is the best way to clone, which 
seems to be working:


   [and...@sophia ]$ git clone --mirror
   git://git.postgresql.org/git/postgresql.git
   Initialized empty Git repository in /home/andrew/postgresql.git/
   remote: Counting objects: 376865, done.
   remote: Compressing objects: 100% (87569/87569), done.
   remote: Total 376865 (delta 310187), reused 352950 (delta 287485)
   Receiving objects: 100% (376865/376865), 178.73 MiB | 251 KiB/s, done.
   Resolving deltas: 100% (310187/310187), done.
   [and...@sophia ]$

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Command to prune archive at restartpoints

2010-06-10 Thread Andrew Dunstan



Heikki Linnakangas wrote:

On 10/06/10 17:38, Andrew Dunstan wrote:

I think my logic needs a tiny piece of adjustment, to ignore the
timeline segment of the file name.


I'm not sure you should ignore it. Presumably anything in an older 
timeline is indeed not required anymore and can be removed, and 
anything in a newer timeline... how did it get there? Seems safer not 
remove it.




Well, I was just following the logic in pg-standby.c:

   /*
* We ignore the timeline part of the XLOG segment 
identifiers

* in deciding whether a segment is still needed.  This
* ensures that we won't prematurely remove a segment from a
* parent timeline. We could probably be a little more
* proactive about removing segments of non-parent 
timelines,

* but that would be a whole lot more complicated.
*
* We use the alphanumeric sorting property of the filenames
* to decide which ones are earlier than the
* exclusiveCleanupFileName file. Note that this means files
* are not removed in the order they were originally 
written,

* in case this worries you.
*/
   if (strlen(xlde-d_name) == XLOG_DATA_FNAME_LEN 
   strspn(xlde-d_name, 0123456789ABCDEF)
   == XLOG_DATA_FNAME_LEN 
 strcmp(xlde-d_name + 8, exclusiveCleanupFileName + 8) 
 0)



cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fix use of posix_fadvise in xlog.c

2010-06-10 Thread Heikki Linnakangas

On 10/06/10 18:17, Mark Wong wrote:

On Jun 9, 2010, at 11:25 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

I don't think POSIX_FADV_DONTNEED does what you think it does. It
tells the kernel that you don't need to keep these pages in the cache
anymore, I won't be accessing them anymore. If you call it when you
open the file, before reading/writing, there is nothing in the cache
and the call will do nothing.


Oops, my bad. I think I was confused by the short description in the man
page. I didn't read the longer descriptoon. :( Then would it be worth
making the this call after the file is zeroed out?


Not sure. If you're churning through WAL files at a reasonable speed, 
the zeroed-out file will soon be written to again. OTOH, we always write 
whole pages, so maybe the OS is smart enough to not read the page back 
to memory just to overwrite it.


In a steady-state situation new WAL files are not created very often 
because we recycle old ones, so it probably doesn't make much difference.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fix use of posix_fadvise in xlog.c

2010-06-10 Thread Tom Lane
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 In a steady-state situation new WAL files are not created very often 
 because we recycle old ones, so it probably doesn't make much difference.

Yeah.  We really don't worry too much about the performance of the
new-WAL-file-creation code path because of this.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How about closing some Open Items?

2010-06-10 Thread Greg Stark
On Wed, Jun 9, 2010 at 8:01 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Thanks for clearing the list.  There are only 5 remaining items, which
 is kinda exciting, though Tom's assertion that HS is still bug-ridden is
 a bit off-putting.

It's a big piece of subtle code and it's quite possible it contains
bugs. But people know that and as a result it's received a lot of
testing and careful thought already. The chances it has bugs of its
own are probably lower than for other major projects in the past. On
the other hand it's recovery-related and it shakes me that we have no
regression tests for recovery let alone standby databases.

What's more scary are either of two cases:

1) There are use cases of varying degrees of obscurity which haven't
been explicitly covered where the behaviour is not what people would
expect. We've already fixed a few such cases such as shutdown
semantics and setting up a standby based on an backup of initdb
results before starting up the database. This is the kind of thing we
need lots of users testing their real workloads with and doing test
failovers and so on.

2) There are unrelated areas of the database which have collateral
damage that nobody expected and thought to test for. Hopefully we have
enough regression tests to detect this kind of thing but again as
there are no regression tests for recovery we could have bugs in other
systems that don't turn up until you use those systems on a standby
database or after running the system as a standby database and then
bringing it up.




-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-10 Thread Heikki Linnakangas

On 10/06/10 17:38, Tom Lane wrote:

Robert Haasrobertmh...@gmail.com  writes:

On Mon, Jun 7, 2010 at 9:21 AM, Fujii Masaomasao.fu...@gmail.com  wrote:

When an error is found in the WAL streamed from the master, a warning
message is repeated without interval forever in the standby. This
consumes CPU load very much, and would interfere with read-only queries.
To fix this problem, we should add a sleep into emode_for_corrupt_record()
or somewhere? Or we should stop walreceiver and retry to read WAL from
pg_xlog or the archive?



I ran into this problem at one point, too, but was in the middle of
trying to investigate a different bug and didn't have time to track
down what was causing it.



I think the basic question here is - if there's an error in the WAL,
how do we expect to EVER recover?  Even if we can read from the
archive or pg_xlog, presumably it's the same WAL - why should we be
any more successful the second time?


What warning message are we talking about?  All the error cases I can
think of in WAL-application are ERROR, or likely even PANIC.


We're talking about a corrupt record (incorrect CRC, incorrect backlink 
etc.), not errors within redo functions. During crash recovery, a 
corrupt record means you've reached end of WAL. In standby mode, when 
streaming WAL from master, that shouldn't happen, and it's not clear 
what to do if it does. PANIC is not a good idea, at least if the server 
uses hot standby, because that only makes the situation worse from 
availability point of view. So we log the error as a WARNING, and keep 
retrying. It's unlikely that the problem will just go away, but we keep 
retrying anyway in the hope that it does. However, it seems that we're 
too aggressive with the retries.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-10 Thread Robert Haas
On Thu, Jun 10, 2010 at 12:01 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 We're talking about a corrupt record (incorrect CRC, incorrect backlink
 etc.), not errors within redo functions. During crash recovery, a corrupt
 record means you've reached end of WAL. In standby mode, when streaming WAL
 from master, that shouldn't happen, and it's not clear what to do if it
 does. PANIC is not a good idea, at least if the server uses hot standby,
 because that only makes the situation worse from availability point of view.
 So we log the error as a WARNING, and keep retrying. It's unlikely that the
 problem will just go away, but we keep retrying anyway in the hope that it
 does. However, it seems that we're too aggressive with the retries.

You can reproduce this problem by doing the following.

1. initdb
2. edit postgresql.conf, set wal_level=hot_standby, max_wal_senders=1;
edit pg_hba.conf, trust local replication connections
3. pg_ctl start
4. make some changes to the database
5. take a hot backup to another directory (call it pgslave)
6. create pgslave/recovery.conf with standby_mode='on',
primary_conninfo=whatever, edit pgslave/postgresql.conf change the
port number, set hot_standby=on
7. pg_ctl start -D pgslave

At this point you should have a working HS/SR setup.  Now:

8. shut the slave down
9. move recovery.conf out of the way
10. restart the slave - it will do recovery and enter normal running
11. make some database changes
12. stop the slave
13. put recovery.conf back
14. restart the slave
15. make a bunch of changes on the master

When the slave then tries to replay, you then get something like:

WARNING:  invalid record length at 0/4005330
WARNING:  invalid record length at 0/4005330
WARNING:  invalid record length at 0/4005330

...ad infinitum.

Obviously there are other ways this could occur - the WAL could really
be corrupted, for example - but the current handling is not too
graceful.  I'm actually thinking it might be better to trigger a
shutdown if this happens.  Probably something has gone haywire and
manual intervention is required.  Retrying when there's no hope of
success isn't really that helpful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with GIT Repository

2010-06-10 Thread Andrew Dunstan



Stephen Frost wrote:

* Andrew Dunstan (and...@dunslane.net) wrote:
  

Luxenberg, Scott I. wrote:


I have been trying to create/run a build farm as part of a project I am
working on.
  

That seems an odd thing to do since we have one ...



To clarify, he's setting up a build farm *member*. :)
  


Aha. Amazing the difference one little word can make ...



As a side-note, it works just fine from git-hub's http mirror and that's
what we've been playing with, but I don't know if we want to recommend
that for build-farm members..


  


I don't see why not. Buildfarm members are going to have to reset their 
repos when we finally cut over in a few months. Luckily, this is a 
fairly painless operation - blow away the repo and change the config 
file and the script will resync as if nothing had happened.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with GIT Repository

2010-06-10 Thread Stephen Frost
* Andrew Dunstan (and...@dunslane.net) wrote:
 I don't see why not. Buildfarm members are going to have to reset their  
 repos when we finally cut over in a few months. Luckily, this is a  
 fairly painless operation - blow away the repo and change the config  
 file and the script will resync as if nothing had happened.

Should we stop bothering to offer http://git.postgresql.org then..?  Or
do we expect it to get fixed and work correctly once we cut over and
rebuild?  Also, perhaps we could list the git-hub option on the wiki
(http://wiki.postgresql.org/wiki/Other_Git_Repositories)?

(and, yea, it's the same me)

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] failover vs. read only queries

2010-06-10 Thread Josh Berkus

On 06/09/2010 07:36 PM, Mark Kirkwood wrote:

On 10/06/10 14:07, Tatsuo Ishii wrote:


The one of top 3 questions I got
when we propose them our HA solution is, how long will it take to
do failover when the master DB crashes?



Same here +1


In that case, wouldn't they set max_standby_delay to 0?  In which case 
the failover problem goes away, no?


--
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-10 Thread Greg Stark
On Thu, Jun 10, 2010 at 5:13 PM, Robert Haas robertmh...@gmail.com wrote:
 At this point you should have a working HS/SR setup.  Now:

 8. shut the slave down
 9. move recovery.conf out of the way
 10. restart the slave - it will do recovery and enter normal running
 11. make some database changes
 12. stop the slave
 13. put recovery.conf back
 14. restart the slave
 15. make a bunch of changes on the master

 When the slave then tries to replay, you then get something like:

 WARNING:  invalid record length at 0/4005330
 WARNING:  invalid record length at 0/4005330
 WARNING:  invalid record length at 0/4005330


Woah, why does this procedure lead to this situation? I would hope
there's nothing a user could do which would cause it short of invoking
dd to corrupt the WAL files.

At precisely which step of the procedure did the user do something
wrong? Is there any reason we can't detect that they've done it and
throw a specific error message saying the configuration is invalid?

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-10 Thread Robert Haas
On Thu, Jun 10, 2010 at 12:49 PM, Greg Stark gsst...@mit.edu wrote:
 On Thu, Jun 10, 2010 at 5:13 PM, Robert Haas robertmh...@gmail.com wrote:
 At this point you should have a working HS/SR setup.  Now:

 8. shut the slave down
 9. move recovery.conf out of the way
 10. restart the slave - it will do recovery and enter normal running
 11. make some database changes
 12. stop the slave
 13. put recovery.conf back
 14. restart the slave
 15. make a bunch of changes on the master

 When the slave then tries to replay, you then get something like:

 WARNING:  invalid record length at 0/4005330
 WARNING:  invalid record length at 0/4005330
 WARNING:  invalid record length at 0/4005330


 Woah, why does this procedure lead to this situation? I would hope
 there's nothing a user could do which would cause it short of invoking
 dd to corrupt the WAL files.

 At precisely which step of the procedure did the user do something
 wrong?

13.

 Is there any reason we can't detect that they've done it and
 throw a specific error message saying the configuration is invalid?

I'm not sure how we'd go about doing that, but I agree it would be nice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with GIT Repository

2010-06-10 Thread Magnus Hagander
On Thu, Jun 10, 2010 at 18:20, Stephen Frost sfr...@snowman.net wrote:
 * Andrew Dunstan (and...@dunslane.net) wrote:
 I don't see why not. Buildfarm members are going to have to reset their
 repos when we finally cut over in a few months. Luckily, this is a
 fairly painless operation - blow away the repo and change the config
 file and the script will resync as if nothing had happened.

 Should we stop bothering to offer http://git.postgresql.org then..?  Or

No, we should not.

Especially if someone has a clue how to do it. The last time I fixed
it by runnin repack, but that didn't work this time. I have no clue
why it's asking for a file that doesn't exist.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Idea for getting rid of VACUUM FREEZE on cold pages

2010-06-10 Thread Jan Wieck

Seems I underestimated the importance of forensic breadcrumbs.


On 6/9/2010 12:09 PM, Tom Lane wrote:

I do like the idea of using a status bit rather than FrozenXid to mark a
frozen tuple, because that eliminates the conflict between wanting to
freeze aggressively for performance reasons and wanting to preserve Xids
for forensic reasons.  But it doesn't seem to do much for Josh's
original problem.


Depends. Specifically on transaction profiles and how long the blocks 
linger around before being written. If you can set the all visible bit 
by the time, the page is written the first time, what bit including the 
is-frozen one cannot be set at that time too?


Maybe some analysis on the typical behavior of such system is in order. 
Especially the case Josh was mentioning seems to be a typical single 
insert logging style application, with little else going on on that 
particular database. I can't reveal specifics about that particular 
case, but think of something like taking frequent sensor readings, that 
need to be kept for years for forensics in case there is a product 
recall some day.


And even if some cases still required another page write because those 
frozen bits cannot be set on first write, this seems to be a win-win. We 
would get rid of the FrozenXid completely and shift to a bit, so we can 
effectively have a min_ freeze_age of zero while keeping the xid's forever.



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Command to prune archive at restartpoints

2010-06-10 Thread Heikki Linnakangas

On 10/06/10 22:24, Dimitri Fontaine wrote:

Heikki Linnakangasheikki.linnakan...@enterprisedb.com  writes:

Maybe we could add a new pg_cleanuparchive binary, but we'll need some
discussion...


Would this binary ever be used manually, not invoked by PostgreSQL? As
it depends on the %r option to be given and to be right, I don't think
so.


Hmm, actually it would be pretty handy. To make use of a base backup, 
you need all the WAL files following the one where pg_start_backup() was 
called. We create a .backup file in the archive to indicate that 
location, like:


0001002F.0020.backup

So to clean up all WAL files older than those needed by that base 
backup, you would simply copy-paste that location and call 
pg_cleanuparchive:


pg_cleanuparchive /walarchive/ 0001002F

Of course, if there's a perl one-liner to do that, we can just put that 
in the docs and don't really need pg_cleanuparchive at all.



Therefore my take on this problem is to provide internal commands here,
that maybe wouldn't need to be explicitly passed any argument. If
they're internal they certainly can access to the information they need?


You want more flexibility in more advanced cases. Like if you have 
multiple standbys sharing the archive, you only want to remove old WAL 
files after they're not needed by *any* of the standbys anymore. Doing 
the cleanup directly in the archive_cleanup_command would cause the old 
WAL files to be removed prematurely, but you could put a shell script 
there to store the location to a file, and call pg_cleanuparchive with 
the max() of the locations reported by all standby servers.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Idea for getting rid of VACUUM FREEZE on cold pages

2010-06-10 Thread Greg Stark
On Thu, Jun 3, 2010 at 11:41 AM, Greg Stark gsst...@mit.edu wrote:
 I think to make it work you need to store a whole 64-bit reference
 transaction id consisting of both a cycle counter and a transaction
 id. The invariant for the page is that every xid on the page can be
 compared to that reference transaction id using normal transactionid
 semantics. Actually I think the easiest way to do that is to set it to
 the oldest xid on the page. The first thing to do before comparing any
 transaction id on the page with a real transaction id would be to
 figure out whether the reference xid is comparable to the live xid,
 which if it's the oldest xid on the page implies they'll all be
 comparable.

 The way to maintain that invariant would be that any xid insertion on
 the page must advance the reference xid if it's not comparable to the
 newly inserted xid. It has to be advanced to the oldest xid that's
 still comparable with the newly inserted xid. Any xids on the page
 that are older than the new refernce xid have to be frozen or removed.
 I'm not sure how to do that without keeping clog forever though.

So the more I think about this the more I think it's unavoidable that
we would need to retain clog forever.

I think the goal here is to be able to load data into the database and
then never write the data ever again. Even if you visit the page years
later after the transaction ids have wrapped around several times. In
that case there's no avoiding that you'll need to know whether that
transaction committed or aborted.

Now we could make a bet that most transactions commit and therefore we
could keep a list of aborted transactions only which we might be able
to keep forever in very little space if very few transactions abort.
Presumably we would only use this form once the transaction was about
to be truncated out of clog. I'm not too happy with the assumption
that there aren't many aborts though. Someone could come along with a
use case where they have lots of aborts and run into strange
limitations and performance characteristics.

Alternatively we could do something like keeping a list of tables
touched by any transaction. Then vacuum could look for any
non-committed transactions old enough to be in danger of aging out of
clog and ensure those tables are frozen. But any tables which have
never been touched by any such old transaction could be left alone.
when we read in the page we'll be able to recognize the old
transactions as committed if they're beyond the end of the clog
horizon.

I don't really like that idea either because it leaves performance
really quite unpredictable. I could have a large table that goes
unvacuumed for a long time -- then when I come along with some tiny
query where I hit C-c and cause an abort I suddenly set a trap which
causes a huge vacuum freeze to fire off.


-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Command to prune archive at restartpoints

2010-06-10 Thread Dimitri Fontaine
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
 Maybe we could add a new pg_cleanuparchive binary, but we'll need some
 discussion...

Would this binary ever be used manually, not invoked by PostgreSQL? As
it depends on the %r option to be given and to be right, I don't think
so.

Therefore my take on this problem is to provide internal commands here,
that maybe wouldn't need to be explicitly passed any argument. If
they're internal they certainly can access to the information they need?

As a user, I'd find it so much better to trust PostgreSQL for proposing
sane defaults. As a developer, you will certainly find it easier to
maintain, document and distribute.

While at it, the other internal command we need is pg_archive_bypass for
the archive_command so that windows users have the /usr/bin/true option
too.

Regards,
-- 
dim

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Command to prune archive at restartpoints

2010-06-10 Thread Simon Riggs
On Thu, 2010-06-10 at 22:49 +0300, Heikki Linnakangas wrote:
 On 10/06/10 22:24, Dimitri Fontaine wrote:
  Heikki Linnakangasheikki.linnakan...@enterprisedb.com  writes:
  Maybe we could add a new pg_cleanuparchive binary, but we'll need some
  discussion...
 
  Would this binary ever be used manually, not invoked by PostgreSQL? As
  it depends on the %r option to be given and to be right, I don't think
  so.
 
 Hmm, actually it would be pretty handy. To make use of a base backup, 
 you need all the WAL files following the one where pg_start_backup() was 
 called. We create a .backup file in the archive to indicate that 
 location, like:
 
 0001002F.0020.backup
 
 So to clean up all WAL files older than those needed by that base 
 backup, you would simply copy-paste that location and call 
 pg_cleanuparchive:
 
 pg_cleanuparchive /walarchive/ 0001002F

OK, sounds like we're on the same thought train.

Here's the code.

-- 
 Simon Riggs   www.2ndQuadrant.com
 PostgreSQL Development, 24x7 Support, Training and Services


pg_archivecleanup.tar
Description: Unix tar archive

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error with GIT Repository

2010-06-10 Thread Alvaro Herrera
Excerpts from Andrew Dunstan's message of jue jun 10 11:26:59 -0400 2010:

 Why are you cloning over http? Here is the best way to clone, which 
 seems to be working:
 
 [and...@sophia ]$ git clone --mirror
 git://git.postgresql.org/git/postgresql.git
 Initialized empty Git repository in /home/andrew/postgresql.git/

In case you're a git-ignorant like me and are wondering why the above
does not produce a usable checkout, the complete recipe is here:

http://archives.postgresql.org/message-id/20090602162347.gf23...@yugib.highrise.ca
(in short, you need a git clone --reference)

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] fix use of posix_fadvise in xlog.c

2010-06-10 Thread Greg Smith

Tom Lane wrote:

Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes:
  
In a steady-state situation new WAL files are not created very often 
because we recycle old ones, so it probably doesn't make much difference.



Yeah.  We really don't worry too much about the performance of the
new-WAL-file-creation code path because of this.
  


The only situation where the WAL zeroing path turns ugly is if you 
launch a bunch of activity against a fresh server that doesn't have any 
segments to recycle yet.  The last time we talked about improving that, 
the best idea I thought came out was to be better about preallocating 
segments than the code already is, rather than trying to speed up how 
the kernel deals with the situation.  See the links for Be more 
aggressive about creating WAL files at http://wiki.postgresql.org/wiki/Todo


I'm also not very optimistic about adding more posix_fadvise calls 
really helping just because the implementations of those are so 
unpredictable across operating systems.  I'm sure that Mark could figure 
out the right magic to speed up this specific case on Linux, but have my 
doubts that work would translate very well to many other operating 
systems.  Whereas a more generic preallocation improvement would help 
everywhere.


--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
g...@2ndquadrant.com   www.2ndQuadrant.us


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug / shortcoming in has_*_privilege

2010-06-10 Thread Robert Haas
On Thu, Jun 10, 2010 at 5:54 PM, Jim Nasby j...@nasby.net wrote:
 test...@workbook=# select has_table_privilege( 'public', 'test', 'SELECT' );
 ERROR:  role public does not exist
 test...@workbook=#

 So there's no way to see if a particular privilege has been granted to 
 public. ISTM 'public' should be accepted, since you can't use it as a role 
 name anyway...

 test...@workbook=# create role public;
 ERROR:  role name public is reserved
 test...@workbook=# create role public;
 ERROR:  role name public is reserved

It's a bit sticky - you could make that work for
has_table_privilege(name, oid, text) or has_table_privilege(name,
text, text), but what would you do about the versions whose first
argument is an oid?  It would seem a bit awkward to have the behavior
by asymmetrical, although I guess we could...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Bug / shortcoming in has_*_privilege

2010-06-10 Thread Jim Nasby
test...@workbook=# select has_table_privilege( 'public', 'test', 'SELECT' );
ERROR:  role public does not exist
test...@workbook=# 

So there's no way to see if a particular privilege has been granted to public. 
ISTM 'public' should be accepted, since you can't use it as a role name 
anyway...

test...@workbook=# create role public;
ERROR:  role name public is reserved
test...@workbook=# create role public;
ERROR:  role name public is reserved
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Exposing the Xact commit order to the user

2010-06-10 Thread Marko Kreen
On 6/4/10, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Jun 4, 2010 at 10:44 AM, Greg Stark gsst...@mit.edu wrote:
   A function which takes a starting xid and a number of transactions to
   return seems very tied to one particular application. I could easily
   see other systems such as a multi-master system instead only wanting
   to compare two transactions to find out which committed first. Or
   non-replication applications where you have an LSN and want to know
   whether a given transaction had committed by that time.
  
   So one possible interface would be to do something like
   xids_committed_between(lsn_start, lsn_end) -- and yes, possibly with
   an optional argument to limit the number or records returned.


 I'm imagining that the backend data storage for this would be a file
  containing, essentially, a struct for each commit repeated over and
  over again, packed tightly.  It's easy to index into such a file using
  a sequence number (give me the 1000'th commit) but searching by LSN
  would require (a) storing the LSNs and (b) binary search.  Maybe it's
  worth adding that complexity, but I'm not sure that it is.  Keeping
  the size of this file small is important for ensuring that it has
  minimal performance impact (which is also why I'm not sold on trying
  to include the tuple counters that Jan proposed - I think we can solve
  the problem he's worried about there more cleanly in other ways).

AIUI, you index the file by offset.

I think
   we should be very careful about assuming that we understand
   replication and its needs better than someone who has spent many years
   developing one of the major PostgreSQL replication solutions.
  
   Well the flip side of that is that we want an interface that's useful
   for more than just one replication system. This is something basic
   enough that I think it will be useful for more than just replication
   if we design it generally enough. It should be useful for
   backup/restore processes and monitoring as well as various forms of
   replication including master-slave trigger based systems but also
   including PITR-based replication, log-parsing systems, multi-master
   trigger based systems, 2PC-based systems, etc.


 Making it general enough to serve multiple needs is good, but we've
  got to make sure that the extra complexity is buying us something.
  Jan seems pretty confident that this could be used by Londiste also,
  though it would be nice to have some confirmation from the Londiste
  developer(s) on that.  I think it may also have applications for
  distributed transactions and multi-master replication, but I am not
  too sure it helps much for PITR-based replication or log-parsing
  systems.  We want to design something that is good, but trying to
  solve too many problems may end up solving none of them well.

The potential for single shared queue implementation, with
the additional potential for merging async replication
implementations sounds attractive.  (Merging ~ having
single one that satisfies broad range of needs.)

Unless the functionality accepted into core will be limited
to replication only and/or performs worse than current
snapshot-based grouping.  Then it is uninteresting, of course.

Jan's proposal of storing small struct into segmented files
sounds like it could work.  Can't say anything more because
I can't imagine it as well as Jan.  Would need to play with
working implementation to say more...

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ps display waiting for max_standby_delay

2010-06-10 Thread Bruce Momjian
Takahiro Itagaki wrote:
 Hi,
 
 We have codes to change ps display for recovery process during hot standby.
 The current code always shows max_standby_delay for the message, but how
 about showing actual waiting time instead? Since DBAs can always get the
 parameter from postgresql.conf they wrote, so the parameter value itself
 is not so useful. Actual waiting time might be more useful to determine
 which values to be set to max_standby_delay, no?
 
 [backend/storage/ipc/standby.c]
 snprintf(new_status + len, 50,
   waiting for max_standby_delay (%d ms),
  MaxStandbyDelay);  == GetCurrentTimestamp() - waitStart
 set_ps_display(new_status, false);
 
 I think SQL-based activity view will be more useful than ps display,
 but it's an item for 9.1.

Sounds interesting, but how often would the ps statust display be
updated?  I hope not too often.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + None of us is going to be here forever. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Idea for getting rid of VACUUM FREEZE on cold pages

2010-06-10 Thread Tom Lane
Jan Wieck janwi...@yahoo.com writes:
 Depends. Specifically on transaction profiles and how long the blocks 
 linger around before being written. If you can set the all visible bit 
 by the time, the page is written the first time, what bit including the 
 is-frozen one cannot be set at that time too?

All-visible and is-frozen would be the same bit ...

 And even if some cases still required another page write because those 
 frozen bits cannot be set on first write, this seems to be a win-win. We 
 would get rid of the FrozenXid completely and shift to a bit, so we can 
 effectively have a min_ freeze_age of zero while keeping the xid's forever.

Right.  I don't see any downside, other than eating another status bit
per tuple, which we can afford.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ps display waiting for max_standby_delay

2010-06-10 Thread Takahiro Itagaki

Bruce Momjian br...@momjian.us wrote:

  how about showing actual waiting time instead? 
waiting for max_standby_delay (%d ms),
   MaxStandbyDelay)
 
 Sounds interesting, but how often would the ps statust display be
 updated?  I hope not too often.

We can change the interval of updates to 500ms or so if do it,
but I rethink ps display is not the best place for the information.

I'd like to modify the additonal message waiting for max_standby_delay
just to waiting, because we don't use waiting for statement_timeout
for normal queries.

If we need additional information about conflictions in recovery,
we would supply them with SQL views instead of ps display in 9.1.

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] failover vs. read only queries

2010-06-10 Thread Fujii Masao
On Fri, Jun 11, 2010 at 1:48 AM, Josh Berkus j...@agliodbs.com wrote:
 On 06/09/2010 07:36 PM, Mark Kirkwood wrote:

 On 10/06/10 14:07, Tatsuo Ishii wrote:

 The one of top 3 questions I got
 when we propose them our HA solution is, how long will it take to
 do failover when the master DB crashes?


 Same here +1

 In that case, wouldn't they set max_standby_delay to 0?  In which case the
 failover problem goes away, no?

Yes, but I guess they'd also like to run read only queries on the standby.
Setting max_standby_delay to 0 would prevent them from doing that because
the conflict with the replay of the VACUUM or HOT record would often happen.
vacuum_defer_cleanup_age would be helpful for that case, but it seems to be
hard to tune that.

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] vacuum_defer_cleanup_age

2010-06-10 Thread Fujii Masao
Hi,

vacuum_defer_cleanup_age is categorized as Statement Behavior
parameter in the document. On the other hand, it's categorized
as Hot Standby one in postgresql.conf. Why do we need to do so?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ps display waiting for max_standby_delay

2010-06-10 Thread Fujii Masao
On Fri, Jun 11, 2010 at 11:20 AM, Takahiro Itagaki
itagaki.takah...@oss.ntt.co.jp wrote:

 Bruce Momjian br...@momjian.us wrote:

  how about showing actual waiting time instead?
                waiting for max_standby_delay (%d ms),
               MaxStandbyDelay)

 Sounds interesting, but how often would the ps statust display be
 updated?  I hope not too often.

 We can change the interval of updates to 500ms or so if do it,
 but I rethink ps display is not the best place for the information.

 I'd like to modify the additonal message waiting for max_standby_delay
 just to waiting, because we don't use waiting for statement_timeout
 for normal queries.

+1

I don't think that it's useful to display the value of max_standby_delay.

 If we need additional information about conflictions in recovery,
 we would supply them with SQL views instead of ps display in 9.1.

+1

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug / shortcoming in has_*_privilege

2010-06-10 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Thu, Jun 10, 2010 at 5:54 PM, Jim Nasby j...@nasby.net wrote:
 So there's no way to see if a particular privilege has been granted to 
 public. ISTM 'public' should be accepted, since you can't use it as a role 
 name anyway...

 It's a bit sticky - you could make that work for
 has_table_privilege(name, oid, text) or has_table_privilege(name,
 text, text), but what would you do about the versions whose first
 argument is an oid?

Nothing.  The only reason to use those forms is in a join against
pg_authid, and the public group doesn't have an entry there.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LLVM / clang

2010-06-10 Thread Peter Eisentraut
On tor, 2010-06-10 at 11:55 +0300, Peter Eisentraut wrote:
 Quick testing shows that clang doesn't get through the configure stage
 on this Debian system -- it looks like some amount of better
 integration with glibc might be needed.

Some details on this ...

configure has two problems.  The first is a present but cannot be
compiled warning about wctype.h.  This is described here:
http://llvm.org/bugs/show_bug.cgi?id=6691.  It looks like glibc 2.11
or some later version will fix this.  (eglibc 2.11 doesn't have the fix
yet.)  But this doesn't cause a problem during the compile.

The second problem is that the prototype check for accept() fails.  This
is because glibc defines the second argument to be a transparent
union, apparently to make it look like a lot of things at once.  clang
apparently doesn't understand that.  One could address this by checking
for the typedef that glibc uses explicitly in the configure check, but
that would appear to defeat the point of the *transparent* union.  A
workaround is to remove -D_GNU_SOURCE from src/template/linux.

Predictably, this will make PL/Perl fail to build.

Also, it will make src/backend/libpq/auth.c fail to build, because
struct ucred is only defined when _GNU_SOURCE is used.  This would
actually fail to work on GCC as well, so I think we should add an
explicit configure check for struct ucred.

The rest of the build goes through and the regression tests pass.

Some new warnings, however:

xlog.c:7759:22: warning: self-comparison always results in a constant
value
max_locks_per_xact != max_locks_per_xact)
   ^

Looks like a bug.

postmaster.c:3386:18: warning: more data arguments than '%' conversions
[-Wformat-extra-args]
 remote_host, remote_port);
  ^

dt_common.c:818:75: warning: more data arguments than '%' conversions
[-Wformat-extra-args]
sprintf(str + strlen(str), (min != 0) ?
%+03d:%02d : %+03d, hour, min);

~~~^

[and a few more like that]

These are instances where a format string is an expression that results
in a variable number of format arguments.  Not sure if that is actually
legal in C.

print.c:778:22: warning: field width should have type 'int', but
argument has type 'unsigned int' [-Wformat]
fprintf(fout, %-*s%s\n, (width_total -
width) / 2, ,
 ^
~

[and a few more like that]

Not sure about that.

Also there are boatloads of warnings in the regex stuff about unused
things, that we probably don't have to worry about.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] warning message in standby

2010-06-10 Thread Fujii Masao
On Fri, Jun 11, 2010 at 1:01 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 We're talking about a corrupt record (incorrect CRC, incorrect backlink
 etc.), not errors within redo functions. During crash recovery, a corrupt
 record means you've reached end of WAL. In standby mode, when streaming WAL
 from master, that shouldn't happen, and it's not clear what to do if it
 does. PANIC is not a good idea, at least if the server uses hot standby,
 because that only makes the situation worse from availability point of view.
 So we log the error as a WARNING, and keep retrying. It's unlikely that the
 problem will just go away, but we keep retrying anyway in the hope that it
 does. However, it seems that we're too aggressive with the retries.

Right. The attached patch calms down the retries: if we found an invalid
record while streaming WAL from master, we sleep for 5 seconds (needs to
be reduced?) before retrying to replay the record which is in the same
location where the invalid one was found. Comments?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


calm_down_retries_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LLVM / clang

2010-06-10 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes:
 [ assorted LLVM warnings ]

 dt_common.c:818:75: warning: more data arguments than '%' conversions
 [-Wformat-extra-args]
 sprintf(str + strlen(str), (min != 0) ?
 %+03d:%02d : %+03d, hour, min);
 ~~~^

 [and a few more like that]

 These are instances where a format string is an expression that results
 in a variable number of format arguments.  Not sure if that is actually
 legal in C.

I believe it's legal, but I'd be in favor of making a project policy
against it, simply because you aren't going to get any static checking
from gcc about whether the arguments match the format string.  There
isn't any good excuse not to code the above like

if (min != 0)
sprintf(str + strlen(str), %+03d:%02d, hour, min);
else
sprintf(str + strlen(str), %+03d, hour);

which would produce warnings if you managed to mess up the format match.

 print.c:778:22: warning: field width should have type 'int', but
 argument has type 'unsigned int' [-Wformat]
 fprintf(fout, %-*s%s\n, (width_total -
 width) / 2, ,

 Not sure about that.

That one, on the other hand, is pretty silly ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LLVM / clang

2010-06-10 Thread Takahiro Itagaki

Peter Eisentraut pete...@gmx.net wrote:

 Some new warnings, however:
 
 xlog.c:7759:22: warning: self-comparison always results in a constant
 value
 max_locks_per_xact != max_locks_per_xact)
^
 
 Looks like a bug.

Ah, it should be compared with the same name field in ControlFile.

Index: src/backend/access/transam/xlog.c
===
--- src/backend/access/transam/xlog.c   (HEAD)
+++ src/backend/access/transam/xlog.c   (fixed)
@@ -7756,7 +7756,7 @@
if (wal_level != ControlFile-wal_level ||
MaxConnections != ControlFile-MaxConnections ||
max_prepared_xacts != ControlFile-max_prepared_xacts ||
-   max_locks_per_xact != max_locks_per_xact)
+   max_locks_per_xact != ControlFile-max_locks_per_xact)
{
/*
 * The change in number of backend slots doesn't need to be



Regards,
---
Takahiro Itagaki
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers