date:20150415

Hi all,

In the refactoring of pg_rewind tests committed as 53ba107, it happens
that some of the servers used for the tests are not stopped at the end
of the test. The issue is that RewindTest.pm uses END to stop the
servers, but now that the local and remote tests are grouped half of
the servers are not stopped.

The reason why I did not notice that before is because as pg_rewind
uses local Unix socket to work on the nodes, the tests are not
failing.
Sorry about that, that's my fault. And attached is a patch fixing the
issue: it replaces END by a cleanup function called at the end of each
local/remote test to be sure that the servers are shut down.
Regards,
-- 
Michael
diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm
index 50cae2c..1339871 100644
--- a/src/bin/pg_rewind/RewindTest.pm
+++ b/src/bin/pg_rewind/RewindTest.pm
@@ -22,6 +22,8 @@ package RewindTest;
 # 5. run_pg_rewind - stops the old master (if it's still running) and runs
 # pg_rewind to synchronize it with the now-promoted standby server.
 #
+# 6. clean_rewind_test - enforces stop of the servers used in the test.
+#
 # The test script can use the helper functions master_psql and standby_psql
 # to run psql against the master and standby servers, respectively. The
 # test script can also use the $connstr_master and $connstr_standby global
@@ -56,6 +58,7 @@ our @EXPORT = qw(
   create_standby
   promote_standby
   run_pg_rewind
+  clean_rewind_test
 );
 
 
@@ -262,9 +265,8 @@ recovery_target_timeline='latest'
 }
 
 # Clean up after the test. Stop both servers, if they're still running.
-END
+sub clean_rewind_test
 {
-	my $save_rc = $?;
 	if ($test_master_datadir)
 	{
 		system pg_ctl -D $test_master_datadir -s -m immediate stop 2 /dev/null;
@@ -273,5 +275,4 @@ END
 	{
 		system pg_ctl -D $test_standby_datadir -s -m immediate stop 2 /dev/null;
 	}
-	$? = $save_rc;
 }
diff --git a/src/bin/pg_rewind/t/001_basic.pl b/src/bin/pg_rewind/t/001_basic.pl
index ae26d01..a1d679f 100644
--- a/src/bin/pg_rewind/t/001_basic.pl
+++ b/src/bin/pg_rewind/t/001_basic.pl
@@ -78,6 +78,7 @@ in master, before promotion
 ),
 		'tail-copy');
 
+	RewindTest::clean_rewind_test();
 }
 
 # Run the test in both modes
diff --git a/src/bin/pg_rewind/t/002_databases.pl b/src/bin/pg_rewind/t/002_databases.pl
index 1cf9a3a..be1e194 100644
--- a/src/bin/pg_rewind/t/002_databases.pl
+++ b/src/bin/pg_rewind/t/002_databases.pl
@@ -40,6 +40,7 @@ standby_afterpromotion
 ),
 			   'database names');
 
+	RewindTest::clean_rewind_test();
 }
 
 # Run the test in both modes.
diff --git a/src/bin/pg_rewind/t/003_extrafiles.pl b/src/bin/pg_rewind/t/003_extrafiles.pl
index 218b865..ed50659 100644
--- a/src/bin/pg_rewind/t/003_extrafiles.pl
+++ b/src/bin/pg_rewind/t/003_extrafiles.pl
@@ -62,6 +62,8 @@ sub run_test
 			   $test_master_datadir/tst_standby_dir/standby_subdir,
 			   $test_master_datadir/tst_standby_dir/standby_subdir/standby_file3],
 			  file lists match);
+
+	RewindTest::clean_rewind_test();
 }
 
 # Run the test in both modes.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Move pg_upgrade from contrib/ to src/bin/

On Wed, Apr 15, 2015 at 7:54 PM, Peter Eisentraut pete...@gmx.net wrote:
 On 4/14/15 8:32 PM, Peter Eisentraut wrote:
 Move pg_upgrade from contrib/ to src/bin/

 Oh dear.  It appears the buildfarm client code needs to be updated to
 support this.  How do we do this?   (I guess the critters that are still
 green are not running this test.)

Argh..

Looking at the buildfarm client code
(https://github.com/PGBuildFarm/client-code) we are going to need to
make the logic of TestUpgrade.pm aware that pg_upgrade is now in
src/bin and not in contrib. A simple idea would be to use a Find to
guess where pg_upgrade oath is located and then use the path found
instead of hardcoding contrib/.

 Should I revert this patch while we sort this out?

I think so, keeping the buildfarm red for a long time is no good...
This is going to require an update of all the buildfarm machines, so
taking the problem at root we had better improve the buildfarm code,
get it deployed on a maximum of machines, and then have this patch
pushed.
It doesn't prevent the other patches in the src/bin to be pushed first
at least, only pg_upgrade has a specific test case.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FPW compression leaks information

On Wed, Apr 15, 2015 at 9:20 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Wed, Apr 15, 2015 at 2:22 PM, Fujii Masao wrote:
 On Wed, Apr 15, 2015 at 11:55 AM, Michael Paquier wrote:
 1) Doc patch to mention that it is possible that compression can give
 hints to attackers when working on sensible fields that have a
 non-fixed size.

 I think that this patch is enough as the first step.

 I'll get something done for that at least, a big warning below the
 description of wal_compression would do it.

 2) Switch at relation level to control wal_compression.

 ALTER TABLE SET is not allowed on system catalog like pg_authid. So should we
 change it so that a user can change the flag even on system catalog? I'm 
 afraid
 that the change might cause another problem, though. Probably we can disable
 the compression on every system catalogs by default. But I can imagine that
 someone wants to enable the compression even on system catalog. For example,
 pg_largeobject may cause lots of FPW.

 We could enforce a value directly in pg_class.h for only pg_authid if
 we think that it is a problem that bad, and rely on the default system
 value for the rest. That's a hacky-ugly approach though...

Something else that I recalled and has not yet been mentioned on this
thread. Even if the server-wide wal_compression is off, any user can
change its value because it is PGC_USERSET, hence I think that we had
better make it at least PGC_SUSET.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

On 15 April 2015 at 08:04, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Apr 15, 2015 at 3:37 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 14 April 2015 at 21:53, Robert Haas robertmh...@gmail.com wrote:
 Peter commented previously that README.HOT should get an update.  The
 relevant section seems to be When can/should we prune or
 defragment?.

 That's easy enough to change once we agree to commit.

 I wonder if it would be a useful heuristic to still prune pages if
 those pages are already dirty.

 Useful for who? This is about responsibility. Why should someone
 performing a large SELECT take the responsibility for cleaning pages?

 Because it makes it subsequent accesses to the page cheaper.

Cheaper for whom?

 Of
 course, that applies in all cases, but when the page is already dirty,
 the cost of pruning it is probably quite small - we're going to have
 to write the page anyway, and pruning it before it gets evicted
 (perhaps even by our scan) will be cheaper than writing it now and
 writing it again after it's pruned.  When the page is clean, the cost
 of pruning is significantly higher.

We aren't going to have to write the page, but someone will.

In a single workload, the mix of actions can be useful. In separate
workloads, where some guy just wants to run a report or a backup, its
not right that we slow them down because of someone else's actions.

 I won't take responsibility for paying my neighbor's tax bill, but I
 might take responsibility for picking up his mail while he's on
 holiday.

That makes it sound like this is an occasional, non-annoying thing.

It's more like, whoever fetches the mail needs to fetch it for
everybody. So we are slowing down one person disproportionately, while
others fly through without penalty. There is no argument that one
workload necessarily needs to perform that on behalf of the other
workload.

The actions you suggest are reasonable and should ideally be the role
of a background process. But that doesn't mean in the absence of that
we should pay the cost in the foreground.

Let me apply this patch.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, RemoteDBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)

2015-04-15 Thread Kouhei Kaigai

  Attached explain_forein_join.patch adds capability to show join combination
 of
  a ForeignScan in EXPLAIN output as an additional item “Relations”.  I 
  thought
  that using array to list relations is a good way too, but I chose one 
  string
 value
  because users would like to know order and type of joins too.
 
  A bit different from my expectation... I expected to display name of the 
  local
  foreign tables (and its alias), not remote one, because all the local join 
  logic
  displays local foreign tables name.
  Is it easy to adjust isn't it? Probably, all you need to do is, putting a 
  local
  relation name on the text buffer (at deparseSelectSql) instead of the 
  deparsed
  remote relation.
 
 Oops, that’s right.  Attached is the revised version.  I chose fully qualified
 name, schema.relname [alias] for the output.  It would waste some cycles 
 during
 planning if that is not for EXPLAIN, but it seems difficult to get a list of 
 name
 of relations in ExplainForeignScan() phase, because planning information has 
 gone
 away at that time.

I understand. Private data structure of the postgres_fdw is not designed
to keep tree structure data (like relations join tree), so it seems to me
a straightforward way to implement the feature.

I have a small suggestion. This patch makes deparseSelectSql initialize
the StringInfoData if supplied, however, it usually shall be a task of
function caller, not callee.
In this case, I like to initStringInfo(relations) next to the line of
initStingInfo(sql) on the postgresGetForeignPlan. In my sense, it is
a bit strange to pass uninitialized StringInfoData, to get a text form.

@@ -803,7 +806,7 @@ postgresGetForeignPlan(PlannerInfo *root,
 */
initStringInfo(sql);
deparseSelectSql(sql, root, baserel, fpinfo-attrs_used, remote_conds,
-params_list, fdw_ps_tlist, retrieved_attrs);
+params_list, fdw_ps_tlist, retrieved_attrs, relations);

/*
 * Build the fdw_private list that will be available in the executor.

Also, could you merge the EXPLAIN output feature on the main patch?
I think here is no reason why to split this feature.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com


 -Original Message-
 From: pgsql-hackers-ow...@postgresql.org
 [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Shigeru HANADA
 Sent: Tuesday, April 14, 2015 7:49 PM
 To: Kaigai Kouhei(海外 浩平)
 Cc: Ashutosh Bapat; Robert Haas; Tom Lane; Thom Brown;
 pgsql-hackers@postgreSQL.org
 Subject: Re: Custom/Foreign-Join-APIs (Re: [HACKERS] [v9.5] Custom Plan API)
 
 KaiGai-san,
 
 2015/04/14 14:04、Kouhei Kaigai kai...@ak.jp.nec.com のメール：
 
  * Fix typos
 
  Please review the v11 patch, and mark it as “ready for committer” if it’s
 ok.
 
  It's OK for me, and wants to be reviewed by other people to get it 
  committed.
 
 
 Thanks!
 
  In addition to essential features, I tried to implement relation listing in
 EXPLAIN
  output.
 
  Attached explain_forein_join.patch adds capability to show join combination
 of
  a ForeignScan in EXPLAIN output as an additional item “Relations”.  I 
  thought
  that using array to list relations is a good way too, but I chose one 
  string
 value
  because users would like to know order and type of joins too.
 
  A bit different from my expectation... I expected to display name of the 
  local
  foreign tables (and its alias), not remote one, because all the local join 
  logic
  displays local foreign tables name.
  Is it easy to adjust isn't it? Probably, all you need to do is, putting a 
  local
  relation name on the text buffer (at deparseSelectSql) instead of the 
  deparsed
  remote relation.
 
 Oops, that’s right.  Attached is the revised version.  I chose fully qualified
 name, schema.relname [alias] for the output.  It would waste some cycles 
 during
 planning if that is not for EXPLAIN, but it seems difficult to get a list of 
 name
 of relations in ExplainForeignScan() phase, because planning information has 
 gone
 away at that time.
 
 
 
 --
 Shigeru HANADA
 shigeru.han...@gmail.com
 
 
 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-15 Thread Amit Kapila

On Wed, Apr 15, 2015 at 10:07 AM, Robert Haas robertmh...@gmail.com wrote:

 On Wed, Apr 15, 2015 at 12:15 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  IIUC, this will allow us to increase usage count only when the buffer
  is touched by clocksweep to decrement the usage count.

 Yes.

  I think such a solution will be good for the cases when many evictions
  needs to be performed to satisfy the workload,  OTOH when there are
  not too many evictions that needs to be done, in such a case some of
  the buffers that are accessed much more will have equal probability to
  get evicted as compare to buffers which are less accessed.

 Possibly, but I think it's even worse under the current algorithm.
 Under this proposal, if we go for a long time without any buffer
 evictions, every buffer usage's count will top out at 2 more than
 wherever it was after the last clock sweep.   In the worst case, every
 buffer (or most of them) could end up with the same usage count.  But
 under the status quo, they'll all go to 5, which is an even bigger
 loss of information, and which will make the first eviction much more
 expensive than if they are all pegged at 2 or 3.


Okay, got your point.  On further thinking on this idea, it occurred to me
that with this idea you are trying to reduce the clock-sweep time which
in-turn will inturn improve the chances of finding usable buffer in less
time under high pressure, if that is true then I think we should have seen
the similar gain even with the bgreclaimer idea which I have tried sometime
back, because that has also reduced the time for clock-sweep.

 There could be ways to improve things further, such as by slowly
 advancing the clock sweep even when no eviction is required.  But
 that's a bit tricky to do, too.  You'd like (perhaps) to advance it
 one step for every buffer allocation, but you don't want a centralized
 counter, or unnecessary contention on the clock sweep when no eviction
 is necessary in the first place.  There's probably some way to make it
 work, though, if we put our mind to it.  I'm inclined to try the
 simpler approach first and see what it buys.


Sure, I think it is worth a try.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Turning off HOT/Cleanup sometimes

On 15 April 2015 at 09:10, Andres Freund and...@anarazel.de wrote:
 On 2015-04-15 08:42:33 -0400, Simon Riggs wrote:
  Because it makes it subsequent accesses to the page cheaper.

 Cheaper for whom?

 Everyone.

I think what you mean is Everyone else. It is demonstrably quicker
and more consistent for a process when it limits the amount of pruning
it does, as well as the fact that it causes additional WAL traffic
when it does so, causing replication lag.

I love it when someone cleans up for me. I just don't think they'll
accept the argument that they should clean up for me because it makes
their life easier.   Certainly doesn't work with my kids.


 I don't really see the downside to this suggestion.

The suggestion makes things better than they are now but is still less
than I have proposed.

If what you both mean is IMHO this is an acceptable compromise, I
can accept it also, at this point in the CF.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, RemoteDBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-15 Thread Robert Haas

On Wed, Apr 15, 2015 at 3:37 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 14 April 2015 at 21:53, Robert Haas robertmh...@gmail.com wrote:
 Peter commented previously that README.HOT should get an update.  The
 relevant section seems to be When can/should we prune or
 defragment?.

 That's easy enough to change once we agree to commit.

 I wonder if it would be a useful heuristic to still prune pages if
 those pages are already dirty.

 Useful for who? This is about responsibility. Why should someone
 performing a large SELECT take the responsibility for cleaning pages?

Because it makes it subsequent accesses to the page cheaper.  Of
course, that applies in all cases, but when the page is already dirty,
the cost of pruning it is probably quite small - we're going to have
to write the page anyway, and pruning it before it gets evicted
(perhaps even by our scan) will be cheaper than writing it now and
writing it again after it's pruned.  When the page is clean, the cost
of pruning is significantly higher.

I won't take responsibility for paying my neighbor's tax bill, but I
might take responsibility for picking up his mail while he's on
holiday.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FPW compression leaks information

On Wed, Apr 15, 2015 at 2:22 PM, Fujii Masao wrote:
 On Wed, Apr 15, 2015 at 11:55 AM, Michael Paquier wrote:
 1) Doc patch to mention that it is possible that compression can give
 hints to attackers when working on sensible fields that have a
 non-fixed size.

 I think that this patch is enough as the first step.

I'll get something done for that at least, a big warning below the
description of wal_compression would do it.

 2) Switch at relation level to control wal_compression.

 ALTER TABLE SET is not allowed on system catalog like pg_authid. So should we
 change it so that a user can change the flag even on system catalog? I'm 
 afraid
 that the change might cause another problem, though. Probably we can disable
 the compression on every system catalogs by default. But I can imagine that
 someone wants to enable the compression even on system catalog. For example,
 pg_largeobject may cause lots of FPW.

We could enforce a value directly in pg_class.h for only pg_authid if
we think that it is a problem that bad, and rely on the default system
value for the rest. That's a hacky-ugly approach though...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-15 Thread Andres Freund

On 2015-04-15 08:42:33 -0400, Simon Riggs wrote:
  Because it makes it subsequent accesses to the page cheaper.
 
 Cheaper for whom?

Everyone. Including further readers. Following HOT chains in read mostly
workloads can be really expensive. If you have workloads with a 'hot'
value range that's frequently updated, but that range moves you can
easily end up with heavily chained tuples which won't soon be touched by
a writer again.

And writers will often not yet be able to prune the page because there's
still live readers for the older versions (like other updaters).

  Of
  course, that applies in all cases, but when the page is already dirty,
  the cost of pruning it is probably quite small - we're going to have
  to write the page anyway, and pruning it before it gets evicted
  (perhaps even by our scan) will be cheaper than writing it now and
  writing it again after it's pruned.  When the page is clean, the cost
  of pruning is significantly higher.
 
 We aren't going to have to write the page, but someone will.

If it's already dirty that doesn't change at all. *Not* pruning in that
moment actually will often *increase* the total amount of writes to the
OS. Because now the pruning will happen on the next write access or
vacuum - when the page already might have been undirtied.

I don't really see the downside to this suggestion.

 The actions you suggest are reasonable and should ideally be the role
 of a background process. But that doesn't mean in the absence of that
 we should pay the cost in the foreground.

I'm not sure that's true. A background process will either cause
additional read IO to find worthwhile pages, or it'll not find
worthwhile pages because they're already paged out.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0


On 04/15/2015 07:51 AM, Peter Geoghegan wrote:

+heap_finish_speculative(Relation relation, HeapTuple tuple, bool conflict)
+{
+   if (!conflict)
+   {
+   /*
+* Update the tuple in-place, in the common case where no 
conflict was
+* detected during speculative insertion.
+*
+* When heap_insert is called in respect of a speculative 
tuple, the
+* page will actually have a tuple inserted.  However, during 
recovery
+* replay will add an all-zero tuple to the page instead, which 
is the
+* same length as the original (but the tuple header is still 
WAL
+* logged and will still be restored at that point).  If and 
when the
+* in-place update record corresponding to releasing a value 
lock is
+* replayed, crash recovery takes the final tuple value from 
there.
+* Thus, speculative heap records require two WAL records.
+*
+* Logical decoding interprets an in-place update associated 
with a
+* speculative insertion as a regular insert change.  In other 
words,
+* the in-place record generated affirms that a speculative 
insertion
+* completed successfully.
+*/
+   heap_inplace_update(relation, tuple);
+   }
+   else
+   {


That's a bizarre solution. May I suggest a much simpler one:

Make the heap-insert record the same for normal and speculative 
insertions, except for a flag that's set if it's a speculative one. 
Replay as usual.


When the speculative insertion is finished, write a new kind of a WAL 
record for that. The record only needs to contain the ctid of the tuple. 
Replaying that record will clear the flag on the heap tuple that said 
that it was a speculative insertion.


In logical decoding, decode speculative insertions like any other 
insertion. To decode a super-deletion record, scan the reorder buffer 
for the transaction to find the corresponding speculative insertion 
record for the tuple, and remove it.


BTW, that'd work just as well without the new WAL record to finish a 
speculative insertion. Am I missing something?



--- a/src/include/storage/off.h
+++ b/src/include/storage/off.h
@@ -26,6 +26,7 @@ typedef uint16 OffsetNumber;
 #define InvalidOffsetNumber((OffsetNumber) 0)
 #define FirstOffsetNumber  ((OffsetNumber) 1)
 #define MaxOffsetNumber((OffsetNumber) (BLCKSZ / 
sizeof(ItemIdData)))
+#define MagicOffsetNumber  (MaxOffsetNumber + 1)
 #define OffsetNumberMask   (0x)/* valid uint16 
bits */


IMHO it would be nicer if the magic value was more constant, e.g. 0x 
or 0xfffe, instead of basing it on MaxOffsetNumber which depends on 
block size. I would rather not include MaxOffsetNumber of anything 
derived from it in the on-disk dormat.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format changes break the suppression of do-nothing checkpoints.


On 03/31/2015 09:09 PM, Jeff Janes wrote:

On Tue, Mar 31, 2015 at 6:19 AM, Heikki Linnakangas hlinn...@iki.fi wrote:


On 03/30/2015 09:01 PM, Jeff Janes wrote:


commit 2c03216d831160bedd72d45f7 has invalidated the part of the docs
saying If no WAL has been written since the previous checkpoint, new
checkpoints will be skipped even if checkpoint_timeout has passed,
presumably by accident.

It seems that this part is no longer true when it should be true:

  if (curInsert == ControlFile-checkPoint +
  MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint))

MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint) is now 96, but the amount
by
which curInsert gets advanced is still 104, like it was before the commit.


Hmm. Wasn't this a bit broken before too, when the checkpoint record
crosses a page boundary?


But once the next one it was written, it wouldn't across a boundary
anymore.  So you would get one extra checkpoint, rather than an infinite
stream of them.  So a bit broken, but just a bit.


Yeah. It's not serious.

It occurs to me that that's actually nice from a robustness point of 
view: if something goes badly wrong and corrupt, you're more likely to 
be able to recover the last checkpoint record if it doesn't cross a page 
boundary. And more importantly, a segment boundary. OTOH, when you 
create a spurious checkpoint, you lose the previous checkpoint 
location. A good compromise is probably to not skip the checkpoint if 
the last one crossed a segment boundary. (Committed that way)



Instead of trying to calculate where the checkpoint record ends, I think
we could check that the prev-pointer points to the last checkpoint record.


Wouldn't doing that involve reading WAL while not in recovery?


No. We keep track of the beginning position of the last inserted record 
at all times, because when the next record is inserted, we have to put 
that in the xl_prev field.


Committed a patch to do that.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

Simon Riggs wrote:
 On 15 April 2015 at 09:10, Andres Freund and...@anarazel.de wrote:

  I don't really see the downside to this suggestion.
 
 The suggestion makes things better than they are now but is still less
 than I have proposed.
 
 If what you both mean is IMHO this is an acceptable compromise, I
 can accept it also, at this point in the CF.

Let me see if I understand things.

What we have now is: when reading a page, we also HOT-clean it.  This
runs HOT-cleanup a large number of times, and causes many pages to
become dirty.

Your patch is when reading a page, HOT-clean it, but only 5 times in
each scan.  This runs HOT-cleanup at most 5 times, and causes at most 5
pages to become dirty.

Robert's proposal is when reading a page, if dirty HOT-clean it; if not
dirty, also HOT-clean it but only 5 times in each scan.  This runs
HOT-cleanup some number of times (as many as there are dirty), and
causes at most 5 pages to become dirty.


Am I right in thinking that HOT-clean in a dirty page is something that
runs completely within CPU cache?  If so, it would be damn fast and
would have benefits for future readers, for very little cost.

Dirtying a page is very different; if buffer reads are common, the
system is later bogged down trying to find clean pages to read uncached
buffers (including the read-only scan itself, so it becomes slower.)


If I have understood things correctly, then I stand behind Robert's
suggestion.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Move pg_upgrade from contrib/ to src/bin/

2015-04-15 Thread Andrew Dunstan

On 04/15/2015 10:51 AM, Andrew Dunstan wrote:

On 04/15/2015 08:31 AM, Michael Paquier wrote:
On Wed, Apr 15, 2015 at 7:54 PM, Peter Eisentraut pete...@gmx.net
wrote:

On 4/14/15 8:32 PM, Peter Eisentraut wrote:

Move pg_upgrade from contrib/ to src/bin/

Oh dear. It appears the buildfarm client code needs to be updated to
support this. How do we do this? (I guess the critters that are
still

green are not running this test.)

Argh..

Looking at the buildfarm client code
(https://github.com/PGBuildFarm/client-code) we are going to need to
make the logic of TestUpgrade.pm aware that pg_upgrade is now in
src/bin and not in contrib. A simple idea would be to use a Find to
guess where pg_upgrade oath is located and then use the path found
instead of hardcoding contrib/.

I'm testing a fix. I just happened to check the status this morning.
It would have been nice to have had a heads up. The Find idea is cute,
I might have used it if I hadn't already coded a fix.

OK, the fix is here:
https://github.com/PGBuildFarm/client-code/commit/47b24efa758360699413640dd14c4096e1643fb2
and the new TestUpgrade.pm can be grabbed from here:
https://raw.githubusercontent.com/PGBuildFarm/client-code/47b24efa758360699413640dd14c4096e1643fb2/PGBuild/Modules/TestUpgrade.pm

It can just be dropped into place as a hot fix.

We're overdue for a buildfarm client release, but this hot fix should
get things green again.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes


On 04/15/2015 05:44 PM, Alvaro Herrera wrote:

Simon Riggs wrote:

On 15 April 2015 at 09:10, Andres Freund and...@anarazel.de wrote:



I don't really see the downside to this suggestion.


The suggestion makes things better than they are now but is still less
than I have proposed.

If what you both mean is IMHO this is an acceptable compromise, I
can accept it also, at this point in the CF.


Let me see if I understand things.

What we have now is: when reading a page, we also HOT-clean it.  This
runs HOT-cleanup a large number of times, and causes many pages to
become dirty.

Your patch is when reading a page, HOT-clean it, but only 5 times in
each scan.  This runs HOT-cleanup at most 5 times, and causes at most 5
pages to become dirty.

Robert's proposal is when reading a page, if dirty HOT-clean it; if not
dirty, also HOT-clean it but only 5 times in each scan.  This runs
HOT-cleanup some number of times (as many as there are dirty), and
causes at most 5 pages to become dirty.


Am I right in thinking that HOT-clean in a dirty page is something that
runs completely within CPU cache?  If so, it would be damn fast and
would have benefits for future readers, for very little cost.


If there are many tuples on the page, it takes some CPU effort to scan 
all the HOT chains and move tuples around. Also, it creates a WAL 
record, which isn't free.


Another question is whether the patch can reliably detect whether it's 
doing a read-only scan or not. I haven't tested, but I suspect it'd 
not do pruning when you do something like INSERT INTO foo SELECT * FROM 
foo WHERE blah. I.e. when the target relation is referenced twice in 
the same statement: once as the target, and second time as a source. 
Maybe that's OK, though.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0

2015-04-15 Thread Andres Freund

On 2015-04-15 17:58:54 +0300, Heikki Linnakangas wrote:
 On 04/15/2015 07:51 AM, Peter Geoghegan wrote:
 +heap_finish_speculative(Relation relation, HeapTuple tuple, bool conflict)
 +{
 +if (!conflict)
 +{
 +/*
 + * Update the tuple in-place, in the common case where no 
 conflict was
 + * detected during speculative insertion.
 + *
 + * When heap_insert is called in respect of a speculative 
 tuple, the
 + * page will actually have a tuple inserted.  However, during 
 recovery
 + * replay will add an all-zero tuple to the page instead, which 
 is the
 + * same length as the original (but the tuple header is still 
 WAL
 + * logged and will still be restored at that point).  If and 
 when the
 + * in-place update record corresponding to releasing a value 
 lock is
 + * replayed, crash recovery takes the final tuple value from 
 there.
 + * Thus, speculative heap records require two WAL records.
 + *
 + * Logical decoding interprets an in-place update associated 
 with a
 + * speculative insertion as a regular insert change.  In other 
 words,
 + * the in-place record generated affirms that a speculative 
 insertion
 + * completed successfully.
 + */
 +heap_inplace_update(relation, tuple);
 +}
 +else
 +{
 
 That's a bizarre solution.

I tend to agree, but for different reasons.

 In logical decoding, decode speculative insertions like any other insertion.
 To decode a super-deletion record, scan the reorder buffer for the
 transaction to find the corresponding speculative insertion record for the
 tuple, and remove it.

Not that easy. That buffer is spilled to disk and such. As discussed.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Auditing extension for PostgreSQL (Take 2)

2015-04-15 Thread Sawada Masahiko

On Wed, Apr 15, 2015 at 10:52 AM, Sawada Masahiko sawada.m...@gmail.com wrote:
 On Wed, Apr 15, 2015 at 8:57 AM, David Steele da...@pgmasters.net wrote:
 On 4/14/15 7:13 PM, Tatsuo Ishii wrote:
 This patch does not apply cleanly due to the moving of pgbench (patch
 to filelist.sgml failed).

 Thank you for pointing that out!

 Ironic that it was the commit directly after the one I was testing with
 that broke the patch.  It appears the end of the last CF is a very bad
 time to be behind HEAD.

 Fixed in attached v8 patch.

 Thank you for updating the patch!

 I applied the patch and complied them successfully without WARNING.

 I tested v8 patch with CURSOR case I mentioned before, and got
 segmentation fault again.
 Here are log messages in my environment,

 =# select test();
  LOG:  server process (PID 29730) was terminated by signal 11:
 Segmentation fault
 DETAIL:  Failed process was running: select test();
 LOG:  terminating any other active server processes
 WARNING:  terminating connection because of crash of another server process
 DETAIL:  The postmaster has commanded this server process to roll back
 the current transaction and exit, because another server process
 exited abnormally and possibly corrupted shared memory.
 HINT:  In a moment you should be able to reconnect to the database and
 repeat your command.
 FATAL:  the database system is in recovery mode

 I hope that these messages helps you to address this problem.
 I will also try to address this.

 Regards,

 ---
 Sawada Masahiko



 I will also try to address this.

I investigated this problem and inform you my result here.

When we execute test() function I mentioned, the three stack items in
total are stored into auditEventStack.
The two of them are freed by stack_pop() - stack_free() (i.g,
stack_free() is called by stack_pop()).
One of them is freed by PortalDrop() - .. -
MemoryContextDeleteChildren() - ... - stack_free().
And it is freed at the same time as deleting pg_audit memory context,
and stack will be completely empty.

But after freeing all items, finish_xact_command() function could call
PortalDrop(), so ExecutorEnd() function could be called again.
pg_audit has ExecutorEnd_hook, so postgres tries to free that item.. SEGV.

In my environment, the following change fixes it.

--- pg_audit.c.org2015-04-15 14:21:07.541866525 +0900
+++ pg_audit.c2015-04-15 11:36:53.758877339 +0900
@@ -1291,7 +1291,7 @@
 standard_ExecutorEnd(queryDesc);

 /* Pop the audit event off the stack */
-if (!internalStatement)
+if (!internalStatement  auditEventStack != NULL)
 {
 stack_pop(auditEventStack-stackId);
 }

It might be good idea to add Assert() at before calling stack_pop().

I'm not sure this change is exactly correct, but I hope this
information helps you.

Regards,

---
Sawada Masahiko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Move pg_upgrade from contrib/ to src/bin/

2015-04-15 Thread Andrew Dunstan



On 04/15/2015 08:31 AM, Michael Paquier wrote:

On Wed, Apr 15, 2015 at 7:54 PM, Peter Eisentraut pete...@gmx.net wrote:

On 4/14/15 8:32 PM, Peter Eisentraut wrote:

Move pg_upgrade from contrib/ to src/bin/

Oh dear.  It appears the buildfarm client code needs to be updated to
support this.  How do we do this?   (I guess the critters that are still
green are not running this test.)

Argh..

Looking at the buildfarm client code
(https://github.com/PGBuildFarm/client-code) we are going to need to
make the logic of TestUpgrade.pm aware that pg_upgrade is now in
src/bin and not in contrib. A simple idea would be to use a Find to
guess where pg_upgrade oath is located and then use the path found
instead of hardcoding contrib/.





I'm testing a fix. I just happened to check the status this morning. It 
would have been nice to have had a heads up. The Find idea is cute, I 
might have used it if I hadn't already coded a fix.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] INSERT ... ON CONFLICT IGNORE (and UPDATE) 3.0


On 04/15/2015 06:01 PM, Andres Freund wrote:

On 2015-04-15 17:58:54 +0300, Heikki Linnakangas wrote:

On 04/15/2015 07:51 AM, Peter Geoghegan wrote:

+heap_finish_speculative(Relation relation, HeapTuple tuple, bool conflict)
+{
+   if (!conflict)
+   {
+   /*
+* Update the tuple in-place, in the common case where no 
conflict was
+* detected during speculative insertion.
+*
+* When heap_insert is called in respect of a speculative 
tuple, the
+* page will actually have a tuple inserted.  However, during 
recovery
+* replay will add an all-zero tuple to the page instead, which 
is the
+* same length as the original (but the tuple header is still 
WAL
+* logged and will still be restored at that point).  If and 
when the
+* in-place update record corresponding to releasing a value 
lock is
+* replayed, crash recovery takes the final tuple value from 
there.
+* Thus, speculative heap records require two WAL records.
+*
+* Logical decoding interprets an in-place update associated 
with a
+* speculative insertion as a regular insert change.  In other 
words,
+* the in-place record generated affirms that a speculative 
insertion
+* completed successfully.
+*/
+   heap_inplace_update(relation, tuple);
+   }
+   else
+   {


That's a bizarre solution.


I tend to agree, but for different reasons.


In logical decoding, decode speculative insertions like any other insertion.
To decode a super-deletion record, scan the reorder buffer for the
transaction to find the corresponding speculative insertion record for the
tuple, and remove it.


Not that easy. That buffer is spilled to disk and such. As discussed.


Hmm, ok, I've read the INSERT ... ON CONFLICT UPDATE and logical 
decoding thread now, and I have to say that IMHO it's a lot more sane 
to handle this in ReorderBufferCommit() like Peter first did, than to 
make the main insertion path more complex like this.


Another idea is to never spill the latest record to disk, at least if it 
was a speculative insertion. Then you would be sure that when you see 
the super-deletion record, the speculative insertion it refers to is 
still in memory. That seems simple.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TAP tests of pg_rewind not stopping servers used for the tests


On 04/15/2015 03:07 PM, Michael Paquier wrote:

Hi all,

In the refactoring of pg_rewind tests committed as 53ba107, it happens
that some of the servers used for the tests are not stopped at the end
of the test. The issue is that RewindTest.pm uses END to stop the
servers, but now that the local and remote tests are grouped half of
the servers are not stopped.

The reason why I did not notice that before is because as pg_rewind
uses local Unix socket to work on the nodes, the tests are not
failing.
Sorry about that, that's my fault. And attached is a patch fixing the
issue: it replaces END by a cleanup function called at the end of each
local/remote test to be sure that the servers are shut down.


Thanks, committed. I kept the END block, though, so that we still clean 
up if the test dies with an exception.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Bugs in CreateCheckPoint

2015-04-15 Thread Zhang Zq

hi,
   To avoid inserting duplicate checkpoints when the system is idle, the code 
of CreateCheckPoint make two tests to determine that nothing has happened since 
the start of the last checkpoint.
But because the layout of XLOG record was changed, the condition would never to 
be true. The attachment is the patch to fix it, thanks.




  Zhang Zq



repeatly-checkpoint.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

Heikki Linnakangas wrote:
 On 04/15/2015 05:44 PM, Alvaro Herrera wrote:

 Robert's proposal is when reading a page, if dirty HOT-clean it; if not
 dirty, also HOT-clean it but only 5 times in each scan.  This runs
 HOT-cleanup some number of times (as many as there are dirty), and
 causes at most 5 pages to become dirty.
 
 
 Am I right in thinking that HOT-clean in a dirty page is something that
 runs completely within CPU cache?  If so, it would be damn fast and
 would have benefits for future readers, for very little cost.
 
 If there are many tuples on the page, it takes some CPU effort to scan all
 the HOT chains and move tuples around. Also, it creates a WAL record, which
 isn't free.

But if the page is in CPU cache, the CPU effort shouldn't be all that
noticeable, should it?  That's my point, but then maybe I'm wrong.  Now,
the WAL logging is annoying, so let's limit that too -- do it at most
for, say, 20 dirty pages and at most 5 clean pages.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-15 Thread Robert Haas

On Wed, Apr 15, 2015 at 8:42 AM, Simon Riggs si...@2ndquadrant.com wrote:
 I won't take responsibility for paying my neighbor's tax bill, but I
 might take responsibility for picking up his mail while he's on
 holiday.

 That makes it sound like this is an occasional, non-annoying thing.

 It's more like, whoever fetches the mail needs to fetch it for
 everybody. So we are slowing down one person disproportionately, while
 others fly through without penalty. There is no argument that one
 workload necessarily needs to perform that on behalf of the other
 workload.

Sure there is.  It's called a tragedy of the commons - everybody acts
in their own selfish interest (it's not *my* responsibility to limit
grazing on public land, or prune this page that I'm not modifying) and
as a result some resource that everybody cares about (grass,
system-wide I/O) gets trashed to everyone's detriment.  Purely selfish
behavior can only be justified here if we assume that the selfish
actor intends to participate in the system only once: I'm going to run
one big reporting query which must run as fast as possible, and then
I'm getting on a space ship to Mars.  So if my refusal to do any
pruning during that reporting query causes lots of extra I/O on the
system ten minutes from now, I don't care, because I'll have left the
playing field forever at that point.

As Heikki points out, any HOT pruning operation generates WAL and has
a CPU cost.  However, pruning a page that is currently dirty
*decreases* the total volume of writes to the data files, whereas
pruning a page that is currently clean *increases* the total volume of
writes to the data files.  In the first case, if we prune the page
right now while it's still dirty, we can't possibly cause any
additional data-file writes, and we may save one, because it's
possible that someone else would later have pruned it when it was
clean and there was no other reason to dirty it.  In the second case,
if we prune the page that is currently clean, it will become dirty.
That will cost us no additional I/O if the page is again modified
before it's written out, but otherwise it costs an additional data
file write.  I think there's a big difference between those two cases.
Sure, from the narrow point of view of how much work it takes this
scan to process this page, it's always better not to prune.  But if
you make the more realistic assumption that you will keep on issuing
queries on the system, then what you're doing to the overall system
I/O load is pretty important.

By the way, was anything ever done about this:

http://www.postgresql.org/message-id/20140929091343.ga4...@alap3.anarazel.de

That's just a workload that is 5/6th pgbench -S and 1/6th pgbench,
which is in no way an unrealistic workload, and showed a significant
regression with an earlier version of the patch.  You seem very eager
to commit this patch after four months of inactivity, but I think this
is a pretty massive behavior change that deserves careful scrutiny
before it goes in.  If we push something that changes longstanding
behavior and can't even be turned off, and it regresses behavior for a
use case that common, our users are going to come after us with
pitchforks.  That's not to say some people won't be happy, but in my
experience it takes a lot of happy users to make up for getting
stabbed with even one pitchfork.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Assert there is no duplicated exit callbacks

Attached is a patch to assert there is no duplicated exit callbacks.

Along the way, I downgrade the runtime enough room check to an
assertion: the callbacks are registered in pretty fixed initialization
code path, thus assertion is good enough to prevent overlook there. If
we don't agree with this, then the duplication check shall also bump
to runtime checks.

Regards,
Qingqing
diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index 6bc0b06..bfc7764 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -291,10 +291,12 @@ atexit_callback(void)
 void
 on_proc_exit(pg_on_exit_callback function, Datum arg)
 {
-   if (on_proc_exit_index = MAX_ON_EXITS)
-   ereport(FATAL,
-   (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-errmsg_internal(out of on_proc_exit slots)));
+   int i;
+   
+   /* check no duplicates and there are rooms */
+   for (i = 0; i  on_proc_exit_index; i++)
+   Assert (on_proc_exit_list[i] != function);
+   Assert (on_proc_exit_index  MAX_ON_EXITS);
 
on_proc_exit_list[on_proc_exit_index].function = function;
on_proc_exit_list[on_proc_exit_index].arg = arg;
@@ -319,10 +321,12 @@ on_proc_exit(pg_on_exit_callback function, Datum arg)
 void
 before_shmem_exit(pg_on_exit_callback function, Datum arg)
 {
-   if (before_shmem_exit_index = MAX_ON_EXITS)
-   ereport(FATAL,
-   (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-errmsg_internal(out of before_shmem_exit 
slots)));
+   int i;
+   
+   /* check no duplicated callbacks and there are rooms */
+   for (i = 0; i  before_shmem_exit_index; i++)
+   Assert (before_shmem_exit_list[i] != function);
+   Assert (before_shmem_exit_index  MAX_ON_EXITS);
 
before_shmem_exit_list[before_shmem_exit_index].function = function;
before_shmem_exit_list[before_shmem_exit_index].arg = arg;
@@ -347,10 +351,12 @@ before_shmem_exit(pg_on_exit_callback function, Datum arg)
 void
 on_shmem_exit(pg_on_exit_callback function, Datum arg)
 {
-   if (on_shmem_exit_index = MAX_ON_EXITS)
-   ereport(FATAL,
-   (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-errmsg_internal(out of on_shmem_exit 
slots)));
+   int i;
+   
+   /* check no duplicates and there are rooms */
+   for (i = 0; i  on_shmem_exit_index; i++)
+   Assert (on_shmem_exit_list[i] != function);
+   Assert (on_shmem_exit_index  MAX_ON_EXITS);
 
on_shmem_exit_list[on_shmem_exit_index].function = function;
on_shmem_exit_list[on_shmem_exit_index].arg = arg;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes


On 04/15/2015 07:11 PM, Alvaro Herrera wrote:

Heikki Linnakangas wrote:

On 04/15/2015 05:44 PM, Alvaro Herrera wrote:



Robert's proposal is when reading a page, if dirty HOT-clean it; if not
dirty, also HOT-clean it but only 5 times in each scan.  This runs
HOT-cleanup some number of times (as many as there are dirty), and
causes at most 5 pages to become dirty.


Am I right in thinking that HOT-clean in a dirty page is something that
runs completely within CPU cache?  If so, it would be damn fast and
would have benefits for future readers, for very little cost.


If there are many tuples on the page, it takes some CPU effort to scan all
the HOT chains and move tuples around. Also, it creates a WAL record, which
isn't free.


But if the page is in CPU cache, the CPU effort shouldn't be all that
noticeable, should it?  That's my point, but then maybe I'm wrong.  Now,
the WAL logging is annoying, so let's limit that too -- do it at most
for, say, 20 dirty pages and at most 5 clean pages.


There isn't much difference between that and just doing it on first 5 
pages. Both of those numbers were pulled out of thin air, anyway. I'd 
rather just keep it simple.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Assert there is no duplicated exit callbacks


湏〠⼴㔱㈯㄰‵㜰㔺′䵐‬楑杮楱杮娠潨⁵牷瑯㩥㸊䄠瑴捡敨⁤獩愠瀠瑡档琠⁯獡敳瑲琠敨敲椠⁳潮搠灵楬慣整⁤硥瑩挠污扬捡獫ਮ䤊搠湯琧爠浥浥敢⁲桴瑡攠敶⁲敢湥愠瀠潲汢浥ਮ㸊䄠潬杮琠敨眠祡‬⁉潤湷牧摡⁥桴⁥畲瑮浩⁥攢潮杵⁨潲浯•档捥⁫潴愠੮‾獡敳瑲潩㩮琠敨挠污扬捡獫愠敲爠来獩整敲⁤湩瀠敲瑴⁹楦數⁤湩瑩慩楬慺楴湯㸊挠摯⁥慰桴‬桴獵愠獳牥楴湯椠⁳潧摯攠潮杵⁨潴瀠敲敶瑮漠敶汲潯⁫桴牥⹥䤠੦‾敷搠湯琧愠牧敥眠瑩⁨桴獩‬桴湥琠敨搠灵楬慣楴湯挠敨正猠慨汬愠獬⁯畢灭㸊琠⁯畲瑮浩⁥档捥獫ਮ䤊搠湯琧琠楨歮愠祮漠⁦桴獩椠⁳敮摥摥ਮⴊ䠠楥歫੩ਊⴊ‭匊湥⁴楶⁡杰煳⵬慨正牥⁳慭汩湩⁧楬瑳⠠杰煳⵬慨正牥䁳潰瑳牧獥汱漮杲਩潔洠歡⁥档湡敧⁳潴礠畯⁲畳獢牣灩楴湯਺瑨灴⼺眯睷瀮獯杴敲煳⹬牯⽧慭汩牰晥瀯獧汱栭捡敫獲

Re: [HACKERS] Bugs in CreateCheckPoint


On 04/15/2015 07:02 PM, Zhang Zq wrote:

hi,
To avoid inserting duplicate checkpoints when the system is idle, the code 
of CreateCheckPoint make two tests to determine that nothing has happened since 
the start of the last checkpoint.
But because the layout of XLOG record was changed, the condition would never to 
be true. The attachment is the patch to fix it, thanks.


Jeff Janes reported this a while ago, and as it happens, I committed a 
fix earlier today. (See 
http://www.postgresql.org/message-id/552e7611.1000...@iki.fi)


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Move pg_upgrade from contrib/ to src/bin/

Andrew Dunstan wrote:

 OK, the fix is here: 
 https://github.com/PGBuildFarm/client-code/commit/47b24efa758360699413640dd14c4096e1643fb2
 and the new TestUpgrade.pm can be grabbed from here: 
 https://raw.githubusercontent.com/PGBuildFarm/client-code/47b24efa758360699413640dd14c4096e1643fb2/PGBuild/Modules/TestUpgrade.pm
 
 It can just be dropped into place as a hot fix.
 
 We're overdue for a buildfarm client release, but this hot fix should get
 things green again.

There's no way we're going to have *everyone* install the hot fix right
away, though, no?  Or is people really that quick with updating BF
scripts?

If not, I think it'd be better to revert, get a buildfarm release out,
then push the patch again in a week or so.  In the meantime surely we
can push other contrib move patches ...

Regarding the buildfarm update, don't we need a patch for MSVC and
src/test/modules?  I remember you (Andrew) and Tom hot-patched
run_build.pm to run those tests, but of course it doesn't work as a
complete release because MSVC doesn't support it.  It would be awesome
to get BF patched for both those issues before the next release.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Assert there is no duplicated exit callbacks

2015-04-15 Thread David Fetter

On Wed, Apr 15, 2015 at 11:14:43AM -0700, Qingqing Zhou wrote:
 Hmm, the email text seems corrupted?

I think it was an encoding issue.  Here's what I got when I piped it
through less.

Cheers,
David.

On 04/15/2015 07:52 PM, Qingqing Zhou wrote:
 Attached is a patch to assert there is no duplicated exit callbacks.

I don't remember that ever been a problem.

 Along the way, I downgrade the runtime enough room check to an
 assertion: the callbacks are registered in pretty fixed initialization
 code path, thus assertion is good enough to prevent overlook there. If
 we don't agree with this, then the duplication check shall also bump
 to runtime checks.

I don't think any of this is needed.

- Heikki
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Auditing extension for PostgreSQL (Take 2)

2015-04-15 Thread David Steele

On 4/14/15 8:37 PM, Tatsuo Ishii wrote:
 BTW, in my understanding pg_audit allows to track a table access even
 if it's used in a view. I think this is a nice feature and it would be
 better explicitly stated in the document and the test case is better
 included in the regression test.
 
 Here is a sample session:
 
 CREATE TABLE test2 (id INT);
 CREATE VIEW vtest2 AS SELECT * FROM test2;
 GRANT SELECT ON TABLE public.test2 TO auditor;
 GRANT SELECT ON TABLE public.vtest2 TO auditor;
 SELECT * FROM vtest2;
 NOTICE:  AUDIT: SESSION,1,1,READ,SELECT,,,SELECT * FROM vtest2;
 NOTICE:  AUDIT: OBJECT,1,1,READ,SELECT,VIEW,public.vtest2,SELECT * FROM 
 vtest2;
 NOTICE:  AUDIT: OBJECT,1,1,READ,SELECT,TABLE,public.test2,SELECT * FROM 
 vtest2;

That's the idea!  In the documentation I throw around the word
relation pretty liberally, but you are right that some clarification
would be helpful.

I have added a few parenthetical statements to the docs that should make
them clearer.  I also took your suggestion and added a view regression test.

Both are in patch v9 which I attached to my previous email on this thread.

Thank you for taking the time to have a look.

-- 
- David Steele
da...@pgmasters.net



signature.asc
Description: OpenPGP digital signature

Re: [HACKERS] Assert there is no duplicated exit callbacks

Hmm, the email text seems corrupted?

Regards,
Qingqing

On Wed, Apr 15, 2015 at 10:03 AM, Heikki Linnakangas hlinn...@iki.fi wrote:
 湏〠⼴㔱㈯㄰‵㜰㔺′䵐‬楑杮楱杮娠潨⁵牷瑯㩥਍‾瑁慴档摥椠⁳⁡慰捴⁨潴愠獳牥⁴

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] show xl_prev in xlog.c errcontext

I found this patch in my local repo that I wrote some weeks or months
ago while debugging some XLog corruption problem: it was difficult to
pinpoint what XLog record in a long sequence of WAL files was causing a
problem, and the displaying the prev pointer in errcontext made finding
it much easier -- I could correlate it with pg_xlogdump output, I think.

Anyone sees a reason not to apply this or something like it?

(As I recall, we eventually found out that the underlying issue was that
two postmasters shared one data area over NFS and were overwriting one
another's WAL files, or something like that.  But this was some time ago
so I could be mistaken that it's the same case.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services
commit 835e7b3c62590574e529558493c50fbfb5a85d6c
Author: Alvaro Herrera alvhe...@alvh.no-ip.org
Date:   Tue Apr 14 12:41:11 2015 -0300

show prev in errcontext

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2580996..36dc422 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -10389,7 +10389,10 @@ rm_redo_error_callback(void *arg)
 	initStringInfo(buf);
 	xlog_outdesc(buf, record);
 
-	errcontext(xlog redo %s, buf.data);
+	errcontext(xlog redo (prev %X/%X) %s,
+			   (uint32) (XLogRecGetPrev(record)  32),
+			   (uint32) (XLogRecGetPrev(record)),
+			   buf.data);
 
 	pfree(buf.data);
 }

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] reparsing query

2015-04-15 Thread Andrzej Barszcz

Hi
I knock once again with this : reparse query to XML ( last knock 5-6 years
before) .
Motivation: more info for front-end developers about query. Users of jdbc
driver
 are not able to gather information from meta data about such simple query :
  select x.a,y.b from xxx x, yyy y
Solution : modification backend/utils/adt/ruleutils.c

I've done it few years ago, but with every new version of potgresql there
is a need to modify library.
I prepared version for 9.4.1, standalone, which could be used as library
loaded by backend. I wish  to have such functionality in main distro.
Is anyone interested ?

Andrzej Barszcz

Re: [HACKERS] reparsing query

On Wed, Apr 15, 2015 at 1:40 PM, Andrzej Barszcz abus...@gmail.com wrote:
 I knock once again with this : reparse query to XML ( last knock 5-6 years
 before) .


What exactly reparse query to XML does?

Regards,
Qingqing


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Move pg_upgrade from contrib/ to src/bin/

2015-04-15 Thread Andrew Dunstan

On 04/15/2015 12:23 PM, Alvaro Herrera wrote:

Andrew Dunstan wrote:

It can just be dropped into place as a hot fix.

We're overdue for a buildfarm client release, but this hot fix should get
things green again.

There's no way we're going to have *everyone* install the hot fix right
away, though, no? Or is people really that quick with updating BF
scripts?

If not, I think it'd be better to revert, get a buildfarm release out,
then push the patch again in a week or so. In the meantime surely we
can push other contrib move patches ...

Regarding the buildfarm update, don't we need a patch for MSVC and
src/test/modules? I remember you (Andrew) and Tom hot-patched
run_build.pm to run those tests, but of course it doesn't work as a
complete release because MSVC doesn't support it. It would be awesome
to get BF patched for both those issues before the next release.

Yes, we do want support for testmodules. I think that needs to be built
into src/tools/msvc, probably in vcregress.pl. Once we have that the
buildfarm changes will be trivial. But if that's not forthcoming quickly
I can just avoid the step on msvc for now.

We've handled the buildfarm being red for a few days before. People are
usually good about applying fixes fairly quickly.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

On 15 April 2015 at 12:39, Robert Haas robertmh...@gmail.com wrote:
 On Wed, Apr 15, 2015 at 8:42 AM, Simon Riggs si...@2ndquadrant.com wrote:
 I won't take responsibility for paying my neighbor's tax bill, but I
 might take responsibility for picking up his mail while he's on
 holiday.

 That makes it sound like this is an occasional, non-annoying thing.

 It's more like, whoever fetches the mail needs to fetch it for
 everybody. So we are slowing down one person disproportionately, while
 others fly through without penalty. There is no argument that one
 workload necessarily needs to perform that on behalf of the other
 workload.

 Sure there is.  It's called a tragedy of the commons - everybody acts
 in their own selfish interest (it's not *my* responsibility to limit
 grazing on public land, or prune this page that I'm not modifying) and
 as a result some resource that everybody cares about (grass,
 system-wide I/O) gets trashed to everyone's detriment.  Purely selfish
 behavior can only be justified here if we assume that the selfish
 actor intends to participate in the system only once: I'm going to run
 one big reporting query which must run as fast as possible, and then
 I'm getting on a space ship to Mars.  So if my refusal to do any
 pruning during that reporting query causes lots of extra I/O on the
 system ten minutes from now, I don't care, because I'll have left the
 playing field forever at that point.

It all depends upon who is being selfish. Why is a user selfish for
not wanting to clean every single block they scan, when the people
that made the mess do nothing and go faster 10 minutes from now?
Randomly and massively penalising large SELECTs makes no sense. Some
cleanup is OK, with reasonable limits, which is why that is proposed.

On 04/15/2015 05:44 PM, Alvaro Herrera wrote:
 Robert's proposal is when reading a page, if dirty HOT-clean it; if not
 dirty, also HOT-clean it but only 5 times in each scan.  This runs
 HOT-cleanup some number of times (as many as there are dirty), and
 causes at most 5 pages to become dirty.

My understanding of Robert's proposal was when reading a page,
HOT-clean it, but only do this up to 5 times on clean pages, but
continue to do this indefinitely when the page is already dirty..
Andres said that was the only way and I have agreed to it.

Are you now saying not to commit your proposal at all?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, RemoteDBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

Simon Riggs wrote:
 On 15 April 2015 at 12:39, Robert Haas robertmh...@gmail.com wrote:

 On 04/15/2015 05:44 PM, Alvaro Herrera wrote:
  Robert's proposal is when reading a page, if dirty HOT-clean it; if not
  dirty, also HOT-clean it but only 5 times in each scan.  This runs
  HOT-cleanup some number of times (as many as there are dirty), and
  causes at most 5 pages to become dirty.
 
 My understanding of Robert's proposal was when reading a page,
 HOT-clean it, but only do this up to 5 times on clean pages, but
 continue to do this indefinitely when the page is already dirty..

To me, both statements look identical.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] raw output from copy

2015-04-15 Thread Peter Eisentraut

On 4/10/15 5:26 PM, Pavel Stehule wrote:
 Hi
 
 I wrote a prototype of this patch, and it works well
 
 postgres=# set client_encoding to 'latin2';
 SET
 Time: 1.488 ms
 postgres=# \copy (select xmlelement(name xx, d) from d) to ~/d.xml
 (format 'raw')
 COPY 1
 Time: 1.108 ms
 postgres=# copy (select xmlelement(name xx, d) from d) to stdout (format
 'raw') ;
 ?xml version=1.0 encoding=LATIN2?xxpříliš žluťoučký
 kůň/xxTime: 1.000 ms

I think you can get the same thing using regular psql output and just
turning off all field and record separators and tuple headers and so on.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

On 15 April 2015 at 16:01, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 Simon Riggs wrote:
 On 15 April 2015 at 12:39, Robert Haas robertmh...@gmail.com wrote:

 On 04/15/2015 05:44 PM, Alvaro Herrera wrote:
  Robert's proposal is when reading a page, if dirty HOT-clean it; if not
  dirty, also HOT-clean it but only 5 times in each scan.  This runs
  HOT-cleanup some number of times (as many as there are dirty), and
  causes at most 5 pages to become dirty.

 My understanding of Robert's proposal was when reading a page,
 HOT-clean it, but only do this up to 5 times on clean pages, but
 continue to do this indefinitely when the page is already dirty..

 To me, both statements look identical.

I didn't read it that way, apologies for any confusion. But that is good news.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, RemoteDBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] raw output from copy

2015-04-15 Thread Pavel Stehule

It would be nice, but it is not true. You can get correct non utf8 xml with
encoding specification only when binary mode is used. Psql doesn't support
binary mode.

Regards

Pavel
Dne 15. 4. 2015 22:06 napsal uživatel Peter Eisentraut pete...@gmx.net:

 On 4/10/15 5:26 PM, Pavel Stehule wrote:
  Hi
 
  I wrote a prototype of this patch, and it works well
 
  postgres=# set client_encoding to 'latin2';
  SET
  Time: 1.488 ms
  postgres=# \copy (select xmlelement(name xx, d) from d) to ~/d.xml
  (format 'raw')
  COPY 1
  Time: 1.108 ms
  postgres=# copy (select xmlelement(name xx, d) from d) to stdout (format
  'raw') ;
  ?xml version=1.0 encoding=LATIN2?xxpříliš žluťoučký
  kůň/xxTime: 1.000 ms

 I think you can get the same thing using regular psql output and just
 turning off all field and record separators and tuple headers and so on.

Re: [HACKERS] [COMMITTERS] pgsql: Move pg_upgrade from contrib/ to src/bin/

Andrew Dunstan wrote:

 Yes, we do want support for testmodules. I think that needs to be built into
 src/tools/msvc, probably in vcregress.pl. Once we have that the buildfarm
 changes will be trivial. But if that's not forthcoming quickly I can just
 avoid the step on msvc for now.

Well, *I* can't write the MSVC patch, but I can't find anyone to
volunteer some time either :-(  I think it's best to have MSVC skip the
step and have the Makefile-based animals run it.  That would give us a
lot more coverage than currently, which is just two animals AFAIU.

 We've handled the buildfarm being red for a few days before. People are
 usually good about applying fixes fairly quickly.

Okay.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Use outerPlanState() consistently in executor code

In executor context, outerPlanState(node) is the same as
node-ss.ps.lefttree. We follow this in most places except a few. This
patch clean up the outliers and might save us a few instructions by
removing indirection.

Most of changes are trivial. Except I take out an outerPlan nullable
check in grouping iterator - as a it surely has a left child.

I noticed that we mixed use node for plan node and plan state. While
changing it can make code clear, but back patching could be terrible.


Regards,
Qingqing
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 9ff0eff..672825a 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2054,10 +2054,11 @@ ExecReScanAgg(AggState *node)
 {
ExprContext *econtext = node-ss.ps.ps_ExprContext;
int aggno;
+   PlanState   *outerPlan;
 
node-agg_done = false;
-
node-ss.ps.ps_TupFromTlist = false;
+   outerPlan = outerPlanState(node);
 
if (((Agg *) node-ss.ps.plan)-aggstrategy == AGG_HASHED)
{
@@ -2075,7 +2076,7 @@ ExecReScanAgg(AggState *node)
 * parameter changes, then we can just rescan the existing hash 
table;
 * no need to build it again.
 */
-   if (node-ss.ps.lefttree-chgParam == NULL)
+   if (outerPlan-chgParam == NULL)
{
ResetTupleHashIterator(node-hashtable, 
node-hashiter);
return;
@@ -2133,8 +2134,8 @@ ExecReScanAgg(AggState *node)
 * if chgParam of subnode is not null then plan will be re-scanned by
 * first ExecProcNode.
 */
-   if (node-ss.ps.lefttree-chgParam == NULL)
-   ExecReScan(node-ss.ps.lefttree);
+   if (outerPlan-chgParam == NULL)
+   ExecReScan(outerPlan);
 }
 
 
diff --git a/src/backend/executor/nodeBitmapHeapscan.c 
b/src/backend/executor/nodeBitmapHeapscan.c
index 8ea8b9f..6502841 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -449,6 +449,8 @@ ExecBitmapHeapScan(BitmapHeapScanState *node)
 void
 ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
 {
+   PlanState   *outerPlan;
+
/* rescan to release any page pin */
heap_rescan(node-ss.ss_currentScanDesc, NULL);
 
@@ -469,8 +471,9 @@ ExecReScanBitmapHeapScan(BitmapHeapScanState *node)
 * if chgParam of subnode is not null then plan will be re-scanned by
 * first ExecProcNode.
 */
-   if (node-ss.ps.lefttree-chgParam == NULL)
-   ExecReScan(node-ss.ps.lefttree);
+   outerPlan = outerPlanState(node);
+   if (outerPlan-chgParam == NULL)
+   ExecReScan(outerPlan);
 }
 
 /* 
diff --git a/src/backend/executor/nodeGroup.c b/src/backend/executor/nodeGroup.c
index 83d562e..b0d5442 100644
--- a/src/backend/executor/nodeGroup.c
+++ b/src/backend/executor/nodeGroup.c
@@ -280,6 +280,8 @@ ExecEndGroup(GroupState *node)
 void
 ExecReScanGroup(GroupState *node)
 {
+   PlanState   *outerPlan;
+   
node-grp_done = FALSE;
node-ss.ps.ps_TupFromTlist = false;
/* must clear first tuple */
@@ -289,7 +291,7 @@ ExecReScanGroup(GroupState *node)
 * if chgParam of subnode is not null then plan will be re-scanned by
 * first ExecProcNode.
 */
-   if (node-ss.ps.lefttree 
-   node-ss.ps.lefttree-chgParam == NULL)
-   ExecReScan(node-ss.ps.lefttree);
+   outerPlan = outerPlanState(node);
+   if (outerPlan-chgParam == NULL)
+   ExecReScan(outerPlan);
 }
diff --git a/src/backend/executor/nodeMaterial.c 
b/src/backend/executor/nodeMaterial.c
index 1158825..41859e0 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -317,6 +317,9 @@ ExecMaterialRestrPos(MaterialState *node)
 void
 ExecReScanMaterial(MaterialState *node)
 {
+   PlanState   *outerPlan;
+
+   outerPlan = outerPlanState(node);
ExecClearTuple(node-ss.ps.ps_ResultTupleSlot);
 
if (node-eflags != 0)
@@ -339,13 +342,13 @@ ExecReScanMaterial(MaterialState *node)
 * Otherwise we can just rewind and rescan the stored output. 
The
 * state of the subnode does not change.
 */
-   if (node-ss.ps.lefttree-chgParam != NULL ||
+   if (outerPlan-chgParam != NULL ||
(node-eflags  EXEC_FLAG_REWIND) == 0)
{
tuplestore_end(node-tuplestorestate);
node-tuplestorestate = NULL;
-   if (node-ss.ps.lefttree-chgParam == NULL)
-   ExecReScan(node-ss.ps.lefttree);
+   if (outerPlan-chgParam == NULL)
+   ExecReScan(outerPlan);

Re: [HACKERS] reparsing query

 On Wed, Apr 15, 2015 at 2:04 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 His old posting:
 https://www.postgresql.org/message-id/1247323023.16438.35.camel%40ab-desktop


Is this a proposal to have a better formatted (JSON etc)
debug_print_parse results?

Thanks,
Qingqing


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PATCH: default_index_tablespace

2015-04-15 Thread jltallon


Hi,

This small patch implements a new GUC (default_index_tablespace) plus 
supporting code.
Originated from a customer request, the feature intends to make 
creation of indexes on SSD-backed tablespaces easy and convenient 
(almost transparent) for users: the DBA can just set it and indexes will 
be placed in the specified tablespace --as opposed to the same 
tablespace where the referenced table is-- without having to specify it 
every time.


Feedback appreciated.



Thanks,

  / J.L.
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***
*** 5622,5627  COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
--- 5622,5664 
/listitem
   /varlistentry
  
+  varlistentry id=guc-default-index-tablespace 
xreflabel=default_index_tablespace
+   termvarnamedefault_index_tablespace/varname (typestring/type)
+   indexterm
+primaryvarnamedefault_index_tablespace/ configuration 
parameter/primary
+   /indexterm
+   indextermprimarytablespace/secondarydefault//
+   /term
+   listitem
+para
+ This variable specifies the default tablespace in which to create
+ indexes when a commandCREATE INDEX/ command does
+ not explicitly specify a tablespace.
+/para
+ 
+para
+ The value is either the name of a tablespace, or an empty string
+ to specify using the default tablespace of the current database
+ unless the xref linkend=guc-default-tablespace is defined,
+ in which case the rules for that parameter apply.
+ If the value does not match the name of any existing tablespace,
+ productnamePostgreSQL/ will automatically use the default
+ tablespace of the current database; the user must have 
+ literalCREATE/ privilege for it, or creation attempts will fail.
+/para
+ 
+para
+ This variable is not used for indexes on temporary tables; for them,
+ xref linkend=guc-temp-tablespaces is consulted instead.
+/para
+ 
+para
+ For more information on tablespaces,
+ see xref linkend=manage-ag-tablespaces.
+/para
+   /listitem
+  /varlistentry
+ 
   varlistentry id=guc-temp-tablespaces xreflabel=temp_tablespaces
termvarnametemp_tablespaces/varname (typestring/type)
indexterm
*** a/src/backend/commands/indexcmds.c
--- b/src/backend/commands/indexcmds.c
***
*** 417,423  DefineIndex(Oid relationId,
}
else
{
!   tablespaceId = 
GetDefaultTablespace(rel-rd_rel-relpersistence);
/* note InvalidOid is OK in this case */
}
  
--- 417,423 
}
else
{
!   tablespaceId = GetIndexTablespace(rel-rd_rel-relpersistence);
/* note InvalidOid is OK in this case */
}
  
*** a/src/backend/commands/tablespace.c
--- b/src/backend/commands/tablespace.c
***
*** 86,91 
--- 86,92 
  
  /* GUC variables */
  char *default_tablespace = NULL;
+ char *default_index_tblspc = NULL;
  char *temp_tablespaces = NULL;
  
  
***
*** 1085,1090  GetDefaultTablespace(char relpersistence)
--- 1086,1149 
  
  
  /*
+  * GetIndexTablespace -- get the OID of the tablespace for an index
+  *
+  * Temporary objects have different default tablespaces, hence the
+  * relpersistence parameter must be specified.
+  *
+  * May return InvalidOid to indicate use the database's default tablespace.
+  *
+  * Note that caller is expected to check appropriate permissions for any
+  * result other than InvalidOid.
+  *
+  * This exists to hide (and possibly optimize the use of) the
+  * default_index_tablespace GUC variable.
+  */
+ Oid
+ GetIndexTablespace(char relpersistence)
+ {
+   Oid result;
+   const char *tablespace;
+ 
+   /* The temp-table case is handled elsewhere */
+   if (relpersistence == RELPERSISTENCE_TEMP)
+   {
+   PrepareTempTablespaces();
+   return GetNextTempTableSpace();
+   }
+ 
+   /* Fast path for empty GUC: check defaults */
+   if (default_index_tblspc == NULL || default_index_tblspc[0] == '\0')
+   {
+   /* if default_tablespace is also empty, return immediately */
+   if (default_tablespace == NULL || default_tablespace[0] == '\0')
+   return InvalidOid;
+ else
+ tablespace = default_tablespace;
+   }
+   else
+   tablespace = default_index_tblspc;
+ 
+   /*
+* It is tempting to cache this lookup for more speed, but then we would
+* fail to detect the case where the tablespace was dropped since the 
GUC
+* variable was set.  Note also that we don't complain if the value 
fails
+* to refer to an existing tablespace; we just silently return 
InvalidOid,
+

Re: [HACKERS] PATCH: default_index_tablespace

2015-04-15 Thread Tom Lane

jltal...@adv-solutions.net writes:
 This small patch implements a new GUC (default_index_tablespace) plus 
 supporting code.
 Originated from a customer request, the feature intends to make 
 creation of indexes on SSD-backed tablespaces easy and convenient 
 (almost transparent) for users: the DBA can just set it and indexes will 
 be placed in the specified tablespace --as opposed to the same 
 tablespace where the referenced table is-- without having to specify it 
 every time.

I'm afraid this idea is a nonstarter, because it will break existing
applications, and in particular existing pg_dump output files, which
expect to be able to determine an index's tablespace by setting
default_tablespace.  (It is *not* adequate that the code falls back
to default_tablespace if the new GUC is unset; if it is set, you've
still broken pg_dump.)  The incremental value, if indeed there is any,
of being able to control index positioning this way seems unlikely to
justify a backwards-compatibility break of such magnitude.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] inherit support for foreign tables

2015-04-15 Thread Etsuro Fujita


On 2015/04/15 3:52, Alvaro Herrera wrote:

On 4/14/15 5:49 AM, Etsuro Fujita wrote:

postgres=# create foreign table ft1 (c1 int) server myserver options
(table_name 't1');
CREATE FOREIGN TABLE
postgres=# create foreign table ft2 (c1 int) server myserver options
(table_name 't2');
CREATE FOREIGN TABLE
postgres=# alter foreign table ft2 inherit ft1;
ALTER FOREIGN TABLE
postgres=# select * from ft1 for update;
ERROR:  could not find junk tableoid1 column

I think this is a bug.  Attached is a patch fixing this issue.



I think the real question is whether we're now (I mean after this patch)
emitting useless tableoid columns that we didn't previously have.  I
think the answer is yes, and if so I think we need a smaller comb to fix
the problem.  This one seems rather large.


My answer for that would be *no* because I think tableoid would be 
needed for EvalPlanQual checking in more complex SELECT FOR UPDATE on 
the inheritance or UPDATE/DELETE involving the inheritance as a source 
table.  Also, if we allow the FDW to change the behavior of SELECT FOR 
UPDATE so as to match the local semantics exactly, which I'm working on 
in another thread, I think tableoid would also be needed for the actual 
row locking.


Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] reparsing query

On Wed, Apr 15, 2015 at 5:19 PM, Lukas Fittl lu...@fittl.com wrote:
 It'd be interesting to explore if there is some way to make this less
 hack-ish, and enable tools to parse queries in a better way. Essentially
 what is needed is some way to reliably translate SQL into an AST-like
 output, from an outside tool, whilst reusing the current PostgreSQL parser.

It is not difficult to output parsed query in some tool readable
format but it comes with a maintain overhead: once tools rely on it,
we have to conform to some schema continuously, like the xml/xmlns. Do
we want to take this? Depends on how far the tools can go with this
exposed information.

We actually already have explain command in json format and tools
drawing/analyzing query plan rely on it. Will that work for your
scenario?

Regards,
Qingqing


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] TAP tests of pg_rewind not stopping servers used for the tests

On Thu, Apr 16, 2015 at 1:57 AM, Heikki Linnakangas wrote:
 Thanks, committed. I kept the END block, though, so that we still clean up
 if the test dies with an exception.

That makes sense. Thanks.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Move pg_upgrade from contrib/ to src/bin/

Michael Paquier wrote:
 On Thu, Apr 16, 2015 at 4:15 AM, Alvaro Herrera
 alvhe...@2ndquadrant.com wrote:

  Well, *I* can't write the MSVC patch, but I can't find anyone to
  volunteer some time either :-(  I think it's best to have MSVC skip the
  step and have the Makefile-based animals run it.  That would give us a
  lot more coverage than currently, which is just two animals AFAIU.
 
 Meh. I'll just do it then. We've been slacking for some time regarding
 that, and it is not cool to have the modules untested on Windows.

No kidding.

 So let's do the following:
 - Rework the patch doing refactoring of contrib/ and src/test/modules in MSVC.
 - Include the modules in install when install target is 'all' or
 default, like we did in previous releases.
 - Add a new option in vcregress, modulecheck to kick the tests of
 those modules. vcregress will need to refactoring as well. This makes
 3 patches visibly, one for the build part, one for the install part
 and one for vcregress.
 Sounds fine?

Sounds like a plan.

  We've handled the buildfarm being red for a few days before. People are
  usually good about applying fixes fairly quickly.
 
  Okay.
 
 I did it on my stuff already...

Great.  It's a bit less red already ...

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: default_index_tablespace

2015-04-15 Thread Stephen Frost

* Bruce Momjian (br...@momjian.us) wrote:
 On Wed, Apr 15, 2015 at 07:12:11PM -0400, Tom Lane wrote:
  jltal...@adv-solutions.net writes:
   This small patch implements a new GUC (default_index_tablespace) plus 
   supporting code.
   Originated from a customer request, the feature intends to make 
   creation of indexes on SSD-backed tablespaces easy and convenient 
   (almost transparent) for users: the DBA can just set it and indexes will 
   be placed in the specified tablespace --as opposed to the same 
   tablespace where the referenced table is-- without having to specify it 
   every time.
  
  I'm afraid this idea is a nonstarter, because it will break existing
  applications, and in particular existing pg_dump output files, which
  expect to be able to determine an index's tablespace by setting
  default_tablespace.  (It is *not* adequate that the code falls back
  to default_tablespace if the new GUC is unset; if it is set, you've
  still broken pg_dump.)  The incremental value, if indeed there is any,
  of being able to control index positioning this way seems unlikely to
  justify a backwards-compatibility break of such magnitude.
 
 I can see why someone would want this because random I/O, which is
 frequent for indexes, is much faster on SSD than magnetic disks. 
 (Sequential I/O is more similar for the two.)

The general idea is something I've brought up previously also (I believe
it was even discussed at the Dev meeting in, uh, 2013?) so I'm not
anxious to simply dismiss it, but it'd certainly have to be done
correctly..

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Improve sleep processing of pg_rewind TAP tests

Michael Paquier wrote:

 However after discussion with a colleague we have noticed that those
 values may not be enough in slow environments, a value of up to 10s
 being sometimes needed after promotion to make tests pass.

Yeah, hardcoded sleep times are not reliable.  (/me would love to get
rid of hardcoded times in isolationtester timeout.spec test ...)

 Attached is a patch improving this sleep logic and doing the following things:
 1) To ensure that standby has caught up, check replay position on the
 standby and compare it with the current WAL position of master.

 2) To ensure that promotion is effective, use pg_is_in_recovery() and
 continue processing until we are sure that the standby is out of
 recovery.

Seems reasonable, but why are you sleeping 1s if pg_ctl -w is in use?  I
thought the -w would wait until promotion has taken effect, so there's
no need to sleep additional time.

 Note that this patch adds a small routine called command_result in
 TestLib.pm, able to return stdout, stderr and the exit code. That's
 really handy, and I am planning to use something like that as well in
 the replication test suite.

Makes sense to me.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Turning off HOT/Cleanup sometimes

2015-04-15 Thread Peter Geoghegan

On Wed, Apr 15, 2015 at 6:10 AM, Andres Freund and...@anarazel.de wrote:
  Of
  course, that applies in all cases, but when the page is already dirty,
  the cost of pruning it is probably quite small - we're going to have
  to write the page anyway, and pruning it before it gets evicted
  (perhaps even by our scan) will be cheaper than writing it now and
  writing it again after it's pruned.  When the page is clean, the cost
  of pruning is significantly higher.

 We aren't going to have to write the page, but someone will.

 If it's already dirty that doesn't change at all. *Not* pruning in that
 moment actually will often *increase* the total amount of writes to the
 OS. Because now the pruning will happen on the next write access or
 vacuum - when the page already might have been undirtied.

 I don't really see the downside to this suggestion.

+1. I think, in general, the opportunity cost of not pruning when a
page is already dirty is likely to be rather high. In general, it's
likely to be worth it.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: default_index_tablespace

2015-04-15 Thread Bruce Momjian

On Wed, Apr 15, 2015 at 07:12:11PM -0400, Tom Lane wrote:
 jltal...@adv-solutions.net writes:
  This small patch implements a new GUC (default_index_tablespace) plus 
  supporting code.
  Originated from a customer request, the feature intends to make 
  creation of indexes on SSD-backed tablespaces easy and convenient 
  (almost transparent) for users: the DBA can just set it and indexes will 
  be placed in the specified tablespace --as opposed to the same 
  tablespace where the referenced table is-- without having to specify it 
  every time.
 
 I'm afraid this idea is a nonstarter, because it will break existing
 applications, and in particular existing pg_dump output files, which
 expect to be able to determine an index's tablespace by setting
 default_tablespace.  (It is *not* adequate that the code falls back
 to default_tablespace if the new GUC is unset; if it is set, you've
 still broken pg_dump.)  The incremental value, if indeed there is any,
 of being able to control index positioning this way seems unlikely to
 justify a backwards-compatibility break of such magnitude.

I can see why someone would want this because random I/O, which is
frequent for indexes, is much faster on SSD than magnetic disks. 
(Sequential I/O is more similar for the two.)

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: default_index_tablespace

2015-04-15 Thread Amit Kapila

On Thu, Apr 16, 2015 at 8:01 AM, Bruce Momjian br...@momjian.us wrote:

 On Wed, Apr 15, 2015 at 07:12:11PM -0400, Tom Lane wrote:
  jltal...@adv-solutions.net writes:
   This small patch implements a new GUC (default_index_tablespace) plus
   supporting code.
   Originated from a customer request, the feature intends to make
   creation of indexes on SSD-backed tablespaces easy and convenient
   (almost transparent) for users: the DBA can just set it and indexes
will
   be placed in the specified tablespace --as opposed to the same
   tablespace where the referenced table is-- without having to specify
it
   every time.
 
  I'm afraid this idea is a nonstarter, because it will break existing
  applications, and in particular existing pg_dump output files, which
  expect to be able to determine an index's tablespace by setting
  default_tablespace.  (It is *not* adequate that the code falls back
  to default_tablespace if the new GUC is unset; if it is set, you've
  still broken pg_dump.)  The incremental value, if indeed there is any,
  of being able to control index positioning this way seems unlikely to
  justify a backwards-compatibility break of such magnitude.

 I can see why someone would want this because random I/O, which is
 frequent for indexes, is much faster on SSD than magnetic disks.
 (Sequential I/O is more similar for the two.)


Another way to provide different default tablespace for index could be
to provide it at Database level.  Have a new option INDEX_TABLESPACE
in Create Database command which can be used to create indexes
when not specified during Create Index command.  This would also need
changes in pg_dump (like while dumping info about database) but on
initial look, it seems it can be done without much changes.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] reparsing query

2015-04-15 Thread Lukas Fittl

On Wed, Apr 15, 2015 at 8:39 PM, Qingqing Zhou zhouqq.postg...@gmail.com
wrote:

 It is not difficult to output parsed query in some tool readable
 format but it comes with a maintain overhead: once tools rely on it,
 we have to conform to some schema continuously, like the xml/xmlns. Do
 we want to take this? Depends on how far the tools can go with this
 exposed information.

 We actually already have explain command in json format and tools
 drawing/analyzing query plan rely on it. Will that work for your
 scenario?


Sure, that would work - but as I understand adding an explicit SQL command
(or function) is not something that would ever be merged into Postgres
core. Especially since that command's output could easily change across
versions.

My thought was more along the lines of making something like raw_parser +
nodeToString available through an extension, but with a JSON output format.

Note that an important detail in the monitoring case is that you don't
necessarily have the statement's underlying relations available (since you
might work with statistics data on a different machine).

Best,
Lukas

-- 
Lukas Fittl

Skype: lfittl
Phone: +1 415 321 0630

Re: [HACKERS] reparsing query

2015-04-15 Thread Lukas Fittl

On Wed, Apr 15, 2015 at 2:43 PM, Qingqing Zhou zhouqq.postg...@gmail.com
wrote:

 Is this a proposal to have a better formatted (JSON etc)
 debug_print_parse results?


I've run into the need for this as well for monitoring purposes, my
solution was to compile the needed object files into a library (see [0]).

It'd be interesting to explore if there is some way to make this less
hack-ish, and enable tools to parse queries in a better way. Essentially
what is needed is some way to reliably translate SQL into an AST-like
output, from an outside tool, whilst reusing the current PostgreSQL parser.

I believe the recent work on deparsing DDL statements relates to some of
that (i.e. might be re-usable for this purpose, through an extension), if
I'm understanding correctly.

[0]: https://github.com/pganalyze/pg_query

Best,
Lukas

-- 
Lukas Fittl

Skype: lfittl
Phone: +1 415 321 0630

[HACKERS] Improve sleep processing of pg_rewind TAP tests

Hi all,

TAP tests of pg_rewind are using in 2 places hardcoded values to wait
for a given amount of time for some events. In HEAD, those things are:
1) Wait for 1s for standby to catch up.
2) Wait for 2s for promotion of standby.
However after discussion with a colleague we have noticed that those
values may not be enough in slow environments, a value of up to 10s
being sometimes needed after promotion to make tests pass. And
actually the current way of doing is not reliable because it depends
on how the environment is able to handle quickly standby replay and
promotion (I am expecting issues regarding that on Windows btw, and on
small spec machines).

Attached is a patch improving this sleep logic and doing the following things:
1) To ensure that standby has caught up, check replay position on the
standby and compare it with the current WAL position of master.
2) To ensure that promotion is effective, use pg_is_in_recovery() and
continue processing until we are sure that the standby is out of
recovery.
In each case the patch attached makes a maximum of 30 attempts, each
attempt waiting 1s before processing, so we let a margin of up to 30s
for environments to handle replay and promotion properly.

Note that this patch adds a small routine called command_result in
TestLib.pm, able to return stdout, stderr and the exit code. That's
really handy, and I am planning to use something like that as well in
the replication test suite.
Regards,
-- 
Michael
diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm
index e6a5b9b..86e9def 100644
--- a/src/bin/pg_rewind/RewindTest.pm
+++ b/src/bin/pg_rewind/RewindTest.pm
@@ -183,7 +183,7 @@ sub create_standby
 	# Base backup is taken with xlog files included
 	system_or_bail(pg_basebackup -D $test_standby_datadir -p $port_master -x $log_path 21);
 	append_to_file($test_standby_datadir/recovery.conf, qq(
-primary_conninfo='$connstr_master'
+primary_conninfo='$connstr_master application_name=rewind_standby'
 standby_mode=on
 recovery_target_timeline='latest'
 ));
@@ -191,8 +191,29 @@ recovery_target_timeline='latest'
 	# Start standby
 	system_or_bail(pg_ctl -w -D $test_standby_datadir -o \-k $tempdir_short --listen-addresses='' -p $port_standby\ start $log_path 21);
 
-	# sleep a bit to make sure the standby has caught up.
-	sleep 1;
+	# Wait until the standby has caught up with the primary by comparing
+	# WAL positions on both nodes. Note that this is fine
+	my $max_attempts = 30;
+	my $attempts = 0;
+	while ($attempts  $max_attempts)
+	{
+		# Wait a bit before proceeding.
+		sleep 1;
+		$attempts++;
+
+		my $query = SELECT pg_current_xlog_location() = replay_location FROM pg_stat_replication WHERE application_name = 'rewind_standby';;
+		my $cmd = ['psql', '-At', '-c', $query, '-d', $connstr_master ];
+		my ($res, $stdout, $stderr) = command_result($cmd);
+		chomp($stdout);
+		if ($stdout eq t)
+		{
+			last;
+		}
+	}
+	if ($attempts == $max_attempts)
+	{
+		die Maximum number of attempts reached when waiting for standby to catch up;
+	}
 }
 
 sub promote_standby
@@ -201,9 +222,31 @@ sub promote_standby
 	# up standby
 
 	# Now promote slave and insert some new data on master, this will put
-	# the master out-of-sync with the standby.
+	# the master out-of-sync with the standby. Be sure that we leave here
+	# with a standby actually ready for the next operations.
 	system_or_bail(pg_ctl -w -D $test_standby_datadir promote $log_path 21);
-	sleep 2;
+	my $max_attempts = 30;
+	my $attempts = 0;
+	while ($attempts  $max_attempts)
+	{
+		# Wait a bit before proceeding, promotion may have not taken effect
+		# in such a short time.
+		sleep 1;
+		$attempts++;
+
+		my $query = SELECT pg_is_in_recovery();
+		my $cmd = ['psql', '-At', '-c', $query, '-d', $connstr_standby ];
+		my ($res, $stdout, $stderr) = command_result($cmd);
+		chomp($stdout);
+		if ($stdout eq f)
+		{
+			last;
+		}
+	}
+	if ($attempts == $max_attempts)
+	{
+		die Maximum number of attempts reached when waiting for promotion of standby;
+	}
 }
 
 sub run_pg_rewind
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index 003cd9a..3c6a49e 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -16,6 +16,7 @@ our @EXPORT = qw(
   command_ok
   command_fails
   command_exit_is
+  command_result
   program_help_ok
   program_version_ok
   program_options_handling_ok
@@ -161,6 +162,14 @@ sub command_exit_is
 	is($h-result(0), $expected, $test_name);
 }
 
+sub command_result
+{
+	my ($cmd) = @_;
+	my ($stdout, $stderr);
+	my $result = run $cmd, '', \$stdout, '2', \$stderr;
+	return ($result, $stdout, $stderr);
+}
+
 sub program_help_ok
 {
 	my ($cmd) = @_;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improve sleep processing of pg_rewind TAP tests

On Thu, Apr 16, 2015 at 12:51 PM, Alvaro Herrera wrote:
 Michael Paquier wrote:

 However after discussion with a colleague we have noticed that those
 values may not be enough in slow environments, a value of up to 10s
 being sometimes needed after promotion to make tests pass.

 Yeah, hardcoded sleep times are not reliable.  (/me would love to get
 rid of hardcoded times in isolationtester timeout.spec test ...)

 Attached is a patch improving this sleep logic and doing the following 
 things:
 1) To ensure that standby has caught up, check replay position on the
 standby and compare it with the current WAL position of master.

 2) To ensure that promotion is effective, use pg_is_in_recovery() and
 continue processing until we are sure that the standby is out of
 recovery.

 Seems reasonable, but why are you sleeping 1s if pg_ctl -w is in use?  I
 thought the -w would wait until promotion has taken effect, so there's
 no need to sleep additional time.

Visibly that's not the case for this test case, the timing issues that
we saw happened not because of the standby not catching up, but
because of the promotion not taking effect in a timely fashion. And
that's as well something I saw on my OSX box some days ago as well,
explaining why the sleep time has been increased from 1 to 2 in
53ba1077. This patch just takes it the careful way. In perl, system
waits for the process of pg_ctl to exit before moving on, perhaps the
problem is that pg_ctl thinks that promotion is done, but the node is
not quite ready, explaining why the test fails.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Move pg_upgrade from contrib/ to src/bin/

On Thu, Apr 16, 2015 at 4:15 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Andrew Dunstan wrote:

 Yes, we do want support for testmodules. I think that needs to be built into
 src/tools/msvc, probably in vcregress.pl. Once we have that the buildfarm
 changes will be trivial. But if that's not forthcoming quickly I can just
 avoid the step on msvc for now.

 Well, *I* can't write the MSVC patch, but I can't find anyone to
 volunteer some time either :-(  I think it's best to have MSVC skip the
 step and have the Makefile-based animals run it.  That would give us a
 lot more coverage than currently, which is just two animals AFAIU.

Meh. I'll just do it then. We've been slacking for some time regarding
that, and it is not cool to have the modules untested on Windows. So
let's do the following:
- Rework the patch doing refactoring of contrib/ and src/test/modules in MSVC.
- Include the modules in install when install target is 'all' or
default, like we did in previous releases.
- Add a new option in vcregress, modulecheck to kick the tests of
those modules. vcregress will need to refactoring as well. This makes
3 patches visibly, one for the build part, one for the install part
and one for vcregress.
Sounds fine?

 We've handled the buildfarm being red for a few days before. People are
 usually good about applying fixes fairly quickly.

 Okay.

I did it on my stuff already...
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-15 Thread Martijn van Oosterhout

On Wed, Apr 15, 2015 at 12:37:44AM -0400, Robert Haas wrote:
  I think such a solution will be good for the cases when many evictions
  needs to be performed to satisfy the workload,  OTOH when there are
  not too many evictions that needs to be done, in such a case some of
  the buffers that are accessed much more will have equal probability to
  get evicted as compare to buffers which are less accessed.
 
 Possibly, but I think it's even worse under the current algorithm.
 Under this proposal, if we go for a long time without any buffer
 evictions, every buffer usage's count will top out at 2 more than
 wherever it was after the last clock sweep.   In the worst case, every
 buffer (or most of them) could end up with the same usage count.  But
 under the status quo, they'll all go to 5, which is an even bigger
 loss of information, and which will make the first eviction much more
 expensive than if they are all pegged at 2 or 3.

I've been following this thread from the side with interest and got
twigged by the point about loss of information.  If you'd like better
information about relative ages, you can acheive this by raising the
cap on the usage count and dividing (or right-shifting) each sweep.

This would allow you to remember much more about about the relative
worth of often used pages.  With a cap of 32 you'd have the same effect
as now where after 5 sweeps the buffer is evicted.  Mathematically the
count would converge to the number of times the block is used per
sweep.

If you wanted to be really clever, you could at the beginning of each
sweep take an estimate of the number of buffers used since the last
sweep (from the stats collector perhaps) and use that to drive your
divisor, so if you have a lots of allocations you become more
aggressive about reducing the counts.  Or if the load is light fall
back to just subtracting one.  Then you don't need a cap at all.

(Apologies if this has been suggested before, Google didn't find
anything for me).

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature

Re: [HACKERS] reparsing query

Qingqing Zhou wrote:
 On Wed, Apr 15, 2015 at 1:40 PM, Andrzej Barszcz abus...@gmail.com wrote:
  I knock once again with this : reparse query to XML ( last knock 5-6 years
  before) .
 
 
 What exactly reparse query to XML does?

His old posting:
https://www.postgresql.org/message-id/1247323023.16438.35.camel%40ab-desktop

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Clock sweep not caching enough B-Tree leaf pages?

2015-04-15 Thread Greg Stark

On Wed, Apr 15, 2015 at 5:26 AM, Robert Haas robertmh...@gmail.com wrote:
 The way our cache works we promote when a buffer is accessed but we only
 demote when a buffer is flushed. We flush a lot less often than we touch
 buffers so it's not surprising that the cache ends up full of buffers that
 are all in the most recently used section.

 This isn't really correct.  We promote when it's accessed, but we
 demote it when the clock sweep hand passes over it, which happens each
 time we consider it for eviction.  It does not have to do with
 flushing dirty date to disk, and it does not happen only when the
 buffer is actually evicted.


This is my point though (you're right that flushed isn't always the
same as eviction but that's not the important point here). Right now
we only demote when we consider buffers for eviction. But we promote
when we pin buffers. Those two things aren't necessarily happening at
the same rate and in fact are often orders of magnitude different. So
it makes sense that we end up with a lot of buffers promoted all the
way to the most recently used ntile and then have to do n passes
around the clock and have no good information about which buffer to
evict.

What I'm saying is that we should demote a buffer every time we
promote a buffer. So every time we pin a buffer we should advance the
clock a corresponding amount. I know I'm being intentionally vague
about what the corresponding amount is.) The important thing is that
the two should be tied together.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] initdb -S and tablespaces

2015-04-15 Thread Abhijit Menon-Sen

At 2015-04-06 10:16:36 -0300, alvhe...@2ndquadrant.com wrote:

 Well, we have many things that can be set as symlinks; some you can
 have initdb create the links for you (initdb --xlogdir), others you
 can move manually.

Looking at sendDir() in backend/replication/basebackup.c, I notice that
the only places where it expects/allows symlinks are for pg_xlog and in
pg_tblspc.

Still, thanks to the example in basebackup.c, I've got most of a patch
that should follow any symlinks in PGDATA. I just need to test a little
more before I post it.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] FPW compression leaks information


On 04/10/2015 05:17 AM, Robert Haas wrote:

I bet that there are at least 1000 covert channel attacks that are
more practically exploitable than this. When this is anywhere near
the top of the list of things to worry about, I recommend we throw a
huge party and then fly home in our faster-than-light starships
which, by then, will probably be available at Walmart.


Let's try to put some perspective on how serious (or not) this leak is. 
I'm going to use finding a password hash in pg_authid as the target.


After each checkpoint, we can make one guess, and see how that 
compresses. If we assume ideal conditions, and guess one hex digit at a 
time (the md5 hashes are stored as hex strings), you can find a 32 digit 
hash in 16*32 = 512 checkpoints. You could possible improve on that, by 
doing a binary search of the digits: first use a guess that contains 
digits 0-8, then 9-f, and find which range the victim was in. The split 
that range in two, repeat, until you find the digit. With that method, 
the theoretical minimum is log2(16)*32 = 128 checkpoints.


Obviously, the real world is more complicated and you won't get very 
close to that ideal case.


For my entertainment, I wrote a little exploit script (attached). I 
didn't use the binary-search approach, it guesses one digit at a time. 
Testing on my laptop, in a newly initdb'd database with no other 
activity, it found a full MD5 hash in 3594 checkpoints. I cheated to 
make it faster, and used the CHECKPOINT command instead of waiting or 
writing a lot of WAL, to trigger the checkpoints, but that shouldn't 
affect the number of checkpoints required. Your mileage may vary, of 
course, and I'm sure this script will get confused and not work in many 
cases, but OTOH I'm sure you could also optimize it further.


If we take that ~4000 checkpoints figure, and checkpoint_timeout=5 
minutes, it would take about two weeks. But if you're happy with getting 
just the first 4 or 8 bytes or so, and launch a dictionary attack on 
that, you can stop much earlier than that.


- Heikki
#!/usr/bin/python3
#
# Prerequisites:
# Create a user called eve. The script logs in as that user, and changes
# its password repeatedly. You also need to create a dummy table called
# garbage, which is used by the script to create some dummy WAL activity.
#
# CREATE USER eve;
# CREATE TABLE garbage (t text);
# ALTER TABLE garbage OWNER To eve;
# 
#
# Usage:
# Edit the connection string below. Launch the script. It will run
# for a long time, printing out a line after each guess (after each
# checkpoint).
#
# If you want to make the script run faster, you can cheat and grant eve
# superuser privileges, and modify the perform_checkpoint function below
# to use a direct CHECKPOINT command, instead of forcing a checkpoint with
# write activity.
#
# Caveats:
# There are a lot of things that could confuse the script and stall it:
# * Concurrent update of pg_authid table
# * Other concurrent WAL activity
# * More than one other ueser on the same pg_authid page (or even a dead row)
#
import psycopg2;
import statistics;

connectstr = host=localhost dbname=postgres user=eve;

# characters to use for padding our guesses. This needs to be non-repeating
# and not contain hex characters (0-9, a-f), to avoid compression.
paddingstr = ABCDEFGHIJKLMNOPQRSTUVXYZghijklmnopqrstuvxyz;

# If you have an initial guess of the first N bytes, you can type it here
initial_guess = md5;

# length of the string we're guessing.
victim_length = 3 + 32;

# How many times to repeat each guess? Increasing this makes the attack take
# longer, but makes it less likely to get confused by small random differences
# in WAL record sizes.
num_repeats = 5;



checkpoints = 0;  # how many checkpoints have we executed so far?

# Perform a checkpoint. We can't just call CHECKPOINT, because that's
# superuser-only. We resort to doing dummy activity until a checkpoint
# is triggered.
#
# There is also no direct way of detecting when a checkpoint has happened, so
# we take advantage of the WAL compression to detect that. We insert a 1k row
# that compressess well. Normally, the WAL record will take somewhat over 1k
# bytes. But just after a checkpoint has happened, a full-page image is taken,
# and full-page compression is enabled, the WAL record with the compresses to
# much less than 1k bytes.
#
def perform_checkpoint():
  global checkpoints;
  checkpoints += 1;

  # If you're testing this as superuser, you can cheat and perform direct
  # CHECKPOINT command by uncommenting this. It makes the script run *much*
  # faster.
  c.execute(CHECKPOINT);
  return;

  conn.rollback();
  conn.autocommit = True;
  c.execute(vacuum garbage);
  conn.autocommit = False;
  c.execute(SELECT pg_current_xlog_insert_location(););
  beforexlog = c.fetchone()[0];
  for i in range(0, 10):
c.execute(INSERT INTO garbage VALUES (repeat('x', 1000)));
c.execute(SELECT pg_current_xlog_insert_location(), pg_current_xlog_insert_location() - %s;,

Re: [HACKERS] Turning off HOT/Cleanup sometimes

On 14 April 2015 at 21:53, Robert Haas robertmh...@gmail.com wrote:

 Peter commented previously that README.HOT should get an update.  The
 relevant section seems to be When can/should we prune or
 defragment?.

That's easy enough to change once we agree to commit.

 I wonder if it would be a useful heuristic to still prune pages if
 those pages are already dirty.

Useful for who? This is about responsibility. Why should someone
performing a large SELECT take the responsibility for cleaning pages?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, RemoteDBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Make more portable TAP tests of initdb