Re: [HACKERS] Reducing ClogControlLock contention

2015-06-30 Thread Simon Riggs
On 30 June 2015 at 08:22, Simon Riggs si...@2ndquadrant.com wrote:


 This contention is masked by contention elsewhere, e.g. ProcArrayLock, so
 the need for testing here should come once other patches ahead of this are
 in.


Let me explain more clearly.

Andres' patch to cache snapshots and reduce ProcArrayLock was interesting,
but not initially compelling. We now have a solution that commits in
batches, which will reduce the number of times the ProcArray changes - this
will heighten the benefit from Andres' snapshot cache patch.

So the order of testing/commit should be

Queued commit patch
ProcArray cache patch
Clog shared commit patch (this one)

I didn't hear recent mention of Robert's chash patch, but IIRC it was
effective and so we hope to see it again soon also.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-06-30 Thread Thomas Munro
On Tue, Jun 30, 2015 at 7:37 PM, Rahila Syed rahilasye...@gmail.com wrote:
 Hello Hackers,

 Following is a proposal for feature to calculate VACUUM progress.

 Use Case : Measuring progress of long running VACUUMs to help DBAs make
 informed decision
 whether to continue running VACUUM or abort it.

+1

I was thinking recently that it would be very cool to see some
estimation of the progress of VACUUM and CLUSTER in a view similar to
pg_stat_activity, or the ps title.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-06-30 Thread dinesh kumar
On Tue, Jun 30, 2015 at 1:07 PM, Rahila Syed rahilasye...@gmail.com wrote:

 Hello Hackers,

 Following is a proposal for feature to calculate VACUUM progress.

 Use Case : Measuring progress of long running VACUUMs to help DBAs make
 informed decision
 whether to continue running VACUUM or abort it.

 +1

I am excited to know how the progress works in when any of the statement
got blocked during locks. Rather displaying the stats in the LOG, shall we
have this in a pg_stat_vacuum_activity[ New catalog for all auto-vacuum
stats].

Best Regards,
Dinesh
manojadinesh.blogspot.com

Design:

 A shared preload library to store progress information from different
 backends running VACUUM, calculate remaining time for each and display
 progress in the
 in the form a view.


 VACUUM  needs to be instrumented with a hook to collect progress
 information (pages vacuumed/scanned) periodically.

 The patch attached implements a new hook to store vacuumed_pages and
 scanned_pages count at the end of each page scanned by VACUUM.

 This information is stored in a shared memory structure.

 In addition to measuring progress this function using hook also calculates
 remaining time for VACUUM.



 The frequency of collecting progress information can be reduced by
 appending delays in between hook function calls.

 Also, a GUC parameter

 log_vacuum_min_duration can be used.

 This will cause VACUUM progress to be calculated only if VACUUM runs more
 than specified milliseconds.

 A value of zero calculates VACUUM progress for each page processed. -1
 disables logging.


 Progress calculation :


 percent_complete = scanned_pages * 100 / total_pages_to_be_scanned;

 remaining_time = elapsed_time * (total_pages_to_be_scanned -
 scanned_pages) / scanned_pages;


 Shared memory struct:

 typedef struct PgStat_VacuumStats

 {

   Oid databaseoid;

   Oid tableoid;

   Int32   vacuumed_pages;

   Int32   total_pages;

   Int32   scanned_pages;

   doubleelapsed_time;

   doubleremaining_time;

  } PgStat_VacuumStats[max_connections];



 Reporting :

  A view named 'pg_maintenance_progress' can be created using the values in
 the struct above.

 pg_stat_maintenance can be called from any other backend and will display
 progress of

 each running VACUUM.


 Other uses of hook in VACUUM:


 Cost of VACUUM in terms of pages hit , missed and dirtied same as
 autovacuum can be collected using this hook.

 Autovacuum does it at the end of VACUUM for each table. It can be done
 while VACUUM on a table is in progress.
 This can be helpful to track manual VACUUMs also not just autovacuum.

 Read/Write(I/O) rates can be computed on the lines of autovacuum.
 Read rate patterns can be used to help tuning future vacuum on the
 table(like shared buffers tuning)
 Other resource usages can also be collected using progress checker hook.


 Attached patch is POC patch of progress calculation for a single backend.

 Also attached is a brief snapshot of the output log.




 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




Re: [HACKERS] Reducing ClogControlLock contention

2015-06-30 Thread Amit Kapila
On Tue, Jun 30, 2015 at 12:52 PM, Simon Riggs si...@2ndquadrant.com wrote:

 On 30 June 2015 at 08:13, Michael Paquier michael.paqu...@gmail.com
wrote:


 Could it be possible to see some performance numbers? For example with a
simple pgbench script doing a bunch of tiny transactions, with many
concurrent sessions (perhaps hundreds).


 I'm more interested to see if people think it is safe.

 This contention is masked by contention elsewhere, e.g. ProcArrayLock, so
the need for testing here should come once other patches ahead of this are
in.


Exactly and other lock that can mask this improvement is WALWriteLock,
but for that we can take the performance data with synchronous_commit
off mode.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-06-30 Thread Pavel Stehule
Hi

2015-06-30 9:37 GMT+02:00 Rahila Syed rahilasye...@gmail.com:

 Hello Hackers,

 Following is a proposal for feature to calculate VACUUM progress.


interesting idea - I like to see it integrated to core.



 Use Case : Measuring progress of long running VACUUMs to help DBAs make
 informed decision
 whether to continue running VACUUM or abort it.

 Design:

 A shared preload library to store progress information from different
 backends running VACUUM, calculate remaining time for each and display
 progress in the
 in the form a view.


probably similar idea can be used for REINDEX, CREATE INDEX, COPY TO
statements

I though about the possibilities of progress visualization - and one
possibility is one or two special column in pg_stat_activity table - this
info can be interesting for VACUUM started by autovacuum too.

Regards

Pavel



 VACUUM  needs to be instrumented with a hook to collect progress
 information (pages vacuumed/scanned) periodically.

 The patch attached implements a new hook to store vacuumed_pages and
 scanned_pages count at the end of each page scanned by VACUUM.

 This information is stored in a shared memory structure.

 In addition to measuring progress this function using hook also calculates
 remaining time for VACUUM.



 The frequency of collecting progress information can be reduced by
 appending delays in between hook function calls.

 Also, a GUC parameter

 log_vacuum_min_duration can be used.

 This will cause VACUUM progress to be calculated only if VACUUM runs more
 than specified milliseconds.

 A value of zero calculates VACUUM progress for each page processed. -1
 disables logging.


 Progress calculation :


 percent_complete = scanned_pages * 100 / total_pages_to_be_scanned;

 remaining_time = elapsed_time * (total_pages_to_be_scanned -
 scanned_pages) / scanned_pages;


 Shared memory struct:

 typedef struct PgStat_VacuumStats

 {

   Oid databaseoid;

   Oid tableoid;

   Int32   vacuumed_pages;

   Int32   total_pages;

   Int32   scanned_pages;

   doubleelapsed_time;

   doubleremaining_time;

  } PgStat_VacuumStats[max_connections];



 Reporting :

  A view named 'pg_maintenance_progress' can be created using the values in
 the struct above.

 pg_stat_maintenance can be called from any other backend and will display
 progress of

 each running VACUUM.


 Other uses of hook in VACUUM:


 Cost of VACUUM in terms of pages hit , missed and dirtied same as
 autovacuum can be collected using this hook.

 Autovacuum does it at the end of VACUUM for each table. It can be done
 while VACUUM on a table is in progress.
 This can be helpful to track manual VACUUMs also not just autovacuum.

 Read/Write(I/O) rates can be computed on the lines of autovacuum.
 Read rate patterns can be used to help tuning future vacuum on the
 table(like shared buffers tuning)
 Other resource usages can also be collected using progress checker hook.


 Attached patch is POC patch of progress calculation for a single backend.

 Also attached is a brief snapshot of the output log.




 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




Re: [HACKERS] [PROPOSAL] VACUUM Progress Checker.

2015-06-30 Thread Simon Riggs
On 30 June 2015 at 08:52, Pavel Stehule pavel.steh...@gmail.com wrote:


 I though about the possibilities of progress visualization - and one
 possibility is one or two special column in pg_stat_activity table - this
 info can be interesting for VACUUM started by autovacuum too.


Yes, I suggest just a single column on pg_stat_activity called pct_complete

trace_completion_interval = 5s (default)

Every interval, we report the current % complete for any operation that
supports it. We just show NULL if the current operation has not reported
anything or never will.

We do this for VACUUM first, then we can begin adding other operations as
we work out how (for that operation).

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Reducing ClogControlLock contention

2015-06-30 Thread Simon Riggs
On 30 June 2015 at 08:13, Michael Paquier michael.paqu...@gmail.com wrote:



 On Tue, Jun 30, 2015 at 4:02 PM, Simon Riggs si...@2ndquadrant.com
 wrote:

 ClogControlLock contention is high at commit time. This appears to be due
 to the fact that ClogControlLock is acquired in Exclusive mode prior to
 marking commit, which then gets starved by backends running
 TransactionIdGetStatus().

 Proposal for improving this is to acquire the ClogControlLock in Shared
 mode, if possible.

 This is safe because people checking visibility of an xid must always run
 TransactionIdIsInProgress() first to avoid race conditions, which will
 always return true for the transaction we are currently committing. As a
 result, we never get concurrent access to the same bits in clog, which
 would require a barrier.

 Two concurrent writers might access the same word concurrently, so we
 protect against that with a new CommitLock. We could partition that by
 pageno also, if needed.


 Could it be possible to see some performance numbers? For example with a
 simple pgbench script doing a bunch of tiny transactions, with many
 concurrent sessions (perhaps hundreds).


I'm more interested to see if people think it is safe.

This contention is masked by contention elsewhere, e.g. ProcArrayLock, so
the need for testing here should come once other patches ahead of this are
in.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


[HACKERS] Missing return value checks in jsonfuncs.c

2015-06-30 Thread Michael Paquier
Hi all,

Coverity is pointing out that a couple of return value checks are
missing for JsonbIteratorNext in jsonfuncs.c. Patch is attached to
append (void) to them, as far as I am guessing we want skip the value
iterated on.
Regards,
-- 
Michael
diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c
index 13d5b7a..b4258d8 100644
--- a/src/backend/utils/adt/jsonfuncs.c
+++ b/src/backend/utils/adt/jsonfuncs.c
@@ -3362,8 +3362,7 @@ jsonb_delete(PG_FUNCTION_ARGS)
 		{
 			/* skip corresponding value as well */
 			if (r == WJB_KEY)
-JsonbIteratorNext(it, v, true);
-
+(void) JsonbIteratorNext(it, v, true);
 			continue;
 		}
 
@@ -3436,7 +3435,7 @@ jsonb_delete_idx(PG_FUNCTION_ARGS)
 			if (i++ == idx)
 			{
 if (r == WJB_KEY)
-	JsonbIteratorNext(it, v, true);	/* skip value */
+	(void) JsonbIteratorNext(it, v, true);	/* skip value */
 continue;
 			}
 		}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory leak with XLogFileCopy since de768844 (WAL file with .partial)

2015-06-30 Thread Fujii Masao
On Tue, Jun 9, 2015 at 10:09 AM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Tue, Jun 9, 2015 at 6:25 AM, Heikki Linnakangas wrote:
 I'm still not sure if I should've just reverted that refactoring, to make
 XLogFileCopy() look the same in master and back-branches, which makes
 back-patching easier, or keep the refactoring, because it makes the code
 slightly nicer. But the current situation is the worst of both worlds: the
 interface of XLogFileCopy() is no better than it used to be, but it's
 different enough to cause merge conflicts. At this point, it's probably best
 to revert the code to look the same as in 9.4.

 That's a valid concern. What about the attached then? I think that it
 is still good to keep upto to copy only data up to the switch point at
 recovery exit. InstallXLogFileSegment() changes a bit as well because
 of its modifications of arguments.

Applied. Thanks!

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in bttext_abbrev_convert()

2015-06-30 Thread Peter Geoghegan
On Tue, Jun 30, 2015 at 5:25 PM, Jim Nasby jim.na...@bluetreble.com wrote:
 Isn't this the kind of thing Coverty's supposed to find?

I don't know, but in general I'm not very excited about static
analysis tools. The best things that they have going for them is that
they're available, and don't require test coverage in the same way as
running the regression tests with Valgrind enabled.

There is no real testing of sorting in the regression tests. It would
be nice to have a way of generating a large and varied selection of
sort operations programmatically, to catch this kind of thing.
pg_regress is not really up to it. The same applies to various other
cases where having a lot of expected output makes using pg_regress
infeasible.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] WAL-related tools and .paritial WAL file

2015-06-30 Thread Fujii Masao
Hi,

WAL-related tools, i.e., pg_archivecleanup, pg_resetxlog and
pg_xlogdump don't seem to properly handle .paritial WAL file.
I think that we should fix at least pg_archivecleanup, otherwise,
in the system using pg_archivecleanup to clean up old archived
files, the archived .paritial WAL file will never be removed.
Thought?

To make pg_archivecleanup detect .paritial WAL file, I'd like to
use IsPartialXLogFileName() macro that commit 179cdd0 added.
Fortunately Michael proposed the patch which makes the macros
in xlog_internal.h usable in WAL-related tools, and it's better
to apply the fix based on his patch. So my plan is to apply his
patch first and then apply the fix to pg_archivecleanup.
http://www.postgresql.org/message-id/CAB7nPqSiKU+8HjVe_J05btonC5E7kvmRaLpGS6EaEQe=dy3...@mail.gmail.com

Regarding pg_resetxlog and pg_xlogdump, do we need to change
also them so that they can handle .paritial file properly?

pg_resetxlog cannot clean up .partial file even if it should be
removed. But the remaining .paritial file is harmless and will be
recycled later by checkpoint, so extra treatment of .paritial
file in pg_resetxlog may not be necessary. IOW, maybe we can
document that's the limitation of pg_resetxlog.

Also regarding pg_xlogdump, we can just document, for example,
please get rid of .paritial suffix from the WAL file name if
you want to dump it by pg_xlogdump.

Thought?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup and replication slots

2015-06-30 Thread Peter Eisentraut
On 5/21/15 8:42 AM, Peter Eisentraut wrote:
 I wonder why pg_basebackup doesn't have any support for replication slots.
 
 When relying on replication slots to hang on to WAL data, there is a gap
 between when pg_basebackup finishes and streaming replication is started
 where WAL data could be thrown away by the primary.
 
 Looking at the code, the -X stream method could easily specify a
 replication slot.  (Might be nice if it could also create it in the same
 run.)

Here is a patch set for this.

(If you're looking at the patch and wondering why there is no code to
actually do anything with the replication slot, that's because the code
that does the WAL streaming is already aware of replication slots
because of the pg_receivexlog functionality that it also serves.  So all
that needs to be done is set replication_slot.)

See also the concurrent discussion on (optionally?) initializing
restart_lsn when the replication slot is created
(http://www.postgresql.org/message-id/flat/b8d538ac5587c84898b261fcb8c7d8a41fde1...@ex10-mbx-36009.ant.amazon.com#b8d538ac5587c84898b261fcb8c7d8a41fde1...@ex10-mbx-36009.ant.amazon.com),
which might have an impact on the details of this change.


From a718282cbc18566732a0d9c3e82c8fca752f4812 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut pete...@gmx.net
Date: Tue, 30 Jun 2015 21:15:05 -0400
Subject: [PATCH 1/3] pg_basebackup: Add tests for -R option

---
 src/bin/pg_basebackup/t/010_pg_basebackup.pl | 9 -
 src/test/perl/TestLib.pm | 8 
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index c8c9250..0ddfffe 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -2,7 +2,7 @@
 use warnings;
 use Cwd;
 use TestLib;
-use Test::More tests = 35;
+use Test::More tests = 39;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -138,3 +138,10 @@
 command_ok([ 'pg_basebackup', '-D', $tempdir/tarbackup_l3, '-Ft' ],
 	'pg_basebackup tar with long symlink target');
 psql 'postgres', DROP TABLESPACE tblspc3;;
+
+command_ok([ 'pg_basebackup', '-D', $tempdir/backupR, '-R' ],
+	'pg_basebackup -R runs');
+ok(-f $tempdir/backupR/recovery.conf, 'recovery.conf was created');
+my $recovery_conf = slurp_file $tempdir/backupR/recovery.conf;
+like($recovery_conf, qr/^standby_mode = 'on'$/m, 'recovery.conf sets standby_mode');
+like($recovery_conf, qr/^primary_conninfo = '.*port=$ENV{PGPORT}.*'$/m, 'recovery.conf sets primary_conninfo');
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index ef42366..60bd7d5 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -11,6 +11,7 @@ our @EXPORT = qw(
   start_test_server
   restart_test_server
   psql
+  slurp_file
   system_or_bail
 
   command_ok
@@ -129,6 +130,13 @@ sub psql
 	run [ 'psql', '-X', '-q', '-d', $dbname, '-f', '-' ], '', \$sql or die;
 }
 
+sub slurp_file
+{
+	local $/;
+	local @ARGV = @_;
+	
+}
+
 sub system_or_bail
 {
 	system(@_) == 0 or BAIL_OUT(system @_ failed: $?);
-- 
2.4.5

From 934309249f1fed9fd008eccd32906fbfd763811a Mon Sep 17 00:00:00 2001
From: Peter Eisentraut pete...@gmx.net
Date: Tue, 30 Jun 2015 21:15:29 -0400
Subject: [PATCH 2/3] pg_basebackup: Add tests for -X option

---
 src/bin/pg_basebackup/t/010_pg_basebackup.pl | 13 -
 src/test/perl/TestLib.pm | 10 ++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index 0ddfffe..52f1575 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -2,7 +2,7 @@
 use warnings;
 use Cwd;
 use TestLib;
-use Test::More tests = 39;
+use Test::More tests = 44;
 
 program_help_ok('pg_basebackup');
 program_version_ok('pg_basebackup');
@@ -46,6 +46,10 @@
 	'pg_basebackup runs');
 ok(-f $tempdir/backup/PG_VERSION, 'backup was created');
 
+is_deeply([sort(slurp_dir($tempdir/backup/pg_xlog/))],
+		  [sort qw(. .. archive_status)],
+		  'no WAL files copied');
+
 command_ok(
 	[   'pg_basebackup', '-D', $tempdir/backup2, '--xlogdir',
 		$tempdir/xlog2 ],
@@ -145,3 +149,10 @@
 my $recovery_conf = slurp_file $tempdir/backupR/recovery.conf;
 like($recovery_conf, qr/^standby_mode = 'on'$/m, 'recovery.conf sets standby_mode');
 like($recovery_conf, qr/^primary_conninfo = '.*port=$ENV{PGPORT}.*'$/m, 'recovery.conf sets primary_conninfo');
+
+command_ok([ 'pg_basebackup', '-D', $tempdir/backupxf, '-X', 'fetch' ],
+	'pg_basebackup -X fetch runs');
+ok(grep(/^[0-9A-F]{24}$/, slurp_dir($tempdir/backupxf/pg_xlog)), 'WAL files copied');
+command_ok([ 'pg_basebackup', '-D', $tempdir/backupxs, '-X', 'stream' ],
+	'pg_basebackup -X stream runs');
+ok(grep(/^[0-9A-F]{24}$/, slurp_dir($tempdir/backupxf/pg_xlog)), 'WAL files copied');
diff --git 

Re: [HACKERS] Memory leak with XLogFileCopy since de768844 (WAL file with .partial)

2015-06-30 Thread Michael Paquier
On Wed, Jul 1, 2015 at 10:58 AM, Fujii Masao wrote:
 On Tue, Jun 9, 2015 at 10:09 AM, Michael Paquier wrote:
 That's a valid concern. What about the attached then? I think that it
 is still good to keep upto to copy only data up to the switch point at
 recovery exit. InstallXLogFileSegment() changes a bit as well because
 of its modifications of arguments.

 Applied. Thanks!

Thanks for showing up.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in bttext_abbrev_convert()

2015-06-30 Thread Jim Nasby

On 6/29/15 6:47 PM, Peter Geoghegan wrote:

As we all know, the state of automated testing is pretty lamentable.
This is the kind of thing that we could catch more easily in the
future if better infrastructure were in place. I caught this by
eyeballing bttext_abbrev_convert() with slightly fresher eyes than the
last time I looked.


Isn't this the kind of thing Coverty's supposed to find?
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dereferenced pointer in tablesample.c

2015-06-30 Thread Petr Jelinek

On 2015-06-30 09:10, Michael Paquier wrote:

Hi all,
(Petr in CC)

Coverity is complaining about the following pointer dereference in
tablesample_init@tablesample.c:
+   ExprState  *argstate = ExecInitExpr(argexpr, (PlanState
*) scanstate);
+
+   if (argstate == NULL)
+   {
+   fcinfo.argnull[i] = true;
+   fcinfo.arg[i] = (Datum) 0;;
+   }
+
+   fcinfo.arg[i] = ExecEvalExpr(argstate, econtext,
+
  fcinfo.argnull[i], NULL);

If the expression argstate is NULL when calling ExecInitExpr(), argstate
is going to be NULL and dereferenced afterwards, see execQual.c for more
details. Hence I think that the patch attached should be applied. Thoughts?



Well, yes the ExecEvalExpr should be in the else block if we'd keep the 
NULL logic there.


However after rereading the code, ISTM the ExecInitExpr will only return 
NULL if the argexpr is NULL and argexpr is added by ParseTableSample 
using the transformExpr on every argument which comes from grammar and 
those are a_exprs which AFAIK will never be NULL. So I actually think 
that the argstate can never be NULL in practice.


Given the above I would just remove the if statement here - it's not 
present in any other code that does ExecInitExpr/ExecEvalExpr either. 
It's most likely relic of the code that didn't treat the repeatable 
separately and just put it into args List.


Patch attached.

--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/access/tablesample/tablesample.c b/src/backend/access/tablesample/tablesample.c
index 44a2434..d263fe2 100644
--- a/src/backend/access/tablesample/tablesample.c
+++ b/src/backend/access/tablesample/tablesample.c
@@ -110,12 +110,6 @@ tablesample_init(SampleScanState *scanstate, TableSampleClause *tablesample)
 		Expr	   *argexpr = (Expr *) lfirst(arg);
 		ExprState  *argstate = ExecInitExpr(argexpr, (PlanState *) scanstate);
 
-		if (argstate == NULL)
-		{
-			fcinfo.argnull[i] = true;
-			fcinfo.arg[i] = (Datum) 0;;
-		}
-
 		fcinfo.arg[i] = ExecEvalExpr(argstate, econtext,
 	 fcinfo.argnull[i], NULL);
 		i++;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Missing checks on return value of timestamp2tm in datetime.c

2015-06-30 Thread Michael Paquier
Hi all,

timestamp2tm is called close to 40 times in the backend source code,
returning -1 in case of failure. However, there are two places in
datetime.c where we do not check for its return value: GetCurrentDateTime
and GetCurrentTimeUsec.
This does not really matter much in practice as the timestamp used is
GetCurrentTransactionStartTimestamp(), but for correctness shouldn't we
check for its return code and report ERRCODE_DATETIME_VALUE_OUT_OF_RANGE on
error?

Per se the patch attached.
Regards,
-- 
Michael


[HACKERS] Unneeded NULL-pointer check in FreeSpaceMapTruncateRel

2015-06-30 Thread Michael Paquier
Hi all,

In the category of nitpicky-code-style-issues, FreeSpaceMapTruncateRel
is doing a NULL-pointer check for something that has been dereferenced
on all the code paths leading to this check.
(Yes, that's not interesting for common humans, Coverity sees things
based on correctness).

Regards,
-- 
Michael
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index fddb47c..c948cb6 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -308,8 +308,7 @@ FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks)
 	 * at the next command boundary.  But this ensures it isn't outright wrong
 	 * until then.
 	 */
-	if (rel-rd_smgr)
-		rel-rd_smgr-smgr_fsm_nblocks = new_nfsmblocks;
+	rel-rd_smgr-smgr_fsm_nblocks = new_nfsmblocks;
 }
 
 /*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Refactor to split nodeAgg.c?

2015-06-30 Thread David Rowley
On 30 June 2015 at 14:33, Jeff Davis pg...@j-davis.com wrote:

 I was going to rebase my HashAgg patch, and got some conflicts related
 to the grouping sets patch. I could probably sort them out, but I think
 that may be the tipping point where we want to break up nodeAgg.c into
 nodeSortedAgg.c and nodeHashAgg.c, and probably a common file as well.

 This would also (I hope) be convenient for Simon and David Rowley, who
 have been hacking on aggregates in general.


That would be more inconvenient right now as I have a pending patch which
makes quite a number of changes which are all over nodeAgg.c.
https://commitfest.postgresql.org/5/271/

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


[HACKERS] 9.5 branch splitoff

2015-06-30 Thread Tom Lane
Barring objections, I'll create the REL9_5_STABLE branch and stamp HEAD
as 9.6devel sometime this afternoon, maybe around 1800 UTC.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-06-30 Thread Amit Kapila
On Fri, Jun 26, 2015 at 6:26 PM, Robert Haas robertmh...@gmail.com wrote:

 On Thu, Jun 25, 2015 at 11:57 PM, Amit Kapila amit.kapil...@gmail.com
wrote:
   3. Add new view 'pg_stat_wait_event' with following info:
   pid   - process id of this backend
   waiting - true for any form of wait, false otherwise
   wait_event_type - Heavy Weight Lock, Light Weight Lock, I/O wait,
etc
   wait_event - Lock (Relation), Lock (Relation Extension), etc
  I am pretty unconvinced that it's a good idea to try to split up the
  wait event into two columns.  I'm only proposing ~20 wait states, so
  there's something like 5 bits of information here.  Spreading that
  over two text columns is a lot, and note that Amit's text would
  basically recapitulate the contents of the first column in the second
  one, which I cannot get excited about.
  There is an advantage in splitting the columns which is if
wait_event_type
  column indicates Heavy Weight Lock, then user can go and check further
  details in pg_locks, I think he can do that even by seeing wait_event
  column, but that might not be as straightforward as with wait_event_type
  column.

 It's just a matter of writing event_type LIKE 'Lock %' instead of
 event_type = 'Lock'.


Yes that way it can be done and may be that is not inconvenient for user,
but then there is other type of information which user might need like what
distinct resources on which wait is possible, which again he can easily
find with
different event_type column. I think some more discussion is required
before we
could freeze the user interface for this feature, but in the meantime I have
prepared an initial patch by adding a new column wait_event in
pg_stat_activity.
For now, I have added the support for Heavy-Weight locks, Light-Weight
locks [1]
and Buffer Cleanup Lock.  I could add for other types (spin lock delay
sleep, IO,
network IO, etc.) if there is no objection in the approach used in patch to
implement
this feature.

[1] For LWLocks, currently I have used wait_event as OtherLock for locks
other than NUM_FIXED_LWLOCKS (Refer function NumLWLocks to see all
type of LWLocks).  The reason is that there is no straight forward way to
get
the id (lockid) of such locks as for some of those (like shared_buffers,
MaxBackends) the number of locks will depend on run-time configuration
parameters.  I think if we want to handle those then we could either do some
math to find out the lockid based on runtime values of these parameters or
we could add tag in LWLock structure (which indicates the lock type) and
fill it during Lock initialization or may be some other better way to do it.

I have still not added documentation and have not changed anything for
waiting column in pg_stat_activity as I think before that we need to
finalize
the user interface.  Apart from that as mentioned above still wait for
some event types (like IO, netwrok IO, etc.) is not added and also I think
separate function/'s (like we have for existing ones
pg_stat_get_backend_waiting)
will be required which again depends upon user interface.


Suggestions?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


extend_pg_stat_activity_wait_event_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Refactor to split nodeAgg.c?

2015-06-30 Thread Andres Freund
Hi,

On 2015-06-29 19:33:58 -0700, Jeff Davis wrote:
 I was going to rebase my HashAgg patch, and got some conflicts related
 to the grouping sets patch. I could probably sort them out, but I think
 that may be the tipping point where we want to break up nodeAgg.c into
 nodeSortedAgg.c and nodeHashAgg.c, and probably a common file as well.

I'm not sure that's going to be helpful and clean without a significant
amount of duplication. Grouping sets right now use sorting, but Andrew
Gierth already is working on a patch that employs hashing for individual
group of groups that support it and where the aggregated state is deemed
small enough.  That implies a fair amount of coupling between the sorted
and hashed aggregation modes.

I'm not sure that conflicts due to GS can be taken as an argument to
split the file - I doubt there'd be significantly fewer with a splitup
since common datastructures have been changed.

That said, I think e.g. splitting out the lowest level of interaction
with aggregation functions and transition layers could be split off
without too much pain.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Solaris testers wanted for strxfrm() behavior

2015-06-30 Thread Tom Lane
Noah Misch n...@leadboat.com writes:
 On Sun, Jun 28, 2015 at 07:00:14PM -0400, Tom Lane wrote:
 Another idea would be to make a test during postmaster start to see
 if this bug exists, and fail if so.  I'm generally on board with the
 thought that we don't need to work on systems with such a bad bug,
 but it would be a good thing if the failure was clean and produced
 a helpful error message, rather than looking like a Postgres bug.

 Failing cleanly on unpatched Solaris is adequate, agreed.  A check at
 postmaster start isn't enough, because the postmaster might run in the C
 locale while individual databases or collations use problem locales.  The
 safest thing is to test after every setlocale(LC_COLLATE) and
 newlocale(LC_COLLATE).  That's once at backend start and once per backend per
 collation used, more frequent than I would like.  Hmm.

I was thinking more along the lines of making a single test by momentarily
switching into a known-buggy locale.

However, your results here imply that there are or were more than one bug
with this symptom, which may make a catchall test impossible.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in bttext_abbrev_convert()

2015-06-30 Thread Michael Paquier
On Wed, Jul 1, 2015 at 9:36 AM, Peter Geoghegan p...@heroku.com wrote:
 On Tue, Jun 30, 2015 at 5:25 PM, Jim Nasby jim.na...@bluetreble.com wrote:
 Isn't this the kind of thing Coverty's supposed to find?

 I don't know, but in general I'm not very excited about static
 analysis tools. The best things that they have going for them is that
 they're available, and don't require test coverage in the same way as
 running the regression tests with Valgrind enabled.

Well, yes. It should have complained about that if it were a perfect
tool able to emulate all code paths with all possible variable values.
But Coverity is not actually that perfect, and sometimes it misses the
shot, like here. by experience when you look at reports of a static
tool, you need to have a look first at other code paths that share
similarities with the problem reported, and you will actually see that
most of the time what the static tool saw is just the tip of the
iceberg. The human factor is determinant in this case.

 There is no real testing of sorting in the regression tests. It would
 be nice to have a way of generating a large and varied selection of
 sort operations programmatically, to catch this kind of thing.
 pg_regress is not really up to it. The same applies to various other
 cases where having a lot of expected output makes using pg_regress
 infeasible.

Well, the issue is double here:
1) Having a buildfarm member with valgrind would greatly help.
2) This code path is not used at all AFAIK in the test suite, so we
definitely lack regression tests here. In your opinion what would be a
sort set large enough to be able to trigger this code path? The idea
is to not make the regression test suite take too much time because of
it, and not to have the server created by pg_regress running the
regression tests having a too large PGDATA folder. For example, could
a btree index do it with a correct data set do it on at least one
platform?
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in bttext_abbrev_convert()

2015-06-30 Thread Peter Geoghegan
On Tue, Jun 30, 2015 at 9:39 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 There is no real testing of sorting in the regression tests. It would
 be nice to have a way of generating a large and varied selection of
 sort operations programmatically, to catch this kind of thing.
 pg_regress is not really up to it. The same applies to various other
 cases where having a lot of expected output makes using pg_regress
 infeasible.

 Well, the issue is double here:
 1) Having a buildfarm member with valgrind would greatly help.

Not sure that I agree with that. It wouldn't hurt, but people are
running Valgrind often enough these days that I think it's unlikely
that having it run consistently actually adds all that much. Maybe it
would make a difference to do it on a 32-bit platform, since most of
us have not used a 32-bit machine as our primary development
environment in about a decade, and we probably miss some things
because of that.

If you run Valgrind on Postgres master these days, you actually have a
pretty good chance of not finding any issues, which is great.
shared_buffers support for Valgrind would also help a lot (e.g. making
it so that referencing a shared buffer that isn't pinned causes a
Valgrind error -- Noah and I discussed this a while back; should be
doable). If you're looking for a new project, may I highly recommend
working on that.  :-)

 2) This code path is not used at all AFAIK in the test suite, so we
 definitely lack regression tests here. In your opinion what would be a
 sort set large enough to be able to trigger this code path? The idea
 is to not make the regression test suite take too much time because of
 it, and not to have the server created by pg_regress running the
 regression tests having a too large PGDATA folder. For example, could
 a btree index do it with a correct data set do it on at least one
 platform?

Maybe. The trick would be constructing a case where many different
code paths are covered, including the many different permutations and
combinations of how an external sort can go (number of runs, merge
steps, datatypes, etc).

In general, I think that there is a lot of value to be added by
someone making it their explicit goal to increase test coverage, as
measured by a tool like gcov (plus subjective expert analysis, of
course), particularly when it comes to things like memory corruption
bugs (hopefully including shared memory corruption bugs, and hopefully
including recovery). If someone was doing that in a focused way, then
the codepath where we must explicitly pfree() (in the case of this
bug) would probably have coverage, and then Valgrind probably would
catch this.

As long as that has to be a part of adding things to the standard
regression test suite (particularly with sorts), a suite which is
expected to run quickly, and as long as we're entirely relying on
pg_regress, we will not make progress here. Maybe if there was an
extended regression test suite that was explicitly about meeting a
code coverage goal we'd do better. Yes, I think mechanically
increasing code coverage is useful as an end in itself (although I do
accept that meeting a particularly coverage percentage is often not
especially useful). Increasing coverage has led me to the right
question before, just as static analysis has done that for you. For
example, the effort of increasing coverage can find dead code, and you
can intuitively get a sense of where to look for bugs manually by
determining mechanically what code paths are hard to hit, or are
naturally seldom hit.

It would be nice to always have a html report from gcov always
available on the internet. That would be something useful to automate,
IMV. Watching that regress over time might provide useful insight, but
I only use gcov a couple of times a year, so that's not going to
happen on its own.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in bttext_abbrev_convert()

2015-06-30 Thread Peter Geoghegan
On Tue, Jun 30, 2015 at 10:25 PM, Peter Geoghegan p...@heroku.com wrote:
 It would be nice to always have a html report from gcov always
 available on the internet. That would be something useful to automate,
 IMV. Watching that regress over time might provide useful insight, but
 I only use gcov a couple of times a year, so that's not going to
 happen on its own.

I generated such a report just now, and noticed this:

   1137 : tuplesort_performsort(btspool-sortstate);
 2141131 : if (btspool2)
 215   0 : tuplesort_performsort(btspool2-sortstate);
 216 :
 2171131 : wstate.heap = btspool-heap;
 2181131 : wstate.index = btspool-index;

The regression tests have zero coverage for this
tuplesort_performsort() btspool2 case. That's a fairly common case
to have no coverage for, and that took me all of 5 minutes to find.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in bttext_abbrev_convert()

2015-06-30 Thread Peter Geoghegan
On Tue, Jun 30, 2015 at 10:35 PM, Peter Geoghegan p...@heroku.com wrote:
 The regression tests have zero coverage for this
 tuplesort_performsort() btspool2 case. That's a fairly common case
 to have no coverage for, and that took me all of 5 minutes to find.

BTW, I looked here because I added a bunch of sortsupport stuff to
_bt_load() in 9.5. It appears that that new code is entirely without
coverage.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-06-30 Thread Amit Kapila
On Tue, Jun 30, 2015 at 4:00 AM, Jeff Davis pg...@j-davis.com wrote:

 [Jumping in without catching up on entire thread.

No problem.

 Please let me know
 if these questions have already been covered.]

 1. Can you change the name to something like ParallelHeapScan?
 Parallel Sequential is a contradiction. (I know this is bikeshedding
 and I won't protest further if you keep the name.)


For what you are asking to change name for?
We have two nodes in patch (Funnel and PartialSeqScan). Funnel is
the name given to node because it is quite generic and can be
used in multiple ways (other than plain parallel sequiantial scan)
and other node is named as PartialSeqScan because it is used
for doing the part of sequence scan.

 2. Where is the speedup coming from? How much of it is CPU and IO
 overlapping (i.e. not leaving disk or CPU idle while the other is
 working), and how much from the CPU parallelism? I know this is
 difficult to answer rigorously, but it would be nice to have some
 breakdown even if for a specific machine.


Yes, you are right and we have done quite some testing (on the hardware
available) with this patch (with different approaches) to see how much
difference it creates for IO and CPU, with respect to IO we have found
that it doesn't help much [1], though it helps when the data is cached
and there are really good benefits in terms of CPU [2].

In terms of completeness, I think we should add some documentation
for this patch, one way is to update about the execution mechanism in
src/backend/access/transam/README.parallel and then explain about
new configuration knobs in documentation (.sgml files).  Also we
can have a separate page in itself in documentation under Server
Programming Section (Parallel Query - Parallel Scan;
Parallel Scan Examples; ...)

Another thing to think about this patch at this stage do we need to
breakup this patch and if yes, how to break it up into multiple patches,
so that it can be easier to complete the review. I could see that it
can be splitted into 2 or 3 patches.
a. Infrastructure for parallel execution, like some of the stuff in
execparallel.c, heapam.c,tqueue.c, etc and all other generic
(non-nodes specific) code.
b. Nodes (Funnel and PartialSeqScan) specific code for optimiser
and executor.
c. Documentation

Suggestions?

[1] -
http://www.postgresql.org/message-id/caa4ek1jhcmn2x1ljq4bomlapt+btouid5vqqk5g6ddfv69i...@mail.gmail.com
[2] - Refer slides 14-15 for the presentation in PGCon, I can repost the
data here if required.
https://www.pgcon.org/2015/schedule/events/785.en.html


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] WAL-related tools and .paritial WAL file

2015-06-30 Thread Amit Kapila
On Wed, Jul 1, 2015 at 9:09 AM, Fujii Masao masao.fu...@gmail.com wrote:

 Also regarding pg_xlogdump, we can just document, for example,
 please get rid of .paritial suffix from the WAL file name if
 you want to dump it by pg_xlogdump.


Can't we skip such files in pg_xlogdump?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Streaming replication for psycopg2

2015-06-30 Thread Shulgin, Oleksandr
On Thu, Jun 4, 2015 at 5:49 PM, Shulgin, Oleksandr 
oleksandr.shul...@zalando.de wrote:

 On Tue, Jun 2, 2015 at 2:23 PM, Shulgin, Oleksandr 
 oleksandr.shul...@zalando.de wrote:
 
  Hello,
 
  I've submitted a patch to psycopg2 to support streaming replication
 protocol (COPY_BOTH): https://github.com/psycopg/psycopg2/pull/322
 
  It would be great if more people had a chance to take a look and provide
 feedback about it.  In particular, please see example usage at this github
 comment[1]. Some bikeshedding is really needed here. :-)


Hello again,

I have updated the pull request above to address the feedback I've gathered
from using this construct internally and from other sources.  Now included,
the lower level asynchronous interface that gives the user more control,
for the price of doing select() and keeping track of keepalive timeouts
manually.

It would be nice if someone could review this.  The final look can be found
by using this link (docs included):
https://github.com/psycopg/psycopg2/pull/322/files

Thanks.
--
Alex


Re: [HACKERS] thread_test's sched_yield requires -lrt on solaris

2015-06-30 Thread Alvaro Herrera
Oskari Saarenmaa wrote:
 I configured the dingo and binturong Solaris 10 animals to build 9.3
 some time ago but apparently they always failed the configure phase.
 Turns out this is caused by thread_test's usage of sched_yield which is
 in librt on Solaris but which is not pulled in by anything on 9.3 and
 earlier on my box.
 
 Apparently the other Solaris animal (castoroides) requires librt for
 fdatasync, but that's not required on my system.  On 9.4 and master
 librt is required for shm_open so the check doesn't fail there.
 
 Attached a patch to check for sched_yield in configure, the patch only
 applies against 9.0 - 9.3 which are using autoconf 2.63.  We should
 probably check for sched_yield anyway on all branches even if it's not
 strictly required on 9.4+ at the moment.

I'm going to apply this patch to all branches.  (I don't think it's so
cool to rely on librt being pulled in by unrelated functions.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock deadlock and gdb advice

2015-06-30 Thread Jeff Janes
On Mon, Jun 29, 2015 at 11:28 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 On Mon, Jun 29, 2015 at 5:55 PM, Peter Geoghegan p...@heroku.com wrote:

 On Mon, Jun 29, 2015 at 5:37 PM, Jeff Janes jeff.ja...@gmail.com wrote:
  Is there a way to use gdb to figure out who holds the lock they are
 waiting
  for?

 Have you considered building with LWLOCK_STATS defined, and LOCK_DEBUG
 defined? That might do it.


 I hadn't planned on running into this problem, so had not compiled
 accordingly.

 I thought LOCK_DEBUG would produce too much output, but now I see it
 doesn't print anything unless Trace_lwlocks is on (but it still makes more
 info available to gdb, as Amit mentioned), so I think that should be OK.

 Since I messed up the gdb session causing the postmaster to SIGKILL all
 the children and destroy the evidence, I'll repeat the run compiled with
 LOCK_DEBUG and see what it looks like in the morning.


I've gotten the LWLock deadlock again.  User backend 24841 holds the
WALInsertLocks 7 and is blocked attempting to acquire 6 .  So it seems to
be violating the lock ordering rules (although I don't see that rule
spelled out in xlog.c)

The Checkpoint process, meanwhile, has acquired WALInsertLocks 0 through 6
and is blocked on 7.

I'm not sure where to go from here.


24841:

(gdb) f 2

(gdb) p held_lwlocks
$1 = {{lock = 0x7f70e11fae40, mode = LW_EXCLUSIVE}, {lock = 0x7f70e13df080,
mode = LW_EXCLUSIVE}, {lock = 0x7f70e11d4280, mode = LW_EXCLUSIVE}, {lock =
0x7f70e11d4280,
mode = LW_EXCLUSIVE}, {lock = 0x0, mode = LW_EXCLUSIVE} repeats 196
times}
(gdb) p T_NAME((LWLock *) held_lwlocks[1])
$2 = 0x84cee1 WALInsertLocks
(gdb) p T_ID((LWLock *) held_lwlocks[1])

(gdb) p lock
$3 = (LWLock *) 0x7f70e13df000
(gdb) p T_NAME((LWLock *) lock)
$7 = 0x84cee1 WALInsertLocks
(gdb) p T_ID((LWLock *) lock)
$8 = 6

(gdb) p *(lock)
$1 = {mutex = 0 '\000', tranche = 1, state = {value = 1627389952}, nwaiters
= {value = 2}, waiters = {head = {prev = 0x7f70e9e29e58, next =
0x7f70e9e2a140}},
  owner = 0x7f70e9e2d260}

(gdb) p *(lock-owner)
$5 = {links = {prev = 0x0, next = 0x0}, sem = {semId = 539754537, semNum =
0}, waitStatus = 0, procLatch = {is_set = 0, is_shared = 1 '\001',
owner_pid = 24820}, lxid = 0,
  pid = 24820, pgprocno = 112, backendId = -1, databaseId = 0, roleId = 0,
recoveryConflictPending = 0 '\000', lwWaiting = 1 '\001', lwWaitMode = 0
'\000', lwWaitLink = {
prev = 0x7f70e13df090, next = 0x7f70e13df090}, waitLock = 0x0,
waitProcLock = 0x0, waitLockMode = 0, heldLocks = 0, waitLSN = 0,
syncRepState = 0, syncRepLinks = {prev = 0x0,
next = 0x0}, myProcLocks = {{prev = 0x7f70e9e2d2f0, next =
0x7f70e9e2d2f0}, {prev = 0x7f70e9e2d300, next = 0x7f70e9e2d300}, {prev =
0x7f70e9e2d310, next = 0x7f70e9e2d310}, {
  prev = 0x7f70e9e2d320, next = 0x7f70e9e2d320}, {prev =
0x7f70e9e2d330, next = 0x7f70e9e2d330}, {prev = 0x7f70e9e2d340, next =
0x7f70e9e2d340}, {prev = 0x7f70e9e2d350,
  next = 0x7f70e9e2d350}, {prev = 0x7f70e9e2d360, next =
0x7f70e9e2d360}, {prev = 0x7f70e9e2d370, next = 0x7f70e9e2d370}, {prev =
0x7f70e9e2d380, next = 0x7f70e9e2d380}, {
  prev = 0x7f70e9e2d390, next = 0x7f70e9e2d390}, {prev =
0x7f70e9e2d3a0, next = 0x7f70e9e2d3a0}, {prev = 0x7f70e9e2d3b0, next =
0x7f70e9e2d3b0}, {prev = 0x7f70e9e2d3c0,
  next = 0x7f70e9e2d3c0}, {prev = 0x7f70e9e2d3d0, next =
0x7f70e9e2d3d0}, {prev = 0x7f70e9e2d3e0, next = 0x7f70e9e2d3e0}}, subxids =
{xids = {0 repeats 64 times}},
  backendLock = 0x7f70e13dad00, fpLockBits = 0, fpRelId = {0 repeats 16
times}, fpVXIDLock = 0 '\000', fpLocalTransactionId = 0}


#0  0x003dcb6eaf27 in semop () from /lib64/libc.so.6
#1  0x00671aef in PGSemaphoreLock (sema=0x7f70e9e2a108) at
pg_sema.c:387
#2  0x006d79ac in LWLockWaitForVar (lock=0x7f70e13df000,
valptr=0x7f70e13df028, oldval=0, newval=0x7fffb33cbd48) at lwlock.c:1406
#3  0x004ee006 in WaitXLogInsertionsToFinish (upto=41590734848) at
xlog.c:1576
#4  0x004ee94b in AdvanceXLInsertBuffer (upto=41594920960,
opportunistic=value optimized out) at xlog.c:1888
#5  0x004f3c42 in GetXLogBuffer (ptr=41594920960) at xlog.c:1669
#6  0x004f40e7 in CopyXLogRecordToWAL (rdata=0x19bc920,
fpw_lsn=value optimized out) at xlog.c:1313
#7  XLogInsertRecord (rdata=0x19bc920, fpw_lsn=value optimized out) at
xlog.c:1008
#8  0x004f7c6c in XLogInsert (rmid=13 '\r', info=32 ' ') at
xloginsert.c:453
#9  0x0047e210 in ginPlaceToPage (btree=0x7fffb33cc070,
stack=0x1a5bfe0, insertdata=value optimized out, updateblkno=value
optimized out, childbuf=0, buildStats=0x0)
at ginbtree.c:418
#10 0x0047f50d in ginInsertValue (btree=0x7fffb33cc070,
stack=0x1a5bfe0, insertdata=0x7fffb33cc0f0, buildStats=0x0) at
ginbtree.c:748
#11 0x00475deb in ginEntryInsert (ginstate=0x7fffb33cc470,
attnum=5544, key=1, category=value optimized out, items=0x7f70e10d9140,
nitem=51, buildStats=0x0)
at gininsert.c:234
#12 0x0048602c in ginInsertCleanup 

Re: [HACKERS] pg_restore -t should match views, matviews, and foreign tables

2015-06-30 Thread Pavel Stehule
Hi

I am sending a review of this trivial patch.

1.This patch enables the possibility to restore only selected view, mat.
view, foreign table or sequence. Currently the option -t works with tables
only. All other relation like objects are quietly ignored. With this patch,
the check on type is enhanced to allow other types described by pg_class
system table. The implementation is trivial:

 +strcmp(te-desc, TABLE DATA) == 0 ||
+strcmp(te-desc, VIEW) == 0 ||
+strcmp(te-desc, FOREIGN TABLE) == 0 ||
+strcmp(te-desc, MATERIALIZED VIEW) == 0 ||
+strcmp(te-desc, MATERIALIZED VIEW DATA) == 0 ||
+strcmp(te-desc, SEQUENCE) == 0)

2. There was not any objections against this patch.
3. There was not any problem with patching and compilation.
4. This feature is good enough documented.

There is opened question, if the related line should be changed? The
current text is not 100% accurate, but it is short, and well readable and
understandable.

  -S, --superuser=NAME superuser user name to use for disabling
triggers
  -t, --table=NAME restore named table
  -T, --trigger=NAME   restore named trigger

5. All tests passed

6. There are no tests. But pg_dump related sw has not any tests yet.

I don't see any issues - this patch is really trivial without risks. It is
working already on pg_dump side, so the fix on pg_restore side is natural.

Regards

Pavel



2015-04-01 5:01 GMT+02:00 Craig Ringer cr...@2ndquadrant.com:

 Following on from this -bugs post:


 http://www.postgresql.org/message-id/camsr+ygj50tvtvk4dbp66gajeoc0kap6kxfehaom+neqmhv...@mail.gmail.com

 this patch adds support for views, foreign tables, and materialised views
 to the pg_restore -t flag.

 --
  Craig Ringer   http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training  Services



Re: [HACKERS] Refactoring speculative insertion with unique indexes a little

2015-06-30 Thread Heikki Linnakangas

On 06/11/2015 02:19 AM, Peter Geoghegan wrote:

Currently, speculative insertion (the INSERT ... ON CONFLICT DO UPDATE
executor/storage infrastructure) uses checkUnique ==
UNIQUE_CHECK_PARTIAL for unique indexes, which is a constant
originally only used by deferred unique constraints. It occurred to me
that this has a number of disadvantages:

* It confuses separation of concerns. Pushing down this information to
the nbtree AM makes it clear why it's slightly special from a
speculative insertion point of view. For example, the nbtree AM does
not actually require livelock insurance (for unique indexes,
although in principle not for nbtree-based exclusion constraints,
which are possible).

* UNIQUE_CHECK_PARTIAL is not only not the same thing as
UNIQUE_CHECK_SPECULATIVE (a new constant for the enum). It's also
naturally mutually exclusive with it (since we do not and cannot
support deferred unique constraints as arbiters). Let's represent this
directly.

* It makes a conflict not detected by the pre-check always insert an
index tuple, even though that occurs after a point where it's already
been established that the pointed-to TID is doomed -- it must go on to
be super deleted. Why bother bloating the index?


I'm actually not really motivated by wanting to reduce bloat here
(that was always something that I thought was a non-issue with *any*
implemented speculative insertion prototype [1]). Rather, by actually
physically inserting an index tuple unnecessarily, the implication is
that it makes sense to do so (perhaps for roughly the same reason it
makes sense with deferred unique constraints, or some other
complicated and subtle reason.). AFAICT that implication is incorrect,
though; I see no reason why inserting that index tuple serves any
purpose, and it clearly *can* be avoided with little effort.


I agree it would be cleaner to have a separate CHECK_UNIQUE_XXX code for 
speculative insertions. You've defined CHECK_UNIQUE_SPECULATIVE as like 
CHECK_UNIQUE_PARTIAL, but you don't have to insert the index tuple if 
there's a conflict. I think it'd be better to define it as like 
CHECK_UNIQUE_YES, but return FALSE instead of throwing an error on 
conflict. The difference is that the aminsert would not be allowed to 
return FALSE when there is no conflict.


Actually, even without this patch, a dummy implementation of aminsert 
that always returns FALSE in CHECK_UNIQUE_PARTIAL mode would be legal 
per the docs, but would loop forever. So that would be nice to fix or 
document away, even though it's not a problem for B-tree currently.



Attached patch updates speculative insertion along these lines.

In passing, I have make ExecInsertIndexTuples() give up when a
speculative insertion conflict is detected. Again, this is not about
bloat prevention; it's about making the code easier to understand by
not doing something that is unnecessary, and in avoiding that also
avoiding the implication that it is necessary. There are already
enough complicated interactions that *are* necessary (e.g. livelock
avoidance for exclusion constraints). Let us make our intent clearer.


Hmm, not sure I like those changes. ExecInsertIndexTuples's behaviour 
now depends on whether specConflict is passed, but that seems weird as 
specConflict is an output parameter.



The patch also updates the AM interface documentation (the part
covering unique indexes). It seems quite natural to me to document the
theory of operation for speculative insertion there.


Yeah, although I find the explanation pretty long-winded and difficult 
to understand ;-).


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock deadlock and gdb advice

2015-06-30 Thread Heikki Linnakangas

On 06/30/2015 07:37 PM, Alvaro Herrera wrote:

Jeff Janes wrote:


I've gotten the LWLock deadlock again.  User backend 24841 holds the
WALInsertLocks 7 and is blocked attempting to acquire 6 .  So it seems to
be violating the lock ordering rules (although I don't see that rule
spelled out in xlog.c)


Hmm, interesting -- pg_stat_statement is trying to re-do an operation
that involves updating a GIN index, while WAL-logging for the original
update is still ongoing, it seems.


I don't think pg_stat_statement has anything to do with this. You can 
see from the backtrace that pg_stat_statement is enabled, as the call 
went through the pgss_ExecutorRun executor hook, but that's all.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock deadlock and gdb advice

2015-06-30 Thread Alvaro Herrera
Jeff Janes wrote:

 I've gotten the LWLock deadlock again.  User backend 24841 holds the
 WALInsertLocks 7 and is blocked attempting to acquire 6 .  So it seems to
 be violating the lock ordering rules (although I don't see that rule
 spelled out in xlog.c)

Hmm, interesting -- pg_stat_statement is trying to re-do an operation
that involves updating a GIN index, while WAL-logging for the original
update is still ongoing, it seems.


 #0  0x003dcb6eaf27 in semop () from /lib64/libc.so.6
 #1  0x00671aef in PGSemaphoreLock (sema=0x7f70e9e2a108) at
 pg_sema.c:387
 #2  0x006d79ac in LWLockWaitForVar (lock=0x7f70e13df000,
 valptr=0x7f70e13df028, oldval=0, newval=0x7fffb33cbd48) at lwlock.c:1406
 #3  0x004ee006 in WaitXLogInsertionsToFinish (upto=41590734848) at
 xlog.c:1576
 #4  0x004ee94b in AdvanceXLInsertBuffer (upto=41594920960,
 opportunistic=value optimized out) at xlog.c:1888
 #5  0x004f3c42 in GetXLogBuffer (ptr=41594920960) at xlog.c:1669
 #6  0x004f40e7 in CopyXLogRecordToWAL (rdata=0x19bc920,
 fpw_lsn=value optimized out) at xlog.c:1313
 #7  XLogInsertRecord (rdata=0x19bc920, fpw_lsn=value optimized out) at
 xlog.c:1008
 #8  0x004f7c6c in XLogInsert (rmid=13 '\r', info=32 ' ') at
 xloginsert.c:453
 #9  0x0047e210 in ginPlaceToPage (btree=0x7fffb33cc070,
 stack=0x1a5bfe0, insertdata=value optimized out, updateblkno=value
 optimized out, childbuf=0, buildStats=0x0)
 at ginbtree.c:418
 #10 0x0047f50d in ginInsertValue (btree=0x7fffb33cc070,
 stack=0x1a5bfe0, insertdata=0x7fffb33cc0f0, buildStats=0x0) at
 ginbtree.c:748
 #11 0x00475deb in ginEntryInsert (ginstate=0x7fffb33cc470,
 attnum=5544, key=1, category=value optimized out, items=0x7f70e10d9140,
 nitem=51, buildStats=0x0)
 at gininsert.c:234
 #12 0x0048602c in ginInsertCleanup (ginstate=0x7fffb33cc470,
 vac_delay=value optimized out, stats=0x0) at ginfast.c:843
 #13 0x004871b9 in ginHeapTupleFastInsert (ginstate=0x7fffb33cc470,
 collector=value optimized out) at ginfast.c:436
 #14 0x0047625a in gininsert (fcinfo=value optimized out) at
 gininsert.c:531
 #15 0x007dc483 in FunctionCall6Coll (flinfo=value optimized out,
 collation=value optimized out, arg1=value optimized out, arg2=value
 optimized out,
 arg3=value optimized out, arg4=value optimized out,
 arg5=140122789534360, arg6=0) at fmgr.c:1436
 #16 0x004b2247 in index_insert (indexRelation=0x7f70e1190e60,
 values=0x7fffb33cef50, isnull=0x7fffb33cf050 , heap_t_ctid=0x1a4a794,
 heapRelation=0x7f70e1185a98,
 checkUnique=UNIQUE_CHECK_NO) at indexam.c:226
 #17 0x005d2e67 in ExecInsertIndexTuples (slot=0x1a497f0,
 tupleid=0x1a4a794, estate=0x1a40768, noDupErr=0 '\000', specConflict=0x0,
 arbiterIndexes=0x0)
 at execIndexing.c:384
 #18 0x005f0161 in ExecUpdate (tupleid=0x7fffb33cf250, oldtuple=0x0,
 slot=0x1a497f0, planSlot=0x1a42498, epqstate=0x1a40a70, estate=0x1a40768,
 canSetTag=1 '\001')
 at nodeModifyTable.c:978
 #19 0x005f0b2a in ExecModifyTable (node=0x1a409d0) at
 nodeModifyTable.c:1436
 #20 0x005d6cc8 in ExecProcNode (node=0x1a409d0) at
 execProcnode.c:389
 #21 0x005d5402 in ExecutePlan (queryDesc=0x1a2b220,
 direction=value optimized out, count=0) at execMain.c:1549
 #22 standard_ExecutorRun (queryDesc=0x1a2b220, direction=value optimized
 out, count=0) at execMain.c:337
 #23 0x7f70ea39af3b in pgss_ExecutorRun (queryDesc=0x1a2b220,
 direction=ForwardScanDirection, count=0) at pg_stat_statements.c:881
 #24 0x006ea38f in ProcessQuery (plan=0x1a2b3d8,
 sourceText=0x1a2af68 update foo set count=count+1 where text_array @ $1,
 params=0x1a2afe8, dest=value optimized out,
 completionTag=0x7fffb33cf660 ) at pquery.c:185
 #25 0x006ea5ec in PortalRunMulti (portal=0x19a6128, isTopLevel=1
 '\001', dest=0xc55020, altdest=0xc55020, completionTag=0x7fffb33cf660 )
 at pquery.c:1279
 #26 0x006eadb3 in PortalRun (portal=0x19a6128,
 count=9223372036854775807, isTopLevel=1 '\001', dest=0x1990630,
 altdest=0x1990630, completionTag=0x7fffb33cf660 )
 at pquery.c:816
 #27 0x006e6ffb in exec_execute_message (portal_name=0x1990218 ,
 max_rows=9223372036854775807) at postgres.c:1988
 #28 0x006e8c15 in PostgresMain (argc=value optimized out,
 argv=value optimized out, dbname=0x19a3210 jjanes, username=value
 optimized out) at postgres.c:4088
 #29 0x006855dd in BackendRun (argc=value optimized out,
 argv=value optimized out) at postmaster.c:4159
 #30 BackendStartup (argc=value optimized out, argv=value optimized out)
 at postmaster.c:3835
 #31 ServerLoop (argc=value optimized out, argv=value optimized out) at
 postmaster.c:1609
 #32 PostmasterMain (argc=value optimized out, argv=value optimized out)
 at postmaster.c:1254
 #33 0x00610ab0 in main (argc=4, argv=0x1976cf0) at main.c:221


-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 

Re: [HACKERS] Dereferenced pointer in tablesample.c

2015-06-30 Thread Tom Lane
Petr Jelinek p...@2ndquadrant.com writes:
 On 2015-06-30 09:10, Michael Paquier wrote:
 If the expression argstate is NULL when calling ExecInitExpr(), argstate
 is going to be NULL and dereferenced afterwards, see execQual.c for more
 details. Hence I think that the patch attached should be applied. Thoughts?

 Well, yes the ExecEvalExpr should be in the else block if we'd keep the 
 NULL logic there.

 However after rereading the code, ISTM the ExecInitExpr will only return 
 NULL if the argexpr is NULL and argexpr is added by ParseTableSample 
 using the transformExpr on every argument which comes from grammar and 
 those are a_exprs which AFAIK will never be NULL. So I actually think 
 that the argstate can never be NULL in practice.

Indeed.  ParseTableSample() is badly in need of a rewrite, but I agree
that it's not going to produce null expression trees.

 Patch attached.

Will push this shortly.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind failure by file deletion in source server

2015-06-30 Thread Heikki Linnakangas

On 06/29/2015 09:44 AM, Michael Paquier wrote:

On Mon, Jun 29, 2015 at 4:55 AM, Heikki Linnakangas wrote:

But we'll still need to handle the pg_xlog symlink case somehow. Perhaps it
would be enough to special-case pg_xlog for now.


Well, sure, pg_rewind does not copy the soft links either way. Now it
would be nice to have an option to be able to recreate the soft link
of at least pg_xlog even if it can be scripted as well after a run.


Hmm. I'm starting to think that pg_rewind should ignore pg_xlog 
entirely. In any non-trivial scenarios, just copying all the files from 
pg_xlog isn't enough anyway, and you need to set up a recovery.conf 
after running pg_rewind that contains a restore_command or 
primary_conninfo, to fetch the WAL. So you can argue that by not copying 
pg_xlog automatically, we're actually doing a favour to the DBA, by 
forcing him to set up the recovery.conf file correctly. Because if you 
just test simple scenarios where not much time has passed between the 
failover and running pg_rewind, it might be enough to just copy all the 
WAL currently in pg_xlog, but it would not be enough if more time had 
passed and not all the required WAL is present in pg_xlog anymore.  And 
by not copying the WAL, we can avoid some copying, as restore_command or 
streaming replication will only copy what's needed, while pg_rewind 
would copy all WAL it can find the target's data directory.


pg_basebackup also doesn't include any WAL, unless you pass the --xlog 
option. It would be nice to also add an optional --xlog option to 
pg_rewind, but with pg_rewind it's possible that all the required WAL 
isn't present in the pg_xlog directory anymore, so you wouldn't always 
achieve the same effect of making the backup self-contained.


So, I propose the attached. It makes pg_rewind ignore the pg_xlog 
directory in both the source and the target.


- Heikki

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 32dc83f..bb1d640 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -49,12 +49,13 @@ PostgreSQL documentation
 
   para
The result is equivalent to replacing the target data directory with the
-   source one. All files are copied, including configuration files. The
-   advantage of applicationpg_rewind/ over taking a new base backup, or
-   tools like applicationrsync/, is that applicationpg_rewind/ does
-   not require reading through all unchanged files in the cluster. That makes
-   it a lot faster when the database is large and only a small portion of it
-   differs between the clusters.
+   source one. All files in the data directory are copied, including
+   configuration files, but excluding the WAL log directory,
+   filenamepg_xlog/. The advantage of applicationpg_rewind/ over
+   taking a new base backup, or tools like applicationrsync/, is that
+   applicationpg_rewind/ does not require reading through all unchanged
+   files in the cluster. That makes it a lot faster when the database is
+   large and only a small portion of it differs between the clusters.
   /para
 
   para
@@ -74,12 +75,12 @@ PostgreSQL documentation
When the target server is started up for the first time after running
applicationpg_rewind/, it will go into recovery mode and replay all
WAL generated in the source server after the point of divergence.
-   If some of the WAL was no longer available in the source server when
-   applicationpg_rewind/ was run, and therefore could not be copied by
-   applicationpg_rewind/ session, it needs to be made available when the
-   target server is started up. That can be done by creating a
-   filenamerecovery.conf/ file in the target data directory with a
-   suitable varnamerestore_command/.
+   Like after restoring from a base backup using continous archiving,
+   the source server's WAL must be made available to the rewound server.
+   This can be done by creating a filenamerecovery.conf/ file in the
+   target data directory with a suitable varnamerestore_command/ or
+   varnameprimary_conninfo/ line, or by copying the required WAL files
+   manually into filenamepg_xlog/ in the target data directory.
   /para
  /refsect1
 
diff --git a/src/bin/pg_rewind/RewindTest.pm b/src/bin/pg_rewind/RewindTest.pm
index 5219ec9..82e3f4c 100644
--- a/src/bin/pg_rewind/RewindTest.pm
+++ b/src/bin/pg_rewind/RewindTest.pm
@@ -284,14 +284,12 @@ sub run_pg_rewind
 
 	# Keep a temporary postgresql.conf for master node or it would be
 	# overwritten during the rewind.
-	copy(
-		$test_master_datadir/postgresql.conf,
-		$testroot/master-postgresql.conf.tmp);
+	copy($test_master_datadir/postgresql.conf,
+		 $testroot/master-postgresql.conf.tmp);
 
 	# Now run pg_rewind
 	if ($test_mode eq local)
 	{
-
 		# Do rewind using a local pgdata as source
 		# Stop the master and be ready to perform the rewind
 		system_or_bail(
@@ -306,10 +304,22 @@ sub run_pg_rewind
 			$log_path,
 			'21');
 		ok($result, 'pg_rewind 

Re: [HACKERS] LWLock deadlock and gdb advice

2015-06-30 Thread Heikki Linnakangas

On 06/30/2015 07:05 PM, Jeff Janes wrote:

On Mon, Jun 29, 2015 at 11:28 PM, Jeff Janes jeff.ja...@gmail.com wrote:


On Mon, Jun 29, 2015 at 5:55 PM, Peter Geoghegan p...@heroku.com wrote:


On Mon, Jun 29, 2015 at 5:37 PM, Jeff Janes jeff.ja...@gmail.com wrote:

Is there a way to use gdb to figure out who holds the lock they are

waiting

for?


Have you considered building with LWLOCK_STATS defined, and LOCK_DEBUG
defined? That might do it.



I hadn't planned on running into this problem, so had not compiled
accordingly.

I thought LOCK_DEBUG would produce too much output, but now I see it
doesn't print anything unless Trace_lwlocks is on (but it still makes more
info available to gdb, as Amit mentioned), so I think that should be OK.

Since I messed up the gdb session causing the postmaster to SIGKILL all
the children and destroy the evidence, I'll repeat the run compiled with
LOCK_DEBUG and see what it looks like in the morning.



I've gotten the LWLock deadlock again.  User backend 24841 holds the
WALInsertLocks 7 and is blocked attempting to acquire 6 .  So it seems to
be violating the lock ordering rules (although I don't see that rule
spelled out in xlog.c)

The Checkpoint process, meanwhile, has acquired WALInsertLocks 0 through 6
and is blocked on 7.

I'm not sure where to go from here.


The user backend 24841 is waiting in a LWLockWaitForVar() call, on 
WALInsertLock 6, and oldval==0. The checkpointer is holding 
WALInsertLock 6, but it has set the insertingat value to PG_UINT64_MAX. 
The LWLockWaitForVar() call should not be blocked on that, because 0 != 
PG_UINT64_MAX.


Looks like the LWLock-scalability patch that made LWlock acquisiton to 
use atomic ops instead of a spinlock broke this. LWLockWaitForVar() is 
supposed to guarantee that it always wakes up from sleep, when the 
variable's value changes. But there is now a race condition in 
LWLockWaitForVar that wasn't there in 9.4:



if (mustwait)
{
/*
 * Perform comparison using spinlock as we can't rely 
on atomic 64
 * bit reads/stores.
 */
#ifdef LWLOCK_STATS
lwstats-spin_delay_count += 
SpinLockAcquire(lock-mutex);
#else
SpinLockAcquire(lock-mutex);
#endif

/*
 * XXX: We can significantly optimize this on platforms 
with 64bit
 * atomics.
 */
value = *valptr;
if (value != oldval)
{
result = false;
mustwait = false;
*newval = value;
}
else
mustwait = true;
SpinLockRelease(lock-mutex);
}
else
mustwait = false;

if (!mustwait)
break;  /* the lock was free or 
value didn't match */

/*
 * Add myself to wait queue. Note that this is racy, somebody 
else
 * could wakeup before we're finished queuing. NB: We're using 
nearly
 * the same twice-in-a-row lock acquisition protocol as
 * LWLockAcquire(). Check its comments for details.
 */
LWLockQueueSelf(lock, LW_WAIT_UNTIL_FREE, false);


After the spinlock is released above, but before the LWLockQueueSelf() 
call, it's possible that another backend comes in, acquires the lock, 
changes the variable's value, and releases the lock again. In 9.4, the 
spinlock was not released until the process was queued.


Looking at LWLockAcquireWithVar(), it's also no longer holding the 
spinlock while it updates the Var. That's not cool either :-(.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5 release notes

2015-06-30 Thread Andres Freund
I've gone through the release notes and added comments referencing
commits as discussed earlier. Additionally I've improved and added a
bunch of items.

Further stuff that came up while looking:
* 2014-09-25 [b0d81ad] Heikki..: Add -D option to specify data directory to 
pg_c..
  new options, should possibly be documented?

* 2014-10-02 [3acc10c9] Robert..: Increase the number of buffer mapping 
partition..
  should we mention this? This has been patched by a number of
  downstream vendors and users, so it's probably worth calling out?

* 2014-11-18 [606c012] Simon ..: Reduce btree scan overhead for  and  
strategies
  Worthy of a note in the performance section.

* 2014-11-22 [eca2b9b] Andrew..: Rework echo_hidden for \sf and \ef from commit 
..
  Seems far too minor in comparison to other stuff left out.

* 2014-11-07 [7516f52] Alvaro..: BRIN: Block Range Indexes
  Looks to me like that should just be Alvaro.

* 2014-11-22 [b62f94c] Tom Lane: Allow simplification of EXISTS() subqueries 
con..
  Planner change, I think it's good to mention those.

* 2014-11-28 [e384ed6] Tom Lane: Improve typcache: cache negative lookup 
results..
  Should perhaps, together with other cache enhancing entries, be mentioned?

* 2014-12-08 [519b075] Simon ..: Use GetSystemTimeAsFileTime directly in win32
  2014-12-08 [8001fe6] Simon ..: Windows: use GetSystemTimePreciseAsFileTime if 
..
  Timer resolution isn't a unimportant thing for people using explain?

* 2014-12-12 [7e354ab] Andrew..: Add several generator functions for jsonb that 
..
  2015-05-12 [c694701] Andrew..: Additional functions and operators for jsonb
  2015-05-31 [37def42] Andrew..: Rename jsonb_replace to jsonb_set and allow it 
..
  2014-12-12 [237a882] Andrew..: Add json_strip_nulls and jsonb_strip_nulls 
fun..
  should probably be merged? Similar set of authors and too many similar 
release note
  entries.

* 2014-12-23 [d7ee82e] Alvaro..: Add SQL-callable pg_get_object_address
* 2014-12-30 [a676201] Alvaro..: Add pg_identify_object_as_address
  should be merged.

* 2015-01-13 [4bad60e] Andres..: Allow latches to wait for socket writability 
wi..
  2015-01-14 [59f71a0] Andres..: Add a default local latch for use in signal 
han..
  2015-01-17 [ff44fba] Andres..: Replace walsender's latch with the general 
shar..
  2015-02-03 [387da18] Andres..: Use a nonblocking socket for FE/BE 
communicatio..
  2015-02-03 [4f85fde] Andres..: Introduce and use infrastructure for interrupt 
..
  2015-02-03 [4fe384b] Andres..: Process 'die' interrupts while reading/writing 
..
  2015-02-03 [6647248] Andres..: Don't allow immediate interrupts during 
authent..
  2015-02-03 [675] Andres..: Move deadlock and other interrupt handling in 
p..
  2015-02-13 [80788a4] Heikki..: Simplify waiting logic in reading from 
/writin..

* 2015-01-17 [9402869] Heikki..: Advance backend's advertised xmin more 
aggressi..
  This is a pretty big efficiency boon for systems with longer
  nontrivial transactions.

* 2015-01-29 [ed12700] Andres..: Align buffer descriptors to cache line 
boundari..
  Maybe worthwhile to mention?

* 2015-02-16 [9e3ad1a] Tom Lane: Use fast path in plpgsql's RETURN/RETURN NEXT 
i..
  2015-02-28 [e524cbd] Tom Lane: Track typmods in plpgsql expression evaluation 
..
  2015-03-04 [1345cc6] Tom Lane: Use standard casting mechanism to convert 
types..
  2015-03-11 [21dcda2] Tom Lane: Allocate ParamListInfo once per plpgsql 
functio..

  Maybe reformulate to generalize the array performance into plpgsql and
  mention arrays and RETURN? And combine with the casting change entry,
  because that's also about performance?

* 2015-02-20 [09d8d11] Tom Lane: Use FLEXIBLE_ARRAY_MEMBER in a bunch more 
places.
  2015-02-20 [5740be6] Tom Lane: Some more FLEXIBLE_ARRAY_MEMBER hacking.
  2015-02-20 [e38b1eb] Tom Lane: Use FLEXIBLE_ARRAY_MEMBER in struct varlena.
  2015-02-20 [c110eff] Tom Lane: Use FLEXIBLE_ARRAY_MEMBER in struct 
RecordIOData.
  2015-02-20 [33a3b03] Tom Lane: Use FLEXIBLE_ARRAY_MEMBER in some more places.
  2015-02-20 [33b2a2c] Tom Lane: Fix statically allocated struct with 
FLEXIBLE_A..
  2015-02-21 [f2874fe] Tom Lane: Some more FLEXIBLE_ARRAY_MEMBER fixes.
  2015-02-21 [e1a11d9] Tom Lane: Use FLEXIBLE_ARRAY_MEMBER for 
HeapTupleHeaderDa..
  2015-02-21 [2e21121] Tom Lane: Use FLEXIBLE_ARRAY_MEMBER in a number of other 
..
  Maybe add a note in the source code section? Not sure if relevant enough.

* 2015-02-21 [b419865] Jeff D..: In array_agg(), don't create a new context for 
..
  Peformance improvement?

* 2015-02-27 [f65e827] Tom Lane: Invent a memory context reset/delete callback 
m..
  Interesting for devs, maybe add?

* 2015-03-11 [b557226] Tom Lane: Improve planner's cost estimation in the 
presen..
  2015-03-03 [b989619] Tom Lane: Fix cost estimation for indexscans on 
expensive..
  2015-03-11 [f4abd02] Tom Lane: Support flattening of empty-FROM subqueries 
and..
  2015-05-04 [2503982] Tom Lane: Improve procost estimates for some text search 
..
  

Re: [HACKERS] LWLock deadlock and gdb advice

2015-06-30 Thread Andres Freund
On 2015-06-30 21:08:53 +0300, Heikki Linnakangas wrote:
  /*
   * XXX: We can significantly optimize this on platforms 
  with 64bit
   * atomics.
   */
  value = *valptr;
  if (value != oldval)
  {
  result = false;
  mustwait = false;
  *newval = value;
  }
  else
  mustwait = true;
  SpinLockRelease(lock-mutex);
  }
  else
  mustwait = false;
 
  if (!mustwait)
  break;  /* the lock was free or 
  value didn't match */
 
  /*
   * Add myself to wait queue. Note that this is racy, somebody 
  else
   * could wakeup before we're finished queuing. NB: We're using 
  nearly
   * the same twice-in-a-row lock acquisition protocol as
   * LWLockAcquire(). Check its comments for details.
   */
  LWLockQueueSelf(lock, LW_WAIT_UNTIL_FREE, false);
 
 After the spinlock is released above, but before the LWLockQueueSelf() call,
 it's possible that another backend comes in, acquires the lock, changes the
 variable's value, and releases the lock again. In 9.4, the spinlock was not
 released until the process was queued.

Hm. Right. A recheck of the value after the queuing should be sufficient
to fix? That's how we deal with the exact same scenarios for the normal
lock state, so that shouldn't be very hard to add.

 Looking at LWLockAcquireWithVar(), it's also no longer holding the spinlock
 while it updates the Var. That's not cool either :-(.

Hm. I'd somehow assumed holding the lwlock ought to be sufficient, but
it really isn't. This var stuff isn't fitting all that well into the
lwlock code.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Mention column name in error messages

2015-06-30 Thread Franck Verrot
Hi all,

As far as I know, there is currently no way to find which column is triggering
an error on an INSERT or ALTER COLUMN statement. Example:

# create table foo(bar varchar(4), baz varchar(2));
CREATE TABLE
# insert into foo values ('foo!', 'ok');
INSERT 0 1
# insert into foo values ('foo2!', 'ok');
ERROR:  value too long for type character varying(4)
# insert into foo values ('foo!', 'ok2');
ERROR:  value too long for type character varying(2)


I browsed the list and found a conversation from last year
(http://www.postgresql.org/message-id/3214.1390155...@sss.pgh.pa.us) that
discussed adding the actual value in the output.

The behavior I am proposing differs in the sense we will be able to see in
the ERROR:  what column triggered the error:

# create table foo(bar varchar(4), baz varchar(2));
CREATE TABLE
# insert into foo values ('foo!', 'ok');
INSERT 0 1
# insert into foo values ('foo2!', 'ok');
ERROR:  value too long for bar of type character varying(4)
# insert into foo values ('foo!', 'ok2');
ERROR:  value too long for baz of type character varying(2)


In that same conversation, Tom Lane said:

  [...] People have speculated about ways to
  name the target column in the error report, which would probably be
  far more useful; but it's not real clear how to do that in our system
  structure.

Given my very restricted knowledge of PG's codebase I am not sure whether my
modifications are legitimate or not, so please don't hesitate to comment on
it and pointing where things are subpar to PG's codebase. In any case, it's
to be considered as WIP for the moment.

Thanks in advance,
Franck



-- 
Franck Verrot
https://github.com/franckverrot
https://twitter.com/franckverrot


verbose-validation.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock deadlock and gdb advice

2015-06-30 Thread Heikki Linnakangas

On 06/30/2015 10:09 PM, Andres Freund wrote:

On 2015-06-30 21:08:53 +0300, Heikki Linnakangas wrote:

/*
 * XXX: We can significantly optimize this on platforms 
with 64bit
 * atomics.
 */
value = *valptr;
if (value != oldval)
{
result = false;
mustwait = false;
*newval = value;
}
else
mustwait = true;
SpinLockRelease(lock-mutex);
}
else
mustwait = false;

if (!mustwait)
break;  /* the lock was free or 
value didn't match */

/*
 * Add myself to wait queue. Note that this is racy, somebody 
else
 * could wakeup before we're finished queuing. NB: We're using 
nearly
 * the same twice-in-a-row lock acquisition protocol as
 * LWLockAcquire(). Check its comments for details.
 */
LWLockQueueSelf(lock, LW_WAIT_UNTIL_FREE, false);


After the spinlock is released above, but before the LWLockQueueSelf() call,
it's possible that another backend comes in, acquires the lock, changes the
variable's value, and releases the lock again. In 9.4, the spinlock was not
released until the process was queued.


Hm. Right. A recheck of the value after the queuing should be sufficient
to fix? That's how we deal with the exact same scenarios for the normal
lock state, so that shouldn't be very hard to add.


Yeah. It's probably more efficient to not release the spinlock between 
checking the value and LWLockQueueSelf() though.



Looking at LWLockAcquireWithVar(), it's also no longer holding the spinlock
while it updates the Var. That's not cool either :-(.


Hm. I'd somehow assumed holding the lwlock ought to be sufficient, but
it really isn't. This var stuff isn't fitting all that well into the
lwlock code.


I'll take a stab at fixing this tomorrow. I take the blame for not 
documenting the semantics of LWLockAcquireWithVar() and friends well 
enough...


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock deadlock and gdb advice

2015-06-30 Thread Andres Freund
On 2015-06-30 22:19:02 +0300, Heikki Linnakangas wrote:
 Hm. Right. A recheck of the value after the queuing should be sufficient
 to fix? That's how we deal with the exact same scenarios for the normal
 lock state, so that shouldn't be very hard to add.
 
 Yeah. It's probably more efficient to not release the spinlock between
 checking the value and LWLockQueueSelf() though.

Right now LWLockQueueSelf() takes spinlocks etc itself, and is used that
way in a bunch of callsites... So that'd be harder.  Additionally I'm
planning to get rid of the spinlocks around queuing (they show up as
bottlenecks in contended workloads), so it seems more future proof not
to assume that either way.  On top of that I think we should, when
available (or using the same type of fallback as for 32bit?), use 64 bit
atomics for the var anyway.

 I'll take a stab at fixing this tomorrow.

Thanks! Do you mean both or just the second issue?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_trgm version 1.2

2015-06-30 Thread Jeff Janes
On Tue, Jun 30, 2015 at 2:46 AM, Alexander Korotkov 
a.korot...@postgrespro.ru wrote:

 On Sun, Jun 28, 2015 at 1:17 AM, Jeff Janes jeff.ja...@gmail.com wrote:

 This patch implements version 1.2 of contrib module pg_trgm.

 This supports the triconsistent function, introduced in version 9.4 of
 the server, to make it faster to implement indexed queries where some keys
 are common and some are rare.


 Thank you for the patch! Lack of pg_trgm triconsistent support was
 significant miss after fast scan implementation.


 I've included the paths to both upgrade and downgrade between 1.1 and
 1.2, although after doing so you must close and restart the session before
 you can be sure the change has taken effect. There is no change to the
 on-disk index structure


 pg_trgm--1.1.sql andpg_trgm--1.1--1.2.sql are useful for debug, but do you
 expect them in final commit? As I can see in other contribs we have only
 last version and upgrade scripts.



I had thought that pg_trgm--1.1.sql was needed for pg_upgrade to work, but
I see that that is not the case.

I did see another downgrade path for different module, but on closer
inspection it was one that I wrote while trying to analyze that module, and
then never removed.  I have no objection to removing pg_trgm--1.2--1.1.sql
before the commit, but I don't see what we gain by doing so.  If a
downgrade is feasible and has been tested, why not include it?

...



 You could get the same benefit just by increasing MAX_MAYBE_ENTRIES (in
 core) from 4 to some higher value (which it probably should be anyway, but
 there will always be a case where it needs to be higher than you can afford
 it to be, so a real solution is needed).


 Actually, it depends on how long it takes to calculate consistent
 function. The cheaper consistent function is, the higher MAX_MAYBE_ENTRIES
 could be. Since all functions in PostgreSQL may define its cost, GIN could
 calculate MAX_MAYBE_ENTRIES from the cost of consistent function.


The consistent function gets called on every candidate tuple, so if it is
expensive then it is also all the more worthwhile to reduce the set of
candidate tuples.  Perhaps MAX_MAYBE_ENTRIES could be calculated from the
log of the maximum of the predictNumberResult entries? Anyway, a subject
for a different day



 There may also be some gains in the similarity and regex cases, but I
 didn't really analyze those for performance.



 I've thought about how to document this change.  Looking to other example
 of other contrib modules with multiple versions, I decided that we don't
 document them, other than in the release notes.


 The same patch applies to 9.4 code with a minor conflict in the Makefile,
 and gives benefits there as well.


 Some other notes about the patch:
 * You allocate boolcheck array and don't use it.


That was a bug (at least notionally).  trigramsMatchGraph was supposed to
be getting boolcheck, not check, passed in to it.

It may not have been a bug in practise, because GIN_MAYBE and GIN_TRUE both
test as true when cast to booleans. But it seems wrong to rely on that.  Or
was it intended to work this way?

I'm surprised the compiler suite doesn't emit some kind of warning on that.



 * Check coding style and formatting, in particular check[i]==GIN_TRUE
 should be check[i] == GIN_TRUE.


Sorry about that, fixed.  I also changed it in other places to check[i] !=
GIN_FALSE, rather than checking against both GIN_TRUE and GIN_MAYBE.  At
first I was concerned we might decide to add a 4th option to the type which
would render  != GIN_FALSE the wrong way to test for it.  But since it is
called GinTernaryValue, I think we wouldn't add a fourth thing to it.  But
perhaps the more verbose form is clearer to some people.



 * I think some comments needed in gin_trgm_triconsistent() about
 trigramsMatchGraph(). gin_trgm_triconsistent() may use trigramsMatchGraph()
 that way because trigramsMatchGraph() implements monotonous boolean
 function.


I have a function-level comment that in all cases, GIN_TRUE is at least as
favorable to inclusion of a tuple as GIN_MAYBE.  Should I reiterate that
comment before trigramsMatchGraph() as well?  Calling it a monotonic
boolean function is precise and concise, but probably less understandable
to people reading the code.  At least, if I saw that, I would need to go do
some reading before I knew what it meant.

Update attached, with the downgrade path still included for now.

Thanks,

Jeff


pg_trgm_1_2_v002.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Dereferenced pointer in tablesample.c

2015-06-30 Thread Michael Paquier
Hi all,
(Petr in CC)

Coverity is complaining about the following pointer dereference in
tablesample_init@tablesample.c:
+   ExprState  *argstate = ExecInitExpr(argexpr, (PlanState *)
scanstate);
+
+   if (argstate == NULL)
+   {
+   fcinfo.argnull[i] = true;
+   fcinfo.arg[i] = (Datum) 0;;
+   }
+
+   fcinfo.arg[i] = ExecEvalExpr(argstate, econtext,
+
 fcinfo.argnull[i], NULL);

If the expression argstate is NULL when calling ExecInitExpr(), argstate is
going to be NULL and dereferenced afterwards, see execQual.c for more
details. Hence I think that the patch attached should be applied. Thoughts?

At the same time I noted a double semicolon, fixed as well in the attached.
Regards,
-- 
Michael
diff --git a/src/backend/access/tablesample/tablesample.c b/src/backend/access/tablesample/tablesample.c
index 44a2434..9d443b1 100644
--- a/src/backend/access/tablesample/tablesample.c
+++ b/src/backend/access/tablesample/tablesample.c
@@ -113,11 +113,13 @@ tablesample_init(SampleScanState *scanstate, TableSampleClause *tablesample)
 		if (argstate == NULL)
 		{
 			fcinfo.argnull[i] = true;
-			fcinfo.arg[i] = (Datum) 0;;
+			fcinfo.arg[i] = (Datum) 0;
+		}
+		else
+		{
+			fcinfo.arg[i] = ExecEvalExpr(argstate, econtext,
+		 fcinfo.argnull[i], NULL);
 		}
-
-		fcinfo.arg[i] = ExecEvalExpr(argstate, econtext,
-	 fcinfo.argnull[i], NULL);
 		i++;
 	}
 	Assert(i == fcinfo.nargs);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reduce ProcArrayLock contention

2015-06-30 Thread Amit Kapila
On Tue, Jun 30, 2015 at 11:53 AM, Simon Riggs si...@2ndquadrant.com wrote:

 On 30 June 2015 at 04:21, Amit Kapila amit.kapil...@gmail.com wrote:


 Now, I would like to briefly explain how allow-one-waker idea has
 helped to improve the patch as not every body here was present
 in that Un-conference.


 The same idea applies for marking commits in clog, for which I have been
sitting on a patch for a month or so and will post now I'm done travelling.


Sure and I think we might want to try something similar even
for XLogFlush where we use LWLockAcquireOrWait for
WALWriteLock, not sure how it will workout in that case as
I/O is involved, but I think it is worth trying.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-30 Thread Amit Kapila
On Tue, Jun 30, 2015 at 11:00 AM, Simon Riggs si...@2ndquadrant.com wrote:

 On 30 June 2015 at 05:02, Amit Kapila amit.kapil...@gmail.com wrote:

 On Mon, Jun 29, 2015 at 7:18 PM, Simon Riggs si...@2ndquadrant.com
wrote:
 
  On 28 June 2015 at 17:17, Tom Lane t...@sss.pgh.pa.us wrote:
 
  If lseek fails badly then SeqScans would give *silent* data loss,
which in my view is worse. Just added pages aren't the only thing we might
miss if lseek is badly wrong.
 

 So for the purpose of this patch, do we need to assume that
 lseek can give us wrong size of file and we should add preventive
 checks and other handling for the same?
 I am okay to change that way, if we are going to have that as assumption
 in out code wherever we are using it or will use it in-future, otherwise
 we will end with some preventive checks which are actually not required.


 They're preventative checks. You always hope it is wasted effort.


I am not sure if Preventative checks (without the real need) are okay if
they
are not-cheap which could happen in this case.  I think Validating
buffer-tag
would require rel or sys cache lookup.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


[HACKERS] Reducing ClogControlLock contention

2015-06-30 Thread Simon Riggs
ClogControlLock contention is high at commit time. This appears to be due
to the fact that ClogControlLock is acquired in Exclusive mode prior to
marking commit, which then gets starved by backends running
TransactionIdGetStatus().

Proposal for improving this is to acquire the ClogControlLock in Shared
mode, if possible.

This is safe because people checking visibility of an xid must always run
TransactionIdIsInProgress() first to avoid race conditions, which will
always return true for the transaction we are currently committing. As a
result, we never get concurrent access to the same bits in clog, which
would require a barrier.

Two concurrent writers might access the same word concurrently, so we
protect against that with a new CommitLock. We could partition that by
pageno also, if needed.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


clog_optimize.v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing ClogControlLock contention

2015-06-30 Thread Michael Paquier
On Tue, Jun 30, 2015 at 4:02 PM, Simon Riggs si...@2ndquadrant.com wrote:

 ClogControlLock contention is high at commit time. This appears to be due
 to the fact that ClogControlLock is acquired in Exclusive mode prior to
 marking commit, which then gets starved by backends running
 TransactionIdGetStatus().

 Proposal for improving this is to acquire the ClogControlLock in Shared
 mode, if possible.

 This is safe because people checking visibility of an xid must always run
 TransactionIdIsInProgress() first to avoid race conditions, which will
 always return true for the transaction we are currently committing. As a
 result, we never get concurrent access to the same bits in clog, which
 would require a barrier.

 Two concurrent writers might access the same word concurrently, so we
 protect against that with a new CommitLock. We could partition that by
 pageno also, if needed.


Could it be possible to see some performance numbers? For example with a
simple pgbench script doing a bunch of tiny transactions, with many
concurrent sessions (perhaps hundreds).
-- 
Michael


Re: [HACKERS] Reduce ProcArrayLock contention

2015-06-30 Thread Simon Riggs
On 30 June 2015 at 03:43, Robert Haas robertmh...@gmail.com wrote:

 On Mon, Jun 29, 2015 at 1:22 PM, Simon Riggs si...@2ndquadrant.com
 wrote:
  Yes, I know. And we all had a long conversation about how to do it
 without
  waking up the other procs.
 
  Forming a list, like we use for sync rep and having just a single process
  walk the queue was the way I suggested then and previously.
 
  Weird.

 I am not sure what your point is.  Are you complaining that you didn't
 get a design credit for this patch?  If so, I think that's a bit
 petty.  I agree that you mentioned something along these lines at
 PGCon, but Amit and I have been discussing this every week for over a
 month, so it's not as if the conversations at PGCon were the only
 ones, or the first.  Nor is there a conspiracy to deprive Simon Riggs
 of credit for his ideas.  I believe that you should assume good faith
 and take it for granted that Amit credited who he believed that he got
 his ideas from.  The fact that you may have had similar ideas does not
 mean that he got his from you.  It probably does mean that they are
 good ideas, since we are apparently all thinking in the same way.


Oh, I accept multiple people can have the same ideas. That happens to me a
lot around here.

What I find weird is that the discussion was so intense about
LWLockAcquireOrWait that when someone presented a solution there were
people that didn't notice. It makes me wonder whether large group
discussions are worth it.

What I find even weirder is the thought that there were another 100 people
in the room and it makes me wonder whether others present had even better
ideas but they don't speak up for some other reason. I guess that happens
on-list all the time, its just we seldom experience the number of people
watching our discussions. I wonder how many times people give good ideas
and we ignore them, myself included.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] LWLock deadlock and gdb advice

2015-06-30 Thread Jeff Janes
On Mon, Jun 29, 2015 at 5:55 PM, Peter Geoghegan p...@heroku.com wrote:

 On Mon, Jun 29, 2015 at 5:37 PM, Jeff Janes jeff.ja...@gmail.com wrote:
  Is there a way to use gdb to figure out who holds the lock they are
 waiting
  for?

 Have you considered building with LWLOCK_STATS defined, and LOCK_DEBUG
 defined? That might do it.


I hadn't planned on running into this problem, so had not compiled
accordingly.

I thought LOCK_DEBUG would produce too much output, but now I see it
doesn't print anything unless Trace_lwlocks is on (but it still makes more
info available to gdb, as Amit mentioned), so I think that should be OK.

Since I messed up the gdb session causing the postmaster to SIGKILL all the
children and destroy the evidence, I'll repeat the run compiled with
LOCK_DEBUG and see what it looks like in the morning.

Thanks,

Jeff


Re: [HACKERS] drop/truncate table sucks for large values of shared buffers

2015-06-30 Thread Simon Riggs
On 30 June 2015 at 07:34, Amit Kapila amit.kapil...@gmail.com wrote:

 On Tue, Jun 30, 2015 at 11:00 AM, Simon Riggs si...@2ndquadrant.com
 wrote:
 
  On 30 June 2015 at 05:02, Amit Kapila amit.kapil...@gmail.com wrote:
 
  On Mon, Jun 29, 2015 at 7:18 PM, Simon Riggs si...@2ndquadrant.com
 wrote:
  
   On 28 June 2015 at 17:17, Tom Lane t...@sss.pgh.pa.us wrote:
  
   If lseek fails badly then SeqScans would give *silent* data loss,
 which in my view is worse. Just added pages aren't the only thing we might
 miss if lseek is badly wrong.
  
 
  So for the purpose of this patch, do we need to assume that
  lseek can give us wrong size of file and we should add preventive
  checks and other handling for the same?
  I am okay to change that way, if we are going to have that as assumption
  in out code wherever we are using it or will use it in-future, otherwise
  we will end with some preventive checks which are actually not required.
 
 
  They're preventative checks. You always hope it is wasted effort.
 

 I am not sure if Preventative checks (without the real need) are okay if
 they
 are not-cheap which could happen in this case.  I think Validating
 buffer-tag
 would require rel or sys cache lookup.


True, so don't do that.

Keep a list of dropped relations and have the checkpoint process scan the
buffer pool every 64 tables, kinda like AbsorbFsync

All the heavy lifting gets done in a background process and we know we're
safe.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Reduce ProcArrayLock contention

2015-06-30 Thread Simon Riggs
On 30 June 2015 at 04:21, Amit Kapila amit.kapil...@gmail.com wrote:


 Now, I would like to briefly explain how allow-one-waker idea has
 helped to improve the patch as not every body here was present
 in that Un-conference.


The same idea applies for marking commits in clog, for which I have been
sitting on a patch for a month or so and will post now I'm done travelling.

These ideas have been around some time and are even listed on the
PostgreSQL TODO:
http://archives.postgresql.org/pgsql-hackers/2007-09/msg00206.php

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Reduce ProcArrayLock contention

2015-06-30 Thread Simon Riggs
On 30 June 2015 at 07:30, Amit Kapila amit.kapil...@gmail.com wrote:


 Sure and I think we might want to try something similar even
 for XLogFlush where we use LWLockAcquireOrWait for
 WALWriteLock, not sure how it will workout in that case as
 I/O is involved, but I think it is worth trying.


+1

-- 
Simon Riggshttp://www.2ndQuadrant.com/
http://www.2ndquadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


Re: [HACKERS] Reduce ProcArrayLock contention

2015-06-30 Thread Peter Geoghegan
On Mon, Jun 29, 2015 at 11:14 PM, Simon Riggs si...@2ndquadrant.com wrote:
 What I find weird is that the discussion was so intense about
 LWLockAcquireOrWait that when someone presented a solution there were people
 that didn't notice. It makes me wonder whether large group discussions are
 worth it.

I didn't think of this myself, but it feels like something I could
have thought of easily.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PANIC in GIN code

2015-06-30 Thread Heikki Linnakangas

On 06/30/2015 02:18 AM, Jeff Janes wrote:

On Mon, Jun 29, 2015 at 2:08 PM, Heikki Linnakangas hlinn...@iki.fi wrote:


I just pushed a fix for this, but unfortunately it didn't make it
9.5alpha1.


Thanks.  I think that that fixed it.  It survived for over an hour this
time.


Thanks. I grepped through all the other call sites of 
XLogInitBufferForRedo() to look for similar bugs, and sure enough, there 
was one more bug of the same sort, in B-tree page deletion. That one was 
in 9.4 as well. Fixed that too, and added an assertion into 
PageGetSpecialPointer() that would help to catch this sort of bugs 
earlier in the future.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Solaris testers wanted for strxfrm() behavior

2015-06-30 Thread Josh Berkus
On 06/29/2015 07:53 PM, Robert Haas wrote:
 On Mon, Jun 29, 2015 at 6:07 PM, Josh Berkus j...@agliodbs.com wrote:
 On 06/29/2015 02:08 PM, Tom Lane wrote:
 Josh Berkus j...@agliodbs.com writes:
 Joyent confirms that the bug is fixed on SmartOS:

 The more interesting bit of information would be *when* it was fixed.

 Answer: not certain, but fixed at least 2 years ago.
 
 2 years is definitely not long enough to assume that no unpatched
 machines exist in the wild.  :-(

Well, AFAIK, 90% of SmartOS users are on the Joyent Cloud and get
upgraded automatically.  It's not much used as a portable OS.  So for
them, it is enough.

Given that Theo's patch against Illumos went in in 2012, I think we can
take that as our cutoff date for Illumos.

The question is: how many folks out there are running PostgreSQL on
Solaris 10?  And are they at all likely to upgrade to PostgreSQL 9.5?

Unfortunately, these questions are coming up right when Bjorn is on
vacation ...

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Solaris testers wanted for strxfrm() behavior

2015-06-30 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 The question is: how many folks out there are running PostgreSQL on
 Solaris 10?  And are they at all likely to upgrade to PostgreSQL 9.5?

That's only the pertinent question if the bug exists on Solaris 10,
which I don't think we know do we?  Oskari Saarenmaa's results upthread
seemed to indicate not.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Mention column name in error messages

2015-06-30 Thread Tom Lane
Franck Verrot fra...@verrot.fr writes:
 As far as I know, there is currently no way to find which column is triggering
 an error on an INSERT or ALTER COLUMN statement. Example:

Indeed ...

 Given my very restricted knowledge of PG's codebase I am not sure whether my
 modifications are legitimate or not, so please don't hesitate to comment on
 it and pointing where things are subpar to PG's codebase. In any case, it's
 to be considered as WIP for the moment.

I think the way you're going at this is fundamentally wrong.  It amounts
to an (undocumented) change to the API for datatype input functions, and
the only way it can work is to change *every datatype's* input function
to know about it.  That's not practical.  More, it doesn't cover errors
that are thrown from someplace other than a datatype input function.
Evaluation errors in domain check constraint expressions are an example
that would be very hard to manage this way.  And the approach definitely
doesn't scale nicely when you start thinking about other contexts wherein
a datatype input operation or coercion might fail.

What seems more likely to lead to a usable patch is to arrange for the
extra information you want to be emitted as error context, via an error
context callback that gets installed at the right times.  We already have
that set up for datatype-specific errors detected by COPY IN, for example.
If you feed garbage data to COPY you get something like

ERROR:  invalid input syntax for integer: foo
CONTEXT:  COPY int8_tbl, line 2, column q2: foo

with no need for int8in to be directly aware of the context.  You should
try adapting that methodology for the cases you're worried about.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_trgm version 1.2

2015-06-30 Thread Tom Lane
Jeff Janes jeff.ja...@gmail.com writes:
 On Tue, Jun 30, 2015 at 2:46 AM, Alexander Korotkov 
 a.korot...@postgrespro.ru wrote:
 pg_trgm--1.1.sql andpg_trgm--1.1--1.2.sql are useful for debug, but do you
 expect them in final commit? As I can see in other contribs we have only
 last version and upgrade scripts.

 I did see another downgrade path for different module, but on closer
 inspection it was one that I wrote while trying to analyze that module, and
 then never removed.  I have no objection to removing pg_trgm--1.2--1.1.sql
 before the commit, but I don't see what we gain by doing so.  If a
 downgrade is feasible and has been tested, why not include it?

Because we don't want to support 1.1 anymore once 1.2 exists.  You're
supposing that just because you wrote the downgrade script and think
it works, there's no further costs associated with having that.
Personally, I don't even want to review such a script, let alone document
its existence and why someone might want to use it, let alone support 1.1
into the far future.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers