date:20151128

Re: [HACKERS] COPY (INSERT/UPDATE/DELETE .. RETURNING ..)

2015-11-28 Thread Michael Paquier

On Sat, Nov 28, 2015 at 1:15 AM, Teodor Sigaev  wrote:
>> Patch is switched to "ready for committer".
>
> Committed, thank you

Thanks.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2015-11-28 Thread Michael Paquier

On Sat, Nov 28, 2015 at 7:53 AM, Alvaro Herrera wrote:
> I moved all the initialization code (deleting stuff from environment,
> detecting Windows, opening SimpleTie filedescs etc) into BEGIN blocks,
> which run earlier than any other code.

Ah, OK. Thanks. That makes visibly the whole set of modules more consistent.

> Hmm. I just noticed RewindTest sets $ENV{PGDATABASE} outside BEGIN.  Not
> sure what to think of that.  Could instead pass the database name in
> $node->getConnStr() calls, like run_pg_rewind() is already doing.

Yes, let's remove that and pass the database name to getConnStr().

> I tried all the t/ tests we have and all of them pass for me.  If I'm
> able, I will push this on my Sunday late evening, so that I can fix
> whatever gets red on Monday first thing ...

I have done as well additional tests on Windows and this patch is
showing a green status.

A separate issue, but as long as we are working on this set of tests:
I have noticed that config_default.pl is missing the flag tap_tests in
its list. See the patch attached. Could you apply that as well and
backpatch?

I have as well noticed that RewindTest.pm is missing "1;" on its last
line. When this is loaded this would lead to compilation errors.
-- 
Michael
diff --git a/src/tools/msvc/config_default.pl b/src/tools/msvc/config_default.pl
index b9f2ff4..e50be7e 100644
--- a/src/tools/msvc/config_default.pl
+++ b/src/tools/msvc/config_default.pl
@@ -3,7 +3,7 @@ use strict;
 use warnings;
 
 our $config = {
-	asserts => 0,# --enable-cassert
+	asserts  => 0,# --enable-cassert
 	  # integer_datetimes=>1,   # --enable-integer-datetimes - on is now default
 	  # float4byval=>1, # --disable-float4-byval, on by default
 
@@ -13,18 +13,19 @@ our $config = {
 	# blocksize => 8, # --with-blocksize, 8kB by default
 	# wal_blocksize => 8, # --with-wal-blocksize, 8kB by default
 	# wal_segsize => 16,  # --with-wal-segsize, 16MB by default
-	ldap => 1,# --with-ldap
-	extraver => undef,# --with-extra-version=
-	nls  => undef,# --enable-nls=
-	tcl  => undef,# --with-tls=
-	perl => undef,# --with-perl
-	python   => undef,# --with-python=
-	openssl  => undef,# --with-openssl=
-	uuid => undef,# --with-ossp-uuid
-	xml  => undef,# --with-libxml=
-	xslt => undef,# --with-libxslt=
-	iconv=> undef,# (not in configure, path to iconv)
-	zlib => undef # --with-zlib=
+	ldap  => 1,# --with-ldap
+	extraver  => undef,# --with-extra-version=
+	nls   => undef,# --enable-nls=
+	tap_tests => undef,# --enable-tap-tests
+	tcl   => undef,# --with-tls=
+	perl  => undef,# --with-perl
+	python=> undef,# --with-python=
+	openssl   => undef,# --with-openssl=
+	uuid  => undef,# --with-ossp-uuid
+	xml   => undef,# --with-libxml=
+	xslt  => undef,# --with-libxslt=
+	iconv => undef,# (not in configure, path to iconv)
+	zlib  => undef # --with-zlib=
 };
 
 1;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: PL/Pythonu - function ereport

2015-11-28 Thread Michael Paquier

On Sat, Nov 28, 2015 at 1:57 AM, Pavel Stehule  wrote:
>
>
> 2015-11-27 17:54 GMT+01:00 Teodor Sigaev :
>>
>> Is this patch in 'Waiting on Author' state actually?

I am marking it as returned with feedback for this CF then.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Use pg_rewind when target timeline was switched

2015-11-28 Thread Michael Paquier

On Fri, Nov 27, 2015 at 11:42 PM, Teodor Sigaev  wrote:
> Seems, patch is ready to commit. But it needs some documentation.

Of what kind? The documentation of pg_rewind is rather explicit on the
subject and looks fine as-is, and that's what Alexander and I agreed
on upthread. If there is something forgotten though, this may be the
fact that we do not mention in 9.5's documentation that pg_rewind can
*not* handle timeline switches.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

2015-11-28 Thread Alvaro Herrera

Michael Paquier wrote:
> On Sat, Nov 28, 2015 at 7:53 AM, Alvaro Herrera wrote:

> > Hmm. I just noticed RewindTest sets $ENV{PGDATABASE} outside BEGIN.  Not
> > sure what to think of that.  Could instead pass the database name in
> > $node->getConnStr() calls, like run_pg_rewind() is already doing.
> 
> Yes, let's remove that and pass the database name to getConnStr().

Ok.


> A separate issue, but as long as we are working on this set of tests:
> I have noticed that config_default.pl is missing the flag tap_tests in
> its list. See the patch attached. Could you apply that as well and
> backpatch?

> I have as well noticed that RewindTest.pm is missing "1;" on its last
> line. When this is loaded this would lead to compilation errors.

Sure.


> > I tried all the t/ tests we have and all of them pass for me.  If I'm
> > able, I will push this on my Sunday late evening, so that I can fix
> > whatever gets red on Monday first thing ...
> 
> I have done as well additional tests on Windows and this patch is
> showing a green status.

Great.

Here's your recovery test patch rebased, for your (and others'!)
perusal.  It passes for me.  (Test 003 is unchanged.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/test/Makefile b/src/test/Makefile
index b713c2c..7f7754f 100644
--- a/src/test/Makefile
+++ b/src/test/Makefile
@@ -12,7 +12,7 @@ subdir = src/test
 top_builddir = ../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS = regress isolation modules
+SUBDIRS = regress isolation modules recovery
 
 # We don't build or execute examples/, locale/, or thread/ by default,
 # but we do want "make clean" etc to recurse into them.  Likewise for ssl/,
diff --git a/src/test/recovery/.gitignore b/src/test/recovery/.gitignore
new file mode 100644
index 000..499fa7d
--- /dev/null
+++ b/src/test/recovery/.gitignore
@@ -0,0 +1,3 @@
+# Generated by test suite
+/regress_log/
+/tmp_check/
diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
new file mode 100644
index 000..16c063a
--- /dev/null
+++ b/src/test/recovery/Makefile
@@ -0,0 +1,17 @@
+#-
+#
+# Makefile for src/test/recovery
+#
+# Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+# Portions Copyright (c) 1994, Regents of the University of California
+#
+# src/test/recovery/Makefile
+#
+#-
+
+subdir = src/test/recovery
+top_builddir = ../../..
+include $(top_builddir)/src/Makefile.global
+
+check:
+	$(prove_check)
diff --git a/src/test/recovery/README b/src/test/recovery/README
new file mode 100644
index 000..20b98e0
--- /dev/null
+++ b/src/test/recovery/README
@@ -0,0 +1,19 @@
+src/test/recovery/README
+
+Regression tests for recovery and replication
+=
+
+This directory contains a test suite for recovery and replication,
+testing mainly the interactions of recovery.conf with cluster
+instances by providing a simple set of routines that can be used
+to define a custom cluster for a test, including backup, archiving,
+and streaming configuration.
+
+Running the tests
+=
+
+make check
+
+NOTE: This creates a temporary installation, and some tests may
+create one or multiple nodes, be they master or standby(s) for the
+purpose of the tests.
diff --git a/src/test/recovery/RecoveryTest.pm b/src/test/recovery/RecoveryTest.pm
new file mode 100644
index 000..3706847
--- /dev/null
+++ b/src/test/recovery/RecoveryTest.pm
@@ -0,0 +1,177 @@
+# Set of common routines for recovery regression tests for a PostgreSQL
+# cluster. This includes methods that can be used by the various set of
+# tests present to set up cluster nodes and configure them according to
+# the test scenario wanted.
+#
+# This module makes use of PostgresNode for node manipulation, performing
+# higher-level operations to create standby nodes or setting them up
+# for archiving and replication.
+#
+# Nodes are identified by their port number and have one allocated when
+# created, hence it is unique for each node of the cluster as it is run
+# locally. PGHOST is equally set to a unique value for the duration of
+# each test.
+
+package RecoveryTest;
+
+use strict;
+use warnings;
+
+use Cwd;
+use Exporter 'import';
+use IPC::Run qw(run start);
+use PostgresNode;
+use RecursiveCopy;
+use TestBase;
+use TestLib;
+use Test::More;
+
+our @EXPORT = qw(
+	enable_archiving
+	enable_restoring
+	enable_streaming
+	make_master
+	make_archive_standby
+	make_stream_standby
+);
+
+# Set of handy routines able to set up a node with different characteristics
+# Enable streaming replication
+sub enable_streaming
+{
+	my $node_root = shift; # Instance to link to
+	my $node_standby = shift;
+	my $root_connstr = $node_root->getConnStr();
+	my $applname =

Re: [HACKERS] Using quicksort for every external sort run

2015-11-28 Thread Jeff Janes

On Thu, Nov 19, 2015 at 12:35 PM, Greg Stark  wrote:
> On Thu, Nov 19, 2015 at 6:56 PM, Peter Geoghegan  wrote:
>> Yes, I really do mean it when I say that the DBA is not supposed to
>> see this message, no matter how much or how little memory or data is
>> involved. There is no nuance intended here; it isn't sensible to allow
>> a multi-pass sort, just as it isn't sensible to allow checkpoints
>> every 5 seconds. Both of those things can be thought of as thrashing.
>
> Hm. So a bit of back-of-envelope calculation. If we have want to
> buffer at least 1MB for each run -- I think we currently do more
> actually -- and say that a 1GB work_mem ought to be enough to run
> reasonably (that's per sort after all and there might be multiple
> sorts to say nothing of other users on the system). That means we can
> merge about 1,000 runs in the final merge. Each run will be about 2GB
> currently but 1GB if we quicksort the runs. So the largest table we
> can sort in a single pass is 1-2 TB.
>
> If we go above those limits we have the choice of buffering less per
> run or doing a whole second pass through the data.

If we only go slightly above the limits, it is much more graceful.  It
will happily do a 3 way merge followed by a 1023 way final merge (or
something like that) so only 0.3 percent of the data needs a second
pass, not all of it.  If course by the time you get a factor of 2 over
the limit, you are making an entire second pass one way or another.

Cheers,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2015-11-28 Thread Jeff Janes

On Thu, Nov 26, 2015 at 11:32 PM, Amit Kapila  wrote:
> On Tue, Nov 17, 2015 at 6:30 PM, Simon Riggs  wrote:
>>
>> On 17 November 2015 at 11:48, Amit Kapila  wrote:
>>>
>>>
>>> I think in that case what we can do is if the total number of
>>> sub transactions is lesser than equal to 64 (we can find that by
>>> overflowed flag in PGXact) , then apply this optimisation, else use
>>> the existing flow to update the transaction status.  I think for that we
>>> don't even need to reserve any additional memory. Does that sound
>>> sensible to you?
>>
>>
>> I understand you to mean that the leader should look backwards through the
>> queue collecting xids while !(PGXACT->overflowed)
>>
>> No additional shmem is required
>>
>
> Okay, as discussed I have handled the case of sub-transactions without
> additional shmem in the attached patch.  Apart from that, I have tried
> to apply this optimization for Prepared transactions as well, but as
> the dummy proc used for such transactions doesn't have semaphore like
> backend proc's, so it is not possible to use such a proc in group status
> updation as each group member needs to wait on semaphore.  It is not tad
> difficult to add the support for that case if we are okay with creating
> additional
> semaphore for each such dummy proc which I was not sure, so I have left
> it for now.

Is this proposal instead of, or in addition to, the original thread
topic of increasing clog buffers to 64?

Thanks,

Jeff


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Freeze avoidance of very large table.

2015-11-28 Thread Jeff Janes

On Tue, Nov 24, 2015 at 3:13 PM, Masahiko Sawada  wrote:
>
> Yeah, we need to consider to compute checksum if enabled.
> I've changed the patch, and attached.
> Please review it.

Thanks for the update.  This now conflicts with the updates doesn to
fix pg_upgrade out-of-space issue on Windows. I've fixed (I think) the
conflict in order to do some testing, but I'd like to get an updated
patch from the author in case I did it wrong.  I don't want to find
bugs that I just introduced myself.

Thanks,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Errors in our encoding conversion tables

2015-11-28 Thread Tom Lane

I wrote:
> There's a discussion over at
> http://www.postgresql.org/message-id/flat/2sa.dhu5.1hk1yrptnfy.1ml...@seznam.cz
> of an apparent error in our WIN1250 -> LATIN2 conversion.

Attached is an updated patch (against today's HEAD) showing proposed
changes to bring cyrillic_and_mic.c and latin2_and_win1250.c into sync
with the Unicode Consortium's conversion data.

In addition, I've attached the C program I used to generate the proposed
new conversion tables from the Unicode/*.map files, a simple SQL script
to print out the conversion behavior for the affected conversions, and
a diff of the script's output between 9.5 and the proposed patch.

While the changes in the WIN1250 <-> LATIN2 conversions just amount to
removal of some translations that seem to have no basis in reality, the
changes in the Cyrillic mappings are quite a bit more extensive.  It would
be good if we could get those checked by some native Russian speakers.

regards, tom lane

diff --git a/src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c b/src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c
index 1847287..539c3d3 100644
*** a/src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c
--- b/src/backend/utils/mb/conversion_procs/cyrillic_and_mic/cyrillic_and_mic.c
*** static const unsigned char iso2koi[] = {
*** 65,71 
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
! 	0x00, 0xB3, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0xE1, 0xE2, 0xF7, 0xE7, 0xE4, 0xE5, 0xF6, 0xFA,
  	0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF, 0xF0,
--- 65,71 
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
! 	0x9A, 0xB3, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0xE1, 0xE2, 0xF7, 0xE7, 0xE4, 0xE5, 0xF6, 0xFA,
  	0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF, 0xF0,
*** static const unsigned char koi2iso[] = {
*** 84,90 
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
! 	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0xF1, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0xA1, 0x00, 0x00, 0x00, 0x00,
--- 84,90 
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
! 	0x00, 0x00, 0xA0, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0xF1, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0xA1, 0x00, 0x00, 0x00, 0x00,
*** static const unsigned char win12512koi[]
*** 105,114 
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
! 	0x00, 0x00, 0x00, 0x00, 0x00, 0xBD, 0x00, 0x00,
! 	0xB3, 0x00, 0xB4, 0x00, 0x00, 0x00, 0x00, 0xB7,
! 	0x00, 0x00, 0xB6, 0xA6, 0xAD, 0x00, 0x00, 0x00,
! 	0xA3, 0x00, 0xA4, 0x00, 0x00, 0x00, 0x00, 0xA7,
  	0xE1, 0xE2, 0xF7, 0xE7, 0xE4, 0xE5, 0xF6, 0xFA,
  	0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF, 0xF0,
  	0xF2, 0xF3, 0xF4, 0xF5, 0xE6, 0xE8, 0xE3, 0xFE,
--- 105,114 
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
! 	0x9A, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
! 	0xB3, 0xBF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
! 	0x9C, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x9E,
! 	0xA3, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0xE1, 0xE2, 0xF7, 0xE7, 0xE4, 0xE5, 0xF6, 0xFA,
  	0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF, 0xF0,
  	0xF2, 0xF3, 0xF4, 0xF5, 0xE6, 0xE8, 0xE3, 0xFE,
*** static const unsigned char koi2win1251[]
*** 124,134 
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
! 	0x00, 0x00, 0x00, 0xB8, 0xBA, 0x00, 0xB3, 0xBF,
! 	0x00, 0x00, 0x00, 0x00, 0x00, 0xB4, 0x00, 0x00,
! 	0x00, 0x00, 0x00, 0xA8, 0xAA, 0x00, 0xB2, 0xAF,
! 	0x00, 0x00, 0x00, 0x00, 0x00, 0xA5, 0x00, 0x00,
  	0xFE, 0xE0, 0xE1, 0xF6, 0xE4, 0xE5, 0xF4, 0xE3,
  	0xF5, 0xE8, 0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE,
  	0xEF, 0xFF, 0xF0, 0xF1, 0xF2, 0xF3, 0xE6, 0xE2,
--- 124,134 
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
+ 	0x00, 0x00, 0xA0, 0x00, 0xB0, 0x00, 0xB7, 0x00,
+ 	0x00, 0x00, 0x00, 0xB8, 0x00, 0x00, 0x00, 0x00,
  	0x00, 0x00, 0x00, 0x00, 0x00, 0x00,

Re: [HACKERS] Using quicksort for every external sort run

2015-11-28 Thread Jeff Janes

On Wed, Nov 18, 2015 at 3:29 PM, Peter Geoghegan  wrote:
> On Wed, Nov 18, 2015 at 10:31 AM, Jeff Janes  wrote:
>
>> I agree we don't want to optimize for low memory, but I don't think we
>> should throw it under the bus, either.  Right now we are effectively
>> saying the CPU-cache problems with the heap start exceeding the larger
>> run size benefits at 64kb (the smallest allowed setting for work_mem).
>> While any number we pick is going to be a guess that won't apply to
>> all hardware, surely we can come up with a guess better than 64kb.
>> Like, 8 MB, say.  If available memory for the sort is 8MB or smaller
>> and the predicted size anticipates a multipass merge, then we can use
>> the heap method rather than the quicksort method.  Would a rule like
>> that complicate things much?
>
> I'm already using replacement selection for the first run when it is
> predicted by my new ad-hoc cost model that we can get away with a
> "quicksort with spillover", avoiding almost all I/O. We only
> incrementally spill as many tuples as needed right now, but it would
> be pretty easy to not quicksort the remaining tuples, but continue to
> incrementally spill everything. So no, it wouldn't be too hard to hang
> on to the old behavior sometimes, if it looked worthwhile.
>
> In principle, I have no problem with doing that. Through testing, I
> cannot see any actual upside, though. Perhaps I just missed something.
> Even 8MB is enough to avoid the multipass merge in the event of a
> surprisingly high volume of data (my work laptop is elsewhere, so I
> don't have my notes on this in front of me, but I figured out the
> crossover point for a couple of cases).

For me very large sorts (100,000,000 ints) with work_mem below 4MB do
better with unpatched than with your patch series, by about 5%.  Not a
big deal, but also if it is easy to keep the old behavior then I think
we should.  Yes, it is dumb to do large sorts with work_mem below 4MB,
but if you have canned apps which do a mixture of workloads it is not
so easy to micromanage their work_mem.  Especially as there are no
easy tools that let me as the DBA say "if you connect from this IP
address, you get this work_mem".

I didn't collect trace_sort on those ones because of the high volume
it would generate.

>
>>> In theory, the answer could be "yes", but it seems highly unlikely.
>>> Not only is very little memory required to avoid a multi-pass merge
>>> step, but as described above the amount required grows very slowly
>>> relative to linear growth in input. I propose to add a
>>> checkpoint_warning style warning (with a checkpoint_warning style GUC
>>> to control it).
>>
>> I'm skeptical about a warning for this.
>
> Other systems expose this explicitly, and, as I said, say in an
> unqualified way that a multi-pass merge should be avoided. Maybe the
> warning isn't the right way of communicating that message to the DBA
> in detail, but I am confident that it ought to be communicated to the
> DBA fairly clearly.

I thinking about how many other places in the code could justify a
similar type of warning "If you just gave me 15% more memory, this
hash join would be much faster", and what that would make the logs
look like if future work went along with this precedence.  If there
were some mechanism to put the warning in a system view counter
instead of the log file, that would be much cleaner.  Or a way to
separate the server log file into streams.  But since we don't have
those, I guess I can't really object much to the proposed behavior.

>
>> One idea would be to stop and write out a just-sorted partition
>> whenever that partition is contiguous to the already-written portion.
>> If the qsort is tweaked to recurse preferentially into the left
>> partition first, this would result in tuples being written out at a
>> pretty study pace.  If the qsort was unbalanced and the left partition
>> was always the larger of the two, then that approach would have to be
>> abandoned at some point.  But I think there are already defenses
>> against that, and at worst you would give up and revert to the
>> sort-them-all then write-them-all behavior.
>
> Seems kind of invasive.

I agree, but I wonder if it won't become much more important at 30GB
of work_mem.  Of course if there is no reason to ever set work_mem
that high, then it wouldn't matter--but there is always a reason to do
so, if you have so much memory to spare.  So better than that invasive
work, I guess would be to make sort use less than work_mem if it gets
no benefit from using all of it.  Anyway, ideas for future work,
either way.

>
>> Overall this is very nice.  Doing some real world index builds of
>> short text (~20 bytes ascii) identifiers, I could easily get speed ups
>> of 40% with your patch if I followed the philosophy of "give it as
>> much maintenance_work_mem as I can afford".  If I fine-tuned the
>> maintenance_work_mem so that it was optimal for each sort method,

Re: [HACKERS] Using quicksort for every external sort run

2015-11-28 Thread Peter Geoghegan

On Sat, Nov 28, 2015 at 2:04 PM, Jeff Janes  wrote:
> For me very large sorts (100,000,000 ints) with work_mem below 4MB do
> better with unpatched than with your patch series, by about 5%.  Not a
> big deal, but also if it is easy to keep the old behavior then I think
> we should.  Yes, it is dumb to do large sorts with work_mem below 4MB,
> but if you have canned apps which do a mixture of workloads it is not
> so easy to micromanage their work_mem.  Especially as there are no
> easy tools that let me as the DBA say "if you connect from this IP
> address, you get this work_mem".

I'm not very concerned about a regression that is only seen when
work_mem is set below the (very conservative) postgresql.conf default
value of 4MB when sorting 100 million integers. Thank you for
characterizing the regression, though -- it's good to have a better
idea of how much of a problem that is in practice.

I can still preserve the old behavior with a GUC, but it isn't
completely trivial, and I don't want to complicate things any further
without a real benefit, which I still don't see. I'm still using a
replacement selection style heap, and I think that there will be
future uses for the heap (e.g. dynamic duplicate removal within
tuplesort), though.

>> Other systems expose this explicitly, and, as I said, say in an
>> unqualified way that a multi-pass merge should be avoided. Maybe the
>> warning isn't the right way of communicating that message to the DBA
>> in detail, but I am confident that it ought to be communicated to the
>> DBA fairly clearly.
>
> I thinking about how many other places in the code could justify a
> similar type of warning "If you just gave me 15% more memory, this
> hash join would be much faster", and what that would make the logs
> look like if future work went along with this precedence.  If there
> were some mechanism to put the warning in a system view counter
> instead of the log file, that would be much cleaner.  Or a way to
> separate the server log file into streams.  But since we don't have
> those, I guess I can't really object much to the proposed behavior.

I'm going to let this go, actually. Not because I don't think that
avoiding a multi-pass sort is a good goal for DBAs to have, but
because a multi-pass sort doesn't appear to be a point at which
performance tanks these days, with modern block devices. Also, I just
don't have time to push something non-essential that there is
resistance to.

>>> One idea would be to stop and write out a just-sorted partition
>>> whenever that partition is contiguous to the already-written portion.
>>> If the qsort is tweaked to recurse preferentially into the left
>>> partition first, this would result in tuples being written out at a
>>> pretty study pace.  If the qsort was unbalanced and the left partition
>>> was always the larger of the two, then that approach would have to be
>>> abandoned at some point.  But I think there are already defenses
>>> against that, and at worst you would give up and revert to the
>>> sort-them-all then write-them-all behavior.
>>
>> Seems kind of invasive.
>
> I agree, but I wonder if it won't become much more important at 30GB
> of work_mem.  Of course if there is no reason to ever set work_mem
> that high, then it wouldn't matter--but there is always a reason to do
> so, if you have so much memory to spare.  So better than that invasive
> work, I guess would be to make sort use less than work_mem if it gets
> no benefit from using all of it.  Anyway, ideas for future work,
> either way.

I hope to come up with a fairly robust model for automatically sizing
an "effective work_mem" in the context of external sorts. There should
be a heuristic that balances fan-in against other considerations. I
think that doing this with the existing external sort code would be
completely hopeless. This is a problem that is well understood by the
research community, although balances things well in the context of
PostgreSQL is a little trickier.

I also think it's a little arbitrary that the final on-the-fly merge
step uses a work_mem-ish sized buffer, much like the sorting of runs,
as if there is a good reason to be consistent. Maybe that's fine,
though.

There are advantages to returning tuples earlier in the context of
parallelism, which recommends smaller effective work_mem sizes
(provided they're above a certain threshold). For this reason, having
larger runs may not be a useful goal in general, even without
considering the cost in cache misses paid in the pursuit that goal.

>> Thanks, but I expected better than that. Was it a collated text
>> column? The C collation will put the patch in a much better light
>> (more strcoll() calls are needed with this new approach -- it's still
>> well worth it, but it is a downside that makes collated text not
>> especially sympathetic). Just sorting on an integer attribute is also
>> a good sympathetic case, FWIW.
>
> It was UTF8 encoded (although all characters

Re: [HACKERS] WIP: Make timestamptz_out less slow.

2015-11-28 Thread Peter Geoghegan

On Sun, Nov 22, 2015 at 2:20 AM, David Rowley
 wrote:
> Just to confirm, you mean this comment?
>
> int tm_year; /* relative to 1900 */
>
> Please let me know if you disagree, but I'm not sure it's the business of
> this patch to fix that. If it's wrong now, then it was wrong before my
> patch, so it should be a separate patch which fixes it.
>
> At this stage I don't quite know what the fix should be, weather it's doing
> tm->tm_year -= 1900; in timestamp2tm() after the j2date() call, or if it's
> just removing the misleading comment.
>
> I also don't quite understand why we bother having it relative to 1900 and
> not just base it on 0.

That's fair. I defer to the judgement of the committer here.

> Is there anything else you see that's pending before it can be marked as
> ready for committer?

Can't think of any reason not to. It's been marked "ready for committer".

Thanks

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage()

2015-11-28 Thread Noah Misch

On Tue, Nov 10, 2015 at 11:22:47PM -0500, Noah Misch wrote:
> On Mon, Nov 09, 2015 at 10:40:07PM +0100, Andres Freund wrote:
> > /*
> >  * Optional array of WAL flush LSNs associated with entries in the SLRU
> >  * pages.  If not zero/NULL, we must flush WAL before writing pages 
> > (true
> >  * for pg_clog, false for multixact, pg_subtrans, pg_notify).  
> > group_lsn[]
> >  * has lsn_groups_per_page entries per buffer slot, each containing the
> >  * highest LSN known for a contiguous group of SLRU entries on that 
> > slot's
> >  * page.
> >  */
> > XLogRecPtr *group_lsn;
> > int lsn_groups_per_page;
> > 

> Here's the multixact.c comment justifying it:
> 
>  * XLOG interactions: this module generates an XLOG record whenever a new
>  * OFFSETs or MEMBERs page is initialized to zeroes, as well as an XLOG record
>  * whenever a new MultiXactId is defined.  This allows us to completely
>  * rebuild the data entered since the last checkpoint during XLOG replay.
>  * Because this is possible, we need not follow the normal rule of
>  * "write WAL before data"; the only correctness guarantee needed is that
>  * we flush and sync all dirty OFFSETs and MEMBERs pages to disk before a
>  * checkpoint is considered complete.  If a page does make it to disk ahead
>  * of corresponding WAL records, it will be forcibly zeroed before use anyway.
>  * Therefore, we don't need to mark our pages with LSN information; we have
>  * enough synchronization already.
> 
> The comment's justification is incomplete, though.  What of pages filled over
> the course of multiple checkpoint cycles?

Having studied this further, I think it is safe for a reason given elsewhere:

 * Note: we need not flush this XLOG entry to disk before proceeding. 
The
 * only way for the MXID to be referenced from any data page is for
 * heap_lock_tuple() to have put it there, and heap_lock_tuple() 
generates
 * an XLOG record that must follow ours.  The normal LSN interlock 
between
 * the data page and that XLOG record will ensure that our XLOG record
 * reaches disk first.  If the SLRU members/offsets data reaches disk
 * sooner than the XLOG record, we do not care because we'll overwrite 
it
 * with zeroes unless the XLOG record is there too; see notes at top of
 * this file.

I find no flaw in the first three sentences.  In most cases, one of
multixact_redo(), TrimMultiXact() or ExtendMultiXactOffset() will indeed
overwrite the widowed mxid data with zeroes before the system again writes
data in that vicinity.  We can fail to do that if a backend stalls just after
calling GetNewMultiXactId(), then recovers and updates SLRU pages long after
other backends filled the balance of those pages.  That's still okay; if we
never flush the XLOG_MULTIXACT_CREATE_ID record, no xmax referring to the mxid
will survive recovery.  I drafted the attached comment update to consolidate
and slightly correct this justification.  (If we were designing the multixact
subsystem afresh today, I'd vote for following the write-WAL-before-data rule
despite believing it is not needed.  The simplicity would be worth it.)

While contemplating the "stalls ... just after calling GetNewMultiXactId()"
scenario, I noticed a race condition not involving any write-WAL-before-data
violation.  MultiXactIdCreateFromMembers() and callees have this skeleton:

GetNewMultiXactId
  LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE)
  ExtendMultiXactOffset()
  ExtendMultiXactMember()
  START_CRIT_SECTION()
  (MultiXactState->nextMXact)++
  MultiXactState->nextOffset += nmembers
  LWLockRelease(MultiXactGenLock)
XLogInsert(RM_MULTIXACT_ID, XLOG_MULTIXACT_CREATE_ID)
RecordNewMultiXact
  LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE)
  *offptr = offset;  /* update pg_multixact/offsets SLRU page */
  LWLockRelease(MultiXactOffsetControlLock)
  LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE)
  *memberptr = members[i].xid;  /* update pg_multixact/members SLRU page */
  *flagsptr = flagsval;  /* update pg_multixact/members SLRU page */
  LWLockRelease(MultiXactMemberControlLock)
END_CRIT_SECTION

Between GetNewMultiXactId() and XLogInsert(), other backends will be free to
create higher-numbered mxids.  They will skip over SLRU space the current
backend reserved.  Future GetMultiXactIdMembers() calls rely critically on the
current backend eventually finishing:

 * 2. The next multixact may still be in process of being filled in: 
that
 * is, another process may have done GetNewMultiXactId but not yet 
written
 * the offset entry for that ID.  In that scenario, it is guaranteed 
that
 * the offset entry for that multixact exists (because GetNewMultiXactId
 * won't release MultiXactGenLock until it does) but contains zero
 * (because we are careful to pre-zero offset pages). Because
 * GetNewMultiXactId

Re: [HACKERS] proposal: PL/Pythonu - function ereport

2015-11-28 Thread Pavel Stehule

2015-11-28 13:11 GMT+01:00 Michael Paquier :

> On Sat, Nov 28, 2015 at 1:57 AM, Pavel Stehule 
> wrote:
> >
> >
> > 2015-11-27 17:54 GMT+01:00 Teodor Sigaev :
> >>
> >> Is this patch in 'Waiting on Author' state actually?
>
> I am marking it as returned with feedback for this CF then.
>

please, can revert it?

I though so Peter request will require some massive change, but I had to
change few lines.

So this patch is still ready for review - minimally for Peter's review.

Regards

Pavel

> --
> Michael
>

Re: [HACKERS] COPY (INSERT/UPDATE/DELETE .. RETURNING ..)

Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

Re: [HACKERS] proposal: PL/Pythonu - function ereport

Re: [HACKERS] Use pg_rewind when target timeline was switched

Re: [HACKERS] Re: In-core regression tests for replication, cascading, archiving, PITR, etc.

Re: [HACKERS] Using quicksort for every external sort run

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

Re: [HACKERS] Freeze avoidance of very large table.

Re: [HACKERS] Errors in our encoding conversion tables

Re: [HACKERS] Using quicksort for every external sort run

Re: [HACKERS] Using quicksort for every external sort run

Re: [HACKERS] WIP: Make timestamptz_out less slow.

[HACKERS] Re: Multixact slru doesn't don't force WAL flushes in SlruPhysicalWritePage()

Re: [HACKERS] proposal: PL/Pythonu - function ereport

14 matches

Site Navigation

Mail list logo

Footer information