[HACKERS] git down

2017-10-27 Thread Erik Rijkers

git.postgresql.org is down/unreachable

( git://git.postgresql.org/git/postgresql.git )



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] v10 bottom-listed

2017-10-05 Thread Erik Rijkers

In the 'ftp' listing, v10 appears at the bottom:
  https://www.postgresql.org/ftp/source/

With all the other v10* directories at the top, we could get a lot of 
people installing wrong binaries...


Maybe it can be fixed so that it appears at the top.


Thanks,

Erik Rijkers




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] comments improvements

2017-09-24 Thread Erik Rijkers

comments improvements--- src/backend/optimizer/prep/prepunion.c.orig	2017-09-24 17:40:34.888790877 +0200
+++ src/backend/optimizer/prep/prepunion.c	2017-09-24 17:41:39.796748743 +0200
@@ -2413,7 +2413,7 @@
  * 		Find AppendRelInfo structures for all relations specified by relids.
  *
  * The AppendRelInfos are returned in an array, which can be pfree'd by the
- * caller. *nappinfos is set to the the number of entries in the array.
+ * caller. *nappinfos is set to the number of entries in the array.
  */
 AppendRelInfo **
 find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos)
--- src/test/regress/sql/triggers.sql.orig	2017-09-24 17:40:45.760783805 +0200
+++ src/test/regress/sql/triggers.sql	2017-09-24 17:41:33.448752854 +0200
@@ -1409,7 +1409,7 @@
 --
 -- Verify behavior of statement triggers on partition hierarchy with
 -- transition tables.  Tuples should appear to each trigger in the
--- format of the the relation the trigger is attached to.
+-- format of the relation the trigger is attached to.
 --
 
 -- set up a partition hierarchy with some different TupleDescriptors

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Automatic testing of patches in commit fest

2017-09-11 Thread Erik Rijkers

On 2017-09-11 02:12, Thomas Munro wrote:

On Mon, Sep 11, 2017 at 11:40 AM, Michael Paquier
<michael.paqu...@gmail.com> wrote:

Thomas Munro has hacked up a prototype of application testing
automatically if patches submitted apply and build:
http://commitfest.cputube.org/


I should add: this is a spare-time effort, a work-in-progress and
building on top of a bunch of hairy web scraping, so it may take some
time to perfect.


It would be great if one of the intermediary products of this effort 
could be made available too, namely, a list of latest patches.


Or perhaps such a list should come out of the commitfest app.

For me, such a list would be even more useful than any subsequently 
processed results.


thanks,

Erik Rijkers



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] psql: new help related to variables are not too readable

2017-09-07 Thread Erik Rijkers

On 2017-09-08 06:09, Pavel Stehule wrote:

Hi

Now the output looks like:

  AUTOCOMMIT
if set, successful SQL commands are automatically committed
  COMP_KEYWORD_CASE
determines the case used to complete SQL key words
[lower, upper, preserve-lower, preserve-upper]
  DBNAME
the currently connected database name

[...]

What do you think about using new line between entries in this format?

  AUTOCOMMIT
if set, successful SQL commands are automatically committed

  COMP_KEYWORD_CASE
determines the case used to complete SQL key words
[lower, upper, preserve-lower, preserve-upper]

  DBNAME
the currently connected database name



I dislike it, it takes more screen space and leads to unneccessary 
scroll-need.


The 9.6.5 formatting is/was:

  AUTOCOMMIT if set, successful SQL commands are automatically 
committed

  COMP_KEYWORD_CASE  determines the case used to complete SQL key words
 [lower, upper, preserve-lower, preserve-upper]
  DBNAME the currently connected database name
[...]
  PGPASSWORD connection password (not recommended)
  PGPASSFILE password file name
  PSQL_EDITOR, EDITOR, VISUAL
 editor used by the \e, \ef, and \ev commands
  PSQL_EDITOR_LINENUMBER_ARG
 how to specify a line number when invoking the 
editor

  PSQL_HISTORY   alternative location for the command history file

I would prefer to revert to that more compact 9.6-formatting.


Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] adding the commit to a patch's thread

2017-08-31 Thread Erik Rijkers
At the moment it's not easy to find the commit that terminates a 
commitfest thread about a patch.  One has to manually compare dates and 
guess what belongs to what.  The commit message nowadays often has the 
link to the thread ("Discussion") but the other way around is often not 
so easily found.


For example: looking at

  https://commitfest.postgresql.org/14/1020/

One cannot directly find the actual commit that finished it.

Would it be possible to change the commitfest a bit and make it possible 
to add the commit (or commit-message, or hash) to the thread in the 
commitfest-app.  I would think it would be best to make it so that when 
the thread gets set to state 'committed', the actual commit/hash is 
added somewhere at the same time.



thanks,

Erik Rijkers



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] changed column-count breaks pdf build

2017-08-17 Thread Erik Rijkers


The feature matrix table in high-availability.sgml had a column added so 
also increase the column-count (patch attached).


thanks,

Erik Rijkers--- doc/src/sgml/high-availability.sgml.orig	2017-08-17 15:04:32.535819637 +0200
+++ doc/src/sgml/high-availability.sgml	2017-08-17 15:04:46.528122345 +0200
@@ -301,7 +301,7 @@
 
  
   High Availability, Load Balancing, and Replication Feature Matrix
-  
+  

 
  Feature

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] parallel documentation improvements

2017-08-01 Thread Erik Rijkers

On 2017-08-01 20:43, Robert Haas wrote:


In commit 054637d2e08cda6a096f48cc99696136a06f4ef5, I updated the
parallel query documentation to reflect recently-committed parallel

Barring objections, I'd like to commit this in the next couple of days


I think that in this bit:

 occurrence is frequent, considering increasing
 max_worker_processes and 
max_parallel_workers
 so that more workers can be run simultaneously or alternatively 
reducing
- so that the 
planner
+max_parallel_workers_per_gather so that the 
planner

 requests fewer workers.


'considering increasing'  should be
'consider increasing'




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2017-07-28 Thread Erik Rijkers

On 2017-07-27 21:08, Mark Rofail wrote:

On Thu, Jul 27, 2017 at 7:15 PM, Erik Rijkers <e...@xs4all.nl> wrote:


It would help (me at least) if you could be more explicit about what
exactly each instance is.



I apologize, I thought it was clear through the context.


Thanks a lot.  It's just really easy for testers like me that aren't 
following a thread too closely and just snatch a half hour here and 
there to look into a feature/patch.



One small thing while building docs:

$  cd doc/src/sgml && make html
osx -wall -wno-unused-param -wno-empty -wfully-tagged -D . -D . -x lower 
postgres.sgml >postgres.xml.tmp
osx:ref/create_table.sgml:960:100:E: document type does not allow 
element "VARLISTENTRY" here

Makefile:147: recipe for target 'postgres.xml' failed
make: *** [postgres.xml] Error 1

(Debian 8/jessie)


thanks,


Erik Rijkers



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2017-07-27 Thread Erik Rijkers

On 2017-07-27 02:31, Mark Rofail wrote:

I have written some benchmark test.



It would help (me at least) if you could be more explicit about what 
exactly each instance is.


Apparently there is an 'original patch': is this the original patch by 
Marco Nenciarini?

Or is it something you posted earlier?

I guess it could be distilled from the earlier posts but when I looked 
those over yesterday evening I still didn't get it.


A link to the post where the 'original patch' is would be ideal...

thanks!

Erik Rijkers



With two tables a PK table with 5 rows and an FK table with growing row
count.






Once triggering an RI check
at 10 rows,
100 rows,
1,000 rows,
10,000 rows,
100,000 rows and
1,000,000 rows

Please find the graph with the findings attached below


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2017-07-24 Thread Erik Rijkers

On 2017-07-24 23:31, Mark Rofail wrote:

On Mon, Jul 24, 2017 at 11:25 PM, Erik Rijkers <e...@xs4all.nl> wrote:


This patch doesn't apply to HEAD at the moment ( e2c8100e6072936 ).



My bad, I should have mentioned that the patch is dependant on the 
original

patch.
Here is a *unified* patch that I just tested.


Thanks.  Apply is now good, but I get this error when compiling:

ELEMENT' not present in UNRESERVED_KEYWORD section of gram.y
make[4]: *** [gram.c] Error 1
make[3]: *** [parser/gram.h] Error 2
make[2]: *** [../../src/include/parser/gram.h] Error 2
make[1]: *** [all-common-recurse] Error 2
make: *** [all-src-recurse] Error 2





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GSoC 2017: Foreign Key Arrays

2017-07-24 Thread Erik Rijkers

On 2017-07-24 23:08, Mark Rofail wrote:
Here is the new Patch with the bug fixes and the New Patch with the 
Index

in place performance results.

I just want to point this out because I still can't believe the 
numbers. In

reference to the old patch:
The new patch without the index suffers a 41.68% slow down, while the 
new

patch with the index has a 95.18% speed up!



[elemOperatorV4.patch]


This patch doesn't apply to HEAD at the moment ( e2c8100e6072936 ).

Can you have a look?

thanks,

Erik Rijkers




patching file doc/src/sgml/ref/create_table.sgml
Hunk #1 succeeded at 816 with fuzz 3.
patching file src/backend/access/gin/ginarrayproc.c
patching file src/backend/utils/adt/arrayfuncs.c
patching file src/backend/utils/adt/ri_triggers.c
Hunk #1 FAILED at 2650.
Hunk #2 FAILED at 2694.
2 out of 2 hunks FAILED -- saving rejects to file 
src/backend/utils/adt/ri_triggers.c.rej

patching file src/include/catalog/pg_amop.h
patching file src/include/catalog/pg_operator.h
patching file src/include/catalog/pg_proc.h
patching file src/test/regress/expected/arrays.out
patching file src/test/regress/expected/opr_sanity.out
patching file src/test/regress/sql/arrays.sql



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PDF content lemma subdivision

2017-07-08 Thread Erik Rijkers
The PDF-version of the documentation has content-'frame' displayed on 
the left-hand side (I'm viewing with okular; I assmume it will be 
similar in most viewers).


That content displays a treeview down to the main entries/lemmata, like 
'CREATE TABLE'.  It doesn't go any deeper anymore.


There used to be a further subdivision in that lefthand subtree: Name, 
Synopsis, Division, Parameters, Notes, Examples, See Also (and an even 
further, finer subdivision).  These would all be clickable links 
straight into the lemma itself.


(Especially 'Examples' was a handy jump to have, IMHO - sometimes it 
saved many pages of scrolling)


I noticed today that all these lower level subdivisions are gone.  Was 
that deliberate or an accident?


If it's at all possible I would like to see these subdivisions 
reinstated, so that navigating via the content-tree becomes that much 
easier again.


(By the way (unrelated), I also noticed only today that the new process 
now wraps many of the too-long-lines; lines that were previously 
unceremoniously cut off in 'mid-sentence'.  That wrapping, although not 
always pretty, is a really useful improvement.)



thanks,

Erik Rijkers



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-06-17 Thread Erik Rijkers

On 2017-06-18 00:27, Peter Eisentraut wrote:

On 6/17/17 06:48, Erik Rijkers wrote:

On 2017-05-28 12:44, Erik Rijkers wrote:

re: srsubstate in pg_subscription_rel:


No idea what it means.  At the very least this value 'w' is missing
from the documentation, which only mentions:
  i = initalize
  d = data copy
  s = synchronized
  r = (normal replication)


Shouldn't we add this to that table (51.53) in the documentation?

After all, the value 'w' does show up when you monitor
pg_subscription_rel.


It's not supposed to.  Have you seen it after
e3a815d2faa5be28551e71d5db44fb2c78133433?


Ah no, I haven't seen that 'w'-value after that (and 1000s of tests ran 
without error since then).


I just hadn't realized that that w-value I had reported was indeed a 
erroneous state.


thanks, this is OK then.

Erik Rijkers



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-06-17 Thread Erik Rijkers

On 2017-05-28 12:44, Erik Rijkers wrote:

re: srsubstate in pg_subscription_rel:


No idea what it means.  At the very least this value 'w' is missing
from the documentation, which only mentions:
  i = initalize
  d = data copy
  s = synchronized
  r = (normal replication)


Shouldn't we add this to that table (51.53) in the documentation?

After all, the value 'w' does show up when you monitor 
pg_subscription_rel.









--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] tablesync.c - comment improvements

2017-06-10 Thread Erik Rijkers

tablesync.c - comment improvements--- src/backend/replication/logical/tablesync.c.orig	2017-06-10 10:20:07.617662465 +0200
+++ src/backend/replication/logical/tablesync.c	2017-06-10 10:45:52.620514397 +0200
@@ -12,18 +12,18 @@
  *	  logical replication.
  *
  *	  The initial data synchronization is done separately for each table,
- *	  in separate apply worker that only fetches the initial snapshot data
- *	  from the publisher and then synchronizes the position in stream with
+ *	  in a separate apply worker that only fetches the initial snapshot data
+ *	  from the publisher and then synchronizes the position in the stream with
  *	  the main apply worker.
  *
- *	  The are several reasons for doing the synchronization this way:
+ *	  There are several reasons for doing the synchronization this way:
  *	   - It allows us to parallelize the initial data synchronization
  *		 which lowers the time needed for it to happen.
  *	   - The initial synchronization does not have to hold the xid and LSN
  *		 for the time it takes to copy data of all tables, causing less
  *		 bloat and lower disk consumption compared to doing the
- *		 synchronization in single process for whole database.
- *	   - It allows us to synchronize the tables added after the initial
+ *		 synchronization in a single process for the whole database.
+ *	   - It allows us to synchronize any tables added after the initial
  *		 synchronization has finished.
  *
  *	  The stream position synchronization works in multiple steps.
@@ -37,7 +37,7 @@
  *		 read the stream and apply changes (acting like an apply worker) until
  *		 it catches up to the specified stream position.  Then it sets the
  *		 state to SYNCDONE.  There might be zero changes applied between
- *		 CATCHUP and SYNCDONE, because the sync worker might be ahead of the
+ *		 CATCHUP and SYNCDONE because the sync worker might be ahead of the
  *		 apply worker.
  *	   - Once the state was set to SYNCDONE, the apply will continue tracking
  *		 the table until it reaches the SYNCDONE stream position, at which
@@ -147,7 +147,7 @@
 }
 
 /*
- * Wait until the relation synchronization state is set in catalog to the
+ * Wait until the relation synchronization state is set in the catalog to the
  * expected one.
  *
  * Used when transitioning from CATCHUP state to SYNCDONE.
@@ -206,12 +206,12 @@
 }
 
 /*
- * Wait until the the apply worker changes the state of our synchronization
+ * Wait until the apply worker changes the state of our synchronization
  * worker to the expected one.
  *
  * Used when transitioning from SYNCWAIT state to CATCHUP.
  *
- * Returns false if the apply worker has disappeared or table state has been
+ * Returns false if the apply worker has disappeared or the table state has been
  * reset.
  */
 static bool
@@ -225,7 +225,7 @@
 
 		CHECK_FOR_INTERRUPTS();
 
-		/* Bail if he apply has died. */
+		/* Bail if the apply has died. */
 		LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
 		worker = logicalrep_worker_find(MyLogicalRepWorker->subid,
 		InvalidOid, false);
@@ -333,7 +333,7 @@
 
 	Assert(!IsTransactionState());
 
-	/* We need up to date sync state info for subscription tables here. */
+	/* We need up-to-date sync state info for subscription tables here. */
 	if (!table_states_valid)
 	{
 		MemoryContext oldctx;
@@ -365,7 +365,7 @@
 	}
 
 	/*
-	 * Prepare hash table for tracking last start times of workers, to avoid
+	 * Prepare a hash table for tracking last start times of workers, to avoid
 	 * immediate restarts.  We don't need it if there are no tables that need
 	 * syncing.
 	 */
@@ -401,7 +401,7 @@
 		{
 			/*
 			 * Apply has caught up to the position where the table sync has
-			 * finished.  Time to mark the table as ready so that apply will
+			 * finished.  Mark the table as ready so that apply will
 			 * just continue to replicate it normally.
 			 */
 			if (current_lsn >= rstate->lsn)
@@ -436,7 +436,7 @@
 			else
 
 /*
- * If no sync worker for this table yet, count running sync
+ * If there is no sync worker for this table yet, count running sync
  * workers for this subscription, while we have the lock, for
  * later.
  */
@@ -477,7 +477,7 @@
 
 			/*
 			 * If there is no sync worker registered for the table and there
-			 * is some free sync worker slot, start new sync worker for the
+			 * is some free sync worker slot, start a new sync worker for the
 			 * table.
 			 */
 			else if (!syncworker && nsyncworkers < max_sync_workers_per_subscription)
@@ -551,7 +551,7 @@
 	int			bytesread = 0;
 	int			avail;
 
-	/* If there are some leftover data from previous read, use them. */
+	/* If there are some leftover data from previous read, use it. */
 	avail = copybuf->len - copybuf->cursor;
 	if (avail)
 	{
@@ -694,7 +694,7 @@
 (errmsg("could not fetch table info for table \"%s.%s\": %s",
 		nspname, relname, res->err)));
 
-	/* We don't know number of rows coming, so allocate enough space. */
+	/* 

Re: [HACKERS] logical replication - possible remaining problem

2017-06-07 Thread Erik Rijkers

On 2017-06-07 23:18, Alvaro Herrera wrote:

Erik Rijkers wrote:

Now, looking at the script again I am thinking that it would be 
reasonable

to expect that after issuing
   delete from pg_subscription;

the other 2 tables are /also/ cleaned, automatically, as a 
consequence.  (Is

this reasonable? this is really the main question of this email).


I don't think it's reasonable to expect that the system recovers
automatically from what amounts to catalog corruption.  You should be
using the DDL that removes subscriptions instead.


You're right, that makes sense.
Thanks.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Race conditions with WAL sender PID lookups

2017-06-07 Thread Erik Rijkers

On 2017-06-07 20:31, Robert Haas wrote:


[...]

[ Side note: Erik's report on this thread initially seemed to suggest
that we needed this patch to make logical decoding stable.  But my
impression is that this is belied by subsequent developments on other
threads, so my theory is that this patch was never really related to
the problem, but rather than by the time Erik got around to testing
this patch, other fixes had made the problems relatively rare, and the
apparently-improved results with this patch were just chance.  If that
theory is wrong, it would be good to hear about it. ]


Yes, agreed; I was probably mistaken.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] logical replication - possible remaining problem

2017-06-07 Thread Erik Rijkers
I am not sure whether what I found here amounts to a bug, I might be 
doing something dumb.


During the last few months I did tests by running pgbench over logical 
replication.  Earlier emails have details.


The basic form of that now works well (and the fix has been comitted) 
but as I looked over my testing program I noticed one change I made to 
it, already many weeks ago:


In the cleanup during startup (pre-flight check you might say) and also 
before the end, instead of


  echo "delete from pg_subscription;" | psql -qXp $port2 -- (1)

I changed that (as I say, many weeks ago) to:

  echo "delete from pg_subscription;
delete from pg_subscription_rel;
delete from pg_replication_origin; " | psql -qXp $port2   -- (2)

This occurs (2x) inside the bash function clean_pubsub(), in main test 
script pgbench_detail2.sh


This change was an effort to ensure to arrive at a 'clean' start (and 
end-) state which would always be the same.


All my more recent testing (and that of Mark, I have to assume) was thus 
done with (2).


Now, looking at the script again I am thinking that it would be 
reasonable to expect that after issuing

   delete from pg_subscription;

the other 2 tables are /also/ cleaned, automatically, as a consequence.  
(Is this reasonable? this is really the main question of this email).


So I removed the latter two delete statements again, and ran the tests 
again with the form in  (1)


I have established that (after a number of successful cycles) the test 
stops succeeding with in the replica log repetitions of:


2017-06-07 22:10:29.057 CEST [2421] LOG:  logical replication apply 
worker for subscription "sub1" has started
2017-06-07 22:10:29.057 CEST [2421] ERROR:  could not find free 
replication state slot for replication origin with OID 11
2017-06-07 22:10:29.057 CEST [2421] HINT:  Increase 
max_replication_slots and try again.
2017-06-07 22:10:29.058 CEST [2061] LOG:  worker process: logical 
replication worker for subscription 29235 (PID 2421) exited with exit 
code 1


when I manually 'clean up' by doing:
   delete from pg_replication_origin;

then, and only then, does the session finish and succeed ('replica ok').

So to me it looks as if there is an omission of 
pg_replication_origin-cleanup when pg_description is deleted.


Does that make sense?  All this is probably vague and I am only posting 
in the hope that Petr (or someone else) perhaps immediately understands 
what goes wrong, with even his limited amount of info.


In the meantime I will try to dig up more detailed info...


thanks,


Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-06-06 Thread Erik Rijkers

On 2017-06-06 20:53, Peter Eisentraut wrote:

On 6/4/17 22:38, Petr Jelinek wrote:

Committed that, with some further updates of comments to reflect the


Belated apologies all round for the somewhat provocative $subject; but I 
felt at that moment that this item needed some extra attention.


I don't know if it worked but I'm glad that it is solved ;)

Thanks,

Erik Rijkers





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-06-04 Thread Erik Rijkers

On 2017-05-31 16:20, Erik Rijkers wrote:

On 2017-05-31 11:16, Petr Jelinek wrote:
[...]
Thanks to Mark's offer I was able to study the issue as it happened 
and

found the cause of this.

[0001-Improve-handover-logic-between-sync-and-apply-worker.patch]


This looks good:

-- out_20170531_1141.txt
100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n   --  scale 25
100 -- All is well.

So this is 100x a 1-minute test with 100x success. (This on the most
fastidious machine (slow disks, meagre specs) that used to give 15%
failures)


[Improve-handover-logic-between-sync-and-apply-worker-v2.patch]

No errors after (several days of) running variants of this. (2500x 1 
minute runs; 12x 1-hour runs)




Thanks!

Erik Rijkers



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-06-01 Thread Erik Rijkers

On 2017-06-02 00:46, Mark Kirkwood wrote:

On 31/05/17 21:16, Petr Jelinek wrote:

I'm seeing a new failure with the patch applied - this time the
history table has missing rows. Petr, I'll put back your access :-)


Is this error during 1-minute runs?

I'm asking because I've moved back to longer (1-hour) runs (no errors so 
far), and I'd like to keep track of what the most 'vulnerable' 
parameters are.


thanks,

Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-31 Thread Erik Rijkers

On 2017-05-31 11:16, Petr Jelinek wrote:
[...]

Thanks to Mark's offer I was able to study the issue as it happened and
found the cause of this.

[0001-Improve-handover-logic-between-sync-and-apply-worker.patch]


This looks good:

-- out_20170531_1141.txt
100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n   --  scale 25
100 -- All is well.

So this is 100x a 1-minute test with 100x success. (This on the most 
fastidious machine (slow disks, meagre specs) that used to give 15% 
failures)


I'll let it run for a couple of days with varying params (and on varying 
hardware) but it definitely does look as if you fixed it.


Thanks!

Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-31 Thread Erik Rijkers

On 2017-05-26 08:10, Erik Rijkers wrote:


If you run a pgbench session of 1 minute over a logical replication
connection and repeat that 100x this is what you get:

At clients 90, 64, 8, scale 25:
-- out_20170525_0944.txt
100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n   --  scale 25
  7 -- Not good.
-- out_20170525_1426.txt
100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n   --  scale 25
 18 -- Not good.
-- out_20170525_2049.txt
100 -- pgbench -c 8 -j 8 -T 60 -P 12 -n   --  scale 25
 10 -- Not good.
At clients 90, 64, 8, scale 5:
-- out_20170526_0126.txt
100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n   --  scale 5
  2 -- Not good.
-- out_20170526_0352.txt
100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n   --  scale 5
  3 -- Not good.
-- out_20170526_0621.txt
100 -- pgbench -c 8 -j 8 -T 60 -P 12 -n   --  scale 5
  4 -- Not good.


It seems this problem is a bit less serious than it did look to me (as 
others find lower numbers of fail).


Still, how is its seriousness graded by now?  Is it a show-stopper?  
Should it go onto the Open Items page?


Is anyone still looking into it?


thanks,

Erik Rijkers





The above installations (master+replica) are with Petr Jelinek's (and
Michael Paquier's) last patches
 0001-Fix-signal-handling-in-logical-workers.patch
 0002-Make-tablesync-worker-exit-when-apply-dies-while-it-.patch
 0003-Receive-invalidation-messages-correctly-in-tablesync.patch
 Remove-the-SKIP-REFRESH-syntax-suggar-in-ALTER-SUBSC-v2.patch




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-28 Thread Erik Rijkers

On 2017-05-29 03:33, Jeff Janes wrote:

On Sun, May 28, 2017 at 3:17 PM, Mark Kirkwood <
mark.kirkw...@catalyst.net.nz> wrote:

I also got a failure, after 87 iterations of a similar test case.  It

[...]
repeated the runs, but so far it hasn't failed again in over 800 
iterations


Could you give the params for the successful runs?
(ideally, a  grep | sort | uniq -c  of the ran  pgbench lines )


Can you say anything about hardware?


Thanks for repeating my lengthy tests.


Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-28 Thread Erik Rijkers

On 2017-05-29 00:17, Mark Kirkwood wrote:

On 28/05/17 19:01, Mark Kirkwood wrote:


So running in cloud land now...so for no errors - will update.


The framework ran 600 tests last night, and I see 3 'NOK' results, i.e
3 failed test runs (all scale 25 and 8 pgbench clients). Given the way


Could you also give the params for the successful runs?

Can you say anything about hardware?  (My experience is that older, 
slower, 'worse' hardware makes for more fails.)



Many thanks, by the way.  I'm glad that it turns out I'm probably not 
doing something uniquely stupid (although I'm not glad that there seems 
to be a bug, and an elusive one at that)



Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-28 Thread Erik Rijkers

On 2017-05-26 15:59, Petr Jelinek wrote:

Hmm, I was under the impression that the changes we proposed in the
snapbuild thread fixed your issues, does this mean they didn't? Or the
modified versions of those that were eventually committed didn't? Or 
did

issues reappear at some point?


Here is a bit of info:

Just now (using Mark Kirkwood's version of my test) I had a session 
logging this:


  unknown relation state "w"

which I had never seen before.

This is column srsubstate in pg_subscription_rel.

That session completed successfully ('replica ok'), so it's not 
necessarily a problem.



grepping through my earlier logs (of weeks of intermittent test-runs), I 
found only one more (timestamp 20170525_0125).  Here it occurred in a 
failed session.


No idea what it means.  At the very least this value 'w' is missing from 
the documentation, which only mentions:

  i = initalize
  d = data copy
  s = synchronized
  r = (normal replication)


Erik Rijkers








--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-27 Thread Erik Rijkers

On 2017-05-28 01:15, Mark Kirkwood wrote:


Also, any idea which rows are different? If you want something out of
the box that will do that for you see DBIx::Compare.


I used to save the content-diffs too but in the end decided they were 
useless (to me, anyway).



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-27 Thread Erik Rijkers

On 2017-05-28 01:21, Mark Kirkwood wrote:

Sorry - I see you have done this already.

On 28/05/17 11:15, Mark Kirkwood wrote:
Interesting - might be good to see your test script too (so we can 
better understand how you are deciding if the runs are successful or 
not).



Yes, in pgbench_derail2.sh in the cb function it says:

  if [[ "${md5_total[$port1]}" == "${md5_total[$port2]}" ]]
  then
echo " ok"
  else
echo " NOK"
  fi

This is the final decision about success ('ok') or failure ('NOK'). (NOK 
stands for 'Not OK')


The two compared md5's (on the two ports: primary and replica) are each 
taken over a concatenation of the 4 separate md5's of the table-content 
(taken earlier in cb()).  If one or more of the 4 md5's differs, then 
that concatation-md5 will differ too.


Sorry, there is not a lot of comment






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-27 Thread Erik Rijkers

On 2017-05-27 17:11, Andres Freund wrote:
On May 27, 2017 6:13:19 AM EDT, Simon Riggs <si...@2ndquadrant.com> 
wrote:

On 27 May 2017 at 09:44, Erik Rijkers <e...@xs4all.nl> wrote:


I am very curious at your results.


We take your bug report on good faith, but we still haven't seen
details of the problem or how to recreate it.

Please post some details. Thanks.


?


ok, ok...

( The thing is, I am trying to pre-digest the output but it takes time )

I can do this now: attached some output that belongs with this group of 
100  1-minute runs:


-- out_20170525_1426.txt
100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n   --  scale 25
 82 -- All is well.
 18 -- Not good.

That is the worst set of runs of what I showed earlier.

that is:  out_20170525_1426.txt  and
2x18 logfiles that the 18 failed runs produced.
Those logfiles have names like:
logrep.20170525_1426.1436.1.scale_25.clients_64.NOK.log
logrep.20170525_1426.1436.2.scale_25.clients_64.NOK.log

.1.=primary
.2.=replica

Please disregard the errors around pg_current_wal_location().  (it was 
caused by some code to dump some wal into zipfiles which obviously 
stopped working after the function was removed/renamed)  There are also 
some uninportant errors from the test-harness where I call with the 
wrong port.  Not interesting, I don't think.


sent_20170527_1745.tar.bz2
Description: BZip2 compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-27 Thread Erik Rijkers

On 2017-05-27 10:30, Erik Rijkers wrote:

On 2017-05-27 01:35, Mark Kirkwood wrote:



Here is what I have:

instances.sh:
testset.sh
pgbench_derail2.sh
pubsub.sh



To be clear:

( Apart from that standalone call like
./pgbench_derail2.sh $scale $clients $duration $date_str
)

I normally run by editing the parameters in testset.sh, then run:

./testset.sh

that then shows a tail -F of the output-logfile (to paste into another 
screen).


in yet another screen the 'watch -n20 results.sh' line

The output=files are the .txt files.
The logfiles of the instances are (at the end of each test) copied to 
directory  logfiles/
under a meaningful name that shows the parameters, and with an extension 
like '.ok.log' or '.NOK.log'.


I am very curious at your results.



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-27 Thread Erik Rijkers

On 2017-05-27 01:35, Mark Kirkwood wrote:

On 26/05/17 20:09, Erik Rijkers wrote:



The idea is simple enough:

startup instance1
startup instance2 (on same machine)
primary: init pgbench tables
primary: add primary key to pgbench_history
copy empty tables to replica by dump/restore
primary: start publication
replica: start subscription
primary: run 1-minute pgbench
wait till the 4 md5's of primary pgbench tables
  are the same as the 4 md5's of replica pgbench
  tables (this will need a time-out).
log 'ok' or 'not ok'
primary: clean up publication
replica: clean up subscription
shutdown primary
shutdown replica

this whole thing 100x



Here is what I have:

instances.sh:
  starts up 2 assert enabled sessions

instances_fast.sh:
  alternative to instances.sh
  starts up 2 assert disabled 'fast' sessions

testset.sh
  loop to call pgbench_derail2.sh with varying params

pgbench_derail2.sh
  main test program
  can be called 'standalone'
./pgbench_derail2.sh $scale $clients $duration $date_str
  so for instance this should work:
./pgbench_derail2.sh 25 64 60 20170527_1019
  to remove publication and subscription from sessions, add a 5th 
parameter 'clean'

./pgbench_derail2.sh 1 1 1 1 'clean'

pubsub.sh
  displays replication state. also called by pgbench_derail2.sh
  must be in path

result.sh
  display results
  I keep this in a screen-session as:
  watch -n 20 './result.sh 201705'


Peculiar to my setup also:
  server version at compile time stamped with date + commit hash
  I misuse information_schema.sql_packages  at compile time to store 
patch information

  instances are in  $pg_stuff_dir/pg_installations/pgsql.

So you'll have to outcomment a line here and there, and adapt paths, 
ports, and things like that.


It's a bit messy, I should have used perl from the beginning...

Good luck :)

Erik Rijkers







#!/bin/sh

#assertions on  in $pg_stuff_dir/pg_installations/pgsql./bin
#assertions off in $pg_stuff_dir/pg_installations/pgsql./bin.fast

port1=6972 project1=logical_replication
port2=6973 project2=logical_replication2
pg_stuff_dir=$HOME/pg_stuff
PATH1=$pg_stuff_dir/pg_installations/pgsql.$project1/bin:$PATH
PATH2=$pg_stuff_dir/pg_installations/pgsql.$project2/bin:$PATH
server_dir1=$pg_stuff_dir/pg_installations/pgsql.$project1
server_dir2=$pg_stuff_dir/pg_installations/pgsql.$project2
data_dir1=$server_dir1/data
data_dir2=$server_dir2/data

options1="
-c wal_level=logical
-c max_replication_slots=10
-c max_worker_processes=12
-c max_logical_replication_workers=10
-c max_wal_senders=10
-c logging_collector=on
-c log_directory=$server_dir1
-c log_filename=logfile.${project1}
-c log_replication_commands=on "

# -c wal_sender_timeout=18
# -c client_min_messages=DEBUG1 "
# -c log_connections=on
# -c max_sync_workers_per_subscription=6

options2="
-c wal_level=replica
-c max_replication_slots=10
-c max_worker_processes=12
-c max_logical_replication_workers=10
-c max_wal_senders=10
-c logging_collector=on
-c log_directory=$server_dir2
-c log_filename=logfile.${project2}
-c log_replication_commands=on "

# -c wal_sender_timeout=18
# -c client_min_messages=DEBUG1 "
# -c log_connections=on
# -c max_sync_workers_per_subscription=6

export PATH=$PATH1; export PG=$( which postgres ); $PG -D $data_dir1 -p $port1 ${options1} &
sleep 1
export PATH=$PATH2; export PG=$( which postgres ); $PG -D $data_dir2 -p $port2 ${options2} &
sleep 1
#!/bin/sh

#assertions on  in $pg_stuff_dir/pg_installations/pgsql./bin
#assertions off in $pg_stuff_dir/pg_installations/pgsql./bin.fast

port1=6972 project1=logical_replication
port2=6973 project2=logical_replication2
pg_stuff_dir=$HOME/pg_stuff
PATH1=$pg_stuff_dir/pg_installations/pgsql.$project1/bin.fast:$PATH
PATH2=$pg_stuff_dir/pg_installations/pgsql.$project2/bin.fast:$PATH
server_dir1=$pg_stuff_dir/pg_installations/pgsql.$project1
server_dir2=$pg_stuff_dir/pg_installations/pgsql.$project2
data_dir1=$server_dir1/data
data_dir2=$server_dir2/data
options1="
-c wal_level=logical
-c max_replication_slots=10
-c max_worker_processes=12
-c max_logical_replication_workers=10
-c max_wal_senders=14
-c wal_sender_timeout=18
-c logging_collector=on
-c log_directory=$server_dir1
-c log_filename=logfile.${project1} 
-c log_replication_commands=on "

options2="
-c wal_level=replica
-c max_replication_slots=10
-c max_worker_processes=12
-c max_logical_replication_workers=10
-c max_wal_senders=14
-c wal_sender_timeout=18
-c logging_collector=on
-c log_directory=$server_dir2
-c log_filename=logfile.${project2} 
-c log_replication_commands=on "

export PATH=$PATH1; PG=$(which postgres); $PG -D $data_dir1 -p $port1 ${options1} &
export PATH=$PATH2; PG=$(which postgres); $PG -D $data_dir2 -p $port2 ${options2} &

#!/bin/bash
pg_stuff_dir=$HOME/pg_stuff
port1=6972 project1=logical_replication
port2=6973 project2=logical_replication2
db=testdb
rc=0
duration=60
while [[ $rc -eq 0 ]]

Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Erik Rijkers

On 2017-05-27 01:35, Mark Kirkwood wrote:

On 26/05/17 20:09, Erik Rijkers wrote:


this whole thing 100x


Some questions that might help me get it right:
- do you think we need to stop and start the instances every time?
- do we need to init pgbench each time?
- could we just drop the subscription and publication and truncate the 
replica tables instead?


I have done all that in earler versions.

I deliberately added these 'complications' in view of the intractability 
of the problem: my fear is that an earlier failure leaves some 
half-failed state behind in an instance, which then might cause more 
failure.  This would undermine the intent of the whole exercise (which 
is to count succes/failure rate).  So it is important to be as sure as 
possible that each cycle starts out as cleanly as possible.



- what scale pgbench are you running?


I use a small script to call the main script; at the moment it does 
something like:

---
duration=60
from=1
to=100
for scale in 25 5
do
  for clients in 90 64 8
  do
date_str=$(date +"%Y%m%d_%H%M")
outfile=out_${date_str}.txt
time for x in `seq $from $to`
do
./pgbench_derail2.sh $scale $clients $duration $date_str
[...]
---


- how many clients for the 1 min pgbench run?


see above

- are you starting the pgbench run while the copy_data jobs for the 
subscription are still running?


I assume with copy_data you mean the data sync of the original table 
before pgbench starts.

And yes, I think here might be the origin of the problem.
( I think the problem I get is actually easily avoided by putting wait 
states here and there in between separate steps.  But the testing idea 
here is to force the system into error, not to avoid any errors)



- how exactly are you calculating those md5's?


Here is the bash function: cb (I forget what that stands for, I guess 
'content bench').  $outf is a log file to which the program writes 
output:


---
function cb()
{
  #  display the 4 pgbench tables' accumulated content as md5s
  #  a,b,t,h stand for:  pgbench_accounts, -branches, -tellers, -history
  num_tables=$( echo "select count(*) from pg_class where relkind = 'r' 
and relname ~ '^pgbench_'" | psql -qtAX )

  if [[ $num_tables -ne 4 ]]
  then
 echo "pgbench tables not 4 - exit" >> $outf
 exit
  fi
  for port in $port1 $port2
  do
md5_a=$(echo "select * from pgbench_accounts order by aid"|psql 
-qtAXp $port|md5sum|cut -b 1-9)
md5_b=$(echo "select * from pgbench_branches order by bid"|psql 
-qtAXp $port|md5sum|cut -b 1-9)
md5_t=$(echo "select * from pgbench_tellers  order by tid"|psql 
-qtAXp $port|md5sum|cut -b 1-9)
md5_h=$(echo "select * from pgbench_history  order by hid"|psql 
-qtAXp $port|md5sum|cut -b 1-9)
cnt_a=$(echo "select count(*) from pgbench_accounts"  |psql 
-qtAXp $port)
cnt_b=$(echo "select count(*) from pgbench_branches"  |psql 
-qtAXp $port)
cnt_t=$(echo "select count(*) from pgbench_tellers"   |psql 
-qtAXp $port)
cnt_h=$(echo "select count(*) from pgbench_history"   |psql 
-qtAXp $port)
md5_total[$port]=$( echo "${md5_a} ${md5_b} ${md5_t} ${md5_h}" | 
md5sum )

printf "$port a,b,t,h: %8d %6d %6d %6d" $cnt_a $cnt_b $cnt_t $cnt_h
echo -n "  $md5_a $md5_b $md5_t $md5_h"
if   [[ $port -eq $port1 ]]; then echo" master"
elif [[ $port -eq $port2 ]]; then echo -n " replica"
else  echo"   ERROR  "
fi
  done
  if [[ "${md5_total[$port1]}" == "${md5_total[$port2]}" ]]
  then
echo " ok"
  else
echo " NOK"
  fi
}
---

this enables:

echo "-- getting md5 (cb)"
cb_text1=$(cb)

and testing that string like:

if echo "$cb_text1" | grep -qw 'replica ok';
then
   echo "-- All is well."

[...]


Later today I'll try to clean up the whole thing and post it.














--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Erik Rijkers

On 2017-05-26 15:59, Petr Jelinek wrote:

Hi,

Hmm, I was under the impression that the changes we proposed in the
snapbuild thread fixed your issues, does this mean they didn't? Or the
modified versions of those that were eventually committed didn't? Or 
did

issues reappear at some point?


I do think the snapbuild fixed solved certain problems.  I can't say 
where the present problems are caused (as I have said, I suspect logical 
replication, but also my own test-harness: perhaps it leaves some 
error-state lying around (although I do try hard to prevent that) -- so 
I just don't know.


I wouldn't say that problems (re)appeared at a certain point; my 
impression is rather that logical replication has become better and 
better.  But I kept getting the odd failure, without a clear cause, but 
always (eventually) repeatable on other machines.  I did the 1-minute 
pgbench-derail version exactly because of the earlier problems with 
snapbuild: I wanted a test that does a lot of starting and stopping of 
publication and subscription.



Erik Rijkers






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Erik Rijkers

On 2017-05-26 10:29, Mark Kirkwood wrote:

On 26/05/17 20:09, Erik Rijkers wrote:


On 2017-05-26 09:40, Simon Riggs wrote:


If we can find out what the bug is with a repeatable test case we can 
fix it.


Could you provide more details? Thanks


I will, just need some time to clean things up a bit.


But what I would like is for someone else to repeat my 100x1-minute 
tests, taking as core that snippet I posted in my previous email.  I 
built bash-stuff around that core (to take md5's, shut-down/start-up 
the two instances between runs, write info to log-files, etc).  But it 
would be good if someone else made that separately because if that 
then does not fail, it would prove that my test-harness is at fault 
(and not logical replication).




Will do - what I had been doing was running pgbench, waiting until the


Great!

You'll have to think about whether to go with instances of either 
master, or master+those 4 patches.  I guess either choice makes sense.



row counts on the replica pgbench_history were the same as the
primary, then summing the %balance and delta fields from the primary
and replica dbs and comparing. So far - all match up ok. However I'd


I did number-summing for a while as well (because it's a lot faster than 
taking md5's over the full content).
But the problem with summing is that (I think) in the end you cannot be 
really sure that the result is correct (false positives, although I 
don't understand the odds).



been running a longer time frames (5 minutes), so not the same number
of repetitions as yet.


I've run 3600-, 30- and 15-minute runs too, but in this case (these 100x 
tests) I wanted to especially test the area around startup/initialise of 
logical replication.  Also the increasing quality of logical replication 
(once it runs with the correct


thanks,

Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Erik Rijkers

On 2017-05-26 09:40, Simon Riggs wrote:


If we can find out what the bug is with a repeatable test case we can 
fix it.


Could you provide more details? Thanks


I will, just need some time to clean things up a bit.


But what I would like is for someone else to repeat my 100x1-minute 
tests, taking as core that snippet I posted in my previous email.  I 
built bash-stuff around that core (to take md5's, shut-down/start-up the 
two instances between runs, write info to log-files, etc).  But it would 
be good if someone else made that separately because if that then does 
not fail, it would prove that my test-harness is at fault (and not 
logical replication).


The idea is simple enough:

startup instance1
startup instance2 (on same machine)
primary: init pgbench tables
primary: add primary key to pgbench_history
copy empty tables to replica by dump/restore
primary: start publication
replica: start subscription
primary: run 1-minute pgbench
wait till the 4 md5's of primary pgbench tables
  are the same as the 4 md5's of replica pgbench
  tables (this will need a time-out).
log 'ok' or 'not ok'
primary: clean up publication
replica: clean up subscription
shutdown primary
shutdown replica

this whole thing 100x





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Erik Rijkers

On 2017-05-26 08:58, Simon Riggs wrote:

On 26 May 2017 at 07:10, Erik Rijkers <e...@xs4all.nl> wrote:


- Do you agree this number of failures is far too high?
- Am I the only one finding so many failures?


What type of failure are you getting?


The failure is that in the result state the replicated tables differ 
from the original tables.


For instance,

-- out_20170525_0944.txt
100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n   --  scale 25
 93 -- All is well.
  7 -- Not good.

These numbers mean: the result state of primary and replica is not the 
same, in 7 out of 100 runs.


'not the same state' means:  at least one of the 4 md5's of the sorted 
content of the 4 pgbench tables on the primary is different from those 
taken from the replica.


So, 'failure' means: the 4 pgbench tables on primary and replica are not 
exactly the same after the (one-minute) pgbench-run has finished, and 
logical replication has 'finished'.  (plenty of time is given for the 
replica to catchup. The test only calls 'failure' after 20x waiting (for 
15 seconds) and 20x finding the same erroneous state (erroneous because 
not-same as on primary).



I would really like to know it you think that that doesn't amount to 
'failure'.




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] logical replication - still unstable after all these months

2017-05-26 Thread Erik Rijkers
If you run a pgbench session of 1 minute over a logical replication 
connection and repeat that 100x this is what you get:


At clients 90, 64, 8, scale 25:

-- out_20170525_0944.txt
100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n   --  scale 25
 93 -- All is well.
  7 -- Not good.
-- out_20170525_1426.txt
100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n   --  scale 25
 82 -- All is well.
 18 -- Not good.
-- out_20170525_2049.txt
100 -- pgbench -c 8 -j 8 -T 60 -P 12 -n   --  scale 25
 90 -- All is well.
 10 -- Not good


At clients 90, 64, 8, scale 25:

-- out_20170526_0126.txt
100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n   --  scale 5
 98 -- All is well.
  2 -- Not good.
-- out_20170526_0352.txt
100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n   --  scale 5
 97 -- All is well.
  3 -- Not good.
-- out_20170526_0621.txt
 45 -- pgbench -c 8 -j 8 -T 60 -P 12 -n   --  scale 5
 41 -- All is well.
  3 -- Not good.

(That last one obviously not finished)


I think this is pretty awful, really, for a beta level.

The above installations (master+replica) are with Petr Jelinek's (and 
Michael Paquier's) last patches

 0001-Fix-signal-handling-in-logical-workers.patch
 0002-Make-tablesync-worker-exit-when-apply-dies-while-it-.patch
 0003-Receive-invalidation-messages-correctly-in-tablesync.patch
 Remove-the-SKIP-REFRESH-syntax-suggar-in-ALTER-SUBSC-v2.patch

Now, it could be that there is somehow something wrong with my 
test-setup (as opposed to some bug in log-repl).  I can post my test 
program, but I'll do that separately (but below is the core all my tests 
-- it's basically still that very first test that I started out with, 
many months ago...)



I'd like to find out/know more about:
- Do you agree this number of failures is far too high?
- Am I the only one finding so many failures?
- Is anyone else testing the same way (more or less continually, finding 
only succes)?
- Which of the Open Items could be resposible for this failure rate?  (I 
don't see a match.)
- What tests do others do?  Could we somehow concentrate results and 
method somewhere?



Thanks,


Erik Rijkers




PS

The core of the 'pgbench_derail' test (bash) is simply:

echo "drop table if exists pgbench_accounts;
drop table if exists pgbench_branches;
drop table if exists pgbench_tellers;
drop table if exists pgbench_history;" | psql -qXp $port1 \
&& echo "drop table if exists pgbench_accounts;
drop table if exists pgbench_branches;
drop table if exists pgbench_tellers;
drop table if exists pgbench_history;" | psql -qXp $port2 \
&& pgbench -p $port1 -qis $scale \
&& echo "alter table pgbench_history add column hid serial primary key;" 
\

 | psql -q1Xp $port1 && pg_dump -F c -p $port1 \
--exclude-table-data=pgbench_history  \
--exclude-table-data=pgbench_accounts \
--exclude-table-data=pgbench_branches \
--exclude-table-data=pgbench_tellers  \
  -t pgbench_history -t pgbench_accounts \
  -t pgbench_branches -t pgbench_tellers \
 | pg_restore -1 -p $port2 -d testdb
appname=derail2
echo "create publication pub1 for all tables;" | psql -p $port1 -aqtAX
echo "create subscription sub1 connection 'port=${port1}
  application_name=$appname' publication pub1 with(enabled=false);
alter subscription sub1 enable;" | psql -p $port2 -aqtAX

pgbench -c $clients -j $threads -T $duration -P $pseconds -n#  scale 
$scale


Now compare md5's of the sorted content of each of the 4 pgbench tables 
on primary and replica.  They should be the same.




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Race conditions with WAL sender PID lookups

2017-05-21 Thread Erik Rijkers

On 2017-05-21 06:37, Erik Rijkers wrote:

On 2017-05-20 14:40, Michael Paquier wrote:
On Fri, May 19, 2017 at 3:01 PM, Masahiko Sawada 
<sawada.m...@gmail.com> wrote:
Also, as Horiguchi-san pointed out earlier, walreceiver seems need 
the

similar fix.


Actually, now that I look at it, ready_to_display should as well be
protected by the lock of the WAL receiver, so it is incorrectly placed
in walreceiver.h. As you are pointing out, pg_stat_get_wal_receiver()
is lazy as well, and that's new in 10, so we have an open item here
for both of them. And I am the author for both things. No issues
spotted in walreceiverfuncs.c after review.

I am adding an open item so as both issues are fixed in PG10. With the
WAL sender part, I think that this should be a group shot.

So what do you think about the attached?



[walsnd-pid-races-v3.patch]



With this patch on current master my logical replication tests
(pgbench-over-logical-replication) run without errors for the first
time in many days (even weeks).


Unfortunately, just now another logical-replication failure occurred.  
The same as I have seen all along:


The symptom: after starting logical replication, there are no rows in 
pg_stat_replication and in the replica-log logical replication complains 
about max_replication_slots being too low. (from previous experience I 
know that making max_replication_slots higher does indeed 'help', but 
only until the next (same) error occurs, with renewed (same) complaint).


Also from previous experience of this failed state I know that it can be 
'cleaned up' by

manually emptying these tables:
  delete from pg_subscription_rel;
  delete from pg_subscription;
  delete from pg_replication_origin;
Then it becomes possible to start a new subscription without the above 
symptoms.


I'll do some more testing and hopefully get some information that's less 
vague...



Erik Rijkers



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Race conditions with WAL sender PID lookups

2017-05-20 Thread Erik Rijkers

On 2017-05-20 14:40, Michael Paquier wrote:
On Fri, May 19, 2017 at 3:01 PM, Masahiko Sawada 
<sawada.m...@gmail.com> wrote:

Also, as Horiguchi-san pointed out earlier, walreceiver seems need the
similar fix.


Actually, now that I look at it, ready_to_display should as well be
protected by the lock of the WAL receiver, so it is incorrectly placed
in walreceiver.h. As you are pointing out, pg_stat_get_wal_receiver()
is lazy as well, and that's new in 10, so we have an open item here
for both of them. And I am the author for both things. No issues
spotted in walreceiverfuncs.c after review.

I am adding an open item so as both issues are fixed in PG10. With the
WAL sender part, I think that this should be a group shot.

So what do you think about the attached?



[walsnd-pid-races-v3.patch]



With this patch on current master my logical replication tests 
(pgbench-over-logical-replication) run without errors for the first time 
in many days (even weeks).


I'll do still more and longer tests but I have gathered already a long 
streak of successful runs since you posted the patch so I am getting 
convinced this patch is solved the problem that I was experiencing.


Pity it didn't make the beta.


thanks,

Erik Rijkers



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] snapbuild woes

2017-05-09 Thread Erik Rijkers

On 2017-05-09 21:00, Petr Jelinek wrote:

On 09/05/17 19:54, Erik Rijkers wrote:

On 2017-05-09 11:50, Petr Jelinek wrote:



Ah okay, so this is same issue that's reported by both Masahiko Sawada
[1] and Jeff Janes [2].

[1]
https://www.postgresql.org/message-id/CAD21AoBYpyqTSw%2B%3DES%2BxXtRGMPKh%3DpKiqjNxZKnNUae0pSt9bg%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/flat/CAMkU%3D1xUJKs%3D2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw%40mail.gmail.com


I don't understand why you come to that conclusion: both Masahiko Sawada 
and Jeff Janes have a DROP SUBSCRIPTION in the mix; my cases haven't.  
Isn't that a real difference?


( I do sometimes get that DROP-SUBSCRIPTION too, but much less often 
than the sync-failure. )






--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] snapbuild woes

2017-05-09 Thread Erik Rijkers

On 2017-05-09 11:50, Petr Jelinek wrote:

I rebased the above mentioned patch to apply to the patches Andres 
sent,
if you could try to add it on top of what you have and check if it 
still

fails, that would be helpful.


It still fails.

With these patches

- 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
- 2-WIP-Possibly-more-robust-snapbuild-approach.patch +
- fix-statistics-reporting-in-logical-replication-work.patch +
- Skip-unnecessary-snapshot-builds.patch

built again on top of 44c528810a1 ( so I had to add the 
'fix-statistics-rep*' patch because without it I immediately got that 
Assertion failure again ).


As always most runs succeed (especially on this large 192GB 16-core 
server).


But attached is an output file of a number of runs of my 
pgbench_derail2.sh test.


Overal result:

-- out_20170509_1635.txt
  3 -- pgbench -c 64 -j 8 -T 900 -P 180 -n   --  scale 25
  2 -- All is well.
  1 -- Not good, but breaking out of wait (21 times no change)

I broke it off after iteration 4, so 5 never ran, and
iteration 1 failed due to a mistake in the harness (somethind stupid I 
did) - not interesting.


iteration 2 succeeds. (eventually has 'replica ok')

iteration 3 succeeds. (eventually has 'replica ok')

iteration 4 fails.
  Just after 'alter subscription sub1 enable' I caught (as is usual) 
pg_stat_replication.state as 'catchup'. So far so good.
  After the 15-minute pgbench run pg_stat_replication has only 2 
'startup' lines (and none 'catchup' or 'streaming'):


 port | pg_stat_replication |  pid   | wal | replay_loc | diff | 
?column? |  state  |   app   | sync_state
 6972 | pg_stat_replication | 108349 | 19/8FBCC248 ||  | 
 | startup | derail2 | async
 6972 | pg_stat_replication | 108351 | 19/8FBCC248 ||  | 
 | startup | derail2 | async


(that's from:
   select $port1 as port,'pg_stat_replication' as pg_stat_replication, 
pid
 , pg_current_wal_location() wal, replay_location replay_loc, 
pg_current_wal_location() - replay_location as diff

 , pg_current_wal_location() <= replay_location
 , state, application_name as app, sync_state
   from pg_stat_replication
)

This remains in this state for as long as my test-programs lets it 
(i.e., 20 x 30s, or something like that, and then the loop is exited); 
in the ouput file it says: 'Not good, but breaking out of wait'


Below is the accompanying ps (with the 2 'deranged senders' as Jeff 
Janes would surely call them):



UID PID   PPID  C STIME TTY  STAT   TIME CMD
rijkers  107147  1  0 17:11 pts/35   S+ 0:00 
/var/data1/pg_stuff/pg_installations/pgsql.logical_replication2/bin/postgres 
-D /var/data1/pg_stuff/pg_installations
rijkers  107149 107147  0 17:11 ?Ss 0:00  \_ postgres: 
logger process
rijkers  107299 107147  0 17:11 ?Ss 0:01  \_ postgres: 
checkpointer process
rijkers  107300 107147  0 17:11 ?Ss 0:00  \_ postgres: 
writer process
rijkers  107301 107147  0 17:11 ?Ss 0:00  \_ postgres: wal 
writer process
rijkers  107302 107147  0 17:11 ?Ss 0:00  \_ postgres: 
autovacuum launcher process
rijkers  107303 107147  0 17:11 ?Ss 0:00  \_ postgres: stats 
collector process
rijkers  107304 107147  0 17:11 ?Ss 0:00  \_ postgres: 
bgworker: logical replication launcher
rijkers  108348 107147  0 17:12 ?Ss 0:01  \_ postgres: 
bgworker: logical replication worker for subscription 70310 sync 70293
rijkers  108350 107147  0 17:12 ?Ss 0:00  \_ postgres: 
bgworker: logical replication worker for subscription 70310 sync 70298
rijkers  107145  1  0 17:11 pts/35   S+ 0:02 
/var/data1/pg_stuff/pg_installations/pgsql.logical_replication/bin/postgres 
-D /var/data1/pg_stuff/pg_installations
rijkers  107151 107145  0 17:11 ?Ss 0:00  \_ postgres: 
logger process
rijkers  107160 107145  0 17:11 ?Ss 0:08  \_ postgres: 
checkpointer process
rijkers  107161 107145  0 17:11 ?Ss 0:07  \_ postgres: 
writer process
rijkers  107162 107145  0 17:11 ?Ss 0:02  \_ postgres: wal 
writer process
rijkers  107163 107145  0 17:11 ?Ss 0:00  \_ postgres: 
autovacuum launcher process
rijkers  107164 107145  0 17:11 ?Ss 0:02  \_ postgres: stats 
collector process
rijkers  107165 107145  0 17:11 ?Ss 0:00  \_ postgres: 
bgworker: logical replication launcher
rijkers  108349 107145  0 17:12 ?Ss 0:27  \_ postgres: wal 
sender process rijkers [local] idle
rijkers  108351 107145  0 17:12 ?Ss 0:26  \_ postgres: wal 
sender process rijkers [local] idle


I have had no time to add (or view) any CPUinfo.


Erik Rijkers





out_20170509_1635.txt
Description: application/elc

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] snapbuild woes

2017-05-09 Thread Erik Rijkers

On 2017-05-09 11:50, Petr Jelinek wrote:

On 09/05/17 10:59, Erik Rijkers wrote:

On 2017-05-09 10:50, Petr Jelinek wrote:

On 09/05/17 00:03, Erik Rijkers wrote:

On 2017-05-05 02:00, Andres Freund wrote:


Could you have a look?


[...]

I rebased the above mentioned patch to apply to the patches Andres 
sent,
if you could try to add it on top of what you have and check if it 
still

fails, that would be helpful.


I suppose you mean these; but they do not apply anymore:

20170505/0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch
20170505/0002-WIP-Possibly-more-robust-snapbuild-approach.patch

Andres, any change you could update them?

alternatively I could use the older version again..

thanks,

Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] snapbuild woes

2017-05-09 Thread Erik Rijkers

On 2017-05-09 10:50, Petr Jelinek wrote:

On 09/05/17 00:03, Erik Rijkers wrote:

On 2017-05-05 02:00, Andres Freund wrote:


Could you have a look?


Running tests with these three patches:


0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
0002-WIP-Possibly-more-robust-snapbuild-approach.patch +
fix-statistics-reporting-in-logical-replication-work.patch

(on top of 44c528810)

I test by 15-minute pgbench runs while there is a logical replication
connection. Primary and replica are on the same machine.

I have seen errors on 3 different machines (where error means: at 
least

1 of the 4 pgbench tables is not md5-equal). It seems better, faster
machines yield less errors.

Normally I see in pg_stat_replication (on master) one process in state
'streaming'.

 pid  | wal | replay_loc  |   diff   |   state   |   app   |
sync_state
16495 | 11/EDBC | 11/EA3FEEE8 | 58462488 | streaming | derail2 | 
async


Often there are another two processes in pg_stat_replication that 
remain

in state 'startup'.

In the failing sessions the 'streaming'-state process is missing; in
failing sessions there are only the two processes that are and remain 
in

'startup'.


Hmm, startup is the state where slot creation is happening. I wonder if
it's just taking long time to create snapshot because of the 5th issue
which is not yet fixed (and the original patch will not apply on top of
this change). Alternatively there is a bug in this patch.

Did you see high CPU usage during the test when there were those
"startup" state walsenders?



I haven't noticed but I didn't pay attention to that particularly.

I'll try to get some CPU-info logged...



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] snapbuild woes

2017-05-08 Thread Erik Rijkers
 is going to fail.  I 
believe this has been true for all failure cases that I've seen (except 
the much more rare stuck-DROP-SUBSCRIPTION which is mentioned in another 
thread).


Sorry, I have not been able to get any thing more clear or definitive...


thanks,


Erik Rijkers










--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-08 Thread Erik Rijkers

On 2017-05-08 13:13, Masahiko Sawada wrote:

On Mon, May 8, 2017 at 7:14 PM, Erik Rijkers <e...@xs4all.nl> wrote:

On 2017-05-08 11:27, Masahiko Sawada wrote:






FWIW, running


0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
0002-WIP-Possibly-more-robust-snapbuild-approach.patch +
fix-statistics-reporting-in-logical-replication-work.patch


(on top of 44c528810)


Thanks, which thread are these patches attached on?



The first two patches are here:
https://www.postgresql.org/message-id/20170505004237.edtahvrwb3uwd5rs%40alap3.anarazel.de

and last one:
https://www.postgresql.org/message-id/22cc402c-88eb-fa35-217f-0060db2c72f0%402ndquadrant.com

( I have to include that last one or my tests fail within minutes. )


Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table

2017-05-08 Thread Erik Rijkers

On 2017-05-08 11:27, Masahiko Sawada wrote:

Hi,

I encountered a situation where DROP SUBSCRIPTION got stuck when
initial table sync is in progress. In my environment, I created
several tables with some data on publisher. I created subscription on
subscriber and drop subscription immediately after that. It doesn't
always happen but I often encountered it on my environment.

ps -x command shows the following.

 96796 ?Ss 0:00 postgres: masahiko postgres [local] DROP
SUBSCRIPTION
 96801 ?Ts 0:00 postgres: bgworker: logical replication
worker for subscription 40993waiting
 96805 ?Ss 0:07 postgres: bgworker: logical replication
worker for subscription 40993 sync 16418
 96806 ?Ss 0:01 postgres: wal sender process masahiko 
[local] idle

 96807 ?Ss 0:00 postgres: bgworker: logical replication
worker for subscription 40993 sync 16421
 96808 ?Ss 0:00 postgres: wal sender process masahiko 
[local] idle




FWIW, running


0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
0002-WIP-Possibly-more-robust-snapbuild-approach.patch +
fix-statistics-reporting-in-logical-replication-work.patch

(on top of 44c528810)

I have encountered the same condition as well in the last few days, a 
few times (I think 2 or 3 times).


Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication - TRAP: FailedAssertion in pgstat.c

2017-05-03 Thread Erik Rijkers

On 2017-05-03 08:17, Petr Jelinek wrote:

On 02/05/17 20:43, Robert Haas wrote:

On Thu, Apr 20, 2017 at 2:58 PM, Peter Eisentraut



code path that calls CommitTransactionCommand() should have one, no?


Is there anything left to be committed here?



Afaics the fix was not committed. Peter wanted more comprehensive fix
which didn't happen. I think something like attached should do the job.


I'm running my pgbench-over-logical-replication test in chunk of 15 
minutes, wth different pgbench -c (num clients) and -s (scale) values.


With this patch (and nothing else)  on top of master (8f8b9be51fd7 to be 
precise):



fix-statistics-reporting-in-logical-replication-work.patch


logical replication is still often failing (as expected, I suppose; it 
seems because of "inital snapshot too large") but indeed I do not see 
the 'TRAP: FailedAssertion in pgstat.c' anymore.


(If there is any other configuration of patches worth testing please let 
me know)


thanks

Erik Rijkers



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication - TRAP: FailedAssertion in pgstat.c

2017-04-17 Thread Erik Rijkers

On 2017-04-17 15:59, Stas Kelvich wrote:

On 17 Apr 2017, at 10:30, Erik Rijkers <e...@xs4all.nl> wrote:

On 2017-04-16 20:41, Andres Freund wrote:

On 2017-04-16 10:46:21 +0200, Erik Rijkers wrote:

On 2017-04-15 04:47, Erik Rijkers wrote:
>
> 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
> 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch  +
> 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch  +
> 0005-Skip-unnecessary-snapshot-builds.patch
I am now using these newer patches:
https://www.postgresql.org/message-id/30242bc6-eca4-b7bb-670e-8d0458753a8c%402ndquadrant.com
> It builds fine, but when I run the old pbench-over-logical-replication
> test I get:
>
> TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File:
> "pgstat.c", Line: 828)
To get that error:

I presume this is the fault of
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=139eb9673cb84c76f493af7e68301ae204199746
if you git revert that individual commit, do things work again?


Yes, compiled from 67c2def11d4 with the above 4 patches, it runs 
flawlessly again. (flawlessly= a few hours without any error)




I’ve reproduced failure, this happens under tablesync worker and 
putting

pgstat_report_stat() under the previous condition block should help.

However for me it took about an hour of running this script to catch
original assert.

Can you check with that patch applied?



Your patch on top of the 5 patches above seem to solve the matter too: 
no problems after running for 2 hours (previously it failed within half 
a minute).




Erik Rijkers




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication - TRAP: FailedAssertion in pgstat.c

2017-04-17 Thread Erik Rijkers

On 2017-04-16 20:41, Andres Freund wrote:

On 2017-04-16 10:46:21 +0200, Erik Rijkers wrote:

On 2017-04-15 04:47, Erik Rijkers wrote:
>
> 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
> 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
> 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch  +
> 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch  +
> 0005-Skip-unnecessary-snapshot-builds.patch

I am now using these newer patches:
https://www.postgresql.org/message-id/30242bc6-eca4-b7bb-670e-8d0458753a8c%402ndquadrant.com

> It builds fine, but when I run the old pbench-over-logical-replication
> test I get:
>
> TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File:
> "pgstat.c", Line: 828)


To get that error:


I presume this is the fault of
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=139eb9673cb84c76f493af7e68301ae204199746
if you git revert that individual commit, do things work again?



Yes, compiled from 67c2def11d4 with the above 4 patches, it runs 
flawlessly again. (flawlessly= a few hours without any error)




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication - TRAP: FailedAssertion in pgstat.c

2017-04-16 Thread Erik Rijkers

On 2017-04-15 04:47, Erik Rijkers wrote:


0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch  +
0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch  +
0005-Skip-unnecessary-snapshot-builds.patch


I am now using these newer patches:
https://www.postgresql.org/message-id/30242bc6-eca4-b7bb-670e-8d0458753a8c%402ndquadrant.com


It builds fine, but when I run the old pbench-over-logical-replication
test I get:

TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: 
"pgstat.c", Line: 828)



To get that error:

--
#!/bin/sh

port1=6972 port2=6973 scale=25 clients=16 duration=60

   echo "drop table if exists pgbench_accounts;
 drop table if exists pgbench_branches;
 drop table if exists pgbench_tellers;
 drop table if exists pgbench_history;" | psql -qXp $port1 \
&& echo "drop table if exists pgbench_accounts;
 drop table if exists pgbench_branches;
 drop table if exists pgbench_tellers;
 drop table if exists pgbench_history;" | psql -qXp $port2 \
&& pgbench -p $port1 -qis ${scale//_/} && echo "
alter table pgbench_history add column hid serial primary key;
" | psql -q1Xp $port1  \
  && pg_dump -F c -p $port1   \
   --exclude-table-data=pgbench_history  \
   --exclude-table-data=pgbench_accounts \
   --exclude-table-data=pgbench_branches \
   --exclude-table-data=pgbench_tellers  \
   -t pgbench_history  \
   -t pgbench_accounts \
   -t pgbench_branches \
   -t pgbench_tellers  \
  | pg_restore -1 -p $port2 -d testdb

appname=pgbench_derail
echo "create publication pub1 for all tables;" | psql -p $port1 -aqtAX
echo "create subscription sub1 connection 'port=${port1} 
application_name=${appname}' publication pub1 with (disabled);

alter subscription sub1 enable;
" | psql -p $port2 -aqtAX

echo "-- pgbench -p $port1 -c $clients -T $duration -n   -- scale $scale 
"

 pgbench -p $port1 -c $clients -T $duration -n

--


Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Logical replication - TRAP: FailedAssertion in pgstat.c

2017-04-14 Thread Erik Rijkers


Testing logical replication, with the following patches on top of 
yesterday's master:


0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch  +
0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch  +
0005-Skip-unnecessary-snapshot-builds.patch

Is applying that patch set is still correct?

It builds fine, but when I run the old pbench-over-logical-replication 
test I get:


TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: 
"pgstat.c", Line: 828)


reliably (often within a minute).


The test itself does not fail, at least not that I saw (but I only ran a 
few).



thanks,


Erik Rijkers




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] the need to finish

2017-04-12 Thread Erik Rijkers

Logical replication emits logmessages like these:

DETAIL:  90 transactions need to finish.
DETAIL:  87 transactions need to finish.
DETAIL:  70 transactions need to finish.

Could we get rid of that 'need'?   It strikes me as a bit off; something 
that people would say but not a mechanical message by a computer. I 
dislike it strongly.


I would prefer the line to be more terse:

DETAIL:  90 transactions to finish.

Am I the only one who is annoyed by this phrase?


Thanks,


Erik Rijkers















--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] snapbuild woes

2017-04-08 Thread Erik Rijkers

On 2017-04-08 15:56, Andres Freund wrote:

On 2017-04-08 09:51:39 -0400, David Steele wrote:

On 3/2/17 7:54 PM, Petr Jelinek wrote:
>
> Yes the copy patch needs rebase as well. But these ones are fine.

This bug has been moved to CF 2017-07.


FWIW, as these are bug-fixes that need to be backpatched, I do plan to
work on them soon.



CF 2017-07 pertains to postgres 11, is that right?

But I hope you mean to commit these snapbuild patches before the 
postgres 10 release?  As far as I know, logical replication is still 
very broken without them (or at least some of that set of 5 patches - I 
don't know which ones are essential and which may not be).


If it's at all useful I can repeat tests to show how often current 
master still fails (easily 50% or so failure-rate).


This would be the pgbench-over-logical-replication test that I did so 
often earlier on.


thanks,

Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] monitoring.sgml missing tag

2017-04-07 Thread Erik Rijkers

On 2017-04-07 22:50, Andres Freund wrote:

On 2017-04-07 22:47:55 +0200, Erik Rijkers wrote:

monitoring.sgml has one  tag missing


Is that actually an issue? SGML allows skipping certain close tags, and
IIRC row is one them.  We'll probably move to xml at some point not too
far away, but I don't think it makes much sense to fix these 
one-by-one.



Well, I have only used  make oldhtml  before now so maybe I am doing 
something wrong.


I try to run  make html.

First, I got this (just showing first few of a 75x repeat):

$ time ( cd /home/aardvark/pg_stuff/pg_sandbox/pgsql.HEAD/doc/src/sgml;  
make html; )

osx -D . -D . -x lower postgres.sgml >postgres.xml.tmp
osx:monitoring.sgml:1278:12:E: document type does not allow element 
"ROW" here
osx:monitoring.sgml:1282:12:E: document type does not allow element 
"ROW" here
osx:monitoring.sgml:1286:12:E: document type does not allow element 
"ROW" here

...
osx:monitoring.sgml:1560:12:E: document type does not allow element 
"ROW" here
osx:monitoring.sgml:1564:13:E: end tag for "ROW" omitted, but OMITTAG NO 
was specified

osx:monitoring.sgml:1275:8: start tag was here
make: *** [postgres.xml] Error 1


After closing that tag with  ,  make html  still fails:



$ time ( cd /home/aardvark/pg_stuff/pg_sandbox/pgsql.HEAD/doc/src/sgml;  
make html; )

osx -D . -D . -x lower postgres.sgml >postgres.xml.tmp
'/opt/perl-5.24/bin/perl' -p -e 
's/\[(aacute|acirc|aelig|agrave|amp|aring|atilde|auml|bull|copy|eacute|egrave|gt|iacute|lt|mdash|nbsp|ntilde|oacute|ocirc|oslash|ouml|pi|quot|scaron|uuml) 
*\]/\&\1;/gi;' -e '$_ .= qq{XML V4.2//EN" 
"http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd;>\n} if $. == 
1;'  postgres.xml

rm postgres.xml.tmp
xmllint --noout --valid postgres.xml
xsltproc --stringparam pg.version '10devel'  stylesheet.xsl postgres.xml
runtime error: file stylesheet-html-common.xsl line 41 element 
call-template

The called template 'id.attribute' was not found.
runtime error: file stylesheet-html-common.xsl line 41 element 
call-template

The called template 'id.attribute' was not found.
runtime error: file stylesheet-html-common.xsl line 41 element 
call-template

The called template 'id.attribute' was not found.
runtime error: file stylesheet-html-common.xsl line 41 element 
call-template

The called template 'id.attribute' was not found.
runtime error: file stylesheet-html-common.xsl line 41 element 
call-template

The called template 'id.attribute' was not found.
runtime error: file stylesheet-html-common.xsl line 30 element 
call-template

The called template 'id.attribute' was not found.
runtime error: file stylesheet-html-common.xsl line 30 element 
call-template

The called template 'id.attribute' was not found.
runtime error: file stylesheet-html-common.xsl line 30 element 
call-template

The called template 'id.attribute' was not found.
runtime error: file stylesheet-html-common.xsl line 30 element 
call-template

The called template 'id.attribute' was not found.
runtime error: file stylesheet-html-common.xsl line 30 element 
call-template

The called template 'id.attribute' was not found.
runtime error: file stylesheet-html-common.xsl line 30 element 
call-template

The called template 'id.attribute' was not found.
no result for postgres.xml
make: *** [html-stamp] Error 9

real4m23.641s
user4m22.304s
sys 0m0.914s


Any hints welcome...

thanks


$ cat /etc/redhat-release
CentOS release 6.6 (Final)



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] monitoring.sgml missing tag

2017-04-07 Thread Erik Rijkers

monitoring.sgml has one  tag missing--- doc/src/sgml/monitoring.sgml.orig	2017-04-07 22:37:55.388708334 +0200
+++ doc/src/sgml/monitoring.sgml	2017-04-07 22:38:16.582047695 +0200
@@ -1275,6 +1275,7 @@
 
  ProcArrayGroupUpdate
  Waiting for group leader to clear transaction id at transaction end.
+
 
  SafeSnapshot
  Waiting for a snapshot for a READ ONLY DEFERRABLE transaction.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-03-30 Thread Erik Rijkers


(At the moment using these patches for tests:)
 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch  +
 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch  +
 0005-Skip-unnecessary-snapshot-builds.patch+
 and now (Tuesday 30) added :
 0001-Fix-remote-position-tracking-in-logical-replication.patch



I think what you have seen is because of this:
https://www.postgresql.org/message-id/flat/b235fa69-147a-5e09-f8f3-3f780a1ab...@2ndquadrant.com#b235fa69-147a-5e09-f8f3-3f780a1ab...@2ndquadrant.com



You were right: with that 6th patch (and wal_sender_timout back at its 
default 60s) there are no errors either (I tested on all 3 
test-machines).


I must have missed that last patch when you posted it.  Anyway all seems 
fine now; I hope the above patches can all be committed soon.


thanks,

Erik Rijkers



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-03-29 Thread Erik Rijkers

On 2017-03-09 11:06, Erik Rijkers wrote:


I use three different machines (2 desktop, 1 server) to test logical
replication, and all three have now at least once failed to correctly
synchronise a pgbench session (amidst many succesful runs, of course)





(At the moment using tese patches for tests:)


0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch  +
0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch  +
0005-Skip-unnecessary-snapshot-builds.patch+



The failed tests that I kept seeing (see the 
pgbench-over-logical-replication tests upthread) were never really 
'solved'.



But I have now finally figured out what caused these unexpected failed 
tests: it was  wal_sender_timeout  or rather, its default of 60 s.


This caused 'terminating walsender process due to replication timeout' 
on the primary (not strictly an error), and the concomittant ERROR on 
the replica: 'could not receive data from WAL stream: server closed the 
connection unexpectedly'.


here is a typical example (primary/replica logs time-intertwined, with 
'primary'):


[...]
2017-03-24 16:21:38.129 CET [15002]  primaryLOG:  using stale 
statistics instead of current ones because stats collector is not 
responding
2017-03-24 16:21:42.690 CET [27515]  primaryLOG:  using stale 
statistics instead of current ones because stats collector is not 
responding
2017-03-24 16:21:42.965 CET [14999]replica  LOG:  using stale 
statistics instead of current ones because stats collector is not 
responding
2017-03-24 16:21:49.816 CET [14930]  primaryLOG:  terminating 
walsender process due to
2017-03-24 16:21:49.817 CET [14926]replica  ERROR:  could not 
receive data from WAL stream: server closed the connection unexpectedly
2017-03-24 16:21:49.824 CET [27502]replica  LOG:  worker process: 
logical replication worker for subscription 24864 (PID 14926) exited 
with exit code 1
2017-03-24 16:21:49.824 CET [27521]replica  LOG:  starting logical 
replication worker for subscription "sub1"
2017-03-24 16:21:49.828 CET [15008]replica  LOG:  logical 
replication apply for subscription sub1 started
2017-03-24 16:21:49.832 CET [15009]  primaryLOG:  received 
replication command: IDENTIFY_SYSTEM
2017-03-24 16:21:49.832 CET [15009]  primaryLOG:  received 
replication command: START_REPLICATION SLOT "sub1" LOGICAL 3/FC976440 
(proto_version '1', publication_names '"pub1"')
2017-03-24 16:21:49.833 CET [15009]  primaryDETAIL:  streaming 
transactions committing after 3/FC889810, reading WAL from 3/FC820FC0
2017-03-24 16:21:49.833 CET [15009]  primaryLOG:  starting logical 
decoding for slot "sub1"
2017-03-24 16:21:50.471 CET [15009]  primaryDETAIL:  Logical 
decoding will begin using saved snapshot.
2017-03-24 16:21:50.471 CET [15009]  primaryLOG:  logical decoding 
found consistent point at 3/FC820FC0
2017-03-24 16:21:51.169 CET [15008]replica  DETAIL:  Key 
(hid)=(9014) already exists.
2017-03-24 16:21:51.169 CET [15008]replica  ERROR:  duplicate key 
value violates unique constraint "pgbench_history_pkey"
2017-03-24 16:21:51.170 CET [27502]replica  LOG:  worker process: 
logical replication worker for subscription 24864 (PID 15008) exited 
with exit code 1
2017-03-24 16:21:51.170 CET [27521]replica  LOG:  starting logical 
replication worker for subscription "sub1"

[...]

My primary and replica were always on a single machine (making it more 
likely that that timeout is reached?).  In my testing it seems that 
reaching the timeout on the primary (and 'closing the connection 
unexpectedly' on the replica) does not necessarily break the logical 
replication.  But almost all log-rep failures that I have seen were 
started by this sequence of events.


After setting  wal_sender_timeout  to 3 minutes there were no more 
failed tests.


Perhaps it warrants setting  wal_sender_timeout  a bit higher than the 
current default of 60 seconds?  After all I also saw the 'replication 
timeout' / 'closed the connection' couple rather often during 
not-failing tests.  (These also disappeared, almost completely,  with a 
higher  setting of wal_sender_timeout)


In any case it would be good to mention the setting (and its potentially 
deteriorating effect) somehere nearer the logical replication treatment.


( I read about wal_sender_timeout and keepalive ping, perhaps there's 
(still) something amiss there? Just a guess, I don't know )


As I said, I saw no more failures with the higher 3 minute setting, with 
one exception: the one test that straddled the DST change (saterday 24 
march 02:00 h).  I am happy to discount that one failure but strictly 
speaking I suppose it should be able to take DST into its stride.



Thanks,

Erik Rijkers











--
Sent via pgsql-hackers

[HACKERS] walsender.c comments

2017-03-28 Thread Erik Rijkers

Small fry gathered wile reading walsender.c ...

(to be applied to master)


Thanks,

Erik Rijkers
--- src/backend/replication/walsender.c.orig	2017-03-28 08:34:56.787217522 +0200
+++ src/backend/replication/walsender.c	2017-03-28 08:44:56.486327700 +0200
@@ -14,11 +14,11 @@
  * replication-mode commands. The START_REPLICATION command begins streaming
  * WAL to the client. While streaming, the walsender keeps reading XLOG
  * records from the disk and sends them to the standby server over the
- * COPY protocol, until the either side ends the replication by exiting COPY
+ * COPY protocol, until either side ends the replication by exiting COPY
  * mode (or until the connection is closed).
  *
  * Normal termination is by SIGTERM, which instructs the walsender to
- * close the connection and exit(0) at next convenient moment. Emergency
+ * close the connection and exit(0) at the next convenient moment. Emergency
  * termination is by SIGQUIT; like any backend, the walsender will simply
  * abort and exit on SIGQUIT. A close of the connection and a FATAL error
  * are treated as not a crash but approximately normal termination;
@@ -277,7 +277,7 @@
  * Clean up after an error.
  *
  * WAL sender processes don't use transactions like regular backends do.
- * This function does any cleanup requited after an error in a WAL sender
+ * This function does any cleanup required after an error in a WAL sender
  * process, similar to what transaction abort does in a regular backend.
  */
 void
@@ -570,7 +570,7 @@
 			sendTimeLineIsHistoric = true;
 
 			/*
-			 * Check that the timeline the client requested for exists, and
+			 * Check that the timeline the client requested exists, and
 			 * the requested start location is on that timeline.
 			 */
 			timeLineHistory = readTimeLineHistory(ThisTimeLineID);
@@ -588,8 +588,8 @@
 			 * starting point. This is because the client can legitimately
 			 * request to start replication from the beginning of the WAL
 			 * segment that contains switchpoint, but on the new timeline, so
-			 * that it doesn't end up with a partial segment. If you ask for a
-			 * too old starting point, you'll get an error later when we fail
+			 * that it doesn't end up with a partial segment. If you ask for
+			 * too old a starting point, you'll get an error later when we fail
 			 * to find the requested WAL segment in pg_wal.
 			 *
 			 * XXX: we could be more strict here and only allow a startpoint
@@ -626,7 +626,7 @@
 	{
 		/*
 		 * When we first start replication the standby will be behind the
-		 * primary. For some applications, for example, synchronous
+		 * primary. For some applications, for example synchronous
 		 * replication, it is important to have a clear state for this initial
 		 * catchup mode, so we can trigger actions when we change streaming
 		 * state later. We may stay in this state for a long time, which is
@@ -954,7 +954,7 @@
 
 		ReplicationSlotMarkDirty();
 
-		/* Write this slot to disk if it's permanent one. */
+		/* Write this slot to disk if it's a permanent one. */
 		if (!cmd->temporary)
 			ReplicationSlotSave();
 	}
@@ -,7 +,7 @@
  *
  * Prepare a write into a StringInfo.
  *
- * Don't do anything lasting in here, it's quite possible that nothing will done
+ * Don't do anything lasting in here, it's quite possible that nothing will be done
  * with the data.
  */
 static void
@@ -1150,7 +1150,7 @@
 
 	/*
 	 * Fill the send timestamp last, so that it is taken as late as possible.
-	 * This is somewhat ugly, but the protocol's set as it's already used for
+	 * This is somewhat ugly, but the protocol is set as it's already used for
 	 * several releases by streaming physical replication.
 	 */
 	resetStringInfo();
@@ -1237,7 +1237,7 @@
 
 
 	/*
-	 * Fast path to avoid acquiring the spinlock in the we already know we
+	 * Fast path to avoid acquiring the spinlock in case we already know we
 	 * have enough WAL available. This is particularly interesting if we're
 	 * far behind.
 	 */
@@ -2498,7 +2498,7 @@
 		 * given the current implementation of XLogRead().  And in any case
 		 * it's unsafe to send WAL that is not securely down to disk on the
 		 * master: if the master subsequently crashes and restarts, slaves
-		 * must not have applied any WAL that gets lost on the master.
+		 * must not have applied any WAL that got lost on the master.
 		 */
 		SendRqstPtr = GetFlushRecPtr();
 	}
@@ -2522,7 +2522,7 @@
 	 * LSN.
 	 *
 	 * Note that the LSN is not necessarily the LSN for the data contained in
-	 * the present message; it's the end of the the WAL, which might be
+	 * the present message; it's the end of the WAL, which might be
 	 * further ahead.  All the lag tracking machinery cares about is finding
 	 * out when that arbitrary LSN is eventually reported as written, flushed
 	 * and applied, so that it can measure the elapsed time.
@@ -2922,7 +2922,7 @@
  * Wake up all walsenders
  *
  * This will be called inside critical sections, so throw

Re: [HACKERS] Logical replication existing data copy

2017-03-24 Thread Erik Rijkers

On 2017-03-24 10:45, Mark Kirkwood wrote:


However one minor observation - as Michael Banck noted - the elapsed
time for slave to catch up after running:

$ pgbench -c8 -T600 bench

on the master was (subjectively) much longer than for physical
streaming replication. Is this expected?



I think you probably want to do (on the slave) :

  alter role  set synchronous_commit = off;

otherwise it's indeed extremely slow.







--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug/oversight in TestLib.pm and PostgresNode.pm

2017-03-23 Thread Erik Rijkers

On 2017-03-23 03:28, Michael Paquier wrote:

On Thu, Mar 23, 2017 at 12:51 AM, Erik Rijkers <e...@xs4all.nl> wrote:
While trying to test pgbench's stderr (looking for 'creating tables' 
in
output of the initialisation step)  I ran into these two bugs (or 
perhaps

better 'oversights').


+   if (defined $expected_stderr) {
+   like($stderr, $expected_stderr, "$test_name: stderr matches");
+   }
+   else {
is($stderr, '', "$test_name: no stderr");
-   like($stdout, $expected_stdout, "$test_name: matches");
+   }
To simplify that you could as well set expected_output to be an empty
string, and just use like() instead of is(), saving this if/else.


(I'll assume you meant '$expected_stderr' (not 'expected_output'))

That would be nice but with that, other tests start complaining: 
"doesn't look like a regex to me"


To avoid that, I uglified your version back to:

+   like($stderr, (defined $expected_stderr ? $expected_stderr : 
qr{}),

+   "$test_name: stderr matches");

I did it like that in the attached patch 
(0001-testlib-like-stderr.diff).



The other (PostgresNode.pm.diff) is unchanged.

make check-world without error.


Thanks,

Erik Rijkers

--- src/test/perl/TestLib.pm.orig	2017-03-23 08:11:16.034410936 +0100
+++ src/test/perl/TestLib.pm	2017-03-23 08:12:33.154132124 +0100
@@ -289,13 +289,14 @@
 
 sub command_like
 {
-	my ($cmd, $expected_stdout, $test_name) = @_;
+	my ($cmd, $expected_stdout, $test_name, $expected_stderr) = @_;
 	my ($stdout, $stderr);
 	print("# Running: " . join(" ", @{$cmd}) . "\n");
 	my $result = IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
 	ok($result, "$test_name: exit code 0");
-	is($stderr, '', "$test_name: no stderr");
-	like($stdout, $expected_stdout, "$test_name: matches");
+	like($stderr, (defined $expected_stderr ? $expected_stderr : qr{}),
+	"$test_name: stderr matches");
+	like($stdout, $expected_stdout, "$test_name: stdout matches");
 }
 
 sub command_fails_like
--- src/test/perl/PostgresNode.pm.orig	2017-03-22 15:58:58.690052999 +0100
+++ src/test/perl/PostgresNode.pm	2017-03-22 15:49:38.422777312 +0100
@@ -1283,6 +1283,23 @@
 
 =pod
 
+=item $node->command_fails_like(...) - TestLib::command_fails_like with our PGPORT
+
+See command_ok(...)
+
+=cut
+
+sub command_fails_like
+{
+	my $self = shift;
+
+	local $ENV{PGPORT} = $self->port;
+
+	TestLib::command_fails_like(@_);
+}
+
+=pod
+
 =item $node->issues_sql_like(cmd, expected_sql, test_name)
 
 Run a command on the node, then verify that $expected_sql appears in the

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] bug/oversight in TestLib.pm and PostgresNode.pm

2017-03-22 Thread Erik Rijkers
I am trying to re-create pgbench-over-logical-replication as a TAP-test. 
(the wisdom of that might be doubted, and I appreciate comments on it 
too, but it's really another subject).


While trying to test pgbench's stderr (looking for 'creating tables' in 
output of the initialisation step)  I ran into these two bugs (or 
perhaps better 'oversights').


But especially the omission of command_fails_like() in PostgresNode.pm 
feels like an bug.


In the end it was necessary to change TestLib.pm's command_like() 
because command_fails_like() also checks for a non-zero return value 
(which seems to make sense, but in this case not possible: pgbench 
returns 0 on init with output on stderr).



make check-world passes without error


Thanks,

Erik Rijkers

--- src/test/perl/TestLib.pm.orig	2017-03-22 11:34:36.948857255 +0100
+++ src/test/perl/TestLib.pm	2017-03-22 14:36:56.793267113 +0100
@@ -289,13 +290,18 @@
 
 sub command_like
 {
-	my ($cmd, $expected_stdout, $test_name) = @_;
+	my ($cmd, $expected_stdout, $test_name, $expected_stderr) = @_;
 	my ($stdout, $stderr);
 	print("# Running: " . join(" ", @{$cmd}) . "\n");
 	my $result = IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
 	ok($result, "$test_name: exit code 0");
+	if (defined $expected_stderr) {
+		like($stderr, $expected_stderr, "$test_name: stderr matches");
+	}
+	else {
 	is($stderr, '', "$test_name: no stderr");
-	like($stdout, $expected_stdout, "$test_name: matches");
+	}
+	like($stdout, $expected_stdout, "$test_name: stdout matches");
 }
 
 sub command_fails_like
--- src/test/perl/PostgresNode.pm.orig	2017-03-22 15:58:58.690052999 +0100
+++ src/test/perl/PostgresNode.pm	2017-03-22 15:49:38.422777312 +0100
@@ -1283,6 +1283,23 @@
 
 =pod
 
+=item $node->command_fails_like(...) - TestLib::command_fails_like with our PGPORT
+
+See command_ok(...)
+
+=cut
+
+sub command_fails_like
+{
+	my $self = shift;
+
+	local $ENV{PGPORT} = $self->port;
+
+	TestLib::command_fails_like(@_);
+}
+
+=pod
+
 =item $node->issues_sql_like(cmd, expected_sql, test_name)
 
 Run a command on the node, then verify that $expected_sql appears in the

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] more on comments of snapbuild.c

2017-03-17 Thread Erik Rijkers

On 2017-03-18 06:37, Erik Rijkers wrote:
Studying logrep yielded some more improvements to the comments in 
snapbuild.c


(to be applied to master)


Attached the actual file


thanks,

Erik Rijekrs

--- src/backend/replication/logical/snapbuild.c.orig	2017-03-18 05:02:28.627077888 +0100
+++ src/backend/replication/logical/snapbuild.c	2017-03-18 06:04:48.091686815 +0100
@@ -27,7 +27,7 @@
  * removed. This is achieved by using the replication slot mechanism.
  *
  * As the percentage of transactions modifying the catalog normally is fairly
- * small in comparisons to ones only manipulating user data, we keep track of
+ * small in comparison to ones only manipulating user data, we keep track of
  * the committed catalog modifying ones inside [xmin, xmax) instead of keeping
  * track of all running transactions like it's done in a normal snapshot. Note
  * that we're generally only looking at transactions that have acquired an
@@ -42,7 +42,7 @@
  * catalog in a transaction. During normal operation this is achieved by using
  * CommandIds/cmin/cmax. The problem with that however is that for space
  * efficiency reasons only one value of that is stored
- * (c.f. combocid.c). Since ComboCids are only available in memory we log
+ * (cf. combocid.c). Since ComboCids are only available in memory we log
  * additional information which allows us to get the original (cmin, cmax)
  * pair during visibility checks. Check the reorderbuffer.c's comment above
  * ResolveCminCmaxDuringDecoding() for details.
@@ -92,7 +92,7 @@
  * Only transactions that commit after CONSISTENT state has been reached will
  * be replayed, even though they might have started while still in
  * FULL_SNAPSHOT. That ensures that we'll reach a point where no previous
- * changes has been exported, but all the following ones will be. That point
+ * changes have been exported, but all the following ones will be. That point
  * is a convenient point to initialize replication from, which is why we
  * export a snapshot at that point, which *can* be used to read normal data.
  *
@@ -134,7 +134,7 @@
 
 /*
  * This struct contains the current state of the snapshot building
- * machinery. Besides a forward declaration in the header, it is not exposed
+ * machinery. Except for a forward declaration in the header, it is not exposed
  * to the public, so we can easily change its contents.
  */
 struct SnapBuild
@@ -442,7 +442,7 @@
 
 	/*
 	 * We misuse the original meaning of SnapshotData's xip and subxip fields
-	 * to make the more fitting for our needs.
+	 * to make them more fitting for our needs.
 	 *
 	 * In the 'xip' array we store transactions that have to be treated as
 	 * committed. Since we will only ever look at tuples from transactions
@@ -645,7 +645,7 @@
 
 /*
  * Handle the effects of a single heap change, appropriate to the current state
- * of the snapshot builder and returns whether changes made at (xid, lsn) can
+ * of the snapshot builder and return whether changes made at (xid, lsn) can
  * be decoded.
  */
 bool
@@ -1143,7 +1143,7 @@
 	 */
 	builder->xmin = running->oldestRunningXid;
 
-	/* Remove transactions we don't need to keep track off anymore */
+	/* Remove transactions we don't need to keep track of anymore */
 	SnapBuildPurgeCommittedTxn(builder);
 
 	elog(DEBUG3, "xmin: %u, xmax: %u, oldestrunning: %u",
@@ -1250,7 +1250,7 @@
 	}
 
 	/*
-	 * a) No transaction were running, we can jump to consistent.
+	 * a) No transactions were running, we can jump to consistent.
 	 *
 	 * NB: We might have already started to incrementally assemble a snapshot,
 	 * so we need to be careful to deal with that.
@@ -1521,8 +1521,8 @@
 			(uint32) (lsn >> 32), (uint32) lsn, MyProcPid);
 
 	/*
-	 * Unlink temporary file if it already exists, needs to have been before a
-	 * crash/error since we won't enter this function twice from within a
+	 * Unlink temporary file if it already exists, must have been from before
+	 * a crash/error since we won't enter this function twice from within a
 	 * single decoding slot/backend and the temporary file contains the pid of
 	 * the current process.
 	 */
@@ -1624,8 +1624,8 @@
 	fsync_fname("pg_logical/snapshots", true);
 
 	/*
-	 * Now there's no way we can loose the dumped state anymore, remember this
-	 * as a serialization point.
+	 * Now that there's no way we can lose the dumped state anymore, remember
+	 * this as a serialization point.
 	 */
 	builder->last_serialized_snapshot = lsn;
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] more on comments of snapbuild.c

2017-03-17 Thread Erik Rijkers
Studying logrep yielded some more improvements to the comments in 
snapbuild.c


(to be applied to master)

thanks,

Erik Rijekrs


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless)

2017-03-17 Thread Erik Rijkers

On 2017-03-17 02:28, Corey Huinker wrote:
Attached is the latest work. Not everything is done yet. I post it 
because



0001.if_endif.v23.diff



This patch does not compile for me (gcc 6.3.0):

command.c:38:25: fatal error: conditional.h: No such file or directory
 #include "conditional.h"
 ^
compilation terminated.
make[3]: *** [command.o] Error 1
make[2]: *** [all-psql-recurse] Error 2
make[2]: *** Waiting for unfinished jobs
make[1]: *** [all-bin-recurse] Error 2
make: *** [all-src-recurse] Error 2

Perhaps that is expected, as "Not everything is done yet",  but I can't 
tell from your email so I thought I'd report ir anyway. Ignore as 
appropriate...



Thanks,

Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] improve comments of snapbuild.c

2017-03-14 Thread Erik Rijkers

Improvements (grammar/typos) in the comments in snapbuild.c

To be applied to master.

thanks,

Erik Rijkers

--- src/backend/replication/logical/snapbuild.c.orig	2017-03-14 21:53:42.590196415 +0100
+++ src/backend/replication/logical/snapbuild.c	2017-03-14 21:57:57.906539208 +0100
@@ -34,7 +34,7 @@
  * xid. That is we keep a list of transactions between snapshot->(xmin, xmax)
  * that we consider committed, everything else is considered aborted/in
  * progress. That also allows us not to care about subtransactions before they
- * have committed which means this modules, in contrast to HS, doesn't have to
+ * have committed which means this module, in contrast to HS, doesn't have to
  * care about suboverflowed subtransactions and similar.
  *
  * One complexity of doing this is that to e.g. handle mixed DDL/DML
@@ -82,7 +82,7 @@
  * Initially the machinery is in the START stage. When an xl_running_xacts
  * record is read that is sufficiently new (above the safe xmin horizon),
  * there's a state transition. If there were no running xacts when the
- * runnign_xacts record was generated, we'll directly go into CONSISTENT
+ * running_xacts record was generated, we'll directly go into CONSISTENT
  * state, otherwise we'll switch to the FULL_SNAPSHOT state. Having a full
  * snapshot means that all transactions that start henceforth can be decoded
  * in their entirety, but transactions that started previously can't. In
@@ -273,7 +273,7 @@
 /*
  * Allocate a new snapshot builder.
  *
- * xmin_horizon is the xid >=which we can be sure no catalog rows have been
+ * xmin_horizon is the xid >= which we can be sure no catalog rows have been
  * removed, start_lsn is the LSN >= we want to replay commits.
  */
 SnapBuild *
@@ -1840,7 +1840,7 @@
 	char		path[MAXPGPATH];
 
 	/*
-	 * We start of with a minimum of the last redo pointer. No new replication
+	 * We start off with a minimum of the last redo pointer. No new replication
 	 * slot will start before that, so that's a safe upper bound for removal.
 	 */
 	redo = GetRedoRecPtr();
@@ -1898,7 +1898,7 @@
 			/*
 			 * It's not particularly harmful, though strange, if we can't
 			 * remove the file here. Don't prevent the checkpoint from
-			 * completing, that'd be cure worse than the disease.
+			 * completing, that'd be a cure worse than the disease.
 			 */
 			if (unlink(path) < 0)
 			{

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-03-09 Thread Erik Rijkers

On 2017-03-09 11:06, Erik Rijkers wrote:

On 2017-03-08 10:36, Petr Jelinek wrote:

On 07/03/17 23:30, Erik Rijkers wrote:

On 2017-03-06 11:27, Petr Jelinek wrote:


0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch  +
0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch  +
0005-Skip-unnecessary-snapshot-builds.patch+
0001-Logical-replication-support-for-initial-data-copy-v6.patch




The attached bz2 contains
- an output file from pgbench_derail2.sh (also attached, as it changes
somewhat all the time); the

- the pg_waldump output from both master (file with .1. in it) and
replica (.2.).

- the 2 logfiles.




I forgot to include the bash-output file.  Now attached.  This file 
should have been in the bz2 I sent a few minutes ago.



= iteration 1 --   1 of 10 =

-- scale 25   clients 64   duration 300   CLEAN_ONLY=

-- hostname: barzoi
-- timestamp: 20170309_1021
-- master_start_time 2017-03-08 12:04:02.127127+01 replica_start_time 
2017-03-08 12:04:02.12713+01
-- master  patch-md5 [59c92165d4a328d68450ef0e922c0a42]
-- replica patch-md5 [59c92165d4a328d68450ef0e922c0a42] (ok)
-- synchronous_commit, master  [on]  replica [off]
-- master_assert  [on]  replica_assert [on]
-- self md5 87554cfed7cda67ad292b6481e1b8b41 ./pgbench_derail2.sh
clean-at-start-call
creating tables...
1699900 of 250 tuples (67%) done (elapsed 5.19 s, remaining 2.44 s)
250 of 250 tuples (100%) done (elapsed 7.51 s, remaining 0.00 s)
vacuum...
set primary keys...
done.
create publication pub1 for all tables;
create subscription sub1 connection 'port=6972' publication pub1 with 
(disabled);
alter subscription sub1 enable;
-- pgbench -c 64 -j 8 -T 300 -P 60 -n   --  scale 25
progress: 60.0 s, 134.4 tps, lat 472.280 ms stddev 622.992
progress: 120.0 s, 26.4 tps, lat 2083.748 ms stddev 4356.546
progress: 180.0 s, 21.2 tps, lat 2977.751 ms stddev 4767.332
progress: 240.0 s, 13.5 tps, lat 5230.657 ms stddev 7029.718
progress: 300.0 s, 42.4 tps, lat 1555.645 ms stddev 1733.152
transaction type: 
scaling factor: 25
query mode: simple
number of clients: 64
number of threads: 8
duration: 300 s
number of transactions actually processed: 14336
latency average = 1342.222 ms
latency stddev = 3043.759 ms
tps = 47.383887 (including connections establishing)
tps = 47.385513 (excluding connections establishing)
-- waiting 0s... (always)
2017.03.09 10:27:56
-- getting md5 (cb)
6972 a,b,t,h:  250 25250  14336  ee0f7bfd9 960d7d79c 3e8af1e9e 
cd2bd0395 master
6973 a,b,t,h:  250 25250  14336  ee0f7bfd9 960d7d79c 3e8af1e9e 
cd2bd0395 replica ok  578113f12
2017.03.09 10:29:18
-- All is well.
-- 0 seconds total.  scale 25  clients 64  -T 300
-- waiting 20s, then end-cleaning
clean-at-end-call
sub_count -ne 0 : deleting sub1 (plain)
ERROR:  could not drop the replication slot "sub1" on publisher
DETAIL:  The error was: ERROR:  replication slot "sub1" is active for PID 10569
sub_count -ne 0 : deleting sub1 (nodrop)
pub_count -ne 0 - deleting pub1
pub_repl_slot_count -ne 0 - deleting (sub1)
ERROR:  replication slot "sub1" is active for PID 10569

 pub_count  0
   pub_repl_slot_count  1
 sub_count  0
   sub_repl_slot_count  0

-- imperfect cleanup, pg_waldump to unclean.20170309_1021.txt.bz2, waiting 60 
s, then exit
-- testset.sh: waiting 10s...

= iteration 2 --   2 of 10 =

-- scale 25   clients 64   duration 300   CLEAN_ONLY=

-- hostname: barzoi
-- timestamp: 20170309_1021
-- master_start_time 2017-03-08 12:04:02.127127+01 replica_start_time 
2017-03-08 12:04:02.12713+01
-- master  patch-md5 [59c92165d4a328d68450ef0e922c0a42]
-- replica patch-md5 [59c92165d4a328d68450ef0e922c0a42] (ok)
-- synchronous_commit, master  [on]  replica [off]
-- master_assert  [on]  replica_assert [on]
-- self md5 87554cfed7cda67ad292b6481e1b8b41 ./pgbench_derail2.sh
clean-at-start-call
pub_repl_slot_count -ne 0 - deleting (sub1)
 pg_drop_replication_slot 
--
 
(1 row)

creating tables...
1596800 of 250 tuples (63%) done (elapsed 5.09 s, remaining 2.88 s)
250 of 250 tuples (100%) done (elapsed 7.88 s, remaining 0.00 s)
vacuum...
set primary keys...
done.
create publication pub1 for all tables;
create subscription sub1 connection 'port=6972' publication pub1 with 
(disabled);
alter subscription sub1 enable;
-- pgbench -c 64 -j 8 -T 300 -P 60 -n   --  scale 25
progress: 60.0 s, 129.0 tps, lat 493.130 ms stddev 635.654
progress: 120.0 s, 34.0 tps, l

Re: [HACKERS] Logical replication existing data copy

2017-03-09 Thread Erik Rijkers

On 2017-03-09 11:06, Erik Rijkers wrote:



file Name:
logrep.20170309_1021.1.1043.scale_25.clients_64.NOK.log

20170309_1021 is the start-time of the script
1  is master (2 is replica)
1043 is the time, 10:43, just before the pg_waldump call


Sorry, that might be confusing.  That 10:43 is the time when script 
renames and copies the logfiles (not the waldump)



I meant to show the name of the waldump file:

waldump.20170309_1021_1039.1.5.000100270069.txt.bz2
where:

20170309_1021 is the start-time of the script
1  is master (2 is replica)
5 is wait-state cycles during which all 8 md5s remained the same
1039 is the time, 10:43, just before the pg_waldump call



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-03-07 Thread Erik Rijkers

On 2017-03-06 11:27, Petr Jelinek wrote:


0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch +
0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch  +
0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch  +
0005-Skip-unnecessary-snapshot-builds.patch+
0001-Logical-replication-support-for-initial-data-copy-v6.patch


I use three different machines (2 desktop, 1 server) to test logical 
replication, and all three have now at least once failed to correctly 
synchronise a pgbench session (amidst many succesful runs, of course)


I attach an output-file from the test-program, with the 2 logfiles 
(master+replica) of the failed run.  The outputfile 
(out_20170307_1613.txt) contains the output of 5 runs of 
pgbench_derail2.sh.  The first run failed, the next 4 were ok.


But that's probably not very useful; perhaps is pg_waldump more useful?  
From what moment, or leading up to what moment, or period, is a 
pg_waldump(s) useful?  I can run it from the script, repeatedly, and 
only keep the dumped files when things go awry. Would that make sense?


Any other ideas welcome.


thanks,

Erik Rijkers





20170307_1613.tar.bz2
Description: BZip2 compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-03-06 Thread Erik Rijkers

On 2017-03-06 16:10, Erik Rijkers wrote:

On 2017-03-06 11:27, Petr Jelinek wrote:

Hi,

updated and rebased version of the patch attached.



I compiled with /only/ this one latest patch:
   0001-Logical-replication-support-for-initial-data-copy-v6.patch

Is that correct, or are other patches still needed on top, or 
underneath?




TWIMC, I'll answer my own question: the correct patchset seems to be 
these six:


0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch
0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
0005-Skip-unnecessary-snapshot-builds.patch
0001-Logical-replication-support-for-initial-data-copy-v6.patch

These compile, make check, and install fine.  make check-world is also 
without errors.


Logical replication tests are now running again (no errors yet); they'll 
have to run for a few hours with varying parameters to gain some 
confidence but it's looking good for the moment.



Erik Rijkers







--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-03-06 Thread Erik Rijkers

On 2017-03-06 11:27, Petr Jelinek wrote:

Hi,

updated and rebased version of the patch attached.



I compiled with /only/ this one latest patch:
   0001-Logical-replication-support-for-initial-data-copy-v6.patch

Is that correct, or are other patches still needed on top, or 
underneath?


Anyway, with that one patch, and even after
  alter role ... set synchronous_commit = off;
the process is very slow. (sufficiently slow that I haven't
had the patience to see it to completion yet)

What am I doing wrong?

thanks,

Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] snapbuild woes

2017-03-02 Thread Erik Rijkers

On 2017-03-03 01:30, Petr Jelinek wrote:

With these patches:

0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
0002-Fix-after-trigger-execution-in-logical-replication.patch
0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
snapbuild-v5-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
snapbuild-v5-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
snapbuild-v5-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch
snapbuild-v5-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
snapbuild-v5-0005-Skip-unnecessary-snapshot-builds.patch
0001-Logical-replication-support-for-initial-data-copy-v6.patch

I get:

subscriptioncmds.c:47:12: error: static declaration of ‘oid_cmp’ follows 
non-static declaration

 static int oid_cmp(const void *p1, const void *p2);
^~~
In file included from subscriptioncmds.c:42:0:
../../../src/include/utils/builtins.h:70:12: note: previous declaration 
of ‘oid_cmp’ was here

 extern int oid_cmp(const void *p1, const void *p2);
^~~
make[3]: *** [subscriptioncmds.o] Error 1
make[3]: *** Waiting for unfinished jobs
make[2]: *** [commands-recursive] Error 2
make[2]: *** Waiting for unfinished jobs
make[1]: *** [all-backend-recurse] Error 2
make: *** [all-src-recurse] Error 2




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-02-28 Thread Erik Rijkers

On 2017-02-28 07:38, Erik Rijkers wrote:

On 2017-02-27 15:08, Petr Jelinek wrote:


0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch 
+
0002-Fix-after-trigger-execution-in-logical-replication.patch   
+
0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch 
+
snapbuild-v4-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch 
+

snapbuild-v4-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+
snapbuild-v4-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch 
 +
snapbuild-v4-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch  
+
snapbuild-v4-0005-Skip-unnecessary-snapshot-builds.patch
+

0001-Logical-replication-support-for-initial-data-copy-v6.patch


This is the most frequent error that happens while doing pgbench-runs 
over logical replication: I run it continuously all day, and every few 
hours an error occurs of the kind seen below: a table (pgbench_history, 
mostly) ends up 1 row short (673466 instead of 673467).  I have the 
script wait a long time before calling it an error (because in theory it 
could still 'finish', and end successfully (although that has not 
happened yet, once the system got into this state).


-- pgbench -c 16 -j 8 -T 120 -P 24 -n -M simple  --  scale 25

[...]
6972 a,b,t,h:  250 25250 673467   e53236c09 643235708 
f952814c3 559d618cd   master
6973 a,b,t,h:  250 25250 673466   e53236c09 643235708 
f952814c3 4b09337e3   replica NOK   a22fb00a6

-- wait another 5 s   (total 20 s) (unchanged 1)
-- getting md5 (cb)
6972 a,b,t,h:  250 25250 673467   e53236c09 643235708 
f952814c3 559d618cd   master
6973 a,b,t,h:  250 25250 673466   e53236c09 643235708 
f952814c3 4b09337e3   replica NOK   a22fb00a6

-- wait another 5 s   (total 25 s) (unchanged 2)
-- getting md5 (cb)
6972 a,b,t,h:  250 25250 673467   e53236c09 643235708 
f952814c3 559d618cd   master
6973 a,b,t,h:  250 25250 673466   e53236c09 643235708 
f952814c3 4b09337e3   replica NOK   a22fb00a6

-- wait another 5 s   (total 30 s) (unchanged 3)
-- getting md5 (cb)
6972 a,b,t,h:  250 25250 673467   e53236c09 643235708 
f952814c3 559d618cd   master
6973 a,b,t,h:  250 25250 673466   e53236c09 643235708 
f952814c3 4b09337e3   replica NOK   a22fb00a6

-- wait another 5 s   (total 35 s) (unchanged 4)


I gathered some info in this (proabably deadlocked) state in the hope 
there is something suspicious in there:



UID PID   PPID  C STIME TTY  STAT   TIME CMD
rijkers   71203  1  0 20:06 pts/57   S  0:00 postgres -D 
/var/data1/pg_stuff/pg_installations/pgsql.logical_replication2/data -p 
6973 -c wal_level=replica [...]
rijkers   71214  71203  0 20:06 ?Ss 0:00  \_ postgres: 
logger process
rijkers   71216  71203  0 20:06 ?Ss 0:00  \_ postgres: 
checkpointer process
rijkers   71217  71203  0 20:06 ?Ss 0:00  \_ postgres: 
writer process
rijkers   71218  71203  0 20:06 ?Ss 0:00  \_ postgres: wal 
writer process
rijkers   71219  71203  0 20:06 ?Ss 0:00  \_ postgres: 
autovacuum launcher process
rijkers   71220  71203  0 20:06 ?Ss 0:00  \_ postgres: stats 
collector process
rijkers   71221  71203  0 20:06 ?Ss 0:00  \_ postgres: 
bgworker: logical replication launcher
rijkers   71222  71203  0 20:06 ?Ss 0:00  \_ postgres: 
bgworker: logical replication worker 30042


rijkers   71201  1  0 20:06 pts/57   S  0:00 postgres -D 
/var/data1/pg_stuff/pg_installations/pgsql.logical_replication/data -p 
6972 -c wal_level=logical [...]
rijkers   71206  71201  0 20:06 ?Ss 0:00  \_ postgres: 
logger process
rijkers   71208  71201  0 20:06 ?Ss 0:00  \_ postgres: 
checkpointer process
rijkers   71209  71201  0 20:06 ?Ss 0:00  \_ postgres: 
writer process
rijkers   71210  71201  0 20:06 ?Ss 0:00  \_ postgres: wal 
writer process
rijkers   71211  71201  0 20:06 ?Ss 0:00  \_ postgres: 
autovacuum launcher process
rijkers   71212  71201  0 20:06 ?Ss 0:00  \_ postgres: stats 
collector process
rijkers   71213  71201  0 20:06 ?Ss 0:00  \_ postgres: 
bgworker: logical replication launcher
rijkers   71223  71201  0 20:06 ?Ss 0:00  \_ postgres: wal 
sender process rijkers [local] idle





-- replica:
 port | shared_buffers | work_mem | m_w_m | e_c_s
--++--+---+---
 6973 | 100MB  | 50MB | 2GB   | 64GB
(1 row)

select  current_setting('port') as port
, datname   as db
,  to_char(pg_database_size(datname), '9G999G999G999G999')
 || ' (' ||  pg_size_pretty(pg_database_size(datname)) || ')' as 
dbsize

, pid
, application_name  as app
, xact_start
, query_start
, regexp_replace( cast(now() - query_start as text), 
E'\.[[:digit

Re: [HACKERS] Logical replication existing data copy

2017-02-27 Thread Erik Rijkers

On 2017-02-27 15:08, Petr Jelinek wrote:


The performance was why in original patch I wanted the apply process to
default to synchronous_commit = off as without it the apply performance
(due to applying transactions individually and in sequences) is quite
lackluster.

It can be worked around using user that has synchronous_commit = off 
set

via ALTER ROLE as owner of the subscription.



Wow, that's a huge difference in speed.

I set
   ALTER ROLE aardvark synchronous_commit = off;

during the first iteration of a 10x pgbench-test (so the first was still 
done with it 'on'):

here the pertinent grep | uniq -c lines:

-- out_20170228_0004.txt
 10 -- pgbench -c 16 -j 8 -T 900 -P 180 -n   --  scale 25
 10 -- All is well.
  1 -- 1325 seconds total.
  9 -- 5 seconds total.

And that 5 seconds is a hardcoded wait; so it's probably even quicker.

This is a slowish machine but that's a really spectacular difference.  
It's the difference between keeping up or getting lost.


Would you remind me why synchronous_commit = on was deemed a better 
default?  This thread isn't very clear about it (not the 'logical 
replication WIP' thread).



thanks,

Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-02-27 Thread Erik Rijkers

With these patches:

-- 0416d87c-09a5-182e-4901-236aec103...@2ndquadrant.com
   Subject: Re: Logical Replication WIP
  48. 
https://www.postgresql.org/message-id/attachment/49886/0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
  49. 
https://www.postgresql.org/message-id/attachment/49887/0002-Fix-after-trigger-execution-in-logical-replication.patch
  50. 
https://www.postgresql.org/message-id/attachment/49888/0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch


-- 51f65289-54f8-2256-d107-937d662d6...@2ndquadrant.com
   Subject: Re: snapbuild woes
  48. 
https://www.postgresql.org/message-id/attachment/49995/snapbuild-v4-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
  49. 
https://www.postgresql.org/message-id/attachment/49996/snapbuild-v4-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
  50. 
https://www.postgresql.org/message-id/attachment/49997/snapbuild-v4-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch
  51. 
https://www.postgresql.org/message-id/attachment/49998/snapbuild-v4-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
  52. 
https://www.postgresql.org/message-id/attachment/4/snapbuild-v4-0005-Skip-unnecessary-snapshot-builds.patch


-- c0f90176-efff-0770-1e79-0249fb4b9...@2ndquadrant.com
   Subject: Re: Logical replication existing data copy
  48. 
https://www.postgresql.org/message-id/attachment/49977/0001-Logical-replication-support-for-initial-data-copy-v6.patch



logical replication now seems pretty stable, at least for the limited 
testcase that I am using.  I've done dozens of pgbench_derail2.sh runs 
without failure.  I am now changing the pgbench-test to larger scale 
(pgbench -is) and longer periods (-T) which makes running the test slow 
(both instances are running on a modest desktop with a single 7200 
disk).  It is quite a bit slower than I expected (a 5-minute pgbench 
scale 5, with 8 clients, takes, after it has finished on master, another 
2-3 minutes to get synced on the replica).  I suppose it's just a 
hardware limitation.  I set max_sync_workers_per_subscription to 6 (from 
default 2) but it doesn't help much (at all).


To be continued...


Thanks,

Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-02-26 Thread Erik Rijkers

On 2017-02-26 10:53, Erik Rijkers wrote:


Not yet perfect, but we're getting there...


Sorry, I made a mistake: I was running the newest patches on master but 
the older versions on replica (or more precise: I didn't properly 
shutdown the replica so the older version remained up and running during 
subsequent testing).


So my last email mentioning the 'DROP SUBSCRIPTION' hang error are 
hopefully wrong.


I'll get back when I've repeated these tests. This will take some hours 
(at least).


Sorry to cause you these palpitations, perhaps unnecessarily...


Erik Rijkers





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-02-26 Thread Erik Rijkers

On 2017-02-26 01:45, Petr Jelinek wrote:

Again, much better... :

-- out_20170226_0724.txt
 25 -- pgbench -c 1 -j 8 -T 10 -P 5 -n
 25 -- All is well.
-- out_20170226_0751.txt
 25 -- pgbench -c 4 -j 8 -T 10 -P 5 -n
 25 -- All is well.
-- out_20170226_0819.txt
 25 -- pgbench -c 8 -j 8 -T 10 -P 5 -n
 25 -- All is well.
-- out_20170226_0844.txt
 25 -- pgbench -c 16 -j 8 -T 10 -P 5 -n
 25 -- All is well.
-- out_20170226_0912.txt
 25 -- pgbench -c 32 -j 8 -T 10 -P 5 -n
 25 -- All is well.
-- out_20170226_0944.txt
 25 -- scale  5 clients  1   INIT_WAIT  0CLEAN_ONLY=
 25 -- pgbench -c 1 -j 8 -T 10 -P 5 -n
 25 -- All is well.

 but not perfect: with the next scale up (pgbench scale 25) I got:

-- out_20170226_1001.txt
  3 -- scale  25 clients  1   INIT_WAIT  0CLEAN_ONLY=
  3 -- pgbench -c 1 -j 8 -T 10 -P 5 -n
  2 -- All is well.
  1 -- Not good, but breaking out of wait (waited more than 60s)

It looks like something got stuck at DROP SUBSCRIPTION again which, I 
think, derives from this line:

echo "drop subscription if exists sub1"  | psql -qXp $port2

I don't know exactly what is useful/useless to report; below is the 
state of some tables/views (note that this is from 31 minutes after the 
fact (see 'duration' in the first query)), and a backtrace :



$ ./view.sh
select current_setting('port') as port;
 port
--
 6973
(1 row)

select
  rpad(now()::text,19) as now
, pid   as pid
, application_name  as app
, state as state
, wait_eventas wt_evt
, wait_event_type   as wt_evt_type
, date_trunc('second', query_start::timestamp)  as query_start
, substring((now() - query_start)::text, 1, position('.' in (now() - 
query_start)::text)-1) as duration

, query
from pg_stat_activity
where query !~ 'pg_stat_activity'
;
 now |  pid  | app   
  | state  | wt_evt | wt_evt_type | query_start 
| duration |  query

-+---+-+++-+-+--+--
 2017-02-26 10:42:43 | 28232 | logical replication worker 31929  
  | active | relation   | Lock| 
|  |
 2017-02-26 10:42:43 | 28237 | logical replication worker 31929 sync 
31906 || LogicalSyncStateChange | IPC |  
   |  |
 2017-02-26 10:42:43 | 28242 | logical replication worker 31929 sync 
31909 || transactionid  | Lock|  
   |  |
 2017-02-26 10:42:43 | 32023 | psql  
  | active | BgWorkerShutdown   | IPC | 2017-02-26 10:10:52 
| 00:31:51 | drop subscription if exists sub1

(4 rows)

select * from pg_stat_replication;
 pid | usesysid | usename | application_name | client_addr | 
client_hostname | client_port | backend_start | backend_xmin | state | 
sent_location | write_location | flush_location | replay_location | 
sync_priority | sync_state

-+--+-+--+-+-+-+---+--+---+---+++-+---+
(0 rows)

select * from pg_stat_subscription;
 subid | subname |  pid  | relid | received_lsn |  
last_msg_send_time   | last_msg_receipt_time | 
latest_end_lsn |latest_end_time

---+-+---+---+--+---+---++---
 31929 | sub1| 28242 | 31909 |  | 2017-02-26 
10:07:05.723093+01 | 2017-02-26 10:07:05.723093+01 || 
2017-02-26 10:07:05.723093+01
 31929 | sub1| 28237 | 31906 |  | 2017-02-26 
10:07:04.721229+01 | 2017-02-26 10:07:04.721229+01 || 
2017-02-26 10:07:04.721229+01
 31929 | sub1| 28232 |   | 1/73497468   |
   | 2017-02-26 10:07:47.781883+01 | 1/59A73EF8 | 2017-02-26 
10:07:04.720595+01

(3 rows)

select * from pg_subscription;
 subdbid | subname | subowner | subenabled | subconninfo | subslotname | 
subpublications

-+-+--++-+-+-
   16384 | sub1|   10 | t  | port=6972   | sub1| 
{pub1}

(1 row)

select * from pg_subscription_rel;
 srsubid | srrelid | srsubstate |  srsublsn
-+-++
   31929 |   31912 | i  |
   31929 |   31917 | i  |
   31929 |   31909 | d  |
   31929 |   31906 | w  | 1/73498F90
(4 rows)

Dunno if a backtrace is is useful

$ gdb -pid 32023  (from the DROP SUBSCRIPTION 

Re: [HACKERS] Logical replication existing data copy

2017-02-25 Thread Erik Rijkers

On 2017-02-25 00:40, Petr Jelinek wrote:


0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
0002-Fix-after-trigger-execution-in-logical-replication.patch
0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
snapbuild-v3-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
snapbuild-v3-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
snapbuild-v3-0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
snapbuild-v3-0004-Skip-unnecessary-snapshot-builds.patch
0001-Logical-replication-support-for-initial-data-copy-v6.patch


Here are some results. There is improvement although it's not an 
unqualified success.


Several repeat-runs of pgbench_derail2.sh, with different parameters for 
number-of-client yielded an output file each.


Those show that logrep is now pretty stable when there is only 1 client 
(pgbench -c 1).  But it starts making mistakes with 4, 8, 16 clients.  
I'll just show a grep of the output files; I think it is 
self-explicatory:


Output-files (lines counted with  grep | sort | uniq -c):

-- out_20170225_0129.txt
250 -- pgbench -c 1 -j 8 -T 10 -P 5 -n
250 -- All is well.

-- out_20170225_0654.txt
 25 -- pgbench -c 4 -j 8 -T 10 -P 5 -n
 24 -- All is well.
  1 -- Not good, but breaking out of wait (waited more than 60s)

-- out_20170225_0711.txt
 25 -- pgbench -c 8 -j 8 -T 10 -P 5 -n
 23 -- All is well.
  2 -- Not good, but breaking out of wait (waited more than 60s)

-- out_20170225_0803.txt
 25 -- pgbench -c 16 -j 8 -T 10 -P 5 -n
 11 -- All is well.
 14 -- Not good, but breaking out of wait (waited more than 60s)

So, that says:
1 clients: 250x success, zero fail (250 not a typo, ran this overnight)
4 clients: 24x success, 1 fail
8 clients: 23x success, 2 fail
16 clients: 11x success, 14 fail

I want to repeat what I said a few emails back: problems seem to 
disappear when a short wait state is introduced (directly after the 
'alter subscription sub1 enable' line) to give the logrep machinery time 
to 'settle'. It makes one think of a timing error somewhere (now don't 
ask me where..).


To show that, here is pgbench_derail2.sh output that waited 10 seconds 
(INIT_WAIT in the script) as such a 'settle' period works faultless 
(with 16 clients):


-- out_20170225_0852.txt
 25 -- pgbench -c 16 -j 8 -T 10 -P 5 -n
 25 -- All is well.

QED.

(By the way, no hanged sessions so far, so that's good)


thanks

Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-02-24 Thread Erik Rijkers

On 2017-02-25 00:08, Petr Jelinek wrote:


There is now a lot of fixes for existing code that this patch depends
on. Hopefully some of the fixes get committed soonish.


Indeed - could you look over the below list of 8 patches; is it correct 
and in the right (apply) order?


0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
0002-Fix-after-trigger-execution-in-logical-replication.patch
0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
snapbuild-v3-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch
snapbuild-v3-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
snapbuild-v3-0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
snapbuild-v3-0004-Skip-unnecessary-snapshot-builds.patch
0001-Logical-replication-support-for-initial-data-copy-v6.patch

(they do apply & compile like this...)







--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-02-24 Thread Erik Rijkers

On 2017-02-24 22:58, Petr Jelinek wrote:

On 23/02/17 01:41, Petr Jelinek wrote:

On 23/02/17 01:02, Erik Rijkers wrote:

On 2017-02-22 18:13, Erik Rijkers wrote:

On 2017-02-22 14:48, Erik Rijkers wrote:

On 2017-02-22 13:03, Petr Jelinek wrote:


0001-Skip-unnecessary-snapshot-builds.patch
0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
0002-Fix-after-trigger-execution-in-logical-replication.patch
0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
0001-Logical-replication-support-for-initial-data-copy-v5.patch


It works well now, or at least my particular test case seems now 
solved.


Cried victory too early, I'm afraid.



I got into a 'hung' state while repeating  pgbench_derail2.sh.

Below is some state.  I notice that master pg_stat_replication.syaye 
is

'startup'.
Maybe I should only start the test after that state has changed. Any 
of the
other possible values (catchup, streaming) wuold be OK, I would 
think.




I think that's known issue (see comment in tablesync.c about hanging
forever). I think I may have fixed it locally.

I will submit patch once I fixed the other snapshot issue (I managed 
to
reproduce it as well, although very rarely so it's rather hard to 
test).




Hi,

Here it is. But check also the snapbuild related thread for updated
patches related to that (the issue you had with this not copying all
rows is yet another pre-existing Postgres bug).




The four earlier snapbuild patches apply cleanly, but
then I get errors while  applying
0001-Logical-replication-support-for-initial-data-copy-v6.patch:


patching file src/test/regress/expected/sanity_check.out
(Stripping trailing CRs from patch.)
patching file src/test/regress/expected/subscription.out
Hunk #2 FAILED at 25.
1 out of 2 hunks FAILED -- saving rejects to file 
src/test/regress/expected/subscription.out.rej

(Stripping trailing CRs from patch.)
patching file src/test/regress/sql/object_address.sql
(Stripping trailing CRs from patch.)
patching file src/test/regress/sql/subscription.sql
(Stripping trailing CRs from patch.)
patching file src/test/subscription/t/001_rep_changes.pl
Hunk #9 succeeded at 175 with fuzz 2.
Hunk #10 succeeded at 193 (offset -9 lines).
(Stripping trailing CRs from patch.)
patching file src/test/subscription/t/002_types.pl
(Stripping trailing CRs from patch.)
can't find file to patch at input line 4296
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--
|diff --git a/src/test/subscription/t/003_constraints.pl 
b/src/test/subscription/t/003_constraints.pl

|index 17d4565..9543b91 100644
|--- a/src/test/subscription/t/003_constraints.pl
|+++ b/src/test/subscription/t/003_constraints.pl
--
File to patch:


Can you have a look?

thanks,

Erik Rijkers



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-02-22 Thread Erik Rijkers

On 2017-02-22 18:13, Erik Rijkers wrote:

On 2017-02-22 14:48, Erik Rijkers wrote:

On 2017-02-22 13:03, Petr Jelinek wrote:


0001-Skip-unnecessary-snapshot-builds.patch
0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
0002-Fix-after-trigger-execution-in-logical-replication.patch
0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
0001-Logical-replication-support-for-initial-data-copy-v5.patch


It works well now, or at least my particular test case seems now 
solved.


Cried victory too early, I'm afraid.



I got into a 'hung' state while repeating  pgbench_derail2.sh.

Below is some state.  I notice that master pg_stat_replication.syaye is 
'startup'.
Maybe I should only start the test after that state has changed. Any of 
the

other possible values (catchup, streaming) wuold be OK, I would think.


$  ( dbactivity.sh ; echo "; table pg_subscription; table 
pg_subscription_rel;" ) | psql -qXp 6973
 now |  pid  | app   
  | state  | wt_evt | wt_evt_type | query_start 
| duration |  query

-+---+-+++-+-+--+--
 2017-02-23 00:37:57 | 31352 | logical replication worker 47435  
  | active | relation   | Lock| 
|  |
 2017-02-23 00:37:57 |   397 | psql  
  | active | BgWorkerShutdown   | IPC | 2017-02-23 00:22:14 
| 00:15:42 | drop subscription if exists sub1
 2017-02-23 00:37:57 | 31369 | logical replication worker 47435 sync 
47423 || LogicalSyncStateChange | IPC |  
   |  |
 2017-02-23 00:37:57 |   398 | logical replication worker 47435 sync 
47418 || transactionid  | Lock|  
   |  |

(4 rows)

 subdbid | subname | subowner | subenabled | subconninfo | subslotname | 
subpublications

-+-+--++-+-+-
   16384 | sub1|   10 | t  | port=6972   | sub1| 
{pub1}

(1 row)

 srsubid | srrelid | srsubstate |  srsublsn
-+-++
   47435 |   47423 | w  | 2/CB078260
   47435 |   47412 | r  |
   47435 |   47415 | r  |
   47435 |   47418 | c  | 2/CB06E158
(4 rows)


Replica (port 6973):

[bulldog aardvark] [local]:6973 (Thu) 00:52:47 [pid:5401] [testdb] # 
table pg_stat_subscription ;
 subid | subname |  pid  | relid | received_lsn |  
last_msg_send_time   | last_msg_receipt_time | 
latest_end_lsn |latest_end_time

---+-+---+---+--+---+---++---
 47435 | sub1| 31369 | 47423 |  | 2017-02-23 
00:20:45.758072+01 | 2017-02-23 00:20:45.758072+01 || 
2017-02-23 00:20:45.758072+01
 47435 | sub1|   398 | 47418 |  | 2017-02-23 
00:22:14.896471+01 | 2017-02-23 00:22:14.896471+01 || 
2017-02-23 00:22:14.896471+01
 47435 | sub1| 31352 |   | 2/CB06E158   |
   | 2017-02-23 00:20:47.034664+01 || 2017-02-23 
00:20:45.679245+01

(3 rows)


Master  (port 6972):

[bulldog aardvark] [local]:6972 (Thu) 00:48:27 [pid:5307] [testdb] # \x 
on \\ table pg_stat_replication ;

Expanded display is on.
-[ RECORD 1 ]+--
pid  | 399
usesysid | 10
usename  | aardvark
application_name | sub1_47435_sync_47418
client_addr  |
client_hostname  |
client_port  | -1
backend_start| 2017-02-23 00:22:14.902701+01
backend_xmin |
state| startup
sent_location|
write_location   |
flush_location   |
replay_location  |
sync_priority| 0
sync_state   | async
-[ RECORD 2 ]+--
pid  | 31371
usesysid | 10
usename  | aardvark
application_name | sub1_47435_sync_47423
client_addr  |
client_hostname  |
client_port  | -1
backend_start| 2017-02-23 00:20:45.762852+01
backend_xmin |
state| startup
sent_location|
write_location   |
flush_location   |
replay_location  |
sync_priority| 0
sync_state   | async



( above 'dbactivity.sh' is:

select
  rpad(now()::text,19) as now
, pid   as pid
, application_name  as app
, state as state
, wait_eventas wt_evt
, wait_event_type   as wt_evt_type
, date_trunc('second', query_start::timestamp)  as query_start
, substring((now() - query_start)::text, 1

Re: [HACKERS] Logical replication existing data copy

2017-02-22 Thread Erik Rijkers

On 2017-02-22 13:03, Petr Jelinek wrote:


0001-Skip-unnecessary-snapshot-builds.patch
0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch
0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch
0002-Fix-after-trigger-execution-in-logical-replication.patch
0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch
0001-Logical-replication-support-for-initial-data-copy-v5.patch


It works well now, or at least my particular test case seems now solved.

thanks,

Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] snapbuild woes

2017-02-22 Thread Erik Rijkers

On 2017-02-22 03:05, Petr Jelinek wrote:


So to summarize attached patches:
0001 - Fixes performance issue where we build tons of snapshots that we
don't need which kills CPU.

0002 - Disables the use of ondisk historical snapshots for initial
consistent snapshot export as it may result in corrupt data. This
definitely needs backport.

0003 - Fixes bug where we might never reach snapshot on busy server due
to race condition in xl_running_xacts logging. The original use of 
extra

locking does not seem to be enough in practice. Once we have agreed fix
for this it's probably worth backpatching. There are still some 
comments

that need updating, this is more of a PoC.



I am not not entirely sure what to expect.  Should a server with these 3 
patches do initial data copy or not?  The sgml seems to imply there is 
not inital data copy.  But my test does copy something.


Anyway, I have repeated the same old pgbench-test, assuming inital data 
copy should be working.


With

0001-Skip-unnecessary-snapshot-builds.patch
0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch
0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch


the consistent (but wrong) end state is always that only one of the four 
pgbench tables, pgbench_history, is replicated (always correctly).


Below is the output from the test (I've edited the lines for email)
(below, a,b,t,h stand for: pgbench_accounts, pgbench_branches, 
pgbench_tellers, pgbench_history)

(master on port 6972, replica on port 6973.)

port
6972 a,b,t,h: 10  1 10347
6973 a,b,t,h:  0  0  0347

a,b,t,h: a68efc81a  2c27f7ba5  128590a57  1e4070879   master
a,b,t,h: d41d8cd98  d41d8cd98  d41d8cd98  1e4070879   replica NOK

The md5-initstrings are from a md5 of the whole content of each table 
(an ordered select *)


I repeated this a few times: of course, the number of rows in 
pgbench_history varies a bit but otherwise it is always the same: 3 
empty replica tables, pgbench_history replicated correctly.


Something is not right.


thanks,


Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy - comments snapbuild.c

2017-02-19 Thread Erik Rijkers

On 2017-02-19 23:24, Erik Rijkers wrote:

0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
0001-Logical-replication-support-for-initial-data-copy-v4.patch


Improve comment blocks in
  src/backend/replication/logical/snapbuild.c



[deep sigh...]  attached...--- src/backend/replication/logical/snapbuild.c.orig2	2017-02-19 17:25:57.237527107 +0100
+++ src/backend/replication/logical/snapbuild.c	2017-02-19 23:19:57.654946968 +0100
@@ -34,7 +34,7 @@
  * xid. That is we keep a list of transactions between snapshot->(xmin, xmax)
  * that we consider committed, everything else is considered aborted/in
  * progress. That also allows us not to care about subtransactions before they
- * have committed which means this modules, in contrast to HS, doesn't have to
+ * have committed which means this module, in contrast to HS, doesn't have to
  * care about suboverflowed subtransactions and similar.
  *
  * One complexity of doing this is that to e.g. handle mixed DDL/DML
@@ -82,7 +82,7 @@
  * Initially the machinery is in the START stage. When an xl_running_xacts
  * record is read that is sufficiently new (above the safe xmin horizon),
  * there's a state transition. If there were no running xacts when the
- * runnign_xacts record was generated, we'll directly go into CONSISTENT
+ * running_xacts record was generated, we'll directly go into CONSISTENT
  * state, otherwise we'll switch to the FULL_SNAPSHOT state. Having a full
  * snapshot means that all transactions that start henceforth can be decoded
  * in their entirety, but transactions that started previously can't. In
@@ -274,7 +274,7 @@
 /*
  * Allocate a new snapshot builder.
  *
- * xmin_horizon is the xid >=which we can be sure no catalog rows have been
+ * xmin_horizon is the xid >= which we can be sure no catalog rows have been
  * removed, start_lsn is the LSN >= we want to replay commits.
  */
 SnapBuild *
@@ -1642,7 +1642,7 @@
 	fsync_fname("pg_logical/snapshots", true);
 
 	/*
-	 * Now there's no way we can loose the dumped state anymore, remember this
+	 * Now there's no way we can lose the dumped state anymore, remember this
 	 * as a serialization point.
 	 */
 	builder->last_serialized_snapshot = lsn;
@@ -1858,7 +1858,7 @@
 	char		path[MAXPGPATH];
 
 	/*
-	 * We start of with a minimum of the last redo pointer. No new replication
+	 * We start off with a minimum of the last redo pointer. No new replication
 	 * slot will start before that, so that's a safe upper bound for removal.
 	 */
 	redo = GetRedoRecPtr();
@@ -1916,7 +1916,7 @@
 			/*
 			 * It's not particularly harmful, though strange, if we can't
 			 * remove the file here. Don't prevent the checkpoint from
-			 * completing, that'd be cure worse than the disease.
+			 * completing, that'd be a cure worse than the disease.
 			 */
 			if (unlink(path) < 0)
 			{

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy - comments snapbuild.c

2017-02-19 Thread Erik Rijkers

0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
0001-Logical-replication-support-for-initial-data-copy-v4.patch


Improve comment blocks in
  src/backend/replication/logical/snapbuild.c



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy - comments origin.c

2017-02-19 Thread Erik Rijkers

On 2017-02-19 17:21, Erik Rijkers wrote:

0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
0001-Logical-replication-support-for-initial-data-copy-v4.patch


Improve readability of comment blocks
in  src/backend/replication/logical/origin.c



now attached



thanks,

Erik Rijkers
--- src/backend/replication/logical/origin.c.orig	2017-02-19 16:45:28.558865304 +0100
+++ src/backend/replication/logical/origin.c	2017-02-19 17:11:09.034023021 +0100
@@ -11,31 +11,29 @@
  * NOTES
  *
  * This file provides the following:
- * * An infrastructure to name nodes in a replication setup
- * * A facility to efficiently store and persist replication progress in an
- *	 efficient and durable manner.
- *
- * Replication origin consist out of a descriptive, user defined, external
- * name and a short, thus space efficient, internal 2 byte one. This split
- * exists because replication origin have to be stored in WAL and shared
+ * * Infrastructure to name nodes in a replication setup
+ * * A facility to efficiently store and persist replication progress
+ *
+ * A replication origin has a descriptive, user defined, external
+ * name and a short, internal 2 byte one. This split
+ * exists because a replication origin has to be stored in WAL and shared
  * memory and long descriptors would be inefficient.  For now only use 2 bytes
  * for the internal id of a replication origin as it seems unlikely that there
- * soon will be more than 65k nodes in one replication setup; and using only
- * two bytes allow us to be more space efficient.
+ * soon will be more than 65k nodes in one replication setup.
  *
  * Replication progress is tracked in a shared memory table
- * (ReplicationStates) that's dumped to disk every checkpoint. Entries
+ * (ReplicationStates) that is dumped to disk every checkpoint. Entries
  * ('slots') in this table are identified by the internal id. That's the case
  * because it allows to increase replication progress during crash
  * recovery. To allow doing so we store the original LSN (from the originating
  * system) of a transaction in the commit record. That allows to recover the
- * precise replayed state after crash recovery; without requiring synchronous
+ * precise replayed state after crash recovery without requiring synchronous
  * commits. Allowing logical replication to use asynchronous commit is
  * generally good for performance, but especially important as it allows a
  * single threaded replay process to keep up with a source that has multiple
  * backends generating changes concurrently.  For efficiency and simplicity
- * reasons a backend can setup one replication origin that's from then used as
- * the source of changes produced by the backend, until reset again.
+ * reasons a backend can setup one replication origin that is used as
+ * the source of changes produced by the backend, until it is reset again.
  *
  * This infrastructure is intended to be used in cooperation with logical
  * decoding. When replaying from a remote system the configured origin is
@@ -45,11 +43,11 @@
  * There are several levels of locking at work:
  *
  * * To create and drop replication origins an exclusive lock on
- *	 pg_replication_slot is required for the duration. That allows us to
- *	 safely and conflict free assign new origins using a dirty snapshot.
+ *	 pg_replication_slot is required. That allows us to
+ *	 safely and conflict-free assign new origins using a dirty snapshot.
  *
- * * When creating an in-memory replication progress slot the ReplicationOirgin
- *	 LWLock has to be held exclusively; when iterating over the replication
+ * * When creating an in-memory replication progress slot the ReplicationOrigin
+ *	 LWLock has to be held exclusively. When iterating over the replication
  *	 progress a shared lock has to be held, the same when advancing the
  *	 replication progress of an individual backend that has not setup as the
  *	 session's replication origin.
@@ -57,7 +55,7 @@
  * * When manipulating or looking at the remote_lsn and local_lsn fields of a
  *	 replication progress slot that slot's lwlock has to be held. That's
  *	 primarily because we do not assume 8 byte writes (the LSN) is atomic on
- *	 all our platforms, but it also simplifies memory ordering concerns
+ *	 all our platforms, but it also simplifies memory ordering
  *	 between the remote and local lsn. We use a lwlock instead of a spinlock
  *	 so it's less harmful to hold the lock over a WAL write
  *	 (c.f. AdvanceReplicationProgress).
@@ -305,7 +303,7 @@
 		}
 	}
 
-	/* now release lock again,	*/
+	/* now release lock again. */
 	heap_close(rel, ExclusiveLock);
 
 	if (tuple == NULL)
@@ -382,7 +380,7 @@
 
 	CommandCounterIncrement();
 
-	/* now release lock again,	*/
+	/* now release lock again. */
 	heap_close

Re: [HACKERS] Logical replication existing data copy - comments origin.c

2017-02-19 Thread Erik Rijkers

0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
0001-Logical-replication-support-for-initial-data-copy-v4.patch


Improve readability of comment blocks
in  src/backend/replication/logical/origin.c


thanks,

Erik Rijkers




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-02-18 Thread Erik Rijkers

On 2017-02-11 11:16, Erik Rijkers wrote:

On 2017-02-08 23:25, Petr Jelinek wrote:


0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
0001-Logical-replication-support-for-initial-data-copy-v4.patch




Let me add the script ('instances.sh')  that I use to startup the two 
logical replication instances for testing.


Together with the earlier posted 'pgbench_derail2.sh' it makes out the 
fails test.


pg_config of the master is:

$ pg_config
BINDIR = 
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/bin
DOCDIR = 
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/share/doc
HTMLDIR = 
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/share/doc
INCLUDEDIR = 
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/include
PKGINCLUDEDIR = 
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/include
INCLUDEDIR-SERVER = 
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/include/server
LIBDIR = 
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/lib
PKGLIBDIR = 
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/lib
LOCALEDIR = 
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/share/locale
MANDIR = 
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/share/man
SHAREDIR = 
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/share
SYSCONFDIR = 
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/etc
PGXS = 
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = 
'--prefix=/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication' 
'--bindir=/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/bin' 
'--libdir=/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/lib' 
'--with-pgport=6972' '--enable-depend' '--enable-cassert' 
'--enable-debug' '--with-openssl' '--with-perl' '--with-libxml' 
'--with-libxslt' '--with-zlib' '--enable-tap-tests' 
'--with-extra-version=_logical_replication_20170218_1221_e3a58c8835a2'

CC = gcc
CPPFLAGS = -DFRONTEND -D_GNU_SOURCE -I/usr/include/libxml2
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
-Wformat-security -fno-strict-aliasing -fwrapv 
-fexcess-precision=standard -g -O2

CFLAGS_SL = -fpic
LDFLAGS = -L../../src/common -Wl,--as-needed 
-Wl,-rpath,'/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/lib',--enable-new-dtags

LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgcommon -lpgport -lxslt -lxml2 -lssl -lcrypto -lz -lreadline 
-lrt -lcrypt -ldl -lm
VERSION = PostgreSQL 
10devel_logical_replication_20170218_1221_e3a58c8835a2



I hope it helps someone to reproduce the errors I get.  (If you don't, 
I'd like to hear that too)



thanks,

Erik Rijkers
#!/bin/sh
port1=6972
port2=6973
project1=logical_replication
project2=logical_replication2
pg_stuff_dir=$HOME/pg_stuff
PATH1=$pg_stuff_dir/pg_installations/pgsql.$project1/bin:$PATH
PATH2=$pg_stuff_dir/pg_installations/pgsql.$project2/bin:$PATH
server_dir1=$pg_stuff_dir/pg_installations/pgsql.$project1
server_dir2=$pg_stuff_dir/pg_installations/pgsql.$project2
data_dir1=$server_dir1/data
data_dir2=$server_dir2/data
options1="
-c wal_level=logical
-c max_replication_slots=10
-c max_worker_processes=12
-c max_logical_replication_workers=10
-c max_wal_senders=14
-c logging_collector=on
-c log_directory=$server_dir1
-c log_filename=logfile.${project1} "

options2="
-c wal_level=replica
-c max_replication_slots=10
-c max_worker_processes=12
-c max_logical_replication_workers=10
-c max_wal_senders=14
-c logging_collector=on
-c log_directory=$server_dir2
-c log_filename=logfile.${project2} "

export PATH=$PATH1; which postgres; postgres -D $data_dir1 -p $port1 ${options1} &
export PATH=$PATH2; which postgres; postgres -D $data_dir2 -p $port2 ${options2} &


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-02-18 Thread Erik Rijkers


Maybe add this to the 10 Open Items list?
  https://wiki.postgresql.org/wiki/PostgreSQL_10_Open_Items

It may garner a bit more attention.



Ah sorry, it's there already.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-02-18 Thread Erik Rijkers

On 2017-02-16 00:43, Petr Jelinek wrote:

On 13/02/17 14:51, Erik Rijkers wrote:

On 2017-02-11 11:16, Erik Rijkers wrote:

On 2017-02-08 23:25, Petr Jelinek wrote:


0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
0001-Logical-replication-support-for-initial-data-copy-v4.patch


This often works but it also fails far too often (in my hands).  I


That being said, I am so far having problems reproducing this on my 
test

machine(s) so no idea what causes it yet.



A few extra bits:

- I have repeated this now on three different machines (debian 7, 8, 
centos6; one a pretty big server); there is always failure within a few 
tries of that test program (i.e. pgbench_derail2.sh, with the above 5 
patches).


- I have also tried to go back to an older version of logrep: running 
with 2 instances with only the first four patches (i.e., leaving out the 
support-for-existing-data patch).  With only those 4, the logical 
replication is solid. (a quick 25x repetition of a (very similar) test 
program is 100% successful). So the problem is likely somehow in that 
last 5th patch.


- A 25x repetition of a test on a master + replica 5-patch server yields 
13 ok, 12 NOK.


- Is the 'make check' FAILED test 'object_addess' unrelated?  (Can you 
at least reproduce that failed test?)


Maybe add this to the 10 Open Items list?
  https://wiki.postgresql.org/wiki/PostgreSQL_10_Open_Items

It may garner a bit more attention.


thanks,

Erik Rijkers




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-02-16 Thread Erik Rijkers

On 2017-02-16 00:43, Petr Jelinek wrote:

On 13/02/17 14:51, Erik Rijkers wrote:

On 2017-02-11 11:16, Erik Rijkers wrote:

On 2017-02-08 23:25, Petr Jelinek wrote:


0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
0001-Logical-replication-support-for-initial-data-copy-v4.patch


This often works but it also fails far too often (in my hands).  I


Could you periodically dump contents of the pg_subscription_rel on
subscriber (ideally when dumping the md5 of the data) and attach that 
as

well?


I attach a bash script (and its output) that polls the 4 pgbench table's 
md5s and the pg_subscription_rel table, each second, while I run the 
pgbench_derail2.sh (for that see my earlier mail).


pgbench_derail2.sh writes a 'header' into the same output stream (search 
for '^===' ).


The .out file reflects a session where I started pgbench_derail2.sh 
twice (it removes the publication and subscription at startup).  So 
there are 2 headers in the attached  cb_20170216_10_04_47.out. The first 
run ended in a succesful replication (=all 4 pgbench tables 
md5-identical).  The second run does not end correctly: it has (one of) 
the typical faulty end-states: pgbench_accounts, the copy, has a few 
less rows than the master table.


Other typical endstates are:
same number of rows but content not identical (for some, typically < 20 
rows).

mostly pgbench_accounts and pgbench_history are affected.

(I see now that I made some mistakes in generating the timestamps in the 
.out file but I suppose it doesn't matter too much)


I hope it helps; let me know if I can do any other test(s).

20170216_10_04_49_1487  6972 a,b,t,h: 10  1 10776   24be8c7be  
cf860f1f2  aed87334f  f2bfaa587   master
20170216_10_04_50_1487  6973 a,b,t,h:  6  1 10776   74cd7528c  
cf860f1f2  aed87334f  f2bfaa587   replica NOK
  now  | srsubid | srrelid | srsubstate | srsublsn 
---+-+-++--
 2017-02-16 10:04:50.242616+01 |   25398 |   25375 | r  | 
 2017-02-16 10:04:50.242616+01 |   25398 |   25378 | r  | 
 2017-02-16 10:04:50.242616+01 |   25398 |   25381 | r  | 
 2017-02-16 10:04:50.242616+01 |   25398 |   25386 | r  | 
(4 rows)

20170216_10_04_51_1487  6972 a,b,t,h: 10  1 10776   24be8c7be  
cf860f1f2  aed87334f  f2bfaa587   master
20170216_10_04_51_1487  6973 a,b,t,h:  6  1 10776   74cd7528c  
cf860f1f2  aed87334f  f2bfaa587   replica NOK
  now  | srsubid | srrelid | srsubstate | srsublsn 
---+-+-++--
 2017-02-16 10:04:51.945931+01 |   25398 |   25375 | r  | 
 2017-02-16 10:04:51.945931+01 |   25398 |   25378 | r  | 
 2017-02-16 10:04:51.945931+01 |   25398 |   25381 | r  | 
 2017-02-16 10:04:51.945931+01 |   25398 |   25386 | r  | 
(4 rows)


-- 20170216 10:04:S
-- scale  1 clients  1   INIT_WAIT  0
-- 
/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/bin.fast/postgres
20170216_10_04_53_1487  6972 a,b,t,h: 10  1 10776   24be8c7be  
cf860f1f2  aed87334f  f2bfaa587   master
20170216_10_04_53_1487  6973 a,b,t,h:  6  1 10776   74cd7528c  
cf860f1f2  aed87334f  f2bfaa587   replica NOK
  now  | srsubid | srrelid | srsubstate | srsublsn 
---+-+-++--
 2017-02-16 10:04:53.635163+01 |   25398 |   25375 | r  | 
 2017-02-16 10:04:53.635163+01 |   25398 |   25378 | r  | 
 2017-02-16 10:04:53.635163+01 |   25398 |   25381 | r  | 
 2017-02-16 10:04:53.635163+01 |   25398 |   25386 | r  | 
(4 rows)

20170216_10_04_54_1487  6972 a,b,t,h:  0  0  0  0   24be8c7be  
d41d8cd98  d41d8cd98  d41d8cd98   master
20170216_10_04_55_1487  6973 a,b,t,h:  0  0  0  0   d41d8cd98  
d41d8cd98  d41d8cd98  d41d8cd98   replica NOK
 now | srsubid | srrelid | srsubstate | srsublsn 
-+-+-++--
(0 rows)

20170216_10_04_56_1487  6972 a,b,t,h: 10  1 10  0   68d91d95b  
6c4f8b9aa  92162c9b8  d41d8cd98   master
20170216_10_04_56_1487  6973 a,b,t,h:  0  0  0  0   d41d8cd98  
d41d8cd98  d41d8cd98  d41d8cd98   replica NOK
 now | srsubid | srrelid | srsubstate | srsublsn 
-+-+-++--
(0 rows)

20170216_10_04_57_1487  6972 a,b,t,h: 10  1 10  1   68d91d95b  
6c4f8b9aa  92162c9b8  d41d8cd98   master
20170216_10_04_58_1487  6973 a,b,t,h:  0  0  0  0   d41d8c

Re: [HACKERS] Logical replication existing data copy

2017-02-13 Thread Erik Rijkers

On 2017-02-11 11:16, Erik Rijkers wrote:

On 2017-02-08 23:25, Petr Jelinek wrote:


0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
0001-Logical-replication-support-for-initial-data-copy-v4.patch


This often works but it also fails far too often (in my hands).  I
test whether the tables are identical by comparing an md5 from an
ordered resultset, from both replica and master.  I estimate that 1 in
5 tries fail; 'fail'  being a somewhat different table on replica
(compared to mater), most often pgbench_accounts (typically there are
10-30 differing rows).  No errors or warnings in either logfile.   I'm
not sure but I think testing on faster machines seem to be doing
somewhat better ('better' being less replication error).



I have noticed that when I insert a few seconds wait-state after the 
create subscription (or actually: the 'enable'ing of the subscription) 
the problem does not occur.  Apparently, (I assume) the initial snapshot 
occurs somewhere when the subsequent pgbench-run has already started, so 
that the logical replication also starts somewhere 'into' that 
pgbench-run. Does that make sense?


I don't know what to make of it.  Now that I think that I understand 
what happens I hesitate to call it a bug. But I'd say it's still a 
useability problem that the subscription is only 'valid' after some 
time, even if it's only a few seconds.


(the other problem I mentioned (drop subscription hangs) still happens 
every now and then)



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy - sgml fixes

2017-02-13 Thread Erik Rijkers

On 2017-02-09 02:25, Erik Rijkers wrote:

On 2017-02-08 23:25, Petr Jelinek wrote:


0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
0001-Logical-replication-support-for-initial-data-copy-v4.patch




fixes in create_subscription.sgml

--- doc/src/sgml/ref/create_subscription.sgml.orig2	2017-02-11 11:58:10.788502999 +0100
+++ doc/src/sgml/ref/create_subscription.sgml	2017-02-11 12:17:50.069635493 +0100
@@ -55,7 +55,7 @@
 
   
Additional info about subscriptions and logical replication as a whole
-   can is available at  and
+   is available at  and
.
   
 
@@ -122,14 +122,14 @@
 
  
   Name of the replication slot to use. The default behavior is to use
-  subscription_name for slot name.
+  subscription_name as the slot name.
  
 

 

-COPY DATA
-NOCOPY DATA
+COPY DATA
+NOCOPY DATA
 
  
   Specifies if the existing data in the publication that are being
@@ -140,11 +140,11 @@

 

-SKIP CONNECT
+SKIP CONNECT
 
  
-  Instructs the CREATE SUBSCRIPTION to skip initial
-  connection to the provider. This will change default values of other
+  Instructs CREATE SUBSCRIPTION to skip initial
+  connection to the provider. This will change the default values of other
   options to DISABLED,
   NOCREATE SLOT and NOCOPY DATA.
  
@@ -181,8 +181,8 @@
 
   
Create a subscription to a remote server that replicates tables in
-   the publications mypubclication and
-   insert_only and starts replicating immediately on
+   the publications mypublication and
+   insert_only and start replicating immediately on
commit:
 
 CREATE SUBSCRIPTION mysub
@@ -193,7 +193,7 @@
 
   
Create a subscription to a remote server that replicates tables in
-   the insert_only publication and does not start replicating
+   the insert_only publication and do not start replicating
until enabled at a later time.
 
 CREATE SUBSCRIPTION mysub

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-02-11 Thread Erik Rijkers

On 2017-02-08 23:25, Petr Jelinek wrote:


0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
0001-Logical-replication-support-for-initial-data-copy-v4.patch


Apart from the failing one make check test (test 'object_address') which 
I reported earlier, I find it is easy to 'confuse' the replication.


I attach a script that intends to test the default COPY DATA.   There 
are two instances, initially without any replication.  The script inits 
pgbench on the master, adds a serial column to pgbench_history, and 
dump-restores the 4 pgbench-tables to the future replica.  It then 
empties the 4 pgbench-tables on the 'replica'.  The idea is that when 
logrep is initiated, data will be replicated from master, with the end 
result being that there are 4 identical tables on master and replica.


This often works but it also fails far too often (in my hands).  I test 
whether the tables are identical by comparing an md5 from an ordered 
resultset, from both replica and master.  I estimate that 1 in 5 tries 
fail; 'fail'  being a somewhat different table on replica (compared to 
mater), most often pgbench_accounts (typically there are 10-30 differing 
rows).  No errors or warnings in either logfile.   I'm not sure but I 
think testing on faster machines seem to be doing somewhat better 
('better' being less replication error).


Another, probably unrelated, problem occurs (but much more rarely) when 
executing 'DROP SUBSCRIPTION sub1' on the replica (see the beginning of 
the script).  Sometimes that command hangs, and refuses to accept 
shutdown of the server.  I don't know how to recover from this -- I just 
have to kill the replica server (master server still obeys normal 
shutdown) and restart the instances.


The script accepts 2 parameters, scale and clients (used in pgbench -s 
resp. -c)


I don't think I've managed to successfully run the script with more than 
1 client yet.


Can you have a look whether this is reproducible elsewhere?

thanks,

Erik Rijkers






#!/bin/sh

#  assumes both instances are running, on port 6972 and 6973

logfile1=$HOME/pg_stuff/pg_installations/pgsql.logical_replication/logfile.logical_replication
logfile2=$HOME/pg_stuff/pg_installations/pgsql.logical_replication2/logfile.logical_replication2

scale=1
if [[ ! "$1" == "" ]]
then
   scale=$1
fi

clients=1
if [[ ! "$2" == "" ]]
then
   clients=$2
fi

unset PGSERVICEFILE PGSERVICE PGPORT PGDATA PGHOST
PGDATABASE=testdb

# (this script also uses a custom pgpassfile)

## just for info:
# env | grep PG
# psql -qtAXc "select current_setting('server_version')"

port1=6972
port2=6973

function cb()
{
  #  display the 4 pgbench tables' accumulated content as md5s
  #  a,b,t,h stand for:  pgbench_accounts, -branches, -tellers, -history
  md5_total_6972='-1'
  md5_total_6973='-2'
  for port in $port1 $port2
  do
md5_a=$(echo "select * from pgbench_accounts order by aid"|psql -qtAXp$port|md5sum|cut -b 1-9)
md5_b=$(echo "select * from pgbench_branches order by bid"|psql -qtAXp$port|md5sum|cut -b 1-9)
md5_t=$(echo "select * from pgbench_tellers  order by tid"|psql -qtAXp$port|md5sum|cut -b 1-9)
md5_h=$(echo "select * from pgbench_history  order by hid"|psql -qtAXp$port|md5sum|cut -b 1-9)
cnt_a=$(echo "select count(*) from pgbench_accounts"|psql -qtAXp $port)
cnt_b=$(echo "select count(*) from pgbench_branches"|psql -qtAXp $port)
cnt_t=$(echo "select count(*) from pgbench_tellers" |psql -qtAXp $port)
cnt_h=$(echo "select count(*) from pgbench_history" |psql -qtAXp $port)
md5_total[$port]=$( echo "${md5_a} ${md5_b} ${md5_t} ${md5_h}" | md5sum )
printf "$port a,b,t,h: %6d %6d %6d %6d" $cnt_a  $cnt_b  $cnt_t  $cnt_h
echo -n "   $md5_a  $md5_b  $md5_t  $md5_h"
if   [[ $port -eq $port1 ]]; then echo"   master"
elif [[ $port -eq $port2 ]]; then echo -n "   replica"
else  echo" ERROR  "
fi
  done
  if [[ "${md5_total[6972]}" == "${md5_total[6973]}" ]]
  then
echo " ok"
  else
echo " NOK"
  fi
}

bail=0

pub_count=$( echo "select count(*) from pg_publication" | psql -qtAXp 6972 )
if  [[ $pub_count -ne 0 ]]
then
  echo "pub_count -ne 0 - deleting pub1 & bailing out"
  echo "drop publication if exists pub1" | psql -Xp 6972
  bail=1
fi
sub_count=$( echo "select count(*) from pg_subscription" | psql -qtAXp 6973 )
if  [[ $sub_count -ne 0 ]]
then
  echo "sub_count -ne 0 - deleting sub1 & bailing out"
  echo "drop subscription if exist

Re: \if, \elseif, \else, \endif (was Re: [HACKERS] PSQL commands: \quit_if, \quit_unless)

2017-02-09 Thread Erik Rijkers

On 2017-02-09 22:15, Tom Lane wrote:

Corey Huinker <corey.huin...@gmail.com> writes:


The feature now  ( at patch v10) lets you break off with Ctrl-C 
anywhere.  I like it now much more.


The main thing I still dislike somewhat about the patch is the verbose 
output. To be honest I would prefer to just remove /all/ the interactive 
output.


I would vote to just make it remain silent if there is no error.   (and 
if there is an error, issue a message and exit)


thanks,

Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical replication existing data copy

2017-02-08 Thread Erik Rijkers

On 2017-02-08 23:25, Petr Jelinek wrote:


0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch
0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch
0003-Fix-after-trigger-execution-in-logical-replication-v2.patch
0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch
0001-Logical-replication-support-for-initial-data-copy-v4.patch


test 'object_address' fails, see atachment.

That's all I found in a quick first trial.

thanks,

Erik Rijkers




*** /home/aardvark/pg_stuff/pg_sandbox/pgsql.logical_replication/src/test/regress/expected/object_address.out	2017-02-09 00:51:30.345519608 +0100
--- /home/aardvark/pg_stuff/pg_sandbox/pgsql.logical_replication/src/test/regress/results/object_address.out	2017-02-09 00:54:11.884715532 +0100
***
*** 38,43 
--- 38,45 
  	TO SQL WITH FUNCTION int4recv(internal));
  CREATE PUBLICATION addr_pub FOR TABLE addr_nsp.gentable;
  CREATE SUBSCRIPTION addr_sub CONNECTION '' PUBLICATION bar WITH (DISABLED, NOCREATE SLOT);
+ ERROR:  could not connect to the publisher: FATAL:  no pg_hba.conf entry for replication connection from host "[local]", user "aardvark", SSL off
+ 
  -- test some error cases
  SELECT pg_get_object_address('stone', '{}', '{}');
  ERROR:  unrecognized object type "stone"
***
*** 409,463 
  			pg_identify_object_as_address(classid, objid, subobjid) ioa(typ,nms,args),
  			pg_get_object_address(typ, nms, ioa.args) as addr2
  	ORDER BY addr1.classid, addr1.objid, addr1.subobjid;
!type|   schema   |   name|   identity   | ?column? 
! ---++---+--+--
!  default acl   ||   | for role regress_addr_user in schema public on tables| t
!  default acl   ||   | for role regress_addr_user on tables | t
!  type  | pg_catalog | _int4 | integer[]| t
!  type  | addr_nsp   | gencomptype   | addr_nsp.gencomptype | t
!  type  | addr_nsp   | genenum   | addr_nsp.genenum | t
!  type  | addr_nsp   | gendomain | addr_nsp.gendomain   | t
!  function  | pg_catalog |   | pg_catalog.pg_identify_object(pg_catalog.oid,pg_catalog.oid,integer) | t
!  aggregate | addr_nsp   |   | addr_nsp.genaggr(integer)| t
!  sequence  | addr_nsp   | gentable_a_seq| addr_nsp.gentable_a_seq  | t
!  table | addr_nsp   | gentable  | addr_nsp.gentable| t
!  table column  | addr_nsp   | gentable  | addr_nsp.gentable.b  | t
!  index | addr_nsp   | gentable_pkey | addr_nsp.gentable_pkey   | t
!  view  | addr_nsp   | genview   | addr_nsp.genview | t
!  materialized view | addr_nsp   | genmatview| addr_nsp.genmatview  | t
!  foreign table | addr_nsp   | genftable | addr_nsp.genftable   | t
!  foreign table column  | addr_nsp   | genftable | addr_nsp.genftable.a | t
!  role  || regress_addr_user | regress_addr_user| t
!  server|| addr_fserv| addr_fserv   | t
!  user mapping  ||   | regress_addr_user on server integer  | t
!  foreign-data wrapper  || addr_fdw  | addr_fdw | t
!  access method || btree | btree| t
!  operator of access method ||   | operator 1 (integer, integer) of pg_catalog.integer_ops USING btree  | t
!  function of access method ||   | function 2 (integer, integer) of pg_catalog.integer_ops USI

Re: [HACKERS] Cache Hash Index meta page.

2017-02-07 Thread Erik Rijkers

On 2017-02-07 18:41, Robert Haas wrote:

Committed with some changes (which I noted in the commit message).


This has caused a warning with gcc 6.20:

hashpage.c: In function ‘_hash_getcachedmetap’:
hashpage.c:1245:20: warning: ‘cache’ may be used uninitialized in this 
function [-Wmaybe-uninitialized]

rel->rd_amcache = cache;
^~~

which hopefully can be prevented...

thanks,

Erik Rijkers








--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: \if, \elseif, \else, \endif (was Re: [HACKERS] PSQL commands: \quit_if, \quit_unless)

2017-02-03 Thread Erik Rijkers

On 2017-02-03 08:16, Corey Huinker wrote:


0001.if_endif.v5.diff


1. Well, with this amount of interactive output  it is impossible to get 
stuck without knowing :)
This is good. Still, it would be an improvement to be able to break out 
of an inactive \if-branch
with Ctrl-C.  (I noticed that inside an active branch it is already 
possible )

'\endif' is too long to type, /and/ you have to know it.

2. Inside an \if block  \q should be given precedence and cause a direct 
exit of psql (or at the

very least exit the if block(s)), as in regular SQL statements
(compare: 'select * from t  \q'  which will immediately exit psql -- 
this is good. )


3. I think the 'barking' is OK because interactive use is certainly not 
the first use-case.
But nonetheless it could be made a bit more terse without losing its 
function.

The interactive behavior is now:
# \if 1
entered if: active, executing commands
# \elif 0
entered elif: inactive, ignoring commands
# \else
entered else: inactive, ignoring commands
# \endif
exited if: active, executing commands

It really is a bit too wordy, IMHO; I would say, drop all 'entered', 
'active',  and 'inactive' words.

That leaves it plenty clear what's going on.
That would make those lines:
if: executing commands
elif: ignoring commands
else: ignoring commands
exited if
   (or alternatively, just  mention 'if: active' or 'elif: inactive', 
etc., which has the advantage of being shorter)


5. A real bug, I think:
#\if asdasd
unrecognized value "asdasd" for "\if ": boolean expected
# \q;
inside inactive branch, command ignored.
#

That 'unrecognized value' message is fair enough but it is 
counterintuitive that after an erroneous opening \if-expression, the 
if-modus should be entered into. ( and now I have to type \endif 
again... )


6. About the help screen:
There should be an empty line above 'Conditionals' to visually divide it 
from other help items.


The indenting of the new block is incorrect: the lines that start with
 fprintf(output, _("  \\
are indented to the correct level; the other lines are indented 1 place 
too much.


The help text has a few typos (some multiple times):
queires -> queries
exectue -> execute
subsequennt -> subsequent

Thanks,

Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TRAP: FailedAssertion("!(hassrf)", File: "nodeProjectSet.c", Line: 180)

2017-02-02 Thread Erik Rijkers

On 2017-02-02 22:44, Tom Lane wrote:

Erik Rijkers <e...@xs4all.nl> writes:

Something is broken in HEAD:


Fixed, thanks for the report!



Indeed, the complicated version of the script runs again as before.

Thank you very much,

Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] TRAP: FailedAssertion("!(hassrf)", File: "nodeProjectSet.c", Line: 180)

2017-02-01 Thread Erik Rijkers

Something is broken in HEAD:


drop table if exists t;
create table t(c text);
insert into t (c) values ( 'abc' ) ;

select
  regexp_split_to_array(
  regexp_split_to_table(
c
  , chr(13) || chr(10)  )
  , '","' )
  as a
 ,
  regexp_split_to_table(
  c
, chr(13) || chr(10)
 )
  as rw
from t
;

TRAP: FailedAssertion("!(hassrf)", File: "nodeProjectSet.c", Line: 180)


I realise the regexp* functions aren't doing anything particularly 
useful anymore here; they did in the more complicated original (which I 
had used for years).


thanks,

Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   3   4   >