[HACKERS] git down
git.postgresql.org is down/unreachable ( git://git.postgresql.org/git/postgresql.git ) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] v10 bottom-listed
In the 'ftp' listing, v10 appears at the bottom: https://www.postgresql.org/ftp/source/ With all the other v10* directories at the top, we could get a lot of people installing wrong binaries... Maybe it can be fixed so that it appears at the top. Thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] comments improvements
comments improvements--- src/backend/optimizer/prep/prepunion.c.orig 2017-09-24 17:40:34.888790877 +0200 +++ src/backend/optimizer/prep/prepunion.c 2017-09-24 17:41:39.796748743 +0200 @@ -2413,7 +2413,7 @@ * Find AppendRelInfo structures for all relations specified by relids. * * The AppendRelInfos are returned in an array, which can be pfree'd by the - * caller. *nappinfos is set to the the number of entries in the array. + * caller. *nappinfos is set to the number of entries in the array. */ AppendRelInfo ** find_appinfos_by_relids(PlannerInfo *root, Relids relids, int *nappinfos) --- src/test/regress/sql/triggers.sql.orig 2017-09-24 17:40:45.760783805 +0200 +++ src/test/regress/sql/triggers.sql 2017-09-24 17:41:33.448752854 +0200 @@ -1409,7 +1409,7 @@ -- -- Verify behavior of statement triggers on partition hierarchy with -- transition tables. Tuples should appear to each trigger in the --- format of the the relation the trigger is attached to. +-- format of the relation the trigger is attached to. -- -- set up a partition hierarchy with some different TupleDescriptors -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Automatic testing of patches in commit fest
On 2017-09-11 02:12, Thomas Munro wrote: On Mon, Sep 11, 2017 at 11:40 AM, Michael Paquier <michael.paqu...@gmail.com> wrote: Thomas Munro has hacked up a prototype of application testing automatically if patches submitted apply and build: http://commitfest.cputube.org/ I should add: this is a spare-time effort, a work-in-progress and building on top of a bunch of hairy web scraping, so it may take some time to perfect. It would be great if one of the intermediary products of this effort could be made available too, namely, a list of latest patches. Or perhaps such a list should come out of the commitfest app. For me, such a list would be even more useful than any subsequently processed results. thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] psql: new help related to variables are not too readable
On 2017-09-08 06:09, Pavel Stehule wrote: Hi Now the output looks like: AUTOCOMMIT if set, successful SQL commands are automatically committed COMP_KEYWORD_CASE determines the case used to complete SQL key words [lower, upper, preserve-lower, preserve-upper] DBNAME the currently connected database name [...] What do you think about using new line between entries in this format? AUTOCOMMIT if set, successful SQL commands are automatically committed COMP_KEYWORD_CASE determines the case used to complete SQL key words [lower, upper, preserve-lower, preserve-upper] DBNAME the currently connected database name I dislike it, it takes more screen space and leads to unneccessary scroll-need. The 9.6.5 formatting is/was: AUTOCOMMIT if set, successful SQL commands are automatically committed COMP_KEYWORD_CASE determines the case used to complete SQL key words [lower, upper, preserve-lower, preserve-upper] DBNAME the currently connected database name [...] PGPASSWORD connection password (not recommended) PGPASSFILE password file name PSQL_EDITOR, EDITOR, VISUAL editor used by the \e, \ef, and \ev commands PSQL_EDITOR_LINENUMBER_ARG how to specify a line number when invoking the editor PSQL_HISTORY alternative location for the command history file I would prefer to revert to that more compact 9.6-formatting. Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] adding the commit to a patch's thread
At the moment it's not easy to find the commit that terminates a commitfest thread about a patch. One has to manually compare dates and guess what belongs to what. The commit message nowadays often has the link to the thread ("Discussion") but the other way around is often not so easily found. For example: looking at https://commitfest.postgresql.org/14/1020/ One cannot directly find the actual commit that finished it. Would it be possible to change the commitfest a bit and make it possible to add the commit (or commit-message, or hash) to the thread in the commitfest-app. I would think it would be best to make it so that when the thread gets set to state 'committed', the actual commit/hash is added somewhere at the same time. thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] changed column-count breaks pdf build
The feature matrix table in high-availability.sgml had a column added so also increase the column-count (patch attached). thanks, Erik Rijkers--- doc/src/sgml/high-availability.sgml.orig 2017-08-17 15:04:32.535819637 +0200 +++ doc/src/sgml/high-availability.sgml 2017-08-17 15:04:46.528122345 +0200 @@ -301,7 +301,7 @@ High Availability, Load Balancing, and Replication Feature Matrix - + Feature -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] parallel documentation improvements
On 2017-08-01 20:43, Robert Haas wrote: In commit 054637d2e08cda6a096f48cc99696136a06f4ef5, I updated the parallel query documentation to reflect recently-committed parallel Barring objections, I'd like to commit this in the next couple of days I think that in this bit: occurrence is frequent, considering increasing max_worker_processes and max_parallel_workers so that more workers can be run simultaneously or alternatively reducing - so that the planner +max_parallel_workers_per_gather so that the planner requests fewer workers. 'considering increasing' should be 'consider increasing' -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GSoC 2017: Foreign Key Arrays
On 2017-07-27 21:08, Mark Rofail wrote: On Thu, Jul 27, 2017 at 7:15 PM, Erik Rijkers <e...@xs4all.nl> wrote: It would help (me at least) if you could be more explicit about what exactly each instance is. I apologize, I thought it was clear through the context. Thanks a lot. It's just really easy for testers like me that aren't following a thread too closely and just snatch a half hour here and there to look into a feature/patch. One small thing while building docs: $ cd doc/src/sgml && make html osx -wall -wno-unused-param -wno-empty -wfully-tagged -D . -D . -x lower postgres.sgml >postgres.xml.tmp osx:ref/create_table.sgml:960:100:E: document type does not allow element "VARLISTENTRY" here Makefile:147: recipe for target 'postgres.xml' failed make: *** [postgres.xml] Error 1 (Debian 8/jessie) thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GSoC 2017: Foreign Key Arrays
On 2017-07-27 02:31, Mark Rofail wrote: I have written some benchmark test. It would help (me at least) if you could be more explicit about what exactly each instance is. Apparently there is an 'original patch': is this the original patch by Marco Nenciarini? Or is it something you posted earlier? I guess it could be distilled from the earlier posts but when I looked those over yesterday evening I still didn't get it. A link to the post where the 'original patch' is would be ideal... thanks! Erik Rijkers With two tables a PK table with 5 rows and an FK table with growing row count. Once triggering an RI check at 10 rows, 100 rows, 1,000 rows, 10,000 rows, 100,000 rows and 1,000,000 rows Please find the graph with the findings attached below -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GSoC 2017: Foreign Key Arrays
On 2017-07-24 23:31, Mark Rofail wrote: On Mon, Jul 24, 2017 at 11:25 PM, Erik Rijkers <e...@xs4all.nl> wrote: This patch doesn't apply to HEAD at the moment ( e2c8100e6072936 ). My bad, I should have mentioned that the patch is dependant on the original patch. Here is a *unified* patch that I just tested. Thanks. Apply is now good, but I get this error when compiling: ELEMENT' not present in UNRESERVED_KEYWORD section of gram.y make[4]: *** [gram.c] Error 1 make[3]: *** [parser/gram.h] Error 2 make[2]: *** [../../src/include/parser/gram.h] Error 2 make[1]: *** [all-common-recurse] Error 2 make: *** [all-src-recurse] Error 2 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] GSoC 2017: Foreign Key Arrays
On 2017-07-24 23:08, Mark Rofail wrote: Here is the new Patch with the bug fixes and the New Patch with the Index in place performance results. I just want to point this out because I still can't believe the numbers. In reference to the old patch: The new patch without the index suffers a 41.68% slow down, while the new patch with the index has a 95.18% speed up! [elemOperatorV4.patch] This patch doesn't apply to HEAD at the moment ( e2c8100e6072936 ). Can you have a look? thanks, Erik Rijkers patching file doc/src/sgml/ref/create_table.sgml Hunk #1 succeeded at 816 with fuzz 3. patching file src/backend/access/gin/ginarrayproc.c patching file src/backend/utils/adt/arrayfuncs.c patching file src/backend/utils/adt/ri_triggers.c Hunk #1 FAILED at 2650. Hunk #2 FAILED at 2694. 2 out of 2 hunks FAILED -- saving rejects to file src/backend/utils/adt/ri_triggers.c.rej patching file src/include/catalog/pg_amop.h patching file src/include/catalog/pg_operator.h patching file src/include/catalog/pg_proc.h patching file src/test/regress/expected/arrays.out patching file src/test/regress/expected/opr_sanity.out patching file src/test/regress/sql/arrays.sql -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] PDF content lemma subdivision
The PDF-version of the documentation has content-'frame' displayed on the left-hand side (I'm viewing with okular; I assmume it will be similar in most viewers). That content displays a treeview down to the main entries/lemmata, like 'CREATE TABLE'. It doesn't go any deeper anymore. There used to be a further subdivision in that lefthand subtree: Name, Synopsis, Division, Parameters, Notes, Examples, See Also (and an even further, finer subdivision). These would all be clickable links straight into the lemma itself. (Especially 'Examples' was a handy jump to have, IMHO - sometimes it saved many pages of scrolling) I noticed today that all these lower level subdivisions are gone. Was that deliberate or an accident? If it's at all possible I would like to see these subdivisions reinstated, so that navigating via the content-tree becomes that much easier again. (By the way (unrelated), I also noticed only today that the new process now wraps many of the too-long-lines; lines that were previously unceremoniously cut off in 'mid-sentence'. That wrapping, although not always pretty, is a really useful improvement.) thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication - still unstable after all these months
On 2017-06-18 00:27, Peter Eisentraut wrote: On 6/17/17 06:48, Erik Rijkers wrote: On 2017-05-28 12:44, Erik Rijkers wrote: re: srsubstate in pg_subscription_rel: No idea what it means. At the very least this value 'w' is missing from the documentation, which only mentions: i = initalize d = data copy s = synchronized r = (normal replication) Shouldn't we add this to that table (51.53) in the documentation? After all, the value 'w' does show up when you monitor pg_subscription_rel. It's not supposed to. Have you seen it after e3a815d2faa5be28551e71d5db44fb2c78133433? Ah no, I haven't seen that 'w'-value after that (and 1000s of tests ran without error since then). I just hadn't realized that that w-value I had reported was indeed a erroneous state. thanks, this is OK then. Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication - still unstable after all these months
On 2017-05-28 12:44, Erik Rijkers wrote: re: srsubstate in pg_subscription_rel: No idea what it means. At the very least this value 'w' is missing from the documentation, which only mentions: i = initalize d = data copy s = synchronized r = (normal replication) Shouldn't we add this to that table (51.53) in the documentation? After all, the value 'w' does show up when you monitor pg_subscription_rel. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] tablesync.c - comment improvements
tablesync.c - comment improvements--- src/backend/replication/logical/tablesync.c.orig 2017-06-10 10:20:07.617662465 +0200 +++ src/backend/replication/logical/tablesync.c 2017-06-10 10:45:52.620514397 +0200 @@ -12,18 +12,18 @@ * logical replication. * * The initial data synchronization is done separately for each table, - * in separate apply worker that only fetches the initial snapshot data - * from the publisher and then synchronizes the position in stream with + * in a separate apply worker that only fetches the initial snapshot data + * from the publisher and then synchronizes the position in the stream with * the main apply worker. * - * The are several reasons for doing the synchronization this way: + * There are several reasons for doing the synchronization this way: * - It allows us to parallelize the initial data synchronization * which lowers the time needed for it to happen. * - The initial synchronization does not have to hold the xid and LSN * for the time it takes to copy data of all tables, causing less * bloat and lower disk consumption compared to doing the - * synchronization in single process for whole database. - * - It allows us to synchronize the tables added after the initial + * synchronization in a single process for the whole database. + * - It allows us to synchronize any tables added after the initial * synchronization has finished. * * The stream position synchronization works in multiple steps. @@ -37,7 +37,7 @@ * read the stream and apply changes (acting like an apply worker) until * it catches up to the specified stream position. Then it sets the * state to SYNCDONE. There might be zero changes applied between - * CATCHUP and SYNCDONE, because the sync worker might be ahead of the + * CATCHUP and SYNCDONE because the sync worker might be ahead of the * apply worker. * - Once the state was set to SYNCDONE, the apply will continue tracking * the table until it reaches the SYNCDONE stream position, at which @@ -147,7 +147,7 @@ } /* - * Wait until the relation synchronization state is set in catalog to the + * Wait until the relation synchronization state is set in the catalog to the * expected one. * * Used when transitioning from CATCHUP state to SYNCDONE. @@ -206,12 +206,12 @@ } /* - * Wait until the the apply worker changes the state of our synchronization + * Wait until the apply worker changes the state of our synchronization * worker to the expected one. * * Used when transitioning from SYNCWAIT state to CATCHUP. * - * Returns false if the apply worker has disappeared or table state has been + * Returns false if the apply worker has disappeared or the table state has been * reset. */ static bool @@ -225,7 +225,7 @@ CHECK_FOR_INTERRUPTS(); - /* Bail if he apply has died. */ + /* Bail if the apply has died. */ LWLockAcquire(LogicalRepWorkerLock, LW_SHARED); worker = logicalrep_worker_find(MyLogicalRepWorker->subid, InvalidOid, false); @@ -333,7 +333,7 @@ Assert(!IsTransactionState()); - /* We need up to date sync state info for subscription tables here. */ + /* We need up-to-date sync state info for subscription tables here. */ if (!table_states_valid) { MemoryContext oldctx; @@ -365,7 +365,7 @@ } /* - * Prepare hash table for tracking last start times of workers, to avoid + * Prepare a hash table for tracking last start times of workers, to avoid * immediate restarts. We don't need it if there are no tables that need * syncing. */ @@ -401,7 +401,7 @@ { /* * Apply has caught up to the position where the table sync has - * finished. Time to mark the table as ready so that apply will + * finished. Mark the table as ready so that apply will * just continue to replicate it normally. */ if (current_lsn >= rstate->lsn) @@ -436,7 +436,7 @@ else /* - * If no sync worker for this table yet, count running sync + * If there is no sync worker for this table yet, count running sync * workers for this subscription, while we have the lock, for * later. */ @@ -477,7 +477,7 @@ /* * If there is no sync worker registered for the table and there - * is some free sync worker slot, start new sync worker for the + * is some free sync worker slot, start a new sync worker for the * table. */ else if (!syncworker && nsyncworkers < max_sync_workers_per_subscription) @@ -551,7 +551,7 @@ int bytesread = 0; int avail; - /* If there are some leftover data from previous read, use them. */ + /* If there are some leftover data from previous read, use it. */ avail = copybuf->len - copybuf->cursor; if (avail) { @@ -694,7 +694,7 @@ (errmsg("could not fetch table info for table \"%s.%s\": %s", nspname, relname, res->err))); - /* We don't know number of rows coming, so allocate enough space. */ + /*
Re: [HACKERS] logical replication - possible remaining problem
On 2017-06-07 23:18, Alvaro Herrera wrote: Erik Rijkers wrote: Now, looking at the script again I am thinking that it would be reasonable to expect that after issuing delete from pg_subscription; the other 2 tables are /also/ cleaned, automatically, as a consequence. (Is this reasonable? this is really the main question of this email). I don't think it's reasonable to expect that the system recovers automatically from what amounts to catalog corruption. You should be using the DDL that removes subscriptions instead. You're right, that makes sense. Thanks. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Race conditions with WAL sender PID lookups
On 2017-06-07 20:31, Robert Haas wrote: [...] [ Side note: Erik's report on this thread initially seemed to suggest that we needed this patch to make logical decoding stable. But my impression is that this is belied by subsequent developments on other threads, so my theory is that this patch was never really related to the problem, but rather than by the time Erik got around to testing this patch, other fixes had made the problems relatively rare, and the apparently-improved results with this patch were just chance. If that theory is wrong, it would be good to hear about it. ] Yes, agreed; I was probably mistaken. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] logical replication - possible remaining problem
I am not sure whether what I found here amounts to a bug, I might be doing something dumb. During the last few months I did tests by running pgbench over logical replication. Earlier emails have details. The basic form of that now works well (and the fix has been comitted) but as I looked over my testing program I noticed one change I made to it, already many weeks ago: In the cleanup during startup (pre-flight check you might say) and also before the end, instead of echo "delete from pg_subscription;" | psql -qXp $port2 -- (1) I changed that (as I say, many weeks ago) to: echo "delete from pg_subscription; delete from pg_subscription_rel; delete from pg_replication_origin; " | psql -qXp $port2 -- (2) This occurs (2x) inside the bash function clean_pubsub(), in main test script pgbench_detail2.sh This change was an effort to ensure to arrive at a 'clean' start (and end-) state which would always be the same. All my more recent testing (and that of Mark, I have to assume) was thus done with (2). Now, looking at the script again I am thinking that it would be reasonable to expect that after issuing delete from pg_subscription; the other 2 tables are /also/ cleaned, automatically, as a consequence. (Is this reasonable? this is really the main question of this email). So I removed the latter two delete statements again, and ran the tests again with the form in (1) I have established that (after a number of successful cycles) the test stops succeeding with in the replica log repetitions of: 2017-06-07 22:10:29.057 CEST [2421] LOG: logical replication apply worker for subscription "sub1" has started 2017-06-07 22:10:29.057 CEST [2421] ERROR: could not find free replication state slot for replication origin with OID 11 2017-06-07 22:10:29.057 CEST [2421] HINT: Increase max_replication_slots and try again. 2017-06-07 22:10:29.058 CEST [2061] LOG: worker process: logical replication worker for subscription 29235 (PID 2421) exited with exit code 1 when I manually 'clean up' by doing: delete from pg_replication_origin; then, and only then, does the session finish and succeed ('replica ok'). So to me it looks as if there is an omission of pg_replication_origin-cleanup when pg_description is deleted. Does that make sense? All this is probably vague and I am only posting in the hope that Petr (or someone else) perhaps immediately understands what goes wrong, with even his limited amount of info. In the meantime I will try to dig up more detailed info... thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication - still unstable after all these months
On 2017-06-06 20:53, Peter Eisentraut wrote: On 6/4/17 22:38, Petr Jelinek wrote: Committed that, with some further updates of comments to reflect the Belated apologies all round for the somewhat provocative $subject; but I felt at that moment that this item needed some extra attention. I don't know if it worked but I'm glad that it is solved ;) Thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication - still unstable after all these months
On 2017-05-31 16:20, Erik Rijkers wrote: On 2017-05-31 11:16, Petr Jelinek wrote: [...] Thanks to Mark's offer I was able to study the issue as it happened and found the cause of this. [0001-Improve-handover-logic-between-sync-and-apply-worker.patch] This looks good: -- out_20170531_1141.txt 100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n -- scale 25 100 -- All is well. So this is 100x a 1-minute test with 100x success. (This on the most fastidious machine (slow disks, meagre specs) that used to give 15% failures) [Improve-handover-logic-between-sync-and-apply-worker-v2.patch] No errors after (several days of) running variants of this. (2500x 1 minute runs; 12x 1-hour runs) Thanks! Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication - still unstable after all these months
On 2017-06-02 00:46, Mark Kirkwood wrote: On 31/05/17 21:16, Petr Jelinek wrote: I'm seeing a new failure with the patch applied - this time the history table has missing rows. Petr, I'll put back your access :-) Is this error during 1-minute runs? I'm asking because I've moved back to longer (1-hour) runs (no errors so far), and I'd like to keep track of what the most 'vulnerable' parameters are. thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication - still unstable after all these months
On 2017-05-31 11:16, Petr Jelinek wrote: [...] Thanks to Mark's offer I was able to study the issue as it happened and found the cause of this. [0001-Improve-handover-logic-between-sync-and-apply-worker.patch] This looks good: -- out_20170531_1141.txt 100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n -- scale 25 100 -- All is well. So this is 100x a 1-minute test with 100x success. (This on the most fastidious machine (slow disks, meagre specs) that used to give 15% failures) I'll let it run for a couple of days with varying params (and on varying hardware) but it definitely does look as if you fixed it. Thanks! Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication - still unstable after all these months
On 2017-05-26 08:10, Erik Rijkers wrote: If you run a pgbench session of 1 minute over a logical replication connection and repeat that 100x this is what you get: At clients 90, 64, 8, scale 25: -- out_20170525_0944.txt 100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n -- scale 25 7 -- Not good. -- out_20170525_1426.txt 100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n -- scale 25 18 -- Not good. -- out_20170525_2049.txt 100 -- pgbench -c 8 -j 8 -T 60 -P 12 -n -- scale 25 10 -- Not good. At clients 90, 64, 8, scale 5: -- out_20170526_0126.txt 100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n -- scale 5 2 -- Not good. -- out_20170526_0352.txt 100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n -- scale 5 3 -- Not good. -- out_20170526_0621.txt 100 -- pgbench -c 8 -j 8 -T 60 -P 12 -n -- scale 5 4 -- Not good. It seems this problem is a bit less serious than it did look to me (as others find lower numbers of fail). Still, how is its seriousness graded by now? Is it a show-stopper? Should it go onto the Open Items page? Is anyone still looking into it? thanks, Erik Rijkers The above installations (master+replica) are with Petr Jelinek's (and Michael Paquier's) last patches 0001-Fix-signal-handling-in-logical-workers.patch 0002-Make-tablesync-worker-exit-when-apply-dies-while-it-.patch 0003-Receive-invalidation-messages-correctly-in-tablesync.patch Remove-the-SKIP-REFRESH-syntax-suggar-in-ALTER-SUBSC-v2.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication - still unstable after all these months
On 2017-05-29 03:33, Jeff Janes wrote: On Sun, May 28, 2017 at 3:17 PM, Mark Kirkwood < mark.kirkw...@catalyst.net.nz> wrote: I also got a failure, after 87 iterations of a similar test case. It [...] repeated the runs, but so far it hasn't failed again in over 800 iterations Could you give the params for the successful runs? (ideally, a grep | sort | uniq -c of the ran pgbench lines ) Can you say anything about hardware? Thanks for repeating my lengthy tests. Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication - still unstable after all these months
On 2017-05-29 00:17, Mark Kirkwood wrote: On 28/05/17 19:01, Mark Kirkwood wrote: So running in cloud land now...so for no errors - will update. The framework ran 600 tests last night, and I see 3 'NOK' results, i.e 3 failed test runs (all scale 25 and 8 pgbench clients). Given the way Could you also give the params for the successful runs? Can you say anything about hardware? (My experience is that older, slower, 'worse' hardware makes for more fails.) Many thanks, by the way. I'm glad that it turns out I'm probably not doing something uniquely stupid (although I'm not glad that there seems to be a bug, and an elusive one at that) Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication - still unstable after all these months
On 2017-05-26 15:59, Petr Jelinek wrote: Hmm, I was under the impression that the changes we proposed in the snapbuild thread fixed your issues, does this mean they didn't? Or the modified versions of those that were eventually committed didn't? Or did issues reappear at some point? Here is a bit of info: Just now (using Mark Kirkwood's version of my test) I had a session logging this: unknown relation state "w" which I had never seen before. This is column srsubstate in pg_subscription_rel. That session completed successfully ('replica ok'), so it's not necessarily a problem. grepping through my earlier logs (of weeks of intermittent test-runs), I found only one more (timestamp 20170525_0125). Here it occurred in a failed session. No idea what it means. At the very least this value 'w' is missing from the documentation, which only mentions: i = initalize d = data copy s = synchronized r = (normal replication) Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication - still unstable after all these months
On 2017-05-28 01:15, Mark Kirkwood wrote: Also, any idea which rows are different? If you want something out of the box that will do that for you see DBIx::Compare. I used to save the content-diffs too but in the end decided they were useless (to me, anyway). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication - still unstable after all these months
On 2017-05-28 01:21, Mark Kirkwood wrote: Sorry - I see you have done this already. On 28/05/17 11:15, Mark Kirkwood wrote: Interesting - might be good to see your test script too (so we can better understand how you are deciding if the runs are successful or not). Yes, in pgbench_derail2.sh in the cb function it says: if [[ "${md5_total[$port1]}" == "${md5_total[$port2]}" ]] then echo " ok" else echo " NOK" fi This is the final decision about success ('ok') or failure ('NOK'). (NOK stands for 'Not OK') The two compared md5's (on the two ports: primary and replica) are each taken over a concatenation of the 4 separate md5's of the table-content (taken earlier in cb()). If one or more of the 4 md5's differs, then that concatation-md5 will differ too. Sorry, there is not a lot of comment -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication - still unstable after all these months
On 2017-05-27 17:11, Andres Freund wrote: On May 27, 2017 6:13:19 AM EDT, Simon Riggs <si...@2ndquadrant.com> wrote: On 27 May 2017 at 09:44, Erik Rijkers <e...@xs4all.nl> wrote: I am very curious at your results. We take your bug report on good faith, but we still haven't seen details of the problem or how to recreate it. Please post some details. Thanks. ? ok, ok... ( The thing is, I am trying to pre-digest the output but it takes time ) I can do this now: attached some output that belongs with this group of 100 1-minute runs: -- out_20170525_1426.txt 100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n -- scale 25 82 -- All is well. 18 -- Not good. That is the worst set of runs of what I showed earlier. that is: out_20170525_1426.txt and 2x18 logfiles that the 18 failed runs produced. Those logfiles have names like: logrep.20170525_1426.1436.1.scale_25.clients_64.NOK.log logrep.20170525_1426.1436.2.scale_25.clients_64.NOK.log .1.=primary .2.=replica Please disregard the errors around pg_current_wal_location(). (it was caused by some code to dump some wal into zipfiles which obviously stopped working after the function was removed/renamed) There are also some uninportant errors from the test-harness where I call with the wrong port. Not interesting, I don't think. sent_20170527_1745.tar.bz2 Description: BZip2 compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication - still unstable after all these months
On 2017-05-27 10:30, Erik Rijkers wrote: On 2017-05-27 01:35, Mark Kirkwood wrote: Here is what I have: instances.sh: testset.sh pgbench_derail2.sh pubsub.sh To be clear: ( Apart from that standalone call like ./pgbench_derail2.sh $scale $clients $duration $date_str ) I normally run by editing the parameters in testset.sh, then run: ./testset.sh that then shows a tail -F of the output-logfile (to paste into another screen). in yet another screen the 'watch -n20 results.sh' line The output=files are the .txt files. The logfiles of the instances are (at the end of each test) copied to directory logfiles/ under a meaningful name that shows the parameters, and with an extension like '.ok.log' or '.NOK.log'. I am very curious at your results. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication - still unstable after all these months
On 2017-05-27 01:35, Mark Kirkwood wrote: On 26/05/17 20:09, Erik Rijkers wrote: The idea is simple enough: startup instance1 startup instance2 (on same machine) primary: init pgbench tables primary: add primary key to pgbench_history copy empty tables to replica by dump/restore primary: start publication replica: start subscription primary: run 1-minute pgbench wait till the 4 md5's of primary pgbench tables are the same as the 4 md5's of replica pgbench tables (this will need a time-out). log 'ok' or 'not ok' primary: clean up publication replica: clean up subscription shutdown primary shutdown replica this whole thing 100x Here is what I have: instances.sh: starts up 2 assert enabled sessions instances_fast.sh: alternative to instances.sh starts up 2 assert disabled 'fast' sessions testset.sh loop to call pgbench_derail2.sh with varying params pgbench_derail2.sh main test program can be called 'standalone' ./pgbench_derail2.sh $scale $clients $duration $date_str so for instance this should work: ./pgbench_derail2.sh 25 64 60 20170527_1019 to remove publication and subscription from sessions, add a 5th parameter 'clean' ./pgbench_derail2.sh 1 1 1 1 'clean' pubsub.sh displays replication state. also called by pgbench_derail2.sh must be in path result.sh display results I keep this in a screen-session as: watch -n 20 './result.sh 201705' Peculiar to my setup also: server version at compile time stamped with date + commit hash I misuse information_schema.sql_packages at compile time to store patch information instances are in $pg_stuff_dir/pg_installations/pgsql. So you'll have to outcomment a line here and there, and adapt paths, ports, and things like that. It's a bit messy, I should have used perl from the beginning... Good luck :) Erik Rijkers #!/bin/sh #assertions on in $pg_stuff_dir/pg_installations/pgsql./bin #assertions off in $pg_stuff_dir/pg_installations/pgsql./bin.fast port1=6972 project1=logical_replication port2=6973 project2=logical_replication2 pg_stuff_dir=$HOME/pg_stuff PATH1=$pg_stuff_dir/pg_installations/pgsql.$project1/bin:$PATH PATH2=$pg_stuff_dir/pg_installations/pgsql.$project2/bin:$PATH server_dir1=$pg_stuff_dir/pg_installations/pgsql.$project1 server_dir2=$pg_stuff_dir/pg_installations/pgsql.$project2 data_dir1=$server_dir1/data data_dir2=$server_dir2/data options1=" -c wal_level=logical -c max_replication_slots=10 -c max_worker_processes=12 -c max_logical_replication_workers=10 -c max_wal_senders=10 -c logging_collector=on -c log_directory=$server_dir1 -c log_filename=logfile.${project1} -c log_replication_commands=on " # -c wal_sender_timeout=18 # -c client_min_messages=DEBUG1 " # -c log_connections=on # -c max_sync_workers_per_subscription=6 options2=" -c wal_level=replica -c max_replication_slots=10 -c max_worker_processes=12 -c max_logical_replication_workers=10 -c max_wal_senders=10 -c logging_collector=on -c log_directory=$server_dir2 -c log_filename=logfile.${project2} -c log_replication_commands=on " # -c wal_sender_timeout=18 # -c client_min_messages=DEBUG1 " # -c log_connections=on # -c max_sync_workers_per_subscription=6 export PATH=$PATH1; export PG=$( which postgres ); $PG -D $data_dir1 -p $port1 ${options1} & sleep 1 export PATH=$PATH2; export PG=$( which postgres ); $PG -D $data_dir2 -p $port2 ${options2} & sleep 1 #!/bin/sh #assertions on in $pg_stuff_dir/pg_installations/pgsql./bin #assertions off in $pg_stuff_dir/pg_installations/pgsql./bin.fast port1=6972 project1=logical_replication port2=6973 project2=logical_replication2 pg_stuff_dir=$HOME/pg_stuff PATH1=$pg_stuff_dir/pg_installations/pgsql.$project1/bin.fast:$PATH PATH2=$pg_stuff_dir/pg_installations/pgsql.$project2/bin.fast:$PATH server_dir1=$pg_stuff_dir/pg_installations/pgsql.$project1 server_dir2=$pg_stuff_dir/pg_installations/pgsql.$project2 data_dir1=$server_dir1/data data_dir2=$server_dir2/data options1=" -c wal_level=logical -c max_replication_slots=10 -c max_worker_processes=12 -c max_logical_replication_workers=10 -c max_wal_senders=14 -c wal_sender_timeout=18 -c logging_collector=on -c log_directory=$server_dir1 -c log_filename=logfile.${project1} -c log_replication_commands=on " options2=" -c wal_level=replica -c max_replication_slots=10 -c max_worker_processes=12 -c max_logical_replication_workers=10 -c max_wal_senders=14 -c wal_sender_timeout=18 -c logging_collector=on -c log_directory=$server_dir2 -c log_filename=logfile.${project2} -c log_replication_commands=on " export PATH=$PATH1; PG=$(which postgres); $PG -D $data_dir1 -p $port1 ${options1} & export PATH=$PATH2; PG=$(which postgres); $PG -D $data_dir2 -p $port2 ${options2} & #!/bin/bash pg_stuff_dir=$HOME/pg_stuff port1=6972 project1=logical_replication port2=6973 project2=logical_replication2 db=testdb rc=0 duration=60 while [[ $rc -eq 0 ]]
Re: [HACKERS] logical replication - still unstable after all these months
On 2017-05-27 01:35, Mark Kirkwood wrote: On 26/05/17 20:09, Erik Rijkers wrote: this whole thing 100x Some questions that might help me get it right: - do you think we need to stop and start the instances every time? - do we need to init pgbench each time? - could we just drop the subscription and publication and truncate the replica tables instead? I have done all that in earler versions. I deliberately added these 'complications' in view of the intractability of the problem: my fear is that an earlier failure leaves some half-failed state behind in an instance, which then might cause more failure. This would undermine the intent of the whole exercise (which is to count succes/failure rate). So it is important to be as sure as possible that each cycle starts out as cleanly as possible. - what scale pgbench are you running? I use a small script to call the main script; at the moment it does something like: --- duration=60 from=1 to=100 for scale in 25 5 do for clients in 90 64 8 do date_str=$(date +"%Y%m%d_%H%M") outfile=out_${date_str}.txt time for x in `seq $from $to` do ./pgbench_derail2.sh $scale $clients $duration $date_str [...] --- - how many clients for the 1 min pgbench run? see above - are you starting the pgbench run while the copy_data jobs for the subscription are still running? I assume with copy_data you mean the data sync of the original table before pgbench starts. And yes, I think here might be the origin of the problem. ( I think the problem I get is actually easily avoided by putting wait states here and there in between separate steps. But the testing idea here is to force the system into error, not to avoid any errors) - how exactly are you calculating those md5's? Here is the bash function: cb (I forget what that stands for, I guess 'content bench'). $outf is a log file to which the program writes output: --- function cb() { # display the 4 pgbench tables' accumulated content as md5s # a,b,t,h stand for: pgbench_accounts, -branches, -tellers, -history num_tables=$( echo "select count(*) from pg_class where relkind = 'r' and relname ~ '^pgbench_'" | psql -qtAX ) if [[ $num_tables -ne 4 ]] then echo "pgbench tables not 4 - exit" >> $outf exit fi for port in $port1 $port2 do md5_a=$(echo "select * from pgbench_accounts order by aid"|psql -qtAXp $port|md5sum|cut -b 1-9) md5_b=$(echo "select * from pgbench_branches order by bid"|psql -qtAXp $port|md5sum|cut -b 1-9) md5_t=$(echo "select * from pgbench_tellers order by tid"|psql -qtAXp $port|md5sum|cut -b 1-9) md5_h=$(echo "select * from pgbench_history order by hid"|psql -qtAXp $port|md5sum|cut -b 1-9) cnt_a=$(echo "select count(*) from pgbench_accounts" |psql -qtAXp $port) cnt_b=$(echo "select count(*) from pgbench_branches" |psql -qtAXp $port) cnt_t=$(echo "select count(*) from pgbench_tellers" |psql -qtAXp $port) cnt_h=$(echo "select count(*) from pgbench_history" |psql -qtAXp $port) md5_total[$port]=$( echo "${md5_a} ${md5_b} ${md5_t} ${md5_h}" | md5sum ) printf "$port a,b,t,h: %8d %6d %6d %6d" $cnt_a $cnt_b $cnt_t $cnt_h echo -n " $md5_a $md5_b $md5_t $md5_h" if [[ $port -eq $port1 ]]; then echo" master" elif [[ $port -eq $port2 ]]; then echo -n " replica" else echo" ERROR " fi done if [[ "${md5_total[$port1]}" == "${md5_total[$port2]}" ]] then echo " ok" else echo " NOK" fi } --- this enables: echo "-- getting md5 (cb)" cb_text1=$(cb) and testing that string like: if echo "$cb_text1" | grep -qw 'replica ok'; then echo "-- All is well." [...] Later today I'll try to clean up the whole thing and post it. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication - still unstable after all these months
On 2017-05-26 15:59, Petr Jelinek wrote: Hi, Hmm, I was under the impression that the changes we proposed in the snapbuild thread fixed your issues, does this mean they didn't? Or the modified versions of those that were eventually committed didn't? Or did issues reappear at some point? I do think the snapbuild fixed solved certain problems. I can't say where the present problems are caused (as I have said, I suspect logical replication, but also my own test-harness: perhaps it leaves some error-state lying around (although I do try hard to prevent that) -- so I just don't know. I wouldn't say that problems (re)appeared at a certain point; my impression is rather that logical replication has become better and better. But I kept getting the odd failure, without a clear cause, but always (eventually) repeatable on other machines. I did the 1-minute pgbench-derail version exactly because of the earlier problems with snapbuild: I wanted a test that does a lot of starting and stopping of publication and subscription. Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication - still unstable after all these months
On 2017-05-26 10:29, Mark Kirkwood wrote: On 26/05/17 20:09, Erik Rijkers wrote: On 2017-05-26 09:40, Simon Riggs wrote: If we can find out what the bug is with a repeatable test case we can fix it. Could you provide more details? Thanks I will, just need some time to clean things up a bit. But what I would like is for someone else to repeat my 100x1-minute tests, taking as core that snippet I posted in my previous email. I built bash-stuff around that core (to take md5's, shut-down/start-up the two instances between runs, write info to log-files, etc). But it would be good if someone else made that separately because if that then does not fail, it would prove that my test-harness is at fault (and not logical replication). Will do - what I had been doing was running pgbench, waiting until the Great! You'll have to think about whether to go with instances of either master, or master+those 4 patches. I guess either choice makes sense. row counts on the replica pgbench_history were the same as the primary, then summing the %balance and delta fields from the primary and replica dbs and comparing. So far - all match up ok. However I'd I did number-summing for a while as well (because it's a lot faster than taking md5's over the full content). But the problem with summing is that (I think) in the end you cannot be really sure that the result is correct (false positives, although I don't understand the odds). been running a longer time frames (5 minutes), so not the same number of repetitions as yet. I've run 3600-, 30- and 15-minute runs too, but in this case (these 100x tests) I wanted to especially test the area around startup/initialise of logical replication. Also the increasing quality of logical replication (once it runs with the correct thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication - still unstable after all these months
On 2017-05-26 09:40, Simon Riggs wrote: If we can find out what the bug is with a repeatable test case we can fix it. Could you provide more details? Thanks I will, just need some time to clean things up a bit. But what I would like is for someone else to repeat my 100x1-minute tests, taking as core that snippet I posted in my previous email. I built bash-stuff around that core (to take md5's, shut-down/start-up the two instances between runs, write info to log-files, etc). But it would be good if someone else made that separately because if that then does not fail, it would prove that my test-harness is at fault (and not logical replication). The idea is simple enough: startup instance1 startup instance2 (on same machine) primary: init pgbench tables primary: add primary key to pgbench_history copy empty tables to replica by dump/restore primary: start publication replica: start subscription primary: run 1-minute pgbench wait till the 4 md5's of primary pgbench tables are the same as the 4 md5's of replica pgbench tables (this will need a time-out). log 'ok' or 'not ok' primary: clean up publication replica: clean up subscription shutdown primary shutdown replica this whole thing 100x -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical replication - still unstable after all these months
On 2017-05-26 08:58, Simon Riggs wrote: On 26 May 2017 at 07:10, Erik Rijkers <e...@xs4all.nl> wrote: - Do you agree this number of failures is far too high? - Am I the only one finding so many failures? What type of failure are you getting? The failure is that in the result state the replicated tables differ from the original tables. For instance, -- out_20170525_0944.txt 100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n -- scale 25 93 -- All is well. 7 -- Not good. These numbers mean: the result state of primary and replica is not the same, in 7 out of 100 runs. 'not the same state' means: at least one of the 4 md5's of the sorted content of the 4 pgbench tables on the primary is different from those taken from the replica. So, 'failure' means: the 4 pgbench tables on primary and replica are not exactly the same after the (one-minute) pgbench-run has finished, and logical replication has 'finished'. (plenty of time is given for the replica to catchup. The test only calls 'failure' after 20x waiting (for 15 seconds) and 20x finding the same erroneous state (erroneous because not-same as on primary). I would really like to know it you think that that doesn't amount to 'failure'. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] logical replication - still unstable after all these months
If you run a pgbench session of 1 minute over a logical replication connection and repeat that 100x this is what you get: At clients 90, 64, 8, scale 25: -- out_20170525_0944.txt 100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n -- scale 25 93 -- All is well. 7 -- Not good. -- out_20170525_1426.txt 100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n -- scale 25 82 -- All is well. 18 -- Not good. -- out_20170525_2049.txt 100 -- pgbench -c 8 -j 8 -T 60 -P 12 -n -- scale 25 90 -- All is well. 10 -- Not good At clients 90, 64, 8, scale 25: -- out_20170526_0126.txt 100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n -- scale 5 98 -- All is well. 2 -- Not good. -- out_20170526_0352.txt 100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n -- scale 5 97 -- All is well. 3 -- Not good. -- out_20170526_0621.txt 45 -- pgbench -c 8 -j 8 -T 60 -P 12 -n -- scale 5 41 -- All is well. 3 -- Not good. (That last one obviously not finished) I think this is pretty awful, really, for a beta level. The above installations (master+replica) are with Petr Jelinek's (and Michael Paquier's) last patches 0001-Fix-signal-handling-in-logical-workers.patch 0002-Make-tablesync-worker-exit-when-apply-dies-while-it-.patch 0003-Receive-invalidation-messages-correctly-in-tablesync.patch Remove-the-SKIP-REFRESH-syntax-suggar-in-ALTER-SUBSC-v2.patch Now, it could be that there is somehow something wrong with my test-setup (as opposed to some bug in log-repl). I can post my test program, but I'll do that separately (but below is the core all my tests -- it's basically still that very first test that I started out with, many months ago...) I'd like to find out/know more about: - Do you agree this number of failures is far too high? - Am I the only one finding so many failures? - Is anyone else testing the same way (more or less continually, finding only succes)? - Which of the Open Items could be resposible for this failure rate? (I don't see a match.) - What tests do others do? Could we somehow concentrate results and method somewhere? Thanks, Erik Rijkers PS The core of the 'pgbench_derail' test (bash) is simply: echo "drop table if exists pgbench_accounts; drop table if exists pgbench_branches; drop table if exists pgbench_tellers; drop table if exists pgbench_history;" | psql -qXp $port1 \ && echo "drop table if exists pgbench_accounts; drop table if exists pgbench_branches; drop table if exists pgbench_tellers; drop table if exists pgbench_history;" | psql -qXp $port2 \ && pgbench -p $port1 -qis $scale \ && echo "alter table pgbench_history add column hid serial primary key;" \ | psql -q1Xp $port1 && pg_dump -F c -p $port1 \ --exclude-table-data=pgbench_history \ --exclude-table-data=pgbench_accounts \ --exclude-table-data=pgbench_branches \ --exclude-table-data=pgbench_tellers \ -t pgbench_history -t pgbench_accounts \ -t pgbench_branches -t pgbench_tellers \ | pg_restore -1 -p $port2 -d testdb appname=derail2 echo "create publication pub1 for all tables;" | psql -p $port1 -aqtAX echo "create subscription sub1 connection 'port=${port1} application_name=$appname' publication pub1 with(enabled=false); alter subscription sub1 enable;" | psql -p $port2 -aqtAX pgbench -c $clients -j $threads -T $duration -P $pseconds -n# scale $scale Now compare md5's of the sorted content of each of the 4 pgbench tables on primary and replica. They should be the same. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Race conditions with WAL sender PID lookups
On 2017-05-21 06:37, Erik Rijkers wrote: On 2017-05-20 14:40, Michael Paquier wrote: On Fri, May 19, 2017 at 3:01 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote: Also, as Horiguchi-san pointed out earlier, walreceiver seems need the similar fix. Actually, now that I look at it, ready_to_display should as well be protected by the lock of the WAL receiver, so it is incorrectly placed in walreceiver.h. As you are pointing out, pg_stat_get_wal_receiver() is lazy as well, and that's new in 10, so we have an open item here for both of them. And I am the author for both things. No issues spotted in walreceiverfuncs.c after review. I am adding an open item so as both issues are fixed in PG10. With the WAL sender part, I think that this should be a group shot. So what do you think about the attached? [walsnd-pid-races-v3.patch] With this patch on current master my logical replication tests (pgbench-over-logical-replication) run without errors for the first time in many days (even weeks). Unfortunately, just now another logical-replication failure occurred. The same as I have seen all along: The symptom: after starting logical replication, there are no rows in pg_stat_replication and in the replica-log logical replication complains about max_replication_slots being too low. (from previous experience I know that making max_replication_slots higher does indeed 'help', but only until the next (same) error occurs, with renewed (same) complaint). Also from previous experience of this failed state I know that it can be 'cleaned up' by manually emptying these tables: delete from pg_subscription_rel; delete from pg_subscription; delete from pg_replication_origin; Then it becomes possible to start a new subscription without the above symptoms. I'll do some more testing and hopefully get some information that's less vague... Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Race conditions with WAL sender PID lookups
On 2017-05-20 14:40, Michael Paquier wrote: On Fri, May 19, 2017 at 3:01 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote: Also, as Horiguchi-san pointed out earlier, walreceiver seems need the similar fix. Actually, now that I look at it, ready_to_display should as well be protected by the lock of the WAL receiver, so it is incorrectly placed in walreceiver.h. As you are pointing out, pg_stat_get_wal_receiver() is lazy as well, and that's new in 10, so we have an open item here for both of them. And I am the author for both things. No issues spotted in walreceiverfuncs.c after review. I am adding an open item so as both issues are fixed in PG10. With the WAL sender part, I think that this should be a group shot. So what do you think about the attached? [walsnd-pid-races-v3.patch] With this patch on current master my logical replication tests (pgbench-over-logical-replication) run without errors for the first time in many days (even weeks). I'll do still more and longer tests but I have gathered already a long streak of successful runs since you posted the patch so I am getting convinced this patch is solved the problem that I was experiencing. Pity it didn't make the beta. thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] snapbuild woes
On 2017-05-09 21:00, Petr Jelinek wrote: On 09/05/17 19:54, Erik Rijkers wrote: On 2017-05-09 11:50, Petr Jelinek wrote: Ah okay, so this is same issue that's reported by both Masahiko Sawada [1] and Jeff Janes [2]. [1] https://www.postgresql.org/message-id/CAD21AoBYpyqTSw%2B%3DES%2BxXtRGMPKh%3DpKiqjNxZKnNUae0pSt9bg%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/CAMkU%3D1xUJKs%3D2etq2K7bmbY51Q7g853HLxJ7qEB2Snog9oRvDw%40mail.gmail.com I don't understand why you come to that conclusion: both Masahiko Sawada and Jeff Janes have a DROP SUBSCRIPTION in the mix; my cases haven't. Isn't that a real difference? ( I do sometimes get that DROP-SUBSCRIPTION too, but much less often than the sync-failure. ) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] snapbuild woes
On 2017-05-09 11:50, Petr Jelinek wrote: I rebased the above mentioned patch to apply to the patches Andres sent, if you could try to add it on top of what you have and check if it still fails, that would be helpful. It still fails. With these patches - 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+ - 2-WIP-Possibly-more-robust-snapbuild-approach.patch + - fix-statistics-reporting-in-logical-replication-work.patch + - Skip-unnecessary-snapshot-builds.patch built again on top of 44c528810a1 ( so I had to add the 'fix-statistics-rep*' patch because without it I immediately got that Assertion failure again ). As always most runs succeed (especially on this large 192GB 16-core server). But attached is an output file of a number of runs of my pgbench_derail2.sh test. Overal result: -- out_20170509_1635.txt 3 -- pgbench -c 64 -j 8 -T 900 -P 180 -n -- scale 25 2 -- All is well. 1 -- Not good, but breaking out of wait (21 times no change) I broke it off after iteration 4, so 5 never ran, and iteration 1 failed due to a mistake in the harness (somethind stupid I did) - not interesting. iteration 2 succeeds. (eventually has 'replica ok') iteration 3 succeeds. (eventually has 'replica ok') iteration 4 fails. Just after 'alter subscription sub1 enable' I caught (as is usual) pg_stat_replication.state as 'catchup'. So far so good. After the 15-minute pgbench run pg_stat_replication has only 2 'startup' lines (and none 'catchup' or 'streaming'): port | pg_stat_replication | pid | wal | replay_loc | diff | ?column? | state | app | sync_state 6972 | pg_stat_replication | 108349 | 19/8FBCC248 || | | startup | derail2 | async 6972 | pg_stat_replication | 108351 | 19/8FBCC248 || | | startup | derail2 | async (that's from: select $port1 as port,'pg_stat_replication' as pg_stat_replication, pid , pg_current_wal_location() wal, replay_location replay_loc, pg_current_wal_location() - replay_location as diff , pg_current_wal_location() <= replay_location , state, application_name as app, sync_state from pg_stat_replication ) This remains in this state for as long as my test-programs lets it (i.e., 20 x 30s, or something like that, and then the loop is exited); in the ouput file it says: 'Not good, but breaking out of wait' Below is the accompanying ps (with the 2 'deranged senders' as Jeff Janes would surely call them): UID PID PPID C STIME TTY STAT TIME CMD rijkers 107147 1 0 17:11 pts/35 S+ 0:00 /var/data1/pg_stuff/pg_installations/pgsql.logical_replication2/bin/postgres -D /var/data1/pg_stuff/pg_installations rijkers 107149 107147 0 17:11 ?Ss 0:00 \_ postgres: logger process rijkers 107299 107147 0 17:11 ?Ss 0:01 \_ postgres: checkpointer process rijkers 107300 107147 0 17:11 ?Ss 0:00 \_ postgres: writer process rijkers 107301 107147 0 17:11 ?Ss 0:00 \_ postgres: wal writer process rijkers 107302 107147 0 17:11 ?Ss 0:00 \_ postgres: autovacuum launcher process rijkers 107303 107147 0 17:11 ?Ss 0:00 \_ postgres: stats collector process rijkers 107304 107147 0 17:11 ?Ss 0:00 \_ postgres: bgworker: logical replication launcher rijkers 108348 107147 0 17:12 ?Ss 0:01 \_ postgres: bgworker: logical replication worker for subscription 70310 sync 70293 rijkers 108350 107147 0 17:12 ?Ss 0:00 \_ postgres: bgworker: logical replication worker for subscription 70310 sync 70298 rijkers 107145 1 0 17:11 pts/35 S+ 0:02 /var/data1/pg_stuff/pg_installations/pgsql.logical_replication/bin/postgres -D /var/data1/pg_stuff/pg_installations rijkers 107151 107145 0 17:11 ?Ss 0:00 \_ postgres: logger process rijkers 107160 107145 0 17:11 ?Ss 0:08 \_ postgres: checkpointer process rijkers 107161 107145 0 17:11 ?Ss 0:07 \_ postgres: writer process rijkers 107162 107145 0 17:11 ?Ss 0:02 \_ postgres: wal writer process rijkers 107163 107145 0 17:11 ?Ss 0:00 \_ postgres: autovacuum launcher process rijkers 107164 107145 0 17:11 ?Ss 0:02 \_ postgres: stats collector process rijkers 107165 107145 0 17:11 ?Ss 0:00 \_ postgres: bgworker: logical replication launcher rijkers 108349 107145 0 17:12 ?Ss 0:27 \_ postgres: wal sender process rijkers [local] idle rijkers 108351 107145 0 17:12 ?Ss 0:26 \_ postgres: wal sender process rijkers [local] idle I have had no time to add (or view) any CPUinfo. Erik Rijkers out_20170509_1635.txt Description: application/elc -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] snapbuild woes
On 2017-05-09 11:50, Petr Jelinek wrote: On 09/05/17 10:59, Erik Rijkers wrote: On 2017-05-09 10:50, Petr Jelinek wrote: On 09/05/17 00:03, Erik Rijkers wrote: On 2017-05-05 02:00, Andres Freund wrote: Could you have a look? [...] I rebased the above mentioned patch to apply to the patches Andres sent, if you could try to add it on top of what you have and check if it still fails, that would be helpful. I suppose you mean these; but they do not apply anymore: 20170505/0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch 20170505/0002-WIP-Possibly-more-robust-snapbuild-approach.patch Andres, any change you could update them? alternatively I could use the older version again.. thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] snapbuild woes
On 2017-05-09 10:50, Petr Jelinek wrote: On 09/05/17 00:03, Erik Rijkers wrote: On 2017-05-05 02:00, Andres Freund wrote: Could you have a look? Running tests with these three patches: 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+ 0002-WIP-Possibly-more-robust-snapbuild-approach.patch + fix-statistics-reporting-in-logical-replication-work.patch (on top of 44c528810) I test by 15-minute pgbench runs while there is a logical replication connection. Primary and replica are on the same machine. I have seen errors on 3 different machines (where error means: at least 1 of the 4 pgbench tables is not md5-equal). It seems better, faster machines yield less errors. Normally I see in pg_stat_replication (on master) one process in state 'streaming'. pid | wal | replay_loc | diff | state | app | sync_state 16495 | 11/EDBC | 11/EA3FEEE8 | 58462488 | streaming | derail2 | async Often there are another two processes in pg_stat_replication that remain in state 'startup'. In the failing sessions the 'streaming'-state process is missing; in failing sessions there are only the two processes that are and remain in 'startup'. Hmm, startup is the state where slot creation is happening. I wonder if it's just taking long time to create snapshot because of the 5th issue which is not yet fixed (and the original patch will not apply on top of this change). Alternatively there is a bug in this patch. Did you see high CPU usage during the test when there were those "startup" state walsenders? I haven't noticed but I didn't pay attention to that particularly. I'll try to get some CPU-info logged... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] snapbuild woes
is going to fail. I believe this has been true for all failure cases that I've seen (except the much more rare stuck-DROP-SUBSCRIPTION which is mentioned in another thread). Sorry, I have not been able to get any thing more clear or definitive... thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 2017-05-08 13:13, Masahiko Sawada wrote: On Mon, May 8, 2017 at 7:14 PM, Erik Rijkers <e...@xs4all.nl> wrote: On 2017-05-08 11:27, Masahiko Sawada wrote: FWIW, running 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+ 0002-WIP-Possibly-more-robust-snapbuild-approach.patch + fix-statistics-reporting-in-logical-replication-work.patch (on top of 44c528810) Thanks, which thread are these patches attached on? The first two patches are here: https://www.postgresql.org/message-id/20170505004237.edtahvrwb3uwd5rs%40alap3.anarazel.de and last one: https://www.postgresql.org/message-id/22cc402c-88eb-fa35-217f-0060db2c72f0%402ndquadrant.com ( I have to include that last one or my tests fail within minutes. ) Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Get stuck when dropping a subscription during synchronizing table
On 2017-05-08 11:27, Masahiko Sawada wrote: Hi, I encountered a situation where DROP SUBSCRIPTION got stuck when initial table sync is in progress. In my environment, I created several tables with some data on publisher. I created subscription on subscriber and drop subscription immediately after that. It doesn't always happen but I often encountered it on my environment. ps -x command shows the following. 96796 ?Ss 0:00 postgres: masahiko postgres [local] DROP SUBSCRIPTION 96801 ?Ts 0:00 postgres: bgworker: logical replication worker for subscription 40993waiting 96805 ?Ss 0:07 postgres: bgworker: logical replication worker for subscription 40993 sync 16418 96806 ?Ss 0:01 postgres: wal sender process masahiko [local] idle 96807 ?Ss 0:00 postgres: bgworker: logical replication worker for subscription 40993 sync 16421 96808 ?Ss 0:00 postgres: wal sender process masahiko [local] idle FWIW, running 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+ 0002-WIP-Possibly-more-robust-snapbuild-approach.patch + fix-statistics-reporting-in-logical-replication-work.patch (on top of 44c528810) I have encountered the same condition as well in the last few days, a few times (I think 2 or 3 times). Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication - TRAP: FailedAssertion in pgstat.c
On 2017-05-03 08:17, Petr Jelinek wrote: On 02/05/17 20:43, Robert Haas wrote: On Thu, Apr 20, 2017 at 2:58 PM, Peter Eisentraut code path that calls CommitTransactionCommand() should have one, no? Is there anything left to be committed here? Afaics the fix was not committed. Peter wanted more comprehensive fix which didn't happen. I think something like attached should do the job. I'm running my pgbench-over-logical-replication test in chunk of 15 minutes, wth different pgbench -c (num clients) and -s (scale) values. With this patch (and nothing else) on top of master (8f8b9be51fd7 to be precise): fix-statistics-reporting-in-logical-replication-work.patch logical replication is still often failing (as expected, I suppose; it seems because of "inital snapshot too large") but indeed I do not see the 'TRAP: FailedAssertion in pgstat.c' anymore. (If there is any other configuration of patches worth testing please let me know) thanks Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication - TRAP: FailedAssertion in pgstat.c
On 2017-04-17 15:59, Stas Kelvich wrote: On 17 Apr 2017, at 10:30, Erik Rijkers <e...@xs4all.nl> wrote: On 2017-04-16 20:41, Andres Freund wrote: On 2017-04-16 10:46:21 +0200, Erik Rijkers wrote: On 2017-04-15 04:47, Erik Rijkers wrote: > > 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch + > 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+ > 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch + > 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch + > 0005-Skip-unnecessary-snapshot-builds.patch I am now using these newer patches: https://www.postgresql.org/message-id/30242bc6-eca4-b7bb-670e-8d0458753a8c%402ndquadrant.com > It builds fine, but when I run the old pbench-over-logical-replication > test I get: > > TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: > "pgstat.c", Line: 828) To get that error: I presume this is the fault of http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=139eb9673cb84c76f493af7e68301ae204199746 if you git revert that individual commit, do things work again? Yes, compiled from 67c2def11d4 with the above 4 patches, it runs flawlessly again. (flawlessly= a few hours without any error) I’ve reproduced failure, this happens under tablesync worker and putting pgstat_report_stat() under the previous condition block should help. However for me it took about an hour of running this script to catch original assert. Can you check with that patch applied? Your patch on top of the 5 patches above seem to solve the matter too: no problems after running for 2 hours (previously it failed within half a minute). Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication - TRAP: FailedAssertion in pgstat.c
On 2017-04-16 20:41, Andres Freund wrote: On 2017-04-16 10:46:21 +0200, Erik Rijkers wrote: On 2017-04-15 04:47, Erik Rijkers wrote: > > 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch + > 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+ > 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch + > 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch + > 0005-Skip-unnecessary-snapshot-builds.patch I am now using these newer patches: https://www.postgresql.org/message-id/30242bc6-eca4-b7bb-670e-8d0458753a8c%402ndquadrant.com > It builds fine, but when I run the old pbench-over-logical-replication > test I get: > > TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: > "pgstat.c", Line: 828) To get that error: I presume this is the fault of http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=139eb9673cb84c76f493af7e68301ae204199746 if you git revert that individual commit, do things work again? Yes, compiled from 67c2def11d4 with the above 4 patches, it runs flawlessly again. (flawlessly= a few hours without any error) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication - TRAP: FailedAssertion in pgstat.c
On 2017-04-15 04:47, Erik Rijkers wrote: 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch + 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+ 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch + 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch + 0005-Skip-unnecessary-snapshot-builds.patch I am now using these newer patches: https://www.postgresql.org/message-id/30242bc6-eca4-b7bb-670e-8d0458753a8c%402ndquadrant.com It builds fine, but when I run the old pbench-over-logical-replication test I get: TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: "pgstat.c", Line: 828) To get that error: -- #!/bin/sh port1=6972 port2=6973 scale=25 clients=16 duration=60 echo "drop table if exists pgbench_accounts; drop table if exists pgbench_branches; drop table if exists pgbench_tellers; drop table if exists pgbench_history;" | psql -qXp $port1 \ && echo "drop table if exists pgbench_accounts; drop table if exists pgbench_branches; drop table if exists pgbench_tellers; drop table if exists pgbench_history;" | psql -qXp $port2 \ && pgbench -p $port1 -qis ${scale//_/} && echo " alter table pgbench_history add column hid serial primary key; " | psql -q1Xp $port1 \ && pg_dump -F c -p $port1 \ --exclude-table-data=pgbench_history \ --exclude-table-data=pgbench_accounts \ --exclude-table-data=pgbench_branches \ --exclude-table-data=pgbench_tellers \ -t pgbench_history \ -t pgbench_accounts \ -t pgbench_branches \ -t pgbench_tellers \ | pg_restore -1 -p $port2 -d testdb appname=pgbench_derail echo "create publication pub1 for all tables;" | psql -p $port1 -aqtAX echo "create subscription sub1 connection 'port=${port1} application_name=${appname}' publication pub1 with (disabled); alter subscription sub1 enable; " | psql -p $port2 -aqtAX echo "-- pgbench -p $port1 -c $clients -T $duration -n -- scale $scale " pgbench -p $port1 -c $clients -T $duration -n -- Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Logical replication - TRAP: FailedAssertion in pgstat.c
Testing logical replication, with the following patches on top of yesterday's master: 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch + 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+ 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch + 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch + 0005-Skip-unnecessary-snapshot-builds.patch Is applying that patch set is still correct? It builds fine, but when I run the old pbench-over-logical-replication test I get: TRAP: FailedAssertion("!(entry->trans == ((void *)0))", File: "pgstat.c", Line: 828) reliably (often within a minute). The test itself does not fail, at least not that I saw (but I only ran a few). thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] the need to finish
Logical replication emits logmessages like these: DETAIL: 90 transactions need to finish. DETAIL: 87 transactions need to finish. DETAIL: 70 transactions need to finish. Could we get rid of that 'need'? It strikes me as a bit off; something that people would say but not a mechanical message by a computer. I dislike it strongly. I would prefer the line to be more terse: DETAIL: 90 transactions to finish. Am I the only one who is annoyed by this phrase? Thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] snapbuild woes
On 2017-04-08 15:56, Andres Freund wrote: On 2017-04-08 09:51:39 -0400, David Steele wrote: On 3/2/17 7:54 PM, Petr Jelinek wrote: > > Yes the copy patch needs rebase as well. But these ones are fine. This bug has been moved to CF 2017-07. FWIW, as these are bug-fixes that need to be backpatched, I do plan to work on them soon. CF 2017-07 pertains to postgres 11, is that right? But I hope you mean to commit these snapbuild patches before the postgres 10 release? As far as I know, logical replication is still very broken without them (or at least some of that set of 5 patches - I don't know which ones are essential and which may not be). If it's at all useful I can repeat tests to show how often current master still fails (easily 50% or so failure-rate). This would be the pgbench-over-logical-replication test that I did so often earlier on. thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] monitoring.sgml missing tag
On 2017-04-07 22:50, Andres Freund wrote: On 2017-04-07 22:47:55 +0200, Erik Rijkers wrote: monitoring.sgml has one tag missing Is that actually an issue? SGML allows skipping certain close tags, and IIRC row is one them. We'll probably move to xml at some point not too far away, but I don't think it makes much sense to fix these one-by-one. Well, I have only used make oldhtml before now so maybe I am doing something wrong. I try to run make html. First, I got this (just showing first few of a 75x repeat): $ time ( cd /home/aardvark/pg_stuff/pg_sandbox/pgsql.HEAD/doc/src/sgml; make html; ) osx -D . -D . -x lower postgres.sgml >postgres.xml.tmp osx:monitoring.sgml:1278:12:E: document type does not allow element "ROW" here osx:monitoring.sgml:1282:12:E: document type does not allow element "ROW" here osx:monitoring.sgml:1286:12:E: document type does not allow element "ROW" here ... osx:monitoring.sgml:1560:12:E: document type does not allow element "ROW" here osx:monitoring.sgml:1564:13:E: end tag for "ROW" omitted, but OMITTAG NO was specified osx:monitoring.sgml:1275:8: start tag was here make: *** [postgres.xml] Error 1 After closing that tag with , make html still fails: $ time ( cd /home/aardvark/pg_stuff/pg_sandbox/pgsql.HEAD/doc/src/sgml; make html; ) osx -D . -D . -x lower postgres.sgml >postgres.xml.tmp '/opt/perl-5.24/bin/perl' -p -e 's/\[(aacute|acirc|aelig|agrave|amp|aring|atilde|auml|bull|copy|eacute|egrave|gt|iacute|lt|mdash|nbsp|ntilde|oacute|ocirc|oslash|ouml|pi|quot|scaron|uuml) *\]/\&\1;/gi;' -e '$_ .= qq{XML V4.2//EN" "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd;>\n} if $. == 1;' postgres.xml rm postgres.xml.tmp xmllint --noout --valid postgres.xml xsltproc --stringparam pg.version '10devel' stylesheet.xsl postgres.xml runtime error: file stylesheet-html-common.xsl line 41 element call-template The called template 'id.attribute' was not found. runtime error: file stylesheet-html-common.xsl line 41 element call-template The called template 'id.attribute' was not found. runtime error: file stylesheet-html-common.xsl line 41 element call-template The called template 'id.attribute' was not found. runtime error: file stylesheet-html-common.xsl line 41 element call-template The called template 'id.attribute' was not found. runtime error: file stylesheet-html-common.xsl line 41 element call-template The called template 'id.attribute' was not found. runtime error: file stylesheet-html-common.xsl line 30 element call-template The called template 'id.attribute' was not found. runtime error: file stylesheet-html-common.xsl line 30 element call-template The called template 'id.attribute' was not found. runtime error: file stylesheet-html-common.xsl line 30 element call-template The called template 'id.attribute' was not found. runtime error: file stylesheet-html-common.xsl line 30 element call-template The called template 'id.attribute' was not found. runtime error: file stylesheet-html-common.xsl line 30 element call-template The called template 'id.attribute' was not found. runtime error: file stylesheet-html-common.xsl line 30 element call-template The called template 'id.attribute' was not found. no result for postgres.xml make: *** [html-stamp] Error 9 real4m23.641s user4m22.304s sys 0m0.914s Any hints welcome... thanks $ cat /etc/redhat-release CentOS release 6.6 (Final) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] monitoring.sgml missing tag
monitoring.sgml has one tag missing--- doc/src/sgml/monitoring.sgml.orig 2017-04-07 22:37:55.388708334 +0200 +++ doc/src/sgml/monitoring.sgml 2017-04-07 22:38:16.582047695 +0200 @@ -1275,6 +1275,7 @@ ProcArrayGroupUpdate Waiting for group leader to clear transaction id at transaction end. + SafeSnapshot Waiting for a snapshot for a READ ONLY DEFERRABLE transaction. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
(At the moment using these patches for tests:) 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch + 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+ 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch + 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch + 0005-Skip-unnecessary-snapshot-builds.patch+ and now (Tuesday 30) added : 0001-Fix-remote-position-tracking-in-logical-replication.patch I think what you have seen is because of this: https://www.postgresql.org/message-id/flat/b235fa69-147a-5e09-f8f3-3f780a1ab...@2ndquadrant.com#b235fa69-147a-5e09-f8f3-3f780a1ab...@2ndquadrant.com You were right: with that 6th patch (and wal_sender_timout back at its default 60s) there are no errors either (I tested on all 3 test-machines). I must have missed that last patch when you posted it. Anyway all seems fine now; I hope the above patches can all be committed soon. thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
On 2017-03-09 11:06, Erik Rijkers wrote: I use three different machines (2 desktop, 1 server) to test logical replication, and all three have now at least once failed to correctly synchronise a pgbench session (amidst many succesful runs, of course) (At the moment using tese patches for tests:) 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch + 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+ 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch + 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch + 0005-Skip-unnecessary-snapshot-builds.patch+ The failed tests that I kept seeing (see the pgbench-over-logical-replication tests upthread) were never really 'solved'. But I have now finally figured out what caused these unexpected failed tests: it was wal_sender_timeout or rather, its default of 60 s. This caused 'terminating walsender process due to replication timeout' on the primary (not strictly an error), and the concomittant ERROR on the replica: 'could not receive data from WAL stream: server closed the connection unexpectedly'. here is a typical example (primary/replica logs time-intertwined, with 'primary'): [...] 2017-03-24 16:21:38.129 CET [15002] primaryLOG: using stale statistics instead of current ones because stats collector is not responding 2017-03-24 16:21:42.690 CET [27515] primaryLOG: using stale statistics instead of current ones because stats collector is not responding 2017-03-24 16:21:42.965 CET [14999]replica LOG: using stale statistics instead of current ones because stats collector is not responding 2017-03-24 16:21:49.816 CET [14930] primaryLOG: terminating walsender process due to 2017-03-24 16:21:49.817 CET [14926]replica ERROR: could not receive data from WAL stream: server closed the connection unexpectedly 2017-03-24 16:21:49.824 CET [27502]replica LOG: worker process: logical replication worker for subscription 24864 (PID 14926) exited with exit code 1 2017-03-24 16:21:49.824 CET [27521]replica LOG: starting logical replication worker for subscription "sub1" 2017-03-24 16:21:49.828 CET [15008]replica LOG: logical replication apply for subscription sub1 started 2017-03-24 16:21:49.832 CET [15009] primaryLOG: received replication command: IDENTIFY_SYSTEM 2017-03-24 16:21:49.832 CET [15009] primaryLOG: received replication command: START_REPLICATION SLOT "sub1" LOGICAL 3/FC976440 (proto_version '1', publication_names '"pub1"') 2017-03-24 16:21:49.833 CET [15009] primaryDETAIL: streaming transactions committing after 3/FC889810, reading WAL from 3/FC820FC0 2017-03-24 16:21:49.833 CET [15009] primaryLOG: starting logical decoding for slot "sub1" 2017-03-24 16:21:50.471 CET [15009] primaryDETAIL: Logical decoding will begin using saved snapshot. 2017-03-24 16:21:50.471 CET [15009] primaryLOG: logical decoding found consistent point at 3/FC820FC0 2017-03-24 16:21:51.169 CET [15008]replica DETAIL: Key (hid)=(9014) already exists. 2017-03-24 16:21:51.169 CET [15008]replica ERROR: duplicate key value violates unique constraint "pgbench_history_pkey" 2017-03-24 16:21:51.170 CET [27502]replica LOG: worker process: logical replication worker for subscription 24864 (PID 15008) exited with exit code 1 2017-03-24 16:21:51.170 CET [27521]replica LOG: starting logical replication worker for subscription "sub1" [...] My primary and replica were always on a single machine (making it more likely that that timeout is reached?). In my testing it seems that reaching the timeout on the primary (and 'closing the connection unexpectedly' on the replica) does not necessarily break the logical replication. But almost all log-rep failures that I have seen were started by this sequence of events. After setting wal_sender_timeout to 3 minutes there were no more failed tests. Perhaps it warrants setting wal_sender_timeout a bit higher than the current default of 60 seconds? After all I also saw the 'replication timeout' / 'closed the connection' couple rather often during not-failing tests. (These also disappeared, almost completely, with a higher setting of wal_sender_timeout) In any case it would be good to mention the setting (and its potentially deteriorating effect) somehere nearer the logical replication treatment. ( I read about wal_sender_timeout and keepalive ping, perhaps there's (still) something amiss there? Just a guess, I don't know ) As I said, I saw no more failures with the higher 3 minute setting, with one exception: the one test that straddled the DST change (saterday 24 march 02:00 h). I am happy to discount that one failure but strictly speaking I suppose it should be able to take DST into its stride. Thanks, Erik Rijkers -- Sent via pgsql-hackers
[HACKERS] walsender.c comments
Small fry gathered wile reading walsender.c ... (to be applied to master) Thanks, Erik Rijkers --- src/backend/replication/walsender.c.orig 2017-03-28 08:34:56.787217522 +0200 +++ src/backend/replication/walsender.c 2017-03-28 08:44:56.486327700 +0200 @@ -14,11 +14,11 @@ * replication-mode commands. The START_REPLICATION command begins streaming * WAL to the client. While streaming, the walsender keeps reading XLOG * records from the disk and sends them to the standby server over the - * COPY protocol, until the either side ends the replication by exiting COPY + * COPY protocol, until either side ends the replication by exiting COPY * mode (or until the connection is closed). * * Normal termination is by SIGTERM, which instructs the walsender to - * close the connection and exit(0) at next convenient moment. Emergency + * close the connection and exit(0) at the next convenient moment. Emergency * termination is by SIGQUIT; like any backend, the walsender will simply * abort and exit on SIGQUIT. A close of the connection and a FATAL error * are treated as not a crash but approximately normal termination; @@ -277,7 +277,7 @@ * Clean up after an error. * * WAL sender processes don't use transactions like regular backends do. - * This function does any cleanup requited after an error in a WAL sender + * This function does any cleanup required after an error in a WAL sender * process, similar to what transaction abort does in a regular backend. */ void @@ -570,7 +570,7 @@ sendTimeLineIsHistoric = true; /* - * Check that the timeline the client requested for exists, and + * Check that the timeline the client requested exists, and * the requested start location is on that timeline. */ timeLineHistory = readTimeLineHistory(ThisTimeLineID); @@ -588,8 +588,8 @@ * starting point. This is because the client can legitimately * request to start replication from the beginning of the WAL * segment that contains switchpoint, but on the new timeline, so - * that it doesn't end up with a partial segment. If you ask for a - * too old starting point, you'll get an error later when we fail + * that it doesn't end up with a partial segment. If you ask for + * too old a starting point, you'll get an error later when we fail * to find the requested WAL segment in pg_wal. * * XXX: we could be more strict here and only allow a startpoint @@ -626,7 +626,7 @@ { /* * When we first start replication the standby will be behind the - * primary. For some applications, for example, synchronous + * primary. For some applications, for example synchronous * replication, it is important to have a clear state for this initial * catchup mode, so we can trigger actions when we change streaming * state later. We may stay in this state for a long time, which is @@ -954,7 +954,7 @@ ReplicationSlotMarkDirty(); - /* Write this slot to disk if it's permanent one. */ + /* Write this slot to disk if it's a permanent one. */ if (!cmd->temporary) ReplicationSlotSave(); } @@ -,7 +,7 @@ * * Prepare a write into a StringInfo. * - * Don't do anything lasting in here, it's quite possible that nothing will done + * Don't do anything lasting in here, it's quite possible that nothing will be done * with the data. */ static void @@ -1150,7 +1150,7 @@ /* * Fill the send timestamp last, so that it is taken as late as possible. - * This is somewhat ugly, but the protocol's set as it's already used for + * This is somewhat ugly, but the protocol is set as it's already used for * several releases by streaming physical replication. */ resetStringInfo(); @@ -1237,7 +1237,7 @@ /* - * Fast path to avoid acquiring the spinlock in the we already know we + * Fast path to avoid acquiring the spinlock in case we already know we * have enough WAL available. This is particularly interesting if we're * far behind. */ @@ -2498,7 +2498,7 @@ * given the current implementation of XLogRead(). And in any case * it's unsafe to send WAL that is not securely down to disk on the * master: if the master subsequently crashes and restarts, slaves - * must not have applied any WAL that gets lost on the master. + * must not have applied any WAL that got lost on the master. */ SendRqstPtr = GetFlushRecPtr(); } @@ -2522,7 +2522,7 @@ * LSN. * * Note that the LSN is not necessarily the LSN for the data contained in - * the present message; it's the end of the the WAL, which might be + * the present message; it's the end of the WAL, which might be * further ahead. All the lag tracking machinery cares about is finding * out when that arbitrary LSN is eventually reported as written, flushed * and applied, so that it can measure the elapsed time. @@ -2922,7 +2922,7 @@ * Wake up all walsenders * * This will be called inside critical sections, so throw
Re: [HACKERS] Logical replication existing data copy
On 2017-03-24 10:45, Mark Kirkwood wrote: However one minor observation - as Michael Banck noted - the elapsed time for slave to catch up after running: $ pgbench -c8 -T600 bench on the master was (subjectively) much longer than for physical streaming replication. Is this expected? I think you probably want to do (on the slave) : alter role set synchronous_commit = off; otherwise it's indeed extremely slow. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] bug/oversight in TestLib.pm and PostgresNode.pm
On 2017-03-23 03:28, Michael Paquier wrote: On Thu, Mar 23, 2017 at 12:51 AM, Erik Rijkers <e...@xs4all.nl> wrote: While trying to test pgbench's stderr (looking for 'creating tables' in output of the initialisation step) I ran into these two bugs (or perhaps better 'oversights'). + if (defined $expected_stderr) { + like($stderr, $expected_stderr, "$test_name: stderr matches"); + } + else { is($stderr, '', "$test_name: no stderr"); - like($stdout, $expected_stdout, "$test_name: matches"); + } To simplify that you could as well set expected_output to be an empty string, and just use like() instead of is(), saving this if/else. (I'll assume you meant '$expected_stderr' (not 'expected_output')) That would be nice but with that, other tests start complaining: "doesn't look like a regex to me" To avoid that, I uglified your version back to: + like($stderr, (defined $expected_stderr ? $expected_stderr : qr{}), + "$test_name: stderr matches"); I did it like that in the attached patch (0001-testlib-like-stderr.diff). The other (PostgresNode.pm.diff) is unchanged. make check-world without error. Thanks, Erik Rijkers --- src/test/perl/TestLib.pm.orig 2017-03-23 08:11:16.034410936 +0100 +++ src/test/perl/TestLib.pm 2017-03-23 08:12:33.154132124 +0100 @@ -289,13 +289,14 @@ sub command_like { - my ($cmd, $expected_stdout, $test_name) = @_; + my ($cmd, $expected_stdout, $test_name, $expected_stderr) = @_; my ($stdout, $stderr); print("# Running: " . join(" ", @{$cmd}) . "\n"); my $result = IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; ok($result, "$test_name: exit code 0"); - is($stderr, '', "$test_name: no stderr"); - like($stdout, $expected_stdout, "$test_name: matches"); + like($stderr, (defined $expected_stderr ? $expected_stderr : qr{}), + "$test_name: stderr matches"); + like($stdout, $expected_stdout, "$test_name: stdout matches"); } sub command_fails_like --- src/test/perl/PostgresNode.pm.orig 2017-03-22 15:58:58.690052999 +0100 +++ src/test/perl/PostgresNode.pm 2017-03-22 15:49:38.422777312 +0100 @@ -1283,6 +1283,23 @@ =pod +=item $node->command_fails_like(...) - TestLib::command_fails_like with our PGPORT + +See command_ok(...) + +=cut + +sub command_fails_like +{ + my $self = shift; + + local $ENV{PGPORT} = $self->port; + + TestLib::command_fails_like(@_); +} + +=pod + =item $node->issues_sql_like(cmd, expected_sql, test_name) Run a command on the node, then verify that $expected_sql appears in the -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] bug/oversight in TestLib.pm and PostgresNode.pm
I am trying to re-create pgbench-over-logical-replication as a TAP-test. (the wisdom of that might be doubted, and I appreciate comments on it too, but it's really another subject). While trying to test pgbench's stderr (looking for 'creating tables' in output of the initialisation step) I ran into these two bugs (or perhaps better 'oversights'). But especially the omission of command_fails_like() in PostgresNode.pm feels like an bug. In the end it was necessary to change TestLib.pm's command_like() because command_fails_like() also checks for a non-zero return value (which seems to make sense, but in this case not possible: pgbench returns 0 on init with output on stderr). make check-world passes without error Thanks, Erik Rijkers --- src/test/perl/TestLib.pm.orig 2017-03-22 11:34:36.948857255 +0100 +++ src/test/perl/TestLib.pm 2017-03-22 14:36:56.793267113 +0100 @@ -289,13 +290,18 @@ sub command_like { - my ($cmd, $expected_stdout, $test_name) = @_; + my ($cmd, $expected_stdout, $test_name, $expected_stderr) = @_; my ($stdout, $stderr); print("# Running: " . join(" ", @{$cmd}) . "\n"); my $result = IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr; ok($result, "$test_name: exit code 0"); + if (defined $expected_stderr) { + like($stderr, $expected_stderr, "$test_name: stderr matches"); + } + else { is($stderr, '', "$test_name: no stderr"); - like($stdout, $expected_stdout, "$test_name: matches"); + } + like($stdout, $expected_stdout, "$test_name: stdout matches"); } sub command_fails_like --- src/test/perl/PostgresNode.pm.orig 2017-03-22 15:58:58.690052999 +0100 +++ src/test/perl/PostgresNode.pm 2017-03-22 15:49:38.422777312 +0100 @@ -1283,6 +1283,23 @@ =pod +=item $node->command_fails_like(...) - TestLib::command_fails_like with our PGPORT + +See command_ok(...) + +=cut + +sub command_fails_like +{ + my $self = shift; + + local $ENV{PGPORT} = $self->port; + + TestLib::command_fails_like(@_); +} + +=pod + =item $node->issues_sql_like(cmd, expected_sql, test_name) Run a command on the node, then verify that $expected_sql appears in the -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] more on comments of snapbuild.c
On 2017-03-18 06:37, Erik Rijkers wrote: Studying logrep yielded some more improvements to the comments in snapbuild.c (to be applied to master) Attached the actual file thanks, Erik Rijekrs --- src/backend/replication/logical/snapbuild.c.orig 2017-03-18 05:02:28.627077888 +0100 +++ src/backend/replication/logical/snapbuild.c 2017-03-18 06:04:48.091686815 +0100 @@ -27,7 +27,7 @@ * removed. This is achieved by using the replication slot mechanism. * * As the percentage of transactions modifying the catalog normally is fairly - * small in comparisons to ones only manipulating user data, we keep track of + * small in comparison to ones only manipulating user data, we keep track of * the committed catalog modifying ones inside [xmin, xmax) instead of keeping * track of all running transactions like it's done in a normal snapshot. Note * that we're generally only looking at transactions that have acquired an @@ -42,7 +42,7 @@ * catalog in a transaction. During normal operation this is achieved by using * CommandIds/cmin/cmax. The problem with that however is that for space * efficiency reasons only one value of that is stored - * (c.f. combocid.c). Since ComboCids are only available in memory we log + * (cf. combocid.c). Since ComboCids are only available in memory we log * additional information which allows us to get the original (cmin, cmax) * pair during visibility checks. Check the reorderbuffer.c's comment above * ResolveCminCmaxDuringDecoding() for details. @@ -92,7 +92,7 @@ * Only transactions that commit after CONSISTENT state has been reached will * be replayed, even though they might have started while still in * FULL_SNAPSHOT. That ensures that we'll reach a point where no previous - * changes has been exported, but all the following ones will be. That point + * changes have been exported, but all the following ones will be. That point * is a convenient point to initialize replication from, which is why we * export a snapshot at that point, which *can* be used to read normal data. * @@ -134,7 +134,7 @@ /* * This struct contains the current state of the snapshot building - * machinery. Besides a forward declaration in the header, it is not exposed + * machinery. Except for a forward declaration in the header, it is not exposed * to the public, so we can easily change its contents. */ struct SnapBuild @@ -442,7 +442,7 @@ /* * We misuse the original meaning of SnapshotData's xip and subxip fields - * to make the more fitting for our needs. + * to make them more fitting for our needs. * * In the 'xip' array we store transactions that have to be treated as * committed. Since we will only ever look at tuples from transactions @@ -645,7 +645,7 @@ /* * Handle the effects of a single heap change, appropriate to the current state - * of the snapshot builder and returns whether changes made at (xid, lsn) can + * of the snapshot builder and return whether changes made at (xid, lsn) can * be decoded. */ bool @@ -1143,7 +1143,7 @@ */ builder->xmin = running->oldestRunningXid; - /* Remove transactions we don't need to keep track off anymore */ + /* Remove transactions we don't need to keep track of anymore */ SnapBuildPurgeCommittedTxn(builder); elog(DEBUG3, "xmin: %u, xmax: %u, oldestrunning: %u", @@ -1250,7 +1250,7 @@ } /* - * a) No transaction were running, we can jump to consistent. + * a) No transactions were running, we can jump to consistent. * * NB: We might have already started to incrementally assemble a snapshot, * so we need to be careful to deal with that. @@ -1521,8 +1521,8 @@ (uint32) (lsn >> 32), (uint32) lsn, MyProcPid); /* - * Unlink temporary file if it already exists, needs to have been before a - * crash/error since we won't enter this function twice from within a + * Unlink temporary file if it already exists, must have been from before + * a crash/error since we won't enter this function twice from within a * single decoding slot/backend and the temporary file contains the pid of * the current process. */ @@ -1624,8 +1624,8 @@ fsync_fname("pg_logical/snapshots", true); /* - * Now there's no way we can loose the dumped state anymore, remember this - * as a serialization point. + * Now that there's no way we can lose the dumped state anymore, remember + * this as a serialization point. */ builder->last_serialized_snapshot = lsn; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] more on comments of snapbuild.c
Studying logrep yielded some more improvements to the comments in snapbuild.c (to be applied to master) thanks, Erik Rijekrs -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] \if, \elseif, \else, \endif (was Re: PSQL commands: \quit_if, \quit_unless)
On 2017-03-17 02:28, Corey Huinker wrote: Attached is the latest work. Not everything is done yet. I post it because 0001.if_endif.v23.diff This patch does not compile for me (gcc 6.3.0): command.c:38:25: fatal error: conditional.h: No such file or directory #include "conditional.h" ^ compilation terminated. make[3]: *** [command.o] Error 1 make[2]: *** [all-psql-recurse] Error 2 make[2]: *** Waiting for unfinished jobs make[1]: *** [all-bin-recurse] Error 2 make: *** [all-src-recurse] Error 2 Perhaps that is expected, as "Not everything is done yet", but I can't tell from your email so I thought I'd report ir anyway. Ignore as appropriate... Thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] improve comments of snapbuild.c
Improvements (grammar/typos) in the comments in snapbuild.c To be applied to master. thanks, Erik Rijkers --- src/backend/replication/logical/snapbuild.c.orig 2017-03-14 21:53:42.590196415 +0100 +++ src/backend/replication/logical/snapbuild.c 2017-03-14 21:57:57.906539208 +0100 @@ -34,7 +34,7 @@ * xid. That is we keep a list of transactions between snapshot->(xmin, xmax) * that we consider committed, everything else is considered aborted/in * progress. That also allows us not to care about subtransactions before they - * have committed which means this modules, in contrast to HS, doesn't have to + * have committed which means this module, in contrast to HS, doesn't have to * care about suboverflowed subtransactions and similar. * * One complexity of doing this is that to e.g. handle mixed DDL/DML @@ -82,7 +82,7 @@ * Initially the machinery is in the START stage. When an xl_running_xacts * record is read that is sufficiently new (above the safe xmin horizon), * there's a state transition. If there were no running xacts when the - * runnign_xacts record was generated, we'll directly go into CONSISTENT + * running_xacts record was generated, we'll directly go into CONSISTENT * state, otherwise we'll switch to the FULL_SNAPSHOT state. Having a full * snapshot means that all transactions that start henceforth can be decoded * in their entirety, but transactions that started previously can't. In @@ -273,7 +273,7 @@ /* * Allocate a new snapshot builder. * - * xmin_horizon is the xid >=which we can be sure no catalog rows have been + * xmin_horizon is the xid >= which we can be sure no catalog rows have been * removed, start_lsn is the LSN >= we want to replay commits. */ SnapBuild * @@ -1840,7 +1840,7 @@ char path[MAXPGPATH]; /* - * We start of with a minimum of the last redo pointer. No new replication + * We start off with a minimum of the last redo pointer. No new replication * slot will start before that, so that's a safe upper bound for removal. */ redo = GetRedoRecPtr(); @@ -1898,7 +1898,7 @@ /* * It's not particularly harmful, though strange, if we can't * remove the file here. Don't prevent the checkpoint from - * completing, that'd be cure worse than the disease. + * completing, that'd be a cure worse than the disease. */ if (unlink(path) < 0) { -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
On 2017-03-09 11:06, Erik Rijkers wrote: On 2017-03-08 10:36, Petr Jelinek wrote: On 07/03/17 23:30, Erik Rijkers wrote: On 2017-03-06 11:27, Petr Jelinek wrote: 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch + 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+ 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch + 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch + 0005-Skip-unnecessary-snapshot-builds.patch+ 0001-Logical-replication-support-for-initial-data-copy-v6.patch The attached bz2 contains - an output file from pgbench_derail2.sh (also attached, as it changes somewhat all the time); the - the pg_waldump output from both master (file with .1. in it) and replica (.2.). - the 2 logfiles. I forgot to include the bash-output file. Now attached. This file should have been in the bz2 I sent a few minutes ago. = iteration 1 -- 1 of 10 = -- scale 25 clients 64 duration 300 CLEAN_ONLY= -- hostname: barzoi -- timestamp: 20170309_1021 -- master_start_time 2017-03-08 12:04:02.127127+01 replica_start_time 2017-03-08 12:04:02.12713+01 -- master patch-md5 [59c92165d4a328d68450ef0e922c0a42] -- replica patch-md5 [59c92165d4a328d68450ef0e922c0a42] (ok) -- synchronous_commit, master [on] replica [off] -- master_assert [on] replica_assert [on] -- self md5 87554cfed7cda67ad292b6481e1b8b41 ./pgbench_derail2.sh clean-at-start-call creating tables... 1699900 of 250 tuples (67%) done (elapsed 5.19 s, remaining 2.44 s) 250 of 250 tuples (100%) done (elapsed 7.51 s, remaining 0.00 s) vacuum... set primary keys... done. create publication pub1 for all tables; create subscription sub1 connection 'port=6972' publication pub1 with (disabled); alter subscription sub1 enable; -- pgbench -c 64 -j 8 -T 300 -P 60 -n -- scale 25 progress: 60.0 s, 134.4 tps, lat 472.280 ms stddev 622.992 progress: 120.0 s, 26.4 tps, lat 2083.748 ms stddev 4356.546 progress: 180.0 s, 21.2 tps, lat 2977.751 ms stddev 4767.332 progress: 240.0 s, 13.5 tps, lat 5230.657 ms stddev 7029.718 progress: 300.0 s, 42.4 tps, lat 1555.645 ms stddev 1733.152 transaction type: scaling factor: 25 query mode: simple number of clients: 64 number of threads: 8 duration: 300 s number of transactions actually processed: 14336 latency average = 1342.222 ms latency stddev = 3043.759 ms tps = 47.383887 (including connections establishing) tps = 47.385513 (excluding connections establishing) -- waiting 0s... (always) 2017.03.09 10:27:56 -- getting md5 (cb) 6972 a,b,t,h: 250 25250 14336 ee0f7bfd9 960d7d79c 3e8af1e9e cd2bd0395 master 6973 a,b,t,h: 250 25250 14336 ee0f7bfd9 960d7d79c 3e8af1e9e cd2bd0395 replica ok 578113f12 2017.03.09 10:29:18 -- All is well. -- 0 seconds total. scale 25 clients 64 -T 300 -- waiting 20s, then end-cleaning clean-at-end-call sub_count -ne 0 : deleting sub1 (plain) ERROR: could not drop the replication slot "sub1" on publisher DETAIL: The error was: ERROR: replication slot "sub1" is active for PID 10569 sub_count -ne 0 : deleting sub1 (nodrop) pub_count -ne 0 - deleting pub1 pub_repl_slot_count -ne 0 - deleting (sub1) ERROR: replication slot "sub1" is active for PID 10569 pub_count 0 pub_repl_slot_count 1 sub_count 0 sub_repl_slot_count 0 -- imperfect cleanup, pg_waldump to unclean.20170309_1021.txt.bz2, waiting 60 s, then exit -- testset.sh: waiting 10s... = iteration 2 -- 2 of 10 = -- scale 25 clients 64 duration 300 CLEAN_ONLY= -- hostname: barzoi -- timestamp: 20170309_1021 -- master_start_time 2017-03-08 12:04:02.127127+01 replica_start_time 2017-03-08 12:04:02.12713+01 -- master patch-md5 [59c92165d4a328d68450ef0e922c0a42] -- replica patch-md5 [59c92165d4a328d68450ef0e922c0a42] (ok) -- synchronous_commit, master [on] replica [off] -- master_assert [on] replica_assert [on] -- self md5 87554cfed7cda67ad292b6481e1b8b41 ./pgbench_derail2.sh clean-at-start-call pub_repl_slot_count -ne 0 - deleting (sub1) pg_drop_replication_slot -- (1 row) creating tables... 1596800 of 250 tuples (63%) done (elapsed 5.09 s, remaining 2.88 s) 250 of 250 tuples (100%) done (elapsed 7.88 s, remaining 0.00 s) vacuum... set primary keys... done. create publication pub1 for all tables; create subscription sub1 connection 'port=6972' publication pub1 with (disabled); alter subscription sub1 enable; -- pgbench -c 64 -j 8 -T 300 -P 60 -n -- scale 25 progress: 60.0 s, 129.0 tps, lat 493.130 ms stddev 635.654 progress: 120.0 s, 34.0 tps, l
Re: [HACKERS] Logical replication existing data copy
On 2017-03-09 11:06, Erik Rijkers wrote: file Name: logrep.20170309_1021.1.1043.scale_25.clients_64.NOK.log 20170309_1021 is the start-time of the script 1 is master (2 is replica) 1043 is the time, 10:43, just before the pg_waldump call Sorry, that might be confusing. That 10:43 is the time when script renames and copies the logfiles (not the waldump) I meant to show the name of the waldump file: waldump.20170309_1021_1039.1.5.000100270069.txt.bz2 where: 20170309_1021 is the start-time of the script 1 is master (2 is replica) 5 is wait-state cycles during which all 8 md5s remained the same 1039 is the time, 10:43, just before the pg_waldump call -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
On 2017-03-06 11:27, Petr Jelinek wrote: 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch + 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+ 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch + 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch + 0005-Skip-unnecessary-snapshot-builds.patch+ 0001-Logical-replication-support-for-initial-data-copy-v6.patch I use three different machines (2 desktop, 1 server) to test logical replication, and all three have now at least once failed to correctly synchronise a pgbench session (amidst many succesful runs, of course) I attach an output-file from the test-program, with the 2 logfiles (master+replica) of the failed run. The outputfile (out_20170307_1613.txt) contains the output of 5 runs of pgbench_derail2.sh. The first run failed, the next 4 were ok. But that's probably not very useful; perhaps is pg_waldump more useful? From what moment, or leading up to what moment, or period, is a pg_waldump(s) useful? I can run it from the script, repeatedly, and only keep the dumped files when things go awry. Would that make sense? Any other ideas welcome. thanks, Erik Rijkers 20170307_1613.tar.bz2 Description: BZip2 compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
On 2017-03-06 16:10, Erik Rijkers wrote: On 2017-03-06 11:27, Petr Jelinek wrote: Hi, updated and rebased version of the patch attached. I compiled with /only/ this one latest patch: 0001-Logical-replication-support-for-initial-data-copy-v6.patch Is that correct, or are other patches still needed on top, or underneath? TWIMC, I'll answer my own question: the correct patchset seems to be these six: 0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch 0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch 0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch 0005-Skip-unnecessary-snapshot-builds.patch 0001-Logical-replication-support-for-initial-data-copy-v6.patch These compile, make check, and install fine. make check-world is also without errors. Logical replication tests are now running again (no errors yet); they'll have to run for a few hours with varying parameters to gain some confidence but it's looking good for the moment. Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
On 2017-03-06 11:27, Petr Jelinek wrote: Hi, updated and rebased version of the patch attached. I compiled with /only/ this one latest patch: 0001-Logical-replication-support-for-initial-data-copy-v6.patch Is that correct, or are other patches still needed on top, or underneath? Anyway, with that one patch, and even after alter role ... set synchronous_commit = off; the process is very slow. (sufficiently slow that I haven't had the patience to see it to completion yet) What am I doing wrong? thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] snapbuild woes
On 2017-03-03 01:30, Petr Jelinek wrote: With these patches: 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch 0002-Fix-after-trigger-execution-in-logical-replication.patch 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch snapbuild-v5-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch snapbuild-v5-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch snapbuild-v5-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch snapbuild-v5-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch snapbuild-v5-0005-Skip-unnecessary-snapshot-builds.patch 0001-Logical-replication-support-for-initial-data-copy-v6.patch I get: subscriptioncmds.c:47:12: error: static declaration of ‘oid_cmp’ follows non-static declaration static int oid_cmp(const void *p1, const void *p2); ^~~ In file included from subscriptioncmds.c:42:0: ../../../src/include/utils/builtins.h:70:12: note: previous declaration of ‘oid_cmp’ was here extern int oid_cmp(const void *p1, const void *p2); ^~~ make[3]: *** [subscriptioncmds.o] Error 1 make[3]: *** Waiting for unfinished jobs make[2]: *** [commands-recursive] Error 2 make[2]: *** Waiting for unfinished jobs make[1]: *** [all-backend-recurse] Error 2 make: *** [all-src-recurse] Error 2 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
On 2017-02-28 07:38, Erik Rijkers wrote: On 2017-02-27 15:08, Petr Jelinek wrote: 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch + 0002-Fix-after-trigger-execution-in-logical-replication.patch + 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch + snapbuild-v4-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch + snapbuild-v4-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch+ snapbuild-v4-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch + snapbuild-v4-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch + snapbuild-v4-0005-Skip-unnecessary-snapshot-builds.patch + 0001-Logical-replication-support-for-initial-data-copy-v6.patch This is the most frequent error that happens while doing pgbench-runs over logical replication: I run it continuously all day, and every few hours an error occurs of the kind seen below: a table (pgbench_history, mostly) ends up 1 row short (673466 instead of 673467). I have the script wait a long time before calling it an error (because in theory it could still 'finish', and end successfully (although that has not happened yet, once the system got into this state). -- pgbench -c 16 -j 8 -T 120 -P 24 -n -M simple -- scale 25 [...] 6972 a,b,t,h: 250 25250 673467 e53236c09 643235708 f952814c3 559d618cd master 6973 a,b,t,h: 250 25250 673466 e53236c09 643235708 f952814c3 4b09337e3 replica NOK a22fb00a6 -- wait another 5 s (total 20 s) (unchanged 1) -- getting md5 (cb) 6972 a,b,t,h: 250 25250 673467 e53236c09 643235708 f952814c3 559d618cd master 6973 a,b,t,h: 250 25250 673466 e53236c09 643235708 f952814c3 4b09337e3 replica NOK a22fb00a6 -- wait another 5 s (total 25 s) (unchanged 2) -- getting md5 (cb) 6972 a,b,t,h: 250 25250 673467 e53236c09 643235708 f952814c3 559d618cd master 6973 a,b,t,h: 250 25250 673466 e53236c09 643235708 f952814c3 4b09337e3 replica NOK a22fb00a6 -- wait another 5 s (total 30 s) (unchanged 3) -- getting md5 (cb) 6972 a,b,t,h: 250 25250 673467 e53236c09 643235708 f952814c3 559d618cd master 6973 a,b,t,h: 250 25250 673466 e53236c09 643235708 f952814c3 4b09337e3 replica NOK a22fb00a6 -- wait another 5 s (total 35 s) (unchanged 4) I gathered some info in this (proabably deadlocked) state in the hope there is something suspicious in there: UID PID PPID C STIME TTY STAT TIME CMD rijkers 71203 1 0 20:06 pts/57 S 0:00 postgres -D /var/data1/pg_stuff/pg_installations/pgsql.logical_replication2/data -p 6973 -c wal_level=replica [...] rijkers 71214 71203 0 20:06 ?Ss 0:00 \_ postgres: logger process rijkers 71216 71203 0 20:06 ?Ss 0:00 \_ postgres: checkpointer process rijkers 71217 71203 0 20:06 ?Ss 0:00 \_ postgres: writer process rijkers 71218 71203 0 20:06 ?Ss 0:00 \_ postgres: wal writer process rijkers 71219 71203 0 20:06 ?Ss 0:00 \_ postgres: autovacuum launcher process rijkers 71220 71203 0 20:06 ?Ss 0:00 \_ postgres: stats collector process rijkers 71221 71203 0 20:06 ?Ss 0:00 \_ postgres: bgworker: logical replication launcher rijkers 71222 71203 0 20:06 ?Ss 0:00 \_ postgres: bgworker: logical replication worker 30042 rijkers 71201 1 0 20:06 pts/57 S 0:00 postgres -D /var/data1/pg_stuff/pg_installations/pgsql.logical_replication/data -p 6972 -c wal_level=logical [...] rijkers 71206 71201 0 20:06 ?Ss 0:00 \_ postgres: logger process rijkers 71208 71201 0 20:06 ?Ss 0:00 \_ postgres: checkpointer process rijkers 71209 71201 0 20:06 ?Ss 0:00 \_ postgres: writer process rijkers 71210 71201 0 20:06 ?Ss 0:00 \_ postgres: wal writer process rijkers 71211 71201 0 20:06 ?Ss 0:00 \_ postgres: autovacuum launcher process rijkers 71212 71201 0 20:06 ?Ss 0:00 \_ postgres: stats collector process rijkers 71213 71201 0 20:06 ?Ss 0:00 \_ postgres: bgworker: logical replication launcher rijkers 71223 71201 0 20:06 ?Ss 0:00 \_ postgres: wal sender process rijkers [local] idle -- replica: port | shared_buffers | work_mem | m_w_m | e_c_s --++--+---+--- 6973 | 100MB | 50MB | 2GB | 64GB (1 row) select current_setting('port') as port , datname as db , to_char(pg_database_size(datname), '9G999G999G999G999') || ' (' || pg_size_pretty(pg_database_size(datname)) || ')' as dbsize , pid , application_name as app , xact_start , query_start , regexp_replace( cast(now() - query_start as text), E'\.[[:digit
Re: [HACKERS] Logical replication existing data copy
On 2017-02-27 15:08, Petr Jelinek wrote: The performance was why in original patch I wanted the apply process to default to synchronous_commit = off as without it the apply performance (due to applying transactions individually and in sequences) is quite lackluster. It can be worked around using user that has synchronous_commit = off set via ALTER ROLE as owner of the subscription. Wow, that's a huge difference in speed. I set ALTER ROLE aardvark synchronous_commit = off; during the first iteration of a 10x pgbench-test (so the first was still done with it 'on'): here the pertinent grep | uniq -c lines: -- out_20170228_0004.txt 10 -- pgbench -c 16 -j 8 -T 900 -P 180 -n -- scale 25 10 -- All is well. 1 -- 1325 seconds total. 9 -- 5 seconds total. And that 5 seconds is a hardcoded wait; so it's probably even quicker. This is a slowish machine but that's a really spectacular difference. It's the difference between keeping up or getting lost. Would you remind me why synchronous_commit = on was deemed a better default? This thread isn't very clear about it (not the 'logical replication WIP' thread). thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
With these patches: -- 0416d87c-09a5-182e-4901-236aec103...@2ndquadrant.com Subject: Re: Logical Replication WIP 48. https://www.postgresql.org/message-id/attachment/49886/0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch 49. https://www.postgresql.org/message-id/attachment/49887/0002-Fix-after-trigger-execution-in-logical-replication.patch 50. https://www.postgresql.org/message-id/attachment/49888/0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch -- 51f65289-54f8-2256-d107-937d662d6...@2ndquadrant.com Subject: Re: snapbuild woes 48. https://www.postgresql.org/message-id/attachment/49995/snapbuild-v4-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch 49. https://www.postgresql.org/message-id/attachment/49996/snapbuild-v4-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch 50. https://www.postgresql.org/message-id/attachment/49997/snapbuild-v4-0003-Prevent-snapshot-builder-xmin-from-going-backwards.patch 51. https://www.postgresql.org/message-id/attachment/49998/snapbuild-v4-0004-Fix-xl_running_xacts-usage-in-snapshot-builder.patch 52. https://www.postgresql.org/message-id/attachment/4/snapbuild-v4-0005-Skip-unnecessary-snapshot-builds.patch -- c0f90176-efff-0770-1e79-0249fb4b9...@2ndquadrant.com Subject: Re: Logical replication existing data copy 48. https://www.postgresql.org/message-id/attachment/49977/0001-Logical-replication-support-for-initial-data-copy-v6.patch logical replication now seems pretty stable, at least for the limited testcase that I am using. I've done dozens of pgbench_derail2.sh runs without failure. I am now changing the pgbench-test to larger scale (pgbench -is) and longer periods (-T) which makes running the test slow (both instances are running on a modest desktop with a single 7200 disk). It is quite a bit slower than I expected (a 5-minute pgbench scale 5, with 8 clients, takes, after it has finished on master, another 2-3 minutes to get synced on the replica). I suppose it's just a hardware limitation. I set max_sync_workers_per_subscription to 6 (from default 2) but it doesn't help much (at all). To be continued... Thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
On 2017-02-26 10:53, Erik Rijkers wrote: Not yet perfect, but we're getting there... Sorry, I made a mistake: I was running the newest patches on master but the older versions on replica (or more precise: I didn't properly shutdown the replica so the older version remained up and running during subsequent testing). So my last email mentioning the 'DROP SUBSCRIPTION' hang error are hopefully wrong. I'll get back when I've repeated these tests. This will take some hours (at least). Sorry to cause you these palpitations, perhaps unnecessarily... Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
On 2017-02-26 01:45, Petr Jelinek wrote: Again, much better... : -- out_20170226_0724.txt 25 -- pgbench -c 1 -j 8 -T 10 -P 5 -n 25 -- All is well. -- out_20170226_0751.txt 25 -- pgbench -c 4 -j 8 -T 10 -P 5 -n 25 -- All is well. -- out_20170226_0819.txt 25 -- pgbench -c 8 -j 8 -T 10 -P 5 -n 25 -- All is well. -- out_20170226_0844.txt 25 -- pgbench -c 16 -j 8 -T 10 -P 5 -n 25 -- All is well. -- out_20170226_0912.txt 25 -- pgbench -c 32 -j 8 -T 10 -P 5 -n 25 -- All is well. -- out_20170226_0944.txt 25 -- scale 5 clients 1 INIT_WAIT 0CLEAN_ONLY= 25 -- pgbench -c 1 -j 8 -T 10 -P 5 -n 25 -- All is well. but not perfect: with the next scale up (pgbench scale 25) I got: -- out_20170226_1001.txt 3 -- scale 25 clients 1 INIT_WAIT 0CLEAN_ONLY= 3 -- pgbench -c 1 -j 8 -T 10 -P 5 -n 2 -- All is well. 1 -- Not good, but breaking out of wait (waited more than 60s) It looks like something got stuck at DROP SUBSCRIPTION again which, I think, derives from this line: echo "drop subscription if exists sub1" | psql -qXp $port2 I don't know exactly what is useful/useless to report; below is the state of some tables/views (note that this is from 31 minutes after the fact (see 'duration' in the first query)), and a backtrace : $ ./view.sh select current_setting('port') as port; port -- 6973 (1 row) select rpad(now()::text,19) as now , pid as pid , application_name as app , state as state , wait_eventas wt_evt , wait_event_type as wt_evt_type , date_trunc('second', query_start::timestamp) as query_start , substring((now() - query_start)::text, 1, position('.' in (now() - query_start)::text)-1) as duration , query from pg_stat_activity where query !~ 'pg_stat_activity' ; now | pid | app | state | wt_evt | wt_evt_type | query_start | duration | query -+---+-+++-+-+--+-- 2017-02-26 10:42:43 | 28232 | logical replication worker 31929 | active | relation | Lock| | | 2017-02-26 10:42:43 | 28237 | logical replication worker 31929 sync 31906 || LogicalSyncStateChange | IPC | | | 2017-02-26 10:42:43 | 28242 | logical replication worker 31929 sync 31909 || transactionid | Lock| | | 2017-02-26 10:42:43 | 32023 | psql | active | BgWorkerShutdown | IPC | 2017-02-26 10:10:52 | 00:31:51 | drop subscription if exists sub1 (4 rows) select * from pg_stat_replication; pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state -+--+-+--+-+-+-+---+--+---+---+++-+---+ (0 rows) select * from pg_stat_subscription; subid | subname | pid | relid | received_lsn | last_msg_send_time | last_msg_receipt_time | latest_end_lsn |latest_end_time ---+-+---+---+--+---+---++--- 31929 | sub1| 28242 | 31909 | | 2017-02-26 10:07:05.723093+01 | 2017-02-26 10:07:05.723093+01 || 2017-02-26 10:07:05.723093+01 31929 | sub1| 28237 | 31906 | | 2017-02-26 10:07:04.721229+01 | 2017-02-26 10:07:04.721229+01 || 2017-02-26 10:07:04.721229+01 31929 | sub1| 28232 | | 1/73497468 | | 2017-02-26 10:07:47.781883+01 | 1/59A73EF8 | 2017-02-26 10:07:04.720595+01 (3 rows) select * from pg_subscription; subdbid | subname | subowner | subenabled | subconninfo | subslotname | subpublications -+-+--++-+-+- 16384 | sub1| 10 | t | port=6972 | sub1| {pub1} (1 row) select * from pg_subscription_rel; srsubid | srrelid | srsubstate | srsublsn -+-++ 31929 | 31912 | i | 31929 | 31917 | i | 31929 | 31909 | d | 31929 | 31906 | w | 1/73498F90 (4 rows) Dunno if a backtrace is is useful $ gdb -pid 32023 (from the DROP SUBSCRIPTION
Re: [HACKERS] Logical replication existing data copy
On 2017-02-25 00:40, Petr Jelinek wrote: 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch 0002-Fix-after-trigger-execution-in-logical-replication.patch 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch snapbuild-v3-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch snapbuild-v3-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch snapbuild-v3-0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch snapbuild-v3-0004-Skip-unnecessary-snapshot-builds.patch 0001-Logical-replication-support-for-initial-data-copy-v6.patch Here are some results. There is improvement although it's not an unqualified success. Several repeat-runs of pgbench_derail2.sh, with different parameters for number-of-client yielded an output file each. Those show that logrep is now pretty stable when there is only 1 client (pgbench -c 1). But it starts making mistakes with 4, 8, 16 clients. I'll just show a grep of the output files; I think it is self-explicatory: Output-files (lines counted with grep | sort | uniq -c): -- out_20170225_0129.txt 250 -- pgbench -c 1 -j 8 -T 10 -P 5 -n 250 -- All is well. -- out_20170225_0654.txt 25 -- pgbench -c 4 -j 8 -T 10 -P 5 -n 24 -- All is well. 1 -- Not good, but breaking out of wait (waited more than 60s) -- out_20170225_0711.txt 25 -- pgbench -c 8 -j 8 -T 10 -P 5 -n 23 -- All is well. 2 -- Not good, but breaking out of wait (waited more than 60s) -- out_20170225_0803.txt 25 -- pgbench -c 16 -j 8 -T 10 -P 5 -n 11 -- All is well. 14 -- Not good, but breaking out of wait (waited more than 60s) So, that says: 1 clients: 250x success, zero fail (250 not a typo, ran this overnight) 4 clients: 24x success, 1 fail 8 clients: 23x success, 2 fail 16 clients: 11x success, 14 fail I want to repeat what I said a few emails back: problems seem to disappear when a short wait state is introduced (directly after the 'alter subscription sub1 enable' line) to give the logrep machinery time to 'settle'. It makes one think of a timing error somewhere (now don't ask me where..). To show that, here is pgbench_derail2.sh output that waited 10 seconds (INIT_WAIT in the script) as such a 'settle' period works faultless (with 16 clients): -- out_20170225_0852.txt 25 -- pgbench -c 16 -j 8 -T 10 -P 5 -n 25 -- All is well. QED. (By the way, no hanged sessions so far, so that's good) thanks Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
On 2017-02-25 00:08, Petr Jelinek wrote: There is now a lot of fixes for existing code that this patch depends on. Hopefully some of the fixes get committed soonish. Indeed - could you look over the below list of 8 patches; is it correct and in the right (apply) order? 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch 0002-Fix-after-trigger-execution-in-logical-replication.patch 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch snapbuild-v3-0001-Reserve-global-xmin-for-create-slot-snasphot-export.patch snapbuild-v3-0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch snapbuild-v3-0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch snapbuild-v3-0004-Skip-unnecessary-snapshot-builds.patch 0001-Logical-replication-support-for-initial-data-copy-v6.patch (they do apply & compile like this...) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
On 2017-02-24 22:58, Petr Jelinek wrote: On 23/02/17 01:41, Petr Jelinek wrote: On 23/02/17 01:02, Erik Rijkers wrote: On 2017-02-22 18:13, Erik Rijkers wrote: On 2017-02-22 14:48, Erik Rijkers wrote: On 2017-02-22 13:03, Petr Jelinek wrote: 0001-Skip-unnecessary-snapshot-builds.patch 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch 0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch 0002-Fix-after-trigger-execution-in-logical-replication.patch 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch 0001-Logical-replication-support-for-initial-data-copy-v5.patch It works well now, or at least my particular test case seems now solved. Cried victory too early, I'm afraid. I got into a 'hung' state while repeating pgbench_derail2.sh. Below is some state. I notice that master pg_stat_replication.syaye is 'startup'. Maybe I should only start the test after that state has changed. Any of the other possible values (catchup, streaming) wuold be OK, I would think. I think that's known issue (see comment in tablesync.c about hanging forever). I think I may have fixed it locally. I will submit patch once I fixed the other snapshot issue (I managed to reproduce it as well, although very rarely so it's rather hard to test). Hi, Here it is. But check also the snapbuild related thread for updated patches related to that (the issue you had with this not copying all rows is yet another pre-existing Postgres bug). The four earlier snapbuild patches apply cleanly, but then I get errors while applying 0001-Logical-replication-support-for-initial-data-copy-v6.patch: patching file src/test/regress/expected/sanity_check.out (Stripping trailing CRs from patch.) patching file src/test/regress/expected/subscription.out Hunk #2 FAILED at 25. 1 out of 2 hunks FAILED -- saving rejects to file src/test/regress/expected/subscription.out.rej (Stripping trailing CRs from patch.) patching file src/test/regress/sql/object_address.sql (Stripping trailing CRs from patch.) patching file src/test/regress/sql/subscription.sql (Stripping trailing CRs from patch.) patching file src/test/subscription/t/001_rep_changes.pl Hunk #9 succeeded at 175 with fuzz 2. Hunk #10 succeeded at 193 (offset -9 lines). (Stripping trailing CRs from patch.) patching file src/test/subscription/t/002_types.pl (Stripping trailing CRs from patch.) can't find file to patch at input line 4296 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -- |diff --git a/src/test/subscription/t/003_constraints.pl b/src/test/subscription/t/003_constraints.pl |index 17d4565..9543b91 100644 |--- a/src/test/subscription/t/003_constraints.pl |+++ b/src/test/subscription/t/003_constraints.pl -- File to patch: Can you have a look? thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
On 2017-02-22 18:13, Erik Rijkers wrote: On 2017-02-22 14:48, Erik Rijkers wrote: On 2017-02-22 13:03, Petr Jelinek wrote: 0001-Skip-unnecessary-snapshot-builds.patch 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch 0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch 0002-Fix-after-trigger-execution-in-logical-replication.patch 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch 0001-Logical-replication-support-for-initial-data-copy-v5.patch It works well now, or at least my particular test case seems now solved. Cried victory too early, I'm afraid. I got into a 'hung' state while repeating pgbench_derail2.sh. Below is some state. I notice that master pg_stat_replication.syaye is 'startup'. Maybe I should only start the test after that state has changed. Any of the other possible values (catchup, streaming) wuold be OK, I would think. $ ( dbactivity.sh ; echo "; table pg_subscription; table pg_subscription_rel;" ) | psql -qXp 6973 now | pid | app | state | wt_evt | wt_evt_type | query_start | duration | query -+---+-+++-+-+--+-- 2017-02-23 00:37:57 | 31352 | logical replication worker 47435 | active | relation | Lock| | | 2017-02-23 00:37:57 | 397 | psql | active | BgWorkerShutdown | IPC | 2017-02-23 00:22:14 | 00:15:42 | drop subscription if exists sub1 2017-02-23 00:37:57 | 31369 | logical replication worker 47435 sync 47423 || LogicalSyncStateChange | IPC | | | 2017-02-23 00:37:57 | 398 | logical replication worker 47435 sync 47418 || transactionid | Lock| | | (4 rows) subdbid | subname | subowner | subenabled | subconninfo | subslotname | subpublications -+-+--++-+-+- 16384 | sub1| 10 | t | port=6972 | sub1| {pub1} (1 row) srsubid | srrelid | srsubstate | srsublsn -+-++ 47435 | 47423 | w | 2/CB078260 47435 | 47412 | r | 47435 | 47415 | r | 47435 | 47418 | c | 2/CB06E158 (4 rows) Replica (port 6973): [bulldog aardvark] [local]:6973 (Thu) 00:52:47 [pid:5401] [testdb] # table pg_stat_subscription ; subid | subname | pid | relid | received_lsn | last_msg_send_time | last_msg_receipt_time | latest_end_lsn |latest_end_time ---+-+---+---+--+---+---++--- 47435 | sub1| 31369 | 47423 | | 2017-02-23 00:20:45.758072+01 | 2017-02-23 00:20:45.758072+01 || 2017-02-23 00:20:45.758072+01 47435 | sub1| 398 | 47418 | | 2017-02-23 00:22:14.896471+01 | 2017-02-23 00:22:14.896471+01 || 2017-02-23 00:22:14.896471+01 47435 | sub1| 31352 | | 2/CB06E158 | | 2017-02-23 00:20:47.034664+01 || 2017-02-23 00:20:45.679245+01 (3 rows) Master (port 6972): [bulldog aardvark] [local]:6972 (Thu) 00:48:27 [pid:5307] [testdb] # \x on \\ table pg_stat_replication ; Expanded display is on. -[ RECORD 1 ]+-- pid | 399 usesysid | 10 usename | aardvark application_name | sub1_47435_sync_47418 client_addr | client_hostname | client_port | -1 backend_start| 2017-02-23 00:22:14.902701+01 backend_xmin | state| startup sent_location| write_location | flush_location | replay_location | sync_priority| 0 sync_state | async -[ RECORD 2 ]+-- pid | 31371 usesysid | 10 usename | aardvark application_name | sub1_47435_sync_47423 client_addr | client_hostname | client_port | -1 backend_start| 2017-02-23 00:20:45.762852+01 backend_xmin | state| startup sent_location| write_location | flush_location | replay_location | sync_priority| 0 sync_state | async ( above 'dbactivity.sh' is: select rpad(now()::text,19) as now , pid as pid , application_name as app , state as state , wait_eventas wt_evt , wait_event_type as wt_evt_type , date_trunc('second', query_start::timestamp) as query_start , substring((now() - query_start)::text, 1
Re: [HACKERS] Logical replication existing data copy
On 2017-02-22 13:03, Petr Jelinek wrote: 0001-Skip-unnecessary-snapshot-builds.patch 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch 0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch 0001-Use-asynchronous-connect-API-in-libpqwalreceiver.patch 0002-Fix-after-trigger-execution-in-logical-replication.patch 0003-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION.patch 0001-Logical-replication-support-for-initial-data-copy-v5.patch It works well now, or at least my particular test case seems now solved. thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] snapbuild woes
On 2017-02-22 03:05, Petr Jelinek wrote: So to summarize attached patches: 0001 - Fixes performance issue where we build tons of snapshots that we don't need which kills CPU. 0002 - Disables the use of ondisk historical snapshots for initial consistent snapshot export as it may result in corrupt data. This definitely needs backport. 0003 - Fixes bug where we might never reach snapshot on busy server due to race condition in xl_running_xacts logging. The original use of extra locking does not seem to be enough in practice. Once we have agreed fix for this it's probably worth backpatching. There are still some comments that need updating, this is more of a PoC. I am not not entirely sure what to expect. Should a server with these 3 patches do initial data copy or not? The sgml seems to imply there is not inital data copy. But my test does copy something. Anyway, I have repeated the same old pgbench-test, assuming inital data copy should be working. With 0001-Skip-unnecessary-snapshot-builds.patch 0002-Don-t-use-on-disk-snapshots-for-snapshot-export-in-l.patch 0003-Fix-xl_running_xacts-usage-in-snapshot-builder.patch the consistent (but wrong) end state is always that only one of the four pgbench tables, pgbench_history, is replicated (always correctly). Below is the output from the test (I've edited the lines for email) (below, a,b,t,h stand for: pgbench_accounts, pgbench_branches, pgbench_tellers, pgbench_history) (master on port 6972, replica on port 6973.) port 6972 a,b,t,h: 10 1 10347 6973 a,b,t,h: 0 0 0347 a,b,t,h: a68efc81a 2c27f7ba5 128590a57 1e4070879 master a,b,t,h: d41d8cd98 d41d8cd98 d41d8cd98 1e4070879 replica NOK The md5-initstrings are from a md5 of the whole content of each table (an ordered select *) I repeated this a few times: of course, the number of rows in pgbench_history varies a bit but otherwise it is always the same: 3 empty replica tables, pgbench_history replicated correctly. Something is not right. thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy - comments snapbuild.c
On 2017-02-19 23:24, Erik Rijkers wrote: 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch 0001-Logical-replication-support-for-initial-data-copy-v4.patch Improve comment blocks in src/backend/replication/logical/snapbuild.c [deep sigh...] attached...--- src/backend/replication/logical/snapbuild.c.orig2 2017-02-19 17:25:57.237527107 +0100 +++ src/backend/replication/logical/snapbuild.c 2017-02-19 23:19:57.654946968 +0100 @@ -34,7 +34,7 @@ * xid. That is we keep a list of transactions between snapshot->(xmin, xmax) * that we consider committed, everything else is considered aborted/in * progress. That also allows us not to care about subtransactions before they - * have committed which means this modules, in contrast to HS, doesn't have to + * have committed which means this module, in contrast to HS, doesn't have to * care about suboverflowed subtransactions and similar. * * One complexity of doing this is that to e.g. handle mixed DDL/DML @@ -82,7 +82,7 @@ * Initially the machinery is in the START stage. When an xl_running_xacts * record is read that is sufficiently new (above the safe xmin horizon), * there's a state transition. If there were no running xacts when the - * runnign_xacts record was generated, we'll directly go into CONSISTENT + * running_xacts record was generated, we'll directly go into CONSISTENT * state, otherwise we'll switch to the FULL_SNAPSHOT state. Having a full * snapshot means that all transactions that start henceforth can be decoded * in their entirety, but transactions that started previously can't. In @@ -274,7 +274,7 @@ /* * Allocate a new snapshot builder. * - * xmin_horizon is the xid >=which we can be sure no catalog rows have been + * xmin_horizon is the xid >= which we can be sure no catalog rows have been * removed, start_lsn is the LSN >= we want to replay commits. */ SnapBuild * @@ -1642,7 +1642,7 @@ fsync_fname("pg_logical/snapshots", true); /* - * Now there's no way we can loose the dumped state anymore, remember this + * Now there's no way we can lose the dumped state anymore, remember this * as a serialization point. */ builder->last_serialized_snapshot = lsn; @@ -1858,7 +1858,7 @@ char path[MAXPGPATH]; /* - * We start of with a minimum of the last redo pointer. No new replication + * We start off with a minimum of the last redo pointer. No new replication * slot will start before that, so that's a safe upper bound for removal. */ redo = GetRedoRecPtr(); @@ -1916,7 +1916,7 @@ /* * It's not particularly harmful, though strange, if we can't * remove the file here. Don't prevent the checkpoint from - * completing, that'd be cure worse than the disease. + * completing, that'd be a cure worse than the disease. */ if (unlink(path) < 0) { -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy - comments snapbuild.c
0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch 0001-Logical-replication-support-for-initial-data-copy-v4.patch Improve comment blocks in src/backend/replication/logical/snapbuild.c -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy - comments origin.c
On 2017-02-19 17:21, Erik Rijkers wrote: 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch 0001-Logical-replication-support-for-initial-data-copy-v4.patch Improve readability of comment blocks in src/backend/replication/logical/origin.c now attached thanks, Erik Rijkers --- src/backend/replication/logical/origin.c.orig 2017-02-19 16:45:28.558865304 +0100 +++ src/backend/replication/logical/origin.c 2017-02-19 17:11:09.034023021 +0100 @@ -11,31 +11,29 @@ * NOTES * * This file provides the following: - * * An infrastructure to name nodes in a replication setup - * * A facility to efficiently store and persist replication progress in an - * efficient and durable manner. - * - * Replication origin consist out of a descriptive, user defined, external - * name and a short, thus space efficient, internal 2 byte one. This split - * exists because replication origin have to be stored in WAL and shared + * * Infrastructure to name nodes in a replication setup + * * A facility to efficiently store and persist replication progress + * + * A replication origin has a descriptive, user defined, external + * name and a short, internal 2 byte one. This split + * exists because a replication origin has to be stored in WAL and shared * memory and long descriptors would be inefficient. For now only use 2 bytes * for the internal id of a replication origin as it seems unlikely that there - * soon will be more than 65k nodes in one replication setup; and using only - * two bytes allow us to be more space efficient. + * soon will be more than 65k nodes in one replication setup. * * Replication progress is tracked in a shared memory table - * (ReplicationStates) that's dumped to disk every checkpoint. Entries + * (ReplicationStates) that is dumped to disk every checkpoint. Entries * ('slots') in this table are identified by the internal id. That's the case * because it allows to increase replication progress during crash * recovery. To allow doing so we store the original LSN (from the originating * system) of a transaction in the commit record. That allows to recover the - * precise replayed state after crash recovery; without requiring synchronous + * precise replayed state after crash recovery without requiring synchronous * commits. Allowing logical replication to use asynchronous commit is * generally good for performance, but especially important as it allows a * single threaded replay process to keep up with a source that has multiple * backends generating changes concurrently. For efficiency and simplicity - * reasons a backend can setup one replication origin that's from then used as - * the source of changes produced by the backend, until reset again. + * reasons a backend can setup one replication origin that is used as + * the source of changes produced by the backend, until it is reset again. * * This infrastructure is intended to be used in cooperation with logical * decoding. When replaying from a remote system the configured origin is @@ -45,11 +43,11 @@ * There are several levels of locking at work: * * * To create and drop replication origins an exclusive lock on - * pg_replication_slot is required for the duration. That allows us to - * safely and conflict free assign new origins using a dirty snapshot. + * pg_replication_slot is required. That allows us to + * safely and conflict-free assign new origins using a dirty snapshot. * - * * When creating an in-memory replication progress slot the ReplicationOirgin - * LWLock has to be held exclusively; when iterating over the replication + * * When creating an in-memory replication progress slot the ReplicationOrigin + * LWLock has to be held exclusively. When iterating over the replication * progress a shared lock has to be held, the same when advancing the * replication progress of an individual backend that has not setup as the * session's replication origin. @@ -57,7 +55,7 @@ * * When manipulating or looking at the remote_lsn and local_lsn fields of a * replication progress slot that slot's lwlock has to be held. That's * primarily because we do not assume 8 byte writes (the LSN) is atomic on - * all our platforms, but it also simplifies memory ordering concerns + * all our platforms, but it also simplifies memory ordering * between the remote and local lsn. We use a lwlock instead of a spinlock * so it's less harmful to hold the lock over a WAL write * (c.f. AdvanceReplicationProgress). @@ -305,7 +303,7 @@ } } - /* now release lock again, */ + /* now release lock again. */ heap_close(rel, ExclusiveLock); if (tuple == NULL) @@ -382,7 +380,7 @@ CommandCounterIncrement(); - /* now release lock again, */ + /* now release lock again. */ heap_close
Re: [HACKERS] Logical replication existing data copy - comments origin.c
0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch 0001-Logical-replication-support-for-initial-data-copy-v4.patch Improve readability of comment blocks in src/backend/replication/logical/origin.c thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
On 2017-02-11 11:16, Erik Rijkers wrote: On 2017-02-08 23:25, Petr Jelinek wrote: 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch 0001-Logical-replication-support-for-initial-data-copy-v4.patch Let me add the script ('instances.sh') that I use to startup the two logical replication instances for testing. Together with the earlier posted 'pgbench_derail2.sh' it makes out the fails test. pg_config of the master is: $ pg_config BINDIR = /home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/bin DOCDIR = /home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/share/doc HTMLDIR = /home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/share/doc INCLUDEDIR = /home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/include PKGINCLUDEDIR = /home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/include INCLUDEDIR-SERVER = /home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/include/server LIBDIR = /home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/lib PKGLIBDIR = /home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/lib LOCALEDIR = /home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/share/locale MANDIR = /home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/share/man SHAREDIR = /home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/share SYSCONFDIR = /home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/etc PGXS = /home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/lib/pgxs/src/makefiles/pgxs.mk CONFIGURE = '--prefix=/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication' '--bindir=/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/bin' '--libdir=/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/lib' '--with-pgport=6972' '--enable-depend' '--enable-cassert' '--enable-debug' '--with-openssl' '--with-perl' '--with-libxml' '--with-libxslt' '--with-zlib' '--enable-tap-tests' '--with-extra-version=_logical_replication_20170218_1221_e3a58c8835a2' CC = gcc CPPFLAGS = -DFRONTEND -D_GNU_SOURCE -I/usr/include/libxml2 CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -O2 CFLAGS_SL = -fpic LDFLAGS = -L../../src/common -Wl,--as-needed -Wl,-rpath,'/home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/lib',--enable-new-dtags LDFLAGS_EX = LDFLAGS_SL = LIBS = -lpgcommon -lpgport -lxslt -lxml2 -lssl -lcrypto -lz -lreadline -lrt -lcrypt -ldl -lm VERSION = PostgreSQL 10devel_logical_replication_20170218_1221_e3a58c8835a2 I hope it helps someone to reproduce the errors I get. (If you don't, I'd like to hear that too) thanks, Erik Rijkers #!/bin/sh port1=6972 port2=6973 project1=logical_replication project2=logical_replication2 pg_stuff_dir=$HOME/pg_stuff PATH1=$pg_stuff_dir/pg_installations/pgsql.$project1/bin:$PATH PATH2=$pg_stuff_dir/pg_installations/pgsql.$project2/bin:$PATH server_dir1=$pg_stuff_dir/pg_installations/pgsql.$project1 server_dir2=$pg_stuff_dir/pg_installations/pgsql.$project2 data_dir1=$server_dir1/data data_dir2=$server_dir2/data options1=" -c wal_level=logical -c max_replication_slots=10 -c max_worker_processes=12 -c max_logical_replication_workers=10 -c max_wal_senders=14 -c logging_collector=on -c log_directory=$server_dir1 -c log_filename=logfile.${project1} " options2=" -c wal_level=replica -c max_replication_slots=10 -c max_worker_processes=12 -c max_logical_replication_workers=10 -c max_wal_senders=14 -c logging_collector=on -c log_directory=$server_dir2 -c log_filename=logfile.${project2} " export PATH=$PATH1; which postgres; postgres -D $data_dir1 -p $port1 ${options1} & export PATH=$PATH2; which postgres; postgres -D $data_dir2 -p $port2 ${options2} & -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
Maybe add this to the 10 Open Items list? https://wiki.postgresql.org/wiki/PostgreSQL_10_Open_Items It may garner a bit more attention. Ah sorry, it's there already. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
On 2017-02-16 00:43, Petr Jelinek wrote: On 13/02/17 14:51, Erik Rijkers wrote: On 2017-02-11 11:16, Erik Rijkers wrote: On 2017-02-08 23:25, Petr Jelinek wrote: 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch 0001-Logical-replication-support-for-initial-data-copy-v4.patch This often works but it also fails far too often (in my hands). I That being said, I am so far having problems reproducing this on my test machine(s) so no idea what causes it yet. A few extra bits: - I have repeated this now on three different machines (debian 7, 8, centos6; one a pretty big server); there is always failure within a few tries of that test program (i.e. pgbench_derail2.sh, with the above 5 patches). - I have also tried to go back to an older version of logrep: running with 2 instances with only the first four patches (i.e., leaving out the support-for-existing-data patch). With only those 4, the logical replication is solid. (a quick 25x repetition of a (very similar) test program is 100% successful). So the problem is likely somehow in that last 5th patch. - A 25x repetition of a test on a master + replica 5-patch server yields 13 ok, 12 NOK. - Is the 'make check' FAILED test 'object_addess' unrelated? (Can you at least reproduce that failed test?) Maybe add this to the 10 Open Items list? https://wiki.postgresql.org/wiki/PostgreSQL_10_Open_Items It may garner a bit more attention. thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
On 2017-02-16 00:43, Petr Jelinek wrote: On 13/02/17 14:51, Erik Rijkers wrote: On 2017-02-11 11:16, Erik Rijkers wrote: On 2017-02-08 23:25, Petr Jelinek wrote: 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch 0001-Logical-replication-support-for-initial-data-copy-v4.patch This often works but it also fails far too often (in my hands). I Could you periodically dump contents of the pg_subscription_rel on subscriber (ideally when dumping the md5 of the data) and attach that as well? I attach a bash script (and its output) that polls the 4 pgbench table's md5s and the pg_subscription_rel table, each second, while I run the pgbench_derail2.sh (for that see my earlier mail). pgbench_derail2.sh writes a 'header' into the same output stream (search for '^===' ). The .out file reflects a session where I started pgbench_derail2.sh twice (it removes the publication and subscription at startup). So there are 2 headers in the attached cb_20170216_10_04_47.out. The first run ended in a succesful replication (=all 4 pgbench tables md5-identical). The second run does not end correctly: it has (one of) the typical faulty end-states: pgbench_accounts, the copy, has a few less rows than the master table. Other typical endstates are: same number of rows but content not identical (for some, typically < 20 rows). mostly pgbench_accounts and pgbench_history are affected. (I see now that I made some mistakes in generating the timestamps in the .out file but I suppose it doesn't matter too much) I hope it helps; let me know if I can do any other test(s). 20170216_10_04_49_1487 6972 a,b,t,h: 10 1 10776 24be8c7be cf860f1f2 aed87334f f2bfaa587 master 20170216_10_04_50_1487 6973 a,b,t,h: 6 1 10776 74cd7528c cf860f1f2 aed87334f f2bfaa587 replica NOK now | srsubid | srrelid | srsubstate | srsublsn ---+-+-++-- 2017-02-16 10:04:50.242616+01 | 25398 | 25375 | r | 2017-02-16 10:04:50.242616+01 | 25398 | 25378 | r | 2017-02-16 10:04:50.242616+01 | 25398 | 25381 | r | 2017-02-16 10:04:50.242616+01 | 25398 | 25386 | r | (4 rows) 20170216_10_04_51_1487 6972 a,b,t,h: 10 1 10776 24be8c7be cf860f1f2 aed87334f f2bfaa587 master 20170216_10_04_51_1487 6973 a,b,t,h: 6 1 10776 74cd7528c cf860f1f2 aed87334f f2bfaa587 replica NOK now | srsubid | srrelid | srsubstate | srsublsn ---+-+-++-- 2017-02-16 10:04:51.945931+01 | 25398 | 25375 | r | 2017-02-16 10:04:51.945931+01 | 25398 | 25378 | r | 2017-02-16 10:04:51.945931+01 | 25398 | 25381 | r | 2017-02-16 10:04:51.945931+01 | 25398 | 25386 | r | (4 rows) -- 20170216 10:04:S -- scale 1 clients 1 INIT_WAIT 0 -- /home/aardvark/pg_stuff/pg_installations/pgsql.logical_replication/bin.fast/postgres 20170216_10_04_53_1487 6972 a,b,t,h: 10 1 10776 24be8c7be cf860f1f2 aed87334f f2bfaa587 master 20170216_10_04_53_1487 6973 a,b,t,h: 6 1 10776 74cd7528c cf860f1f2 aed87334f f2bfaa587 replica NOK now | srsubid | srrelid | srsubstate | srsublsn ---+-+-++-- 2017-02-16 10:04:53.635163+01 | 25398 | 25375 | r | 2017-02-16 10:04:53.635163+01 | 25398 | 25378 | r | 2017-02-16 10:04:53.635163+01 | 25398 | 25381 | r | 2017-02-16 10:04:53.635163+01 | 25398 | 25386 | r | (4 rows) 20170216_10_04_54_1487 6972 a,b,t,h: 0 0 0 0 24be8c7be d41d8cd98 d41d8cd98 d41d8cd98 master 20170216_10_04_55_1487 6973 a,b,t,h: 0 0 0 0 d41d8cd98 d41d8cd98 d41d8cd98 d41d8cd98 replica NOK now | srsubid | srrelid | srsubstate | srsublsn -+-+-++-- (0 rows) 20170216_10_04_56_1487 6972 a,b,t,h: 10 1 10 0 68d91d95b 6c4f8b9aa 92162c9b8 d41d8cd98 master 20170216_10_04_56_1487 6973 a,b,t,h: 0 0 0 0 d41d8cd98 d41d8cd98 d41d8cd98 d41d8cd98 replica NOK now | srsubid | srrelid | srsubstate | srsublsn -+-+-++-- (0 rows) 20170216_10_04_57_1487 6972 a,b,t,h: 10 1 10 1 68d91d95b 6c4f8b9aa 92162c9b8 d41d8cd98 master 20170216_10_04_58_1487 6973 a,b,t,h: 0 0 0 0 d41d8c
Re: [HACKERS] Logical replication existing data copy
On 2017-02-11 11:16, Erik Rijkers wrote: On 2017-02-08 23:25, Petr Jelinek wrote: 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch 0001-Logical-replication-support-for-initial-data-copy-v4.patch This often works but it also fails far too often (in my hands). I test whether the tables are identical by comparing an md5 from an ordered resultset, from both replica and master. I estimate that 1 in 5 tries fail; 'fail' being a somewhat different table on replica (compared to mater), most often pgbench_accounts (typically there are 10-30 differing rows). No errors or warnings in either logfile. I'm not sure but I think testing on faster machines seem to be doing somewhat better ('better' being less replication error). I have noticed that when I insert a few seconds wait-state after the create subscription (or actually: the 'enable'ing of the subscription) the problem does not occur. Apparently, (I assume) the initial snapshot occurs somewhere when the subsequent pgbench-run has already started, so that the logical replication also starts somewhere 'into' that pgbench-run. Does that make sense? I don't know what to make of it. Now that I think that I understand what happens I hesitate to call it a bug. But I'd say it's still a useability problem that the subscription is only 'valid' after some time, even if it's only a few seconds. (the other problem I mentioned (drop subscription hangs) still happens every now and then) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy - sgml fixes
On 2017-02-09 02:25, Erik Rijkers wrote: On 2017-02-08 23:25, Petr Jelinek wrote: 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch 0001-Logical-replication-support-for-initial-data-copy-v4.patch fixes in create_subscription.sgml --- doc/src/sgml/ref/create_subscription.sgml.orig2 2017-02-11 11:58:10.788502999 +0100 +++ doc/src/sgml/ref/create_subscription.sgml 2017-02-11 12:17:50.069635493 +0100 @@ -55,7 +55,7 @@ Additional info about subscriptions and logical replication as a whole - can is available at and + is available at and . @@ -122,14 +122,14 @@ Name of the replication slot to use. The default behavior is to use - subscription_name for slot name. + subscription_name as the slot name. -COPY DATA -NOCOPY DATA +COPY DATA +NOCOPY DATA Specifies if the existing data in the publication that are being @@ -140,11 +140,11 @@ -SKIP CONNECT +SKIP CONNECT - Instructs the CREATE SUBSCRIPTION to skip initial - connection to the provider. This will change default values of other + Instructs CREATE SUBSCRIPTION to skip initial + connection to the provider. This will change the default values of other options to DISABLED, NOCREATE SLOT and NOCOPY DATA. @@ -181,8 +181,8 @@ Create a subscription to a remote server that replicates tables in - the publications mypubclication and - insert_only and starts replicating immediately on + the publications mypublication and + insert_only and start replicating immediately on commit: CREATE SUBSCRIPTION mysub @@ -193,7 +193,7 @@ Create a subscription to a remote server that replicates tables in - the insert_only publication and does not start replicating + the insert_only publication and do not start replicating until enabled at a later time. CREATE SUBSCRIPTION mysub -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
On 2017-02-08 23:25, Petr Jelinek wrote: 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch 0001-Logical-replication-support-for-initial-data-copy-v4.patch Apart from the failing one make check test (test 'object_address') which I reported earlier, I find it is easy to 'confuse' the replication. I attach a script that intends to test the default COPY DATA. There are two instances, initially without any replication. The script inits pgbench on the master, adds a serial column to pgbench_history, and dump-restores the 4 pgbench-tables to the future replica. It then empties the 4 pgbench-tables on the 'replica'. The idea is that when logrep is initiated, data will be replicated from master, with the end result being that there are 4 identical tables on master and replica. This often works but it also fails far too often (in my hands). I test whether the tables are identical by comparing an md5 from an ordered resultset, from both replica and master. I estimate that 1 in 5 tries fail; 'fail' being a somewhat different table on replica (compared to mater), most often pgbench_accounts (typically there are 10-30 differing rows). No errors or warnings in either logfile. I'm not sure but I think testing on faster machines seem to be doing somewhat better ('better' being less replication error). Another, probably unrelated, problem occurs (but much more rarely) when executing 'DROP SUBSCRIPTION sub1' on the replica (see the beginning of the script). Sometimes that command hangs, and refuses to accept shutdown of the server. I don't know how to recover from this -- I just have to kill the replica server (master server still obeys normal shutdown) and restart the instances. The script accepts 2 parameters, scale and clients (used in pgbench -s resp. -c) I don't think I've managed to successfully run the script with more than 1 client yet. Can you have a look whether this is reproducible elsewhere? thanks, Erik Rijkers #!/bin/sh # assumes both instances are running, on port 6972 and 6973 logfile1=$HOME/pg_stuff/pg_installations/pgsql.logical_replication/logfile.logical_replication logfile2=$HOME/pg_stuff/pg_installations/pgsql.logical_replication2/logfile.logical_replication2 scale=1 if [[ ! "$1" == "" ]] then scale=$1 fi clients=1 if [[ ! "$2" == "" ]] then clients=$2 fi unset PGSERVICEFILE PGSERVICE PGPORT PGDATA PGHOST PGDATABASE=testdb # (this script also uses a custom pgpassfile) ## just for info: # env | grep PG # psql -qtAXc "select current_setting('server_version')" port1=6972 port2=6973 function cb() { # display the 4 pgbench tables' accumulated content as md5s # a,b,t,h stand for: pgbench_accounts, -branches, -tellers, -history md5_total_6972='-1' md5_total_6973='-2' for port in $port1 $port2 do md5_a=$(echo "select * from pgbench_accounts order by aid"|psql -qtAXp$port|md5sum|cut -b 1-9) md5_b=$(echo "select * from pgbench_branches order by bid"|psql -qtAXp$port|md5sum|cut -b 1-9) md5_t=$(echo "select * from pgbench_tellers order by tid"|psql -qtAXp$port|md5sum|cut -b 1-9) md5_h=$(echo "select * from pgbench_history order by hid"|psql -qtAXp$port|md5sum|cut -b 1-9) cnt_a=$(echo "select count(*) from pgbench_accounts"|psql -qtAXp $port) cnt_b=$(echo "select count(*) from pgbench_branches"|psql -qtAXp $port) cnt_t=$(echo "select count(*) from pgbench_tellers" |psql -qtAXp $port) cnt_h=$(echo "select count(*) from pgbench_history" |psql -qtAXp $port) md5_total[$port]=$( echo "${md5_a} ${md5_b} ${md5_t} ${md5_h}" | md5sum ) printf "$port a,b,t,h: %6d %6d %6d %6d" $cnt_a $cnt_b $cnt_t $cnt_h echo -n " $md5_a $md5_b $md5_t $md5_h" if [[ $port -eq $port1 ]]; then echo" master" elif [[ $port -eq $port2 ]]; then echo -n " replica" else echo" ERROR " fi done if [[ "${md5_total[6972]}" == "${md5_total[6973]}" ]] then echo " ok" else echo " NOK" fi } bail=0 pub_count=$( echo "select count(*) from pg_publication" | psql -qtAXp 6972 ) if [[ $pub_count -ne 0 ]] then echo "pub_count -ne 0 - deleting pub1 & bailing out" echo "drop publication if exists pub1" | psql -Xp 6972 bail=1 fi sub_count=$( echo "select count(*) from pg_subscription" | psql -qtAXp 6973 ) if [[ $sub_count -ne 0 ]] then echo "sub_count -ne 0 - deleting sub1 & bailing out" echo "drop subscription if exist
Re: \if, \elseif, \else, \endif (was Re: [HACKERS] PSQL commands: \quit_if, \quit_unless)
On 2017-02-09 22:15, Tom Lane wrote: Corey Huinker <corey.huin...@gmail.com> writes: The feature now ( at patch v10) lets you break off with Ctrl-C anywhere. I like it now much more. The main thing I still dislike somewhat about the patch is the verbose output. To be honest I would prefer to just remove /all/ the interactive output. I would vote to just make it remain silent if there is no error. (and if there is an error, issue a message and exit) thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical replication existing data copy
On 2017-02-08 23:25, Petr Jelinek wrote: 0001-Use-asynchronous-connect-API-in-libpqwalreceiver-v2.patch 0002-Always-initialize-stringinfo-buffers-in-walsender-v2.patch 0003-Fix-after-trigger-execution-in-logical-replication-v2.patch 0004-Add-RENAME-support-for-PUBLICATIONs-and-SUBSCRIPTION-v2.patch 0001-Logical-replication-support-for-initial-data-copy-v4.patch test 'object_address' fails, see atachment. That's all I found in a quick first trial. thanks, Erik Rijkers *** /home/aardvark/pg_stuff/pg_sandbox/pgsql.logical_replication/src/test/regress/expected/object_address.out 2017-02-09 00:51:30.345519608 +0100 --- /home/aardvark/pg_stuff/pg_sandbox/pgsql.logical_replication/src/test/regress/results/object_address.out 2017-02-09 00:54:11.884715532 +0100 *** *** 38,43 --- 38,45 TO SQL WITH FUNCTION int4recv(internal)); CREATE PUBLICATION addr_pub FOR TABLE addr_nsp.gentable; CREATE SUBSCRIPTION addr_sub CONNECTION '' PUBLICATION bar WITH (DISABLED, NOCREATE SLOT); + ERROR: could not connect to the publisher: FATAL: no pg_hba.conf entry for replication connection from host "[local]", user "aardvark", SSL off + -- test some error cases SELECT pg_get_object_address('stone', '{}', '{}'); ERROR: unrecognized object type "stone" *** *** 409,463 pg_identify_object_as_address(classid, objid, subobjid) ioa(typ,nms,args), pg_get_object_address(typ, nms, ioa.args) as addr2 ORDER BY addr1.classid, addr1.objid, addr1.subobjid; !type| schema | name| identity | ?column? ! ---++---+--+-- ! default acl || | for role regress_addr_user in schema public on tables| t ! default acl || | for role regress_addr_user on tables | t ! type | pg_catalog | _int4 | integer[]| t ! type | addr_nsp | gencomptype | addr_nsp.gencomptype | t ! type | addr_nsp | genenum | addr_nsp.genenum | t ! type | addr_nsp | gendomain | addr_nsp.gendomain | t ! function | pg_catalog | | pg_catalog.pg_identify_object(pg_catalog.oid,pg_catalog.oid,integer) | t ! aggregate | addr_nsp | | addr_nsp.genaggr(integer)| t ! sequence | addr_nsp | gentable_a_seq| addr_nsp.gentable_a_seq | t ! table | addr_nsp | gentable | addr_nsp.gentable| t ! table column | addr_nsp | gentable | addr_nsp.gentable.b | t ! index | addr_nsp | gentable_pkey | addr_nsp.gentable_pkey | t ! view | addr_nsp | genview | addr_nsp.genview | t ! materialized view | addr_nsp | genmatview| addr_nsp.genmatview | t ! foreign table | addr_nsp | genftable | addr_nsp.genftable | t ! foreign table column | addr_nsp | genftable | addr_nsp.genftable.a | t ! role || regress_addr_user | regress_addr_user| t ! server|| addr_fserv| addr_fserv | t ! user mapping || | regress_addr_user on server integer | t ! foreign-data wrapper || addr_fdw | addr_fdw | t ! access method || btree | btree| t ! operator of access method || | operator 1 (integer, integer) of pg_catalog.integer_ops USING btree | t ! function of access method || | function 2 (integer, integer) of pg_catalog.integer_ops USI
Re: [HACKERS] Cache Hash Index meta page.
On 2017-02-07 18:41, Robert Haas wrote: Committed with some changes (which I noted in the commit message). This has caused a warning with gcc 6.20: hashpage.c: In function ‘_hash_getcachedmetap’: hashpage.c:1245:20: warning: ‘cache’ may be used uninitialized in this function [-Wmaybe-uninitialized] rel->rd_amcache = cache; ^~~ which hopefully can be prevented... thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: \if, \elseif, \else, \endif (was Re: [HACKERS] PSQL commands: \quit_if, \quit_unless)
On 2017-02-03 08:16, Corey Huinker wrote: 0001.if_endif.v5.diff 1. Well, with this amount of interactive output it is impossible to get stuck without knowing :) This is good. Still, it would be an improvement to be able to break out of an inactive \if-branch with Ctrl-C. (I noticed that inside an active branch it is already possible ) '\endif' is too long to type, /and/ you have to know it. 2. Inside an \if block \q should be given precedence and cause a direct exit of psql (or at the very least exit the if block(s)), as in regular SQL statements (compare: 'select * from t \q' which will immediately exit psql -- this is good. ) 3. I think the 'barking' is OK because interactive use is certainly not the first use-case. But nonetheless it could be made a bit more terse without losing its function. The interactive behavior is now: # \if 1 entered if: active, executing commands # \elif 0 entered elif: inactive, ignoring commands # \else entered else: inactive, ignoring commands # \endif exited if: active, executing commands It really is a bit too wordy, IMHO; I would say, drop all 'entered', 'active', and 'inactive' words. That leaves it plenty clear what's going on. That would make those lines: if: executing commands elif: ignoring commands else: ignoring commands exited if (or alternatively, just mention 'if: active' or 'elif: inactive', etc., which has the advantage of being shorter) 5. A real bug, I think: #\if asdasd unrecognized value "asdasd" for "\if ": boolean expected # \q; inside inactive branch, command ignored. # That 'unrecognized value' message is fair enough but it is counterintuitive that after an erroneous opening \if-expression, the if-modus should be entered into. ( and now I have to type \endif again... ) 6. About the help screen: There should be an empty line above 'Conditionals' to visually divide it from other help items. The indenting of the new block is incorrect: the lines that start with fprintf(output, _(" \\ are indented to the correct level; the other lines are indented 1 place too much. The help text has a few typos (some multiple times): queires -> queries exectue -> execute subsequennt -> subsequent Thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TRAP: FailedAssertion("!(hassrf)", File: "nodeProjectSet.c", Line: 180)
On 2017-02-02 22:44, Tom Lane wrote: Erik Rijkers <e...@xs4all.nl> writes: Something is broken in HEAD: Fixed, thanks for the report! Indeed, the complicated version of the script runs again as before. Thank you very much, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] TRAP: FailedAssertion("!(hassrf)", File: "nodeProjectSet.c", Line: 180)
Something is broken in HEAD: drop table if exists t; create table t(c text); insert into t (c) values ( 'abc' ) ; select regexp_split_to_array( regexp_split_to_table( c , chr(13) || chr(10) ) , '","' ) as a , regexp_split_to_table( c , chr(13) || chr(10) ) as rw from t ; TRAP: FailedAssertion("!(hassrf)", File: "nodeProjectSet.c", Line: 180) I realise the regexp* functions aren't doing anything particularly useful anymore here; they did in the more complicated original (which I had used for years). thanks, Erik Rijkers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers