On 2017-05-09 11:50, Petr Jelinek wrote:

I rebased the above mentioned patch to apply to the patches Andres sent, if you could try to add it on top of what you have and check if it still
fails, that would be helpful.

It still fails.

With these patches

- 0001-WIP-Fix-off-by-one-around-GetLastImportantRecPtr.patch+
- 00002-WIP-Possibly-more-robust-snapbuild-approach.patch     +
- fix-statistics-reporting-in-logical-replication-work.patch +
- Skip-unnecessary-snapshot-builds.patch

built again on top of 44c528810a1 ( so I had to add the 'fix-statistics-rep*' patch because without it I immediately got that Assertion failure again ).

As always most runs succeed (especially on this large 192GB 16-core server).

But attached is an output file of a number of runs of my pgbench_derail2.sh test.

Overal result:

-- out_20170509_1635.txt
      3 -- pgbench -c 64 -j 8 -T 900 -P 180 -n   --  scale 25
      2 -- All is well.
      1 -- Not good, but breaking out of wait (21 times no change)

I broke it off after iteration 4, so 5 never ran, and
iteration 1 failed due to a mistake in the harness (somethind stupid I did) - not interesting.

iteration 2 succeeds. (eventually has 'replica ok')

iteration 3 succeeds. (eventually has 'replica ok')

iteration 4 fails.
Just after 'alter subscription sub1 enable' I caught (as is usual) pg_stat_replication.state as 'catchup'. So far so good. After the 15-minute pgbench run pg_stat_replication has only 2 'startup' lines (and none 'catchup' or 'streaming'):

port | pg_stat_replication | pid | wal | replay_loc | diff | ?column? | state | app | sync_state 6972 | pg_stat_replication | 108349 | 19/8FBCC248 | | | | startup | derail2 | async 6972 | pg_stat_replication | 108351 | 19/8FBCC248 | | | | startup | derail2 | async

(that's from:
select $port1 as port,'pg_stat_replication' as pg_stat_replication, pid , pg_current_wal_location() wal, replay_location replay_loc, pg_current_wal_location() - replay_location as diff
     , pg_current_wal_location() <= replay_location
     , state, application_name as app, sync_state
   from pg_stat_replication
)

This remains in this state for as long as my test-programs lets it (i.e., 20 x 30s, or something like that, and then the loop is exited); in the ouput file it says: 'Not good, but breaking out of wait'

Below is the accompanying ps (with the 2 'deranged senders' as Jeff Janes would surely call them):


UID         PID   PPID  C STIME TTY      STAT   TIME CMD
rijkers 107147 1 0 17:11 pts/35 S+ 0:00 /var/data1/pg_stuff/pg_installations/pgsql.logical_replication2/bin/postgres -D /var/data1/pg_stuff/pg_installations rijkers 107149 107147 0 17:11 ? Ss 0:00 \_ postgres: logger process rijkers 107299 107147 0 17:11 ? Ss 0:01 \_ postgres: checkpointer process rijkers 107300 107147 0 17:11 ? Ss 0:00 \_ postgres: writer process rijkers 107301 107147 0 17:11 ? Ss 0:00 \_ postgres: wal writer process rijkers 107302 107147 0 17:11 ? Ss 0:00 \_ postgres: autovacuum launcher process rijkers 107303 107147 0 17:11 ? Ss 0:00 \_ postgres: stats collector process rijkers 107304 107147 0 17:11 ? Ss 0:00 \_ postgres: bgworker: logical replication launcher rijkers 108348 107147 0 17:12 ? Ss 0:01 \_ postgres: bgworker: logical replication worker for subscription 70310 sync 70293 rijkers 108350 107147 0 17:12 ? Ss 0:00 \_ postgres: bgworker: logical replication worker for subscription 70310 sync 70298 rijkers 107145 1 0 17:11 pts/35 S+ 0:02 /var/data1/pg_stuff/pg_installations/pgsql.logical_replication/bin/postgres -D /var/data1/pg_stuff/pg_installations rijkers 107151 107145 0 17:11 ? Ss 0:00 \_ postgres: logger process rijkers 107160 107145 0 17:11 ? Ss 0:08 \_ postgres: checkpointer process rijkers 107161 107145 0 17:11 ? Ss 0:07 \_ postgres: writer process rijkers 107162 107145 0 17:11 ? Ss 0:02 \_ postgres: wal writer process rijkers 107163 107145 0 17:11 ? Ss 0:00 \_ postgres: autovacuum launcher process rijkers 107164 107145 0 17:11 ? Ss 0:02 \_ postgres: stats collector process rijkers 107165 107145 0 17:11 ? Ss 0:00 \_ postgres: bgworker: logical replication launcher rijkers 108349 107145 0 17:12 ? Ss 0:27 \_ postgres: wal sender process rijkers [local] idle rijkers 108351 107145 0 17:12 ? Ss 0:26 \_ postgres: wal sender process rijkers [local] idle

I have had no time to add (or view) any CPUinfo.


Erik Rijkers



Attachment: out_20170509_1635.txt
Description: application/elc

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to