Re: Replication failure, slave requesting old segments

2018-08-11 Thread Stephen Frost
Greetings,

* Phil Endecott (spam_from_pgsql_li...@chezphil.org) wrote:
> Dear Experts,

Since you're asking ...

> I recently set up replication for the first time.  It seemed to be 
> working OK in my initial tests, but didn't cope when the slave was 
> down for a longer period.  This is all with the Debian stable 
> packages of PostgreSQL 9.6.  My replication setup involves a third 
> server, "backup", to and from which WAL files are copied using scp 
> (and essentially never deleted), plus streaming replication in 
> "hot standby" mode.
> 
> On the master, I have:
> 
> wal_level = replica
> archive_mode = on
> archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
>scp %p backup:backup/postgresql/archivedir/%f'

This is really not a sufficient or particularly intelligent
archive_command.  In general, I'd strongly recommend against trying to
roll your own archiving/backup solution for PostgreSQL.  In particular,
the above is quite expensive, doesn't address the case where a file is
only partially copied to the backup server, and doesn't ensure that the
file is completely written and fsync'd on the backup server meaning that
a failure will likely result in WAL going missing.  There's much better
solutions out there and which you really should be using instead of
trying to build your own.  In particular, my preference is pgbackrest
(though I'm one of the folks in its development, to be clear), but
there's other options such as barman or WAL-E which I believe also
address the concerns raised above.

> On the slave I have:
> 
> standby_mode = 'on'
> primary_conninfo = 'user=postgres host=master port=5432'
> restore_command = 'scp backup:backup/postgresql/archivedir/%f %p'
> 
> hot_standby = on
> 
> 
> This all seemed to work OK until the slave was shut down for a few days.  
> (During this time the master was also down for a shorter period.) 
> When the slave came back up it successfully copied several WAL files 
> from the backup server until it reached one that hasn't been created yet:

That all sounds fine.

> 2018-08-10 22:37:38.322 UTC [615] LOG:  restored log file 
> "0001000700CC" from archive
> 2018-08-10 22:38:02.011 UTC [615] LOG:  restored log file 
> "0001000700CD" from archive
> 2018-08-10 22:38:29.802 UTC [615] LOG:  restored log file 
> "0001000700CE" from archive
> 2018-08-10 22:38:55.973 UTC [615] LOG:  restored log file 
> "0001000700CF" from archive
> 2018-08-10 22:39:24.964 UTC [615] LOG:  restored log file 
> "0001000700D0" from archive
> scp: backup/postgresql/archivedir/0001000700D1: No such file or 
> directory
> 
> At this point there was a temporary problem due to a firewall rule that 
> blocked the replication connection to the master:
> 
> 2018-08-10 22:41:35.865 UTC [1289] FATAL:  could not connect to the primary 
> server: SSL connection has been closed unexpectedly

Well, that's certainly not good.

> The slave then entered a loop, retrying to scp the next file from the backup 
> archive (and failing) and connecting to the master for streaming replication 
> (and also failing).  That is as expected, except for the "invalid magic 
> number":
> 
> scp: backup/postgresql/archivedir/0001000700D1: No such file or 
> directory
> 2018-08-10 22:41:37.018 UTC [615] LOG:  invalid magic number  in log 
> segment 0001000700D1, offset 0
> 2018-08-10 22:43:46.936 UTC [1445] FATAL:  could not connect to the primary 
> server: SSL connection has been closed unexpectedly

Note that the invalid magic number error is just LOG-level, and that
PostgreSQL will certainly ask for WAL files which don't exist in the
archive yet.

> Note that it's complaining about an invalid magic number in log segment D1, 
> which is 
> the one that has just failed to scp.  That looked suspicious to me and I 
> remembered 
> notes in the docs about ensuring that the archive and/or restore commands 
> return an 
> error exit status when they fail; I checked that and yes scp does exit(1) 
> when the 
> requested file doesn't exist:

Sure, but scp doesn't produce any output when it's successful, so are
you sure that an scp wasn't performed after the "no such file or
directory" log message and which perhaps performed a partial copy
(see above about *exactly* that concern being mentioned by me, even
before getting down to this part, due to using these simple unix
commands...)?

> $ scp backup:/egheriugherg /tmp/a
> scp: /egheriugherg: No such file or directory
> $ echo $?
> 1

scp has no idea if it made a copy of a partial file though.

> Anyway, the slave continued in its retrying loop as expected, except that 
> each time 
> it re-fetched the previous segment, D0; this seems a bit peculiar:
> 
> 2018-08-10 22:44:17.796 UTC [615] LOG:  restored log file 
> "0001000700D0" from archive
> scp: backup/postgresql/archivedir/0001000700D1: No such file or 
> directory

Re: Replication failure, slave requesting old segments

2018-08-11 Thread Adrian Klaver

On 08/11/2018 12:42 PM, Phil Endecott wrote:

Hi Adrian,

Adrian Klaver wrote:
Looks like the master recycled the WAL's while the slave could not 
connect.


Yes but... why is that a problem?  The master is copying the WALs to
the backup server using scp, where they remain forever.  The slave gets


To me it looks like that did not happen:

2018-08-11 00:05:50.364 UTC [615] LOG:  restored log file 
"0001000700D0" from archive
scp: backup/postgresql/archivedir/0001000700D1: No such file 
or directory
2018-08-11 00:05:51.325 UTC [7208] LOG:  started streaming WAL from 
primary at 7/D000 on timeline 1
2018-08-11 00:05:51.325 UTC [7208] FATAL:  could not receive data from 
WAL stream: ERROR:  requested WAL segment 0001000700D0 has 
already been removed


Above 0001000700D0 is gone/recycled on the master and the 
archived version does not seem to be complete as the streaming 
replication is trying to find it.



Below you kick the master and it coughs up the files to the archive 
including *D0 and *D1 on up to *D4 and then the streaming picks using *D5.


2018-08-11 00:55:49.741 UTC [7954] LOG:  restored log file 
"0001000700D0" from archive
2018-08-11 00:56:12.304 UTC [7954] LOG:  restored log file 
"0001000700D1" from archive
2018-08-11 00:56:35.481 UTC [7954] LOG:  restored log file 
"0001000700D2" from archive
2018-08-11 00:56:57.443 UTC [7954] LOG:  restored log file 
"0001000700D3" from archive
2018-08-11 00:57:21.723 UTC [7954] LOG:  restored log file 
"0001000700D4" from archive
scp: backup/postgresql/archivedir/0001000700D5: No such file 
or directory
2018-08-11 00:57:22.915 UTC [7954] LOG:  unexpected pageaddr 7/C700 
in log segment 00010007

00D5, offset 0
2018-08-11 00:57:23.114 UTC [12348] LOG:  started streaming WAL from 
primary at 7/D500 on timeline 1



Best guess is the archiving did not work as expected during:

"(During this time the master was also down for a shorter period.)"


them from there before it starts streaming.  So it shouldn't matter
if the master recycles them, as the slave should be able to get everything
using the combination of scp and then streaming.

Am I missing something about how this sort of replication is supposed to
work?


Thanks, Phil.








--
Adrian Klaver
adrian.kla...@aklaver.com



Re: Replication failure, slave requesting old segments

2018-08-11 Thread Phil Endecott

Hi Adrian,

Adrian Klaver wrote:

Looks like the master recycled the WAL's while the slave could not connect.


Yes but... why is that a problem?  The master is copying the WALs to
the backup server using scp, where they remain forever.  The slave gets
them from there before it starts streaming.  So it shouldn't matter
if the master recycles them, as the slave should be able to get everything
using the combination of scp and then streaming.

Am I missing something about how this sort of replication is supposed to
work?


Thanks, Phil.







Re: Replication failure, slave requesting old segments

2018-08-11 Thread Adrian Klaver

On 08/11/2018 08:18 AM, Phil Endecott wrote:

Dear Experts,

I recently set up replication for the first time.  It seemed to be
working OK in my initial tests, but didn't cope when the slave was
down for a longer period.  This is all with the Debian stable
packages of PostgreSQL 9.6.  My replication setup involves a third
server, "backup", to and from which WAL files are copied using scp
(and essentially never deleted), plus streaming replication in
"hot standby" mode.

On the master, I have:

wal_level = replica
archive_mode = on
archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
scp %p backup:backup/postgresql/archivedir/%f'

On the slave I have:

standby_mode = 'on'
primary_conninfo = 'user=postgres host=master port=5432'
restore_command = 'scp backup:backup/postgresql/archivedir/%f %p'

hot_standby = on


This all seemed to work OK until the slave was shut down for a few days.
(During this time the master was also down for a shorter period.)
When the slave came back up it successfully copied several WAL files
from the backup server until it reached one that hasn't been created yet:

2018-08-10 22:37:38.322 UTC [615] LOG:  restored log file 
"0001000700CC" from archive
2018-08-10 22:38:02.011 UTC [615] LOG:  restored log file 
"0001000700CD" from archive
2018-08-10 22:38:29.802 UTC [615] LOG:  restored log file 
"0001000700CE" from archive
2018-08-10 22:38:55.973 UTC [615] LOG:  restored log file 
"0001000700CF" from archive
2018-08-10 22:39:24.964 UTC [615] LOG:  restored log file 
"0001000700D0" from archive
scp: backup/postgresql/archivedir/0001000700D1: No such file or 
directory

At this point there was a temporary problem due to a firewall rule that
blocked the replication connection to the master:

2018-08-10 22:41:35.865 UTC [1289] FATAL:  could not connect to the primary 
server: SSL connection has been closed unexpectedly

The slave then entered a loop, retrying to scp the next file from the backup
archive (and failing) and connecting to the master for streaming replication
(and also failing).  That is as expected, except for the "invalid magic number":

scp: backup/postgresql/archivedir/0001000700D1: No such file or 
directory
2018-08-10 22:41:37.018 UTC [615] LOG:  invalid magic number  in log 
segment 0001000700D1, offset 0
2018-08-10 22:43:46.936 UTC [1445] FATAL:  could not connect to the primary 
server: SSL connection has been closed unexpectedly

Note that it's complaining about an invalid magic number in log segment D1, 
which is
the one that has just failed to scp.  That looked suspicious to me and I 
remembered
notes in the docs about ensuring that the archive and/or restore commands 
return an
error exit status when they fail; I checked that and yes scp does exit(1) when 
the
requested file doesn't exist:

$ scp backup:/egheriugherg /tmp/a
scp: /egheriugherg: No such file or directory
$ echo $?
1

Anyway, the slave continued in its retrying loop as expected, except that each 
time
it re-fetched the previous segment, D0; this seems a bit peculiar:

2018-08-10 22:44:17.796 UTC [615] LOG:  restored log file 
"0001000700D0" from archive
scp: backup/postgresql/archivedir/0001000700D1: No such file or 
directory
2018-08-10 22:46:28.727 UTC [1628] FATAL:  could not connect to the primary 
server: SSL connection has been closed unexpectedly

(Note the "invalid magic number" was only reported the first time.)

Eventually I fixed the networking issue that had prevented the replication 
connection
to master and it could start streaming:

2018-08-11 00:05:50.364 UTC [615] LOG:  restored log file 
"0001000700D0" from archive
scp: backup/postgresql/archivedir/0001000700D1: No such file or 
directory
2018-08-11 00:05:51.325 UTC [7208] LOG:  started streaming WAL from primary at 
7/D000 on timeline 1
2018-08-11 00:05:51.325 UTC [7208] FATAL:  could not receive data from WAL 
stream: ERROR:  requested WAL segment 0001000700D0 has already been 
removed

So here is the real problem which worries me: it seems to ask the master for
segment D0, which is one that it did successfully restore from the backup 
earlier
on, and the master - not unexpectedly - reports that it has already been 
removed.

I tried restarting the slave:

2018-08-11 00:12:15.536 UTC [7954] LOG:  restored log file 
"0001000700D0" from archive
2018-08-11 00:12:15.797 UTC [7954] LOG:  redo starts at 7/D0F956C0
2018-08-11 00:12:16.068 UTC [7954] LOG:  consistent recovery state reached at 
7/D0FFF088
2018-08-11 00:12:16.068 UTC [7953] LOG:  database system is ready to accept 
read only connections
scp: backup/postgresql/archivedir/0001000700D1: No such file or 
directory
2018-08-11 00:12:17.057 UTC [7954] LOG:  invalid magic number  in log 
segment 0001000700D1, offset 0
2018-08-11 00:12:17.2

Replication failure, slave requesting old segments

2018-08-11 Thread Phil Endecott
Dear Experts,

I recently set up replication for the first time.  It seemed to be 
working OK in my initial tests, but didn't cope when the slave was 
down for a longer period.  This is all with the Debian stable 
packages of PostgreSQL 9.6.  My replication setup involves a third 
server, "backup", to and from which WAL files are copied using scp 
(and essentially never deleted), plus streaming replication in 
"hot standby" mode.

On the master, I have:

wal_level = replica
archive_mode = on
archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
   scp %p backup:backup/postgresql/archivedir/%f'

On the slave I have:

standby_mode = 'on'
primary_conninfo = 'user=postgres host=master port=5432'
restore_command = 'scp backup:backup/postgresql/archivedir/%f %p'

hot_standby = on


This all seemed to work OK until the slave was shut down for a few days.  
(During this time the master was also down for a shorter period.) 
When the slave came back up it successfully copied several WAL files 
from the backup server until it reached one that hasn't been created yet:

2018-08-10 22:37:38.322 UTC [615] LOG:  restored log file 
"0001000700CC" from archive
2018-08-10 22:38:02.011 UTC [615] LOG:  restored log file 
"0001000700CD" from archive
2018-08-10 22:38:29.802 UTC [615] LOG:  restored log file 
"0001000700CE" from archive
2018-08-10 22:38:55.973 UTC [615] LOG:  restored log file 
"0001000700CF" from archive
2018-08-10 22:39:24.964 UTC [615] LOG:  restored log file 
"0001000700D0" from archive
scp: backup/postgresql/archivedir/0001000700D1: No such file or 
directory

At this point there was a temporary problem due to a firewall rule that 
blocked the replication connection to the master:

2018-08-10 22:41:35.865 UTC [1289] FATAL:  could not connect to the primary 
server: SSL connection has been closed unexpectedly

The slave then entered a loop, retrying to scp the next file from the backup 
archive (and failing) and connecting to the master for streaming replication 
(and also failing).  That is as expected, except for the "invalid magic number":

scp: backup/postgresql/archivedir/0001000700D1: No such file or 
directory
2018-08-10 22:41:37.018 UTC [615] LOG:  invalid magic number  in log 
segment 0001000700D1, offset 0
2018-08-10 22:43:46.936 UTC [1445] FATAL:  could not connect to the primary 
server: SSL connection has been closed unexpectedly

Note that it's complaining about an invalid magic number in log segment D1, 
which is 
the one that has just failed to scp.  That looked suspicious to me and I 
remembered 
notes in the docs about ensuring that the archive and/or restore commands 
return an 
error exit status when they fail; I checked that and yes scp does exit(1) when 
the 
requested file doesn't exist:

$ scp backup:/egheriugherg /tmp/a
scp: /egheriugherg: No such file or directory
$ echo $?
1

Anyway, the slave continued in its retrying loop as expected, except that each 
time 
it re-fetched the previous segment, D0; this seems a bit peculiar:

2018-08-10 22:44:17.796 UTC [615] LOG:  restored log file 
"0001000700D0" from archive
scp: backup/postgresql/archivedir/0001000700D1: No such file or 
directory
2018-08-10 22:46:28.727 UTC [1628] FATAL:  could not connect to the primary 
server: SSL connection has been closed unexpectedly

(Note the "invalid magic number" was only reported the first time.)

Eventually I fixed the networking issue that had prevented the replication 
connection 
to master and it could start streaming:

2018-08-11 00:05:50.364 UTC [615] LOG:  restored log file 
"0001000700D0" from archive
scp: backup/postgresql/archivedir/0001000700D1: No such file or 
directory
2018-08-11 00:05:51.325 UTC [7208] LOG:  started streaming WAL from primary at 
7/D000 on timeline 1
2018-08-11 00:05:51.325 UTC [7208] FATAL:  could not receive data from WAL 
stream: ERROR:  requested WAL segment 0001000700D0 has already been 
removed

So here is the real problem which worries me: it seems to ask the master for 
segment D0, which is one that it did successfully restore from the backup 
earlier 
on, and the master - not unexpectedly - reports that it has already been 
removed.

I tried restarting the slave:

2018-08-11 00:12:15.536 UTC [7954] LOG:  restored log file 
"0001000700D0" from archive
2018-08-11 00:12:15.797 UTC [7954] LOG:  redo starts at 7/D0F956C0
2018-08-11 00:12:16.068 UTC [7954] LOG:  consistent recovery state reached at 
7/D0FFF088
2018-08-11 00:12:16.068 UTC [7953] LOG:  database system is ready to accept 
read only connections
scp: backup/postgresql/archivedir/0001000700D1: No such file or 
directory
2018-08-11 00:12:17.057 UTC [7954] LOG:  invalid magic number  in log 
segment 0001000700D1, offset 0
2018-08-11 00:12:17.242 UTC [8034] LOG:  started

Re: Who and How is responsible for released installations packages and 3rd party packs? (e.g. on https://yum.postgresql.org/9.6/redhat/rhel-7.3-x86_64/)

2018-08-11 Thread Devrim Gündüz

Hi,

On Fri, 2018-08-10 at 21:49 +0200, Christoph Berg wrote:
> plv8 will work for anyone willing to go through that. It's just not
> feasible to support it from a packager perspective.

+1 from me.

Regards,

-- 
Devrim Gündüz
EnterpriseDB: https://www.enterprisedb.com
PostgreSQL Consultant, Red Hat Certified Engineer
Twitter: @DevrimGunduz , @DevrimGunduzTR


signature.asc
Description: This is a digitally signed message part


Streaming Logical Decoding Events

2018-08-11 Thread Matt Dee
Hi there,

I am trying to build an application which reads messages off of a
replication slot and does some processing with them.  I want to guarantee
that every event gets processed exactly once.  Unfortunately, I've been
unable to find a guide to the correct way to read this data.  I've been
experimenting and it seems to work, but I wanted to confirm a few things.
Apologies if this is the wrong place for this kind of question, or if
reading off a slot is described elsewhere (if so, I would appreciate being
pointed to the right place).

I'm trying to use terminology from
https://www.postgresql.org/docs/9.5/static/protocol-replication.html

My basic idea is:
* On startup, create a replication connection and do START CONNECTION with
LSN 0/0.  This seems to start off from the last "committed" LSN (last LSN
sent back in a standby status update).

* Keep track of a "last committed" LSN, and periodically send back a
Standby Status Update with the "written", "flushed", and "applied" LSNn all
set to that value (I could not find documentation on the difference between
the three, but setting all of them equal seems to work).

* When I see an XLogData message, process the data and then set the "last
committed" LSN to the "starting point of the WAL data" LSN.

* When I see a Primary Keepalive Message, set "last committed" LSN to the
"current end WAL" field, assuming all prior XLogDatas have been processed.
If the "should reply" byte is set, send back a Standby Status Update
immediately

Is this the way that this feature was intended to be used?

In particular, I have two concerns with this approach:

1) On the XLogData, should I be using "starting point of WAL data" or
"current end of WAL"?  In my testing, they seem to always be the same, so
I'm worried there's some case I haven't handled.

2) I'm not sure if committing the end WAL LSNs on the Keepalive messages is
safe.  Is it possible to receive a Keepalive message, followed by an
XLogData message, such that the Keepalive's LSN is higher than the
XLogData's?  This could cause problems, because after committing the
Keepalive's LSN, if my application needed to restart, when it comes back
up, we might miss the XLogData, since we committed a higher LSN.

Thanks, and I appreciate any pointers you can provide!
-Matt


Re: Ltree - how to sort nodes on parent node

2018-08-11 Thread Saccilotto Ramon
Hi everonye,

I don’t know if this is still a topic for anyone. But here is a query that I 
came up with to do the sorting. It will currently probably not make use of the 
ltree indexing, so it might be worth to further adapt the query.

The table (example_table) would be something like

path|ordinal
+--
Top | 1
Top.Science | 1
Top.Hobbies | 2
Top.Collections | 3
…

The selection would work as follows:

/* create a intermediate table with an id column */
WITH ltreeTable AS (
  SELECT
  -- select the last part of the path as id
  subpath(path, -1) as "id",
  "path",
  "ordinal"
  FROM example_table
),

/* split the ltree path into separate parts */
treeParts AS (
  SELECT
  "id",
  -- split the path into separate parts
  unnest(regexp_split_to_array(path::text, '\.'))::ltree as "part",
  -- generate an ordinal for each array to preserve the order of the path
  generate_subscripts(regexp_split_to_array(path::text, '\.'), 1) as "idx"
  FROM ltreeTable
),

/* prefix each part with its respective zero-padded ordinal for sorting */
treePartsSorted AS (
  SELECT
  a.*,
  -- prefix each part with the ordinal
  lpad(b.ordinal::text, 4, '0') || '.' || a.part::text as "prefixed"
  FROM treeParts as a

  LEFT JOIN ltreeTable as b
  ON a.part = b.id
),

/* combine the paths back again */
treeSorting AS (
  SELECT
  "id",
  -- aggregate all parts and combine it back to an ltree path
  array_to_string(array_agg(prefixed ORDER BY idx),'.') AS "sorting"
  FROM treePartsSorted
  GROUP BY "id"
),

/* add the sorting column to the tree */
tree AS (
  SELECT
  a.*, text2ltree(b.sorting) as "sorting"
  FROM ltreeTable as a
  LEFT JOIN treeSorting as b
  ON a.id = b.id
)

SELECT * FROM tree
ORDER BY sorting asc;