Re: [HACKERS] Timeline following for logical slots

2016-04-05 Thread Oleksii Kliukin

> On 05 Apr 2016, at 09:51, Craig Ringer  wrote:
> 
> On 5 April 2016 at 04:00, Robert Haas  > wrote:
> 
> In general, I think we'd be a lot better off if we got some kind of
> logical replication into core first and then worked on lifting these
> types of limitations afterwards.
> 
> First, I'd like to remind everyone that logical decoding is useful for more 
> than replication. You can consume the change stream for audit 
> logging/archival, to feed into a different DBMS, etc etc. This is not just 
> about replicating from one PostgreSQL to another, though to be sure that'll 
> be the main use in the near term.
> 
> The Zalando guys at least are already using it for other things, and interest 
> in the json support suggests they're not alone.

We are actually interested in both the streaming part and the logical 
replication provided at the moment by BDR. The reason we cannot use BDR 
efficiently is that there is no way to provide a HA for one of the BDR nodes 
using physical replication, meaning that we have to run 2x nodes with a 
requirement of each node communicating to each other. Since the only use case 
we have is to run things in multiple data-centers with latencies of 100ms and 
above, running without the physical HA limits us to only couple of nodes, with 
a manual repair mechanism.

The other use case we have is streaming data from many, sometimes rather big 
(TBs) databases to the consumers interested in storing subsets of that data in 
order to run analytical queries on them. It’s hard to imagine a robust system 
like this that is built around the feature that doesn't support a failover 
between the master and a physical replica, forcing to stream again a set of 
selected tables at best, and the complete database at most (did I mention the 
terabytes already?), and, potentially, doing some very funny tricks to merge 
the newly streamed data with something that is already there. 


> 
> Right now if you're doing any kind of logical deocoding from a master server 
> that fails over to a standby the client just dies. The slot vanishes. You're 
> stuffed. Gee, I hope you didn't need all that nice consistent ordering, 
> because you're going to have to start from scratch and somehow reconcile the 
> data in the new master with what you've already received ... and haven’t.

+1

> 
> We could certainly require clients to jump through all sorts of extra hoops 
> to make sure they can follow replay over physical failover. Or we could say 
> that you shouldn't actually expect to use logical decoding in real world 
> environments where HA is a necessity until we get around to supporting 
> realistic, usable logical-rep based failover in a few years.

And faster than in a few years your typical big organization might decide to 
move on to some other data storage solution, promising  HA right now, at the 
expense of using SQL and strong consistency and transactions. It would be a bad 
choice, but the one that developers (especially those looking to build 
“scalable micro services” with only couple of CRUD queries) will be willing to 
make.


> Or we could make it "just work" for the physical failover systems everyone 
> already uses and relies on, just like sequences, indexes, and everything else 
> in PostgreSQL that's expected to survive failover.
> 

--
Oleksii



Re: [HACKERS] WIP: Failover Slots

2016-04-04 Thread Oleksii Kliukin
Hi,

> On 17 Mar 2016, at 09:34, Craig Ringer  wrote:
> 
> OK, here's the latest failover slots patch, rebased on top of today's master 
> plus, in order:
> 
> - Dirty replication slots when confirm_lsn is changed
>   
> (http://www.postgresql.org/message-id/camsr+yhj0oycug2zbyqprhxmcjnkt9d57msxdzgwbkcvx3+...@mail.gmail.com
>  
> )
> 
> - logical decoding timeline following
>   
> (http://www.postgresql.org/message-id/CAMsr+YH-C1-X_+s=2nzapnr0wwqja-rumvhsyyzansn93mu...@mail.gmail.com
>  
> )
> 
> The full tree is at 
> https://github.com/2ndQuadrant/postgres/tree/dev/failover-slots 
>  if you want 
> to avoid the fiddling around required to apply the patch series.
> 
> 
> <0001-Allow-replication-slots-to-follow-failover.patch><0002-Update-decoding_failover-tests-for-failover-slots.patch><0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patch><0004-Add-the-UI-and-for-failover-slots.patch><0005-Document-failover-slots.patch><0006-Add-failover-to-pg_replication_slots.patch><0007-Introduce-TAP-recovery-tests-for-failover-slots.patch>



Thank you for the update. I’ve got some rejects when applying the 
0001-Allow-replication-slots-to-follow-failover.patch after the "Dirty 
replication slots when confirm_lsn is changed” changes. I think it should be 
rebased against the master, (might be the consequence of the "logical slots 
follow the timeline” patch committed).

patch -p1 
<~/git/pg/patches/failover-slots/v6/0001-Allow-replication-slots-to-follow-failover.patch
patching file src/backend/access/rmgrdesc/Makefile
Hunk #1 FAILED at 10.
1 out of 1 hunk FAILED -- saving rejects to file 
src/backend/access/rmgrdesc/Makefile.rej
patching file src/backend/access/rmgrdesc/replslotdesc.c
patching file src/backend/access/transam/rmgr.c
Hunk #1 succeeded at 25 (offset 1 line).
patching file src/backend/access/transam/xlog.c
Hunk #1 succeeded at 6351 (offset 3 lines).
Hunk #2 succeeded at 8199 (offset 14 lines).
Hunk #3 succeeded at 8645 (offset 14 lines).
Hunk #4 succeeded at 8718 (offset 14 lines).
patching file src/backend/commands/dbcommands.c
patching file src/backend/replication/basebackup.c
patching file src/backend/replication/logical/decode.c
Hunk #1 FAILED at 143.
1 out of 1 hunk FAILED -- saving rejects to file 
src/backend/replication/logical/decode.c.rej
patching file src/backend/replication/logical/logical.c
patching file src/backend/replication/slot.c
patching file src/backend/replication/slotfuncs.c
patching file src/backend/replication/walsender.c
patching file src/bin/pg_xlogdump/replslotdesc.c
patching file src/bin/pg_xlogdump/rmgrdesc.c
Hunk #1 succeeded at 27 (offset 1 line).
patching file src/include/access/rmgrlist.h
Hunk #1 FAILED at 45.
1 out of 1 hunk FAILED -- saving rejects to file 
src/include/access/rmgrlist.h.rej
patching file src/include/replication/slot.h
patching file src/include/replication/slot_xlog.h
can't find file to patch at input line 1469
Perhaps you used the wrong -p or --strip option?



--
Oleksii



Re: [HACKERS] empty array case in plperl_ref_from_pg_array not handled correctly

2016-03-08 Thread Oleksii Kliukin

> On 08 Mar 2016, at 10:11, Alex Hunsaker  wrote:
> 
> 
> 
> On Mon, Mar 7, 2016 at 11:32 PM, Andres Freund  > wrote:
> Hi,
> 
> Per the new valgrind animal we get:
> 
> http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=skink=2016-03-08%2004%3A22%3A00
> 2016-03-08 
> 
>  05:56:05.566 UTC [56de6971.723:5] LOG:  statement: select 
> plperl_sum_array('{}');
> ==1827== Invalid write of size 4
> ==1827==at 0x14E35DD1: plperl_ref_from_pg_array (plperl.c:1459)
> 
> 
> [ I think you may have meant to CC me not Alexey K. I'm probably the person 
> responsible :D. ]
> 
> Indeed, I think the simplest fix is to make plperl_ref_from_pg_array() return 
> an "empty" array in that case. The attached fixes the valgrind warning for 
> me. (cassert enabled build on master).

Looks good to me, thank you. Judging from the size in the error message it’s 
likely the

info->nelems[0] = items

line that caused this issue. The patch fixes it at first glance, although I 
have yet to make my valgrind setup on OS X working to check this for real :-)

Kind regards,
--
Oleksii



Re: [HACKERS] WIP: Failover Slots

2016-02-23 Thread Oleksii Kliukin

> On 23 Feb 2016, at 11:30, Craig Ringer  wrote:
> 
>  
> Updated patch 
> 
> ... attached
> 
> I've split it up a bit more too, so it's easier to tell what change is for 
> what and fixed the issues mentioned by Oleksii. I've also removed some 
> unrelated documentation changes.
> 
> Patch 0001, timeline switches for logical decoding, is unchanged since the 
> last post. 

Thank you, I read the user-interface part now, looks good to me.

I found the following issue when shutting down a master with a connected 
replica that uses a physical failover slot:

2016-02-23 20:33:42.546 CET,,,54998,,56ccb3f3.d6d6,3,,2016-02-23 20:33:07 
CET,,0,DEBUG,0,"performing replication slot checkpoint",""
2016-02-23 20:33:42.594 CET,,,55002,,56ccb3f3.d6da,4,,2016-02-23 20:33:07 
CET,,0,DEBUG,0,"archived transaction log file 
""00010003""",""
2016-02-23 20:33:42.601 CET,,,54998,,56ccb3f3.d6d6,4,,2016-02-23 20:33:07 
CET,,0,PANIC,XX000,"concurrent transaction log activity while database system 
is shutting down",""
2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,5,,2016-02-23 20:33:07 
CET,,0,LOG,0,"checkpointer process (PID 54998) was terminated by signal 6: 
Abort trap",""
2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,6,,2016-02-23 20:33:07 
CET,,0,LOG,0,"terminating any other active server processes",

Basically, the issue is that CreateCheckPoint calls CheckpointReplicationSlots, 
which currently produces WAL, and this violates the assumption at line 
xlog.c:8492

if (shutdown && checkPoint.redo != ProcLastRecPtr)
ereport(PANIC,
(errmsg("concurrent transaction log activity 
while database system is shutting down")));


There are a couple of incorrect comments

logical.c: 90
There's some things missing to allow this: I think it should be “There are some 
things missing to allow this:”

logical.c:93
"we need we would need”

slot.c:889
"and there's no latch to set, so poll” - clearly there is a latch used in the 
code below.

Also,  slot.c:301 emits an error message for an attempt to create a failover 
slot on the replica after acquiring and releasing the locks and getting the 
shared memory slot, even though all the data to check for this condition is 
available right at the beginning of the function. Shouldn’t it avoid the extra 
work if it’s not needed? 

> 
> 
> -- 
>  Craig Ringer   http://www.2ndQuadrant.com/ 
> 
>  PostgreSQL Development, 24x7 Support, Training & Services
> <0001-Allow-logical-slots-to-follow-timeline-switches.patch><0002-Allow-replication-slots-to-follow-failover.patch><0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patch><0004-Add-the-UI-and-for-failover-slots.patch><0005-Document-failover-slots.patch><0006-Add-failover-to-pg_replication_slots.patch><0007-not-for-inclusion-Test-script-for-failover-slots.patch>


Kind regards,
--
Oleksii



Re: [HACKERS] WIP: Failover Slots

2016-02-22 Thread Oleksii Kliukin
Hi,

> On 16 Feb 2016, at 09:11, Craig Ringer  wrote:
> 
> 
> 
> Revision attached. There was a file missing from the patch too.
> 

All attached patches apply normally. I only took a look at first 2, but also 
tried to run the Patroni with the modified version to check whether the basic 
replication works.

What it’s doing is calling pg_basebackup first to initialize the replica, and 
that actually failed with:

_basebackup: unexpected termination of replication stream: ERROR:  requested 
WAL segment 0001 has already been removed

The segment name definitely looks bogus to me.

The actual command causing the failure was an attempt to clone the replica 
using pg_basebackup, turning on xlog streaming:

pg_basebackup --pgdata data/postgres1 --xlog-method=stream 
--dbname="host=localhost port=5432  user=replicator”

I checked the same command against the  git master without the patches applied 
and could not reproduce this problem there.

On the code level, I have no comments on 0001, it’s well documented and I have 
no questions about the approach, although I might be not too knowledgable to 
judge the specifics of the implementation.

On the 0002, there are a few rough edges:


slots.c:294
elog(LOG, "persistency is %i", (int)slot->data.persistency);

Should be changed to DEBUG?

slot.c:468
Why did you drop “void" as a parameter type of ReplicationSlotDropAcquired?

walsender.c: 1509 at PhysicalConfirmReceivedLocation

I’ve noticed a comment stating that we don’t need to call 
ReplicationSlotSave(), but that pre-dated the WAL-logging of replication slot 
changes. Don’t we need to call it now, the same way it’s done for the logical 
slots in logical.c:at LogicalConfirmReceivedLocation?

Kind regards,
--
Oleksii



Re: [HACKERS] rows estimate in explain analyze for the BRIN index

2015-12-30 Thread Oleksii Kliukin

> On 30 Dec 2015, at 18:38, Emre Hasegeli  wrote:
> 
>> which is much closer to the actual number of rows removed by the index
>> recheck + the one left.
> 
> Is it better to be closer?  We are saying those are the "actual"
> values not the estimates.  If we cannot provide the actual rows, I
> think it is better to provide nothing.  Something closer to the
> reality would create more confusion.  Maybe, we just just return the
> number of blocks, and put somewhere a note about it.  The actual row
> count is already available on the upper part of the plan.

I don’t see how to solve this problem without changing explain analyze output 
to accommodate for “unknown” value. I don’t think “0” is a non-confusing 
representation of “unknown” for most people, and from the practical standpoint, 
a “best effort” estimate is better than 0 (i.e. I will be able to estimate how 
efficient BRIN index is for my tables in terms of the number of tuples 
retrieved/thrown away)

We might still reflect in the documentation that the BRIN index cannot produce 
the exact number of rows during the bitmap scan and point people asking similar 
questions there.




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] rows estimate in explain analyze for the BRIN index

2015-12-30 Thread Oleksii Kliukin

> On 30 Dec 2015, at 17:02, Tom Lane <t...@sss.pgh.pa.us> wrote:
> 
> Oleksii Kliukin <al...@hintbits.com> writes:
>> Bitmap Heap Scan on example  (cost=744.44..757.64 rows=6 width=0) (actual 
>> time=73.895..73.895 rows=0 loops=1)
>>   Output: 1
>>   Recheck Cond: (example.event_time = (now() - '5 mons'::interval))
>>   Rows Removed by Index Recheck: 4030
>>   Heap Blocks: lossy=128
>>   Buffers: shared hit=629
>>   ->  Bitmap Index Scan on example_event_time_idx1  (cost=0.00..744.41 
>> rows=6 width=0) (actual time=70.335..70.335 rows=1280 loops=1)
>> Index Cond: (example.event_time = (now() - '5 mons'::interval))
>> Buffers: shared hit=501
> 
>> - how does it get 1280 rows from the BRIN index scan, given that BRIN only 
>> stores pointers to the heap blocks, not individual rows. Does it calculate 
>> the number of rows in the blocks returned?
> 
> It evidently returned 128 block IDs to the heapscan logic.  I have not
> looked at the code, but a reasonable bet is that it's just guessing that
> there are 10 rows per block.

You are right, this is at the end of bringetbitmap in brin.c
   /*
 * XXX We have an approximation of the number of *pages* that our scan
 * returns, but we don't have a precise idea of the number of heap 
tuples
 * involved.
 */
PG_RETURN_INT64(totalpages * 10);


> 
> That seems like an awfully low number, though; it equates to assuming
> that rows are 800 bytes wide on average.  If we're going to use a fixed
> number, 100 rows per block would probably be more nearly the correct
> order of magnitude.
> 
> Another idea would be to use the heap's row density as calculated
> by the last ANALYZE (ie, reltuples/relpages), with a fallback to 100
> if relpages=0.  This'd only be convenient if the bitmap scan node has
> the parent heap rel open, which it might not.

+1

Kind regards,
--
Oleksii



Re: [HACKERS] rows estimate in explain analyze for the BRIN index

2015-12-30 Thread Oleksii Kliukin

> On 30 Dec 2015, at 21:12, Tom Lane  wrote:
> 
> Emre Hasegeli  writes:
>>> I don’t see how to solve this problem without changing explain analyze 
>>> output to accommodate for “unknown” value. I don’t think “0” is a 
>>> non-confusing representation of “unknown” for most people, and from the 
>>> practical standpoint, a “best effort” estimate is better than 0 (i.e. I 
>>> will be able to estimate how efficient BRIN index is for my tables in terms 
>>> of the number of tuples retrieved/thrown away)
> 
> We do already have a nearby precedent for returning zero when we don't
> have an accurate answer: that's what BitmapAnd and BitmapOr plan nodes
> do.  (This is documented btw, at the bottom of section 14.1.)

+1 for following a precedent.

> 
>> The number of retrieved and thrown away rows are already available on
>> the upper part of the plan.  Bitmap Index Scan should provide the rows
>> that matched the index.
> 
> It doesn't have that information.
> 
>> Another alternative would be just returning
>> the number of matching pages (by not multiplying with 10).  It might
>> be better understood.
> 
> No, it would not, at least not unless we found a way to explicitly mark
> the output as being blocks not rows (which would doubtless break a lot of
> existing client-side code).  Zero is fairly clearly an impossible value,
> whereas anything that's not zero is going to be taken at face value by
> many users.

But is it? Is it impossible for the BRIN bitmap index scan to return 0 rows 
(say, if the value being matched is outside the min/max boundary for every 
block range?) Granted, if we document that it always returns 0 and should be 
ignored, then confusing the actual 0 with the 0 as a representation of 
“unknown” would be less a problem. 

--
Oleksii



Re: [HACKERS] rows estimate in explain analyze for the BRIN index

2015-12-30 Thread Oleksii Kliukin

> On 30 Dec 2015, at 17:44, Tom Lane <t...@sss.pgh.pa.us> wrote:
> 
> Oleksii Kliukin <al...@hintbits.com> writes:
>>> On 30 Dec 2015, at 17:02, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> Another idea would be to use the heap's row density as calculated
>>> by the last ANALYZE (ie, reltuples/relpages), with a fallback to 100
>>> if relpages=0.  This'd only be convenient if the bitmap scan node has
>>> the parent heap rel open, which it might not.
> 
>> +1
> 
> Any objections to the attached?

Looks good to me. On my sample system with 100K rows, the new version gives me:

— CREATE TABLE test AS SELECT id FROM generate_series(1,10) id;
— CREATE INDEX ON test USING brin(id);

postgres=# explain analyze select 1 from test where id = 500;
   QUERY PLAN
-
 Bitmap Heap Scan on test  (cost=12.01..16.02 rows=1 width=0) (actual 
time=0.199..4.220 rows=1 loops=1)
   Recheck Cond: (id = 500)
   Rows Removed by Index Recheck: 28927
   Heap Blocks: lossy=128
   ->  Bitmap Index Scan on test_id_idx  (cost=0.00..12.01 rows=1 width=0) 
(actual time=0.072..0.072 rows=28800 loops=1)
 Index Cond: (id = 500)
 Planning time: 0.433 ms
 Execution time: 4.323 ms
(8 rows)

which is much closer to the actual number of rows removed by the index recheck 
+ the one left.

--
Oleksii



[HACKERS] rows estimate in explain analyze for the BRIN index

2015-12-30 Thread Oleksii Kliukin
Hi,

While experimenting with BRIN on PostgreSQL 9.5RC1 I came across the following 
plan (which is, btw a very good example of how BRIN rocks for the clustered 
data, the size of this table is around 90GB, the size of the index is around 
3MB):

explain (analyze, buffers, verbose) select 1 from example where event_time = 
now() - interval '5 months';
  QUERY PLAN
---
 Bitmap Heap Scan on example  (cost=744.44..757.64 rows=6 width=0) (actual 
time=73.895..73.895 rows=0 loops=1)
   Output: 1
   Recheck Cond: (example.event_time = (now() - '5 mons'::interval))
   Rows Removed by Index Recheck: 4030
   Heap Blocks: lossy=128
   Buffers: shared hit=629
   ->  Bitmap Index Scan on example_event_time_idx1  (cost=0.00..744.41 rows=6 
width=0) (actual time=70.335..70.335 rows=1280 loops=1)
 Index Cond: (example.event_time = (now() - '5 mons'::interval))
 Buffers: shared hit=501
 Planning time: 0.125 ms
 Execution time: 73.943 ms
(11 rows)

Time: 74.642 ms


Here EXPLAIN ANALYZE reports 1280 rows from the Bitmap Index Scan, but on the 
higher level, 4030 rows were removed by Index Recheck. 

The questions are:

- how does it get 1280 rows from the BRIN index scan, given that BRIN only 
stores pointers to the heap blocks, not individual rows. Does it calculate the 
number of rows in the blocks returned?

- how comes that the subsequent recheck filters out 4030 rows, out of 
supposedly 1280 returned?

Kind regards,
--
Oleksii



Re: [HACKERS] pg_rewind test race condition..?

2015-12-08 Thread Oleksii Kliukin
Hi,

On Wed, Apr 29, 2015, at 01:36 AM, Heikki Linnakangas wrote:
> The problem seems to be that when the standby is promoted, it's a 
> so-called "fast promotion", where it writes an end-of-recovery record 
> and starts accepting queries before creating a real checkpoint. 
> pg_rewind looks at the TLI in the latest checkpoint, as it's in the 
> control file, but that isn't updated until the checkpoint completes. I 
> don't see it on my laptop normally, but I can reproduce it if I insert a 
> "sleep(5)" in StartupXLog, just before it requests the checkpoint:
> 
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -7173,7 +7173,10 @@ StartupXLOG(void)
>* than is appropriate now that we're not in standby mode anymore.
>*/
>   if (fast_promoted)
> +   {
> +   sleep(5);
>   RequestCheckpoint(CHECKPOINT_FORCE);
> +   }
>   }
> 
> The simplest fix would be to force a checkpoint in the regression test, 
> before running pg_rewind. It's a bit of a cop out, since you'd still get 
> the same issue when you tried to do the same thing in the real world. It 
> should be rare in practice - you'd not normally run pg_rewind 
> immediately after promoting the standby - but a better error message at 
> least would be nice..

I think we came across this issue in production. We run a Python daemon
called 'Patroni' (https://github.com/zalando/patroni) in order to
automate failovers. It stops the current master before promotion of one
of the replicas, and runs pg_rewind on the former master to make it
"compatible" with the new one. Naturally, since pg_rewind is launched
automatically, it is called right after detecting that the new master is
running.

What we get sometimes is  "source and target cluster are on the same
timeline". Per the explanation from Heikki, it happens when pg_rewind
connects to the new master after promotion but before the first
checkpoint with the new timeline. Since "fast promotion" is the only
option for pg_ctl promote, I think it might be beneficial to add an
option for pg_rewind to issue a checkpoint  (I worked around this
problem by adding a checkpoint call to Patroni, another way would be to
create the 'fallback_promote' flle and do the 'slow' promote without
relying on pg_ctl promote functionality).

There is another weird case I cannot explain easily. If I do a single
promotion/rewind, while executing an insert that takes quite some time
(insert into test select id from generate_series(1, 1000) id) and
actually interrupting it by the Patroni promotion (which does a
checkpoint and pg_ctl -mf stop on the former master, and pg_ctl promote
on the replica candidate), I'm getting the following (postgresql0 is the
rewound node, note that the latest checkpoint location there is lower
than the prior checkpoint one):

$ pg_controldata data/postgresql0
pg_control version number:942
Catalog version number:   201511071
Database system identifier:   6225948429980199864
Database cluster state:   in archive recovery
pg_control last modified: Tue Dec  8 17:20:16 2015
Latest checkpoint location:   0/628
Prior checkpoint location:0/6000138
Latest checkpoint's REDO location:0/628
Latest checkpoint's REDO WAL file:00010006
Latest checkpoint's TimeLineID:   1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:  0/944
Latest checkpoint's NextOID:  16390
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:931
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  0
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTs:   0
Latest checkpoint's newestCommitTs:   0
Time of latest checkpoint:Tue Dec  8 17:20:04 2015
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location: 0/6020580
Min recovery ending loc's timeline:   2
Backup start location:0/0
Backup end location:  0/0
End-of-backup record required:no
wal_level setting:hot_standby
wal_log_hints setting:on
max_connections setting:  100
max_worker_processes setting: 8
max_prepared_xacts setting:   0
max_locks_per_xact setting:   64
track_commit_timestamp setting:   off
Maximum data alignment:   8
Database block size:  8192
Blocks per segment of large relation: 131072
WAL block size:   8192
Bytes per WAL segment:16777216
Maximum length of identifiers:64
Maximum columns in an index:  32
Maximum size of a TOAST chunk:1996
Size of a large-object chunk: 2048
Date/time type storage:   64-bit integers

Re: [HACKERS] [ADMIN] Replication slots and isolation levels

2015-10-29 Thread Oleksii Kliukin

> On 29 Oct 2015, at 14:39, Vladimir Borodin  wrote:
> 
> f I understand right, with hot_standby_feedback = on standby tells the master 
> xmin of the earliest transaction on standby. And autovacuum worker on master 
> takes it into account when doing vacuum cleanup (because it can see it from 
> pg_replications_slots view), AFAIK with GetOldestXmin function. But I don’t 
> understand why with READ COMMITED transactions xmin in pg_replication_slots 
> view on master continues to increase while with REPEATABLE READ xmin freezes 
> until this transaction finishes.

Could it be a consequence of how REPEATABLE READ transactions handle snapshots? 
With REPEATABLE READ the snapshot is acquired only once at the beginning of a 
transaction; a READ COMMITTED transaction re-evaluates its snapshot with each 
new command.

http://www.postgresql.org/docs/current/static/transaction-iso.html

Kind regards,
--
Oleksii



Re: [HACKERS] run pg_rewind on an uncleanly shut down cluster.

2015-10-06 Thread Oleksii Kliukin

> On 06 Oct 2015, at 08:58, Michael Paquier
> <michael.paqu...@gmail.com> wrote:
>
> On Tue, Oct 6, 2015 at 12:41 AM, Oleksii Kliukin
> <al...@hintbits.com> wrote:
>> pg_rewind -D postgresql0 --source-server="host=127.0.0.1 port=5433
>> dbname=postgres" The servers diverged at WAL position 0/360 on
>> timeline 1. could not open file
>> "data/postgresql0/pg_xlog/00010002": No such file or
>> directory
>>
>> Note that this problem happens not 100% of time during the tests,
>> sometimes pg_rewind can actually rewind the former master.
>
> I don't think that there is any actual reason preventing us from
> rewinding a node that has its state in pg_control set as something
> else than DB_SHUTDOWNED, the important point here is to be sure that
> the target node is *not* running while pg_rewind is running (perhaps
> pg_rewind should perform an action in the target node to not have it
> run, let's say that it creates a fake postgresql.conf with invalid
> data and renames the existing one). Checking pg_control makes things
> easier though, there is no need to rely on external binaries like
> "pg_ctl status" or some parsing of postmaster.pid with kill(pid, 0)
> for example.

Does pg_rewind actually rely on the cluster being rewound to finish
recovery? If not, than it would be a good idea to add —force flag to
force the pg_rewind to ignore the state check, as you suggested in
this thread:

http://www.postgresql.org/message-id/flat/CAF8Q-Gw1HBKzpSEVtotLg=dr+ee-6q59qqfhy5tor3fyaen...@mail.gmail.com#CAF8Q-Gw1HBKzpSEVtotLg=dr+ee-6q59qqfhy5tor3fyaen...@mail.gmail.com

>
>> I know I can copy the segment back from the archive, but I'd like to
>> avoid putting this logic into the failover tool if possible. Is there
>> anything we can do to avoid the problem described above, or is there
>> a better way to bring up the former master after the crash with
>> pg_rewind?
>
> Well, for 9.5 (and actually the same applies to the 9.3 and 9.4
> version on github because I am keeping the previous versions
> consistent with what is in 9.5), I guess no.
>
> This missing segment is going to be needed in any case because the
> list of blocks modified needs to be found, hence the question is "how
> can pg_rewind guess where a WAL segment missing from the target's
> pg_xlog is located?". And there are multiple answers:
> - An archive path, then let's add an option to pg_rewind to add a
>   path, though this needs the archive path to be mounted locally, and
>   usually that's not the case.
> - An existing node of the cluster, perhaps the segment is still
>   present on another standby node that already replayed it, though
>   this would need an extra node.
> - The source node itself, if we are lucky the missing segment created
>   before WAL forked is still there. It may not be there though if it
>   has already been recycled. At the end it seems to me that this is
>   going to need some extra operation by the failover tool or the
>   system administrator either way, and that any additional logic to
>   check where this segment is located is never going to satisfy
>   completely the failover use cases. Hence I would keep just pg_rewind
>   out of that.

Well, checking the source node looks like an option that does not
require providing any additional information by DBA, as the connection
string or the path to the data dir is already there. It would be nice if
pg_rewind could fetch WAL from the given restore_command though, or even
use the command already there in recovery.conf (if the node being
recovered is a replica, which I guess is a pretty common case).

Anyway, thank you for describing the issue. In my case, it seems I
solved it by removing the files from the archive_status directory of the
former master (the node being rewound). This makes PostgreSQL forget
that it has to remove an already archived (but still required for
pg_rewind) segment (I guess it does it during stop when the checkpoint
is issued). Afterwards, postgres starts it in a single user mode with
archive_command=false and archive_mode=on, to make sure no segments are
archived/removed, and stopped right afterwards with:

postgres --single -D . -c "max_replication_slots=5" -c
"wal_level=hot_standby" -c "wal_log_hints=on" -c "archive_mode=on" -c
"archive_command=false” postgres 

Re: [HACKERS] run pg_rewind on an uncleanly shut down cluster.

2015-10-05 Thread Oleksii Kliukin

> On 05 Oct 2015, at 18:04, Bruce Momjian <br...@momjian.us> wrote:
> 
> On Mon, Oct  5, 2015 at 05:41:07PM +0200, Oleksii Kliukin wrote:
>> Hello,
>> 
>> I'm trying to find out how to rewind a cluster that was not shut down
>> cleanly, in order to implement pg_rewind support in patroni (an
>> automated failover system, https://github.com/zalando/patroni).
>> 
>> At the moment, pg_rewind limits itself to only cleanly shut down
>> clusters. This works nicely in the case of a split brain caused by the
>> network partitioning. However, it doesn't cover the important case of a
>> suddenly crashed master: the crashed cluster cannot be rewound to the
>> new master. 
> 
> Did you read this thread convering the same topic from a few weeks ago?
> 
>   
> http://www.postgresql.org/message-id/flat/55fa2537.4070...@gmx.net#55fa2537.4070...@gmx.net
>  
> <http://www.postgresql.org/message-id/flat/55fa2537.4070...@gmx.net#55fa2537.4070...@gmx.net>

Thanks, I saw it. The problem being discussed there is different from mine: I 
need to rewind a crashed master, not a replica being shut down in recovery. And 
I’m looking for something that be achieved with 9.3 or 9.4, and there are 
evidences (at least suggestions in the thread linked from my previous post) 
that it should work.

Kind regards,
--
Oleksii



[HACKERS] run pg_rewind on an uncleanly shut down cluster.

2015-10-05 Thread Oleksii Kliukin
Hello,

I'm trying to find out how to rewind a cluster that was not shut down
cleanly, in order to implement pg_rewind support in patroni (an
automated failover system, https://github.com/zalando/patroni).

At the moment, pg_rewind limits itself to only cleanly shut down
clusters. This works nicely in the case of a split brain caused by the
network partitioning. However, it doesn't cover the important case of a
suddenly crashed master: the crashed cluster cannot be rewound to the
new master. 

One idea to overcome this limitation is to start the former master for a
short time, just to let automatic recovery do its job, and stop it
cleanly afterwards. There are some indications on the list that it
works:
http://www.postgresql.org/message-id/79f6ceb4-f519-40fa-9c72-167def1eb...@simply.name

However, in our testing we had an issue with a missing WAL segment on a
former master, which prevented pg_rewind from bringing it up to date
with the current master:

Suppose, the current XLOG segment right before we crash the master is:

postgres=# select * from pg_xlogfile_name(pg_current_xlog_location());
 pg_xlogfile_name
--
 00010003
(1 row)

(the master is configured to archive all segments into the external
directory).

The latest checkpoint location right before the crash is:

Latest checkpoint's REDO location:0/228
Latest checkpoint's REDO WAL file:00010002

and pg_xlog contains the following data
$ ls -R  postgresql0/pg_xlog/
00010001
00010002.0028.backup archive_status
00010002 00010003

postgresql0/pg_xlog//archive_status:
00010001.done
00010002.done
00010002.0028.backup.done

Now, if we crash the master by sending it SIGKILL, and then start it
again with:

$ postgres  -D data/postgresql0 -c "max_replication_slots=5" -c
"wal_level=hot_standby" -c "wal_log_hints=on"
LOG:  database system was interrupted; last known up at 2015-10-05
17:28:04 CEST
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  redo starts at 0/228
LOG:  invalid record length at 0/360
LOG:  redo done at 0/328

we'll get the following contents of postgresql0/pg_xlog:

$ ls -R  postgresql0/pg_xlog/
00010002.0028.backup 00010004   
 archive_status
00010003 00010005

postgresql0/pg_xlog//archive_status:
00010002.0028.backup.done

Note, that at some moment the master removed the segment
00010002 from its pg_xlog.

In the pg_controldata, I get:

Latest checkpoint's REDO location:0/3000108
Latest checkpoint's REDO WAL file:00010003

When I try to run pg_rewind, I'm getting:

pg_rewind -D postgresql0 --source-server="host=127.0.0.1 port=5433
dbname=postgres"
The servers diverged at WAL position 0/360 on timeline 1.
could not open file "data/postgresql0/pg_xlog/00010002":
No such file or directory

Note that this problem happens not 100% of time during the tests,
sometimes pg_rewind can actually rewind the former master.

I know I can copy the segment back from the archive, but I'd like to
avoid putting this logic into the failover tool if possible. Is there
anything we can do to avoid the problem described above, or is there a
better way to bring up the former master after the crash with pg_rewind?

Kind regards,
--
Oleksii Kliukin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] REVOKE [ADMIN OPTION FOR] ROLE

2015-07-29 Thread Oleksii Kliukin
On Tue, Jul 28, 2015 at 10:51 AM, Egor Rogov e.ro...@postgrespro.ru wrote:

 Well, I looked into a draft of SQL:2003. It basically says that cascade
 for revoke role statement must behave the same way as for revoke
 privilege statement. That is, from standard's point of view we have a code
 issue.

 Still I doubt about usefulness of this behavior. Do we really need it in
 PostgreSQL?

I think it's useful, as long as there are use-cases where instead of
granting privileges on the specific classes of database objects (i.e.
SELECT on all tables in a given schema) a role is granted instead
which 'accumulates' those privileges. This is the case in our
environment, and, while we are not using ADMIN OPTION, I can imagine
it  might be used in order to 'delegate' assignment of privileges from
the central authority to responsible sub-authorities in different
departments. Then, if you need to revoke those (i.e. because the
structure of departments had changed), it's enough to REVOKE ..FROM..
CASCADE instead of getting through each individual assignment case.

Kind regards,
--
Oleksii


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers