Re: [GENERAL] Multimaster

2016-04-19 Thread Konstantin Knizhnik



On 19.04.2016 15:56, Craig Ringer wrote:
On 18 April 2016 at 16:28, Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> wrote:



I intend to make the same split in pglogical its self - a
receiver and apply worker split. Though my intent is to have them
communicate via a shared memory segment until/unless the apply
worker gets too far behind and spills to disk.



In case of multimaster  "too far behind" scenario can never happen.


I disagree. In the case of tightly coupled synchronous multi-master it 
can't happen, sure. But that's hardly the only case of multi-master 
out there.


Sorry, it is just matter of terms meaning. By multimaster I really mean 
"synchronous multimaster", because from my point of view the main 
characteristic of multimaster is symmetric access to all nodes. If there 
is no warranty that all cluster nodes have the same state, then, from my 
point of view, it is not a multimaster at all.  But i have registered 
"multimaster" trademark, so can not insists on such treatment of this term:)






2. Logical backup: transfer data to different database (including
new version of Postgres)


I think that's more HA than logical backup. Needs to be able to be 
synchronous or asynchronous, much like our current phys.rep.


Closely related but not quite the same is logical read replicas/standbys.


This is use case from real production system (Galera use case). If 
customer want to migrate data to new data center, then multimaster is 
one of the possible (and safest) ways to do it. You can ste-by-step and 
transparently for users redirect workload to new data center.



3. Change notification: there are many different subscribers which
can be interested in receiving notifications about database changes.


Yep. I suspect we'll want a json output plugin for this, separate to 
pglogical etc, but we'll need to move a bunch of functionality from 
pglogical into core so it can be shared rather than duplicated.


JSON is not is efficient format for it. And here performance may be 
critical.



4. Synchronous replication: multimaster


"Synchronous multimaster". Not all multimastrer is synchronous, not 
all synchronous replication is multimaster.


We are not enforcing order of commits as Galera does. Consistency
is enforces by DTM, which enforce that transactions at all nodes
are given consistent snapshots and assigned same CSNs. We have
also global deadlock detection algorithm which build global lock
graph (but still false positives are possible because  this graphs
is build incrementally and so it doesn't correspond to some global
snapshot).


OK, so you're relying on a GTM to determine safe, conflict-free apply 
orderings.


I'm ... curious ... about how you do that. Do you have a global lock 
manager too? How do you determine ordering for things that in a 
single-master case are addressed via unique b-tree indexes, not (just) 
heavyweight locking?





We have tried both DTM with global arbiter (analogue of XL GTM) and DTM 
based on timestamps. In the last case there is no centralized arbiter. 
But we are using "raftable" - yet another our plugin which provides 
consistent distributed storage based on RAFT protocol.

Using this raftable we build global deadlock graph based on local subgraphs.




Or need to add to somehow add original DDL statements to the log.


Actually you need to be able to add normalized statements to the xlog. 
The original DDL text isn't quite good enough due to issues with 
search_path among other things. Hence DDL deparse.


Yes, for general purpose we need some DBMS-independent representation of 
DDL.

But for multimaster needs original SQL statement will be enough.





--
 Craig Ringer http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [GENERAL] Multimaster

2016-04-18 Thread Konstantin Knizhnik
 we need to apply changes exactly 
in the same order and at the source node. In this case case content of 
target (recovered) node should be the same as of source node.



We also need 2PC support but this code was sent to you by
Stas, so I hope that sometime it will be included in
PostgreSQL core and pglogical plugin.


I never got a response to my suggestion that testing of upstream
DDL is needed for that. I want to see more on how you plan to
handle DDL on the upstream side that changes the table structure
and acquires strong locks. Especially when it's combined with row
changes in the same prepared xacts.


We are now replicating DDL in the way similar with one used in
BDR: DDL statements are inserted in special table and are replayed
at destination node as part of transaction.

We have also alternative implementation done by Artur Zakirov
mailto:a.zaki...@postgrespro.ru>>
which is using custom WAL records:
https://gitlab.postgrespro.ru/pgpro-dev/postgrespro/tree/logical_deparse
Patch for custom WAL records was committed in 9.6, so we are going
to switch to this approach.


How does that really improve anything over using a table?


It is more straightforward approach, isn't it? You can either try to 
restore DDL from low level sequence of updates of system catalogue.
But it is difficult and not always possible. Or need to add to somehow 
add original DDL statements to the log.
It can be done using some special table or store this information 
directly in the log (if custom WAL records are supported).
Certainly in the last case logical protocol should be extended to 
support playback of user-defined WAl records.

But it seems to be universal mechanism which can be used not only for DDL.

I agree, that custom WAL adds no performance or functionality advantages 
over using a table.
This is why we still didn't switch to it. But IMHO approach with 
inserting DDL (or any other user-defined information) in special table 
looks like hack.




This doesn't address what I asked above though, which is whether you 
have tried doing ALTER TABLE in a 2PC xact with your 2PC replication 
patch, especially one that also makes row changes.


Well, recently I have made attempt to merge our code with the
latest version of pglogical plugin (because our original
implementation of multimaster was based on the code partly taken
fro BDR) but finally have to postpone most of changes. My primary
intention was to support metadata caching. But presence of
multiple apply workers make it not possible to implement it in the
same way as it is done node in pglogical plugin.


Not with a simplistic implementation of multiple workers that just 
round-robin process transactions, no. Your receiver will have to be 
smart enough to read the protocol stream and write the metadata 
changes to a separate stream all the workers read. Which is awkward.


I think you'll probably need your receiver to act as a metadata broker 
for the apply workers in the end.


Also now pglogical plugin contains a lot of code which performs
mapping between source and target database schemas. So it it is
assumed that them may be different.
But it is not true in case of multimaster and I do not want to pay
extra cost for the functionality we do not need.


All it's really doing is mapping upstream to downstream tables by 
name, since the oids will be different.


Really?
Why then you send all table metadata (information about attributes) and 
handle invalidation messages?
What is the purpose of "mapping to local relation, filled as needed" 
fields in PGLogicalRelation if are are not going to perform such mapping?


Multimater really  needs to map local or remote OIDs.  We do not need to 
provide any attribute mapping and handle catalog invalidations.


--

 Craig Ringer http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [GENERAL] Multimaster

2016-04-14 Thread konstantin knizhnik

On Apr 14, 2016, at 8:41 AM, Craig Ringer wrote:

> On 1 April 2016 at 19:50, Konstantin Knizhnik  
> wrote:
> 
> Right now the main problem is parallel apply: we need to apply changes 
> concurrently to avoid unintended dependencies causing deadlocks and provide 
> reasonable performance.
> 
> How do you intend to approach that?

Actually we already have working implementation of multimaster...
There is a pool of pglogical executors. pglogical_receiver just reads 
transaction body from connection and append it to ready-for-execution queue.
Any vacant worker form this pool can dequeue this work and proceed it.
To provide correct  replication session context, I have to setup replication 
session for each transaction. It looks dummy but fortunately has no effect on 
performance.

It is certainly possible with this approach that order of applying transactions 
can be not the same at different nodes.
But it is not a problem if we have DTM. The only exception is recovery of 
multimaster node. In this case we have to apply transaction exactly in the same 
order as them were applied at the original node performing recovery. It is done 
by applying changes in recovery mode by pglogical_receiver itself.

> 
> You should be able to do parallel apply between nodes trivially, i.e. node A 
> applies changes in parallel from nodes B C and D. 
> 
> Doing parallel apply of multiple changes from node A to node B is much 
> harder. I wrote about parallel logical apply separately so I won't repeat it 
> here; search the archives for the notes if interested.
>  
> We also need 2PC support but this code was sent to you by Stas, so I hope 
> that sometime it will be included in PostgreSQL core and pglogical plugin.
> 
> I never got a response to my suggestion that testing of upstream DDL is 
> needed for that. I want to see more on how you plan to handle DDL on the 
> upstream side that changes the table structure and acquires strong locks. 
> Especially when it's combined with row changes in the same prepared xacts. 

We are now replicating DDL in the way similar with one used in BDR: DDL 
statements are inserted in special table and are replayed at destination node 
as part of transaction.
We have also alternative implementation done by Artur Zakirov 
 
which is using custom WAL records: 
https://gitlab.postgrespro.ru/pgpro-dev/postgrespro/tree/logical_deparse
Patch for custom WAL records was committed in 9.6, so we are going to switch to 
this approach.

Right now we are trying to run all Postgres regression tests for multimaster. 
Currently about 50 tests are failed. There are a lot of different issues: with 
temporary tables, GUC variables, ...
Some we have fixed, investigation of others is in progress...


> 
> I'd really like separate patches for the core changes and the pglogical 
> support for them, too.
>  
> There are also some minor technical issues which lead us to making few 
> changes in pglogical code but we tried to do our best to keep original 
> versions unchanged, so we can switch to public version in future.
> 
> Details?
>  
> Now is exactly the time to address those points.
> 

Well, recently I have made attempt to merge our code with the latest version of 
pglogical plugin (because our original implementation of multimaster was based 
on the code partly taken fro BDR) but finally have to postpone most of changes. 
My primary intention was to support metadata caching. But presence of multiple 
apply workers make it not possible to implement it in the same way as it is 
done node in pglogical plugin. 

Also now pglogical plugin contains a lot of code which performs mapping between 
source and target database schemas. So it it is assumed that them may be 
different.
But it is not true in case of multimaster and I do not want to pay extra cost 
for the functionality we do not need.

Frankly speaking I do not wan to spent much time on integration of multimaster 
with pglogical plugin right now,  because  both of them are actively developed 
now. We should definitely do it, but  when we have stable version, so there is 
no need to repeat this work once and once again.

We can try to prepare our "wish list" for pglogical plugin.

> 
> -- 
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services



Re: [GENERAL] Multimaster

2016-04-01 Thread Konstantin Knizhnik

Hi Simon,

Yes, we will distributed all our code under PostgreSQL license.
Right now we are using copy of pglogical_output plugin + receiver part 
written based on BDR code (just because when we started work on 
multimaster pglogical plugin was not released).
We have plans to eliminate copies of pglogical plugin files from our 
code and instead of it refer pglogical plugin.


Right now the main problem is parallel apply: we need to apply changes 
concurrently to avoid unintended dependencies causing deadlocks and 
provide reasonable performance.


We also need 2PC support but this code was sent to you by Stas, so I 
hope that sometime it will be included in PostgreSQL core and pglogical 
plugin.


There are also some minor technical issues which lead us to making few 
changes in pglogical code but we tried to do our best to keep original 
versions unchanged, so we can switch to public version in future.





On 01.04.2016 14:38, Simon Riggs wrote:
On 1 April 2016 at 11:33, Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> wrote:


Our company PostgresPRO is now developing multimaster for
Postgres, some analogue of MySQL Galera.
It is based on logical replication and pglogical plugin from
2ndQuandrant and uses distributed transaction manager.


Hi Konstantin,

Is this open source with The PostgreSQL Licence?

Will you be contributing those changes to the BDR project, or is this 
a permanent fork of that?


Thanks

--
Simon Riggs http://www.2ndQuadrant.com/ <http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



[GENERAL] Multimaster

2016-04-01 Thread Konstantin Knizhnik

Hi,

Our company PostgresPRO is now developing multimaster for Postgres, some 
analogue of MySQL Galera.
It is based on logical replication and pglogical plugin from 
2ndQuandrand and uses distributed transaction manager.
The main differences of multimaster from standard PostgreSQL streaming 
replication with hot standby are:


1. There is no time gap between master and slave (which is currently 
present even for synchronous streaming replication).
2. It is possible to execute any transaction on any node. In case of hot 
standby you can only execute read-only queries on replicas and, 
moreover,  you can not create temporary tables to store the results of 
complex queries. With hot standby you have to change your application by 
separating read-only and read-write transactions or use special proxies 
like PgBouncer.
3. Multimaster provides high-availability out of the box: it is able to 
detect node failure and provide automatic recovery and node return to 
the cluster. Unlike HA stacks based on streaming replication, 
multimaster doesn't require any external tools like corosync and pacemaker.


Unlike bidirectional replication, multimaster provides distributed 
consistency: all transactions will always see consistent snapshots.
If transactions on different nodes are trying to update the same data, 
then conflicts will be detected in the same way as in standalone 
Postgres. No custom conflict resolution is required.
Multimaster is using two phase commit, so a transaction is either 
applied on all nodes or on none. Thus the data of all online nodes are 
identical.


Multimaster doesn't provide:
1. Write scalability: the changes have to be applied to all nodes, so 
there can not be any write performance gain over standalone server. If 
you need write scalability, use sharding. In this case you have to use 
Postgres-XL, GreenPlum, pg_shard, or some application-level solution.
2. Optimization of complex analytic queries using distributed query 
execution. Multimaster doesn't build distributed execution plans and can 
not speed-up execution of single heavy query.
3. Georeplication. Although it is possible to have multimaster nodes in 
different data centers, in practice it will lead to very low write 
performance. For such scenarios asynchronous BDR is much better solution.



So multimaster may be efficient for providing high availability and load 
balancing when most of your queries are read-only.


It will be interesting for us to get user's feedbacks and collect 
possible use cases and workloads for multimaster.
Performance of multimaster greatly depends on relative amount of update 
queries and transaction size (most of multimaster overhead is related to 
transaction processing). In case of simple pgbench-like queries 
performance of three-node multimaster is about two times lower than 
performance of standalone Postgres on update queries and about two times 
higher on read-only queries.
In case of complex analytic queries (like TPC-H) results are much better 
and multimaster provides almost linear scalability.
But multimaster is not intended for read-only OLAP: there are more 
efficient solutions.
We expect most performance benefits for OLTP with high ratio of 
read-only queries.
This is why we try to find workloads for multimaster as close to real 
life as possible. Any references to benchmarks, examples of queries,  
scenarios, etc. are welcome.


Also we would like to get feedback for the following questions:

1. Currently logical replication requires primary key for replicated 
table (otherwise the updates can not be propagated).

How critical can this limitation be for potential multimaster use cases?

2. Is it necessary to support local (not replicated) tables?

3. Right now multimaster replicates the whole instance. Alternatively we 
can try to implement database-level replication, making it possible
to provide more flexible configurations of different databases connected 
into different clusters. It is hard to estimate now how difficult it 
will be to support such mode, but first of all we want to know if this 
feature is really needed?


4. We plan to implement sharding support. But first we also have to 
understand what people want from cluster first of all?
HA, load balancing, distributed queries, etc. In particular: what use 
cases do you see for multimaster without sharding?


5.What is the preferred way of cluster configuration: through 
postgresql.conf or using some special API allowing to dynamically 
add/remove nodes from cluster?


Thanks in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] [HACKERS] Threads in PostgreSQL

2015-12-20 Thread Konstantin Knizhnik

Hi,

PostgreSQL is not using threads but it is possible to spawn thread in 
your PostgreSQL extensions.

For example, I have used pool of threads in my IMCS extension.
But you need to build your  extension with -pthread:

CUSTOM_COPT = -pthread

Also, please take in account that many PostgreSQL functions (even in/out 
or comparison functions) are not reentrant: them are storing their state 
in global variables.
So you will get race conditions if you are calling  such functions from 
multiple threads.


Concerning stack overflow, I think that the most probable reason is 
trivial infinite recursion.

Did you inspect stack trace in debugger?






On 21.12.2015 09:21, sri harsha wrote:

Hi,

   Is it possible to use threads in Postgresql ?? I am using threads 
in my foreign data wrapper and i get the following error when i use 
the threads .


*ERROR:  stack depth limit exceeded*
*HINT:  Increase the configuration parameter "max_stack_depth" 
(currently 2048kB), after ensuring the platform's stack depth limit is 
adequate.*


No matter how much i increase the stack size , the error keeps 
occurring . How do i solve this problem ??



Thanks,
Harsha