[PATCHES] Replication Documentation

Chris Browne Tue, 01 Aug 2006 13:05:46 -0700

Here's a patch to add in the material on replication recently
discussed on pgsql.docs.  I'm not thrilled that there were only a few
comments made; I'd be happy to see "slicing and dicing" to see this
made more useful.


Index: filelist.sgml
===================================================================
RCS file: /projects/cvsroot/pgsql/doc/src/sgml/filelist.sgml,v
retrieving revision 1.44
diff -c -u -r1.44 filelist.sgml
--- filelist.sgml       12 Sep 2005 22:11:38 -0000      1.44
+++ filelist.sgml       1 Aug 2006 20:00:00 -0000
@@ -44,6 +44,7 @@
 <!entity config        SYSTEM "config.sgml">
 <!entity user-manag    SYSTEM "user-manag.sgml">
 <!entity wal           SYSTEM "wal.sgml">
+<!entity replication   SYSTEM "replication.sgml">
 
 <!-- programmer's guide -->
 <!entity dfunc      SYSTEM "dfunc.sgml">
Index: postgres.sgml
===================================================================
RCS file: /projects/cvsroot/pgsql/doc/src/sgml/postgres.sgml,v
retrieving revision 1.77
diff -c -u -r1.77 postgres.sgml
--- postgres.sgml       10 Mar 2006 19:10:48 -0000      1.77
+++ postgres.sgml       1 Aug 2006 20:00:00 -0000
@@ -155,6 +155,7 @@
   &diskusage;
   &wal;
   &regress;
+  &replication;
 
  </part>
 
----   Then add the following as .../doc/src/sgml/replication.sgml

<!-- $PostgreSQL$ -->

<chapter id="replication"> <title> Replication </title>
  
  <indexterm><primary>replication</primary></indexterm>

  <para> People frequently ask about what replication options are
  available for <productname>PostgreSQL</productname>.  Unfortunately,
  there are so many approaches and models to this that are useful for
  different purposes that things tend to get confusing.
  </para>

  <para> At perhaps the most primitive level, one might use <xref
  linkend="backup"> tools, whether <xref linkend="app-pgdump"> or
  <xref linkend="continuous-archiving"> to create additional copies of
  databases.  This <emphasis>doesn't</emphasis> provide any way to
  keep the replicas up to date; to bring the state of things to a
  different point in time requires bringing up another copy.  There is
  no way, with these tools, for updates on a <quote>master</quote>
  system to automatically propagate to the replicas.</para>

  <sect1> <title> Categorization of Replication Systems </title>

   <para> Looking at replication systems, there are a number of ways in
    which they may be viewed:

    <itemizedlist>

     <listitem><para> Single master versus multimaster.</para>

      <para> That is, whether there is a single database considered
      <quote>master</quote>, where all update operations are required
      to be submitted, or the alternative, multimaster, where updates
      may be submitted to any of several databases.</para>

      <para> Multimaster replication is vastly more complex and
      expensive, because of the need to deal with the possibility of
      conflicting updates.  The simplest example of this is where a
      replicated database manages inventory; the question is, what
      happens when requests go to different database nodes requesting
      a particular piece of inventory?</para>

      <para> Synchronous multimaster replication introduces the need
      to distribute locks across the systems, which, in research work
      done with Postgres-R and Slony-II, has proven to be very
      expensive. </para></listitem>

     <listitem><para> Synchronous versus asynchronous</para>

      <para>Synchronous systems are ones where updates must be
      accepted on all the databases before they are permitted to
      <command>COMMIT</command>. </para>

      <para> Asynchronous systems propagate updates to the other
      databases later.  This permits the possibility that one database
      may have data significantly behind others.  Whether or not being
      behind is acceptable or not will depend on the nature of the
      application.</para>

      <para> Asynchronous multimaster replication introduces the
      possibility that conflicting updates will be accepted by
      multiple nodes, as they don't know, at <command>COMMIT</command>
      time, that the updates conflict.  It is then necessary to have
      some sort of conflict resolution system, which can't really be
      generalized as a generic database facility.  An instance of this
      that is commonly seen is in the <productname>PalmOS
      HotSync</productname> system; the <quote>general policy</quote>
      when conflicts are noticed is to allow both conflicting records
      to persist until a human can intervene.  That may be quite
      acceptable for an address book; it's <emphasis>not</emphasis>
      fine for OLTP systems. </para>

     </listitem>

     <listitem><para> Update capture methods  </para>

      <para> Common methods include having triggers on tables,
      capturing SQL statements, and capturing transaction log (WAL)
      updates </para>

      <itemizedlist>

       <listitem><para> Triggers, as used in eRServer and Slony-I,
       have the advantage of capturing updates at the end of
       processing when all column values have been finalized.  The use
       of transaction visibility (MVCC) and ordering can provide
       strong guarantees on consistency. </para>
    
        <para> Of course, firing a trigger for each tuple update comes
        at a not inconsiderable cost: a statement that touches 10,000
        tuples will fire the trigger 10,000 times, and transform, on
        the subscriber, into 10,000 SQL statements.</para></listitem>
      
       <listitem><para> Statement capture almost exactly reverses the
       issues, as compared to triggers.</para>

        <para> There are no strong guarantees on consistency: any sort
        of nondeterministic query can <quote>corrupt</quote> things by
        introducing differences between nodes.  Here are four examples
        of cases where naive statement capture is sure to get things
        wrong:</para>

        <itemizedlist>
         <listitem><para><command>INSERT INTO mytable (txntime,
         product, quantity, taxes, total) values (now(), 'AB-275', 10,
         45, 250.00);</command></para> <para> Some replication systems
         parse the queries, replacing date requests with
         timestamps. </para>
         </listitem>
         <listitem><para><command>INSERT INTO table2 (random() *
            50);</command></para> <para> In this case, nondeterminism is
           fairly much the point!</para>
         </listitem>
         <listitem><para>Any use of sequnce values as defaults,
         particularly with per-connection value cacheing, will open up
         occasions for values to diverge between
         nodes.</para></listitem> 

         <listitem><para><command>INSERT INTO tab1 (txn_type, tdate,
         quantity, units, price) SELECT * FROM tab2 ORDER BY txn_type
         limit 50;</command></para>

          <para> There are many variations on this which will turn out
          badly: </para>
          <itemizedlist>
           <listitem><para>If there are default fields in tab1
             that are set using sequences, the only way to even
             hope for the same ordering is to have
             an <command>ORDER BY</command> clause that ensures
             identical ordering on both hosts.</para></listitem>
           <listitem><para> If the ordering isn't a suitable
             total ordering, the requests for data from tab2 may
             find different data on different
             hosts.</para></listitem>
           <listitem><para>Columns with a default
             of <function>now()</function> will be troublesome as
             mentioned earlier, and this makes the problem harder
             because unlike in the earlier query, where one might
             substitute '2006-09-02 04:42:23-00'
             for <function>now()</function>, this requires a
             substantial rewriting of the query.</para></listitem>
          </itemizedlist>
         </listitem>
        </itemizedlist>
        
       </listitem>
      </itemizedlist>
      
     </listitem>
     
    </itemizedlist>
    
   </para>
   
  </sect1>
  
  <sect1 id="replicationsystems"> <title> PostgreSQL Replication Systems and 
Their Uses </title>

   <para> Based on the preceding taxonomy, we may categorize various
   replication systems, which should be helpful in determining what
   they may be best used for, and whether they are compatible with
   your <quote>use case.</quote></para>

   <sect2><title> Slony-I</title>

    <para> Slony-I is a single-master to multiple subscriber
    asynchronous replication system that captures updates using
    triggers. </para>

    <para> For many systems, it is not clear how to initialize
    replication on a new node some time after a system has been set up
    in production.  Slony-I was specifically designed to provide the
    ability to introduce new nodes without the need to interrupt
    activity on the master node.  </para>

    <para> It has, a particular merit, that, by only using components
    internal to PostgreSQL, it is compatible with multiple versions of
    PostgreSQL. This lends it especially to assisting at upgrading
    systems from one version of PostgreSQL to another without
    requiring a long outage. </para>

    <para> It suffers from three particular problems:</para>

    <itemizedlist>
     <listitem><para> Despite improvements from earlier versions, it
     is fairly complex to configure and administer.</para></listitem>
     <listitem><para> It can only replicate changes that can be
     captured using triggers. </para> 

      <para> There is a handling for sequences, which comes via
      polling, but Slony-I <emphasis>does not</emphasis> provide an
      automatic way to replicate other sorts of objects. </listitem>

     <listitem><para> The handling of DDL changes is somewhat fragile,
     and exists as something of a bag on the side. </para>

      <para> There has been loose discussion as to how to address
      that; useful comprehensive answers have not emerged.
     </listitem>
    </itemizedlist>

    <sect3> <title> Use Cases </title>

     <para> Slony-I has proven useful for the following sorts of usages: </para>
     <itemizedlist>
      
      <listitem><para> Upgrading from one PostgreSQL release to
      another with only brief downtime. </para></listitem>

      <listitem><para> Providing extra database copies that are nearly
      up to date that may be used to offload read activity from the
      <quote>master</quote> database system. </para></listitem>

      <listitem><para> Providing extra database copies that are nearly
      up to date that may be used as failover targets. </para>
      </listitem>

     </itemizedlist>

   </sect2>
   
   <sect2><title> pgpool </title>
    
    <para> <application>pgpool</application> was initially created by
    Tatsuo Isshii as a portable alternative to Java connection pool
    modules.  He subsequently observed that it wouldn't take very much
    effort to extend it to create a simple replication system: if it
    is forwarding SQL queries to a PostgreSQL instance, extending that
    to two databases is very straightforward. </para>

    <para> It suffers, by nature, from the problems associated with
    replicating using capture of SQL statements; any sort of
    nondeterminism in the replicated statements will cause the
    databases to diverge. </para>

    <para> On the other hand, it is very easy to install and
    configure; for users with simple requirements, that can
    suffice. </para>

    <para> A <application>pgpool-2</application> is under way which
    introduces a more sophisticated query parser to try to address the
    nondeterminism issues; that may limit ongoing support for the
    legacy version.</para>

    <sect3> <title> Use Cases </title>

     <para> pgpool has proven useful for the following sorts of usages: </para>
     <itemizedlist>
      
      <listitem><para> Dividing read-only database activity between
      two database instances. </para></listitem>

      <listitem><para> Providing a simple replication system for
      systems that do not make use of nondeterministic update
      queries. </para></listitem>

     </itemizedlist>

    </sect3>

   </sect2>
   
   <sect2> <title> PITR - Point In Time Recovery </title>

    <para> If you have a database cluster that supports a large number
    of database instances (<emphasis>e.g.</emphasis> - varying values
    for PGDATABASE), connection-managing systems like pgpool and
    systems like Slony-I which require a manager process for each
    database for each node that is replicated will turn out quite
    badly.</para>

    <para> For instance, if you have a database cluster that hosts 300
    databases, as would be the case in a "web hosting" situation, for
    Slony-I to replicate all of this data, it would have to have 300
    slon processes for each node.  </para>

    <para> PITR is likely to be more suitable in this case; that
    doesn't provide you with a usable replica running, but it can
    recover <emphasis>all</emphasis> of the tables in
    <emphasis>all</emphasis> of the databases on the backend.</para>

   </sect2>

   <sect2> <title> Postgres-R </title>

    <para> This has been a research project at McGill University,
    building a multimaster synchronous replication system which uses a
    group communications system (<emphasis>e.g.</emphasis> - <ulink
    url="http://www.spread.org/";> Spread</ulink>) to control
    propagation of update requests, which it captures via adding
    <quote>hooks</quote> to the database engine to detect
    changes. </para>

    <para> Being a research project, the key has been to learn about
    replication as opposed to provide a <quote> production grade
    </quote> replication system.  For a considerable period of time it
    was only at all usable on rather old releases of PostgreSQL; it is
    now available for recent releases. </para>

    <para> The handling of DDL changes has long been somewhat
    controversial; several attempts to implement DDL handlers have
    been made, none of which has yet <quote>stuck.</quote> </para>

    <para> The Slony-II project inherited directly from Postgres-R,
    with an intent to create a multimaster synchronous replication
    system atop a group communications system, but then to proceed to
    something more of <quote>production grade</quote>. </para>

    <para> The notable distinction from Postgres-R was that, in order
    to find conflicts earlier, and to diminish the amount of work
    needing to be done at the synchronization point, Slony-II would
    try to publish and promote lock requests as soon as possible.  (It
    is possible for this to worsen behaviour in some cases.)</para>

    <para> Unfortunately several problems emerged: </para>

    <itemizedlist>

     <listitem><para> The available open source group communications
     systems turn out to neither be fast enough nor reliable enough
     for the purpose. </para></listitem>

     <listitem><para> One of the goals was for there to be as little
     need as possible to modify applications to deal with
     replication. </para>

      <para> Unfortunately, there turn out to be some cases where
      competing updates (e.g. - for updates to account balances) would
      cause multimaster replication to reject transactions due to
      concurrency problems with high frequency. </para>
     </listitem>

    </itemizedlist>

    <para> As a result of those problems, Slony-II efforts have fallen
    off somewhat. </para>

    <para> The remaining developers plan to join together efforts for
    these two projects.  There are working prototypes, but it is not
    clear when <quote>production grade</quote> versions will
    emerge. </para>

   </sect2>

  </sect1>

</chapter>

<!-- Keep this comment at the end of the file Local variables:
mode:sgml 
sgml-omittag:nil 
sgml-shorttag:t
sgml-minimize-attributes:nil 
sgml-always-quote-attributes:t
sgml-indent-step:1 
sgml-indent-data:t
sgml-parent-document:postgres.sgml
sgml-default-dtd-file:"./reference.ced" 
sgml-exposed-tags:nil
sgml-local-catalogs:("/usr/lib/sgml/catalog")
sgml-local-ecat-files:nil 
End: -->

-- 
output = ("cbbrowne" "@" "cbbrowne.com")
http://cbbrowne.com/info/
What's another word for synonym? 

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

[PATCHES] Replication Documentation

Reply via email to