Jan Wieck wrote:
> On 6/6/2006 12:17 PM, Jim C. Nasby wrote:
>   
>> On Tue, Jun 06, 2006 at 11:58:48AM -0400, Rod Taylor wrote:
>>     
>>> On Tue, 2006-06-06 at 10:25 -0500, Jim C. Nasby wrote:
>>>       
>>>> On Wed, May 31, 2006 at 10:01:26AM -0400, Rod Taylor wrote:
>>>>         
>>>>> I've been thinking of the initial COPY process.
>>>>>
>>>>> The problem is that with a large amount of data you end up with a very
>>>>> large transaction on the data provider. The transaction on the
>>>>> subscriber isn't as important since it will normally be an otherwise
>>>>> idle database.
>>>>>
>>>>> COPY in is one part, but building indexes on the subscriber is the
>>>>> painful part and during much of this process the data provider has an
>>>>> idle connection.
>>>>>           
>>>>  
>>>> Pardon my ignorance, but is the provider actually sitting in a
>>>> transaction while the subscriber is building indexes, and if so, why?
>>>> ISTM there's no reason you'd need indexes (or RI for that matter) while
>>>> loading data into a subscriber.
>>>>         
>>> Yes it does. Indexes are mostly disabled during the copy itself then a
>>> second pass is made after the COPY to re-enable indexes and rebuild
>>> them. The provider is in a transaction for the same duration as the
>>> subscriber.
>>>       
>> Why does re-enabling the indexes have to happen in the same provider
>> transaction?
>>     
>
> It doesn't. It is implemented that way at the moment.
>
>   
I'm not sure it would be "challenging to the point of being truly
troublesome" to change that, but there's an interesting optimization
that falls out of the way it is implemented now, namely that you can be
sure that a whack of table data will be in cache any time an index is
being regenerated.

In effect, the present scheme of things on the subscriber is thus:

  Deactivate indexes on some_table
  COPY some_table from stdin
[ load a bunch of data ]
  reactivate indexes on some_table
  Reindex table some_table
[ Which may be expected to benefit from locality of reference;
some_table will be what's mostly occupying RAM at the moment... ]

I would suppose that a couple of strategies could fall out as
alternatives to this:

1.  Load all the data, then reindex all tables

   for each table t
      deactivate indexes on t
      COPY data for t
      reactivate indexes on t
  done
  --- At this point, we're done copying data, and can close out the
transaction on the provider
  for each table t
      reindex table t
  done

2.  Load the data, and throw reindexing to a separate thread that might
take place concurrently...

    for each table t
       deactivate indexes on t
       copy data for T
       reactivate indexes on t
       throw table t to reindexing thread
   done

In this case, I expect that we'd have to treat loading each table on the
subscriber as a separate transaction *on the subscriber* in order that
locks on the tables could get relinquished.

This approach is a tad risky, as it leaves the possibility open that
something could grab a lock on the table in between copying data and
starting reindexing.

But perhaps we could do something nastier to prevent this, such as
renaming the tables being worked on into the Slony-I schema, thereby
hiding them somewhat until we put them back...

In either case, we lose at least some of the benefits of the present
assurance of locality of reference.  Closing provider transactions
earlier could be worth it, of course...

Jim, I see some discussion on other PG lists about some testing you're
doing on reindexing; it'll be interesting to see more about that, as
that might help guide this.
_______________________________________________
Slony1-general mailing list
[email protected]
http://gborg.postgresql.org/mailman/listinfo/slony1-general

Reply via email to