On Aug 25, 2010, at 1:17 PM, Jan Wieck wrote:

> On 8/26/2010 9:11 AM, Guy Helmer wrote:
>> On Aug 23, 2010, at 2:42 PM, Steve Singer wrote:
>>> Guy Helmer wrote:
>>>> I'm seeing something odd occasionally on a fairly new slony1 (1.2.20) 
>>>> replication set involving one slave.  At times, the application inserts a 
>>>> record to a particular table, updates the record several times, and then 
>>>> deletes the record, sometimes in a fairly quick succession (but not 
>>>> always).
>>>> When I run the test-slony-state script, sometimes I find that the 
>>>> replication is failing, and when I look deeper, I find that Slony is 
>>>> having trouble replicating the changes to this table because of rows in 
>>>> the slave table that shouldn't be there.  After I manually remove the 
>>>> conflicting rows, Slony is then able to finish the backlogged replication.
>>>> Is there anything in particular I should look for in the log file prior to 
>>>> this problem?
>>> Shortly after the problem happens your going to want to look at sl_log_1  
>>> sl_log_2 and sl_event to figure out what was going on.
>>> You want to find the what sync the delete should have been part of, and 
>>> what sync the failing insert was part of and try to figure out why the 
>>> delete wasn't applied to the slave by the time it tried the insert.
>>> You would also want to look at the logs slon generates to see if that sync 
>>> did get applied and look in sl_confirm to verify that.
>>> Honestly I am somewhat suspect that something else isn't going on I find 
>>> your description somewhat hard reconcile with how things work.
>> Thanks for the advice.  It has happened again.  Due to the timing of the 
>> issue corresponding somewhat closely with a software update where we took 
>> the database & slony down for the maintenance, I am wondering if we might be 
>> taking things down in incorrect order...
>> I didn't notice the problem until test-slony-state saw the problem during 
>> last night's check, so the data is about 21 hours old.  sl_log_1 contains 
>> this for the stuck table:
>> mydb=# SELECT * FROM _replication.sl_log_1 WHERE log_tableid = 28 ORDER BY 
>> log_xid;
>> log_origin | log_xid | log_tableid | log_actionseq | log_cmdtype |           
>>    log_cmddata               
>> ------------+---------+-------------+---------------+-------------+----------------------------------------
>>          1 | 2062810 |          28 |          6854 | I           | 
>> ("user_id","status") values ('1','2')
>>          1 | 2063155 |          28 |          6881 | I           | 
>> ("user_id","status") values ('3','2')
>>          1 | 2063342 |          28 |          6908 | I           | 
>> ("user_id","status") values ('3','2')
>>          1 | 2072564 |          28 |          6980 | I           | 
>> ("user_id","status") values ('34','2')
>>          1 | 2072564 |          28 |          6984 | D           | 
>> "user_id"='34'
>>          1 | 2072564 |          28 |          6986 | I           | 
>> ("user_id","status") values ('34','2')
>>          1 | 2072564 |          28 |          6990 | D           | 
>> "user_id"='34'
>>          1 | 2072564 |          28 |          6992 | I           | 
>> ("user_id","status") values ('34','2')
>>          1 | 2072580 |          28 |          7002 | I           | 
>> ("user_id","status") values ('34','2')
>>          1 | 2072586 |          28 |          7021 | D           | 
>> "user_id"='34'
>>          1 | 2072586 |          28 |          7023 | I           | 
>> ("user_id","status") values ('34','2')
>>          1 | 2072586 |          28 |          7027 | D           | 
>> "user_id"='34'
>>          1 | 2072586 |          28 |          7029 | I           | 
>> ("user_id","status") values ('34','2')
>>          1 | 2072586 |          28 |          7033 | D           | 
>> "user_id"='34'
>>          1 | 2072586 |          28 |          7035 | I           | 
>> ("user_id","status") values ('34','2')
>> (19 rows)
>> There are two consecutive inserts for user_id 34 (user_id is the primary 
>> key) -- is that a possible problem?
> 
> It looks like there is one delete for user_id=34 missing. This could be 
> caused by a corrupted index on sl_log_1. Can you do a
> 
>    REINDEX _replication.sl_log_1;
> 
> and then repeat that SELECT?
> 

I had already manually intervened in the slave's table to get the replication 
working again, so the sl_log_1 table was empty.  I have run the REINDEX TABLE 
_replication.sl_log_1 command, and the table is still empty...

Thanks,
Guy--------
This message has been scanned by ComplianceSafe, powered by Palisade's 
PacketSure.
_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general

Reply via email to