On 8/26/2010 9:11 AM, Guy Helmer wrote:
> On Aug 23, 2010, at 2:42 PM, Steve Singer wrote:
>
>> Guy Helmer wrote:
>>> I'm seeing something odd occasionally on a fairly new slony1 (1.2.20)
>>> replication set involving one slave. At times, the application inserts a
>>> record to a particular table, updates the record several times, and then
>>> deletes the record, sometimes in a fairly quick succession (but not always).
>>> When I run the test-slony-state script, sometimes I find that the
>>> replication is failing, and when I look deeper, I find that Slony is having
>>> trouble replicating the changes to this table because of rows in the slave
>>> table that shouldn't be there. After I manually remove the conflicting
>>> rows, Slony is then able to finish the backlogged replication.
>>> Is there anything in particular I should look for in the log file prior to
>>> this problem?
>>
>>
>> Shortly after the problem happens your going to want to look at sl_log_1
>> sl_log_2 and sl_event to figure out what was going on.
>>
>> You want to find the what sync the delete should have been part of, and what
>> sync the failing insert was part of and try to figure out why the delete
>> wasn't applied to the slave by the time it tried the insert.
>>
>> You would also want to look at the logs slon generates to see if that sync
>> did get applied and look in sl_confirm to verify that.
>>
>>
>> Honestly I am somewhat suspect that something else isn't going on I find
>> your description somewhat hard reconcile with how things work.
>>
>
> Thanks for the advice. It has happened again. Due to the timing of the
> issue corresponding somewhat closely with a software update where we took the
> database & slony down for the maintenance, I am wondering if we might be
> taking things down in incorrect order...
>
> I didn't notice the problem until test-slony-state saw the problem during
> last night's check, so the data is about 21 hours old. sl_log_1 contains
> this for the stuck table:
>
> mydb=# SELECT * FROM _replication.sl_log_1 WHERE log_tableid = 28 ORDER BY
> log_xid;
> log_origin | log_xid | log_tableid | log_actionseq | log_cmdtype |
> log_cmddata
> ------------+---------+-------------+---------------+-------------+----------------------------------------
> 1 | 2062810 | 28 | 6854 | I |
> ("user_id","status") values ('1','2')
> 1 | 2063155 | 28 | 6881 | I |
> ("user_id","status") values ('3','2')
> 1 | 2063342 | 28 | 6908 | I |
> ("user_id","status") values ('3','2')
> 1 | 2072564 | 28 | 6980 | I |
> ("user_id","status") values ('34','2')
> 1 | 2072564 | 28 | 6984 | D |
> "user_id"='34'
> 1 | 2072564 | 28 | 6986 | I |
> ("user_id","status") values ('34','2')
> 1 | 2072564 | 28 | 6990 | D |
> "user_id"='34'
> 1 | 2072564 | 28 | 6992 | I |
> ("user_id","status") values ('34','2')
> 1 | 2072580 | 28 | 7002 | I |
> ("user_id","status") values ('34','2')
> 1 | 2072586 | 28 | 7021 | D |
> "user_id"='34'
> 1 | 2072586 | 28 | 7023 | I |
> ("user_id","status") values ('34','2')
> 1 | 2072586 | 28 | 7027 | D |
> "user_id"='34'
> 1 | 2072586 | 28 | 7029 | I |
> ("user_id","status") values ('34','2')
> 1 | 2072586 | 28 | 7033 | D |
> "user_id"='34'
> 1 | 2072586 | 28 | 7035 | I |
> ("user_id","status") values ('34','2')
> (19 rows)
>
> There are two consecutive inserts for user_id 34 (user_id is the primary key)
> -- is that a possible problem?
It looks like there is one delete for user_id=34 missing. This could be
caused by a corrupted index on sl_log_1. Can you do a
REINDEX _replication.sl_log_1;
and then repeat that SELECT?
Jan
--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin
_______________________________________________
Slony1-general mailing list
[email protected]
http://lists.slony.info/mailman/listinfo/slony1-general