I think its caused by hard reboots (may b hyper visor itself is rebooted!) . Is there any setting which can reduce such problems ?
On Tue, Jun 7, 2016 at 5:30 PM, Craig Ringer <cr...@2ndquadrant.com> wrote: > On 7 June 2016 at 18:24, Nikhil <nikhilsme...@gmail.com> wrote: > >> I am getting below error in my 2 node BDR setup. postgres going down. any >> idea? >> >> <35382016-06-07 10:16:59 GMT%LOG: database system was interrupted; last >> known up at 2016-06-07 09:06:44 GMT >> <35382016-06-07 10:16:59 GMT%PANIC: replication slot file >> "pg_replslot/bdr_16389_6293051490331141125_2_16389__/state" has >> wrong magic 4522536 instead of 17112225 >> <35352016-06-07 10:16:59 GMT%LOG: startup process (PID 3538) was >> terminated by signal 6: Abort trap >> <35352016-06-07 10:16:59 GMT%LOG: aborting startup due to startup >> process failure >> > > That suggests that there was a write failure on the replication slot file. > > A simple write error shouldn't be possible because we write the slot file > to a tempfile, then replace the old slot file with the new one. Filesystem > issues are possible, or memory corruption in the application that caused a > bad write. Or a bug, but it's hard to see how we could write the wrong slot > magic number here. > > With the slot corrupted all you can really do is part one of the nodes > then join a new one. > > If you're able to reproduce this I'd really like to see how it came about. > > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services >