On 08/28/2018 07:41 PM, Jeremy Finzel wrote: > Jeremy, are you able to reproduce the issue locally, using pgq? > That would be very valuable. > > > Tomas et al: > > We have hit this error again, and we plan to snapshot the database as to > be able to do whatever troubleshooting we can. If someone could provide > me guidance as to what exactly you would like me to do, please let me > know. I am able to provide an xlog dump and also debugging information > upon request. > > This is actually a different database system that also uses skytools, > and the exact same table (pgq.event_58_1) is again the cause of the > relfilenode error. I did a point-in-time recovery to a point after this > relfilenode appears using pg_xlogdump, and verified this was the table > that appeared, then disappeared. >
Great! Can you attach to the decoding process using gdb, and set a breakpoint to the elog(ERROR) at reorderbuffer.c:1599, and find out at which LSN / record it fails? https://github.com/postgres/postgres/blob/REL9_6_STABLE/src/backend/replication/logical/reorderbuffer.c#L1599 If it fails too fast, making it difficult to attach gdb before the crash, adding the LSN to the log message might be easier. Once we have the LSN, it would be useful to see the pg_xlogdump before/around that position. Another interesting piece of information would be to know the contents of the relmapper cache (which essentially means stepping through RelationMapFilenodeToOid or something like that). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services