Hello, Amid. > The point which is not completely clear from your description is the > timing of missing records. In one of your previous emails, you seem to > have indicated that the data missed from Table B is from the time when > the initial sync for Table B was in-progress, right? Also, from your > description, it seems there is no error or restart that happened > during the time of initial sync for Table B. Is that understanding > correct?
Yes and yes. * B sync started - 08:08:34 * lost records are created - 09:49:xx * B initial sync finished - 10:19:08 * I/O error with WAL - 10:19:22 * SIGTERM - 10:35:20 "Finished" here is `logical replication table synchronization worker for subscription "cloud_production_main_sub_v4", table "B" has finished`. As far as I know, it is about COPY command. > I am not able to see how these steps can lead to the problem. One idea I have here - it is something related to the patch about forbidding of canceling queries while waiting for synchronous replication acknowledgement [1]. It is applied to Postgres in the cloud we were using [2]. We started to see such errors in 10:24:18: `The COMMIT record has already flushed to WAL locally and might not have been replicated to the standby. We must wait here.` I wonder could it be some tricky race because of downtime of synchronous replica and queries stuck waiting for ACK forever? > If the problem is reproducible at your end, you might want to increase LOG > verbosity to DEBUG1 and see if there is additional information in the > LOGs that can help or it would be really good if there is a > self-sufficient test to reproduce it. Unfortunately, it looks like it is really hard to reproduce. Best regards, Michail. [1]: https://www.postgresql.org/message-id/flat/CALj2ACU%3DnzEb_dEfoLqez5CLcwvx1GhkdfYRNX%2BA4NDRbjYdBg%40mail.gmail.com#8b7ffc8cdecb89de43c0701b4b6b5142 [2]: https://www.postgresql.org/message-id/flat/CAAhFRxgcBy-UCvyJ1ZZ1UKf4Owrx4J2X1F4tN_FD%3Dfh5wZgdkw%40mail.gmail.com#9c71a85cb6009eb60d0361de82772a50