On Sat, Feb 17, 2018 at 11:26 AM, Arseny Sher <a.s...@postgrespro.ru> wrote: > > Hello, > > I confirm this bug. The idea is that while usually we start decoding > from safe data.restart_lsn point, here we don't care about consistent > snapshots and rush into decoding right away from data.confirmed_flush > (slotfunc.c:475). The latter points to the first page's header instead > of valid record if we log SWITCH record and confirm it without any > additional records (while doing this manually you'd better hurry up to > outrun xl_running_xacts slipping through). This can be reproduced with a > bit simpler sample: > > SELECT * FROM pg_create_logical_replication_slot( > 'regression_slot1', 'test_decoding', true); > select pg_switch_wal(); > -- ensure we are on clean segment and xl_running_xacts didn't slip yet > select pg_current_wal_lsn(); > pg_current_wal_lsn > -------------------- > 0/2000000 > (1 row) > > -- set confirmed_flush_lsn to segment start and verify that > select pg_replication_slot_advance('regression_slot1', '0/6000000'); > select slot_name, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn > from pg_replication_slots; > slot_name | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn > ------------------+------+--------------+-------------+--------------------- > regression_slot1 | | 557 | 0/15D0EE8 | 0/2000000 > (1 row) > > -- insert some wal to advance GetFlushRecPtr > create table t (i int); > -- now start decoding from start of segment, boom > select pg_replication_slot_advance('regression_slot1', '0/6000000'); > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > : @:!> > : @:!> > : @:!> \q > > > Simple way to fix that is to start decoding traditionally from > restart_lsn which certainly points to valid record. Attached patch shows > the idea. However, I am not sure this is a right decision: decoding from > restart_lsn is slower since it lags behind confirmed_lsn; and we really > don't care about consistent snapshots in this function. Well, it would > be good if we assemble one on our way, serialize it and advance > restart_lsn -- and this is AFAIU the main reason we ever decode > something at all instead of just calling LogicalConfirmReceivedLocation > -- but that's not the main goal of the function. > > Another approach is to notice pointer to page start and replace it > with ptr to first record, but that doesn't sound elegantly. >
Or I think we need something like XLogFindNextRecord facility before reading WAL from startlsn to find a valid lsn. Note that it might wait for new record up to bgwriter_delay time. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center