On 2013-11-18 19:52:29 +0900, Michael Paquier wrote: > On Sat, Nov 16, 2013 at 5:09 AM, Andres Freund <and...@2ndquadrant.com> wrote: > > On 2013-11-15 11:40:17 +0900, Michael Paquier wrote: > >> - 20131114_3_reindex_concurrently.patch, providing the core feature. > >> Patch 3 needs to have patch 2 applied first. Regression tests, > >> isolation tests and documentation are included with the patch. > > > > Have you addressed my concurrency concerns from the last version? > I have added documentation in the patch with a better explanation > about why those choices of implementation are made.
The dropping still isn't safe: After phase 4 we are in the state: old index: valid, live, !isdead new index: !valid, live, !isdead Then you do a index_concurrent_set_dead() from that state on in phase 5. There you do WaitForLockers(locktag, AccessExclusiveLock) before index_set_state_flags(INDEX_DROP_SET_DEAD). That's not sufficient. Consider what happens with the following sequence: 1) WaitForLockers(locktag, AccessExclusiveLock) -> GetLockConflicts() => virtualxact 1 -> VirtualXactLock(1) 2) virtualxact 2 starts, opens the *old* index since it's currently the only valid one. 3) virtualxact 1 finishes 4) index_concurrent_set_dead() does index_set_state_flags(DROP_SET_DEAD) 5) another transaction (vxid 3) starts inserting data into the relation, updates only the new index, the old index is dead 6) vxid 2 inserts data, updates only the old index. Since it had the index already open the cache invalidations won't be processed. Now the indexes are out of sync. There's entries only in the old index and there's entries only in the new index. Not good. I hate to repeat myself, but you really need to follow the current protocol for concurrently dropping indexes. Which involves *first* marking the index as invalid so it won't be used for querying anymore, then wait for everyone possibly still seing that entry to finish, and only *after* that mark the index as dead. You cannot argue away correctness concerns with potential deadlocks. c.f. http://www.postgresql.org/message-id/20130926103400.ga2471...@alap2.anarazel.de I am also still unconvinced that the logic in index_concurrent_swap() is correct. It very much needs to explain why no backend can see values that are inconsistent. E.g. what prevents a backend thinking the old and new indexes have the same relfilenode? MVCC snapshots don't seem to protect you against that. I am not sure there's a problem, but there certainly needs to more comments explaining why there are none. Something like the following might be possible: Backend 1: start reindex concurrently, till phase 4 Backend 2: ExecOpenIndices() -> RelationGetIndexList (that list is consistent due to mvcc snapshots!) Backend 2: -> index_open(old_index) (old relfilenode) Backend 1: index_concurrent_swap() -> CommitTransaction() -> ProcArrayEndTransaction() (changes visible to others henceforth!) Backend 2: -> index_open(new_index) => no cache invalidations yet, gets the old relfilenode Backend 2: ExecInsertIndexTuples() => updates the same relation twice, corruptf Backend 1: stil. in CommitTransaction() -> AtEOXact_Inval() sends out invalidations Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers