On Wed, Jun 22, 2016 at 8:44 PM, Robert Haas <robertmh...@gmail.com> wrote: > On Wed, Jun 22, 2016 at 5:10 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: >>> > Insertion will happen by scanning the appropriate bucket and needs to >>> > retain pin on primary bucket to ensure that concurrent split doesn't >>> > happen, >>> > otherwise split might leave this tuple unaccounted. >>> >>> What do you mean by "unaccounted"? >> >> It means that split might leave this tuple in old bucket even if it can be >> moved to new bucket. Consider a case where insertion has to add a tuple on >> some intermediate overflow bucket in the bucket chain, if we allow split >> when insertion is in progress, split might not move this newly inserted >> tuple. > >>> I think this is basically correct, although I don't find it to be as >>> clear as I think it could be. It seems very clear that any operation >>> which potentially changes the order of tuples in the bucket chain, >>> such as the squeeze phase as currently implemented, also needs to >>> exclude all concurrent scans. However, I think that it's OK for >>> vacuum to remove tuples from a given page with only an exclusive lock >>> on that particular page. >> >> How can we guarantee that it doesn't remove a tuple that is required by scan >> which is started after split-in-progress flag is set? > > If the tuple is being removed by VACUUM, it is dead. We can remove > dead tuples right away, because no MVCC scan will see them. In fact, > the only snapshot that will see them is SnapshotAny, and there's no > problem with removing dead tuples while a SnapshotAny scan is in > progress. It's no different than heap_page_prune() removing tuples > that a SnapshotAny sequential scan was about to see. >
While again thinking about this case, it seems to me that we need a cleanup lock even for dead tuple removal. The reason for the same is that scans that return multiple tuples always restart the scan from the previous offset number from which they have returned last tuple. Now, consider the case where the first tuple is returned from offset number-3 in page and after that another backend removes the corresponding tuple from heap and vacuum also removes the dead tuple corresponding to offset-3. When the scan will try to get the next tuple, it will start from offset-3 which can lead to incorrect results. Now, one way to solve above problem could be if we change scan for hash indexes such that it works page at a time like we do for btree scans (refer BTScanPosData and comments on top of it). This has an additional advantage that it will reduce lock/unlock calls for retrieving tuples from a page. However, I think this solution can only work for MVCC scans. For non-MVCC scans, still there is a problem, because after fetching all the tuples from a page, when it tries to check the validity of tuples in heap, we won't be able to detect if the old tuple is deleted and a new tuple has been placed at that location in heap. I think what we can do to solve this for non-MVCC scans is that allow vacuum to always take a cleanup lock on a bucket and MVCC-scans will release both the lock and pin as it proceeds. Non-MVCC scans and scans that are started during split-in-progress will release the lock, but not a pin on primary bucket. This way, we can allow vacuum to proceed even if there is a MVCC-scan going on a bucket if it is not started during a bucket split operation. For btree code, we do something similar, which means that vacuum always take cleanup lock on a bucket and non-MVCC scan retains a pin on the bucket. The insertions should work as they are currently in patch, that is they always need to retain a pin on primary bucket to avoid the concurrent split problem as mentioned above (refer the first paragraph discussion of this mail). Thoughts? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers