Re: [HACKERS] Hash Indexes
On Tue, Dec 20, 2016 at 7:44 PM, Robert Haas wrote: > On Tue, Dec 20, 2016 at 9:01 AM, Amit Kapila wrote: >> On Tue, Dec 20, 2016 at 7:11 PM, Robert Haas wrote: >>> On Tue, Dec 20, 2016 at 4:51 AM, Amit Kapila >>> wrote: We have mainly four actions for squeeze operation, add tuples to the write page, empty overflow page, unlinks overflow page, make it free by setting the corresponding bit in overflow page. Now, if we don't log the changes to write page and freeing of overflow page as one operation, then won't query on standby can either see duplicate tuples or miss the tuples which are freed in overflow page. >>> >>> No, I think you could have two operations: >>> >>> 1. Move tuples from the "read" page to the "write" page. >>> >>> 2. Unlink the overflow page from the chain and mark it free. >>> >>> If we fail after step 1, the bucket chain might end with an empty >>> overflow page, but that's OK. >> >> If there is an empty page in bucket chain, access to that page will >> give an error (In WAL patch we are initializing the page instead of >> making it completely empty, so we might not see an error in such a >> case). > > It wouldn't be a new, uninitialized page. It would be empty of > tuples, not all-zeroes. > AFAIU we initialize page as all-zeros, but I think you are envisioning that we need to change it to a new uninitialized page. >> What advantage do you see by splitting the operation? > > It's simpler. The code here is very complicated and trying to merge > too many things into a single operation may make it even more > complicated, increasing the risk of bugs and making the code hard to > maintain without necessarily buying much performance. > Sure, if you find that way better, then we can change it, but the current patch treats it as a single operation. If after looking the patch you find it is better to change it into two operations, I will do so. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Tue, Dec 20, 2016 at 9:01 AM, Amit Kapila wrote: > On Tue, Dec 20, 2016 at 7:11 PM, Robert Haas wrote: >> On Tue, Dec 20, 2016 at 4:51 AM, Amit Kapila wrote: >>> We have mainly four actions for squeeze operation, add tuples to the >>> write page, empty overflow page, unlinks overflow page, make it free >>> by setting the corresponding bit in overflow page. Now, if we don't >>> log the changes to write page and freeing of overflow page as one >>> operation, then won't query on standby can either see duplicate tuples >>> or miss the tuples which are freed in overflow page. >> >> No, I think you could have two operations: >> >> 1. Move tuples from the "read" page to the "write" page. >> >> 2. Unlink the overflow page from the chain and mark it free. >> >> If we fail after step 1, the bucket chain might end with an empty >> overflow page, but that's OK. > > If there is an empty page in bucket chain, access to that page will > give an error (In WAL patch we are initializing the page instead of > making it completely empty, so we might not see an error in such a > case). It wouldn't be a new, uninitialized page. It would be empty of tuples, not all-zeroes. > What advantage do you see by splitting the operation? It's simpler. The code here is very complicated and trying to merge too many things into a single operation may make it even more complicated, increasing the risk of bugs and making the code hard to maintain without necessarily buying much performance. > Anyway, I think it is better to discuss this in WAL patch thread. OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Tue, Dec 20, 2016 at 7:11 PM, Robert Haas wrote: > On Tue, Dec 20, 2016 at 4:51 AM, Amit Kapila wrote: >> We have mainly four actions for squeeze operation, add tuples to the >> write page, empty overflow page, unlinks overflow page, make it free >> by setting the corresponding bit in overflow page. Now, if we don't >> log the changes to write page and freeing of overflow page as one >> operation, then won't query on standby can either see duplicate tuples >> or miss the tuples which are freed in overflow page. > > No, I think you could have two operations: > > 1. Move tuples from the "read" page to the "write" page. > > 2. Unlink the overflow page from the chain and mark it free. > > If we fail after step 1, the bucket chain might end with an empty > overflow page, but that's OK. > If there is an empty page in bucket chain, access to that page will give an error (In WAL patch we are initializing the page instead of making it completely empty, so we might not see an error in such a case). What advantage do you see by splitting the operation? Anyway, I think it is better to discuss this in WAL patch thread. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Tue, Dec 20, 2016 at 4:51 AM, Amit Kapila wrote: > We have mainly four actions for squeeze operation, add tuples to the > write page, empty overflow page, unlinks overflow page, make it free > by setting the corresponding bit in overflow page. Now, if we don't > log the changes to write page and freeing of overflow page as one > operation, then won't query on standby can either see duplicate tuples > or miss the tuples which are freed in overflow page. No, I think you could have two operations: 1. Move tuples from the "read" page to the "write" page. 2. Unlink the overflow page from the chain and mark it free. If we fail after step 1, the bucket chain might end with an empty overflow page, but that's OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Mon, Dec 19, 2016 at 11:05 PM, Robert Haas wrote: > On Sun, Dec 18, 2016 at 8:54 AM, Amit Kapila wrote: >>> I committed remove-hash-wrtbuf and fix_dirty_marking_v1 but I've got >>> some reservations about fix_lock_chaining_v1. ISTM that the natural >>> fix here would be to change the API contract for _hash_freeovflpage so >>> that it doesn't release the lock on the write buffer. Why does it >>> even do that? I think that the only reason why _hash_freeovflpage >>> should be getting wbuf as an argument is so that it can handle the >>> case where wbuf happens to be the previous block correctly. >> >> Yeah, as of now that is the only case, but for WAL patch, I think we >> need to ensure that the action of moving all the tuples to the page >> being written and the overflow page being freed needs to be logged >> together as an atomic operation. > > Not really. We can have one operation that empties the overflow page > and another that unlinks it and makes it free. > We have mainly four actions for squeeze operation, add tuples to the write page, empty overflow page, unlinks overflow page, make it free by setting the corresponding bit in overflow page. Now, if we don't log the changes to write page and freeing of overflow page as one operation, then won't query on standby can either see duplicate tuples or miss the tuples which are freed in overflow page. >> Now apart from that, it is >> theoretically possible that write page will remain locked for multiple >> overflow pages being freed (when the page being written has enough >> space that it can accommodate tuples from multiple overflow pages). I >> am not sure if it is worth worrying about such a case because >> practically it might happen rarely. So, I have prepared a patch to >> retain a lock on wbuf in _hash_freeovflpage() as suggested by you. > > Committed. > Thanks. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Sun, Dec 18, 2016 at 8:54 AM, Amit Kapila wrote: >> I committed remove-hash-wrtbuf and fix_dirty_marking_v1 but I've got >> some reservations about fix_lock_chaining_v1. ISTM that the natural >> fix here would be to change the API contract for _hash_freeovflpage so >> that it doesn't release the lock on the write buffer. Why does it >> even do that? I think that the only reason why _hash_freeovflpage >> should be getting wbuf as an argument is so that it can handle the >> case where wbuf happens to be the previous block correctly. > > Yeah, as of now that is the only case, but for WAL patch, I think we > need to ensure that the action of moving all the tuples to the page > being written and the overflow page being freed needs to be logged > together as an atomic operation. Not really. We can have one operation that empties the overflow page and another that unlinks it and makes it free. > Now apart from that, it is > theoretically possible that write page will remain locked for multiple > overflow pages being freed (when the page being written has enough > space that it can accommodate tuples from multiple overflow pages). I > am not sure if it is worth worrying about such a case because > practically it might happen rarely. So, I have prepared a patch to > retain a lock on wbuf in _hash_freeovflpage() as suggested by you. Committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Fri, Dec 16, 2016 at 9:57 PM, Robert Haas wrote: > On Thu, Dec 15, 2016 at 11:33 AM, Amit Kapila wrote: >> Attached are the two patches on top of remove-hash-wrtbuf. Patch >> fix_dirty_marking_v1.patch allows to mark the buffer dirty in one of >> the corner cases in _hash_freeovflpage() and avoids to mark dirty >> without need in _hash_squeezebucket(). I think this can be combined >> with remove-hash-wrtbuf patch. fix_lock_chaining_v1.patch fixes the >> chaining behavior (lock next overflow bucket page before releasing the >> previous bucket page) was broken in _hash_freeovflpage(). These >> patches can be applied in series, first remove-hash-wrtbuf, then >> fix_dirst_marking_v1.patch and then fix_lock_chaining_v1.patch. > > I committed remove-hash-wrtbuf and fix_dirty_marking_v1 but I've got > some reservations about fix_lock_chaining_v1. ISTM that the natural > fix here would be to change the API contract for _hash_freeovflpage so > that it doesn't release the lock on the write buffer. Why does it > even do that? I think that the only reason why _hash_freeovflpage > should be getting wbuf as an argument is so that it can handle the > case where wbuf happens to be the previous block correctly. > Yeah, as of now that is the only case, but for WAL patch, I think we need to ensure that the action of moving all the tuples to the page being written and the overflow page being freed needs to be logged together as an atomic operation. Now apart from that, it is theoretically possible that write page will remain locked for multiple overflow pages being freed (when the page being written has enough space that it can accommodate tuples from multiple overflow pages). I am not sure if it is worth worrying about such a case because practically it might happen rarely. So, I have prepared a patch to retain a lock on wbuf in _hash_freeovflpage() as suggested by you. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com fix_lock_chaining_v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Thu, Dec 15, 2016 at 11:33 AM, Amit Kapila wrote: > Attached are the two patches on top of remove-hash-wrtbuf. Patch > fix_dirty_marking_v1.patch allows to mark the buffer dirty in one of > the corner cases in _hash_freeovflpage() and avoids to mark dirty > without need in _hash_squeezebucket(). I think this can be combined > with remove-hash-wrtbuf patch. fix_lock_chaining_v1.patch fixes the > chaining behavior (lock next overflow bucket page before releasing the > previous bucket page) was broken in _hash_freeovflpage(). These > patches can be applied in series, first remove-hash-wrtbuf, then > fix_dirst_marking_v1.patch and then fix_lock_chaining_v1.patch. I committed remove-hash-wrtbuf and fix_dirty_marking_v1 but I've got some reservations about fix_lock_chaining_v1. ISTM that the natural fix here would be to change the API contract for _hash_freeovflpage so that it doesn't release the lock on the write buffer. Why does it even do that? I think that the only reason why _hash_freeovflpage should be getting wbuf as an argument is so that it can handle the case where wbuf happens to be the previous block correctly. Aside from that there's no reason for it to touch wbuf. If you fix it like that then you should be able to avoid this rather ugly wart: * XXX Here, we are moving to next overflow page for writing without * ensuring if the previous write page is full. This is annoying, but * should not hurt much in practice as that space will anyway be consumed * by future inserts. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Wed, Dec 14, 2016 at 10:47 PM, Robert Haas wrote: > On Wed, Dec 14, 2016 at 4:27 AM, Amit Kapila wrote: >> >> Yeah, it will fix the problem in hashbucketcleanup, but there are two >> other problems that need to be fixed for which I can send a separate >> patch. By the way, as mentioned to you earlier that WAL patch already >> takes care of removing _hash_wrtbuf and simplified the logic wherever >> possible without introducing the logic of MarkBufferDirty multiple >> times under one lock. However, if you want to proceed with the >> current patch, then I can send you separate patches for the remaining >> problems as addressed in bug fix patch. > > That sounds good. > Attached are the two patches on top of remove-hash-wrtbuf. Patch fix_dirty_marking_v1.patch allows to mark the buffer dirty in one of the corner cases in _hash_freeovflpage() and avoids to mark dirty without need in _hash_squeezebucket(). I think this can be combined with remove-hash-wrtbuf patch. fix_lock_chaining_v1.patch fixes the chaining behavior (lock next overflow bucket page before releasing the previous bucket page) was broken in _hash_freeovflpage(). These patches can be applied in series, first remove-hash-wrtbuf, then fix_dirst_marking_v1.patch and then fix_lock_chaining_v1.patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com fix_dirty_marking_v1.patch Description: Binary data fix_lock_chaining_v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Wed, Dec 14, 2016 at 4:27 AM, Amit Kapila wrote: >> It's not required for enabling WAL. You don't *have* to release and >> reacquire the buffer lock; in fact, that just adds overhead. > > If we don't release the lock, then it will break the general coding > pattern of writing WAL (Acquire pin and an exclusive lock, > Markbufferdirty, Write WAL, Release Lock). Basically, I think it is > to ensure that we don't hold it for multiple atomic operations or in > this case avoid calling MarkBufferDirty multiple times. I think you're being too pedantic. Yes, the general rule is to release the lock after each WAL record, but no harm comes if we take the lock, emit TWO WAL records, and then release it. > It is possible that we can MarkBufferDirty multiple times (twice in > hashbucketcleanup and then in _hash_squeezebucket) while holding the > lock. I don't think that is a big problem as of now but wanted to > avoid it as I thought we need it for WAL patch. I see no harm in calling MarkBufferDirty multiple times, either now or after the WAL patch. Of course we don't want to end up with tons of extra calls - it's not totally free - but it's pretty cheap. >> Aside from hopefully fixing the bug we're talking about >> here, this makes the logic in several places noticeably less >> contorted. > > Yeah, it will fix the problem in hashbucketcleanup, but there are two > other problems that need to be fixed for which I can send a separate > patch. By the way, as mentioned to you earlier that WAL patch already > takes care of removing _hash_wrtbuf and simplified the logic wherever > possible without introducing the logic of MarkBufferDirty multiple > times under one lock. However, if you want to proceed with the > current patch, then I can send you separate patches for the remaining > problems as addressed in bug fix patch. That sounds good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Tue, Dec 13, 2016 at 11:30 PM, Robert Haas wrote: > On Mon, Dec 12, 2016 at 9:21 PM, Amit Kapila wrote: >> The reason is to make the operations consistent in master and standby. >> In WAL patch, for clearing the SPLIT_CLEANUP flag, we write a WAL and >> if we don't release the lock after writing a WAL, the operation can >> appear on standby even before on master. Currently, without WAL, >> there is no benefit of doing so and we can fix by using >> MarkBufferDirty, however, I thought it might be simpler to keep it >> this way as this is required for enabling WAL. OTOH, we can leave >> that for WAL patch. Let me know which way you prefer? > > It's not required for enabling WAL. You don't *have* to release and > reacquire the buffer lock; in fact, that just adds overhead. > If we don't release the lock, then it will break the general coding pattern of writing WAL (Acquire pin and an exclusive lock, Markbufferdirty, Write WAL, Release Lock). Basically, I think it is to ensure that we don't hold it for multiple atomic operations or in this case avoid calling MarkBufferDirty multiple times. > You *do* > have to be aware that the standby will perhaps see the intermediate > state, because it won't hold the lock throughout. But that does not > mean that the master must release the lock. > Okay, but I think that will be avoided to a great extent because we do follow the practice of releasing the lock immediately after writing the WAL. >>> I don't think we should be afraid to call MarkBufferDirty() instead of >>> going through these (fairly stupid) hasham-specific APIs. >> >> Right and anyway we need to use it at many more call sites when we >> enable WAL for hash index. > > I propose the attached patch, which removes _hash_wrtbuf() and causes > those functions which previously called it to do MarkBufferDirty() > directly. > It is possible that we can MarkBufferDirty multiple times (twice in hashbucketcleanup and then in _hash_squeezebucket) while holding the lock. I don't think that is a big problem as of now but wanted to avoid it as I thought we need it for WAL patch. > Aside from hopefully fixing the bug we're talking about > here, this makes the logic in several places noticeably less > contorted. > Yeah, it will fix the problem in hashbucketcleanup, but there are two other problems that need to be fixed for which I can send a separate patch. By the way, as mentioned to you earlier that WAL patch already takes care of removing _hash_wrtbuf and simplified the logic wherever possible without introducing the logic of MarkBufferDirty multiple times under one lock. However, if you want to proceed with the current patch, then I can send you separate patches for the remaining problems as addressed in bug fix patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Mon, Dec 12, 2016 at 9:21 PM, Amit Kapila wrote: > The reason is to make the operations consistent in master and standby. > In WAL patch, for clearing the SPLIT_CLEANUP flag, we write a WAL and > if we don't release the lock after writing a WAL, the operation can > appear on standby even before on master. Currently, without WAL, > there is no benefit of doing so and we can fix by using > MarkBufferDirty, however, I thought it might be simpler to keep it > this way as this is required for enabling WAL. OTOH, we can leave > that for WAL patch. Let me know which way you prefer? It's not required for enabling WAL. You don't *have* to release and reacquire the buffer lock; in fact, that just adds overhead. You *do* have to be aware that the standby will perhaps see the intermediate state, because it won't hold the lock throughout. But that does not mean that the master must release the lock. >> I don't think we should be afraid to call MarkBufferDirty() instead of >> going through these (fairly stupid) hasham-specific APIs. > > Right and anyway we need to use it at many more call sites when we > enable WAL for hash index. I propose the attached patch, which removes _hash_wrtbuf() and causes those functions which previously called it to do MarkBufferDirty() directly. Aside from hopefully fixing the bug we're talking about here, this makes the logic in several places noticeably less contorted. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c index f1511d0..0eeb37d 100644 --- a/src/backend/access/hash/hash.c +++ b/src/backend/access/hash/hash.c @@ -635,7 +635,8 @@ loop_top: num_index_tuples = metap->hashm_ntuples; } - _hash_wrtbuf(rel, metabuf); + MarkBufferDirty(metabuf); + _hash_relbuf(rel, metabuf); /* return statistics */ if (stats == NULL) @@ -724,7 +725,6 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf, OffsetNumber deletable[MaxOffsetNumber]; int ndeletable = 0; bool retain_pin = false; - bool curr_page_dirty = false; vacuum_delay_point(); @@ -805,7 +805,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf, { PageIndexMultiDelete(page, deletable, ndeletable); bucket_dirty = true; - curr_page_dirty = true; + MarkBufferDirty(buf); } /* bail out if there are no more pages to scan. */ @@ -820,15 +820,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf, * release the lock on previous page after acquiring the lock on next * page */ - if (curr_page_dirty) - { - if (retain_pin) -_hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK); - else -_hash_wrtbuf(rel, buf); - curr_page_dirty = false; - } - else if (retain_pin) + if (retain_pin) _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK); else _hash_relbuf(rel, buf); @@ -862,6 +854,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf, bucket_opaque = (HashPageOpaque) PageGetSpecialPointer(page); bucket_opaque->hasho_flag &= ~LH_BUCKET_NEEDS_SPLIT_CLEANUP; + MarkBufferDirty(bucket_buf); } /* @@ -873,7 +866,7 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf, _hash_squeezebucket(rel, cur_bucket, bucket_blkno, bucket_buf, bstrategy); else - _hash_chgbufaccess(rel, bucket_buf, HASH_WRITE, HASH_NOLOCK); + _hash_chgbufaccess(rel, bucket_buf, HASH_READ, HASH_NOLOCK); } void diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c index 572146a..59c4213 100644 --- a/src/backend/access/hash/hashinsert.c +++ b/src/backend/access/hash/hashinsert.c @@ -208,11 +208,12 @@ restart_insert: (void) _hash_pgaddtup(rel, buf, itemsz, itup); /* - * write and release the modified page. if the page we modified was an + * dirty and release the modified page. if the page we modified was an * overflow page, we also need to separately drop the pin we retained on * the primary bucket page. */ - _hash_wrtbuf(rel, buf); + MarkBufferDirty(buf); + _hash_relbuf(rel, buf); if (buf != bucket_buf) _hash_dropbuf(rel, bucket_buf); diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c index e2d208e..8fbf494 100644 --- a/src/backend/access/hash/hashovfl.c +++ b/src/backend/access/hash/hashovfl.c @@ -149,10 +149,11 @@ _hash_addovflpage(Relation rel, Buffer metabuf, Buffer buf, bool retain_pin) /* logically chain overflow page to previous page */ pageopaque->hasho_nextblkno = BufferGetBlockNumber(ovflbuf); + MarkBufferDirty(buf); if ((pageopaque->hasho_flag & LH_BUCKET_PAGE) && retain_pin) - _hash_chgbufaccess(rel, buf, HASH_WRITE, HASH_NOLOCK); + _hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK); else - _hash_wrtbuf(rel, buf); + _hash_relbuf(rel, buf); return ovflbuf; } @@ -304,7 +305,8 @@ found: /* mark page "in use" in t
Re: [HACKERS] Hash Indexes
On 12/11/2016 11:37 PM, Amit Kapila wrote: On Sun, Dec 11, 2016 at 11:54 AM, Amit Kapila wrote: On Wed, Dec 7, 2016 at 2:02 AM, Jeff Janes wrote: With above fixes, the test ran successfully for more than a day. There was a small typo in the previous patch which is fixed in attached. I don't think it will impact the test results if you have already started the test with the previous patch, but if not, then it is better to test with attached. A mix work load (INSERT, DELETE and VACUUM primarily) is successful here too using _v2. Thanks ! Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Tue, Dec 13, 2016 at 2:51 AM, Robert Haas wrote: > On Sun, Dec 11, 2016 at 1:24 AM, Amit Kapila wrote: >> >> With above fixes, the test ran successfully for more than a day. > > Instead of doing this: > > +_hash_chgbufaccess(rel, bucket_buf, HASH_WRITE, HASH_NOLOCK); > +_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE); > > ...wouldn't it be better to just do MarkBufferDirty()? There's no > real reason to release the lock only to reacquire it again, is there? > The reason is to make the operations consistent in master and standby. In WAL patch, for clearing the SPLIT_CLEANUP flag, we write a WAL and if we don't release the lock after writing a WAL, the operation can appear on standby even before on master. Currently, without WAL, there is no benefit of doing so and we can fix by using MarkBufferDirty, however, I thought it might be simpler to keep it this way as this is required for enabling WAL. OTOH, we can leave that for WAL patch. Let me know which way you prefer? > I don't think we should be afraid to call MarkBufferDirty() instead of > going through these (fairly stupid) hasham-specific APIs. > Right and anyway we need to use it at many more call sites when we enable WAL for hash index. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Sun, Dec 11, 2016 at 1:24 AM, Amit Kapila wrote: > The reason for this and the similar error in vacuum was that in one of > the corner cases after freeing the overflow page and updating the link > for the previous bucket, we were not marking the buffer as dirty. So, > due to concurrent activity, the buffer containing the updated links > got evicted and then later when we tried to access the same buffer, it > brought back the old copy which contains a link to freed overflow > page. > > Apart from above issue, Kuntal has noticed that there is assertion > failure (Assert(bucket == new_bucket);) in hashbucketcleanup with the > same test as provided by you. The reason for that problem was that > after deleting tuples in hashbucketcleanup, we were not marking the > buffer as dirty due to which the old copy of the overflow page was > re-appearing and causing that failure. > > After fixing the above problem, it has been noticed that there is > another assertion failure (Assert(bucket == obucket);) in > _hash_splitbucket_guts. The reason for this problem was that after > the split, vacuum failed to remove tuples from the old bucket that are > moved due to split. Now, during next split from the same old bucket, > we don't expect old bucket to contain tuples from the previous split. > To fix this whenever vacuum needs to perform split cleanup, it should > update the metapage values (masks required to calculate bucket > number), so that it shouldn't miss cleaning the tuples. > I believe this is the same assertion what Andreas has reported in > another thread [1]. > > The next problem we encountered is that after running the same test > for somewhat longer, inserts were failing with error "unexpected zero > page at block ..". After some analysis, I have found that the lock > chain (lock next overflow bucket page before releasing the previous > bucket page) was broken in one corner case in _hash_freeovflpage due > to which insert went ahead than squeeze bucket operation and accessed > the freed overflow page before the link for the same has been updated. > > With above fixes, the test ran successfully for more than a day. Instead of doing this: +_hash_chgbufaccess(rel, bucket_buf, HASH_WRITE, HASH_NOLOCK); +_hash_chgbufaccess(rel, bucket_buf, HASH_NOLOCK, HASH_WRITE); ...wouldn't it be better to just do MarkBufferDirty()? There's no real reason to release the lock only to reacquire it again, is there? I don't think we should be afraid to call MarkBufferDirty() instead of going through these (fairly stupid) hasham-specific APIs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Mon, Dec 12, 2016 at 10:25 AM, Jeff Janes wrote: > On Sun, Dec 11, 2016 at 8:37 PM, Amit Kapila > wrote: >> >> On Sun, Dec 11, 2016 at 11:54 AM, Amit Kapila >> wrote: >> > On Wed, Dec 7, 2016 at 2:02 AM, Jeff Janes wrote: >> > >> > With above fixes, the test ran successfully for more than a day. >> > >> >> There was a small typo in the previous patch which is fixed in >> attached. I don't think it will impact the test results if you have >> already started the test with the previous patch, but if not, then it >> is better to test with attached. > > > Thanks, I've already been running the previous one for several hours, and > so far it looks good. > Thanks. > I've tried forward porting it to the WAL patch to > test that as well, but didn't have any luck doing so. > I think we can verify WAL patch separately. I am already working on it. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Sun, Dec 11, 2016 at 8:37 PM, Amit Kapila wrote: > On Sun, Dec 11, 2016 at 11:54 AM, Amit Kapila > wrote: > > On Wed, Dec 7, 2016 at 2:02 AM, Jeff Janes wrote: > > > > With above fixes, the test ran successfully for more than a day. > > > > There was a small typo in the previous patch which is fixed in > attached. I don't think it will impact the test results if you have > already started the test with the previous patch, but if not, then it > is better to test with attached. > Thanks, I've already been running the previous one for several hours, and so far it looks good. I've tried forward porting it to the WAL patch to test that as well, but didn't have any luck doing so. Cheers, Jeff
Re: [HACKERS] Hash Indexes
On Tue, Dec 6, 2016 at 1:29 PM, Jeff Janes wrote: > On Thu, Dec 1, 2016 at 10:54 PM, Amit Kapila > wrote: > > With the latest HASH WAL patch applied, I get different but apparently > related errors > > 41993 UPDATE XX002 2016-12-05 22:28:45.333 PST:ERROR: index "foo_index_idx" > contains corrupted page at block 27602 > 41993 UPDATE XX002 2016-12-05 22:28:45.333 PST:HINT: Please REINDEX it. > 41993 UPDATE XX002 2016-12-05 22:28:45.333 PST:STATEMENT: update foo set > count=count+1 where index=$1 > This is not the problem of WAL patch per se. It should be fixed with the hash index bug fix patch I sent upthread. However, after the bug fix patch, WAL patch needs to be rebased which I will do and send it after verification. In the meantime, it would be great if you can verify the fix posted. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Sun, Dec 11, 2016 at 11:54 AM, Amit Kapila wrote: > On Wed, Dec 7, 2016 at 2:02 AM, Jeff Janes wrote: > > With above fixes, the test ran successfully for more than a day. > There was a small typo in the previous patch which is fixed in attached. I don't think it will impact the test results if you have already started the test with the previous patch, but if not, then it is better to test with attached. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com fix_hashindex_issues_v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Wed, Dec 7, 2016 at 2:02 AM, Jeff Janes wrote: > On Tue, Dec 6, 2016 at 4:00 AM, Amit Kapila wrote: >> >> On Tue, Dec 6, 2016 at 1:29 PM, Jeff Janes wrote: >> > >> > >> > I just occasionally insert a bunch of equal tuples, which have to be in >> > overflow pages no matter how much splitting happens. >> > >> > I am getting vacuum errors against HEAD, after about 20 minutes or so (8 >> > cores). >> > >> > 49233 XX002 2016-12-05 23:06:44.087 PST:ERROR: index "foo_index_idx" >> > contains unexpected zero page at block 64941 >> > 49233 XX002 2016-12-05 23:06:44.087 PST:HINT: Please REINDEX it. >> > 49233 XX002 2016-12-05 23:06:44.087 PST:CONTEXT: automatic vacuum of >> > table >> > "jjanes.public.foo" >> > >> >> Thanks for the report. This can happen due to vacuum trying to access >> a new page. Vacuum can do so if (a) the calculation for maxbuckets >> (in metapage) is wrong or (b) it is trying to free the overflow page >> twice. Offhand, I don't see that can happen in code. I will >> investigate further to see if there is any another reason why vacuum >> can access the new page. BTW, have you done the test after commit >> 2f4193c3, that doesn't appear to be the cause of this problem, but >> still, it is better to test after that fix. I am trying to reproduce >> the issue, but in the meantime, if by anychance, you have a callstack, >> then please share the same. > > > It looks like I compiled the code for testing a few minutes before 2f4193c3 > went in. > > I've run it again with cb9dcbc1eebd8, after promoting the ERROR in > _hash_checkpage, hashutil.c:174 to a PANIC so that I can get a coredump to > feed to gdb. > > This time it was an INSERT, not autovac, that got the error: > The reason for this and the similar error in vacuum was that in one of the corner cases after freeing the overflow page and updating the link for the previous bucket, we were not marking the buffer as dirty. So, due to concurrent activity, the buffer containing the updated links got evicted and then later when we tried to access the same buffer, it brought back the old copy which contains a link to freed overflow page. Apart from above issue, Kuntal has noticed that there is assertion failure (Assert(bucket == new_bucket);) in hashbucketcleanup with the same test as provided by you. The reason for that problem was that after deleting tuples in hashbucketcleanup, we were not marking the buffer as dirty due to which the old copy of the overflow page was re-appearing and causing that failure. After fixing the above problem, it has been noticed that there is another assertion failure (Assert(bucket == obucket);) in _hash_splitbucket_guts. The reason for this problem was that after the split, vacuum failed to remove tuples from the old bucket that are moved due to split. Now, during next split from the same old bucket, we don't expect old bucket to contain tuples from the previous split. To fix this whenever vacuum needs to perform split cleanup, it should update the metapage values (masks required to calculate bucket number), so that it shouldn't miss cleaning the tuples. I believe this is the same assertion what Andreas has reported in another thread [1]. The next problem we encountered is that after running the same test for somewhat longer, inserts were failing with error "unexpected zero page at block ..". After some analysis, I have found that the lock chain (lock next overflow bucket page before releasing the previous bucket page) was broken in one corner case in _hash_freeovflpage due to which insert went ahead than squeeze bucket operation and accessed the freed overflow page before the link for the same has been updated. With above fixes, the test ran successfully for more than a day. Many thanks to Kuntal and Dilip for helping me in analyzing and testing the fixes for these problems. [1] - https://www.postgresql.org/message-id/87y3zrzbu5.fsf_-_%40ansel.ydns.eu -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com fix_hashindex_issues_v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Tue, Dec 6, 2016 at 4:00 AM, Amit Kapila wrote: > On Tue, Dec 6, 2016 at 1:29 PM, Jeff Janes wrote: > > > > > > I just occasionally insert a bunch of equal tuples, which have to be in > > overflow pages no matter how much splitting happens. > > > > I am getting vacuum errors against HEAD, after about 20 minutes or so (8 > > cores). > > > > 49233 XX002 2016-12-05 23:06:44.087 PST:ERROR: index "foo_index_idx" > > contains unexpected zero page at block 64941 > > 49233 XX002 2016-12-05 23:06:44.087 PST:HINT: Please REINDEX it. > > 49233 XX002 2016-12-05 23:06:44.087 PST:CONTEXT: automatic vacuum of > table > > "jjanes.public.foo" > > > > Thanks for the report. This can happen due to vacuum trying to access > a new page. Vacuum can do so if (a) the calculation for maxbuckets > (in metapage) is wrong or (b) it is trying to free the overflow page > twice. Offhand, I don't see that can happen in code. I will > investigate further to see if there is any another reason why vacuum > can access the new page. BTW, have you done the test after commit > 2f4193c3, that doesn't appear to be the cause of this problem, but > still, it is better to test after that fix. I am trying to reproduce > the issue, but in the meantime, if by anychance, you have a callstack, > then please share the same. > It looks like I compiled the code for testing a few minutes before 2f4193c3 went in. I've run it again with cb9dcbc1eebd8, after promoting the ERROR in _hash_checkpage, hashutil.c:174 to a PANIC so that I can get a coredump to feed to gdb. This time it was an INSERT, not autovac, that got the error: 35495 INSERT XX002 2016-12-06 09:25:09.517 PST:PANIC: XX002: index "foo_index_idx" contains unexpected zero page at block 202328 35495 INSERT XX002 2016-12-06 09:25:09.517 PST:HINT: Please REINDEX it. 35495 INSERT XX002 2016-12-06 09:25:09.517 PST:LOCATION: _hash_checkpage, hashutil.c:174 35495 INSERT XX002 2016-12-06 09:25:09.517 PST:STATEMENT: insert into foo (index) select $1 from generate_series(1,1) #0 0x003838c325e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 64return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig); (gdb) bt #0 0x003838c325e5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x003838c33dc5 in abort () at abort.c:92 #2 0x007d6adf in errfinish (dummy=) at elog.c:557 #3 0x00498d93 in _hash_checkpage (rel=0x7f4d030906a0, buf=, flags=) at hashutil.c:169 #4 0x004967cf in _hash_getbuf_with_strategy (rel=0x7f4d030906a0, blkno=, access=2, flags=1, bstrategy=) at hashpage.c:234 #5 0x00493dbb in hashbucketcleanup (rel=0x7f4d030906a0, cur_bucket=14544, bucket_buf=7801, bucket_blkno=22864, bstrategy=0x0, maxbucket=276687, highmask=524287, lowmask=262143, tuples_removed=0x0, num_index_tuples=0x0, split_cleanup=1 '\001', callback=0, callback_state=0x0) at hash.c:799 #6 0x00497921 in _hash_expandtable (rel=0x7f4d030906a0, metabuf=7961) at hashpage.c:668 #7 0x00495596 in _hash_doinsert (rel=0x7f4d030906a0, itup=0x1f426b0) at hashinsert.c:236 #8 0x00494830 in hashinsert (rel=0x7f4d030906a0, values=, isnull=, ht_ctid=0x7f4d03076404, heapRel=, checkUnique=) at hash.c:247 #9 0x005c81bc in ExecInsertIndexTuples (slot=0x1f28940, tupleid=0x7f4d03076404, estate=0x1f28280, noDupErr=0 '\000', specConflict=0x0, arbiterIndexes=0x0) at execIndexing.c:389 #10 0x005e74ad in ExecInsert (node=0x1f284d0) at nodeModifyTable.c:496 #11 ExecModifyTable (node=0x1f284d0) at nodeModifyTable.c:1511 #12 0x005cc9d8 in ExecProcNode (node=0x1f284d0) at execProcnode.c:396 #13 0x005ca53a in ExecutePlan (queryDesc=0x1f21a30, direction=, count=0) at execMain.c:1567 #14 standard_ExecutorRun (queryDesc=0x1f21a30, direction=, count=0) at execMain.c:338 #15 0x7f4d0c1a6dfb in pgss_ExecutorRun (queryDesc=0x1f21a30, direction=ForwardScanDirection, count=0) at pg_stat_statements.c:877 #16 0x006dfc8a in ProcessQuery (plan=, sourceText=0x1f21990 "insert into foo (index) select $1 from generate_series(1,1)", params=0x1f219e0, dest=0xc191c0, completionTag=0x7ffe82a959d0 "") at pquery.c:185 #17 0x006dfeda in PortalRunMulti (portal=0x1e86900, isTopLevel=1 '\001', setHoldSnapshot=0 '\000', dest=0xc191c0, altdest=0xc191c0, completionTag=0x7ffe82a959d0 "") at pquery.c:1299 #18 0x006e056c in PortalRun (portal=0x1e86900, count=9223372036854775807, isTopLevel=1 '\001', dest=0x1eec870, altdest=0x1eec870, completionTag=0x7ffe82a959d0 "") at pquery.c:813 #19 0x006de832 in exec_execute_message (argc=, argv=, dbname=0x1e933b8 "jjanes", username=) at postgres.c:1977 #20 PostgresMain (argc=, argv=, dbname=0x1e933b8 "jjanes", username=) at postgres.c:4132 #21 0x0067e8a4 in BackendRun (argc=, argv=) at postmaster.c:4274 #22 BackendStartup (argc=, argv=) at postmaster.c:3946 #23 ServerLoop (argc=, argv=) at postmaster.c:1704 #24 Postmas
Re: [HACKERS] Hash Indexes
On Tue, Dec 6, 2016 at 1:29 PM, Jeff Janes wrote: > > > I just occasionally insert a bunch of equal tuples, which have to be in > overflow pages no matter how much splitting happens. > > I am getting vacuum errors against HEAD, after about 20 minutes or so (8 > cores). > > 49233 XX002 2016-12-05 23:06:44.087 PST:ERROR: index "foo_index_idx" > contains unexpected zero page at block 64941 > 49233 XX002 2016-12-05 23:06:44.087 PST:HINT: Please REINDEX it. > 49233 XX002 2016-12-05 23:06:44.087 PST:CONTEXT: automatic vacuum of table > "jjanes.public.foo" > Thanks for the report. This can happen due to vacuum trying to access a new page. Vacuum can do so if (a) the calculation for maxbuckets (in metapage) is wrong or (b) it is trying to free the overflow page twice. Offhand, I don't see that can happen in code. I will investigate further to see if there is any another reason why vacuum can access the new page. BTW, have you done the test after commit 2f4193c3, that doesn't appear to be the cause of this problem, but still, it is better to test after that fix. I am trying to reproduce the issue, but in the meantime, if by anychance, you have a callstack, then please share the same. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Thu, Dec 1, 2016 at 10:54 PM, Amit Kapila wrote: > On Thu, Dec 1, 2016 at 8:56 PM, Robert Haas wrote: > > On Thu, Dec 1, 2016 at 12:43 AM, Amit Kapila > wrote: > >> On Thu, Dec 1, 2016 at 2:18 AM, Robert Haas > wrote: > >>> On Wed, Nov 23, 2016 at 8:50 AM, Amit Kapila > wrote: > [ new patch ] > >>> > >>> Committed with some further cosmetic changes. > >> > >> Thank you very much. > >> > >>> I think it would be worth testing this code with very long overflow > >>> chains by hacking the fill factor up to 1000 > >> > >> 1000 is not a valid value for fill factor. Do you intend to say 100? > > > > No. IIUC, 100 would mean split when the average bucket contains 1 > > page worth of tuples. > > > > I also think so. > > > I want to split when the average bucket > > contains 10 pages worth of tuples. > > > > oh, I think what you mean to say is hack the code to bump fill factor > and then test it. I was confused that how can user can do that from > SQL command. > I just occasionally insert a bunch of equal tuples, which have to be in overflow pages no matter how much splitting happens. I am getting vacuum errors against HEAD, after about 20 minutes or so (8 cores). 49233 XX002 2016-12-05 23:06:44.087 PST:ERROR: index "foo_index_idx" contains unexpected zero page at block 64941 49233 XX002 2016-12-05 23:06:44.087 PST:HINT: Please REINDEX it. 49233 XX002 2016-12-05 23:06:44.087 PST:CONTEXT: automatic vacuum of table "jjanes.public.foo" Testing harness is attached. It includes a lot of code to test crash recovery, but all of that stuff is turned off in this instance. No patches need to be applied to the server to get this one to run. With the latest HASH WAL patch applied, I get different but apparently related errors 41993 UPDATE XX002 2016-12-05 22:28:45.333 PST:ERROR: index "foo_index_idx" contains corrupted page at block 27602 41993 UPDATE XX002 2016-12-05 22:28:45.333 PST:HINT: Please REINDEX it. 41993 UPDATE XX002 2016-12-05 22:28:45.333 PST:STATEMENT: update foo set count=count+1 where index=$1 Cheers, Jeff count.pl Description: Binary data do_nocrash.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Fri, Dec 2, 2016 at 10:54 PM, Amit Kapila wrote: > On Sat, Dec 3, 2016 at 12:13 AM, Robert Haas wrote: >> On Fri, Dec 2, 2016 at 1:54 AM, Amit Kapila wrote: I want to split when the average bucket contains 10 pages worth of tuples. >>> >>> oh, I think what you mean to say is hack the code to bump fill factor >>> and then test it. I was confused that how can user can do that from >>> SQL command. >> >> Yes, that's why I said "hacking the fill factor up to 1000" when I >> originally mentioned it. >> >> Actually, for hash indexes, there's no reason why we couldn't allow >> fillfactor settings greater than 100, and it might be useful. > > Yeah, I agree with that, but as of now, it might be tricky to support > the different range of fill factor for one of the indexes. Another > idea could be to have an additional storage parameter like > split_bucket_length or something like that for hash indexes which > indicate that split will occur after the average bucket contains > "split_bucket_length * page" worth of tuples. We do have additional > storage parameters for other types of indexes, so having one for the > hash index should not be a problem. Agreed. > I think this is important because split immediately increases the hash > index space by approximately 2 times. We might want to change that > algorithm someday, but the above idea will prevent that in many cases. Also agreed. But the first thing is that you should probably do some testing in that area via a quick hack to see if anything breaks in an obvious way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Sat, Dec 3, 2016 at 12:13 AM, Robert Haas wrote: > On Fri, Dec 2, 2016 at 1:54 AM, Amit Kapila wrote: >>> I want to split when the average bucket >>> contains 10 pages worth of tuples. >> >> oh, I think what you mean to say is hack the code to bump fill factor >> and then test it. I was confused that how can user can do that from >> SQL command. > > Yes, that's why I said "hacking the fill factor up to 1000" when I > originally mentioned it. > > Actually, for hash indexes, there's no reason why we couldn't allow > fillfactor settings greater than 100, and it might be useful. > Yeah, I agree with that, but as of now, it might be tricky to support the different range of fill factor for one of the indexes. Another idea could be to have an additional storage parameter like split_bucket_length or something like that for hash indexes which indicate that split will occur after the average bucket contains "split_bucket_length * page" worth of tuples. We do have additional storage parameters for other types of indexes, so having one for the hash index should not be a problem. I think this is important because split immediately increases the hash index space by approximately 2 times. We might want to change that algorithm someday, but the above idea will prevent that in many cases. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Fri, Dec 2, 2016 at 1:54 AM, Amit Kapila wrote: >> I want to split when the average bucket >> contains 10 pages worth of tuples. > > oh, I think what you mean to say is hack the code to bump fill factor > and then test it. I was confused that how can user can do that from > SQL command. Yes, that's why I said "hacking the fill factor up to 1000" when I originally mentioned it. Actually, for hash indexes, there's no reason why we couldn't allow fillfactor settings greater than 100, and it might be useful. Possibly it should be the default. Not 1000, certainly, but I'm not sure that the current value of 75 is at all optimal. The optimal value might be 100 or 125 or 150 or something like that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Thu, Dec 1, 2016 at 8:56 PM, Robert Haas wrote: > On Thu, Dec 1, 2016 at 12:43 AM, Amit Kapila wrote: >> On Thu, Dec 1, 2016 at 2:18 AM, Robert Haas wrote: >>> On Wed, Nov 23, 2016 at 8:50 AM, Amit Kapila >>> wrote: [ new patch ] >>> >>> Committed with some further cosmetic changes. >> >> Thank you very much. >> >>> I think it would be worth testing this code with very long overflow >>> chains by hacking the fill factor up to 1000 >> >> 1000 is not a valid value for fill factor. Do you intend to say 100? > > No. IIUC, 100 would mean split when the average bucket contains 1 > page worth of tuples. > I also think so. > I want to split when the average bucket > contains 10 pages worth of tuples. > oh, I think what you mean to say is hack the code to bump fill factor and then test it. I was confused that how can user can do that from SQL command. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Thu, Dec 1, 2016 at 12:43 AM, Amit Kapila wrote: > On Thu, Dec 1, 2016 at 2:18 AM, Robert Haas wrote: >> On Wed, Nov 23, 2016 at 8:50 AM, Amit Kapila wrote: >>> [ new patch ] >> >> Committed with some further cosmetic changes. > > Thank you very much. > >> I think it would be worth testing this code with very long overflow >> chains by hacking the fill factor up to 1000 > > 1000 is not a valid value for fill factor. Do you intend to say 100? No. IIUC, 100 would mean split when the average bucket contains 1 page worth of tuples. I want to split when the average bucket contains 10 pages worth of tuples. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Thu, Dec 1, 2016 at 2:18 AM, Robert Haas wrote: > On Wed, Nov 23, 2016 at 8:50 AM, Amit Kapila wrote: >> [ new patch ] > > Committed with some further cosmetic changes. > Thank you very much. > I think it would be worth testing this code with very long overflow > chains by hacking the fill factor up to 1000 > 1000 is not a valid value for fill factor. Do you intend to say 100? or something of that > sort, so that we get lots and lots of overflow pages before we start > splitting. I think that might find some bugs that aren't obvious > right now because most buckets get split before they even have a > single overflow bucket. > > Also, the deadlock hazards that we talked about upthread should > probably be documented in the README somewhere, along with why we're > OK with accepting those hazards. > That makes sense. I will send a patch along that lines. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Wed, Nov 23, 2016 at 8:50 AM, Amit Kapila wrote: > [ new patch ] Committed with some further cosmetic changes. I guess I won't be very surprised if this turns out to have a few bugs yet, but I think it's in fairly good shape at this point. I think it would be worth testing this code with very long overflow chains by hacking the fill factor up to 1000 or something of that sort, so that we get lots and lots of overflow pages before we start splitting. I think that might find some bugs that aren't obvious right now because most buckets get split before they even have a single overflow bucket. Also, the deadlock hazards that we talked about upthread should probably be documented in the README somewhere, along with why we're OK with accepting those hazards. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Fri, Nov 18, 2016 at 12:11 PM, Amit Kapila wrote: > On Thu, Nov 17, 2016 at 10:54 PM, Robert Haas wrote: >> On Thu, Nov 17, 2016 at 12:05 PM, Amit Kapila >> wrote: >> I think this comment is saying that we'll release the pin on the primary bucket page for now, and then reacquire it later if the user reverses the scan direction. But that doesn't sound very safe, because the bucket could be split in the meantime and the order in which tuples are returned could change. I think we want that to remain stable within a single query execution. >>> >>> Isn't that possible even without the patch? Basically, after reaching >>> end of forward scan and for doing backward *all* scan, we need to >>> perform portal rewind which will in turn call hashrescan where we will >>> drop the lock on bucket and then again when we try to move cursor >>> forward we acquire lock in _hash_first(), so in between when we don't >>> have the lock, the split could happen and next scan results could >>> differ. >> >> Well, the existing code doesn't drop the heavyweight lock at that >> location, but your patch does drop the pin that serves the same >> function, so I feel like there must be some difference. >> > > Yes, but I am not sure if existing code is right. Consider below scenario, > > Session-1 > > Begin; > Declare c cursor for select * from t4 where c1=1; > Fetch forward all from c; --here shared heavy-weight lock count becomes 1 > Fetch prior from c; --here shared heavy-weight lock count becomes 2 > close c; -- here, lock release will reduce the lock count and shared > heavy-weight lock count becomes 1 > > Now, if we try to insert from another session, such that it leads to > bucket-split of the bucket for which session-1 had used a cursor, it > will wait for session-1. > It will not wait, but just skip the split because we are using try lock, however, the point remains that select should not hold bucket level locks even after the cursor is closed. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Thu, Nov 17, 2016 at 10:54 PM, Robert Haas wrote: > On Thu, Nov 17, 2016 at 12:05 PM, Amit Kapila wrote: > >>> I think this comment is saying that we'll release the pin on the >>> primary bucket page for now, and then reacquire it later if the user >>> reverses the scan direction. But that doesn't sound very safe, >>> because the bucket could be split in the meantime and the order in >>> which tuples are returned could change. I think we want that to >>> remain stable within a single query execution. >> >> Isn't that possible even without the patch? Basically, after reaching >> end of forward scan and for doing backward *all* scan, we need to >> perform portal rewind which will in turn call hashrescan where we will >> drop the lock on bucket and then again when we try to move cursor >> forward we acquire lock in _hash_first(), so in between when we don't >> have the lock, the split could happen and next scan results could >> differ. > > Well, the existing code doesn't drop the heavyweight lock at that > location, but your patch does drop the pin that serves the same > function, so I feel like there must be some difference. > Yes, but I am not sure if existing code is right. Consider below scenario, Session-1 Begin; Declare c cursor for select * from t4 where c1=1; Fetch forward all from c; --here shared heavy-weight lock count becomes 1 Fetch prior from c; --here shared heavy-weight lock count becomes 2 close c; -- here, lock release will reduce the lock count and shared heavy-weight lock count becomes 1 Now, if we try to insert from another session, such that it leads to bucket-split of the bucket for which session-1 had used a cursor, it will wait for session-1. The insert can only proceed after session-1 performs the commit. I think after the cursor is closed in session-1, the insert from another session should succeed, don't you think so? >>> Come to think of it, I'm a little worried about the locking in >>> _hash_squeezebucket(). It seems like we drop the lock on each "write" >>> bucket page before taking the lock on the next one. So a concurrent >>> scan could get ahead of the cleanup process. That would be bad, >>> wouldn't it? >> >> Yeah, that would be bad if it happens, but no concurrent scan can >> happen during squeeze phase because we take an exclusive lock on a >> bucket page and maintain it throughout the operation. > > Well, that's completely unacceptable. A major reason the current code > uses heavyweight locks is because you can't hold lightweight locks > across arbitrary amounts of work -- because, just to take one example, > a process holding or waiting for an LWLock isn't interruptible. The > point of this redesign was to get rid of that, so that LWLocks are > only held for short periods. I dislike the lock-chaining approach > (take the next lock before releasing the previous one) quite a bit and > really would like to find a way to get rid of that, but the idea of > holding a buffer lock across a complete traversal of an unbounded > number of overflow buckets is far worse. We've got to come up with a > design that doesn't require that, or else completely redesign the > bucket-squeezing stuff. > I think we can use the idea of lock-chaining (take the next lock before releasing the previous one) for squeeze-phase to solve this issue. Basically for squeeze operation, what we need to take care is that there shouldn't be any scan before we start the squeeze and then afterward if the scan starts, it should be always behind write-end of a squeeze. If we follow this, then there shouldn't be any problem even for backward scans because to start backward scans, it needs to start with the first bucket and reach last bucket page by locking each bucket page in read mode. > (Would it make any sense to change the order of the hash index patches > we've got outstanding? For instance, if we did the page-at-a-time > stuff first, it would make life simpler for this patch in several > ways, possibly including this issue.) > I agree that page-at-a-time can help hash indexes, but I don't think it can help with this particular issue of squeeze operation. While cleaning dead-tuples, it would be okay even if scan went ahead of cleanup (considering we have page-at-a-time mode), but for squeeze, we can't afford that because it can move some tuples to a prior bucket page and scan would miss those tuples. Also, page-at-a-time will help cleaning tuples only for MVCC scans (it might not help for unlogged tables scan or non-MVCC scans). Another point is that we don't have a patch for page-at-a-time scan ready at this stage. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Thu, Nov 17, 2016 at 12:05 PM, Amit Kapila wrote: > Are you expecting a comment change here? > >> +old_blkno = _hash_get_oldblock_from_newbucket(rel, >> opaque->hasho_bucket); >> >> Couldn't you pass "bucket" here instead of "hasho->opaque_bucket"? I >> feel like I'm repeating this ad nauseum, but I really think it's bad >> to rely on the special space instead of our own local variables! >> > > Sure, we can pass bucket as well. However, if you see few lines below > (while (BlockNumberIsValid(opaque->hasho_nextblkno))), we are already > relying on special space to pass variables. In general, we are using > special space to pass variables to functions in many other places in > the code. What exactly are you bothered about in accessing special > space, if it is safe to do? I don't want to rely on the special space to know which buffers we have locked or pinned. We obviously need the special space to find the next and previous buffers in the block chain -- there's no other way to know that. However, we should be more careful about locks and pins. If the special space is corrupted in some way, we still shouldn't get confused about which buffers we have locked or pinned. >> I think this comment is saying that we'll release the pin on the >> primary bucket page for now, and then reacquire it later if the user >> reverses the scan direction. But that doesn't sound very safe, >> because the bucket could be split in the meantime and the order in >> which tuples are returned could change. I think we want that to >> remain stable within a single query execution. > > Isn't that possible even without the patch? Basically, after reaching > end of forward scan and for doing backward *all* scan, we need to > perform portal rewind which will in turn call hashrescan where we will > drop the lock on bucket and then again when we try to move cursor > forward we acquire lock in _hash_first(), so in between when we don't > have the lock, the split could happen and next scan results could > differ. Well, the existing code doesn't drop the heavyweight lock at that location, but your patch does drop the pin that serves the same function, so I feel like there must be some difference. >> Also, I think that so->hashso_skip_moved_tuples is badly designed. >> There are two separate facts you need to know: (1) whether you are >> scanning a bucket that was still being populated at the start of your >> scan and (2) if yes, whether you are scanning the bucket being >> populated or whether you are instead scanning the corresponding "old" >> bucket. You're trying to keep track of that using one Boolean, but >> one Boolean only has two states and there are three possible states >> here. > > So do you prefer to have two booleans to track those facts or have an > uint8 with a tri-state value or something else? I don't currently have a preference. >> Come to think of it, I'm a little worried about the locking in >> _hash_squeezebucket(). It seems like we drop the lock on each "write" >> bucket page before taking the lock on the next one. So a concurrent >> scan could get ahead of the cleanup process. That would be bad, >> wouldn't it? > > Yeah, that would be bad if it happens, but no concurrent scan can > happen during squeeze phase because we take an exclusive lock on a > bucket page and maintain it throughout the operation. Well, that's completely unacceptable. A major reason the current code uses heavyweight locks is because you can't hold lightweight locks across arbitrary amounts of work -- because, just to take one example, a process holding or waiting for an LWLock isn't interruptible. The point of this redesign was to get rid of that, so that LWLocks are only held for short periods. I dislike the lock-chaining approach (take the next lock before releasing the previous one) quite a bit and really would like to find a way to get rid of that, but the idea of holding a buffer lock across a complete traversal of an unbounded number of overflow buckets is far worse. We've got to come up with a design that doesn't require that, or else completely redesign the bucket-squeezing stuff. (Would it make any sense to change the order of the hash index patches we've got outstanding? For instance, if we did the page-at-a-time stuff first, it would make life simpler for this patch in several ways, possibly including this issue.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Thu, Nov 17, 2016 at 3:08 AM, Robert Haas wrote: > On Sat, Nov 12, 2016 at 12:49 AM, Amit Kapila wrote: >> You are right and I have changed the code as per your suggestion. > > So... > > +/* > + * We always maintain the pin on bucket page for whole scan > operation, > + * so releasing the additional pin we have acquired here. > + */ > +if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE) > +_hash_dropbuf(rel, *bufp); > > This relies on the page contents to know whether we took a pin; that > seems like a bad plan. We need to know intrinsically whether we took > a pin. > Okay, I think we can do that as we have bucket buffer information (hashso_bucket_buf) in HashScanOpaqueData. We might need to pass this information in _hash_readprev. > + * If the bucket split is in progress, then we need to skip tuples that > + * are moved from old bucket. To ensure that vacuum doesn't clean any > + * tuples from old or new buckets till this scan is in progress, maintain > + * a pin on both of the buckets. Here, we have to be cautious about > > It wouldn't be a problem if VACUUM removed tuples from the new bucket, > because they'd have to be dead anyway. It also wouldn't be a problem > if it removed tuples from the old bucket that were actually dead. The > real issue isn't vacuum anyway, but the process of cleaning up after a > split. We need to hold the pin so that tuples being moved from the > old bucket to the new bucket by the split don't get removed from the > old bucket until our scan is done. > Are you expecting a comment change here? > +old_blkno = _hash_get_oldblock_from_newbucket(rel, > opaque->hasho_bucket); > > Couldn't you pass "bucket" here instead of "hasho->opaque_bucket"? I > feel like I'm repeating this ad nauseum, but I really think it's bad > to rely on the special space instead of our own local variables! > Sure, we can pass bucket as well. However, if you see few lines below (while (BlockNumberIsValid(opaque->hasho_nextblkno))), we are already relying on special space to pass variables. In general, we are using special space to pass variables to functions in many other places in the code. What exactly are you bothered about in accessing special space, if it is safe to do? > -/* we ran off the end of the bucket without finding a match */ > +/* > + * We ran off the end of the bucket without finding a match. > + * Release the pin on bucket buffers. Normally, such pins are > + * released at end of scan, however scrolling cursors can > + * reacquire the bucket lock and pin in the same scan multiple > + * times. > + */ > *bufP = so->hashso_curbuf = InvalidBuffer; > ItemPointerSetInvalid(current); > +_hash_dropscanbuf(rel, so); > > I think this comment is saying that we'll release the pin on the > primary bucket page for now, and then reacquire it later if the user > reverses the scan direction. But that doesn't sound very safe, > because the bucket could be split in the meantime and the order in > which tuples are returned could change. I think we want that to > remain stable within a single query execution. > Isn't that possible even without the patch? Basically, after reaching end of forward scan and for doing backward *all* scan, we need to perform portal rewind which will in turn call hashrescan where we will drop the lock on bucket and then again when we try to move cursor forward we acquire lock in _hash_first(), so in between when we don't have the lock, the split could happen and next scan results could differ. Also, in the documentation, it is mentioned that "The SQL standard says that it is implementation-dependent whether cursors are sensitive to concurrent updates of the underlying data by default. In PostgreSQL, cursors are insensitive by default, and can be made sensitive by specifying FOR UPDATE." which I think indicates that results can't be guaranteed for forward and backward scans. So, even if we try to come up with some solution for stable results in some scenarios, I am not sure that can be guaranteed for all scenarios. > +/* > + * setting hashso_skip_moved_tuples to false > + * ensures that we don't check for tuples that > are > + * moved by split in old bucket and it also > + * ensures that we won't retry to scan the old > + * bucket once the scan for same is finished. > + */ > +so->hashso_skip_moved_tuples = false; > > I think you've got a big problem here. Suppose the user starts the > scan in the new bucket and runs it forward until they end up in the > old bucket. Then they turn around and run the scan backward. When > they reach
Re: [HACKERS] Hash Indexes
On Sat, Nov 12, 2016 at 12:49 AM, Amit Kapila wrote: > You are right and I have changed the code as per your suggestion. So... +/* + * We always maintain the pin on bucket page for whole scan operation, + * so releasing the additional pin we have acquired here. + */ +if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE) +_hash_dropbuf(rel, *bufp); This relies on the page contents to know whether we took a pin; that seems like a bad plan. We need to know intrinsically whether we took a pin. + * If the bucket split is in progress, then we need to skip tuples that + * are moved from old bucket. To ensure that vacuum doesn't clean any + * tuples from old or new buckets till this scan is in progress, maintain + * a pin on both of the buckets. Here, we have to be cautious about It wouldn't be a problem if VACUUM removed tuples from the new bucket, because they'd have to be dead anyway. It also wouldn't be a problem if it removed tuples from the old bucket that were actually dead. The real issue isn't vacuum anyway, but the process of cleaning up after a split. We need to hold the pin so that tuples being moved from the old bucket to the new bucket by the split don't get removed from the old bucket until our scan is done. +old_blkno = _hash_get_oldblock_from_newbucket(rel, opaque->hasho_bucket); Couldn't you pass "bucket" here instead of "hasho->opaque_bucket"? I feel like I'm repeating this ad nauseum, but I really think it's bad to rely on the special space instead of our own local variables! -/* we ran off the end of the bucket without finding a match */ +/* + * We ran off the end of the bucket without finding a match. + * Release the pin on bucket buffers. Normally, such pins are + * released at end of scan, however scrolling cursors can + * reacquire the bucket lock and pin in the same scan multiple + * times. + */ *bufP = so->hashso_curbuf = InvalidBuffer; ItemPointerSetInvalid(current); +_hash_dropscanbuf(rel, so); I think this comment is saying that we'll release the pin on the primary bucket page for now, and then reacquire it later if the user reverses the scan direction. But that doesn't sound very safe, because the bucket could be split in the meantime and the order in which tuples are returned could change. I think we want that to remain stable within a single query execution. +_hash_readnext(rel, &buf, &page, &opaque, + (opaque->hasho_flag & LH_BUCKET_PAGE) ? true : false); Same comment: don't rely on the special space to figure this out. Keep track. Also != 0 would be better than ? true : false. +/* + * setting hashso_skip_moved_tuples to false + * ensures that we don't check for tuples that are + * moved by split in old bucket and it also + * ensures that we won't retry to scan the old + * bucket once the scan for same is finished. + */ +so->hashso_skip_moved_tuples = false; I think you've got a big problem here. Suppose the user starts the scan in the new bucket and runs it forward until they end up in the old bucket. Then they turn around and run the scan backward. When they reach the beginning of the old bucket, they're going to stop, not move back to the new bucket, AFAICS. Oops. _hash_first() has a related problem: a backward scan starts at the end of the new bucket and moves backward, but it should start at the end of the old bucket, and then when it reaches the beginning, flip to the new bucket and move backward through that one. Otherwise, a backward scan and a forward scan don't return tuples in opposite order, which they should. I think what you need to do to fix both of these problems is a more thorough job gluing the two buckets together. I'd suggest that the responsibility for switching between the two buckets should probably be given to _hash_readprev() and _hash_readnext(), because every place that needs to advance to the next or previous page that cares about this. Right now you are trying to handle it mostly in the functions that call those functions, but that is prone to errors of omission. Also, I think that so->hashso_skip_moved_tuples is badly designed. There are two separate facts you need to know: (1) whether you are scanning a bucket that was still being populated at the start of your scan and (2) if yes, whether you are scanning the bucket being populated or whether you are instead scanning the corresponding "old" bucket. You're trying to keep track of that using one Boolean, but one Boolean only has two states and there are three possible states here. +if (H_BUCKET_BEING_SPLIT(pag
Re: [HACKERS] Hash Indexes
On Thu, Nov 10, 2016 at 2:57 AM, Robert Haas wrote: > > Some more review: > > The API contract of _hash_finish_split seems a bit unfortunate. The > caller is supposed to have obtained a cleanup lock on both the old and > new buffers, but the first thing it does is walk the entire new bucket > chain, completely ignoring the old one. That means holding a cleanup > lock on the old buffer across an unbounded number of I/O operations -- > which also means that you can't interrupt the query by pressing ^C, > because an LWLock (on the old buffer) is held. > I see the problem you are talking about, but it was done to ensure locking order, old bucket first and then new bucket, else there could be a deadlock risk. However, I think we can avoid holding the cleanup lock on old bucket till we scan the new bucket to form a hash table of TIDs. > Moreover, the > requirement to hold a lock on the new buffer isn't convenient for > either caller; they both have to go do it, so why not move it into the > function? > Yeah, we can move the locking of new bucket entirely into new function. > Perhaps the function should be changed so that it > guarantees that a pin is held on the primary page of the existing > bucket, but no locks are held. > Okay, so we can change the locking order as follows: a. ensure a cleanup lock on old bucket and check if the bucket (old) has pending split. b. if there is a pending split, release the lock on old bucket, but not pin. below steps will be performed by _hash_finish_split(): c. acquire the read content lock on new bucket and form the hash table of TIDs and in the process of forming hash table, we need to traverse whole bucket chain. While traversing bucket chain, release the lock on previous bucket (both lock and pin if not a primary bucket page). d. After the hash table is formed, acquire cleanup lock on old and new buckets conditionaly; if we are not able to get cleanup lock on either, then just return from there. e. Perform split operation. f. release the lock and pin on new bucket g. release the lock on old bucket. We don't want to release the pin on old bucket as the caller has ensure it before passing it to _hash_finish_split(), so releasing pin should be resposibility of caller. Now, both the callers need to ensure that they restart the operation from begining as after we release lock on old bucket, somebody might have split the bucket. Does the above change in locking strategy sounds okay? > Where _hash_finish_split does LockBufferForCleanup(bucket_nbuf), > should it instead be trying to get the lock conditionally and > returning immediately if it doesn't get the lock? Seems like a good > idea... > Yeah, we can take a cleanup lock conditionally, but it would waste the effort of forming hash table, if we don't get cleanup lock immediately. Considering incomplete splits to be a rare operation, may be this the wasted effort is okay, but I am not sure. Don't you think we should avoid that effort? > * We're at the end of the old bucket chain, so we're done partitioning > * the tuples. Mark the old and new buckets to indicate split is > * finished. > * > * To avoid deadlocks due to locking order of buckets, first lock the old > * bucket and then the new bucket. > > These comments have drifted too far from the code to which they refer. > The first part is basically making the same point as the > slightly-later comment /* indicate that split is finished */. > I think we can remove the second comment /* indicate that split is finished */. Apart from that, I think the above comment you have quoted seems to be inline with current code. At that point, we have finished partitioning the tuples, so I don't understand what makes you think that it is drifted from the code? Is it because of second part of comment (To avoid deadlocks ...)? If so, I think we can move it to few lines down where we actually performs the locking on old and new bucket? > The use of _hash_relbuf, _hash_wrtbuf, and _hash_chgbufaccess is > coming to seem like a horrible idea to me. That's not your fault - it > was like this before - but maybe in a followup patch we should > consider ripping all of that out and just calling MarkBufferDirty(), > ReleaseBuffer(), LockBuffer(), UnlockBuffer(), and/or > UnlockReleaseBuffer() as appropriate. As far as I can see, the > current style is just obfuscating the code. > Okay, we can do some study and try to change it in the way you are suggesting. It seems partially this has been derived from btree code where we have function _bt_relbuf(). I am sure that we don't need _hash_wrtbuf after WAL patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Wed, Nov 9, 2016 at 12:11 PM, Robert Haas wrote: > On Wed, Nov 9, 2016 at 11:41 AM, Amit Kapila wrote: >> On Wed, Nov 9, 2016 at 9:10 PM, Robert Haas wrote: >>> On Wed, Nov 9, 2016 at 9:04 AM, Amit Kapila wrote: >>> I think we can give a brief explanation right in the code comment. I >>> referred to "decreasing the TIDs"; you are referring to "moving tuples >>> around". But I think that moving the tuples to different locations is >>> not the problem. I think the problem is that a tuple might be >>> assigned a lower spot in the item pointer array >> >> I think we both understand the problem and it is just matter of using >> different words. I will go with your suggestion and will try to >> slightly adjust the README as well so that both places use same >> terminology. > > Yes, I think we're on the same page. Some more review: The API contract of _hash_finish_split seems a bit unfortunate. The caller is supposed to have obtained a cleanup lock on both the old and new buffers, but the first thing it does is walk the entire new bucket chain, completely ignoring the old one. That means holding a cleanup lock on the old buffer across an unbounded number of I/O operations -- which also means that you can't interrupt the query by pressing ^C, because an LWLock (on the old buffer) is held. Moreover, the requirement to hold a lock on the new buffer isn't convenient for either caller; they both have to go do it, so why not move it into the function? Perhaps the function should be changed so that it guarantees that a pin is held on the primary page of the existing bucket, but no locks are held. Where _hash_finish_split does LockBufferForCleanup(bucket_nbuf), should it instead be trying to get the lock conditionally and returning immediately if it doesn't get the lock? Seems like a good idea... * We're at the end of the old bucket chain, so we're done partitioning * the tuples. Mark the old and new buckets to indicate split is * finished. * * To avoid deadlocks due to locking order of buckets, first lock the old * bucket and then the new bucket. These comments have drifted too far from the code to which they refer. The first part is basically making the same point as the slightly-later comment /* indicate that split is finished */. The use of _hash_relbuf, _hash_wrtbuf, and _hash_chgbufaccess is coming to seem like a horrible idea to me. That's not your fault - it was like this before - but maybe in a followup patch we should consider ripping all of that out and just calling MarkBufferDirty(), ReleaseBuffer(), LockBuffer(), UnlockBuffer(), and/or UnlockReleaseBuffer() as appropriate. As far as I can see, the current style is just obfuscating the code. itupsize = new_itup->t_info & INDEX_SIZE_MASK; new_itup->t_info &= ~INDEX_SIZE_MASK; new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK; new_itup->t_info |= itupsize; If I'm not mistaken, you could omit the first, second, and fourth lines here and keep only the third one, and it would do exactly the same thing. The first line saves the bits in INDEX_SIZE_MASK. The second line clears the bits in INDEX_SIZE_MASK. The fourth line re-sets the bits that were originally saved. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Wed, Nov 9, 2016 at 11:41 AM, Amit Kapila wrote: > On Wed, Nov 9, 2016 at 9:10 PM, Robert Haas wrote: >> On Wed, Nov 9, 2016 at 9:04 AM, Amit Kapila wrote: >> I think we can give a brief explanation right in the code comment. I >> referred to "decreasing the TIDs"; you are referring to "moving tuples >> around". But I think that moving the tuples to different locations is >> not the problem. I think the problem is that a tuple might be >> assigned a lower spot in the item pointer array > > I think we both understand the problem and it is just matter of using > different words. I will go with your suggestion and will try to > slightly adjust the README as well so that both places use same > terminology. Yes, I think we're on the same page. > Right, but we don't need that guarantee (there is no pending scan that > has seen the flag after it is cleared) to clear the flags. It was > written in one of the previous patches where I was exploring the idea > of using cleanup lock to clear the flags and then don't use the same > during vacuum. However, there were some problems in that design and I > have changed the code, but forgot to update the comment. OK, got it, thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Wed, Nov 9, 2016 at 9:10 PM, Robert Haas wrote: > On Wed, Nov 9, 2016 at 9:04 AM, Amit Kapila wrote: > > I think we can give a brief explanation right in the code comment. I > referred to "decreasing the TIDs"; you are referring to "moving tuples > around". But I think that moving the tuples to different locations is > not the problem. I think the problem is that a tuple might be > assigned a lower spot in the item pointer array > I think we both understand the problem and it is just matter of using different words. I will go with your suggestion and will try to slightly adjust the README as well so that both places use same terminology. >>> +/* >>> + * Acquiring cleanup lock to clear the split-in-progress flag ensures >>> that >>> + * there is no pending scan that has seen the flag after it is cleared. >>> + */ >>> +_hash_chgbufaccess(rel, bucket_obuf, HASH_NOLOCK, HASH_WRITE); >>> +opage = BufferGetPage(bucket_obuf); >>> +oopaque = (HashPageOpaque) PageGetSpecialPointer(opage); >>> >>> I don't understand the comment, because the code *isn't* acquiring a >>> cleanup lock. >> >> Oops, this is ramnant from one of the design approach to clear these >> flags which was later discarded due to issues. I will change this to >> indicate Exclusive lock. > > Of course, an exclusive lock doesn't guarantee anything like that... > Right, but we don't need that guarantee (there is no pending scan that has seen the flag after it is cleared) to clear the flags. It was written in one of the previous patches where I was exploring the idea of using cleanup lock to clear the flags and then don't use the same during vacuum. However, there were some problems in that design and I have changed the code, but forgot to update the comment. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Wed, Nov 9, 2016 at 9:04 AM, Amit Kapila wrote: > + * This function expects that the caller has acquired a cleanup lock on the > + * primary bucket page, and will with a write lock again held on the primary > + * bucket page. The lock won't necessarily be held continuously, though, > + * because we'll release it when visiting overflow pages. > > Looks like typo in above comment. /will with a write lock/will > return with a write lock Oh, yes. Thanks. >> + * During scan of overflow pages, first we need to lock the next bucket and >> + * then release the lock on current bucket. This ensures that any >> concurrent >> + * scan started after we start cleaning the bucket will always be behind the >> + * cleanup. Allowing scans to cross vacuum will allow it to remove tuples >> + * required for sanctity of scan. >> >> This comment says that it's bad if other scans can pass our cleanup >> scan, but it doesn't explain why. I think it's because we don't have >> page-at-a-time mode yet, >> > > Right. > >> and cleanup might decrease the TIDs for >> existing index entries. >> > > I think the reason is that cleanup might move tuples around during > which it might move previously returned TID to a position earlier than > its current position. This is a problem because it restarts the scan > from previously returned offset and try to find previously returned > tuples TID. This has been mentioned in README as below: > > + It is must to > +keep scans behind cleanup, else vacuum could remove tuples that are required > +to complete the scan as the scan that returns multiple tuples from the same > +bucket page always restart the scan from the previous offset number from > which > +it has returned last tuple. > > We might want to slightly improve the README so that the reason is > more clear and then mention in comments to refer README, but I am open > either way, let me know which way you prefer? I think we can give a brief explanation right in the code comment. I referred to "decreasing the TIDs"; you are referring to "moving tuples around". But I think that moving the tuples to different locations is not the problem. I think the problem is that a tuple might be assigned a lower spot in the item pointer array - i.e. the TID decreases. >> OK, a couple things here. First, it seems like we could also delete >> any tuples where ItemIdIsDead, and that seems worth doing. > > I think we can't do that because here we want to strictly rely on > callback function for vacuum similar to btree. The reason is explained > as below comment in function btvacuumpage(). OK, I see. It would probably be good to comment this, then, so that someone later doesn't get confused as I did. > This looks okay to me. So if you agree with my reasoning for not > including first part, then I can take that out and keep this part in > next patch. Cool. >> I think that might be >> clearer. When LH_BEING_POPULATED is set, the bucket is being filled - >> that is, populated - from the old bucket. > > How about LH_BUCKET_BEING_POPULATED or may LH_BP_BEING_SPLIT where BP > indicates Bucket page? LH_BUCKET_BEING_POPULATED seems good to me. >> And maybe >> LH_BUCKET_PAGE_HAS_GARBAGE -> LH_NEEDS_SPLIT_CLEANUP, too. >> > > How about LH_BUCKET_NEEDS_SPLIT_CLEANUP or LH_BP_NEEDS_SPLIT_CLEANUP? > I am slightly inclined to keep Bucket word, but let me know if you > think it will make the length longer. LH_BUCKET_NEEDS_SPLIT_CLEANUP seems good to me. >> How? Can we just use an >> if-then instead of a for-loop? > > I could see below two possibilities: > First way - > > retry: > mask = lowmask + 1; > new_bucket = old_bucket | mask; > if (new_bucket > maxbucket) > { > lowmask = lowmask >> 1; > goto retry; > } > > Second way - > new_bucket = CALC_NEW_BUCKET(old_bucket,lowmask); > if (new_bucket > maxbucket) > { > lowmask = lowmask >> 1; > new_bucket = CALC_NEW_BUCKET(old_bucket, lowmask); > } > > #define CALC_NEW_BUCKET(old_bucket, lowmask) \ > new_bucket = old_bucket | (lowmask + 1) > > Do you have something else in mind? Second one would be my preference. >> I still don't like the names of these functions very much. If you >> said "get X from Y", it would be clear that you put in Y and you get >> out X. If you say "X 2 Y", it would be clear that you put in X and >> you get out Y. As it is, it's not very clear which is the input and >> which is the output. > > Whatever exists earlier is input and the later one is output. For > example in existing function _hash_get_indextuple_hashkey(). However, > feel free to suggest better names here. How about > _hash_get_oldbucket2newblock() or _hash_get_newblock_from_oldbucket() > or simply _hash_get_newblock()? The problem with _hash_get_newblock() is that it sounds like you are getting a new block in the relation, not the new bucket (or corresponding block) for some old bucket. The name isn't specific enough to know what "new" means. In general, I think "new" and "old" are not very good termin
Re: [HACKERS] Hash Indexes
On Wed, Nov 9, 2016 at 1:23 AM, Robert Haas wrote: > On Mon, Nov 7, 2016 at 9:51 PM, Amit Kapila wrote: >> [ new patches ] > > Attached is yet another incremental patch with some suggested changes. > > + * This expects that the caller has acquired a cleanup lock on the target > + * bucket (primary page of a bucket) and it is reponsibility of caller to > + * release that lock. > > This is confusing, because it makes it sound like we retain the lock > through the entire execution of the function, which isn't always true. > I would say that caller must acquire a cleanup lock on the target > primary bucket page before calling this function, and that on return > that page will again be write-locked. However, the lock might be > temporarily released in the meantime, which visiting overflow pages. > (Attached patch has a suggested rewrite.) > + * This function expects that the caller has acquired a cleanup lock on the + * primary bucket page, and will with a write lock again held on the primary + * bucket page. The lock won't necessarily be held continuously, though, + * because we'll release it when visiting overflow pages. Looks like typo in above comment. /will with a write lock/will return with a write lock > + * During scan of overflow pages, first we need to lock the next bucket and > + * then release the lock on current bucket. This ensures that any concurrent > + * scan started after we start cleaning the bucket will always be behind the > + * cleanup. Allowing scans to cross vacuum will allow it to remove tuples > + * required for sanctity of scan. > > This comment says that it's bad if other scans can pass our cleanup > scan, but it doesn't explain why. I think it's because we don't have > page-at-a-time mode yet, > Right. > and cleanup might decrease the TIDs for > existing index entries. > I think the reason is that cleanup might move tuples around during which it might move previously returned TID to a position earlier than its current position. This is a problem because it restarts the scan from previously returned offset and try to find previously returned tuples TID. This has been mentioned in README as below: + It is must to +keep scans behind cleanup, else vacuum could remove tuples that are required +to complete the scan as the scan that returns multiple tuples from the same +bucket page always restart the scan from the previous offset number from which +it has returned last tuple. We might want to slightly improve the README so that the reason is more clear and then mention in comments to refer README, but I am open either way, let me know which way you prefer? > > +if (delay) > +vacuum_delay_point(); > > You don't really need "delay". If we're not in a cost-accounted > VACUUM, vacuum_delay_point() just turns into CHECK_FOR_INTERRUPTS(), > which should be safe (and a good idea) regardless. (Fixed in > attached.) > Okay, that makes sense. > +if (callback && callback(htup, callback_state)) > +{ > +/* mark the item for deletion */ > +deletable[ndeletable++] = offno; > +if (tuples_removed) > +*tuples_removed += 1; > +} > +else if (bucket_has_garbage) > +{ > +/* delete the tuples that are moved by split. */ > +bucket = > _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup > ), > + maxbucket, > + highmask, > + lowmask); > +/* mark the item for deletion */ > +if (bucket != cur_bucket) > +{ > +/* > + * We expect tuples to either belong to curent bucket or > + * new_bucket. This is ensured because we don't allow > + * further splits from bucket that contains garbage. See > + * comments in _hash_expandtable. > + */ > +Assert(bucket == new_bucket); > +deletable[ndeletable++] = offno; > +} > +else if (num_index_tuples) > +*num_index_tuples += 1; > +} > +else if (num_index_tuples) > +*num_index_tuples += 1; > +} > > OK, a couple things here. First, it seems like we could also delete > any tuples where ItemIdIsDead, and that seems worth doing. I think we can't do that because here we want to strictly rely on callback function for vacuum similar to btree. The reason is explained as below comment in function btvacuumpage(). /* * During Hot Standby we currently assume that * XLOG_BTREE_VACUUM records do not produce conflicts. That is * only true as long as the callback function depends only * upon whether the index tuple refers to heap tuples removed * in the initial
Re: [HACKERS] Hash Indexes
On Mon, Nov 7, 2016 at 9:51 PM, Amit Kapila wrote: > [ new patches ] Attached is yet another incremental patch with some suggested changes. + * This expects that the caller has acquired a cleanup lock on the target + * bucket (primary page of a bucket) and it is reponsibility of caller to + * release that lock. This is confusing, because it makes it sound like we retain the lock through the entire execution of the function, which isn't always true. I would say that caller must acquire a cleanup lock on the target primary bucket page before calling this function, and that on return that page will again be write-locked. However, the lock might be temporarily released in the meantime, which visiting overflow pages. (Attached patch has a suggested rewrite.) + * During scan of overflow pages, first we need to lock the next bucket and + * then release the lock on current bucket. This ensures that any concurrent + * scan started after we start cleaning the bucket will always be behind the + * cleanup. Allowing scans to cross vacuum will allow it to remove tuples + * required for sanctity of scan. This comment says that it's bad if other scans can pass our cleanup scan, but it doesn't explain why. I think it's because we don't have page-at-a-time mode yet, and cleanup might decrease the TIDs for existing index entries. (Attached patch has a suggested rewrite, but might need further adjustment if my understanding of the reasons is incomplete.) +if (delay) +vacuum_delay_point(); You don't really need "delay". If we're not in a cost-accounted VACUUM, vacuum_delay_point() just turns into CHECK_FOR_INTERRUPTS(), which should be safe (and a good idea) regardless. (Fixed in attached.) +if (callback && callback(htup, callback_state)) +{ +/* mark the item for deletion */ +deletable[ndeletable++] = offno; +if (tuples_removed) +*tuples_removed += 1; +} +else if (bucket_has_garbage) +{ +/* delete the tuples that are moved by split. */ +bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup ), + maxbucket, + highmask, + lowmask); +/* mark the item for deletion */ +if (bucket != cur_bucket) +{ +/* + * We expect tuples to either belong to curent bucket or + * new_bucket. This is ensured because we don't allow + * further splits from bucket that contains garbage. See + * comments in _hash_expandtable. + */ +Assert(bucket == new_bucket); +deletable[ndeletable++] = offno; +} +else if (num_index_tuples) +*num_index_tuples += 1; +} +else if (num_index_tuples) +*num_index_tuples += 1; +} OK, a couple things here. First, it seems like we could also delete any tuples where ItemIdIsDead, and that seems worth doing. In fact, I think we should check it prior to invoking the callback, because it's probably quite substantially cheaper than the callback. Second, repeating deletable[ndeletable++] = offno and *num_index_tuples += 1 doesn't seem very clean to me; I think we should introduce a new bool to track whether we're keeping the tuple or killing it, and then use that to drive which of those things we do. (Fixed in attached.) +if (H_HAS_GARBAGE(bucket_opaque) && +!H_INCOMPLETE_SPLIT(bucket_opaque)) This is the only place in the entire patch that use H_INCOMPLETE_SPLIT(), and I'm wondering if that's really correct even here. Don't you really want H_OLD_INCOMPLETE_SPLIT() here? (And couldn't we then remove H_INCOMPLETE_SPLIT() itself?) There's no garbage to be removed from the "new" bucket until the next split, when it will take on the role of the "old" bucket. I think it would be a good idea to change this so that LH_BUCKET_PAGE_HAS_GARBAGE doesn't get set until LH_BUCKET_OLD_PAGE_SPLIT is cleared. The current way is confusing, because those tuples are NOT garbage until the split is completed! Moreover, both of the places that care about LH_BUCKET_PAGE_HAS_GARBAGE need to make sure that LH_BUCKET_OLD_PAGE_SPLIT is clear before they do anything about LH_BUCKET_PAGE_HAS_GARBAGE, so the change I am proposing would actually simplify the code very slightly. +#define H_OLD_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag & LH_BUCKET_OLD_PAGE_SPLIT) +#define H_NEW_INCOMPLETE_SPLIT(opaque) ((opaque)->hasho_flag & LH_BUCKET_NEW_PAGE_SPLIT) The code isn't consistent about the use of these macros, or at least not in a good way. When you care about LH_BUCKET_OLD_PAGE_SPLIT, you test i
Re: [HACKERS] Hash Indexes
On Mon, Nov 7, 2016 at 9:51 PM, Amit Kapila wrote: >> Both the places _hash_squeezebucket() and _hash_splitbucket can use >> this optimization irrespective of rest of the patch. I will prepare a >> separate patch for these and send along with main patch after some >> testing. > > Done as a separate patch skip_dead_tups_hash_index-v1.patch. Thanks. Committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Fri, Nov 4, 2016 at 9:37 PM, Robert Haas wrote: > On Tue, Nov 1, 2016 at 9:09 PM, Robert Haas wrote: >> On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila >> wrote: >>> [ new patches ] >> >> I looked over parts of this today, mostly the hashinsert.c changes. > > Some more review... > > @@ -656,6 +678,10 @@ _hash_squeezebucket(Relation rel, > IndexTuple itup; > Sizeitemsz; > > +/* skip dead tuples */ > +if (ItemIdIsDead(PageGetItemId(rpage, roffnum))) > +continue; > > Is this an optimization independent of the rest of the patch, or is > there something in this patch that necessitates it? > This specific case is independent of rest of patch, but the same optimization is used in function _hash_splitbucket_guts() which is mandatory, because otherwise it will make a copy of that tuple without copying dead flag. > i.e. Could this > small change be committed independently? Both the places _hash_squeezebucket() and _hash_splitbucket can use this optimization irrespective of rest of the patch. I will prepare a separate patch for these and send along with main patch after some testing. > If not, then I think it > needs a better comment explaining why it is now mandatory. > > - * Caller must hold exclusive lock on the target bucket. This allows > + * Caller must hold cleanup lock on the target bucket. This allows > * us to safely lock multiple pages in the bucket. > > The notion of a lock on a bucket no longer really exists; with this > patch, we'll now properly speak of a lock on a primary bucket page. > Also, I think the bit about safely locking multiple pages is bizarre > -- that's not the issue at all: the problem is that reorganizing a > bucket might confuse concurrent scans into returning wrong answers. > > I've included a broader updating of that comment, and some other > comment changes, in the attached incremental patch, which also > refactors your changes to _hash_freeovflpage() a bit to avoid some > code duplication. Please consider this for inclusion in your next > version. > Your modifications looks good to me, so will include it in next version. > In hashutil.c, I think that _hash_msb() is just a reimplementation of > fls(), which you can rely on being present because we have our own > implementation in src/port. It's quite similar to yours but slightly > shorter. :-) Also, some systems have a builtin fls() function which > actually optimizes down to a single machine instruction, and which is > therefore much faster than either version. > Agreed, will change as per suggestion. > I don't like the fact that _hash_get_newblk() and _hash_get_oldblk() > are working out the bucket number by using the HashOpaque structure > within the bucket page they're examining. First, it seems weird to > pass the whole structure when you only need the bucket number out of > it. More importantly, the caller really ought to know what bucket > they care about without having to discover it. > > For example, in _hash_doinsert(), we figure out the bucket into which > we need to insert, and we store that in a variable called "bucket". > Then from there we work out the primary bucket page's block number, > which we store in "blkno". We read the page into "buf" and put a > pointer to that buffer's contents into "page" from which we discover > the HashOpaque, a pointer to which we store into "pageopaque". Then > we pass that to _hash_get_newblk() which will go look into that > structure to find the bucket number ... but couldn't we have just > passed "bucket" instead? > Yes, it can be done. However, note that pageopaque is not only retrieved for passing to _hash_get_newblk(), rather it is used in below code as well, so we can't remove that. > Similarly, _hash_expandtable() has the value > available in the variable "old_bucket". > > The only caller of _hash_get_oldblk() is _hash_first(), which has the > bucket number available in a variable called "bucket". > > So it seems to me that these functions could be simplified to take the > bucket number as an argument directly instead of the HashOpaque. > Okay, I agree that it is better to use bucket number in both the API's, so will change it accordingly. > Generally, this pattern recurs throughout the patch. Every time you > use the data in the page to figure something out which the caller > already knew, you're introducing a risk of bugs: what if the answers > don't match? I think you should try to root out as much of that from > this code as you can. > Okay, I will review the patch once with this angle and see if I can improve it. > As you may be able to tell, I'm working my way into this patch > gradually, starting with peripheral parts and working toward the core > of it. Generally, I think it's in pretty good shape, but I still have > quite a bit left to study. > Thanks. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hac
Re: [HACKERS] Hash Indexes
On Fri, Nov 4, 2016 at 6:37 PM, Robert Haas wrote: > On Thu, Nov 3, 2016 at 6:25 AM, Amit Kapila wrote: >>> +nblkno = _hash_get_newblk(rel, pageopaque); >>> >>> I think this is not a great name for this function. It's not clear >>> what "new blocks" refers to, exactly. I suggest >>> FIND_SPLIT_BUCKET(metap, bucket) or OLD_BUCKET_TO_NEW_BUCKET(metap, >>> bucket) returning a new bucket number. I think that macro can be >>> defined as something like this: bucket + (1 << >>> (fls(metap->hashm_maxbucket) - 1)). >>> >> >> I think such a macro would not work for the usage of incomplete >> splits. The reason is that by the time we try to complete the split >> of the current old bucket, the table half (lowmask, highmask, >> maxbucket) would have changed and it could give you the bucket in new >> table half. > > Can you provide an example of the scenario you are talking about here? > Consider a case as below: First half of table 0 1 2 3 Second half of table 4 5 6 7 Now when split of bucket 2 (corresponding new bucket will be 6) is in progress, system crashes and after restart it splits bucket number 3 (corresponding bucket will be 7). Now after that, it will try to form a new table half with buckets ranging from 8,9,..15. Assume it creates bucket 8 by splitting from bucket 0 and next if it tries to split bucket 2, it will find an incomplete split and will attempt to finish it. At that time if it tries to calculate new bucket from old bucket (2), it will calculate it as 10 (value of metap->hashm_maxbucket will be 8 for third table half and if try it with the above macro, it will calculate it as 10) whereas we need 6. That is why you will see a check (if (new_bucket > metap->hashm_maxbucket)) in _hash_get_newblk() which will ensure that it returns the bucket number from previous half. The basic idea is that if there is an incomplete split from current bucket, it can't do a new split from that bucket, so the check in _hash_get_newblk() will give us correct value. I can try to explain again if above is not clear enough. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Tue, Nov 1, 2016 at 9:09 PM, Robert Haas wrote: > On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila wrote: >> [ new patches ] > > I looked over parts of this today, mostly the hashinsert.c changes. Some more review... @@ -656,6 +678,10 @@ _hash_squeezebucket(Relation rel, IndexTuple itup; Sizeitemsz; +/* skip dead tuples */ +if (ItemIdIsDead(PageGetItemId(rpage, roffnum))) +continue; Is this an optimization independent of the rest of the patch, or is there something in this patch that necessitates it? i.e. Could this small change be committed independently? If not, then I think it needs a better comment explaining why it is now mandatory. - * Caller must hold exclusive lock on the target bucket. This allows + * Caller must hold cleanup lock on the target bucket. This allows * us to safely lock multiple pages in the bucket. The notion of a lock on a bucket no longer really exists; with this patch, we'll now properly speak of a lock on a primary bucket page. Also, I think the bit about safely locking multiple pages is bizarre -- that's not the issue at all: the problem is that reorganizing a bucket might confuse concurrent scans into returning wrong answers. I've included a broader updating of that comment, and some other comment changes, in the attached incremental patch, which also refactors your changes to _hash_freeovflpage() a bit to avoid some code duplication. Please consider this for inclusion in your next version. In hashutil.c, I think that _hash_msb() is just a reimplementation of fls(), which you can rely on being present because we have our own implementation in src/port. It's quite similar to yours but slightly shorter. :-) Also, some systems have a builtin fls() function which actually optimizes down to a single machine instruction, and which is therefore much faster than either version. I don't like the fact that _hash_get_newblk() and _hash_get_oldblk() are working out the bucket number by using the HashOpaque structure within the bucket page they're examining. First, it seems weird to pass the whole structure when you only need the bucket number out of it. More importantly, the caller really ought to know what bucket they care about without having to discover it. For example, in _hash_doinsert(), we figure out the bucket into which we need to insert, and we store that in a variable called "bucket". Then from there we work out the primary bucket page's block number, which we store in "blkno". We read the page into "buf" and put a pointer to that buffer's contents into "page" from which we discover the HashOpaque, a pointer to which we store into "pageopaque". Then we pass that to _hash_get_newblk() which will go look into that structure to find the bucket number ... but couldn't we have just passed "bucket" instead? Similarly, _hash_expandtable() has the value available in the variable "old_bucket". The only caller of _hash_get_oldblk() is _hash_first(), which has the bucket number available in a variable called "bucket". So it seems to me that these functions could be simplified to take the bucket number as an argument directly instead of the HashOpaque. Generally, this pattern recurs throughout the patch. Every time you use the data in the page to figure something out which the caller already knew, you're introducing a risk of bugs: what if the answers don't match? I think you should try to root out as much of that from this code as you can. As you may be able to tell, I'm working my way into this patch gradually, starting with peripheral parts and working toward the core of it. Generally, I think it's in pretty good shape, but I still have quite a bit left to study. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company hashovfl-tweaks.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Fri, Oct 28, 2016 at 12:33 AM, Amit Kapila wrote: > On Fri, Oct 28, 2016 at 2:52 AM, Robert Haas wrote: >> On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila >> wrote: Amit, can you please split the buffer manager changes in this patch into a separate patch? >>> >>> Sure, attached patch extend_bufmgr_api_for_hash_index_v1.patch does that. >> >> The additional argument to ConditionalLockBuffer() doesn't seem to be >> used anywhere in the main patch. Do we actually need it? >> > > No, with latest patch of concurrent hash index, we don't need it. I > have forgot to remove it. Please find updated patch attached. The > usage of second parameter for ConditionalLockBuffer() is removed as we > don't want to allow I/O across content locks, so the patch is changed > to fallback to twice locking the metapage. Committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Thu, Nov 3, 2016 at 6:25 AM, Amit Kapila wrote: >> +nblkno = _hash_get_newblk(rel, pageopaque); >> >> I think this is not a great name for this function. It's not clear >> what "new blocks" refers to, exactly. I suggest >> FIND_SPLIT_BUCKET(metap, bucket) or OLD_BUCKET_TO_NEW_BUCKET(metap, >> bucket) returning a new bucket number. I think that macro can be >> defined as something like this: bucket + (1 << >> (fls(metap->hashm_maxbucket) - 1)). >> > > I think such a macro would not work for the usage of incomplete > splits. The reason is that by the time we try to complete the split > of the current old bucket, the table half (lowmask, highmask, > maxbucket) would have changed and it could give you the bucket in new > table half. Can you provide an example of the scenario you are talking about here? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Wed, Nov 2, 2016 at 6:39 AM, Robert Haas wrote: > On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila wrote: >> [ new patches ] > > I looked over parts of this today, mostly the hashinsert.c changes. > > +/* > + * Copy bucket mapping info now; The comment in _hash_expandtable where > + * we copy this information and calls _hash_splitbucket explains why this > + * is OK. > + */ > > So, I went and tried to find the comments to which this comment is > referring and didn't have much luck. > I guess you have just tried to find it in the patch. However, the comment I am referring above is an existing comment in _hash_expandtable(). Refer below comment: /* * Copy bucket mapping info now; this saves re-accessing the meta page * inside _hash_splitbucket's inner loop. ... > At the point this code is > running, we have a pin but no lock on the metapage, so this is only > safe if changing any of these fields requires a cleanup lock on the > metapage. If that's true, > No that's not true, we need just Exclusive content lock to update those fields and these fields should be copied when we have Share content lock on metapage. In version-8 of patch, it was correct, but in last version, it seems during code re-arrangement, I have moved it. I will change it such that these values are copied under matapage share content lock. I think moving it just before the preceding for loop should be okay, let me know if you think otherwise. > +nblkno = _hash_get_newblk(rel, pageopaque); > > I think this is not a great name for this function. It's not clear > what "new blocks" refers to, exactly. I suggest > FIND_SPLIT_BUCKET(metap, bucket) or OLD_BUCKET_TO_NEW_BUCKET(metap, > bucket) returning a new bucket number. I think that macro can be > defined as something like this: bucket + (1 << > (fls(metap->hashm_maxbucket) - 1)). > I think such a macro would not work for the usage of incomplete splits. The reason is that by the time we try to complete the split of the current old bucket, the table half (lowmask, highmask, maxbucket) would have changed and it could give you the bucket in new table half. > Then do nblkno = > BUCKET_TO_BLKNO(metap, newbucket) to get the block number. That'd all > be considerably simpler than what you have for hash_get_newblk(). > I think to use BUCKET_TO_BLKNO we need either a share or exclusive lock on metapage and as we need a lock on metapage to find new block from old block, I thought it is better to do inside _hash_get_newblk(). > > Moving on ... > > /* > * ovfl page exists; go get it. if it doesn't have room, we'll > - * find out next pass through the loop test above. > + * find out next pass through the loop test above. Retain the > + * pin, if it is a primary bucket page. > */ > -_hash_relbuf(rel, buf); > +if (pageopaque->hasho_flag & LH_BUCKET_PAGE) > +_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK); > +else > +_hash_relbuf(rel, buf); > > It seems like it would be cheaper, safer, and clearer to test whether > buf != bucket_buf here, rather than examining the page opaque data. > That's what you do down at the bottom of the function when you ensure > that the pin on the primary bucket page gets released, and it seems > like it should work up here, too. > > +boolretain_pin = false; > + > +/* page flags must be accessed before releasing lock on a page. > */ > +retain_pin = pageopaque->hasho_flag & LH_BUCKET_PAGE; > > Similarly. > Agreed, will change the usage as per your suggestion. > I have also attached a patch with some suggested comment changes. > I will include it in next version of patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila wrote: > [ new patches ] I looked over parts of this today, mostly the hashinsert.c changes. +/* + * Copy bucket mapping info now; The comment in _hash_expandtable where + * we copy this information and calls _hash_splitbucket explains why this + * is OK. + */ So, I went and tried to find the comments to which this comment is referring and didn't have much luck. At the point this code is running, we have a pin but no lock on the metapage, so this is only safe if changing any of these fields requires a cleanup lock on the metapage. If that's true, it seems like you could just make the comment say that; if it's false, you've got problems. This code seems rather pointless anyway, the way it's written. All of these local variables are used in exactly one place, which is here: +_hash_finish_split(rel, metabuf, buf, nbuf, maxbucket, + highmask, lowmask); But you hold the same locks at the point where you copy those values into local variables and the point where that code runs. So if the code is safe as written, you could instead just pass metap->hashm_maxbucket, metap->hashm_highmask, and metap->hashm_lowmask to that function instead of having these local variables. Or, for that matter, you could just let that function read the data itself: it's got metabuf, after all. + * In future, if we want to finish the splits during insertion in new + * bucket, we must ensure the locking order such that old bucket is locked + * before new bucket. Not if the locks are conditional anyway. +nblkno = _hash_get_newblk(rel, pageopaque); I think this is not a great name for this function. It's not clear what "new blocks" refers to, exactly. I suggest FIND_SPLIT_BUCKET(metap, bucket) or OLD_BUCKET_TO_NEW_BUCKET(metap, bucket) returning a new bucket number. I think that macro can be defined as something like this: bucket + (1 << (fls(metap->hashm_maxbucket) - 1)). Then do nblkno = BUCKET_TO_BLKNO(metap, newbucket) to get the block number. That'd all be considerably simpler than what you have for hash_get_newblk(). Here's some test code I wrote, which seems to work: #include #include #include #include int newbucket(int bucket, int nbuckets) { assert(bucket < nbuckets); return bucket + (1 << (fls(nbuckets) - 1)); } int main(int argc, char **argv) { intnbuckets = 1; int restartat = 1; intsplitbucket = 0; while (splitbucket < 32) { printf("old bucket %d splits to new bucket %d\n", splitbucket, newbucket(splitbucket, nbuckets)); if (++splitbucket >= restartat) { splitbucket = 0; restartat *= 2; } ++nbuckets; } exit(0); } Moving on ... /* * ovfl page exists; go get it. if it doesn't have room, we'll - * find out next pass through the loop test above. + * find out next pass through the loop test above. Retain the + * pin, if it is a primary bucket page. */ -_hash_relbuf(rel, buf); +if (pageopaque->hasho_flag & LH_BUCKET_PAGE) +_hash_chgbufaccess(rel, buf, HASH_READ, HASH_NOLOCK); +else +_hash_relbuf(rel, buf); It seems like it would be cheaper, safer, and clearer to test whether buf != bucket_buf here, rather than examining the page opaque data. That's what you do down at the bottom of the function when you ensure that the pin on the primary bucket page gets released, and it seems like it should work up here, too. +boolretain_pin = false; + +/* page flags must be accessed before releasing lock on a page. */ +retain_pin = pageopaque->hasho_flag & LH_BUCKET_PAGE; Similarly. I have also attached a patch with some suggested comment changes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company hashinsert-comments.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Fri, Oct 28, 2016 at 2:52 AM, Robert Haas wrote: > On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila wrote: >>> Amit, can you please split the buffer manager changes in this patch >>> into a separate patch? >> >> Sure, attached patch extend_bufmgr_api_for_hash_index_v1.patch does that. > > The additional argument to ConditionalLockBuffer() doesn't seem to be > used anywhere in the main patch. Do we actually need it? > No, with latest patch of concurrent hash index, we don't need it. I have forgot to remove it. Please find updated patch attached. The usage of second parameter for ConditionalLockBuffer() is removed as we don't want to allow I/O across content locks, so the patch is changed to fallback to twice locking the metapage. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com extend_bufmgr_api_for_hash_index_v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Mon, Oct 24, 2016 at 10:30 AM, Amit Kapila wrote: >> Amit, can you please split the buffer manager changes in this patch >> into a separate patch? > > Sure, attached patch extend_bufmgr_api_for_hash_index_v1.patch does that. The additional argument to ConditionalLockBuffer() doesn't seem to be used anywhere in the main patch. Do we actually need it? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Mon, Oct 24, 2016 at 8:00 PM, Amit Kapila wrote: > On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas wrote: >> On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas wrote: > > > Thanks for the valuable feedback. > Forgot to mention that in addition to fixing the review comments, I had made an additional change to skip the dead tuple while copying tuples from old bucket to new bucket during split. This was previously not possible because split and scan were blocking operations (split used to take Exclusive lock on bucket and Scan used to hold Share lock on bucket till the operation ends), but now it is possible and during scan some of the tuples can be marked as dead. Similarly during squeeze operation, skipping dead tuples while moving tuples across buckets. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Tue, Oct 18, 2016 at 8:27 PM, Amit Kapila wrote: > By this problem, I mean to say deadlocks for suspended scans, that can > happen in btree for non-Mvcc or other type of scans where we don't > release pin during scan. In my mind, we have below options: > > a. problem of deadlocks for suspended scans should be tackled as a > separate patch as it exists for other indexes (at least for some type > of scans). > b. Implement page-scan mode and then we won't have deadlock problem > for MVCC scans. > c. Let's not care for non-MVCC scans unless we have some way to hit > those for hash indexes and proceed with Dead tuple marking idea. I > think even if we don't care for non-MVCC scans, we might hit this > problem (deadlocks) when the index relation is unlogged. > > Here, even if we want to go with (b), I think we can handle it in a > separate patch, unless you think otherwise. After some off-list discussion with Amit, I think I get his point here: the deadlock hazard which is introduced by this patch already exists for btree and has for a long time, and nobody's gotten around to fixing it (although 2ed5b87f96d473962ec5230fd820abfeaccb2069 improved things). So it's probably OK for hash indexes to have the same issue. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Wed, Oct 19, 2016 at 5:57 AM, Amit Kapila wrote: > On Tue, Oct 18, 2016 at 10:52 PM, Robert Haas wrote: >> On Tue, Oct 18, 2016 at 5:37 AM, Amit Kapila wrote: >>> On Wed, Oct 5, 2016 at 10:22 PM, Robert Haas wrote: On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila wrote: > I think one way to avoid the risk of deadlock in above scenario is to > take the cleanup lock conditionally, if we get the cleanup lock then > we will delete the items as we are doing in patch now, else it will > just mark the tuples as dead and ensure that it won't try to remove > tuples that are moved-by-split. Now, I think the question is how will > these dead tuples be removed. We anyway need a separate mechanism to > clear dead tuples for hash indexes as during scans we are marking the > tuples as dead if corresponding tuple in heap is dead which are not > removed later. This is already taken care in btree code via > kill_prior_tuple optimization. So I think clearing of dead tuples can > be handled by a separate patch. That seems like it could work. >>> >>> I have implemented this idea and it works for MVCC scans. However, I >>> think this might not work for non-MVCC scans. Consider a case where >>> in Process-1, hash scan has returned one row and before it could check >>> it's validity in heap, vacuum marks that tuple as dead and removed the >>> entry from heap and some new tuple has been placed at that offset in >>> heap. >> >> Oops, that's bad. >> >>> Now when Process-1 checks the validity in heap, it will check >>> for different tuple then what the index tuple was suppose to check. >>> If we want, we can make it work similar to what btree does as being >>> discussed on thread [1], but for that we need to introduce page-scan >>> mode as well in hash indexes. However, do we really want to solve >>> this problem as part of this patch when this exists for other index am >>> as well? >> >> For what other index AM does this problem exist? >> > > By this problem, I mean to say deadlocks for suspended scans, that can > happen in btree for non-Mvcc or other type of scans where we don't > release pin during scan. In my mind, we have below options: > > a. problem of deadlocks for suspended scans should be tackled as a > separate patch as it exists for other indexes (at least for some type > of scans). > b. Implement page-scan mode and then we won't have deadlock problem > for MVCC scans. > c. Let's not care for non-MVCC scans unless we have some way to hit > those for hash indexes and proceed with Dead tuple marking idea. I > think even if we don't care for non-MVCC scans, we might hit this > problem (deadlocks) when the index relation is unlogged. > oops, my last sentence is wrong. What I wanted to say is: "I think even if we don't care for non-MVCC scans, we might hit the problem of TIDs reuse when the index relation is unlogged." -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Tue, Oct 18, 2016 at 10:52 PM, Robert Haas wrote: > On Tue, Oct 18, 2016 at 5:37 AM, Amit Kapila wrote: >> On Wed, Oct 5, 2016 at 10:22 PM, Robert Haas wrote: >>> On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila >>> wrote: I think one way to avoid the risk of deadlock in above scenario is to take the cleanup lock conditionally, if we get the cleanup lock then we will delete the items as we are doing in patch now, else it will just mark the tuples as dead and ensure that it won't try to remove tuples that are moved-by-split. Now, I think the question is how will these dead tuples be removed. We anyway need a separate mechanism to clear dead tuples for hash indexes as during scans we are marking the tuples as dead if corresponding tuple in heap is dead which are not removed later. This is already taken care in btree code via kill_prior_tuple optimization. So I think clearing of dead tuples can be handled by a separate patch. >>> >>> That seems like it could work. >> >> I have implemented this idea and it works for MVCC scans. However, I >> think this might not work for non-MVCC scans. Consider a case where >> in Process-1, hash scan has returned one row and before it could check >> it's validity in heap, vacuum marks that tuple as dead and removed the >> entry from heap and some new tuple has been placed at that offset in >> heap. > > Oops, that's bad. > >> Now when Process-1 checks the validity in heap, it will check >> for different tuple then what the index tuple was suppose to check. >> If we want, we can make it work similar to what btree does as being >> discussed on thread [1], but for that we need to introduce page-scan >> mode as well in hash indexes. However, do we really want to solve >> this problem as part of this patch when this exists for other index am >> as well? > > For what other index AM does this problem exist? > By this problem, I mean to say deadlocks for suspended scans, that can happen in btree for non-Mvcc or other type of scans where we don't release pin during scan. In my mind, we have below options: a. problem of deadlocks for suspended scans should be tackled as a separate patch as it exists for other indexes (at least for some type of scans). b. Implement page-scan mode and then we won't have deadlock problem for MVCC scans. c. Let's not care for non-MVCC scans unless we have some way to hit those for hash indexes and proceed with Dead tuple marking idea. I think even if we don't care for non-MVCC scans, we might hit this problem (deadlocks) when the index relation is unlogged. Here, even if we want to go with (b), I think we can handle it in a separate patch, unless you think otherwise. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On 2016-10-18 13:38:14 -0400, Tom Lane wrote: > Robert Haas writes: > > On Tue, Oct 18, 2016 at 5:37 AM, Amit Kapila > > wrote: > >> I have implemented this idea and it works for MVCC scans. However, I > >> think this might not work for non-MVCC scans. Consider a case where > >> in Process-1, hash scan has returned one row and before it could check > >> it's validity in heap, vacuum marks that tuple as dead and removed the > >> entry from heap and some new tuple has been placed at that offset in > >> heap. > > > Oops, that's bad. > > Do we care? Under what circumstances would a hash index be used for a > non-MVCC scan? Uniqueness checks, are the most important one that comes to mind. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
Robert Haas writes: > On Tue, Oct 18, 2016 at 5:37 AM, Amit Kapila wrote: >> I have implemented this idea and it works for MVCC scans. However, I >> think this might not work for non-MVCC scans. Consider a case where >> in Process-1, hash scan has returned one row and before it could check >> it's validity in heap, vacuum marks that tuple as dead and removed the >> entry from heap and some new tuple has been placed at that offset in >> heap. > Oops, that's bad. Do we care? Under what circumstances would a hash index be used for a non-MVCC scan? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Tue, Oct 18, 2016 at 5:37 AM, Amit Kapila wrote: > On Wed, Oct 5, 2016 at 10:22 PM, Robert Haas wrote: >> On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila wrote: >>> I think one way to avoid the risk of deadlock in above scenario is to >>> take the cleanup lock conditionally, if we get the cleanup lock then >>> we will delete the items as we are doing in patch now, else it will >>> just mark the tuples as dead and ensure that it won't try to remove >>> tuples that are moved-by-split. Now, I think the question is how will >>> these dead tuples be removed. We anyway need a separate mechanism to >>> clear dead tuples for hash indexes as during scans we are marking the >>> tuples as dead if corresponding tuple in heap is dead which are not >>> removed later. This is already taken care in btree code via >>> kill_prior_tuple optimization. So I think clearing of dead tuples can >>> be handled by a separate patch. >> >> That seems like it could work. > > I have implemented this idea and it works for MVCC scans. However, I > think this might not work for non-MVCC scans. Consider a case where > in Process-1, hash scan has returned one row and before it could check > it's validity in heap, vacuum marks that tuple as dead and removed the > entry from heap and some new tuple has been placed at that offset in > heap. Oops, that's bad. > Now when Process-1 checks the validity in heap, it will check > for different tuple then what the index tuple was suppose to check. > If we want, we can make it work similar to what btree does as being > discussed on thread [1], but for that we need to introduce page-scan > mode as well in hash indexes. However, do we really want to solve > this problem as part of this patch when this exists for other index am > as well? For what other index AM does this problem exist? Kevin has been careful not to create this problem for btree, or at least I think he has. That's why the pin still has to be held on the index page when it's a non-MVCC scan. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Wed, Oct 5, 2016 at 10:22 PM, Robert Haas wrote: > On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila wrote: >> I think one way to avoid the risk of deadlock in above scenario is to >> take the cleanup lock conditionally, if we get the cleanup lock then >> we will delete the items as we are doing in patch now, else it will >> just mark the tuples as dead and ensure that it won't try to remove >> tuples that are moved-by-split. Now, I think the question is how will >> these dead tuples be removed. We anyway need a separate mechanism to >> clear dead tuples for hash indexes as during scans we are marking the >> tuples as dead if corresponding tuple in heap is dead which are not >> removed later. This is already taken care in btree code via >> kill_prior_tuple optimization. So I think clearing of dead tuples can >> be handled by a separate patch. > > That seems like it could work. > I have implemented this idea and it works for MVCC scans. However, I think this might not work for non-MVCC scans. Consider a case where in Process-1, hash scan has returned one row and before it could check it's validity in heap, vacuum marks that tuple as dead and removed the entry from heap and some new tuple has been placed at that offset in heap. Now when Process-1 checks the validity in heap, it will check for different tuple then what the index tuple was suppose to check. If we want, we can make it work similar to what btree does as being discussed on thread [1], but for that we need to introduce page-scan mode as well in hash indexes. However, do we really want to solve this problem as part of this patch when this exists for other index am as well? [1] - https://www.postgresql.org/message-id/CACjxUsNtBXe1OfRp%3DacB%2B8QFAVWJ%3Dnr55_HMmqQYceCzVGF4tQ%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Mon, Oct 10, 2016 at 10:07 PM, Jeff Janes wrote: > On Mon, Oct 10, 2016 at 5:55 AM, Amit Kapila > wrote: >> >> On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila >> wrote: >> > On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas >> > wrote: >> > >> >> Another thing I don't quite understand about this algorithm is that in >> >> order to conditionally lock the target primary bucket page, we'd first >> >> need to read and pin it. And that doesn't seem like a good thing to >> >> do while we're holding a shared content lock on the metapage, because >> >> of the principle that we don't want to hold content locks across I/O. >> >> >> > >> >> Aren't we already doing this during BufferAlloc() when the buffer >> selected by StrategyGetBuffer() is dirty? > > > Right, you probably shouldn't allocate another buffer while holding a > content lock on a different one, if you can help it. > I don't see the problem in that, but I guess the simple rule is that we should not hold content locks for longer duration, which could happen if we do I/O, or need to allocate a new buffer. > But, BufferAlloc > doesn't do that internally, does it? > You are right that BufferAlloc() doesn't allocate a new buffer while holding content lock on another buffer, but it does perform I/O while holding content lock. > It is only a problem if you make it be > one by the way you use it. Am I missing something? > >> >> >> > I think we can release metapage content lock before reading the buffer. >> > >> >> On thinking about this again, if we release the metapage content lock >> before reading and pinning the primary bucket page, then we need to >> take it again to verify if the split has happened during the time we >> don't have a lock on a metapage. Releasing and again taking content >> lock on metapage is not >> good from the performance aspect. Do you have some other idea for this? > > > Doesn't the relcache patch effectively deal wit hthis? If this is a > sticking point, maybe the relcache patch could be incorporated into this > one. > Yeah, relcache patch would eliminate the need for metapage locking, but that is not a blocking point. As this patch is mainly to enable WAL logging, so there is no urgency to incorporate relcache patch, even if we have to go with an algorithm where we need to take the metapage lock twice to verify the splits. Having said that, I am okay, if Robert and or others are also in favour of combining the two patches (patch in this thread and cache the metapage patch). If we don't want to hold content lock across another ReadBuffer call, then another option could be to modify the read algorithm as below: read the metapage compute bucket number for target hash key based on metapage contents read the required block loop: acquire shared content lock on metapage recompute bucket number for target hash key based on metapage contents if the recomputed block number is not same as the block number we read release meta page content lock read the recomputed block number else break; if (we can't get a shared content lock on the target bucket without blocking) loop: release meta page content lock take a shared content lock on the target primary bucket page take a shared content lock on the metapage if (previously-computed target bucket has not been split) break; The basic change here is that first we compute the target block number *without* locking metapage and then after locking the metapage, if both doesn't match, then we need to again read the computed block number. Thoughts? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Mon, Oct 10, 2016 at 5:55 AM, Amit Kapila wrote: > On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila > wrote: > > On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas > wrote: > > > >> Another thing I don't quite understand about this algorithm is that in > >> order to conditionally lock the target primary bucket page, we'd first > >> need to read and pin it. And that doesn't seem like a good thing to > >> do while we're holding a shared content lock on the metapage, because > >> of the principle that we don't want to hold content locks across I/O. > >> > > > > Aren't we already doing this during BufferAlloc() when the buffer > selected by StrategyGetBuffer() is dirty? > Right, you probably shouldn't allocate another buffer while holding a content lock on a different one, if you can help it. But, BufferAlloc doesn't do that internally, does it? It is only a problem if you make it be one by the way you use it. Am I missing something? > > > I think we can release metapage content lock before reading the buffer. > > > > On thinking about this again, if we release the metapage content lock > before reading and pinning the primary bucket page, then we need to > take it again to verify if the split has happened during the time we > don't have a lock on a metapage. Releasing and again taking content > lock on metapage is not > good from the performance aspect. Do you have some other idea for this? > Doesn't the relcache patch effectively deal wit hthis? If this is a sticking point, maybe the relcache patch could be incorporated into this one. Cheers, Jeff
Re: [HACKERS] Hash Indexes
On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila wrote: > On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas wrote: > >> Another thing I don't quite understand about this algorithm is that in >> order to conditionally lock the target primary bucket page, we'd first >> need to read and pin it. And that doesn't seem like a good thing to >> do while we're holding a shared content lock on the metapage, because >> of the principle that we don't want to hold content locks across I/O. >> > Aren't we already doing this during BufferAlloc() when the buffer selected by StrategyGetBuffer() is dirty? > I think we can release metapage content lock before reading the buffer. > On thinking about this again, if we release the metapage content lock before reading and pinning the primary bucket page, then we need to take it again to verify if the split has happened during the time we don't have a lock on a metapage. Releasing and again taking content lock on metapage is not good from the performance aspect. Do you have some other idea for this? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Wed, Oct 5, 2016 at 10:22 PM, Robert Haas wrote: > On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila wrote: >> I think one way to avoid the risk of deadlock in above scenario is to >> take the cleanup lock conditionally, if we get the cleanup lock then >> we will delete the items as we are doing in patch now, else it will >> just mark the tuples as dead and ensure that it won't try to remove >> tuples that are moved-by-split. Now, I think the question is how will >> these dead tuples be removed. We anyway need a separate mechanism to >> clear dead tuples for hash indexes as during scans we are marking the >> tuples as dead if corresponding tuple in heap is dead which are not >> removed later. This is already taken care in btree code via >> kill_prior_tuple optimization. So I think clearing of dead tuples can >> be handled by a separate patch. > > That seems like it could work. The hash scan code will need to be > made smart enough to ignore any tuples marked dead, if it isn't > already. > It already takes care of ignoring killed tuples in below code, though the way is much less efficient as compare to btree. Basically, it fetches the matched tuple and then check if it is dead where as in btree while matching the key, it does the same check. It might be efficient to do it before matching the hashkey, but I think it is a matter of separate patch. hashgettuple() { .. /* * Skip killed tuples if asked to. */ if (scan->ignore_killed_tuples) } -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Tue, Oct 4, 2016 at 12:36 AM, Amit Kapila wrote: > I think one way to avoid the risk of deadlock in above scenario is to > take the cleanup lock conditionally, if we get the cleanup lock then > we will delete the items as we are doing in patch now, else it will > just mark the tuples as dead and ensure that it won't try to remove > tuples that are moved-by-split. Now, I think the question is how will > these dead tuples be removed. We anyway need a separate mechanism to > clear dead tuples for hash indexes as during scans we are marking the > tuples as dead if corresponding tuple in heap is dead which are not > removed later. This is already taken care in btree code via > kill_prior_tuple optimization. So I think clearing of dead tuples can > be handled by a separate patch. That seems like it could work. The hash scan code will need to be made smart enough to ignore any tuples marked dead, if it isn't already. More aggressive cleanup can be left for another patch. > I have also thought about using page-scan-at-a-time idea which has > been discussed upthread[1], but I think we can't completely eliminate > the need to out-wait scans (cleanup lock requirement) for scans that > are started when split-in-progress or for non-MVCC scans as described > in that e-mail [1]. We might be able to find some way to solve the > problem with this approach, but I think it will be slightly > complicated and much more work is required as compare to previous > approach. There are several levels of aggressiveness here with different locking requirements: 1. Mark line items dead without reorganizing the page. Needs an exclusive content lock, no more. Even a shared content lock may be OK, as for other opportunistic bit-flipping. 2. Mark line items dead and compact the tuple data. If a pin is sufficient to look at tuple data, as it is for the heap, then a cleanup lock is required here. But if we always hold a shared content lock when looking at the tuple data, it might be possible to do this with just an exclusive content lock. 3. Remove dead line items completely, compacting the tuple data and the item-pointer array. Doing this with only an exclusive content lock certainly needs page-at-a-time mode because otherwise a searcher that resumes a scan later might resume from the wrong place. It also needs the guarantee mentioned for point #2, namely that nobody will be examining the tuple data without a shared content lock. 4. Squeezing the bucket. This is probably always going to require a cleanup lock, because otherwise it's pretty unclear how a concurrent scan could be made safe. I suppose the scan could remember every TID it has seen, somehow detect that a squeeze had happened, and rescan the whole bucket ignoring TIDs already returned, but that seems to require the client to use potentially unbounded amounts of memory to remember already-returned TIDs, plus an as-yet-uninvented mechanism for detecting that a squeeze has happened. So this seems like a dead-end to me. I think that it is very much worthwhile to reduce the required lock strength from cleanup-lock to exclusive-lock in as many cases as possible, but I don't think it will be possible to completely eliminate the need to take the cleanup lock in some cases. However, if we can always take the cleanup lock conditionally and never be in a situation where it's absolutely required, we should be OK - and even level (1) gives you that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Tue, Oct 4, 2016 at 10:06 AM, Amit Kapila wrote: > On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila wrote: >> On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas wrote: >>> On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas wrote: >>> >>> As I was looking at the old text regarding deadlock risk, I realized >>> what may be a serious problem. Suppose process A is performing a scan >>> of some hash index. While the scan is suspended, it attempts to take >>> a lock and is blocked by process B. Process B, meanwhile, is running >>> VACUUM on that hash index. Eventually, it will do >>> LockBufferForCleanup() on the hash bucket on which process A holds a >>> buffer pin, resulting in an undetected deadlock. In the current >>> coding, A would hold a heavyweight lock and B would attempt to acquire >>> a conflicting heavyweight lock, and the deadlock detector would kill >>> one of them. This patch probably breaks that. I notice that that's >>> the only place where we attempt to acquire a buffer cleanup lock >>> unconditionally; every place else, we acquire the lock conditionally, >>> so there's no deadlock risk. Once we resolve this problem, the >>> paragraph about deadlock risk in this section should be revised to >>> explain whatever solution we come up with. >>> >>> By the way, since VACUUM must run in its own transaction, B can't be >>> holding arbitrary locks, but that doesn't seem quite sufficient to get >>> us out of the woods. It will at least hold ShareUpdateExclusiveLock >>> on the relation being vacuumed, and process A could attempt to acquire >>> that same lock. >>> >> >> Right, I think there is a danger of deadlock in above situation. >> Needs some more thoughts. >> > > I think one way to avoid the risk of deadlock in above scenario is to > take the cleanup lock conditionally, if we get the cleanup lock then > we will delete the items as we are doing in patch now, else it will > just mark the tuples as dead and ensure that it won't try to remove > tuples that are moved-by-split. Now, I think the question is how will > these dead tuples be removed. We anyway need a separate mechanism to > clear dead tuples for hash indexes as during scans we are marking the > tuples as dead if corresponding tuple in heap is dead which are not > removed later. This is already taken care in btree code via > kill_prior_tuple optimization. So I think clearing of dead tuples can > be handled by a separate patch. > I think we can also remove the dead tuples next time when vacuum visits the bucket and is able to acquire the cleanup lock. Right now, we are just checking if the corresponding heap tuple is dead, we can have an additional check as well to ensure if the current item is dead in index, then consider it in list of deletable items. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Thu, Sep 29, 2016 at 8:27 PM, Amit Kapila wrote: > On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas wrote: >> On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas wrote: >> >> As I was looking at the old text regarding deadlock risk, I realized >> what may be a serious problem. Suppose process A is performing a scan >> of some hash index. While the scan is suspended, it attempts to take >> a lock and is blocked by process B. Process B, meanwhile, is running >> VACUUM on that hash index. Eventually, it will do >> LockBufferForCleanup() on the hash bucket on which process A holds a >> buffer pin, resulting in an undetected deadlock. In the current >> coding, A would hold a heavyweight lock and B would attempt to acquire >> a conflicting heavyweight lock, and the deadlock detector would kill >> one of them. This patch probably breaks that. I notice that that's >> the only place where we attempt to acquire a buffer cleanup lock >> unconditionally; every place else, we acquire the lock conditionally, >> so there's no deadlock risk. Once we resolve this problem, the >> paragraph about deadlock risk in this section should be revised to >> explain whatever solution we come up with. >> >> By the way, since VACUUM must run in its own transaction, B can't be >> holding arbitrary locks, but that doesn't seem quite sufficient to get >> us out of the woods. It will at least hold ShareUpdateExclusiveLock >> on the relation being vacuumed, and process A could attempt to acquire >> that same lock. >> > > Right, I think there is a danger of deadlock in above situation. > Needs some more thoughts. > I think one way to avoid the risk of deadlock in above scenario is to take the cleanup lock conditionally, if we get the cleanup lock then we will delete the items as we are doing in patch now, else it will just mark the tuples as dead and ensure that it won't try to remove tuples that are moved-by-split. Now, I think the question is how will these dead tuples be removed. We anyway need a separate mechanism to clear dead tuples for hash indexes as during scans we are marking the tuples as dead if corresponding tuple in heap is dead which are not removed later. This is already taken care in btree code via kill_prior_tuple optimization. So I think clearing of dead tuples can be handled by a separate patch. I have also thought about using page-scan-at-a-time idea which has been discussed upthread[1], but I think we can't completely eliminate the need to out-wait scans (cleanup lock requirement) for scans that are started when split-in-progress or for non-MVCC scans as described in that e-mail [1]. We might be able to find some way to solve the problem with this approach, but I think it will be slightly complicated and much more work is required as compare to previous approach. What is your preference among above approaches to resolve this problem or let me know if you have a better idea to solve it? [1] - https://www.postgresql.org/message-id/CAA4eK1Jj1UqneTXrywr%3DGg87vgmnMma87LuscN_r3hKaHd%3DL2g%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
Jeff Janes writes: > I've done a simple comparison using pgbench's default transaction, in which > all the primary keys have been dropped and replaced with indexes of either > hash or btree type, alternating over many rounds. > I run 'pgbench -c16 -j16 -T 900 -M prepared' on an 8 core machine with a > scale of 40. All the data fits in RAM, but not in shared_buffers (128MB). > I find a 4% improvement for hash indexes over btree indexes, 9324.744 > vs 9727.766. The difference is significant at p-value of 1.9e-9. Thanks for doing this work! > The four versions of hash indexes (HEAD, concurrent, wal, cache, applied > cumulatively) have no statistically significant difference in performance > from each other. Interesting. > I think I don't see improvement in hash performance with the concurrent and > cache patches because I don't have enough cores to get to the contention > that those patches are targeted at. Possibly. However, if the cache patch is not a prerequisite to the WAL fixes, IMO somebody would have to demonstrate that it has a measurable performance benefit before it would get in. It certainly doesn't look like it's simplifying the code, so I wouldn't take it otherwise. I think, though, that this is enough to put to bed the argument that we should toss the hash AM entirely. If it's already competitive with btree today, despite the lack of attention that it's gotten, there is reason to hope that it will be a significant win (for some use-cases, obviously) in future. We should now get back to reviewing these patches on their own merits. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Thu, Sep 29, 2016 at 5:14 PM, Robert Haas wrote: > On Thu, Sep 29, 2016 at 8:07 PM, Peter Geoghegan wrote: > > On Wed, Sep 28, 2016 at 8:06 PM, Andres Freund > wrote: > >> On 2016-09-28 15:04:30 -0400, Robert Haas wrote: > >>> Andres already > >>> stated that he things working on btree-over-hash would be more > >>> beneficial than fixing hash, but at this point it seems like he's the > >>> only one who takes that position. > >> > >> Note that I did *NOT* take that position. I was saying that I think we > >> should evaluate whether that's not a better approach, doing some simple > >> performance comparisons. > > > > I, for one, agree with this position. > > Well, I, for one, find it frustrating. It seems pretty unhelpful to > bring this up only after the code has already been written. The first > post on this thread was on May 10th. The first version of the patch > was posted on June 16th. This position was first articulated on > September 15th. > > But, by all means, please feel free to do the performance comparison > and post the results. I'd be curious to see them myself. > I've done a simple comparison using pgbench's default transaction, in which all the primary keys have been dropped and replaced with indexes of either hash or btree type, alternating over many rounds. I run 'pgbench -c16 -j16 -T 900 -M prepared' on an 8 core machine with a scale of 40. All the data fits in RAM, but not in shared_buffers (128MB). I find a 4% improvement for hash indexes over btree indexes, 9324.744 vs 9727.766. The difference is significant at p-value of 1.9e-9. The four versions of hash indexes (HEAD, concurrent, wal, cache, applied cumulatively) have no statistically significant difference in performance from each other. I certainly don't see how btree-over-hash-over-integer could be better than direct btree-over-integer. I think I don't see improvement in hash performance with the concurrent and cache patches because I don't have enough cores to get to the contention that those patches are targeted at. But since the concurrent patch is a prerequisite to the wal patch, that is enough to justify it even without a demonstrated performance boost. A 4% gain is not astonishing, but is nice to have provided we can get it without giving up crash safety. Cheers, Jeff
Re: [HACKERS] Hash Indexes
On Mon, Oct 3, 2016 at 12:42 AM, Pavel Stehule wrote: > > > 2016-10-02 12:40 GMT+02:00 Michael Paquier : >> >> On Sun, Oct 2, 2016 at 3:31 AM, Greg Stark wrote: >> > On Fri, Sep 30, 2016 at 2:11 AM, Robert Haas >> > wrote: >> >> For one thing, we can stop shipping a totally broken feature in release >> >> after release >> > >> > For what it's worth I'm for any patch that can accomplish that. >> > >> > In retrospect I think we should have done the hash-over-btree thing >> > ten years ago but we didn't and if Amit's patch makes hash indexes >> > recoverable today then go for it. >> >> +1. > > +1 And moved to next CF to make it breath. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
2016-10-02 12:40 GMT+02:00 Michael Paquier : > On Sun, Oct 2, 2016 at 3:31 AM, Greg Stark wrote: > > On Fri, Sep 30, 2016 at 2:11 AM, Robert Haas > wrote: > >> For one thing, we can stop shipping a totally broken feature in release > after release > > > > For what it's worth I'm for any patch that can accomplish that. > > > > In retrospect I think we should have done the hash-over-btree thing > > ten years ago but we didn't and if Amit's patch makes hash indexes > > recoverable today then go for it. > > +1. > +1 Pavel > -- > Michael > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Re: [HACKERS] Hash Indexes
On Sun, Oct 2, 2016 at 3:31 AM, Greg Stark wrote: > On Fri, Sep 30, 2016 at 2:11 AM, Robert Haas wrote: >> For one thing, we can stop shipping a totally broken feature in release >> after release > > For what it's worth I'm for any patch that can accomplish that. > > In retrospect I think we should have done the hash-over-btree thing > ten years ago but we didn't and if Amit's patch makes hash indexes > recoverable today then go for it. +1. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Fri, Sep 30, 2016 at 2:11 AM, Robert Haas wrote: > For one thing, we can stop shipping a totally broken feature in release after > release For what it's worth I'm for any patch that can accomplish that. In retrospect I think we should have done the hash-over-btree thing ten years ago but we didn't and if Amit's patch makes hash indexes recoverable today then go for it. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
Andres Freund : On 2016-09-30 17:39:04 +0100, Peter Geoghegan wrote: On Fri, Sep 30, 2016 at 4:46 PM, Robert Haas wrote: > I would just be very disappointed if, after the amount of work that > Amit and others have put into this project, the code gets rejected > because somebody thinks a different project would have been more worth > doing. I wouldn't presume to tell anyone else how to spend their time, and am not concerned about this making the hash index code any less useful from the user's perspective. Me neither. I'm concerned that this is a heck of a lot of work, and I don't think we've reached the end of it by a good bit. I think it would have, and probably still is, a more efficient use of time to go for the hash-via-btree method, and rip out the current hash indexes. But that's just me. I find it more than a bit odd to be accused of trying to waste others time by saying this, and that this is too late because time has already been invested. Especially the latter never has been a standard in the community, and while excruciatingly painful when one is the person(s) having invested the time, it probably shouldn't be. > The fact that we have hash indexes already and cannot remove them > because too much other code depends on hash opclasses is also, in my > opinion, a sufficiently good reason to pursue improving them. I think that Andres was suggesting that hash index opclasses would be usable with hash-over-btree, so you might still not end up with the wart of having hash opclasses without hash indexes (an idea that has been proposed and rejected at least once before now). Andres? Yes, that was what I was pretty much thinking. I was kind of guessing that this might be easiest implemented as a separate AM ("hash2" ;)) that's just a layer ontop of nbtree. Greetings, Andres Freund Hi, There have been benchmarks posted over the years were even the non-WAL logged hash out performed the btree usage variant. You cannot argue against O(1) algorithm behavior. We need to have a usable hash index so that others can help improve it. My 2 cents. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On 30-Sep-2016 10:26 PM, "Peter Geoghegan" wrote: > > On Fri, Sep 30, 2016 at 4:46 PM, Robert Haas wrote: > > I would just be very disappointed if, after the amount of work that > > Amit and others have put into this project, the code gets rejected > > because somebody thinks a different project would have been more worth > > doing. > > I wouldn't presume to tell anyone else how to spend their time, and am > not concerned about this patch making the hash index code any less > useful from the user's perspective. If this is how we remove the wart > of hash indexes not being WAL-logged, that's fine by me. I'm trying to > be helpful. > If that is fine, then I think we should do that. I want to bring it to your notice that we have already seen and reported that with proposed set of patches, hash indexes are good bit faster than btree, so that adds additional value in making them WAL-logged. > > As Tom said upthread: $But to kick the hash AM as such to the > > curb is to say > > "sorry, there will never be O(1) index lookups in Postgres".$ I > > think that's correct and a sufficiently-good reason to pursue this > > work, regardless of the merits (or lack of merits) of hash-over-btree. > > I don't think that "O(1) index lookups" is a useful guarantee with a > very expensive constant factor. The constant factor doesn't play much role when data doesn't have duplicates or have lesser duplicates. Amit seemed to agree with this, since > he spoke of the importance of both theoretical performance benefits > and practically realizable performance benefits. > No, I don't agree with that rather I think hash indexes are theoretically faster than btree and we have seen that practically as well for quite a few cases (for read workloads - when used with unique data and also in nest loops). With Regards, Amit Kapila
Re: [HACKERS] Hash Indexes
On 2016-09-30 17:39:04 +0100, Peter Geoghegan wrote: > On Fri, Sep 30, 2016 at 4:46 PM, Robert Haas wrote: > > I would just be very disappointed if, after the amount of work that > > Amit and others have put into this project, the code gets rejected > > because somebody thinks a different project would have been more worth > > doing. > > I wouldn't presume to tell anyone else how to spend their time, and am > not concerned about this making the hash index code any less useful > from the user's perspective. Me neither. I'm concerned that this is a heck of a lot of work, and I don't think we've reached the end of it by a good bit. I think it would have, and probably still is, a more efficient use of time to go for the hash-via-btree method, and rip out the current hash indexes. But that's just me. I find it more than a bit odd to be accused of trying to waste others time by saying this, and that this is too late because time has already been invested. Especially the latter never has been a standard in the community, and while excruciatingly painful when one is the person(s) having invested the time, it probably shouldn't be. > > The fact that we have hash indexes already and cannot remove them > > because too much other code depends on hash opclasses is also, in my > > opinion, a sufficiently good reason to pursue improving them. > > I think that Andres was suggesting that hash index opclasses would be > usable with hash-over-btree, so you might still not end up with the > wart of having hash opclasses without hash indexes (an idea that has > been proposed and rejected at least once before now). Andres? Yes, that was what I was pretty much thinking. I was kind of guessing that this might be easiest implemented as a separate AM ("hash2" ;)) that's just a layer ontop of nbtree. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
Peter Geoghegan writes: > On Fri, Sep 30, 2016 at 4:46 PM, Robert Haas wrote: >> The fact that we have hash indexes already and cannot remove them >> because too much other code depends on hash opclasses is also, in my >> opinion, a sufficiently good reason to pursue improving them. > I think that Andres was suggesting that hash index opclasses would be > usable with hash-over-btree, so you might still not end up with the > wart of having hash opclasses without hash indexes (an idea that has > been proposed and rejected at least once before now). Andres? That's an interesting point. If we were to flat-out replace the hash AM with a hash-over-btree AM, the existing hash opclasses would just migrate to that unchanged. But if someone wanted to add hash-over-btree alongside the hash AM, it would be necessary to clone all those opclass entries, or else find a way for the two AMs to share pg_opclass etc entries. Either one of those is kind of annoying. (Although if we did do the work of implementing the latter, it might come in handy in future; you could certainly imagine that there will be cases like a next-generation GIST AM wanting to reuse the opclasses of existing GIST, say.) But having said that, I remain opposed to removing the hash AM. If someone wants to implement hash-over-btree, that's cool with me, but I'd much rather put it in beside plain hash and let them duke it out in the field. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Fri, Sep 30, 2016 at 4:46 PM, Robert Haas wrote: > I would just be very disappointed if, after the amount of work that > Amit and others have put into this project, the code gets rejected > because somebody thinks a different project would have been more worth > doing. I wouldn't presume to tell anyone else how to spend their time, and am not concerned about this patch making the hash index code any less useful from the user's perspective. If this is how we remove the wart of hash indexes not being WAL-logged, that's fine by me. I'm trying to be helpful. > As Tom said upthread: $But to kick the hash AM as such to the > curb is to say > "sorry, there will never be O(1) index lookups in Postgres".$ I > think that's correct and a sufficiently-good reason to pursue this > work, regardless of the merits (or lack of merits) of hash-over-btree. I don't think that "O(1) index lookups" is a useful guarantee with a very expensive constant factor. Amit seemed to agree with this, since he spoke of the importance of both theoretical performance benefits and practically realizable performance benefits. > The fact that we have hash indexes already and cannot remove them > because too much other code depends on hash opclasses is also, in my > opinion, a sufficiently good reason to pursue improving them. I think that Andres was suggesting that hash index opclasses would be usable with hash-over-btree, so you might still not end up with the wart of having hash opclasses without hash indexes (an idea that has been proposed and rejected at least once before). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Fri, Sep 30, 2016 at 4:46 PM, Robert Haas wrote: > I would just be very disappointed if, after the amount of work that > Amit and others have put into this project, the code gets rejected > because somebody thinks a different project would have been more worth > doing. I wouldn't presume to tell anyone else how to spend their time, and am not concerned about this making the hash index code any less useful from the user's perspective. If this is how we remove the wart of hash indexes not being WAL-logged, that's fine by me. I am trying to be helpful. > As Tom said upthread: $But to kick the hash AM as such to the > curb is to say > "sorry, there will never be O(1) index lookups in Postgres".$ I > think that's correct and a sufficiently-good reason to pursue this > work, regardless of the merits (or lack of merits) of hash-over-btree. I don't think that "O(1) index lookups" is a useful guarantee with a very expensive constant factor. Amit said: "I think any discussion on whether we should consider not to improve current hash indexes is only meaningful if some one has a code which can prove both theoretically and practically that it is better than hash indexes for all usages", so I think that he shares this view. > The fact that we have hash indexes already and cannot remove them > because too much other code depends on hash opclasses is also, in my > opinion, a sufficiently good reason to pursue improving them. I think that Andres was suggesting that hash index opclasses would be usable with hash-over-btree, so you might still not end up with the wart of having hash opclasses without hash indexes (an idea that has been proposed and rejected at least once before now). Andres? To be clear: I haven't expressed any opinion on this patch. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Fri, Sep 30, 2016 at 7:47 AM, Peter Geoghegan wrote: > My only firm position is that it wouldn't be very hard to investigate > hash-over-btree to Andres' satisfaction, say, so, why not? I'm > surprised that this has caused consternation -- ISTM that Andres' > suggestion is *perfectly* reasonable. It doesn't appear to be an > objection to anything in particular. I would just be very disappointed if, after the amount of work that Amit and others have put into this project, the code gets rejected because somebody thinks a different project would have been more worth doing. As Tom said upthread: $$But to kick the hash AM as such to the curb is to say "sorry, there will never be O(1) index lookups in Postgres".$$ I think that's correct and a sufficiently-good reason to pursue this work, regardless of the merits (or lack of merits) of hash-over-btree. The fact that we have hash indexes already and cannot remove them because too much other code depends on hash opclasses is also, in my opinion, a sufficiently good reason to pursue improving them. I don't think the project needs the additional justification of outperforming a hash-over-btree in order to exist, even if such a comparison could be done fairly, which I suspect is harder than you're crediting. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Fri, Sep 30, 2016 at 9:07 AM, Amit Kapila wrote: > Considering, I have missed the real intention of their suggestions, I think > such a serious objection on any work should be discussed on list. To answer > the actual objection, I have already mentioned upthread that we can deduce > from the current tests done by Jesper and Mithun that there are some cases > where hash index will be better than hash-over-btree (tests done over > integer columns). I think any discussion on whether we should consider not > to improve current hash indexes is only meaningful if some one has a code > which can prove both theoretically and practically that it is better than > hash indexes for all usages. I cannot speak for Andres, but you judged my intent here correctly. I have no firm position on any of this just yet; I haven't even read the patch. I just think that it is worth doing some simple analysis of a hash-over-btree implementation, with simple prototyping and a simple test-case. I would consider that a due-diligence thing, because, honestly, it seems obvious to me that it should be at least checked out informally. I wasn't aware that there was already some analysis of this. Robert did just acknowledge that it is *possible* that "the hash-over-btree idea completely trounces hash indexes", so the general tone of this thread suggested to me that there was little or no analysis of hash-over-btree. I'm willing to believe that I'm wrong to be dismissive of the hash AM in general, and I'm even willing to be flexible on crediting the hash AM with being less optimized overall (assuming we can see a way past that). My only firm position is that it wouldn't be very hard to investigate hash-over-btree to Andres' satisfaction, say, so, why not? I'm surprised that this has caused consternation -- ISTM that Andres' suggestion is *perfectly* reasonable. It doesn't appear to be an objection to anything in particular. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On 30-Sep-2016 6:24 AM, "Robert Haas" wrote: > > On Thu, Sep 29, 2016 at 8:29 PM, Andres Freund wrote: > > On September 29, 2016 5:28:00 PM PDT, Robert Haas wrote: > >>On Thu, Sep 29, 2016 at 8:16 PM, Andres Freund > >>wrote: > Well, I, for one, find it frustrating. It seems pretty unhelpful to > bring this up only after the code has already been written. > >>> > >>> I brought this up in person at pgcon too. > >> > >>To whom? In what context? > > > > Amit, over dinner. > > OK, well, I can't really comment on that, then, except to say that if > you waited three months to follow up on the mailing list, you probably > can't blame Amit if he thought that it was more of a casual suggestion > than a serious objection. Maybe it was? I don't know. > Both of them have talked about hash indexes with me offline. Peter mentioned that it would be better to improve btree rather than hash indexes. IIRC, Andres asked me mainly about what use cases I have in mind for hash indexes and then we do have some further discussion on the same thing where he was not convinced that there is any big use case for hash indexes even though there may be some cases. In that discussion, as he is saying and I don't doubt him, he would have told me the alternative, but it was not apparent to me that he is expecting some sort of comparison. What I got from both the discussions was a friendly gesture that it might be a better use of my time, if I work on some other problem. I really respect suggestions from both of them, but it was no where clear to me that any one of them is expecting any comparison of other approach. Considering, I have missed the real intention of their suggestions, I think such a serious objection on any work should be discussed on list. To answer the actual objection, I have already mentioned upthread that we can deduce from the current tests done by Jesper and Mithun that there are some cases where hash index will be better than hash-over-btree (tests done over integer columns). I think any discussion on whether we should consider not to improve current hash indexes is only meaningful if some one has a code which can prove both theoretically and practically that it is better than hash indexes for all usages. Note - excuse me for formatting of this email as I am on travel and using my phone. With Regards, Amit Kapila.
Re: [HACKERS] Hash Indexes
On Thu, Sep 29, 2016 at 8:53 PM, Peter Geoghegan wrote: > On Fri, Sep 30, 2016 at 1:14 AM, Robert Haas wrote: >>> I, for one, agree with this position. >> >> Well, I, for one, find it frustrating. It seems pretty unhelpful to >> bring this up only after the code has already been written. The first >> post on this thread was on May 10th. The first version of the patch >> was posted on June 16th. This position was first articulated on >> September 15th. > > Really, what do we have to lose at this point? It's not very difficult > to do what Andres proposes. Well, first of all, I can't, because I don't really understand what tests he has in mind. Maybe somebody else does, in which case perhaps they could do the work and post the results. If the tests really are simple, that shouldn't be much of a burden. But, second, suppose we do the tests and find out that the hash-over-btree idea completely trounces hash indexes. What then? I don't think that would really prove anything because, as I said in my email to Andres, the current hash index code is severely under-optimized, so it's not really an apples-to-apples comparison. But even if it did prove something, is the idea then that Amit (with help from Mithun and Ashutosh Sharma) should throw away the ~8 months of development work that's been done on hash indexes in favor of starting all over with a new and probably harder project to build a whole new AM, and just leave hash indexes broken? That doesn't seem like a very reasonable think to ask. Leaving hash indexes broken fixes no problem that we have. On the other hand, applying those patches (after they've been suitably reviewed and fixed up) does fix several things. For one thing, we can stop shipping a totally broken feature in release after release. For another thing, those hash indexes do in fact outperform btree on some workloads, and with more work they can probably beat btree on more workloads. And if somebody later wants to write hash-over-btree and that turns out to be better still, great! I'm not blocking anyone from doing that. The only argument that's been advanced for not fixing hash indexes is that we'd then have to give people accurate guidance on whether to choose hash or btree, but that would also be true of a hypothetical hash-over-btree AM. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Thu, Sep 29, 2016 at 8:29 PM, Andres Freund wrote: > On September 29, 2016 5:28:00 PM PDT, Robert Haas > wrote: >>On Thu, Sep 29, 2016 at 8:16 PM, Andres Freund >>wrote: Well, I, for one, find it frustrating. It seems pretty unhelpful to bring this up only after the code has already been written. >>> >>> I brought this up in person at pgcon too. >> >>To whom? In what context? > > Amit, over dinner. OK, well, I can't really comment on that, then, except to say that if you waited three months to follow up on the mailing list, you probably can't blame Amit if he thought that it was more of a casual suggestion than a serious objection. Maybe it was? I don't know. For my part, I don't really understand how you think that we could find anything out via relatively simple tests. The hash index code is horribly under-maintained, which is why Amit is able to get large performance improvements out of improving it. If you compare it to btree in some way, it's probably going to lose. But I don't think that answers the question of whether a hash AM that somebody's put some work into will win or lose against a hypothetical hash-over-btree AM that nobody's written. Even if it wins, is that really a reason to leave the hash index code itself in a state of disrepair? We probably would have removed it already except that the infrastructure is used for hash joins and hash aggregation, so we really can't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Fri, Sep 30, 2016 at 1:14 AM, Robert Haas wrote: >> I, for one, agree with this position. > > Well, I, for one, find it frustrating. It seems pretty unhelpful to > bring this up only after the code has already been written. The first > post on this thread was on May 10th. The first version of the patch > was posted on June 16th. This position was first articulated on > September 15th. Really, what do we have to lose at this point? It's not very difficult to do what Andres proposes. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Fri, Sep 30, 2016 at 1:29 AM, Andres Freund wrote: >>To whom? In what context? > > Amit, over dinner. In case it matters, I also talked to Amit about this privately. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On September 29, 2016 5:28:00 PM PDT, Robert Haas wrote: >On Thu, Sep 29, 2016 at 8:16 PM, Andres Freund >wrote: >>> Well, I, for one, find it frustrating. It seems pretty unhelpful to >>> bring this up only after the code has already been written. >> >> I brought this up in person at pgcon too. > >To whom? In what context? Amit, over dinner. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Thu, Sep 29, 2016 at 8:16 PM, Andres Freund wrote: >> Well, I, for one, find it frustrating. It seems pretty unhelpful to >> bring this up only after the code has already been written. > > I brought this up in person at pgcon too. To whom? In what context? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On 2016-09-29 20:14:40 -0400, Robert Haas wrote: > On Thu, Sep 29, 2016 at 8:07 PM, Peter Geoghegan wrote: > > On Wed, Sep 28, 2016 at 8:06 PM, Andres Freund wrote: > >> On 2016-09-28 15:04:30 -0400, Robert Haas wrote: > >>> Andres already > >>> stated that he things working on btree-over-hash would be more > >>> beneficial than fixing hash, but at this point it seems like he's the > >>> only one who takes that position. > >> > >> Note that I did *NOT* take that position. I was saying that I think we > >> should evaluate whether that's not a better approach, doing some simple > >> performance comparisons. > > > > I, for one, agree with this position. > > Well, I, for one, find it frustrating. It seems pretty unhelpful to > bring this up only after the code has already been written. I brought this up in person at pgcon too. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Thu, Sep 29, 2016 at 8:07 PM, Peter Geoghegan wrote: > On Wed, Sep 28, 2016 at 8:06 PM, Andres Freund wrote: >> On 2016-09-28 15:04:30 -0400, Robert Haas wrote: >>> Andres already >>> stated that he things working on btree-over-hash would be more >>> beneficial than fixing hash, but at this point it seems like he's the >>> only one who takes that position. >> >> Note that I did *NOT* take that position. I was saying that I think we >> should evaluate whether that's not a better approach, doing some simple >> performance comparisons. > > I, for one, agree with this position. Well, I, for one, find it frustrating. It seems pretty unhelpful to bring this up only after the code has already been written. The first post on this thread was on May 10th. The first version of the patch was posted on June 16th. This position was first articulated on September 15th. But, by all means, please feel free to do the performance comparison and post the results. I'd be curious to see them myself. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Wed, Sep 28, 2016 at 8:06 PM, Andres Freund wrote: > On 2016-09-28 15:04:30 -0400, Robert Haas wrote: >> Andres already >> stated that he things working on btree-over-hash would be more >> beneficial than fixing hash, but at this point it seems like he's the >> only one who takes that position. > > Note that I did *NOT* take that position. I was saying that I think we > should evaluate whether that's not a better approach, doing some simple > performance comparisons. I, for one, agree with this position. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas wrote: > On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas wrote: >> I'll write another email with my thoughts about the rest of the patch. > > I think that the README changes for this patch need a fairly large > amount of additional work. Here are a few things I notice: > > - The confusion between buckets and pages hasn't been completely > cleared up. If you read the beginning of the README, the terminology > is clearly set forth. It says: > >>> A hash index consists of two or more "buckets", into which tuples are >>> placed whenever their hash key maps to the bucket number. Each bucket in >>> the hash index comprises one or more index pages. The bucket's first page >>> is permanently assigned to it when the bucket is created. Additional pages, >>> called "overflow pages", are added if the bucket receives too many tuples >>> to fit in the primary bucket page." > > But later on, you say: > >>> Scan will take a lock in shared mode on the primary bucket or on one of the >>> overflow page. > > So the correct terminology here would be "primary bucket page" not > "primary bucket". > > - In addition, notice that there are two English errors in this > sentence: the word "the" needs to be added to the beginning of the > sentence, and the last word needs to be "pages" rather than "page". > There are a considerable number of similar minor errors; if you can't > fix them, I'll make a pass over it and clean it up. > > - The whole "lock definitions" section seems to me to be pretty loose > and imprecise about what is happening. For example, it uses the term > "split-in-progress" without first defining it. The sentence quoted > above says that scans take a lock in shared mode either on the primary > page or on one of the overflow pages, but it's not to document code by > saying that it will do either A or B without explaining which one! In > fact, I think that a scan will take a content lock first on the > primary bucket page and then on each overflow page in sequence, > retaining a pin on the primary buffer page throughout the scan. So it > is not one or the other but both in a particular sequence, and that > can and should be explained. > > Another problem with this section is that even when it's precise about > what is going on, it's probably duplicating what is or should be in > the following sections where the algorithms for each operation are > explained. In the original wording, this section explains what each > lock protects, and then the following sections explain the algorithms > in the context of those definitions. Now, this section contains a > sketch of the algorithm, and then the following sections lay it out > again in more detail. The question of what each lock protects has > been lost. Here's an attempt at some text to replace what you have > here: > > === > Concurrency control for hash indexes is provided using buffer content > locks, buffer pins, and cleanup locks. Here as elsewhere in > PostgreSQL, cleanup lock means that we hold an exclusive lock on the > buffer and have observed at some point after acquiring the lock that > we hold the only pin on that buffer. For hash indexes, a cleanup lock > on a primary bucket page represents the right to perform an arbitrary > reorganization of the entire bucket, while a cleanup lock on an > overflow page represents the right to perform a reorganization of just > that page. Therefore, scans retain a pin on both the primary bucket > page and the overflow page they are currently scanning, if any. > I don't think we take cleanup lock on overflow page, so I will edit that part. > Splitting a bucket requires a cleanup lock on both the old and new > primary bucket pages. VACUUM therefore takes a cleanup lock on every > bucket page in turn order to remove tuples. It can also remove tuples > copied to a new bucket by any previous split operation, because the > cleanup lock taken on the primary bucket page guarantees that no scans > which started prior to the most recent split can still be in progress. > After cleaning each page individually, it attempts to take a cleanup > lock on the primary bucket page in order to "squeeze" the bucket down > to the minimum possible number of pages. > === > > As I was looking at the old text regarding deadlock risk, I realized > what may be a serious problem. Suppose process A is performing a scan > of some hash index. While the scan is suspended, it attempts to take > a lock and is blocked by process B. Process B, meanwhile, is running > VACUUM on that hash index. Eventually, it will do > LockBufferForCleanup() on the hash bucket on which process A holds a > buffer pin, resulting in an undetected deadlock. In the current > coding, A would hold a heavyweight lock and B would attempt to acquire > a conflicting heavyweight lock, and the deadlock detector would kill > one of them. This patch probably breaks that. I notice that that's > the only place where we att
Re: [HACKERS] Hash Indexes
On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas wrote: > I'll write another email with my thoughts about the rest of the patch. I think that the README changes for this patch need a fairly large amount of additional work. Here are a few things I notice: - The confusion between buckets and pages hasn't been completely cleared up. If you read the beginning of the README, the terminology is clearly set forth. It says: >> A hash index consists of two or more "buckets", into which tuples are placed >> whenever their hash key maps to the bucket number. Each bucket in the hash >> index comprises one or more index pages. The bucket's first page is >> permanently assigned to it when the bucket is created. Additional pages, >> called "overflow pages", are added if the bucket receives too many tuples to >> fit in the primary bucket page." But later on, you say: >> Scan will take a lock in shared mode on the primary bucket or on one of the >> overflow page. So the correct terminology here would be "primary bucket page" not "primary bucket". - In addition, notice that there are two English errors in this sentence: the word "the" needs to be added to the beginning of the sentence, and the last word needs to be "pages" rather than "page". There are a considerable number of similar minor errors; if you can't fix them, I'll make a pass over it and clean it up. - The whole "lock definitions" section seems to me to be pretty loose and imprecise about what is happening. For example, it uses the term "split-in-progress" without first defining it. The sentence quoted above says that scans take a lock in shared mode either on the primary page or on one of the overflow pages, but it's not to document code by saying that it will do either A or B without explaining which one! In fact, I think that a scan will take a content lock first on the primary bucket page and then on each overflow page in sequence, retaining a pin on the primary buffer page throughout the scan. So it is not one or the other but both in a particular sequence, and that can and should be explained. Another problem with this section is that even when it's precise about what is going on, it's probably duplicating what is or should be in the following sections where the algorithms for each operation are explained. In the original wording, this section explains what each lock protects, and then the following sections explain the algorithms in the context of those definitions. Now, this section contains a sketch of the algorithm, and then the following sections lay it out again in more detail. The question of what each lock protects has been lost. Here's an attempt at some text to replace what you have here: === Concurrency control for hash indexes is provided using buffer content locks, buffer pins, and cleanup locks. Here as elsewhere in PostgreSQL, cleanup lock means that we hold an exclusive lock on the buffer and have observed at some point after acquiring the lock that we hold the only pin on that buffer. For hash indexes, a cleanup lock on a primary bucket page represents the right to perform an arbitrary reorganization of the entire bucket, while a cleanup lock on an overflow page represents the right to perform a reorganization of just that page. Therefore, scans retain a pin on both the primary bucket page and the overflow page they are currently scanning, if any. Splitting a bucket requires a cleanup lock on both the old and new primary bucket pages. VACUUM therefore takes a cleanup lock on every bucket page in turn order to remove tuples. It can also remove tuples copied to a new bucket by any previous split operation, because the cleanup lock taken on the primary bucket page guarantees that no scans which started prior to the most recent split can still be in progress. After cleaning each page individually, it attempts to take a cleanup lock on the primary bucket page in order to "squeeze" the bucket down to the minimum possible number of pages. === As I was looking at the old text regarding deadlock risk, I realized what may be a serious problem. Suppose process A is performing a scan of some hash index. While the scan is suspended, it attempts to take a lock and is blocked by process B. Process B, meanwhile, is running VACUUM on that hash index. Eventually, it will do LockBufferForCleanup() on the hash bucket on which process A holds a buffer pin, resulting in an undetected deadlock. In the current coding, A would hold a heavyweight lock and B would attempt to acquire a conflicting heavyweight lock, and the deadlock detector would kill one of them. This patch probably breaks that. I notice that that's the only place where we attempt to acquire a buffer cleanup lock unconditionally; every place else, we acquire the lock conditionally, so there's no deadlock risk. Once we resolve this problem, the paragraph about deadlock risk in this section should be revised to explain whatever solution we come up with. By the
Re: [HACKERS] Hash Indexes
On Wed, Sep 28, 2016 at 3:06 PM, Andres Freund wrote: > On 2016-09-28 15:04:30 -0400, Robert Haas wrote: >> Andres already >> stated that he things working on btree-over-hash would be more >> beneficial than fixing hash, but at this point it seems like he's the >> only one who takes that position. > > Note that I did *NOT* take that position. I was saying that I think we > should evaluate whether that's not a better approach, doing some simple > performance comparisons. OK, sorry. I evidently misunderstood your position, for which I apologize. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On 2016-09-28 15:04:30 -0400, Robert Haas wrote: > Andres already > stated that he things working on btree-over-hash would be more > beneficial than fixing hash, but at this point it seems like he's the > only one who takes that position. Note that I did *NOT* take that position. I was saying that I think we should evaluate whether that's not a better approach, doing some simple performance comparisons. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Tue, Sep 27, 2016 at 3:06 PM, Jesper Pedersen wrote: > I have been running various tests, and applications with this patch together > with the WAL v5 patch [1]. > > As I havn't seen any failures and doesn't currently have additional feedback > I'm moving this patch to "Ready for Committer" for their feedback. Cool! Thanks for reviewing. Amit, can you please split the buffer manager changes in this patch into a separate patch? I think those changes can be committed first and then we can try to deal with the rest of it. Instead of adding ConditionalLockBufferShared, I think we should add an "int mode" argument to the existing ConditionalLockBuffer() function. That way is more consistent with LockBuffer(). It means an API break for any third-party code that's calling this function, but that doesn't seem like a big problem. There are only 10 callers of ConditionalLockBuffer() in our source tree and only one of those is in contrib, so probably there isn't much third-party code that will be affected by this, and I think it's worth it for the long-term cleanliness. As for CheckBufferForCleanup, I think that looks OK, but: (1) please add an Assert() that we hold an exclusive lock on the buffer, using LWLockHeldByMeInMode; and (2) I think we should rename it to something like IsBufferCleanupOK. Then, when it's used, it reads like English: if (IsBufferCleanupOK(buf)) { /* clean up the buffer */ }. I'll write another email with my thoughts about the rest of the patch. For the record, Amit and I have had extensive discussions about this effort off-list, and as Amit noted in his original post, the design is based on suggestions which I previously posted to the list suggesting how the issues with hash indexes might get fixed. Therefore, I don't expect to have too many basic disagreements regarding the design of the patch; if anyone else does, please speak up. Andres already stated that he things working on btree-over-hash would be more beneficial than fixing hash, but at this point it seems like he's the only one who takes that position. Even if we accept that working on the hash AM is a reasonable thing to do, it doesn't follow that the design Amit has adopted here is ideal. I think it's reasonably good, but that's only to be expected considering that I drafted the original version of it and have been involved in subsequent discussions; someone else might dislike something that I thought was OK, and any such opinions certainly deserve a fair hearing. To be clear, It's been a long time since I've looked at any of the actual code in this patch and I have at no point studied it deeply, so I expect that I may find a fair number of things that I'm not happy with in detail, and I'll write those up along with any design-level concerns that I do have. This should in no way forestall review from anyone else who wants to get involved. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On 09/20/2016 09:02 AM, Amit Kapila wrote: On Fri, Sep 16, 2016 at 11:22 AM, Amit Kapila wrote: I do want to work on it, but it is always possible that due to some other work this might get delayed. Also, I think there is always a chance that while doing that work, we face some problem due to which we might not be able to use that optimization. So I will go with your suggestion of removing hashscan.c and it's usage for now and then if required we will pull it back. If nobody else thinks otherwise, I will update this in next patch version. In the attached patch, I have removed the support of hashscans. I think it might improve performance by few percentage (especially for single row fetch transactions) as we have registration and destroy of hashscans. I have been running various tests, and applications with this patch together with the WAL v5 patch [1]. As I havn't seen any failures and doesn't currently have additional feedback I'm moving this patch to "Ready for Committer" for their feedback. If others have comments, move the patch status back in the CommitFest application, please. [1] https://www.postgresql.org/message-id/CAA4eK1KE%3D%2BkkowyYD0vmch%3Dph4ND3H1tViAB%2B0cWTHqjZDDfqg%40mail.gmail.com Best regards, Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On 25/09/16 18:18, Amit Kapila wrote: On Sat, Sep 24, 2016 at 10:49 PM, Greg Stark wrote: On Thu, Sep 22, 2016 at 3:23 AM, Tom Lane wrote: But to kick the hash AM as such to the curb is to say "sorry, there will never be O(1) index lookups in Postgres". Well there's plenty of halfway solutions for that. We could move hash indexes to contrib or even have them in core as experimental_hash or unlogged_hash until the day they achieve their potential. We definitely shouldn't discourage people from working on hash indexes Okay, but to me it appears that naming it as experimental_hash or moving it to contrib could discourage people or at the very least people will be less motivated. Thinking on those lines a year or so back would have been a wise direction, but now when already there is lot of work done (patches to make it wal-enabled, more concurrent and performant, page inspect module are available) for hash indexes and still more is in progress, that sounds like a step backward then step forward. +1 I think so too - I've seen many email threads over the years on this list that essentially state "we need hash indexes wal logged to make progress with them"...and Amit et al has/have done this (more than this obviously - made 'em better too) and I'm astonished that folk are suggesting anything other than 'commit this great patch now!'... regards Mark -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
On Sat, Sep 24, 2016 at 10:49 PM, Greg Stark wrote: > On Thu, Sep 22, 2016 at 3:23 AM, Tom Lane wrote: >> But to kick the hash AM as such to the curb is to say >> "sorry, there will never be O(1) index lookups in Postgres". > > Well there's plenty of halfway solutions for that. We could move hash > indexes to contrib or even have them in core as experimental_hash or > unlogged_hash until the day they achieve their potential. > > We definitely shouldn't discourage people from working on hash indexes > Okay, but to me it appears that naming it as experimental_hash or moving it to contrib could discourage people or at the very least people will be less motivated. Thinking on those lines a year or so back would have been a wise direction, but now when already there is lot of work done (patches to make it wal-enabled, more concurrent and performant, page inspect module are available) for hash indexes and still more is in progress, that sounds like a step backward then step forward. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Hash Indexes
Greg Stark writes: > On Thu, Sep 22, 2016 at 3:23 AM, Tom Lane wrote: >> But to kick the hash AM as such to the curb is to say >> "sorry, there will never be O(1) index lookups in Postgres". > Well there's plenty of halfway solutions for that. We could move hash > indexes to contrib or even have them in core as experimental_hash or > unlogged_hash until the day they achieve their potential. > We definitely shouldn't discourage people from working on hash indexes > but we probably shouldn't have released ten years worth of a feature > marked "please don't use this" that's guaranteed to corrupt your > database and cause weird problems if you use it a any of a number of > supported situations (including non-replicated system recovery that > has been a bedrock feature of Postgres for over a decade). Obviously that has not been a good situation, but we lack a time machine to retroactively make it better, so I don't see much point in fretting over what should have been done in the past. > Arguably adding a hashed btree opclass and relegating the existing > code to an experimental state would actually encourage development > since a) Users would actually be likely to use the hashed btree > opclass so any work on a real hash opclass would have a real userbase > ready and waiting for delivery, b) delivering a real hash opclass > wouldn't involve convincing users to unlearn a million instructions > warning not to use this feature and c) The fear of breaking existing > users use cases and databases would be less and pg_upgrade would be an > ignorable problem at least until the day comes for the big cutover of > the default to the new opclass. I'm not following your point here. There is no hash-over-btree AM and nobody (including Andres) has volunteered to create one. Meanwhile, we have a patch in hand to WAL-enable the hash AM. Why would we do anything other than apply that patch and stop saying hash is deprecated? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers