On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas <robertmh...@gmail.com> wrote: >> I'll write another email with my thoughts about the rest of the patch. > > I think that the README changes for this patch need a fairly large > amount of additional work. Here are a few things I notice: > > - The confusion between buckets and pages hasn't been completely > cleared up. If you read the beginning of the README, the terminology > is clearly set forth. It says: > >>> A hash index consists of two or more "buckets", into which tuples are >>> placed whenever their hash key maps to the bucket number. Each bucket in >>> the hash index comprises one or more index pages. The bucket's first page >>> is permanently assigned to it when the bucket is created. Additional pages, >>> called "overflow pages", are added if the bucket receives too many tuples >>> to fit in the primary bucket page." > > But later on, you say: > >>> Scan will take a lock in shared mode on the primary bucket or on one of the >>> overflow page. > > So the correct terminology here would be "primary bucket page" not > "primary bucket". > > - In addition, notice that there are two English errors in this > sentence: the word "the" needs to be added to the beginning of the > sentence, and the last word needs to be "pages" rather than "page". > There are a considerable number of similar minor errors; if you can't > fix them, I'll make a pass over it and clean it up. > > - The whole "lock definitions" section seems to me to be pretty loose > and imprecise about what is happening. For example, it uses the term > "split-in-progress" without first defining it. The sentence quoted > above says that scans take a lock in shared mode either on the primary > page or on one of the overflow pages, but it's not to document code by > saying that it will do either A or B without explaining which one! In > fact, I think that a scan will take a content lock first on the > primary bucket page and then on each overflow page in sequence, > retaining a pin on the primary buffer page throughout the scan. So it > is not one or the other but both in a particular sequence, and that > can and should be explained. > > Another problem with this section is that even when it's precise about > what is going on, it's probably duplicating what is or should be in > the following sections where the algorithms for each operation are > explained. In the original wording, this section explains what each > lock protects, and then the following sections explain the algorithms > in the context of those definitions. Now, this section contains a > sketch of the algorithm, and then the following sections lay it out > again in more detail. The question of what each lock protects has > been lost. Here's an attempt at some text to replace what you have > here: > > === > Concurrency control for hash indexes is provided using buffer content > locks, buffer pins, and cleanup locks. Here as elsewhere in > PostgreSQL, cleanup lock means that we hold an exclusive lock on the > buffer and have observed at some point after acquiring the lock that > we hold the only pin on that buffer. For hash indexes, a cleanup lock > on a primary bucket page represents the right to perform an arbitrary > reorganization of the entire bucket, while a cleanup lock on an > overflow page represents the right to perform a reorganization of just > that page. Therefore, scans retain a pin on both the primary bucket > page and the overflow page they are currently scanning, if any. >
I don't think we take cleanup lock on overflow page, so I will edit that part. > Splitting a bucket requires a cleanup lock on both the old and new > primary bucket pages. VACUUM therefore takes a cleanup lock on every > bucket page in turn order to remove tuples. It can also remove tuples > copied to a new bucket by any previous split operation, because the > cleanup lock taken on the primary bucket page guarantees that no scans > which started prior to the most recent split can still be in progress. > After cleaning each page individually, it attempts to take a cleanup > lock on the primary bucket page in order to "squeeze" the bucket down > to the minimum possible number of pages. > === > > As I was looking at the old text regarding deadlock risk, I realized > what may be a serious problem. Suppose process A is performing a scan > of some hash index. While the scan is suspended, it attempts to take > a lock and is blocked by process B. Process B, meanwhile, is running > VACUUM on that hash index. Eventually, it will do > LockBufferForCleanup() on the hash bucket on which process A holds a > buffer pin, resulting in an undetected deadlock. In the current > coding, A would hold a heavyweight lock and B would attempt to acquire > a conflicting heavyweight lock, and the deadlock detector would kill > one of them. This patch probably breaks that. I notice that that's > the only place where we attempt to acquire a buffer cleanup lock > unconditionally; every place else, we acquire the lock conditionally, > so there's no deadlock risk. Once we resolve this problem, the > paragraph about deadlock risk in this section should be revised to > explain whatever solution we come up with. > > By the way, since VACUUM must run in its own transaction, B can't be > holding arbitrary locks, but that doesn't seem quite sufficient to get > us out of the woods. It will at least hold ShareUpdateExclusiveLock > on the relation being vacuumed, and process A could attempt to acquire > that same lock. > Right, I think there is a danger of deadlock in above situation. Needs some more thoughts. > Also in regards to deadlock, I notice that you added a paragraph > saying that we lock higher-numbered buckets before lower-numbered > buckets. That's fair enough, but what about the metapage? The reader > algorithm suggests that the metapage must lock must be taken after the > bucket locks, because it tries to grab the bucket lock conditionally > after acquiring the metapage lock, but that's not documented here. > That is for efficiency. This patch haven't changed anything in metapage locking which can directly impact deadlock. > The reader algorithm itself seems to be a bit oddly explained. > > pin meta page and take buffer content lock in shared mode > + compute bucket number for target hash key > + read and pin the primary bucket page > > So far, I'm with you. > > + conditionally get the buffer content lock in shared mode on > primary bucket page for search > + if we didn't get the lock (need to wait for lock) > > "didn't get the lock" and "wait for the lock" are saying the same > thing, so this is redundant, and the statement that it is "for search" > on the previous line is redundant with the introductory text > describing this as the reader algorithm. > > + release the buffer content lock on meta page > + acquire buffer content lock on primary bucket page in shared mode > + acquire the buffer content lock in shared mode on meta page > > OK... > > + to check for possibility of split, we need to recompute the bucket > and > + verify, if it is a correct bucket; set the retry flag > > This makes it sound like we set the retry flag whether it was the > correct bucket or not, which isn't sensible. > > + else if we get the lock, then we can skip the retry path > > This line is totally redundant. If we don't set the retry flag, then > of course we can skip the part guarded by if (retry). > Will change as per suggestions. > + if (retry) > + loop: > + compute bucket number for target hash key > + release meta page buffer content lock > + if (correct bucket page is already locked) > + break > + release any existing content lock on bucket page (if a > concurrent split happened) > + pin primary bucket page and take shared buffer content lock > + retake meta page buffer content lock in shared mode > > This is the part I *really* don't understand. It makes sense to me > that we need to loop until we get the correct bucket locked with no > concurrent splits, but why is this retry loop separate from the > previous bit of code that set the retry flag. In other words, why is > not something like this? > > pin the meta page and take shared content lock on it > compute bucket number for target hash key > if (we can't get a shared content lock on the target bucket without blocking) > loop: > release meta page content lock > take a shared content lock on the target primary bucket page > take a shared content lock on the metapage > if (previously-computed target bucket has not been split) > break; > I think we can write it the way you are suggesting, but I don't want to change much in the existing for loop in code, which uses _hash_getbuf() to acquire the pin and lock together. > Another thing I don't quite understand about this algorithm is that in > order to conditionally lock the target primary bucket page, we'd first > need to read and pin it. And that doesn't seem like a good thing to > do while we're holding a shared content lock on the metapage, because > of the principle that we don't want to hold content locks across I/O. > I think we can release metapage content lock before reading the buffer. > -- then, per read request: > release pin on metapage > - read current page of bucket and take shared buffer content lock > - step to next page if necessary (no chaining of locks) > + if the split is in progress for current bucket and this is a new bucket > + release the buffer content lock on current bucket page > + pin and acquire the buffer content lock on old bucket in shared mode > + release the buffer content lock on old bucket, but not pin > + retake the buffer content lock on new bucket > + mark the scan such that it skips the tuples that are marked > as moved by split > > Aren't these steps done just once per scan? If so, I think they > should appear before "-- then, per read request" which AIUI is > intended to imply a loop over tuples. > As per code, there is no such intention (loop over tuples). It is about reading the page and getting the tuple. > + step to next page if necessary (no chaining of locks) > + if the scan indicates moved by split, then move to old bucket > after the scan > + of current bucket is finished > get tuple > release buffer content lock and pin on current page > -- at scan shutdown: > - release bucket share-lock > > Don't we have a pin to release at scan shutdown in the new system? > Yes, it is mentioned in line below: + release any pin we hold on current buffer, old bucket buffer, new bucket buffer + > Well, I was hoping to get through the whole patch in one email, but > I'm not even all the way through the README. However, it's late, so > I'm stopping here for now. > Thanks for the review! -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers