On Thu, Sep 29, 2016 at 6:04 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Sep 28, 2016 at 3:04 PM, Robert Haas <robertmh...@gmail.com> wrote:
>> I'll write another email with my thoughts about the rest of the patch.
>
> I think that the README changes for this patch need a fairly large
> amount of additional work.  Here are a few things I notice:
>
> - The confusion between buckets and pages hasn't been completely
> cleared up.  If you read the beginning of the README, the terminology
> is clearly set forth.  It says:
>
>>> A hash index consists of two or more "buckets", into which tuples are 
>>> placed whenever their hash key maps to the bucket number.  Each bucket in 
>>> the hash index comprises one or more index pages.  The bucket's first page 
>>> is permanently assigned to it when the bucket is created. Additional pages, 
>>> called "overflow pages", are added if the bucket receives too many tuples 
>>> to fit in the primary bucket page."
>
> But later on, you say:
>
>>> Scan will take a lock in shared mode on the primary bucket or on one of the 
>>> overflow page.
>
> So the correct terminology here would be "primary bucket page" not
> "primary bucket".
>
> - In addition, notice that there are two English errors in this
> sentence: the word "the" needs to be added to the beginning of the
> sentence, and the last word needs to be "pages" rather than "page".
> There are a considerable number of similar minor errors; if you can't
> fix them, I'll make a pass over it and clean it up.
>
> - The whole "lock definitions" section seems to me to be pretty loose
> and imprecise about what is happening.  For example, it uses the term
> "split-in-progress" without first defining it.  The sentence quoted
> above says that scans take a lock in shared mode either on the primary
> page or on one of the overflow pages, but it's not to document code by
> saying that it will do either A or B without explaining which one!  In
> fact, I think that a scan will take a content lock first on the
> primary bucket page and then on each overflow page in sequence,
> retaining a pin on the primary buffer page throughout the scan.  So it
> is not one or the other but both in a particular sequence, and that
> can and should be explained.
>
> Another problem with this section is that even when it's precise about
> what is going on, it's probably duplicating what is or should be in
> the following sections where the algorithms for each operation are
> explained.  In the original wording, this section explains what each
> lock protects, and then the following sections explain the algorithms
> in the context of those definitions.  Now, this section contains a
> sketch of the algorithm, and then the following sections lay it out
> again in more detail.  The question of what each lock protects has
> been lost.  Here's an attempt at some text to replace what you have
> here:
>
> ===
> Concurrency control for hash indexes is provided using buffer content
> locks, buffer pins, and cleanup locks.   Here as elsewhere in
> PostgreSQL, cleanup lock means that we hold an exclusive lock on the
> buffer and have observed at some point after acquiring the lock that
> we hold the only pin on that buffer.  For hash indexes, a cleanup lock
> on a primary bucket page represents the right to perform an arbitrary
> reorganization of the entire bucket, while a cleanup lock on an
> overflow page represents the right to perform a reorganization of just
> that page.  Therefore, scans retain a pin on both the primary bucket
> page and the overflow page they are currently scanning, if any.
>

I don't think we take cleanup lock on overflow page, so I will edit that part.

> Splitting a bucket requires a cleanup lock on both the old and new
> primary bucket pages.  VACUUM therefore takes a cleanup lock on every
> bucket page in turn order to remove tuples.  It can also remove tuples
> copied to a new bucket by any previous split operation, because the
> cleanup lock taken on the primary bucket page guarantees that no scans
> which started prior to the most recent split can still be in progress.
> After cleaning each page individually, it attempts to take a cleanup
> lock on the primary bucket page in order to "squeeze" the bucket down
> to the minimum possible number of pages.
> ===
>
> As I was looking at the old text regarding deadlock risk, I realized
> what may be a serious problem.  Suppose process A is performing a scan
> of some hash index.  While the scan is suspended, it attempts to take
> a lock and is blocked by process B.  Process B, meanwhile, is running
> VACUUM on that hash index.  Eventually, it will do
> LockBufferForCleanup() on the hash bucket on which process A holds a
> buffer pin, resulting in an undetected deadlock. In the current
> coding, A would hold a heavyweight lock and B would attempt to acquire
> a conflicting heavyweight lock, and the deadlock detector would kill
> one of them.  This patch probably breaks that.  I notice that that's
> the only place where we attempt to acquire a buffer cleanup lock
> unconditionally; every place else, we acquire the lock conditionally,
> so there's no deadlock risk.  Once we resolve this problem, the
> paragraph about deadlock risk in this section should be revised to
> explain whatever solution we come up with.
>
> By the way, since VACUUM must run in its own transaction, B can't be
> holding arbitrary locks, but that doesn't seem quite sufficient to get
> us out of the woods.  It will at least hold ShareUpdateExclusiveLock
> on the relation being vacuumed, and process A could attempt to acquire
> that same lock.
>

Right, I think there is a danger of deadlock in above situation.
Needs some more thoughts.

> Also in regards to deadlock, I notice that you added a paragraph
> saying that we lock higher-numbered buckets before lower-numbered
> buckets.  That's fair enough, but what about the metapage?  The reader
> algorithm suggests that the metapage must lock must be taken after the
> bucket locks, because it tries to grab the bucket lock conditionally
> after acquiring the metapage lock, but that's not documented here.
>

That is for efficiency.  This patch haven't changed anything in
metapage locking which can directly impact deadlock.

> The reader algorithm itself seems to be a bit oddly explained.
>
>       pin meta page and take buffer content lock in shared mode
> +    compute bucket number for target hash key
> +    read and pin the primary bucket page
>
> So far, I'm with you.
>
> +    conditionally get the buffer content lock in shared mode on
> primary bucket page for search
> +    if we didn't get the lock (need to wait for lock)
>
> "didn't get the lock" and "wait for the lock" are saying the same
> thing, so this is redundant, and the statement that it is "for search"
> on the previous line is redundant with the introductory text
> describing this as the reader algorithm.
>
> +        release the buffer content lock on meta page
> +        acquire buffer content lock on primary bucket page in shared mode
> +        acquire the buffer content lock in shared mode on meta page
>
> OK...
>
> +        to check for possibility of split, we need to recompute the bucket 
> and
> +        verify, if it is a correct bucket; set the retry flag
>
> This makes it sound like we set the retry flag whether it was the
> correct bucket or not, which isn't sensible.
>
> +    else if we get the lock, then we can skip the retry path
>
> This line is totally redundant.  If we don't set the retry flag, then
> of course we can skip the part guarded by if (retry).
>

Will change as per suggestions.

> +    if (retry)
> +        loop:
> +            compute bucket number for target hash key
> +            release meta page buffer content lock
> +            if (correct bucket page is already locked)
> +                break
> +            release any existing content lock on bucket page (if a
> concurrent split happened)
> +            pin primary bucket page and take shared buffer content lock
> +            retake meta page buffer content lock in shared mode
>
> This is the part I *really* don't understand.  It makes sense to me
> that we need to loop until we get the correct bucket locked with no
> concurrent splits, but why is this retry loop separate from the
> previous bit of code that set the retry flag.  In other words, why is
> not something like this?
>
> pin the meta page and take shared content lock on it
> compute bucket number for target hash key
> if (we can't get a shared content lock on the target bucket without blocking)
>     loop:
>         release meta page content lock
>         take a shared content lock on the target primary bucket page
>         take a shared content lock on the metapage
>         if (previously-computed target bucket has not been split)
>             break;
>

I think we can write it the way you are suggesting, but I don't want
to change much in the existing for loop in code, which uses
_hash_getbuf() to acquire the pin and lock together.

> Another thing I don't quite understand about this algorithm is that in
> order to conditionally lock the target primary bucket page, we'd first
> need to read and pin it.  And that doesn't seem like a good thing to
> do while we're holding a shared content lock on the metapage, because
> of the principle that we don't want to hold content locks across I/O.
>

I think we can release metapage content lock before reading the buffer.

>  -- then, per read request:
>     release pin on metapage
> -    read current page of bucket and take shared buffer content lock
> -        step to next page if necessary (no chaining of locks)
> +    if the split is in progress for current bucket and this is a new bucket
> +        release the buffer content lock on current bucket page
> +        pin and acquire the buffer content lock on old bucket in shared mode
> +        release the buffer content lock on old bucket, but not pin
> +        retake the buffer content lock on new bucket
> +        mark the scan such that it skips the tuples that are marked
> as moved by split
>
> Aren't these steps done just once per scan?  If so, I think they
> should appear before "-- then, per read request" which AIUI is
> intended to imply a loop over tuples.
>

As per code, there is no such intention (loop over tuples).  It is
about reading the page and getting the tuple.

> +    step to next page if necessary (no chaining of locks)
> +        if the scan indicates moved by split, then move to old bucket
> after the scan
> +        of current bucket is finished
>      get tuple
>      release buffer content lock and pin on current page
>  -- at scan shutdown:
> -    release bucket share-lock
>
> Don't we have a pin to release at scan shutdown in the new system?
>

Yes, it is mentioned in line below:

+ release any pin we hold on current buffer, old bucket buffer, new
bucket buffer
+


> Well, I was hoping to get through the whole patch in one email, but
> I'm not even all the way through the README.  However, it's late, so
> I'm stopping here for now.
>

Thanks for the review!



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to