subject:"Re\: \[HACKERS\] crash\-safe visibility map, take five"

Re: [HACKERS] crash-safe visibility map, take five

2011-06-24 Thread Robert Haas

On Fri, Jun 24, 2011 at 1:43 PM, Jeff Davis  wrote:
>> And anything that is WAL-logged must obey the WAL-before-data rule.
>> We have a system already that ensures that when
>> synchronous_commit=off, CLOG pages can't be flushed before the
>> corresponding WAL record makes it to disk.
>
> In this case, how do you prevent the PD_ALL_VISIBLE from making it to
> disk if you never bumped the LSN when it was set? It seems like you just
> don't have the information to do so, and it seems like the information
> required would be variable in size.

Well, I think that would be a problem for the hypothetical implementer
of the persistent snapshot feature.  :-)

More seriously, Heikki and I previously discussed creating some
systematic method for suppressing FPIs when they are not needed,
perhaps by using a bit in the page header to indicate whether an FPI
has been generated since the last checkpoint.  I think it would be
nice to have such a system, but since we don't have a clear agreement
either that it's a good idea or what we'd do after that, I'm not
inclined to invest time in it.  To really get any benefit out of a
change in that area, we'd need probably need to (a) remove the LSN
interlocks that prevent changes from being replayed if the LSN of the
page has already advanced beyond the record LSN and (b) change at
least some of XLOG_HEAP_{INSERT,UPDATE,DELETE} to be idempotent.  But
if we went in that direction then that might help to regularize some
of this and make it a bit less ad-hoc.

> I didn't mean to make this conversation quite so hypothetical. My
> primary points are:
>
> 1. Sometimes it makes sense to break the typical WAL conventions for
> performance reasons. But when we do so, we have to be quite careful,
> because things get complicated quickly.

Yes.

> 2. PD_ALL_VISIBLE is a little bit more complex than other hint bits,
> because the conditions under which it may be set are more complex
> (having to do with both snapshots and cleanup actions). Other hint bits
> are based only on transaction status: either the WAL for that
> transaction completion got flushed (and is therefore permanent), and we
> set the hint bit; or it didn't get flushed and we don't.

I think the term "hint bits" really shouldn't be applied to anything
other than HEAP_{XMIN,XMAX}_{COMMITTED,INVALID}.  Otherwise, we get
into confusing territory pretty quickly.  Our algorithm for
opportunistically killing index entries pointing to dead tuples is not
WAL-logged, but it involves more than a single bit.  OTOH, clearing of
the PD_ALL_VISIBLE bit has always been WAL-logged, so lumping that in
with HEAP_XMIN_COMMITTED is pretty misleading.

> Just having this discussion has been good enough for me to get a better
> idea what's going on, so if you think the comments are sufficient that's
> OK with me.

I'm not 100% certain they are, but let's wait and see if anyone else
wants to weigh in...  please do understand I'm not trying to be a pain
in the neck.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take five

2011-06-24 Thread Jeff Davis

On Thu, 2011-06-23 at 22:02 -0400, Robert Haas wrote:
> > 1. INSERT to a new page, marking it with LSN X
> > 2. WAL flushed to LSN Y (Y > X)
> > 2. Some persistent snapshot (that doesn't see the INSERT) is released,
> > and generates WAL recording that fact with LSN Z (Z > Y)
> > 3. Lazy VACUUM marks the newly all-visible page with PD_ALL_VISIBLE
> > 4. page is written out because LSN is still X
> > 5. crash

> I don't really think that's a separate set of assumptions - if we had
> some system whereby snapshots could survive a crash, then they'd have
> to be WAL-logged (because that's how we make things survive crashes).

In the above example, it is WAL-logged (with LSN Z).

> And anything that is WAL-logged must obey the WAL-before-data rule.
> We have a system already that ensures that when
> synchronous_commit=off, CLOG pages can't be flushed before the
> corresponding WAL record makes it to disk.

In this case, how do you prevent the PD_ALL_VISIBLE from making it to
disk if you never bumped the LSN when it was set? It seems like you just
don't have the information to do so, and it seems like the information
required would be variable in size.

> I guess the point you are driving at here is that a page can only go
> from being all-visible to not-all-visible by virtue of being modified.
>  There's no other piece of state (like a persistent snapshot) that can
> be lost as part of a crash that would make us need change our mind and
> decide that an all-visible XID is really not all-visible after all.
> (The reverse is not true: since snapshots are ephemeral, a crash will
> render every row either all-visible or dead.)  I guess I never thought
> about documenting that particular aspect of it because (to me) it
> seems fairly self-evident.  Maybe I'm wrong...

I didn't mean to make this conversation quite so hypothetical. My
primary points are:

1. Sometimes it makes sense to break the typical WAL conventions for
performance reasons. But when we do so, we have to be quite careful,
because things get complicated quickly.

2. PD_ALL_VISIBLE is a little bit more complex than other hint bits,
because the conditions under which it may be set are more complex
(having to do with both snapshots and cleanup actions). Other hint bits
are based only on transaction status: either the WAL for that
transaction completion got flushed (and is therefore permanent), and we
set the hint bit; or it didn't get flushed and we don't.

Just having this discussion has been good enough for me to get a better
idea what's going on, so if you think the comments are sufficient that's
OK with me.

Regards,
Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take five

2011-06-23 Thread Robert Haas

On Thu, Jun 23, 2011 at 6:40 PM, Jeff Davis  wrote:
> On Thu, 2011-06-23 at 18:18 -0400, Robert Haas wrote:
>> Lazy VACUUM is the only thing that makes a page all visible.  I don't
>> understand the part about snapshots.
>
> Lazy VACUUM is the only thing that _marks_ a page with PD_ALL_VISIBLE.
>
> After an INSERT to a new page, and after all snapshots are released, the
> page becomes all-visible; and thus subject to being marked with
> PD_ALL_VISIBLE by lazy vacuum without bumping the LSN. Note that there
> is no cleanup action that takes place here, so nothing else will bump
> the LSN either.
>
> So, let's say that we hypothetically had persistent snapshots, then
> you'd have the following problem:
>
> 1. INSERT to a new page, marking it with LSN X
> 2. WAL flushed to LSN Y (Y > X)
> 2. Some persistent snapshot (that doesn't see the INSERT) is released,
> and generates WAL recording that fact with LSN Z (Z > Y)
> 3. Lazy VACUUM marks the newly all-visible page with PD_ALL_VISIBLE
> 4. page is written out because LSN is still X
> 5. crash
>
> Now, the persistent snapshot is still present because LSN Z never made
> it to disk; but the page is marked with PD_ALL_VISIBLE.
>
> Sure, if these hypothetical persistent snapshots were transactional, and
> if synchronous_commit is on, then LSN Z would be flushed before step 3;
> but that's another set of assumptions. That's why I left it simple and
> said that the assumption was "snapshots are released if there's a
> crash".

I don't really think that's a separate set of assumptions - if we had
some system whereby snapshots could survive a crash, then they'd have
to be WAL-logged (because that's how we make things survive crashes).
And anything that is WAL-logged must obey the WAL-before-data rule.
We have a system already that ensures that when
synchronous_commit=off, CLOG pages can't be flushed before the
corresponding WAL record makes it to disk.  For a system like what
you're describing, you'd need something similar - these
crash-surviving snapshots would have to make sure that no action which
depended on their state hit the disk before the WAL record marking the
state change hit the disk.

I guess the point you are driving at here is that a page can only go
from being all-visible to not-all-visible by virtue of being modified.
 There's no other piece of state (like a persistent snapshot) that can
be lost as part of a crash that would make us need change our mind and
decide that an all-visible XID is really not all-visible after all.
(The reverse is not true: since snapshots are ephemeral, a crash will
render every row either all-visible or dead.)  I guess I never thought
about documenting that particular aspect of it because (to me) it
seems fairly self-evident.  Maybe I'm wrong...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take five

2011-06-23 Thread Jeff Davis

On Thu, 2011-06-23 at 18:18 -0400, Robert Haas wrote:
> Lazy VACUUM is the only thing that makes a page all visible.  I don't
> understand the part about snapshots.

Lazy VACUUM is the only thing that _marks_ a page with PD_ALL_VISIBLE. 

After an INSERT to a new page, and after all snapshots are released, the
page becomes all-visible; and thus subject to being marked with
PD_ALL_VISIBLE by lazy vacuum without bumping the LSN. Note that there
is no cleanup action that takes place here, so nothing else will bump
the LSN either.

So, let's say that we hypothetically had persistent snapshots, then
you'd have the following problem:

1. INSERT to a new page, marking it with LSN X
2. WAL flushed to LSN Y (Y > X)
2. Some persistent snapshot (that doesn't see the INSERT) is released,
and generates WAL recording that fact with LSN Z (Z > Y)
3. Lazy VACUUM marks the newly all-visible page with PD_ALL_VISIBLE
4. page is written out because LSN is still X
5. crash

Now, the persistent snapshot is still present because LSN Z never made
it to disk; but the page is marked with PD_ALL_VISIBLE.

Sure, if these hypothetical persistent snapshots were transactional, and
if synchronous_commit is on, then LSN Z would be flushed before step 3;
but that's another set of assumptions. That's why I left it simple and
said that the assumption was "snapshots are released if there's a
crash".

Regards,
Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take five

2011-06-23 Thread Robert Haas

On Wed, Jun 22, 2011 at 10:40 PM, Jeff Davis  wrote:
> 1. Torn pages -- not a problem because it's a single bit and idempotent.

Right.

> 2. PD_ALL_VISIBLE bit makes it to disk before a WAL record representing
> an action that makes the page all-visible. Depending on what action
> makes a page all-visible:
>  a. An old snapshot is released -- not a problem, because if there is a
> crash all snapshots are released.
>  b. Cleanup action on the page -- not a problem, because that will
> create a WAL record and update the page's LSN before setting the
> PD_ALL_VISIBLE.

Lazy VACUUM is the only thing that makes a page all visible.  I don't
understand the part about snapshots.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take five

2011-06-22 Thread Jeff Davis

On Wed, 2011-06-22 at 21:53 -0400, Robert Haas wrote:
> On Wed, Jun 22, 2011 at 8:53 PM, Jeff Davis  wrote:
> > On Thu, 2011-06-16 at 23:17 -0400, Noah Misch wrote:
> >> 2. In the words of a comment added by the patch:
> >>  * The critical integrity requirement here is that we must never end up 
> >> with
> >>  * a situation where the visibility map bit is set, and the page-level
> >>  * PD_ALL_VISIBLE bit is clear.  If that were to occur, then a subsequent
> >>  * page modification would fail to clear the visibility map bit.
> >> It does this by WAL-logging the operation of setting a vm bit.  This also 
> >> has
> >> the benefit of getting vm bits set correctly on standbys.
> >
> > In the same function, there is also the comment:
> >
> > "We don't bump the LSN of the heap page when setting the visibility
> > map bit, because that would generate an unworkable volume of
> > full-page writes.  This exposes us to torn page hazards, but since
> > we're not inspecting the existing page contents in any way, we
> > don't care."
> >
> > It would be nice to have a comment explaining why that is safe with
> > respect to the WAL-before-data rule. Obviously torn pages aren't much of
> > a problem, because it's a single bit and completely idempotent.
> 
> That's exactly what I was trying to explain in the comment you cite.
> Feel free to propose a specific change...

Well, I was a little unsure, but here's my attempt:

The potential problems are:
1. Torn pages -- not a problem because it's a single bit and idempotent.
2. PD_ALL_VISIBLE bit makes it to disk before a WAL record representing
an action that makes the page all-visible. Depending on what action
makes a page all-visible:
  a. An old snapshot is released -- not a problem, because if there is a
crash all snapshots are released.
  b. Cleanup action on the page -- not a problem, because that will
create a WAL record and update the page's LSN before setting the
PD_ALL_VISIBLE.

First of all, is that correct? Second, are there other cases to
consider?

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take five

2011-06-22 Thread Robert Haas

On Wed, Jun 22, 2011 at 8:53 PM, Jeff Davis  wrote:
> On Thu, 2011-06-16 at 23:17 -0400, Noah Misch wrote:
>> 2. In the words of a comment added by the patch:
>>  * The critical integrity requirement here is that we must never end up with
>>  * a situation where the visibility map bit is set, and the page-level
>>  * PD_ALL_VISIBLE bit is clear.  If that were to occur, then a subsequent
>>  * page modification would fail to clear the visibility map bit.
>> It does this by WAL-logging the operation of setting a vm bit.  This also has
>> the benefit of getting vm bits set correctly on standbys.
>
> In the same function, there is also the comment:
>
> "We don't bump the LSN of the heap page when setting the visibility
> map bit, because that would generate an unworkable volume of
> full-page writes.  This exposes us to torn page hazards, but since
> we're not inspecting the existing page contents in any way, we
> don't care."
>
> It would be nice to have a comment explaining why that is safe with
> respect to the WAL-before-data rule. Obviously torn pages aren't much of
> a problem, because it's a single bit and completely idempotent.

That's exactly what I was trying to explain in the comment you cite.
Feel free to propose a specific change...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take five

2011-06-22 Thread Jeff Davis

On Thu, 2011-06-16 at 23:17 -0400, Noah Misch wrote:
> 2. In the words of a comment added by the patch:
>  * The critical integrity requirement here is that we must never end up with
>  * a situation where the visibility map bit is set, and the page-level
>  * PD_ALL_VISIBLE bit is clear.  If that were to occur, then a subsequent
>  * page modification would fail to clear the visibility map bit.
> It does this by WAL-logging the operation of setting a vm bit.  This also has
> the benefit of getting vm bits set correctly on standbys.

In the same function, there is also the comment:

"We don't bump the LSN of the heap page when setting the visibility
map bit, because that would generate an unworkable volume of
full-page writes.  This exposes us to torn page hazards, but since
we're not inspecting the existing page contents in any way, we
don't care."

It would be nice to have a comment explaining why that is safe with
respect to the WAL-before-data rule. Obviously torn pages aren't much of
a problem, because it's a single bit and completely idempotent.

Regards,
Jeff Davis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take five

2011-06-18 Thread Robert Haas

On Sat, Jun 18, 2011 at 10:41 PM, Noah Misch  wrote:
> This version of the patch has some bogus hunks in int.c, int8.c, and heap.c
> reverting a few of your other recent commits.

Woops, good catch.  Corrected version attached, in case anyone else
wants to take a look at it.

> Looks good otherwise.  I've
> marked the patch "Ready for Committer".

Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


visibility-map-v7.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take five

2011-06-18 Thread Noah Misch

On Sat, Jun 18, 2011 at 10:04:52PM -0400, Robert Haas wrote:
> On Sat, Jun 18, 2011 at 9:37 AM, Noah Misch  wrote:
> > Indeed. ?If we conflicted on the xmin of the most-recent all-visible tuple
> > when setting the bit, all-visible on the standby would become a superset of
> > all-visible on the master, and we could cease to ignore the bits. ?This is 
> > an
> > excellent trade-off for masters that do UPDATE and DELETE, because they're
> > already conflicting (or using avoidance measures) on similar xids at VACUUM
> > time. ?(Some INSERT-only masters will disfavor the trade-off, but we could
> > placate them with a GUC. ?On the other hand, hot_standby_feedback works and
> > costs an INSERT-only master so little.)
> 
> I hadn't really considered the possibility that making more things
> conflict on the standby server could work out to a win.  I think that
> would need some careful testing, and I don't want to get tied up in
> that for purposes of this patch.  There is no law against a WAL format
> change, so we can always do it later if someone wants to do the
> legwork.

Fair enough.

> > I ran one repetition of my benchmark from last report, and the amount of WAL
> > has not changed. ?Given that, I think the previous runs remain valid.
> 
> As far as I can see, the upshot of those benchmarks was that more WAL
> was generated, but the time to execute vacuum didn't really change.  I
> think that's going to be pretty typical.  In your test, the additional
> WAL amounted to about 0.6% of the amount of data VACUUM is dirtying,
> which I think is pretty much in the noise, assuming wal_buffers and
> checkpoint_segments are tuned to suitable values.

I think that's about right.  The timings were too unstable to conclude much,
so I've focused on the WAL volume, which isn't worrisome.

> The other thing to worry about from a performance view is whether the
> push-ups required to relocate the clear-the-vm-bits logic to within
> the critical sections is going to have a significant impact on
> ordinary insert/update/delete operations.  I was a bit unsure at first
> about Heikki's contention that it wouldn't matter, but after going
> through the exercise of writing the code I think I see now why he
> wasn't concerned.

I agree; that part of the cost looks unmeasurable.

> The only possibly-noticeable overhead comes from
> the possibility that we might decide not to pin the visibility map
> buffer before locking the page(s) we're operating on, and need to
> unlock-pin-and-relock.  But for that to happen, VACUUM's got to buzz
> through and mark the page all-visible just around the time that we
> examine bit.  And I have a tough time imagining that happening very
> often.  You have to have a page that has been modified since the last
> VACUUM, but not too recently (otherwise it's not actually all-visible)
> and then VACUUM has to get there and process it at almost the same
> moment that someone decides to make another modification.  It's hard
> even to think about an artificial test case that would hit that
> situation with any regularity.

I can't see it happening very often, either.


This version of the patch has some bogus hunks in int.c, int8.c, and heap.c
reverting a few of your other recent commits.  Looks good otherwise.  I've
marked the patch "Ready for Committer".

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take five

2011-06-18 Thread Robert Haas

On Sat, Jun 18, 2011 at 9:37 AM, Noah Misch  wrote:
> Indeed.  If we conflicted on the xmin of the most-recent all-visible tuple
> when setting the bit, all-visible on the standby would become a superset of
> all-visible on the master, and we could cease to ignore the bits.  This is an
> excellent trade-off for masters that do UPDATE and DELETE, because they're
> already conflicting (or using avoidance measures) on similar xids at VACUUM
> time.  (Some INSERT-only masters will disfavor the trade-off, but we could
> placate them with a GUC.  On the other hand, hot_standby_feedback works and
> costs an INSERT-only master so little.)

I hadn't really considered the possibility that making more things
conflict on the standby server could work out to a win.  I think that
would need some careful testing, and I don't want to get tied up in
that for purposes of this patch.  There is no law against a WAL format
change, so we can always do it later if someone wants to do the
legwork.

> No problem.  I ran these test cases with the new patch:
>
> -Description-                                           -Expected bits-       
>   -Result-
> set bit, VACUUM commit, crash           1,5                                   
>   1,5
> set bit, crash                                          0,1                   
>                   0,1
> set bit, flush heap page, crash         0,5                                   
>   0,5
> set bit, flush vm page, crash           1,5                                   
>   1,5
>
> xlog flush request locations look correct, too.  Overall, looking good.  Do
> you know of other notable cases to check?  The last two were somewhat tricky
> to inject; if anyone wishes to reproduce them, I'm happy to share my recipe.

Those sound like interesting recipes...

> One more thing came to mind: don't we need a critical section inside the "if"
> block in visibilitymap_set()?  I can't see anything in there that could
> elog(ERROR, ...), but my general impression is that we use one anyway.

Seems like a good idea.

> I ran one repetition of my benchmark from last report, and the amount of WAL
> has not changed.  Given that, I think the previous runs remain valid.

As far as I can see, the upshot of those benchmarks was that more WAL
was generated, but the time to execute vacuum didn't really change.  I
think that's going to be pretty typical.  In your test, the additional
WAL amounted to about 0.6% of the amount of data VACUUM is dirtying,
which I think is pretty much in the noise, assuming wal_buffers and
checkpoint_segments are tuned to suitable values.

The other thing to worry about from a performance view is whether the
push-ups required to relocate the clear-the-vm-bits logic to within
the critical sections is going to have a significant impact on
ordinary insert/update/delete operations.  I was a bit unsure at first
about Heikki's contention that it wouldn't matter, but after going
through the exercise of writing the code I think I see now why he
wasn't concerned.  The only possibly-noticeable overhead comes from
the possibility that we might decide not to pin the visibility map
buffer before locking the page(s) we're operating on, and need to
unlock-pin-and-relock.  But for that to happen, VACUUM's got to buzz
through and mark the page all-visible just around the time that we
examine bit.  And I have a tough time imagining that happening very
often.  You have to have a page that has been modified since the last
VACUUM, but not too recently (otherwise it's not actually all-visible)
and then VACUUM has to get there and process it at almost the same
moment that someone decides to make another modification.  It's hard
even to think about an artificial test case that would hit that
situation with any regularity.

>> Hmm, now that I look at it, I think this test is backwards too.  I
>> think you might be right that it wouldn't hurt anything to go ahead
>> and set it, but I'm inclined to leave it in for now.
>
> Okay.  Consider expanding the comment to note that this is out of conservatism
> rather than known necessity.

Done.

> Probably not worth mentioning, but there remains one space instead of one tab
> after "visibilitymap_clear" in this line:
>  *              visibilitymap_clear      - clear a bit in the visibility map

Gee, I'm not sure whether there's a one true way of doing that.  I
know we have a no-spaces-before-tabs rule but I'm not sure what the
guidelines are for indenting elsewhere in the line.

> FWIW, I just do C-s TAB in emacs and then scan for places where the boundaries
> of the highlighted regions fail to line up vertically.

vim.

>> > This fills RelationGetBufferForTuple()'s needs, but the name doesn't seem 
>> > to
>> > fit the task too well. ?This function doesn't check the pin or that the 
>> > buffer
>> > applies to the correct relation. ?Consider naming it something like
>> > `visibilitymap_buffer_covers_block'.
>>
>> I think that this may be related to the fact that visibility

Re: [HACKERS] crash-safe visibility map, take five

2011-06-18 Thread Noah Misch

On Fri, Jun 17, 2011 at 01:21:17PM -0400, Robert Haas wrote:
> On Thu, Jun 16, 2011 at 11:17 PM, Noah Misch  wrote:
> > I suggest revisiting the suggestion in
> > http://archives.postgresql.org/message-id/27743.1291135...@sss.pgh.pa.us and
> > including a latestRemovedXid in xl_heap_visible. ?The range of risky xids is
> > the same for setting a visibility map bit as for deleting a dead tuple, and
> > the same operation (VACUUM) produces both conflicts.
> 
> See Heikki's follow-up email.  The standby has to ignore all-visible
> bits anyway, because the fact that a transaction is all-visible on the
> master does not imply that it is all-visible on the standby.  So I
> don't think there's a problem here.

Indeed.  If we conflicted on the xmin of the most-recent all-visible tuple
when setting the bit, all-visible on the standby would become a superset of
all-visible on the master, and we could cease to ignore the bits.  This is an
excellent trade-off for masters that do UPDATE and DELETE, because they're
already conflicting (or using avoidance measures) on similar xids at VACUUM
time.  (Some INSERT-only masters will disfavor the trade-off, but we could
placate them with a GUC.  On the other hand, hot_standby_feedback works and
costs an INSERT-only master so little.)

> > I proceeded to do some immediate-shutdown tests to see if we get the bits 
> > set
> > as expected. ?I set up a database like this:
> > ? ? ? ?CREATE EXTENSION pageinspect;
> > ? ? ? ?CREATE TABLE t (c int);
> > ? ? ? ?INSERT INTO t VALUES (2);
> > ? ? ? ?CHECKPOINT;
> > I normally cleared bits with "UPDATE t SET c = 1; CHECKPOINT;" and set them
> > with "VACUUM t". ?I checked bits with this query:
> > ? ? ? ?SELECT to_hex(get_byte(get_raw_page('t', 'vm', 0), 24)),
> > ? ? ? ? ? ? ? to_hex(flags::int) FROM page_header(get_raw_page('t', 0));
> > The row from that query should generally be 0,1 (bits unset) or 1,5 (bits
> > set). ?0,5 is fine after a crash. ?1,1 means we've broken our contract: the 
> > VM
> > bit is set while PD_ALL_VISIBLE is not set.
> >
> > First test was to clear bits, checkpoint, then VACUUM and SIGQUIT the
> > postmaster. ?The system came back up with 1/1 bits. ?I poked around enough 
> > to
> > see that XLByteLE(lsn, PageGetLSN(page)) was failing, but I did not dig any
> > deeper toward a root cause. ?If you have trouble reproducing this, let me 
> > know
> > so I can assemble a complete, self-contained test case.
> 
> Thank you for putting these test cases together.  As you can probably
> tell, I was having difficulty figuring out exactly how to test this.

No problem.  I ran these test cases with the new patch:

-Description-   -Expected bits- 
-Result-
set bit, VACUUM commit, crash   1,5 
1,5
set bit, crash  0,1 
0,1
set bit, flush heap page, crash 0,5 
0,5
set bit, flush vm page, crash   1,5 
1,5

xlog flush request locations look correct, too.  Overall, looking good.  Do
you know of other notable cases to check?  The last two were somewhat tricky
to inject; if anyone wishes to reproduce them, I'm happy to share my recipe.

One more thing came to mind: don't we need a critical section inside the "if"
block in visibilitymap_set()?  I can't see anything in there that could
elog(ERROR, ...), but my general impression is that we use one anyway.

I ran one repetition of my benchmark from last report, and the amount of WAL
has not changed.  Given that, I think the previous runs remain valid.

> I think that the problem here is that the sense of that test is
> exactly backwards from what it should be.  IIUC, the word "precedes"
> in the previous comment should in fact say "follows".

Ah; quite so.

> >> + ? ? ? ? ? ? /* Don't set the bit if replay has already passed this 
> >> point. */
> >> + ? ? ? ? ? ? if (XLByteLE(lsn, PageGetLSN(BufferGetPage(vmbuffer
> >> + ? ? ? ? ? ? ? ? ? ? visibilitymap_set(reln, xlrec->block, lsn, 
> >> &vmbuffer);
> >
> > ... wouldn't it be better to do this unconditionally? ?Some later record 
> > will
> > unset it if needed, because all replay functions that clear the bit do so
> > without consulting the vm page LSN. ?On the other hand, the worst 
> > consequence
> > of the way you've done it is VACUUM visiting the page one more time to set 
> > the
> > bit.
> 
> Hmm, now that I look at it, I think this test is backwards too.  I
> think you might be right that it wouldn't hurt anything to go ahead
> and set it, but I'm inclined to leave it in for now.

Okay.  Consider expanding the comment to note that this is out of conservatism
rather than known necessity.

> > I think it's worth noting, perhaps based on your explanation in the
> > second-to-last paragraph of
> > http://archives.postgresql.org/message-id/BANLkTi=b7jvmq6fa_exlcygzuyv1u9a

Re: [HACKERS] crash-safe visibility map, take five

2011-06-17 Thread Robert Haas

On Thu, Jun 16, 2011 at 11:17 PM, Noah Misch  wrote:
> I took a look at this patch.

No kidding!  Thanks for the very detailed review.

> +1 for Buffer over Buffer * when the value isn't mutated for the caller.

I changed this.

> I suggest revisiting the suggestion in
> http://archives.postgresql.org/message-id/27743.1291135...@sss.pgh.pa.us and
> including a latestRemovedXid in xl_heap_visible.  The range of risky xids is
> the same for setting a visibility map bit as for deleting a dead tuple, and
> the same operation (VACUUM) produces both conflicts.

See Heikki's follow-up email.  The standby has to ignore all-visible
bits anyway, because the fact that a transaction is all-visible on the
master does not imply that it is all-visible on the standby.  So I
don't think there's a problem here.

> lazy_scan_heap() has two calls to visibilitymap_set(), but the patch only
> changed the recptr argument for one of them.  This has the effect that we only
> emit WAL for empty pages and pages that happened to have pd_lsn == {0,0}, such
> as those not modified since the transaction creating the table.  I fixed this
> before testing further.

Good catch, thanks.  I also added the Assert() that you recommended
further down.

> This happens due to heap_xlog_redo() calling UnlockReleaseBuffer() despite
> having taken no buffer content lock.  I added
>        LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
> before the "if" to get past this.

Fixed, thanks.

> I proceeded to do some immediate-shutdown tests to see if we get the bits set
> as expected.  I set up a database like this:
>        CREATE EXTENSION pageinspect;
>        CREATE TABLE t (c int);
>        INSERT INTO t VALUES (2);
>        CHECKPOINT;
> I normally cleared bits with "UPDATE t SET c = 1; CHECKPOINT;" and set them
> with "VACUUM t".  I checked bits with this query:
>        SELECT to_hex(get_byte(get_raw_page('t', 'vm', 0), 24)),
>               to_hex(flags::int) FROM page_header(get_raw_page('t', 0));
> The row from that query should generally be 0,1 (bits unset) or 1,5 (bits
> set).  0,5 is fine after a crash.  1,1 means we've broken our contract: the VM
> bit is set while PD_ALL_VISIBLE is not set.
>
> First test was to clear bits, checkpoint, then VACUUM and SIGQUIT the
> postmaster.  The system came back up with 1/1 bits.  I poked around enough to
> see that XLByteLE(lsn, PageGetLSN(page)) was failing, but I did not dig any
> deeper toward a root cause.  If you have trouble reproducing this, let me know
> so I can assemble a complete, self-contained test case.

Thank you for putting these test cases together.  As you can probably
tell, I was having difficulty figuring out exactly how to test this.

I think that the problem here is that the sense of that test is
exactly backwards from what it should be.  IIUC, the word "precedes"
in the previous comment should in fact say "follows".

> I would delete this comment.  We were done earlier, but we needed to finish 
> the
> critical section.

Done.

> Concerning the optimization in xlog_heap_delete() et al. of not changing the
> page when its LSN is newer -- am I correct that it only ever triggers when
> full_page_writes = off?  Assuming yes ...

I believe that's right.

>> +     /*
>> +      * Even we skipped the heap page update due to the LSN interlock, it's
>> +      * still safe to update the visibility map.  Any WAL record that clears
>> +      * the visibility map bit does so before checking the page LSN, so any
>> +      * bits that need to be cleared will still be cleared.
>> +      */
>> +     if (record->xl_info & XLR_BKP_BLOCK_1)
>> +             RestoreBkpBlocks(lsn, record, false);
>> +     else
>> +     {
>> +             Relation        reln;
>> +             Buffer          vmbuffer = InvalidBuffer;
>> +
>> +             reln = CreateFakeRelcacheEntry(xlrec->node);
>> +             visibilitymap_pin(reln, xlrec->block, &vmbuffer);
>> +             /* Don't set the bit if replay has already passed this point. 
>> */
>> +             if (XLByteLE(lsn, PageGetLSN(BufferGetPage(vmbuffer
>> +                     visibilitymap_set(reln, xlrec->block, lsn, &vmbuffer);
>
> ... wouldn't it be better to do this unconditionally?  Some later record will
> unset it if needed, because all replay functions that clear the bit do so
> without consulting the vm page LSN.  On the other hand, the worst consequence
> of the way you've done it is VACUUM visiting the page one more time to set the
> bit.

Hmm, now that I look at it, I think this test is backwards too.  I
think you might be right that it wouldn't hurt anything to go ahead
and set it, but I'm inclined to leave it in for now.

> I think it's worth noting, perhaps based on your explanation in the
> second-to-last paragraph of
> http://archives.postgresql.org/message-id/BANLkTi=b7jvmq6fa_exlcygzuyv1u9a...@mail.gmail.com
> that the answer may be incorrect again after the recheck.  We don't care:
> redundant clearings of the visibility bit are no

Re: [HACKERS] crash-safe visibility map, take five

2011-06-16 Thread Noah Misch

Robert,

I took a look at this patch.  To summarize its purpose as I understand it, it
does two independent but synergistic things:

1. Move the INSERT/UPDATE/DELETE clearing of visibility map bits from after
the main operation to before it.  This has no affect on crash recovery.  It
closes a race condition described in the last paragraphs of visibilitymap.c's
header comment.

2. In the words of a comment added by the patch:
 * The critical integrity requirement here is that we must never end up with
 * a situation where the visibility map bit is set, and the page-level
 * PD_ALL_VISIBLE bit is clear.  If that were to occur, then a subsequent
 * page modification would fail to clear the visibility map bit.
It does this by WAL-logging the operation of setting a vm bit.  This also has
the benefit of getting vm bits set correctly on standbys.

On Mon, May 23, 2011 at 03:54:44PM -0400, Robert Haas wrote:
> On Wed, May 11, 2011 at 8:46 AM, Pavan Deolasee
>  wrote:
> > Why do the set and clear functions need pass-by-reference (Buffer *)
> > argument ? I don't see them modifying the argument at all. This patch adds
> > the clear function, but the existing set function also suffers from that.
> 
> Yep, I just copied the style of the existing function, in the interest
> of changing no more than necessary.  But maybe it'd be better to
> change them both to take just the Buffer, instead of a pointer.
> Anyone else want to weigh in?

+1 for Buffer over Buffer * when the value isn't mutated for the caller.

I suggest revisiting the suggestion in
http://archives.postgresql.org/message-id/27743.1291135...@sss.pgh.pa.us and
including a latestRemovedXid in xl_heap_visible.  The range of risky xids is
the same for setting a visibility map bit as for deleting a dead tuple, and
the same operation (VACUUM) produces both conflicts.  The system that has
fueled my reports related to standby conflict would hit a snapshot conflict
more or less instantly without vacuum_defer_cleanup_age.  (With 9.1, we'll use
hot_standby_feedback and never look back.)  Adding visibility map bits as a
source of conflict costs nothing to a system like that.  You might not like it
on an INSERT-only system whose VACUUMs only update visibility.  Such systems
might like a GUC on the standby that disables both index-only scans and
conflict on xl_heap_visible.lastestRemovedXid.  For normal read/write systems,
there will be no advantage.

The patch follows a design hashed in Nov-Dec 2010.  Currently, it adds
overhead to VACUUM with thin practical benefit.  However, there's broad
community acceptance that this is a step on a deterministic path to success at
implementing index-only scans, for worthy benefit.

lazy_scan_heap() has two calls to visibilitymap_set(), but the patch only
changed the recptr argument for one of them.  This has the effect that we only
emit WAL for empty pages and pages that happened to have pd_lsn == {0,0}, such
as those not modified since the transaction creating the table.  I fixed this
before testing further.

Next, for no especially good reason, I set a breakpoint on the closing brace
of visibilitymap_set, ran a VACUUM that called it, and instigated a PANIC from
gdb.  Recovery failed like this:

3614 2011-06-16 20:05:18.118 EDT LOG:  0: redo starts at 0/172ABA0
3614 2011-06-16 20:05:18.118 EDT LOCATION:  StartupXLOG, xlog.c:6494
3614 2011-06-16 20:05:18.119 EDT FATAL:  XX000: lock 148 is not held
3614 2011-06-16 20:05:18.119 EDT CONTEXT:  xlog redo visible: rel 
1663/16384/24588; blk 0
3614 2011-06-16 20:05:18.119 EDT LOCATION:  LWLockRelease, lwlock.c:587
TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 
1695)
3613 2011-06-16 20:05:18.119 EDT LOG:  0: startup process (PID 3614) was 
terminated by signal 6: Aborted
3613 2011-06-16 20:05:18.119 EDT LOCATION:  LogChildExit, postmaster.c:2881
3613 2011-06-16 20:05:18.119 EDT LOG:  0: aborting startup due to startup 
process failure
3613 2011-06-16 20:05:18.119 EDT LOCATION:  reaper, postmaster.c:2376

This happens due to heap_xlog_redo() calling UnlockReleaseBuffer() despite
having taken no buffer content lock.  I added
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
before the "if" to get past this.

I proceeded to do some immediate-shutdown tests to see if we get the bits set
as expected.  I set up a database like this:
CREATE EXTENSION pageinspect;
CREATE TABLE t (c int);
INSERT INTO t VALUES (2);
CHECKPOINT;
I normally cleared bits with "UPDATE t SET c = 1; CHECKPOINT;" and set them
with "VACUUM t".  I checked bits with this query:
SELECT to_hex(get_byte(get_raw_page('t', 'vm', 0), 24)),
   to_hex(flags::int) FROM page_header(get_raw_page('t', 0));
The row from that query should generally be 0,1 (bits unset) or 1,5 (bits
set).  0,5 is fine after a crash.  1,1 means we've broken our contract: the VM
bit is set while PD_ALL_VISIBLE is not set.

First test was to clear bits, checkpoint, th

Re: [HACKERS] crash-safe visibility map, take five

2011-05-23 Thread Robert Haas

On Wed, May 11, 2011 at 8:46 AM, Pavan Deolasee
 wrote:
> Why do the set and clear functions need pass-by-reference (Buffer *)
> argument ? I don't see them modifying the argument at all. This patch adds
> the clear function, but the existing set function also suffers from that.

Yep, I just copied the style of the existing function, in the interest
of changing no more than necessary.  But maybe it'd be better to
change them both to take just the Buffer, instead of a pointer.
Anyone else want to weigh in?

> There are several invocations of pin/clear/release combos. May be you would
> want a convenience routine for doing that in a single step or just pass
> InvalidBuffer to clear() in which case, it would assume that the vm buffer
> is not pinned and do the needful.

I don't think there are enough copies of this logic to worry about it;
and I think that keeping those routines separate is more clear.

> The comment at the top of visibilitymap_pin_ok  says "On entry, *buf", but
> the function really takes just a buf.

Good catch.Very slightly updated patch attached.

> You can possibly fold
> visibilitymap_pin_ok() into a macro (and also name it slightly differently
> like visibilitymap_is_pinned ?).

Well, that name would be kind of misleading, because it's not so much
checking whether we have a pin, but rather whether any pin we have is
the right one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

visibility-map-v4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take five

2011-05-11 Thread Pavan Deolasee

On Tue, May 10, 2011 at 7:38 PM, Robert Haas  wrote:

> On Tue, May 10, 2011 at 9:45 AM, Merlin Moncure 
> wrote:
> > I see: here's a comment that was throwing me off:
> > +   /*
> > +* If we didn't get the lock and it turns out we need it, we'll
> have to
> > +* unlock and re-lock, to avoid holding the buffer lock across an
> I/O.
> > +* That's a bit unfortunate, but hopefully shouldn't happen
> often.
> > +*/
> >
> > I think that might be phrased as "didn't get the pin and it turns out
> > we need it because the bit can change after inspection".  The visible
> > bit isn't 'wrong' as suggested in the comments, it just can change so
> > that it becomes wrong.  Maybe a note of why it could change would be
> > helpful.
>
> Oh, I see.  I did write "lock" when I meant "pin", and your other
> point is well-taken as well.  Here's a revised version with some
> additional wordsmithing.
>
>
Some trivial comments.

Why do the set and clear functions need pass-by-reference (Buffer *)
argument ? I don't see them modifying the argument at all. This patch adds
the clear function, but the existing set function also suffers from that.

There are several invocations of pin/clear/release combos. May be you would
want a convenience routine for doing that in a single step or just pass
InvalidBuffer to clear() in which case, it would assume that the vm buffer
is not pinned and do the needful.

The comment at the top of visibilitymap_pin_ok  says "On entry, *buf", but
the function really takes just a buf. You can possibly fold
visibilitymap_pin_ok() into a macro (and also name it slightly differently
like visibilitymap_is_pinned ?).

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

Re: [HACKERS] crash-safe visibility map, take five

2011-05-10 Thread Robert Haas

On Tue, May 10, 2011 at 1:00 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, May 10, 2011 at 11:34 AM, Jesper Krogh  wrote:
>>> Or what is the downside for keeping it across IO? Will it block other
>>> processes trying to read it?
>
>> Heikki might be in a better position to comment on that than I am,
>> since he wrote the existing code.  But I think that's basically the
>> issue.  When one process has an exclusive buffer lock, nobody else can
>> scan the tuples in that block - so a sequential scan, for example,
>> that reached that block, or an index scan that needed to probe into
>> it, would pile up behind the read of the visibility map page.
>
> Right, it's the loss of potential concurrency that's annoying here.
>
> On the other hand, the concurrency loss might be entirely theoretical
> --- in particular, if other potential readers of the heap page would
> probably also need to wait for the visibility page to come in, then
> nothing is gained by letting them acquire the heap page lock sooner.
>
> I've not read this thread in any detail yet, but if we're going to be
> jumping through extremely complex hoops to avoid that scenario, it might
> be better to KISS ... especially in the first iteration.

I wouldn't describe the hoops as extremely complex; I don't feel any
inclination to simplify the patch beyond what it is right now.  Of
course, we'll see what the feedback looks like after more people have
read the patch, but my feeling is that the patch strikes a reasonable
balance between performance and keeping it simple.  There are some
more complicated shenanigans that I started to experiment with and
ripped out as premature optimization, but this part I think is OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take five

2011-05-10 Thread Tom Lane

Robert Haas  writes:
> On Tue, May 10, 2011 at 11:34 AM, Jesper Krogh  wrote:
>> Or what is the downside for keeping it across IO? Will it block other
>> processes trying to read it?

> Heikki might be in a better position to comment on that than I am,
> since he wrote the existing code.  But I think that's basically the
> issue.  When one process has an exclusive buffer lock, nobody else can
> scan the tuples in that block - so a sequential scan, for example,
> that reached that block, or an index scan that needed to probe into
> it, would pile up behind the read of the visibility map page.

Right, it's the loss of potential concurrency that's annoying here.

On the other hand, the concurrency loss might be entirely theoretical
--- in particular, if other potential readers of the heap page would
probably also need to wait for the visibility page to come in, then
nothing is gained by letting them acquire the heap page lock sooner.

I've not read this thread in any detail yet, but if we're going to be
jumping through extremely complex hoops to avoid that scenario, it might
be better to KISS ... especially in the first iteration.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take five

2011-05-10 Thread Robert Haas

On Tue, May 10, 2011 at 11:34 AM, Jesper Krogh  wrote:
> On 2011-05-10 14:48, Robert Haas wrote:
>>
>> We could avoid all of this complexity - and the possibility of pinning
>> the visibility map page needlessly - by locking the heap buffer first
>> and then pinning the visibility map page if the heap page is
>> all-visible.  However, that would involve holding the lock on the heap
>> buffer across a possible disk I/O to bring the visibility map page
>> into memory, which is something the existing code tries pretty hard to
>> avoid.
>
> Assuming that the visibillity map would be used for visibillity testing,
> just picking the lock would effectively mean "we want it in the buffers",
> which would not be that bad?
>
> Or what is the downside for keeping it across IO? Will it block other
> processes trying to read it?

Heikki might be in a better position to comment on that than I am,
since he wrote the existing code.  But I think that's basically the
issue.  When one process has an exclusive buffer lock, nobody else can
scan the tuples in that block - so a sequential scan, for example,
that reached that block, or an index scan that needed to probe into
it, would pile up behind the read of the visibility map page.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take five

2011-05-10 Thread Jesper Krogh


On 2011-05-10 14:48, Robert Haas wrote:

We could avoid all of this complexity - and the possibility of pinning
the visibility map page needlessly - by locking the heap buffer first
and then pinning the visibility map page if the heap page is
all-visible.  However, that would involve holding the lock on the heap
buffer across a possible disk I/O to bring the visibility map page
into memory, which is something the existing code tries pretty hard to
avoid.

Assuming that the visibillity map would be used for visibillity testing,
just picking the lock would effectively mean "we want it in the buffers",
which would not be that bad?

Or what is the downside for keeping it across IO? Will it block other
processes trying to read it?

--
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take five

2011-05-10 Thread Robert Haas

On Tue, May 10, 2011 at 9:45 AM, Merlin Moncure  wrote:
> I see: here's a comment that was throwing me off:
> +       /*
> +        * If we didn't get the lock and it turns out we need it, we'll have 
> to
> +        * unlock and re-lock, to avoid holding the buffer lock across an I/O.
> +        * That's a bit unfortunate, but hopefully shouldn't happen often.
> +        */
>
> I think that might be phrased as "didn't get the pin and it turns out
> we need it because the bit can change after inspection".  The visible
> bit isn't 'wrong' as suggested in the comments, it just can change so
> that it becomes wrong.  Maybe a note of why it could change would be
> helpful.

Oh, I see.  I did write "lock" when I meant "pin", and your other
point is well-taken as well.  Here's a revised version with some
additional wordsmithing.

> Other than that, it looks pretty good...ISTM an awfully small amount
> of code to provide what it's doing (that's a good thing!).

Thanks.  It's definitely not big in terms of code footprint; it's
mostly a matter of making sure we've dotted all the "i"s and crossed
all the "t"s.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


visibility-map-v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take five

2011-05-10 Thread Merlin Moncure

On Tue, May 10, 2011 at 7:48 AM, Robert Haas  wrote:
> I thought I'd explained it fairly thoroughly in the comments, but
> evidently not.  Suggestions for improvement are welcome.

ok.  that clears it up nicely.

> Here goes in more detail: Every time we insert, update, or delete a
> tuple in a particular heap page, we must check whether the page is
> marked all-visible.  If it is, then we need to clear the page-level
> bit marking it as all-visible, and also the corresponding page in the
> visibility map.  On the other hand, if the page isn't marked
> all-visible, then we needn't touch the visibility map at all.  So,
> there are either one or two buffers involved: the buffer containing
> the heap page ("buffer") and possibly also a buffer containing the
> visibility map page in which the bit for the heap page is to be found
> ("vmbuffer").   Before taking an exclusive content-lock on the heap
> buffer, we check whether the page appears to be all-visible.  If it
> does, then we pin the visibility map page and then lock the buffer.
> If not, we just lock the buffer.

I see: here's a comment that was throwing me off:
+   /*
+* If we didn't get the lock and it turns out we need it, we'll have to
+* unlock and re-lock, to avoid holding the buffer lock across an I/O.
+* That's a bit unfortunate, but hopefully shouldn't happen often.
+*/

I think that might be phrased as "didn't get the pin and it turns out
we need it because the bit can change after inspection".  The visible
bit isn't 'wrong' as suggested in the comments, it just can change so
that it becomes wrong.  Maybe a note of why it could change would be
helpful.

Other than that, it looks pretty good...ISTM an awfully small amount
of code to provide what it's doing (that's a good thing!).

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take five

2011-05-10 Thread Robert Haas

On Mon, May 9, 2011 at 10:25 PM, Merlin Moncure  wrote:
> On Fri, May 6, 2011 at 5:47 PM, Robert Haas  wrote:
>> On Wed, Mar 30, 2011 at 8:52 AM, Heikki Linnakangas
>>  wrote:
 Another question:
 To address the problem in
 http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php
 , should we just clear the vm before the log of insert/update/delete?
 This may reduce the performance, is there another solution?
>>>
>>> Yeah, that's a straightforward way to fix it. I don't think the performance
>>> hit will be too bad. But we need to be careful not to hold locks while doing
>>> I/O, which might require some rearrangement of the code. We might want to do
>>> a similar dance that we do in vacuum, and call visibilitymap_pin first, then
>>> lock and update the heap page, and then set the VM bit while holding the
>>> lock on the heap page.
>>
>> Here's an attempt at implementing the necessary gymnastics.
>
> Is there a quick synopsis of why you have to do (sometimes) the
> pin->lock->unlock->pin->lock mechanic? How come you only can fail to
> get the pin at most once?

I thought I'd explained it fairly thoroughly in the comments, but
evidently not.  Suggestions for improvement are welcome.

Here goes in more detail: Every time we insert, update, or delete a
tuple in a particular heap page, we must check whether the page is
marked all-visible.  If it is, then we need to clear the page-level
bit marking it as all-visible, and also the corresponding page in the
visibility map.  On the other hand, if the page isn't marked
all-visible, then we needn't touch the visibility map at all.  So,
there are either one or two buffers involved: the buffer containing
the heap page ("buffer") and possibly also a buffer containing the
visibility map page in which the bit for the heap page is to be found
("vmbuffer").   Before taking an exclusive content-lock on the heap
buffer, we check whether the page appears to be all-visible.  If it
does, then we pin the visibility map page and then lock the buffer.
If not, we just lock the buffer.  However, since we weren't holding
any lock, it's possible that between the time when we checked the
visibility map bit and the time when we obtained the exclusive
buffer-lock, the visibility map bit might have changed from clear to
set (because someone is concurrently running VACUUM on the table; or
on platforms with weak memory-ordering, someone was running VACUUM
"almost" concurrently).  If that happens, we give up our buffer lock,
go pin the visibility map page, and reacquire the buffer lock.

At this point in the process, we know that *if* the page is marked
all-visible, *then* we have the appropriate visibility map page
pinned.  There are three possible pathways: (1) If the buffer
initially appeared to be all-visible, we will have pinned the
visibility map page before acquiring the exclusive lock; (2) If the
buffer initially appeared NOT to be all-visible, but by the time we
obtained the exclusive lock it now appeared to be all-visible, then we
will have done the unfortunate unlock-pin-relock dance, and the
visibility map page will now be pinned; (3) if the buffer initially
appeared NOT to be all-visible, and by the time we obtained the
exclusive lock it STILL appeared NOT to be all-visible, then we don't
have the visibility map page pinned - but that's OK, because in this
case no operation on the visibility map needs to be performed.

Now it is very possible that in case (1) or (2) the visibility map
bit, though we saw it set at some point, will actually have been
cleared in the meantime.  In case (1), this could happen before we
obtain the exclusive lock; while in case (2), it could happen after we
give up the lock to go pin the visibility map page and before we
reacquire it.  This will typically happen when a buffer has been
sitting around for a while in an all-visible state and suddenly two
different backends both try to update or delete tuples in that buffer
at almost exactly the same time.  But it causes no great harm - both
backends will pin the visibility map page, whichever one gets the
exclusive lock on the heap page first will clear it, and when the
other backend gets the heap page afterwards, it will see that the bit
has already been cleared and do nothing further.  We've wasted the
effort of pinning and unpinning the visibility map page when it wasn't
really necessary, but that's not the end of the world.

We could avoid all of this complexity - and the possibility of pinning
the visibility map page needlessly - by locking the heap buffer first
and then pinning the visibility map page if the heap page is
all-visible.  However, that would involve holding the lock on the heap
buffer across a possible disk I/O to bring the visibility map page
into memory, which is something the existing code tries pretty hard to
avoid.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-

Re: [HACKERS] crash-safe visibility map, take five

2011-05-09 Thread Merlin Moncure

On Fri, May 6, 2011 at 5:47 PM, Robert Haas  wrote:
> On Wed, Mar 30, 2011 at 8:52 AM, Heikki Linnakangas
>  wrote:
>>> Another question:
>>> To address the problem in
>>> http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php
>>> , should we just clear the vm before the log of insert/update/delete?
>>> This may reduce the performance, is there another solution?
>>
>> Yeah, that's a straightforward way to fix it. I don't think the performance
>> hit will be too bad. But we need to be careful not to hold locks while doing
>> I/O, which might require some rearrangement of the code. We might want to do
>> a similar dance that we do in vacuum, and call visibilitymap_pin first, then
>> lock and update the heap page, and then set the VM bit while holding the
>> lock on the heap page.
>
> Here's an attempt at implementing the necessary gymnastics.

Is there a quick synopsis of why you have to do (sometimes) the
pin->lock->unlock->pin->lock mechanic? How come you only can fail to
get the pin at most once?

merlin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take five

2011-05-09 Thread Rob Wultsch

On Fri, May 6, 2011 at 2:47 PM, Robert Haas  wrote:
> Comments?

At my day job there is saying: "Silence is consent".

I am surprised there has not been more discussion of this change,
considering the magnitude of the possibilities it unlocks.


-- 
Rob Wultsch
wult...@gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

Re: [HACKERS] crash-safe visibility map, take five

25 matches

Site Navigation

Mail list logo

Footer information