subject:"\[HACKERS\] Reviewing freeze map code"

Re: [HACKERS] Reviewing freeze map code

2016-08-05 Thread Robert Haas

On Thu, Aug 4, 2016 at 3:24 AM, Andres Freund  wrote:
> Hi,
>
> On 2016-08-02 10:55:18 -0400, Noah Misch wrote:
>> [Action required within 72 hours.  This is a generic notification.]
>>
>> The above-described topic is currently a PostgreSQL 9.6 open item.  Andres,
>> since you committed the patch believed to have created it, you own this open
>> item.
>
> Well kinda (it was a partial fix for something not originally by me),
> but I'll deal with. Reading now, will commit tomorrow.

Thanks.  I kept meaning to get to this one, and failing to do so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-08-04 Thread Andres Freund

Hi,

On 2016-08-02 10:55:18 -0400, Noah Misch wrote:
> [Action required within 72 hours.  This is a generic notification.]
> 
> The above-described topic is currently a PostgreSQL 9.6 open item.  Andres,
> since you committed the patch believed to have created it, you own this open
> item.

Well kinda (it was a partial fix for something not originally by me),
but I'll deal with. Reading now, will commit tomorrow.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-08-02 Thread Noah Misch

On Tue, Aug 02, 2016 at 02:10:29PM +0530, Amit Kapila wrote:
> On Tue, Aug 2, 2016 at 11:19 AM, Noah Misch  wrote:
> > On Sat, Jul 23, 2016 at 01:25:55PM +0530, Amit Kapila wrote:
> >> On Mon, Jul 18, 2016 at 2:03 PM, Andres Freund  wrote:
> >> > On 2016-07-18 10:02:52 +0530, Amit Kapila wrote:
> >> >> Consider the below scenario.
> >> >>
> >> >> Vacuum
> >> >> a. acquires a cleanup lock for page - 10
> >> >> b. busy in checking visibility of tuples
> >> >> --assume, here it takes some time and in the meantime Session-1
> >> >> performs step (a) and (b) and start waiting in step- (c)
> >> >> c. marks the page as all-visible (PageSetAllVisible)
> >> >> d. unlockandrelease the buffer
> >> >>
> >> >> Session-1
> >> >> a. In heap_lock_tuple(), readbuffer for page-10
> >> >> b. check PageIsAllVisible(), found page is not all-visible, so didn't
> >> >> acquire the visbilitymap_pin
> >> >> c. LockBuffer in ExlusiveMode  - here it will wait for vacuum to
> >> >> release the lock
> >> >> d. Got the lock, but now the page is marked as all-visible, so ideally
> >> >> need to recheck the page and acquire the visibilitymap_pin
> >> >
> >> > So, I've tried pretty hard to reproduce that. While the theory above is
> >> > sound, I believe the relevant code-path is essentially dead for SQL
> >> > callable code, because we'll always hold a buffer pin before even
> >> > entering heap_update/heap_lock_tuple.
> >> >
> >>
> >> It is possible that we don't hold any buffer pin before entering
> >> heap_update() and or heap_lock_tuple().  For heap_update(), it is
> >> possible when it enters via simple_heap_update() path.  For
> >> heap_lock_tuple(), it is possible for ON CONFLICT DO Update statement
> >> and may be others as well.
> >
> > This is currently listed as a 9.6 open item.  Is it indeed a regression in
> > 9.6, or do released versions have the same defect?  If it is a 9.6 
> > regression,
> > do you happen to know which commit, or at least which feature, caused it?
> >
> 
> Commit eca0f1db is the reason for this specific issue.

[Action required within 72 hours.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 9.6 open item.  Andres,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
9.6 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within 72 hours of this
message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
in advance of shipping 9.6rc1 next week.  Consequently, I will appreciate your
efforts toward speedy resolution.  Thanks.

[1] 
http://www.postgresql.org/message-id/20160527025039.ga447...@tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-08-02 Thread Amit Kapila

On Tue, Aug 2, 2016 at 11:19 AM, Noah Misch  wrote:
> On Sat, Jul 23, 2016 at 01:25:55PM +0530, Amit Kapila wrote:
>> On Mon, Jul 18, 2016 at 2:03 PM, Andres Freund  wrote:
>> > On 2016-07-18 10:02:52 +0530, Amit Kapila wrote:
>> >> Consider the below scenario.
>> >>
>> >> Vacuum
>> >> a. acquires a cleanup lock for page - 10
>> >> b. busy in checking visibility of tuples
>> >> --assume, here it takes some time and in the meantime Session-1
>> >> performs step (a) and (b) and start waiting in step- (c)
>> >> c. marks the page as all-visible (PageSetAllVisible)
>> >> d. unlockandrelease the buffer
>> >>
>> >> Session-1
>> >> a. In heap_lock_tuple(), readbuffer for page-10
>> >> b. check PageIsAllVisible(), found page is not all-visible, so didn't
>> >> acquire the visbilitymap_pin
>> >> c. LockBuffer in ExlusiveMode  - here it will wait for vacuum to
>> >> release the lock
>> >> d. Got the lock, but now the page is marked as all-visible, so ideally
>> >> need to recheck the page and acquire the visibilitymap_pin
>> >
>> > So, I've tried pretty hard to reproduce that. While the theory above is
>> > sound, I believe the relevant code-path is essentially dead for SQL
>> > callable code, because we'll always hold a buffer pin before even
>> > entering heap_update/heap_lock_tuple.
>> >
>>
>> It is possible that we don't hold any buffer pin before entering
>> heap_update() and or heap_lock_tuple().  For heap_update(), it is
>> possible when it enters via simple_heap_update() path.  For
>> heap_lock_tuple(), it is possible for ON CONFLICT DO Update statement
>> and may be others as well.
>
> This is currently listed as a 9.6 open item.  Is it indeed a regression in
> 9.6, or do released versions have the same defect?  If it is a 9.6 regression,
> do you happen to know which commit, or at least which feature, caused it?
>

Commit eca0f1db is the reason for this specific issue.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-08-01 Thread Noah Misch

On Sat, Jul 23, 2016 at 01:25:55PM +0530, Amit Kapila wrote:
> On Mon, Jul 18, 2016 at 2:03 PM, Andres Freund  wrote:
> > On 2016-07-18 10:02:52 +0530, Amit Kapila wrote:
> >> Consider the below scenario.
> >>
> >> Vacuum
> >> a. acquires a cleanup lock for page - 10
> >> b. busy in checking visibility of tuples
> >> --assume, here it takes some time and in the meantime Session-1
> >> performs step (a) and (b) and start waiting in step- (c)
> >> c. marks the page as all-visible (PageSetAllVisible)
> >> d. unlockandrelease the buffer
> >>
> >> Session-1
> >> a. In heap_lock_tuple(), readbuffer for page-10
> >> b. check PageIsAllVisible(), found page is not all-visible, so didn't
> >> acquire the visbilitymap_pin
> >> c. LockBuffer in ExlusiveMode  - here it will wait for vacuum to
> >> release the lock
> >> d. Got the lock, but now the page is marked as all-visible, so ideally
> >> need to recheck the page and acquire the visibilitymap_pin
> >
> > So, I've tried pretty hard to reproduce that. While the theory above is
> > sound, I believe the relevant code-path is essentially dead for SQL
> > callable code, because we'll always hold a buffer pin before even
> > entering heap_update/heap_lock_tuple.
> >
> 
> It is possible that we don't hold any buffer pin before entering
> heap_update() and or heap_lock_tuple().  For heap_update(), it is
> possible when it enters via simple_heap_update() path.  For
> heap_lock_tuple(), it is possible for ON CONFLICT DO Update statement
> and may be others as well.

This is currently listed as a 9.6 open item.  Is it indeed a regression in
9.6, or do released versions have the same defect?  If it is a 9.6 regression,
do you happen to know which commit, or at least which feature, caused it?

Thanks,
nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-26 Thread Amit Kapila

On Wed, Jul 27, 2016 at 3:24 AM, Robert Haas  wrote:
> On Sat, Jul 23, 2016 at 3:55 AM, Amit Kapila  wrote:
>> Attached patch fixes the problem for me.  Note, I have not tried to
>> reproduce the problem for heap_lock_updated_tuple_rec(), but I think
>> if you are convinced with above cases, then we should have a similar
>> check in it as well.
>
> I don't think this hunk is correct:
>
> +/*
> + * If we didn't pin the visibility map page and the page has become
> + * all visible, we'll have to unlock and re-lock.  See 
> heap_lock_tuple
> + * for details.
> + */
> +if (vmbuffer == InvalidBuffer && 
> PageIsAllVisible(BufferGetPage(buf)))
> +{
> +LockBuffer(buf, BUFFER_LOCK_UNLOCK);
> +visibilitymap_pin(rel, block, &vmbuffer);
> +LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
> +goto l4;
> +}
>
> The code beginning at label l4 appears that the buffer is unlocked,
> but this code leaves the buffer unlocked.  Also, I don't see the point
> of doing this test so far down in the function.  Why not just recheck
> *immediately* after taking the buffer lock?
>

Right, in this case we can recheck immediately after taking buffer
lock, updated patch attached.  In the passing by, I have noticed that
heap_delete() doesn't do this unlocking, pinning of vm and locking at
appropriate place.  It just checks immediately after taking lock,
whereas in the down code, it do unlock and lock the buffer again.  I
think we should do it as in attached patch
(pin_vm_heap_delete-v1.patch).


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


pin_vm_lock_tuple-v2.patch
Description: Binary data


pin_vm_heap_delete-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-26 Thread Robert Haas

On Sat, Jul 23, 2016 at 3:55 AM, Amit Kapila  wrote:
> Attached patch fixes the problem for me.  Note, I have not tried to
> reproduce the problem for heap_lock_updated_tuple_rec(), but I think
> if you are convinced with above cases, then we should have a similar
> check in it as well.

I don't think this hunk is correct:

+/*
+ * If we didn't pin the visibility map page and the page has become
+ * all visible, we'll have to unlock and re-lock.  See heap_lock_tuple
+ * for details.
+ */
+if (vmbuffer == InvalidBuffer && PageIsAllVisible(BufferGetPage(buf)))
+{
+LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+visibilitymap_pin(rel, block, &vmbuffer);
+LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+goto l4;
+}

The code beginning at label l4 appears that the buffer is unlocked,
but this code leaves the buffer unlocked.  Also, I don't see the point
of doing this test so far down in the function.  Why not just recheck
*immediately* after taking the buffer lock?  If you find out that you
need the pin after all, then   LockBuffer(buf,
BUFFER_LOCK_UNLOCK); visibilitymap_pin(rel, block, &vmbuffer);
LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); but *do not* go back to l4.
Unless I'm missing something, putting this block further down, as you
have it, buys nothing, because none of that intervening code can
release the buffer lock without using goto to jump back to l4.

+/*
+ * If we didn't pin the visibility map page and the page has become all
+ * visible while we were busy locking the buffer, or during some
+ * subsequent window during which we had it unlocked, we'll have to unlock
+ * and re-lock, to avoid holding the buffer lock across an I/O.  That's a
+ * bit unfortunate, especially since we'll now have to recheck whether the
+ * tuple has been locked or updated under us, but hopefully it won't
+ * happen very often.
+ */
+if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))
+{
+LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
+visibilitymap_pin(relation, block, &vmbuffer);
+LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
+goto l3;
+}

In contrast, this looks correct: l3 expects the buffer to be locked
already, and the code above this point and below the point this logic
can unlock and re-lock the buffer, potentially multiple times.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-23 Thread Amit Kapila

On Mon, Jul 18, 2016 at 2:03 PM, Andres Freund  wrote:
> On 2016-07-18 10:02:52 +0530, Amit Kapila wrote:
>> >
>>
>> Consider the below scenario.
>>
>> Vacuum
>> a. acquires a cleanup lock for page - 10
>> b. busy in checking visibility of tuples
>> --assume, here it takes some time and in the meantime Session-1
>> performs step (a) and (b) and start waiting in step- (c)
>> c. marks the page as all-visible (PageSetAllVisible)
>> d. unlockandrelease the buffer
>>
>> Session-1
>> a. In heap_lock_tuple(), readbuffer for page-10
>> b. check PageIsAllVisible(), found page is not all-visible, so didn't
>> acquire the visbilitymap_pin
>> c. LockBuffer in ExlusiveMode  - here it will wait for vacuum to
>> release the lock
>> d. Got the lock, but now the page is marked as all-visible, so ideally
>> need to recheck the page and acquire the visibilitymap_pin
>
> So, I've tried pretty hard to reproduce that. While the theory above is
> sound, I believe the relevant code-path is essentially dead for SQL
> callable code, because we'll always hold a buffer pin before even
> entering heap_update/heap_lock_tuple.
>

It is possible that we don't hold any buffer pin before entering
heap_update() and or heap_lock_tuple().  For heap_update(), it is
possible when it enters via simple_heap_update() path.  For
heap_lock_tuple(), it is possible for ON CONFLICT DO Update statement
and may be others as well.  Let me also try to explain with a test for
both the cases, if above is not clear enough.

Case-1 for heap_update()
---
Session-1
Create table t1(c1 int);
Alter table t1 alter column c1 set default 10;  --via debugger stop at
StoreAttrDefault()/heap_update, while you are in heap_update(), note
down the block number

Session-2
vacuum (DISABLE_PAGE_SKIPPING) pg_attribute; -- In lazy_scan_heap(),
stop at line (if (all_visible && !all_visible_according_to_vm))) for
block number noted in Session-1.

Session-1
In debugger, proceed and let it wait at lockbuffer, note that it will
not pin the visibility map.

Session-2
Set the visibility flag and complete the operation.

Session-1
You will notice that it will attempt to unlock the buffer, pin the
visibility map, lock the buffer again.

Case-2 for heap_lock_tuple()

Session-1
Create table i_conflict(c1 int, c2 char(100));
Create unique index idx_u on i_conflict(c1);

Insert into i_conflict values(1,'aaa');
Insert into i_conflict values(1,'aaa') On Conflict (c1) Do Update Set
c2='bbb';  -- via debugger, stop at line 385 in nodeModifyTable.c (In
ExecInsert(), at
if (onconflict == ONCONFLICT_UPDATE)).

Session-2
-
vacuum (DISABLE_PAGE_SKIPPING) i_conflict --stop before setting the
all-visible flag

Session-1
--
In debugger, proceed and let it wait at lockbuffer, note that it will
not pin the visibility map.

Session-2
---
Set the visibility flag and complete the operation.

Session-1
--
PANIC:  wrong buffer passed to visibilitymap_clear  --this is problematic.

Attached patch fixes the problem for me.  Note, I have not tried to
reproduce the problem for heap_lock_updated_tuple_rec(), but I think
if you are convinced with above cases, then we should have a similar
check in it as well.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

pin_vm_lock_tuple-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-18 Thread Michael Paquier

On Sat, Jul 16, 2016 at 10:08 AM, Andres Freund  wrote:
> On 2016-07-14 20:53:07 -0700, Andres Freund wrote:
>> On 2016-07-13 23:06:07 -0700, Andres Freund wrote:
>> > won't enter the branch, because HEAP_XMAX_LOCK_ONLY won't be set.  Which
>> > will leave t_ctid and HEAP_HOT_UPDATED set differently on the master and
>> > standby / after crash recovery.   I'm failing to see any harmful
>> > consequences right now, but differences between master and standby are a 
>> > bad
>> > thing.
>>
>> I think it's actually critical, because HEAP_HOT_UPDATED /
>> HEAP_XMAX_LOCK_ONLY are used to terminate ctid chasing loops (like
>> heap_hot_search_buffer()).
>
> I've pushed a quite heavily revised version of the first patch to
> 9.1-master. I manually verified using pageinspect, gdb breakpoints and a
> standby that xmax, infomask etc are set correctly (leading to finding
> a4d357bf).  As there's noticeable differences, especially 9.2->9.3,
> between versions, I'd welcome somebody having a look at the commits.

Waoh, man. Thanks!

I have been just pinged this week end about a set up that likely has
faced this exact problem in the shape of "tuple concurrently updated"
with a node getting kill-9-ed by some framework because it did not
finish its shutdown checkpoint after some time in some test which
enforced it to do crash recovery. I have not been able to put my hands
on the raw data to have a look at the flags set within those tuples
but I got the string feeling that this is related to that. After a
couple of rounds doing so, it was possible to see "tuple concurrently
updated" errors for a relation that has few pages and a high update
rate using 9.4.

More seriously, I have spent some time looking at what you have pushed
on each branch, and the fixes are looking correct to me.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-18 Thread Andres Freund

On 2016-07-18 01:33:10 -0700, Andres Freund wrote:
> On 2016-07-18 10:02:52 +0530, Amit Kapila wrote:
> > On Mon, Jul 18, 2016 at 9:13 AM, Andres Freund  wrote:
> > > On 2016-07-18 09:07:19 +0530, Amit Kapila wrote:
> > >> + /*
> > >> + * Before locking the buffer, pin the visibility map page if it may be
> > >> + * necessary.
> > >> + */
> > >>
> > >> + if (PageIsAllVisible(BufferGetPage(*buffer)))
> > >> + visibilitymap_pin(relation, block, &vmbuffer);
> > >> +
> > >>   LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
> > >>
> > >> I think we need to check for PageIsAllVisible and try to pin the
> > >> visibility map after taking the lock on buffer. I think it is quite
> > >> possible that in the time this routine tries to acquire lock on
> > >> buffer, the page becomes all visible.
> > >
> > > I don't see how. Without a cleanup lock it's not possible to mark a page
> > > all-visible/frozen.
> > >
> >
> > Consider the below scenario.
> >
> > Vacuum
> > a. acquires a cleanup lock for page - 10
> > b. busy in checking visibility of tuples
> > --assume, here it takes some time and in the meantime Session-1
> > performs step (a) and (b) and start waiting in step- (c)
> > c. marks the page as all-visible (PageSetAllVisible)
> > d. unlockandrelease the buffer
> >
> > Session-1
> > a. In heap_lock_tuple(), readbuffer for page-10
> > b. check PageIsAllVisible(), found page is not all-visible, so didn't
> > acquire the visbilitymap_pin
> > c. LockBuffer in ExlusiveMode  - here it will wait for vacuum to
> > release the lock
> > d. Got the lock, but now the page is marked as all-visible, so ideally
> > need to recheck the page and acquire the visibilitymap_pin
>
> So, I've tried pretty hard to reproduce that. While the theory above is
> sound, I believe the relevant code-path is essentially dead for SQL
> callable code, because we'll always hold a buffer pin before even
> entering heap_update/heap_lock_tuple.  It's possible that you could
> concoct a dangerous scenario with follow_updates though; but I can't
> immediately see how.  Due to that, and based on the closing in beta
> release, I'm planning to push a version of the patch that the returns
> fixed; but not this.  It seems better to have the majority of the fix
> in.

Pushed that way. Let's try to figure out a good solution to a) test this
case b) how to fix it in a reasonable way. Note that there's also
http://archives.postgresql.org/message-id/20160718071729.tlj4upxhaylwv75n%40alap3.anarazel.de
which seems related.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-18 Thread Andres Freund

On 2016-07-18 10:02:52 +0530, Amit Kapila wrote:
> On Mon, Jul 18, 2016 at 9:13 AM, Andres Freund  wrote:
> > On 2016-07-18 09:07:19 +0530, Amit Kapila wrote:
> >> + /*
> >> + * Before locking the buffer, pin the visibility map page if it may be
> >> + * necessary.
> >> + */
> >>
> >> + if (PageIsAllVisible(BufferGetPage(*buffer)))
> >> + visibilitymap_pin(relation, block, &vmbuffer);
> >> +
> >>   LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
> >>
> >> I think we need to check for PageIsAllVisible and try to pin the
> >> visibility map after taking the lock on buffer. I think it is quite
> >> possible that in the time this routine tries to acquire lock on
> >> buffer, the page becomes all visible.
> >
> > I don't see how. Without a cleanup lock it's not possible to mark a page
> > all-visible/frozen.
> >
> 
> Consider the below scenario.
> 
> Vacuum
> a. acquires a cleanup lock for page - 10
> b. busy in checking visibility of tuples
> --assume, here it takes some time and in the meantime Session-1
> performs step (a) and (b) and start waiting in step- (c)
> c. marks the page as all-visible (PageSetAllVisible)
> d. unlockandrelease the buffer
> 
> Session-1
> a. In heap_lock_tuple(), readbuffer for page-10
> b. check PageIsAllVisible(), found page is not all-visible, so didn't
> acquire the visbilitymap_pin
> c. LockBuffer in ExlusiveMode  - here it will wait for vacuum to
> release the lock
> d. Got the lock, but now the page is marked as all-visible, so ideally
> need to recheck the page and acquire the visibilitymap_pin

So, I've tried pretty hard to reproduce that. While the theory above is
sound, I believe the relevant code-path is essentially dead for SQL
callable code, because we'll always hold a buffer pin before even
entering heap_update/heap_lock_tuple.  It's possible that you could
concoct a dangerous scenario with follow_updates though; but I can't
immediately see how.  Due to that, and based on the closing in beta
release, I'm planning to push a version of the patch that the returns
fixed; but not this.  It seems better to have the majority of the fix
in.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-17 Thread Amit Kapila

On Mon, Jul 18, 2016 at 9:13 AM, Andres Freund  wrote:
> On 2016-07-18 09:07:19 +0530, Amit Kapila wrote:
>> + /*
>> + * Before locking the buffer, pin the visibility map page if it may be
>> + * necessary.
>> + */
>>
>> + if (PageIsAllVisible(BufferGetPage(*buffer)))
>> + visibilitymap_pin(relation, block, &vmbuffer);
>> +
>>   LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
>>
>> I think we need to check for PageIsAllVisible and try to pin the
>> visibility map after taking the lock on buffer. I think it is quite
>> possible that in the time this routine tries to acquire lock on
>> buffer, the page becomes all visible.
>
> I don't see how. Without a cleanup lock it's not possible to mark a page
> all-visible/frozen.
>

Consider the below scenario.

Vacuum
a. acquires a cleanup lock for page - 10
b. busy in checking visibility of tuples
--assume, here it takes some time and in the meantime Session-1
performs step (a) and (b) and start waiting in step- (c)
c. marks the page as all-visible (PageSetAllVisible)
d. unlockandrelease the buffer

Session-1
a. In heap_lock_tuple(), readbuffer for page-10
b. check PageIsAllVisible(), found page is not all-visible, so didn't
acquire the visbilitymap_pin
c. LockBuffer in ExlusiveMode  - here it will wait for vacuum to
release the lock
d. Got the lock, but now the page is marked as all-visible, so ideally
need to recheck the page and acquire the visibilitymap_pin




-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-17 Thread Andres Freund

On 2016-07-18 09:07:19 +0530, Amit Kapila wrote:
> + /*
> + * Before locking the buffer, pin the visibility map page if it may be
> + * necessary.
> + */
> 
> + if (PageIsAllVisible(BufferGetPage(*buffer)))
> + visibilitymap_pin(relation, block, &vmbuffer);
> +
>   LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
> 
> I think we need to check for PageIsAllVisible and try to pin the
> visibility map after taking the lock on buffer. I think it is quite
> possible that in the time this routine tries to acquire lock on
> buffer, the page becomes all visible.

I don't see how. Without a cleanup lock it's not possible to mark a page
all-visible/frozen. We might miss the bit becoming unset concurrently,
but that's ok.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-17 Thread Andres Freund

On 2016-07-17 23:34:01 -0400, Robert Haas wrote:
> Thanks very much for working on this.  Random suggestions after a quick look:
> 
> + * Before locking the buffer, pin the visibility map page if it may be
> + * necessary.
> 
> s/necessary/needed/
> 
> More substantively, what happens if the situation changes before we
> obtain the buffer lock?  I think you need to release the page lock,
> pin the page after all, and then relock the page.

It shouldn't be able to. Cleanup locks, which are required for
vacuumlazy to do anything relevant, aren't possible with the buffer
pinned.  This pattern is used in heap_delete/heap_update, so I think
we're on a reasonably well trodden path.

> There seem to be several ways to escape from this function without
> releasing the pin on vmbuffer.  From the visibilitymap_pin call here,
> search downward for "return".

Hm, that's cleary not good.

The best thing to address that seems to be to create a
separate jump label, which check vmbuffer and releases the page
lock. Unless you have a better idea.

> + *  visibilitymap_clear - clear bit(s) for one page in visibility map
> 
> I don't really like the parenthesized-s convention as a shorthand for
> "one or more".  It tends to confuse non-native English speakers.
> 
> + * any I/O.  Returns whether any bits have been cleared.
> 
> I suggest "Returns true if any bits have been cleared and false otherwise".

Will change.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-17 Thread Amit Kapila

On Mon, Jul 18, 2016 at 8:18 AM, Andres Freund  wrote:
> On 2016-07-16 10:45:26 -0700, Andres Freund wrote:
>>
>>
>> On July 16, 2016 8:49:06 AM PDT, Tom Lane  wrote:
>> >Amit Kapila  writes:
>> >> On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund 
>> >wrote:
>> >>> I think we have two choices how to deal with that: First, we can add
>> >a
>> >>> new flags variable to xl_heap_lock similar to
>> >>> xl_heap_insert/update/... and bump page magic,
>> >
>> >> +1 for going in this way.  This will keep us consistent with how
>> >clear
>> >> the visibility info in other places like heap_xlog_update().
>> >
>> >Yeah.  We've already forced a catversion bump for beta3, and I'm about
>> >to go fix PG_CONTROL_VERSION as well, so there's basically no downside
>> >to doing an xlog version bump as well.  At least, not if you can get it
>> >in before Monday.
>>
>> OK, Cool. Will do it later today.
>
> Took till today. Attached is a rather heavily revised version of
> Sawada-san's patch. Most notably the recovery routines take care to
> reset the vm in all cases, we don't perform visibilitymap_get_status
> from inside a critical section anymore, and
> heap_lock_updated_tuple_rec() also resets the vm (although I'm not
> entirely sure that can practically be hit).
>


@@ -4563,8 +4579,18 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,

+ /*
+ * Before locking the buffer, pin the visibility map page if it may be
+ * necessary.
+ */

+ if (PageIsAllVisible(BufferGetPage(*buffer)))
+ visibilitymap_pin(relation, block, &vmbuffer);
+
  LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);

I think we need to check for PageIsAllVisible and try to pin the
visibility map after taking the lock on buffer. I think it is quite
possible that in the time this routine tries to acquire lock on
buffer, the page becomes all visible.  To avoid the similar hazard, we
do try to check the visibility of page after acquiring buffer lock in
heap_update() at below place.

if (vmbuffer == InvalidBuffer && PageIsAllVisible(page))


Similarly, I think heap_lock_updated_tuple_rec() needs to take care of
same.  While I was typing this e-mail, it seems Robert has already
pointed the same issue.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-17 Thread Robert Haas

On Sun, Jul 17, 2016 at 10:48 PM, Andres Freund  wrote:
> Took till today. Attached is a rather heavily revised version of
> Sawada-san's patch. Most notably the recovery routines take care to
> reset the vm in all cases, we don't perform visibilitymap_get_status
> from inside a critical section anymore, and
> heap_lock_updated_tuple_rec() also resets the vm (although I'm not
> entirely sure that can practically be hit).
>
> I'm doing some more testing, and Robert said he could take a quick look
> at the patch. If somebody else... Will push sometime after dinner.

Thanks very much for working on this.  Random suggestions after a quick look:

+ * Before locking the buffer, pin the visibility map page if it may be
+ * necessary.

s/necessary/needed/

More substantively, what happens if the situation changes before we
obtain the buffer lock?  I think you need to release the page lock,
pin the page after all, and then relock the page.

There seem to be several ways to escape from this function without
releasing the pin on vmbuffer.  From the visibilitymap_pin call here,
search downward for "return".

+ *  visibilitymap_clear - clear bit(s) for one page in visibility map

I don't really like the parenthesized-s convention as a shorthand for
"one or more".  It tends to confuse non-native English speakers.

+ * any I/O.  Returns whether any bits have been cleared.

I suggest "Returns true if any bits have been cleared and false otherwise".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-17 Thread Andres Freund

On 2016-07-16 10:45:26 -0700, Andres Freund wrote:
> 
> 
> On July 16, 2016 8:49:06 AM PDT, Tom Lane  wrote:
> >Amit Kapila  writes:
> >> On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund 
> >wrote:
> >>> I think we have two choices how to deal with that: First, we can add
> >a
> >>> new flags variable to xl_heap_lock similar to
> >>> xl_heap_insert/update/... and bump page magic,
> >
> >> +1 for going in this way.  This will keep us consistent with how
> >clear
> >> the visibility info in other places like heap_xlog_update().
> >
> >Yeah.  We've already forced a catversion bump for beta3, and I'm about
> >to go fix PG_CONTROL_VERSION as well, so there's basically no downside
> >to doing an xlog version bump as well.  At least, not if you can get it
> >in before Monday.
> 
> OK, Cool. Will do it later today.

Took till today. Attached is a rather heavily revised version of
Sawada-san's patch. Most notably the recovery routines take care to
reset the vm in all cases, we don't perform visibilitymap_get_status
from inside a critical section anymore, and
heap_lock_updated_tuple_rec() also resets the vm (although I'm not
entirely sure that can practically be hit).

I'm doing some more testing, and Robert said he could take a quick look
at the patch. If somebody else... Will push sometime after dinner.

Regards,

Andres
>From 26f6eff8cef9b436e328a7364d6e4954b702208b Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Sun, 17 Jul 2016 19:30:38 -0700
Subject: [PATCH] Clear all-frozen visibilitymap status when locking tuples.

Since a892234 & fd31cd265 the visibilitymap's freeze bit is used to
avoid vacuuming the whole relation in anti-wraparound vacuums. Doing so
correctly relies on not adding xids to the heap without also unsetting
the visibilitymap flag.  Tuple locking related code has not done so.

To allow selectively resetting all-frozen - to avoid pessimizing
heap_lock_tuple - allow to selectively reset the all-frozen with
visibilitymap_clear(). To avoid having to use
visibilitymap_get_status (e.g. via VM_ALL_FROZEN) inside a critical
section, have visibilitymap_clear() return whether any bits have been
reset.

The added flags field fields to xl_heap_lock and xl_heap_lock_updated
require bumping the WAL magic. Since there's already been a catversion
bump since the last beta, that's not an issue.

Author: Masahiko Sawada, heavily revised by Andres Freund
Discussion: CAEepm=3fwabwryvw9swhylty4sxvf0xblvxqowuodincx9m...@mail.gmail.com
Backpatch: -
---
 src/backend/access/heap/heapam.c| 126 +---
 src/backend/access/heap/visibilitymap.c |  18 +++--
 src/backend/access/rmgrdesc/heapdesc.c  |   6 +-
 src/backend/commands/vacuumlazy.c   |   6 +-
 src/include/access/heapam_xlog.h|   9 ++-
 src/include/access/visibilitymap.h  |   4 +-
 src/include/access/xlog_internal.h  |   2 +-
 7 files changed, 145 insertions(+), 26 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2815d91..1216f3f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2423,7 +2423,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		PageClearAllVisible(BufferGetPage(buffer));
 		visibilitymap_clear(relation,
 			ItemPointerGetBlockNumber(&(heaptup->t_self)),
-			vmbuffer);
+			vmbuffer, VISIBILITYMAP_VALID_BITS);
 	}
 
 	/*
@@ -2737,7 +2737,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 			PageClearAllVisible(page);
 			visibilitymap_clear(relation,
 BufferGetBlockNumber(buffer),
-vmbuffer);
+vmbuffer, VISIBILITYMAP_VALID_BITS);
 		}
 
 		/*
@@ -3239,7 +3239,7 @@ l1:
 		all_visible_cleared = true;
 		PageClearAllVisible(page);
 		visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
-			vmbuffer);
+			vmbuffer, VISIBILITYMAP_VALID_BITS);
 	}
 
 	/* store transaction information of xact deleting the tuple */
@@ -3925,6 +3925,7 @@ l2:
 		TransactionId xmax_lock_old_tuple;
 		uint16		infomask_lock_old_tuple,
 	infomask2_lock_old_tuple;
+		bool		cleared_all_frozen = false;
 
 		/*
 		 * To prevent concurrent sessions from updating the tuple, we have to
@@ -3968,6 +3969,17 @@ l2:
 		/* temporarily make it look not-updated, but locked */
 		oldtup.t_data->t_ctid = oldtup.t_self;
 
+		/*
+		 * Clear all-frozen bit on visibility map if needed. We could
+		 * immediately reset ALL_VISIBLE, but given that the WAL logging
+		 * overhead would be unchanged, that doesn't seem necessarily
+		 * worthwhile.
+		 */
+		if (PageIsAllVisible(BufferGetPage(buffer)) &&
+			visibilitymap_clear(relation, block, vmbuffer,
+VISIBILITYMAP_ALL_FROZEN))
+			cleared_all_frozen = true;
+
 		MarkBufferDirty(buffer);
 
 		if (RelationNeedsWAL(relation))
@@ -3982,6 +3994,8 @@ l2:
 			xlrec.locking_xid = xmax_lock_old_tuple;
 			xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask,
   oldtup.t_data->t_infomask2);
+			xl

Re: [HACKERS] Reviewing freeze map code

2016-07-16 Thread Andres Freund



On July 16, 2016 8:49:06 AM PDT, Tom Lane  wrote:
>Amit Kapila  writes:
>> On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund 
>wrote:
>>> I think we have two choices how to deal with that: First, we can add
>a
>>> new flags variable to xl_heap_lock similar to
>>> xl_heap_insert/update/... and bump page magic,
>
>> +1 for going in this way.  This will keep us consistent with how
>clear
>> the visibility info in other places like heap_xlog_update().
>
>Yeah.  We've already forced a catversion bump for beta3, and I'm about
>to go fix PG_CONTROL_VERSION as well, so there's basically no downside
>to doing an xlog version bump as well.  At least, not if you can get it
>in before Monday.

OK, Cool. Will do it later today.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-16 Thread Tom Lane

Amit Kapila  writes:
> On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund  wrote:
>> I think we have two choices how to deal with that: First, we can add a
>> new flags variable to xl_heap_lock similar to
>> xl_heap_insert/update/... and bump page magic,

> +1 for going in this way.  This will keep us consistent with how clear
> the visibility info in other places like heap_xlog_update().

Yeah.  We've already forced a catversion bump for beta3, and I'm about
to go fix PG_CONTROL_VERSION as well, so there's basically no downside
to doing an xlog version bump as well.  At least, not if you can get it
in before Monday.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-16 Thread Amit Kapila

On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund  wrote:
> On 2016-07-13 23:06:07 -0700, Andres Freund wrote:
>> > +   /* Clear only the all-frozen bit on visibility map if needed */
>> > +   if (PageIsAllVisible(BufferGetPage(buffer)) &&
>> > +   VM_ALL_FROZEN(relation, block, &vmbuffer))
>> > +   {
>> > +   visibilitymap_clear_extended(relation, block, vmbuffer,
>> > +  
>> >   VISIBILITYMAP_ALL_FROZEN);
>> > +   }
>> > +
>>
>> FWIW, I don't think it's worth introducing visibilitymap_clear_extended.
>> As this is a 9.6 only patch, i think it's better to change
>> visibilitymap_clear's API.
>
> Besides that easily fixed issue, the code also has the significant issue
> that it's only performing the the visibilitymap processing in the
> BLK_NEEDS_REDO case. But that's not ok, because both in the BLK_RESTORED
> and the BLK_DONE cases the visibilitymap isn't guaranteed (or even
> likely in the former case) to have been updated.
>
> I think we have two choices how to deal with that: First, we can add a
> new flags variable to xl_heap_lock similar to
> xl_heap_insert/update/... and bump page magic,
>

+1 for going in this way.  This will keep us consistent with how clear
the visibility info in other places like heap_xlog_update().



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-15 Thread Andres Freund

On 2016-07-13 23:06:07 -0700, Andres Freund wrote:
> > +   /* Clear only the all-frozen bit on visibility map if needed */
> > +   if (PageIsAllVisible(BufferGetPage(buffer)) &&
> > +   VM_ALL_FROZEN(relation, block, &vmbuffer))
> > +   {
> > +   visibilitymap_clear_extended(relation, block, vmbuffer,
> > +   
> >  VISIBILITYMAP_ALL_FROZEN);
> > +   }
> > +
> 
> FWIW, I don't think it's worth introducing visibilitymap_clear_extended.
> As this is a 9.6 only patch, i think it's better to change
> visibilitymap_clear's API.

Besides that easily fixed issue, the code also has the significant issue
that it's only performing the the visibilitymap processing in the
BLK_NEEDS_REDO case. But that's not ok, because both in the BLK_RESTORED
and the BLK_DONE cases the visibilitymap isn't guaranteed (or even
likely in the former case) to have been updated.

I think we have two choices how to deal with that: First, we can add a
new flags variable to xl_heap_lock similar to
xl_heap_insert/update/... and bump page magic, or we can squeeze the
information into infobits_set.  The latter seems fairly ugly, and
fragile to me; so unless somebody protests I'm going with the former. I
think due to padding the additional byte doesn't make any size
difference anyway.

Regards,

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-15 Thread Andres Freund

On 2016-07-14 20:53:07 -0700, Andres Freund wrote:
> On 2016-07-13 23:06:07 -0700, Andres Freund wrote:
> > won't enter the branch, because HEAP_XMAX_LOCK_ONLY won't be set.  Which
> > will leave t_ctid and HEAP_HOT_UPDATED set differently on the master and
> > standby / after crash recovery.   I'm failing to see any harmful
> > consequences right now, but differences between master and standby are a bad
> > thing.
> 
> I think it's actually critical, because HEAP_HOT_UPDATED /
> HEAP_XMAX_LOCK_ONLY are used to terminate ctid chasing loops (like
> heap_hot_search_buffer()).

I've pushed a quite heavily revised version of the first patch to
9.1-master. I manually verified using pageinspect, gdb breakpoints and a
standby that xmax, infomask etc are set correctly (leading to finding
a4d357bf).  As there's noticeable differences, especially 9.2->9.3,
between versions, I'd welcome somebody having a look at the commits.

Regards,

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-14 Thread Andres Freund

On 2016-07-13 23:06:07 -0700, Andres Freund wrote:
> won't enter the branch, because HEAP_XMAX_LOCK_ONLY won't be set.  Which
> will leave t_ctid and HEAP_HOT_UPDATED set differently on the master and
> standby / after crash recovery.   I'm failing to see any harmful
> consequences right now, but differences between master and standby are a bad
> thing.

I think it's actually critical, because HEAP_HOT_UPDATED /
HEAP_XMAX_LOCK_ONLY are used to terminate ctid chasing loops (like
heap_hot_search_buffer()).

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-14 Thread Andres Freund

On 2016-07-14 18:12:42 +0530, Amit Kapila wrote:
> Just thinking out loud.  If we set HEAP_XMAX_LOCK_ONLY during update,
> then won't it impact the return value of
> HeapTupleHeaderIsOnlyLocked().  It will start returning true whereas
> otherwise I think it would have returned false due to in_progress
> transaction.   As  HeapTupleHeaderIsOnlyLocked() is being used at many
> places, it might impact those cases, I have not checked in deep
> whether such an impact would cause any real issue, but it seems to me
> that some analysis is needed there unless you think we are safe with
> respect to that.

I don't think that's an issue: Right now the row will be considered
deleted in that moment, with the change it's considered locked. the
latter is surely more appropriate.


> > Any arguments against?
> >
> >>
> >> + /* Clear only the all-frozen bit on visibility map if needed 
> >> */
> >> + if (PageIsAllVisible(BufferGetPage(buffer)) &&
> >> + VM_ALL_FROZEN(relation, block, &vmbuffer))
> >> + {
> >> + visibilitymap_clear_extended(relation, block, 
> >> vmbuffer,
> >> +  
> >> VISIBILITYMAP_ALL_FROZEN);
> >> + }
> >> +
> >
> > FWIW, I don't think it's worth introducing visibilitymap_clear_extended.
> > As this is a 9.6 only patch, i think it's better to change
> > visibilitymap_clear's API.
> >
> > Unless somebody protests I'm planning to commit with those adjustments
> > tomorrow.
> >
> 
> Do you think performance tests done by Sawada-san are sufficient to
> proceed here?

I'm doing some more, but generally yes. I also don't think we have much
of a choice anyway.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-14 Thread Amit Kapila

On Thu, Jul 14, 2016 at 11:36 AM, Andres Freund  wrote:
> Hi,
>
> Master does
> /* temporarily make it look not-updated */
> oldtup.t_data->t_ctid = oldtup.t_self;
> here, and as is the wal record won't reflect that, because:
> static void
> heap_xlog_lock(XLogReaderState *record)
> {
> ...
> /*
>  * Clear relevant update flags, but only if the modified 
> infomask says
>  * there's no update.
>  */
> if (HEAP_XMAX_IS_LOCKED_ONLY(htup->t_infomask))
> {
> HeapTupleHeaderClearHotUpdated(htup);
> /* Make sure there is no forward chain link in t_ctid 
> */
> ItemPointerSet(&htup->t_ctid,
>
> BufferGetBlockNumber(buffer),
>offnum);
> }
> won't enter the branch, because HEAP_XMAX_LOCK_ONLY won't be set.  Which
> will leave t_ctid and HEAP_HOT_UPDATED set differently on the master and
> standby / after crash recovery.   I'm failing to see any harmful
> consequences right now, but differences between master and standby are a bad
> thing. Pre 9.3 that's not a problem, we reset ctid and HOT_UPDATED
> unconditionally there.   I think I'm more comfortable with setting
> HEAP_XMAX_LOCK_ONLY until the tuple is finally updated - that also
> coincides more closely with the actual meaning.
>

Just thinking out loud.  If we set HEAP_XMAX_LOCK_ONLY during update,
then won't it impact the return value of
HeapTupleHeaderIsOnlyLocked().  It will start returning true whereas
otherwise I think it would have returned false due to in_progress
transaction.   As  HeapTupleHeaderIsOnlyLocked() is being used at many
places, it might impact those cases, I have not checked in deep
whether such an impact would cause any real issue, but it seems to me
that some analysis is needed there unless you think we are safe with
respect to that.

> Any arguments against?
>
>>
>> + /* Clear only the all-frozen bit on visibility map if needed */
>> + if (PageIsAllVisible(BufferGetPage(buffer)) &&
>> + VM_ALL_FROZEN(relation, block, &vmbuffer))
>> + {
>> + visibilitymap_clear_extended(relation, block, vmbuffer,
>> +
>>   VISIBILITYMAP_ALL_FROZEN);
>> + }
>> +
>
> FWIW, I don't think it's worth introducing visibilitymap_clear_extended.
> As this is a 9.6 only patch, i think it's better to change
> visibilitymap_clear's API.
>
> Unless somebody protests I'm planning to commit with those adjustments
> tomorrow.
>

Do you think performance tests done by Sawada-san are sufficient to
proceed here?



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-13 Thread Andres Freund

Hi,

So I'm generally happy with 0001, baring some relatively minor
adjustments. I am however wondering about one thing:

On 2016-07-11 23:51:05 +0900, Masahiko Sawada wrote:
> diff --git a/src/backend/access/heap/heapam.c 
> b/src/backend/access/heap/heapam.c
> index 57da57a..e7cb8ca 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -3923,6 +3923,16 @@ l2:
>
>   if (need_toast || newtupsize > pagefree)
>   {
> + /*
> +  * For crash safety, we need to emit that xmax of old tuple is 
> set
> +  * and clear only the all-frozen bit on visibility map if needed
> +  * before releasing the buffer. We can reuse xl_heap_lock for 
> this
> +  * purpose. It should be fine even if we crash midway from this
> +  * section and the actual updating one later, since the xmax 
> will
> +  * appear to come from an aborted xid.
> +  */
> + START_CRIT_SECTION();
> +
>   /* Clear obsolete visibility flags ... */
>   oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
>   oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
> @@ -3936,6 +3946,28 @@ l2:
>   /* temporarily make it look not-updated */
>   oldtup.t_data->t_ctid = oldtup.t_self;
>   already_marked = true;
> +
> + MarkBufferDirty(buffer);
> +
> + if (RelationNeedsWAL(relation))
> + {
> + xl_heap_lock xlrec;
> + XLogRecPtr recptr;
> +
> + XLogBeginInsert();
> + XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
> +
> + xlrec.offnum = 
> ItemPointerGetOffsetNumber(&oldtup.t_self);
> + xlrec.locking_xid = xmax_old_tuple;
> + xlrec.infobits_set = 
> compute_infobits(oldtup.t_data->t_infomask,
> + 
>   oldtup.t_data->t_infomask2);
> + XLogRegisterData((char *) &xlrec, SizeOfHeapLock);
> + recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK);
> + PageSetLSN(page, recptr);
> + }

Master does
/* temporarily make it look not-updated */
oldtup.t_data->t_ctid = oldtup.t_self;
here, and as is the wal record won't reflect that, because:
static void
heap_xlog_lock(XLogReaderState *record)
{
...
/*
 * Clear relevant update flags, but only if the modified 
infomask says
 * there's no update.
 */
if (HEAP_XMAX_IS_LOCKED_ONLY(htup->t_infomask))
{
HeapTupleHeaderClearHotUpdated(htup);
/* Make sure there is no forward chain link in t_ctid */
ItemPointerSet(&htup->t_ctid,
   BufferGetBlockNumber(buffer),
   offnum);
}
won't enter the branch, because HEAP_XMAX_LOCK_ONLY won't be set.  Which
will leave t_ctid and HEAP_HOT_UPDATED set differently on the master and
standby / after crash recovery.   I'm failing to see any harmful
consequences right now, but differences between master and standby are a bad
thing. Pre 9.3 that's not a problem, we reset ctid and HOT_UPDATED
unconditionally there.   I think I'm more comfortable with setting
HEAP_XMAX_LOCK_ONLY until the tuple is finally updated - that also
coincides more closely with the actual meaning.

Any arguments against?

>
> + /* Clear only the all-frozen bit on visibility map if needed */
> + if (PageIsAllVisible(BufferGetPage(buffer)) &&
> + VM_ALL_FROZEN(relation, block, &vmbuffer))
> + {
> + visibilitymap_clear_extended(relation, block, vmbuffer,
> + 
>  VISIBILITYMAP_ALL_FROZEN);
> + }
> +

FWIW, I don't think it's worth introducing visibilitymap_clear_extended.
As this is a 9.6 only patch, i think it's better to change
visibilitymap_clear's API.

Unless somebody protests I'm planning to commit with those adjustments
tomorrow.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-11 Thread Masahiko Sawada

On Fri, Jul 8, 2016 at 10:24 PM, Amit Kapila  wrote:
> On Thu, Jul 7, 2016 at 12:34 PM, Masahiko Sawada  
> wrote:
>> Than you for reviewing!
>>
>> On Thu, Jul 7, 2016 at 7:06 AM, Andres Freund  wrote:
>>> On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote:
 diff --git a/src/backend/access/heap/heapam.c 
 b/src/backend/access/heap/heapam.c
 index 57da57a..fd66527 100644
 --- a/src/backend/access/heap/heapam.c
 +++ b/src/backend/access/heap/heapam.c
 @@ -3923,6 +3923,17 @@ l2:

   if (need_toast || newtupsize > pagefree)
   {
 + /*
 +  * To prevent data corruption due to updating old tuple by
 +  * other backends after released buffer
>>>
>>> That's not really the reason, is it? The prime problem is crash safety /
>>> replication. The row-lock we're faking (by setting xmax to our xid),
>>> prevents concurrent updates until this backend died.
>>
>> Fixed.
>>
 , we need to emit that
 +  * xmax of old tuple is set and clear visibility map bits if
 +  * needed before releasing buffer. We can reuse xl_heap_lock
 +  * for this purpose. It should be fine even if we crash 
 midway
 +  * from this section and the actual updating one later, since
 +  * the xmax will appear to come from an aborted xid.
 +  */
 + START_CRIT_SECTION();
 +
   /* Clear obsolete visibility flags ... */
   oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
   oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
 @@ -3936,6 +3947,46 @@ l2:
   /* temporarily make it look not-updated */
   oldtup.t_data->t_ctid = oldtup.t_self;
   already_marked = true;
 +
 + /* Clear PD_ALL_VISIBLE flags */
 + if (PageIsAllVisible(BufferGetPage(buffer)))
 + {
 + all_visible_cleared = true;
 + PageClearAllVisible(BufferGetPage(buffer));
 + visibilitymap_clear(relation, 
 BufferGetBlockNumber(buffer),
 + vmbuffer);
 + }
 +
 + MarkBufferDirty(buffer);
 +
 + if (RelationNeedsWAL(relation))
 + {
 + xl_heap_lock xlrec;
 + XLogRecPtr recptr;
 +
 + /*
 +  * For logical decoding we need combocids to 
 properly decode the
 +  * catalog.
 +  */
 + if (RelationIsAccessibleInLogicalDecoding(relation))
 + log_heap_new_cid(relation, &oldtup);
>>>
>>> Hm, I don't see that being necessary here. Row locks aren't logically
>>> decoded, so there's no need to emit this here.
>>
>> Fixed.
>>
>>>
 + /* Clear PD_ALL_VISIBLE flags */
 + if (PageIsAllVisible(page))
 + {
 + Buffer  vmbuffer = InvalidBuffer;
 + BlockNumber block = BufferGetBlockNumber(*buffer);
 +
 + all_visible_cleared = true;
 + PageClearAllVisible(page);
 + visibilitymap_pin(relation, block, &vmbuffer);
 + visibilitymap_clear(relation, block, vmbuffer);
 + }
 +
>>>
>>> That clears all-visible unnecessarily, we only need to clear all-frozen.
>>>
>>
>> Fixed.
>>
>>>
 @@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record)
   }
   HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
   HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
 +
 + /* The visibility map need to be cleared */
 + if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0)
 + {
 + RelFileNode rnode;
 + Buffer  vmbuffer = InvalidBuffer;
 + BlockNumber blkno;
 + Relationreln;
 +
 + XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
 + reln = CreateFakeRelcacheEntry(rnode);
 +
 + visibilitymap_pin(reln, blkno, &vmbuffer);
 + visibilitymap_clear(reln, blkno, vmbuffer);
 + PageClearAllVisible(page);
 + }
 +
>>>
>>>
   PageSetLSN(page, lsn);
   MarkBufferDirty(buffer);
   }
 diff --git a/src/include/access/heapam_xlog.h 
 b/src/include/access/heapam_xlog.h
 index a822d0b..41b3c54 100644
 --- a/src/include/access/heapam_xlog.h
 +++ b/src/include/access/heapam_xlog.h
 @@ -242,6 +242,7 @

Re: [HACKERS] Reviewing freeze map code

2016-07-08 Thread Amit Kapila

On Thu, Jul 7, 2016 at 12:34 PM, Masahiko Sawada  wrote:
> Than you for reviewing!
>
> On Thu, Jul 7, 2016 at 7:06 AM, Andres Freund  wrote:
>> On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote:
>>> diff --git a/src/backend/access/heap/heapam.c 
>>> b/src/backend/access/heap/heapam.c
>>> index 57da57a..fd66527 100644
>>> --- a/src/backend/access/heap/heapam.c
>>> +++ b/src/backend/access/heap/heapam.c
>>> @@ -3923,6 +3923,17 @@ l2:
>>>
>>>   if (need_toast || newtupsize > pagefree)
>>>   {
>>> + /*
>>> +  * To prevent data corruption due to updating old tuple by
>>> +  * other backends after released buffer
>>
>> That's not really the reason, is it? The prime problem is crash safety /
>> replication. The row-lock we're faking (by setting xmax to our xid),
>> prevents concurrent updates until this backend died.
>
> Fixed.
>
>>> , we need to emit that
>>> +  * xmax of old tuple is set and clear visibility map bits if
>>> +  * needed before releasing buffer. We can reuse xl_heap_lock
>>> +  * for this purpose. It should be fine even if we crash midway
>>> +  * from this section and the actual updating one later, since
>>> +  * the xmax will appear to come from an aborted xid.
>>> +  */
>>> + START_CRIT_SECTION();
>>> +
>>>   /* Clear obsolete visibility flags ... */
>>>   oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
>>>   oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
>>> @@ -3936,6 +3947,46 @@ l2:
>>>   /* temporarily make it look not-updated */
>>>   oldtup.t_data->t_ctid = oldtup.t_self;
>>>   already_marked = true;
>>> +
>>> + /* Clear PD_ALL_VISIBLE flags */
>>> + if (PageIsAllVisible(BufferGetPage(buffer)))
>>> + {
>>> + all_visible_cleared = true;
>>> + PageClearAllVisible(BufferGetPage(buffer));
>>> + visibilitymap_clear(relation, 
>>> BufferGetBlockNumber(buffer),
>>> + vmbuffer);
>>> + }
>>> +
>>> + MarkBufferDirty(buffer);
>>> +
>>> + if (RelationNeedsWAL(relation))
>>> + {
>>> + xl_heap_lock xlrec;
>>> + XLogRecPtr recptr;
>>> +
>>> + /*
>>> +  * For logical decoding we need combocids to properly 
>>> decode the
>>> +  * catalog.
>>> +  */
>>> + if (RelationIsAccessibleInLogicalDecoding(relation))
>>> + log_heap_new_cid(relation, &oldtup);
>>
>> Hm, I don't see that being necessary here. Row locks aren't logically
>> decoded, so there's no need to emit this here.
>
> Fixed.
>
>>
>>> + /* Clear PD_ALL_VISIBLE flags */
>>> + if (PageIsAllVisible(page))
>>> + {
>>> + Buffer  vmbuffer = InvalidBuffer;
>>> + BlockNumber block = BufferGetBlockNumber(*buffer);
>>> +
>>> + all_visible_cleared = true;
>>> + PageClearAllVisible(page);
>>> + visibilitymap_pin(relation, block, &vmbuffer);
>>> + visibilitymap_clear(relation, block, vmbuffer);
>>> + }
>>> +
>>
>> That clears all-visible unnecessarily, we only need to clear all-frozen.
>>
>
> Fixed.
>
>>
>>> @@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record)
>>>   }
>>>   HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
>>>   HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
>>> +
>>> + /* The visibility map need to be cleared */
>>> + if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0)
>>> + {
>>> + RelFileNode rnode;
>>> + Buffer  vmbuffer = InvalidBuffer;
>>> + BlockNumber blkno;
>>> + Relationreln;
>>> +
>>> + XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
>>> + reln = CreateFakeRelcacheEntry(rnode);
>>> +
>>> + visibilitymap_pin(reln, blkno, &vmbuffer);
>>> + visibilitymap_clear(reln, blkno, vmbuffer);
>>> + PageClearAllVisible(page);
>>> + }
>>> +
>>
>>
>>>   PageSetLSN(page, lsn);
>>>   MarkBufferDirty(buffer);
>>>   }
>>> diff --git a/src/include/access/heapam_xlog.h 
>>> b/src/include/access/heapam_xlog.h
>>> index a822d0b..41b3c54 100644
>>> --- a/src/include/access/heapam_xlog.h
>>> +++ b/src/include/access/heapam_xlog.h
>>> @@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info
>>>  #define XLHL_XMAX_EXCL_LOCK  0x04
>>>  #define XLHL_XMAX_KEYSHR_LOCK0x08
>>>  #define XLHL_KEYS_UPDATED0x10
>>> +#d

Re: [HACKERS] Reviewing freeze map code

2016-07-07 Thread Robert Haas

On Thu, Jul 7, 2016 at 2:04 PM, Andres Freund  wrote:
>> Yeah, that's true, but I'm having a bit of trouble imagining exactly
>> we end up with corruption that actually matters.  I guess a torn page
>> could do it.
>
> I think Noah pointed out a bad scenario: If we crash after putting the
> xid in the page header, but before WAL logging, the xid could get reused
> after the crash. By a different transaction. And suddenly the row isn't
> visible anymore, after the reused xid commits...

Oh, wow.  Yikes.  OK, so I guess we should try to back-patch the fix, then.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-07 Thread Andres Freund

On 2016-07-07 14:01:05 -0400, Robert Haas wrote:
> On Thu, Jul 7, 2016 at 1:58 PM, Andres Freund  wrote:
> > On 2016-07-07 10:37:15 -0400, Robert Haas wrote:
> >> On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund  wrote:
> >> > Hm. We can't easily do that in the back-patched version; because a
> >> > standby won't know to check for the flag . That's kinda ok, since we
> >> > don't yet need to clear all-visible yet at that point of
> >> > heap_update. But that better means we don't do so on the master either.
> >>
> >> Is there any reason to back-patch this in the first place?
> >
> > It seems not unlikely that this has caused corruption in the past; and
> > that we chalked it up to hardware corruption or something. Both toasting
> > and file extension frequently take extended amounts of time under load,
> > the window for crashing in the wrong moment isn't small...
> 
> Yeah, that's true, but I'm having a bit of trouble imagining exactly
> we end up with corruption that actually matters.  I guess a torn page
> could do it.

I think Noah pointed out a bad scenario: If we crash after putting the
xid in the page header, but before WAL logging, the xid could get reused
after the crash. By a different transaction. And suddenly the row isn't
visible anymore, after the reused xid commits...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-07 Thread Robert Haas

On Thu, Jul 7, 2016 at 1:58 PM, Andres Freund  wrote:
> On 2016-07-07 10:37:15 -0400, Robert Haas wrote:
>> On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund  wrote:
>> > Hm. We can't easily do that in the back-patched version; because a
>> > standby won't know to check for the flag . That's kinda ok, since we
>> > don't yet need to clear all-visible yet at that point of
>> > heap_update. But that better means we don't do so on the master either.
>>
>> Is there any reason to back-patch this in the first place?
>
> It seems not unlikely that this has caused corruption in the past; and
> that we chalked it up to hardware corruption or something. Both toasting
> and file extension frequently take extended amounts of time under load,
> the window for crashing in the wrong moment isn't small...

Yeah, that's true, but I'm having a bit of trouble imagining exactly
we end up with corruption that actually matters.  I guess a torn page
could do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-07 Thread Andres Freund

On 2016-07-07 10:37:15 -0400, Robert Haas wrote:
> On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund  wrote:
> > Hm. We can't easily do that in the back-patched version; because a
> > standby won't know to check for the flag . That's kinda ok, since we
> > don't yet need to clear all-visible yet at that point of
> > heap_update. But that better means we don't do so on the master either.
> 
> Is there any reason to back-patch this in the first place?

It seems not unlikely that this has caused corruption in the past; and
that we chalked it up to hardware corruption or something. Both toasting
and file extension frequently take extended amounts of time under load,
the window for crashing in the wrong moment isn't small...

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-07 Thread Robert Haas

On Thu, Jul 7, 2016 at 10:53 AM, Alvaro Herrera
 wrote:
> Robert Haas wrote:
>> On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund  wrote:
>> > Hm. We can't easily do that in the back-patched version; because a
>> > standby won't know to check for the flag . That's kinda ok, since we
>> > don't yet need to clear all-visible yet at that point of
>> > heap_update. But that better means we don't do so on the master either.
>>
>> Is there any reason to back-patch this in the first place?
>
> Wasn't this determined to be a pre-existing bug?  I think the
> probability of occurrence has increased, but it's still possible in
> earlier releases.  I wonder if there are unexplained bugs that can be
> traced down to this.
>
> I'm not really following this (sorry about that) but I wonder if (in
> back branches) the failure to propagate in case the standby wasn't
> updated can cause actual problems.  If it does, maybe it'd be a better
> idea to have a new WAL record type instead of piggybacking on lock
> tuple.  Then again, apparently the probability of this bug is low enough
> that we shouldn't sweat over it ... Moreso considering Robert's apparent
> opinion that perhaps we shouldn't backpatch at all in the first place.
>
> In any case I would like to see much more commentary in the patch next
> to the new XLHL flag.  It's sufficiently different than the rest than it
> deserves so, IMO.

There are two issues being discussed on this thread.  One of them is a
new issue in 9.6: heap_lock_tuple needs to clear the all-frozen bit in
the freeze map even though it does not clear all-visible.  The one
that's actually a preexisting bug is that we can start to update a
tuple without WAL-logging anything and then release the page lock in
order to go perform TOAST insertions.  At this point, other backends
(on the master) will see this tuple as in the process of being updated
because xmax has been set and ctid has been made to point back to the
same tuple.

I'm guessing that if the UPDATE goes on to complete, any discrepancy
between the master and the standby is erased by the replay of the WAL
record covering the update itself.  I haven't checked that, but it
seems like that WAL record must set both xmax and ctid appropriately
or we'd be in big trouble.  The infomask bits are in play too, but
presumably the update's WAL is going to set those correctly also.  So
in this case I don't think there's really any issue for the standby.
Or for the master, either: it may technically be true the tuple is not
all-visible any more, but the only backend that could potentially fail
to see it is the one performing the update, and no user code can run
in the middle of toast_insert_or_update, so I think we're OK.

On the other hand, if the UPDATE aborts, there's now a persistent
difference between the master and standby: the infomask, xmax, and
ctid of the tuple may differ.  I don't know whether that could cause
any problem.  It's probably a very rare case, because there aren't all
that many things that will cause us to abort in the middle of
inserting TOAST tuples.  Out of disk space comes to mind, or maybe
some kind of corruption that throws an elog().

As far as back-patching goes, the question is whether it's worth the
risk.  Introducing new WAL logging at this point could certainly cause
performance problems if nothing else, never mind the risk of
garden-variety bugs.  I'm not sure it's worth taking that risk in
released branches for the sake of a bug which has existed for a decade
without anybody finding it until now.  I'm not going to argue strongly
for that position, but I think it's worth thinking about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-07 Thread Alvaro Herrera

Robert Haas wrote:
> On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund  wrote:
> > Hm. We can't easily do that in the back-patched version; because a
> > standby won't know to check for the flag . That's kinda ok, since we
> > don't yet need to clear all-visible yet at that point of
> > heap_update. But that better means we don't do so on the master either.
> 
> Is there any reason to back-patch this in the first place?

Wasn't this determined to be a pre-existing bug?  I think the
probability of occurrence has increased, but it's still possible in
earlier releases.  I wonder if there are unexplained bugs that can be
traced down to this.

I'm not really following this (sorry about that) but I wonder if (in
back branches) the failure to propagate in case the standby wasn't
updated can cause actual problems.  If it does, maybe it'd be a better
idea to have a new WAL record type instead of piggybacking on lock
tuple.  Then again, apparently the probability of this bug is low enough
that we shouldn't sweat over it ... Moreso considering Robert's apparent
opinion that perhaps we shouldn't backpatch at all in the first place.

In any case I would like to see much more commentary in the patch next
to the new XLHL flag.  It's sufficiently different than the rest than it
deserves so, IMO.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-07 Thread Robert Haas

On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund  wrote:
> Hm. We can't easily do that in the back-patched version; because a
> standby won't know to check for the flag . That's kinda ok, since we
> don't yet need to clear all-visible yet at that point of
> heap_update. But that better means we don't do so on the master either.

Is there any reason to back-patch this in the first place?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-07 Thread Masahiko Sawada

Than you for reviewing!

On Thu, Jul 7, 2016 at 7:06 AM, Andres Freund  wrote:
> On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote:
>> diff --git a/src/backend/access/heap/heapam.c 
>> b/src/backend/access/heap/heapam.c
>> index 57da57a..fd66527 100644
>> --- a/src/backend/access/heap/heapam.c
>> +++ b/src/backend/access/heap/heapam.c
>> @@ -3923,6 +3923,17 @@ l2:
>>
>>   if (need_toast || newtupsize > pagefree)
>>   {
>> + /*
>> +  * To prevent data corruption due to updating old tuple by
>> +  * other backends after released buffer
>
> That's not really the reason, is it? The prime problem is crash safety /
> replication. The row-lock we're faking (by setting xmax to our xid),
> prevents concurrent updates until this backend died.

Fixed.

>> , we need to emit that
>> +  * xmax of old tuple is set and clear visibility map bits if
>> +  * needed before releasing buffer. We can reuse xl_heap_lock
>> +  * for this purpose. It should be fine even if we crash midway
>> +  * from this section and the actual updating one later, since
>> +  * the xmax will appear to come from an aborted xid.
>> +  */
>> + START_CRIT_SECTION();
>> +
>>   /* Clear obsolete visibility flags ... */
>>   oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
>>   oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
>> @@ -3936,6 +3947,46 @@ l2:
>>   /* temporarily make it look not-updated */
>>   oldtup.t_data->t_ctid = oldtup.t_self;
>>   already_marked = true;
>> +
>> + /* Clear PD_ALL_VISIBLE flags */
>> + if (PageIsAllVisible(BufferGetPage(buffer)))
>> + {
>> + all_visible_cleared = true;
>> + PageClearAllVisible(BufferGetPage(buffer));
>> + visibilitymap_clear(relation, 
>> BufferGetBlockNumber(buffer),
>> + vmbuffer);
>> + }
>> +
>> + MarkBufferDirty(buffer);
>> +
>> + if (RelationNeedsWAL(relation))
>> + {
>> + xl_heap_lock xlrec;
>> + XLogRecPtr recptr;
>> +
>> + /*
>> +  * For logical decoding we need combocids to properly 
>> decode the
>> +  * catalog.
>> +  */
>> + if (RelationIsAccessibleInLogicalDecoding(relation))
>> + log_heap_new_cid(relation, &oldtup);
>
> Hm, I don't see that being necessary here. Row locks aren't logically
> decoded, so there's no need to emit this here.

Fixed.

>
>> + /* Clear PD_ALL_VISIBLE flags */
>> + if (PageIsAllVisible(page))
>> + {
>> + Buffer  vmbuffer = InvalidBuffer;
>> + BlockNumber block = BufferGetBlockNumber(*buffer);
>> +
>> + all_visible_cleared = true;
>> + PageClearAllVisible(page);
>> + visibilitymap_pin(relation, block, &vmbuffer);
>> + visibilitymap_clear(relation, block, vmbuffer);
>> + }
>> +
>
> That clears all-visible unnecessarily, we only need to clear all-frozen.
>

Fixed.

>
>> @@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record)
>>   }
>>   HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
>>   HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
>> +
>> + /* The visibility map need to be cleared */
>> + if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0)
>> + {
>> + RelFileNode rnode;
>> + Buffer  vmbuffer = InvalidBuffer;
>> + BlockNumber blkno;
>> + Relationreln;
>> +
>> + XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
>> + reln = CreateFakeRelcacheEntry(rnode);
>> +
>> + visibilitymap_pin(reln, blkno, &vmbuffer);
>> + visibilitymap_clear(reln, blkno, vmbuffer);
>> + PageClearAllVisible(page);
>> + }
>> +
>
>
>>   PageSetLSN(page, lsn);
>>   MarkBufferDirty(buffer);
>>   }
>> diff --git a/src/include/access/heapam_xlog.h 
>> b/src/include/access/heapam_xlog.h
>> index a822d0b..41b3c54 100644
>> --- a/src/include/access/heapam_xlog.h
>> +++ b/src/include/access/heapam_xlog.h
>> @@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info
>>  #define XLHL_XMAX_EXCL_LOCK  0x04
>>  #define XLHL_XMAX_KEYSHR_LOCK0x08
>>  #define XLHL_KEYS_UPDATED0x10
>> +#define XLHL_ALL_VISIBLE_CLEARED 0x20
>
> Hm. We can't easily do that in the back-patched version; because a
> standby won't know to check for the flag . That's kinda ok, since we
> don't ye

Re: [HACKERS] Reviewing freeze map code

2016-07-06 Thread Amit Kapila

On Thu, Jul 7, 2016 at 3:36 AM, Andres Freund  wrote:
> On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote:
>
>> @@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record)
>>   }
>>   HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
>>   HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
>> +
>> + /* The visibility map need to be cleared */
>> + if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0)
>> + {
>> + RelFileNode rnode;
>> + Buffer  vmbuffer = InvalidBuffer;
>> + BlockNumber blkno;
>> + Relationreln;
>> +
>> + XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
>> + reln = CreateFakeRelcacheEntry(rnode);
>> +
>> + visibilitymap_pin(reln, blkno, &vmbuffer);
>> + visibilitymap_clear(reln, blkno, vmbuffer);
>> + PageClearAllVisible(page);
>> + }
>> +
>
>
>>   PageSetLSN(page, lsn);
>>   MarkBufferDirty(buffer);
>>   }
>> diff --git a/src/include/access/heapam_xlog.h 
>> b/src/include/access/heapam_xlog.h
>> index a822d0b..41b3c54 100644
>> --- a/src/include/access/heapam_xlog.h
>> +++ b/src/include/access/heapam_xlog.h
>> @@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info
>>  #define XLHL_XMAX_EXCL_LOCK  0x04
>>  #define XLHL_XMAX_KEYSHR_LOCK0x08
>>  #define XLHL_KEYS_UPDATED0x10
>> +#define XLHL_ALL_VISIBLE_CLEARED 0x20
>
> Hm. We can't easily do that in the back-patched version; because a
> standby won't know to check for the flag . That's kinda ok, since we
> don't yet need to clear all-visible yet at that point of
> heap_update. But that better means we don't do so on the master either.
>

To clarify, do you mean to say that lets have XLHL_ALL_FROZEN_CLEARED
and do that just for master.  For back-branches no need to clear
visibility any flags?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-06 Thread Andres Freund

On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote:
> diff --git a/src/backend/access/heap/heapam.c 
> b/src/backend/access/heap/heapam.c
> index 57da57a..fd66527 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -3923,6 +3923,17 @@ l2:
>  
>   if (need_toast || newtupsize > pagefree)
>   {
> + /*
> +  * To prevent data corruption due to updating old tuple by
> +  * other backends after released buffer

That's not really the reason, is it? The prime problem is crash safety /
replication. The row-lock we're faking (by setting xmax to our xid),
prevents concurrent updates until this backend died.

> , we need to emit that
> +  * xmax of old tuple is set and clear visibility map bits if
> +  * needed before releasing buffer. We can reuse xl_heap_lock
> +  * for this purpose. It should be fine even if we crash midway
> +  * from this section and the actual updating one later, since
> +  * the xmax will appear to come from an aborted xid.
> +  */
> + START_CRIT_SECTION();
> +
>   /* Clear obsolete visibility flags ... */
>   oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
>   oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
> @@ -3936,6 +3947,46 @@ l2:
>   /* temporarily make it look not-updated */
>   oldtup.t_data->t_ctid = oldtup.t_self;
>   already_marked = true;
> +
> + /* Clear PD_ALL_VISIBLE flags */
> + if (PageIsAllVisible(BufferGetPage(buffer)))
> + {
> + all_visible_cleared = true;
> + PageClearAllVisible(BufferGetPage(buffer));
> + visibilitymap_clear(relation, 
> BufferGetBlockNumber(buffer),
> + vmbuffer);
> + }
> +
> + MarkBufferDirty(buffer);
> +
> + if (RelationNeedsWAL(relation))
> + {
> + xl_heap_lock xlrec;
> + XLogRecPtr recptr;
> +
> + /*
> +  * For logical decoding we need combocids to properly 
> decode the
> +  * catalog.
> +  */
> + if (RelationIsAccessibleInLogicalDecoding(relation))
> + log_heap_new_cid(relation, &oldtup);

Hm, I don't see that being necessary here. Row locks aren't logically
decoded, so there's no need to emit this here.


> + /* Clear PD_ALL_VISIBLE flags */
> + if (PageIsAllVisible(page))
> + {
> + Buffer  vmbuffer = InvalidBuffer;
> + BlockNumber block = BufferGetBlockNumber(*buffer);
> +
> + all_visible_cleared = true;
> + PageClearAllVisible(page);
> + visibilitymap_pin(relation, block, &vmbuffer);
> + visibilitymap_clear(relation, block, vmbuffer);
> + }
> +

That clears all-visible unnecessarily, we only need to clear all-frozen.



> @@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record)
>   }
>   HeapTupleHeaderSetXmax(htup, xlrec->locking_xid);
>   HeapTupleHeaderSetCmax(htup, FirstCommandId, false);
> +
> + /* The visibility map need to be cleared */
> + if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0)
> + {
> + RelFileNode rnode;
> + Buffer  vmbuffer = InvalidBuffer;
> + BlockNumber blkno;
> + Relationreln;
> +
> + XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
> + reln = CreateFakeRelcacheEntry(rnode);
> +
> + visibilitymap_pin(reln, blkno, &vmbuffer);
> + visibilitymap_clear(reln, blkno, vmbuffer);
> + PageClearAllVisible(page);
> + }
> +


>   PageSetLSN(page, lsn);
>   MarkBufferDirty(buffer);
>   }
> diff --git a/src/include/access/heapam_xlog.h 
> b/src/include/access/heapam_xlog.h
> index a822d0b..41b3c54 100644
> --- a/src/include/access/heapam_xlog.h
> +++ b/src/include/access/heapam_xlog.h
> @@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info
>  #define XLHL_XMAX_EXCL_LOCK  0x04
>  #define XLHL_XMAX_KEYSHR_LOCK0x08
>  #define XLHL_KEYS_UPDATED0x10
> +#define XLHL_ALL_VISIBLE_CLEARED 0x20

Hm. We can't easily do that in the back-patched version; because a
standby won't know to check for the flag . That's kinda ok, since we
don't yet need to clear all-visible yet at that point of
heap_update. But that better means we don't do so on the master either.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make

Re: [HACKERS] Reviewing freeze map code

2016-07-05 Thread Masahiko Sawada

On Mon, Jul 4, 2016 at 5:44 PM, Masahiko Sawada  wrote:
> On Sat, Jul 2, 2016 at 12:34 PM, Amit Kapila  wrote:
>> On Fri, Jul 1, 2016 at 7:52 PM, Masahiko Sawada  
>> wrote:
>>> On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila  
>>> wrote:

 Why do you think IndexOnlyScan will return wrong result?  If the
 server crash in the way as you described, the transaction that has
 made modifications will anyway be considered aborted, so the result of
 IndexOnlyScan should not be wrong.

>>>
>>> Ah, you're right, I misunderstood.
>>>
>>> Attached updated patch incorporating your comments.
>>> I've changed it so that heap_xlog_lock clears vm flags if page is
>>> marked all frozen.
>>>
>>
>> I think we should make a similar change in heap_lock_tuple API as
>> well.
>> Also, currently by default heap_xlog_lock tuple tries to clear
>> the visibility flags, isn't it better to handle it as we do at all
>> other places (ex. see log_heap_update), by logging the information
>> about same.
>
> I will deal with them.
>
>> Though, it is important to get the patch right, but I feel in the
>> meantime, it might be better to start benchmarking.  AFAIU, even if
>> change some part of information while WAL logging it, the benchmark
>> results won't be much different.
>
> Okay, I will do the benchmark test as well.
>

I measured the thoughput and the output quantity of WAL with HEAD and
HEAD+patch(attached) on my virtual environment.
I used pgbench with attached custom script file which sets 3200 length
string to the filler column in order to make toast data.
The scale factor is 1000 and pgbench options are, -c 4 -T 600 -f toast_test.sql.
After changed filler column to the text data type I ran it.

* Throughput
HEAD : 1833.204172
Patched : 1827.399482

* Output quantity of WAL
HEAD :  7771 MB
Patched : 8082 MB

The throughput is almost same, but the ouput quantity of WAL is
slightly increased. (about 4%)

Regards,

--
Masahiko Sawada


toast_test.sql
Description: Binary data
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 57da57a..fd66527 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3923,6 +3923,17 @@ l2:
 
 	if (need_toast || newtupsize > pagefree)
 	{
+		/*
+		 * To prevent data corruption due to updating old tuple by
+		 * other backends after released buffer, we need to emit that
+		 * xmax of old tuple is set and clear visibility map bits if
+		 * needed before releasing buffer. We can reuse xl_heap_lock
+		 * for this purpose. It should be fine even if we crash midway
+		 * from this section and the actual updating one later, since
+		 * the xmax will appear to come from an aborted xid.
+		 */
+		START_CRIT_SECTION();
+
 		/* Clear obsolete visibility flags ... */
 		oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 		oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -3936,6 +3947,46 @@ l2:
 		/* temporarily make it look not-updated */
 		oldtup.t_data->t_ctid = oldtup.t_self;
 		already_marked = true;
+
+		/* Clear PD_ALL_VISIBLE flags */
+		if (PageIsAllVisible(BufferGetPage(buffer)))
+		{
+			all_visible_cleared = true;
+			PageClearAllVisible(BufferGetPage(buffer));
+			visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+vmbuffer);
+		}
+
+		MarkBufferDirty(buffer);
+
+		if (RelationNeedsWAL(relation))
+		{
+			xl_heap_lock xlrec;
+			XLogRecPtr recptr;
+
+			/*
+			 * For logical decoding we need combocids to properly decode the
+			 * catalog.
+			 */
+			if (RelationIsAccessibleInLogicalDecoding(relation))
+log_heap_new_cid(relation, &oldtup);
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+			xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self);
+			xlrec.locking_xid = xmax_old_tuple;
+			xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask,
+  oldtup.t_data->t_infomask2);
+			if (all_visible_cleared)
+xlrec.infobits_set |= XLHL_ALL_VISIBLE_CLEARED;
+			XLogRegisterData((char *) &xlrec, SizeOfHeapLock);
+			recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
 		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
 		/*
@@ -4140,7 +4191,8 @@ l2:
 		 */
 		if (RelationIsAccessibleInLogicalDecoding(relation))
 		{
-			log_heap_new_cid(relation, &oldtup);
+			if (!already_marked)
+log_heap_new_cid(relation, &oldtup);
 			log_heap_new_cid(relation, heaptup);
 		}
 
@@ -4513,6 +4565,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple,
 new_infomask2;
 	bool		first_time = true;
 	bool		have_tuple_lock = false;
+	bool		all_visible_cleared = false;
 
 	*buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
 	LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -5034,6 +5087,18 @@ failed:
 	if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask))
 		tuple->t_data->t_ctid = *tid;
 
+	/* Clear PD_ALL_VISIBLE flags */
+	if (PageIsAllVisible(page))
+	{
+		Buffer	vmbuffer = InvalidBuff

Re: [HACKERS] Reviewing freeze map code

2016-07-04 Thread Amit Kapila

On Mon, Jul 4, 2016 at 2:31 PM, Masahiko Sawada  wrote:
> On Sat, Jul 2, 2016 at 12:17 PM, Amit Kapila  wrote:
>> On Sat, Jul 2, 2016 at 12:53 AM, Andres Freund  wrote:
>>> On 2016-07-01 15:18:39 -0400, Robert Haas wrote:
>>>
 Should we just clear all-visible and call it good enough?
>>>
>>> Given that we need to do that in heap_lock_tuple, which entirely
>>> preserves all-visible (but shouldn't preserve all-frozen), ISTM we
>>> better find something that doesn't invalidate all-visible.
>>>
>>
>> Sounds logical, considering that we have a way to set all-frozen and
>> vacuum does that as well.  So probably either we need to have a new
>> API or add a new parameter to visibilitymap_clear() to indicate the
>> same.  If we want to go that route, isn't it better to have
>> PD_ALL_FROZEN as well?
>>
>
> Cant' we call visibilitymap_set with all-visible but not all-frozen
> bits instead of clearing flags?
>

That doesn't sound to be an impressive way to deal.  First,
visibilitymap_set logs the action itself which will generate two WAL
records (one for visibility map and another for lock tuple).  Second,
it doesn't look consistent to me.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-04 Thread Masahiko Sawada

On Sat, Jul 2, 2016 at 12:17 PM, Amit Kapila  wrote:
> On Sat, Jul 2, 2016 at 12:53 AM, Andres Freund  wrote:
>> On 2016-07-01 15:18:39 -0400, Robert Haas wrote:
>>> On Fri, Jul 1, 2016 at 10:22 AM, Masahiko Sawada  
>>> wrote:
>>> > Ah, you're right, I misunderstood.
>>> >
>>> > Attached updated patch incorporating your comments.
>>> > I've changed it so that heap_xlog_lock clears vm flags if page is
>>> > marked all frozen.
>>>
>>> I believe that this should be separated into two patches, since there
>>> are two issues here:
>>>
>>> 1. Locking a tuple doesn't clear the all-frozen bit, but needs to do so.
>>> 2. heap_update releases the buffer content lock without logging the
>>> changes it has made.
>>>
>>> With respect to #1, there is no need to clear the all-visible bit,
>>> only the all-frozen bit.  However, that's a bit tricky given that we
>>> removed PD_ALL_FROZEN.  Should we think about putting that back again?
>>
>> I think it's fine to just do the vm lookup.
>>
>>> Should we just clear all-visible and call it good enough?
>>
>> Given that we need to do that in heap_lock_tuple, which entirely
>> preserves all-visible (but shouldn't preserve all-frozen), ISTM we
>> better find something that doesn't invalidate all-visible.
>>
>
> Sounds logical, considering that we have a way to set all-frozen and
> vacuum does that as well.  So probably either we need to have a new
> API or add a new parameter to visibilitymap_clear() to indicate the
> same.  If we want to go that route, isn't it better to have
> PD_ALL_FROZEN as well?
>

Cant' we call visibilitymap_set with all-visible but not all-frozen
bits instead of clearing flags?

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-04 Thread Masahiko Sawada

On Sat, Jul 2, 2016 at 12:34 PM, Amit Kapila  wrote:
> On Fri, Jul 1, 2016 at 7:52 PM, Masahiko Sawada  wrote:
>> On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila  wrote:
>>>
>>> Why do you think IndexOnlyScan will return wrong result?  If the
>>> server crash in the way as you described, the transaction that has
>>> made modifications will anyway be considered aborted, so the result of
>>> IndexOnlyScan should not be wrong.
>>>
>>
>> Ah, you're right, I misunderstood.
>>
>> Attached updated patch incorporating your comments.
>> I've changed it so that heap_xlog_lock clears vm flags if page is
>> marked all frozen.
>>
>
> I think we should make a similar change in heap_lock_tuple API as
> well.
> Also, currently by default heap_xlog_lock tuple tries to clear
> the visibility flags, isn't it better to handle it as we do at all
> other places (ex. see log_heap_update), by logging the information
> about same.

I will deal with them.

> Though, it is important to get the patch right, but I feel in the
> meantime, it might be better to start benchmarking.  AFAIU, even if
> change some part of information while WAL logging it, the benchmark
> results won't be much different.

Okay, I will do the benchmark test as well.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-01 Thread Amit Kapila

On Fri, Jul 1, 2016 at 7:52 PM, Masahiko Sawada  wrote:
> On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila  wrote:
>>
>> Why do you think IndexOnlyScan will return wrong result?  If the
>> server crash in the way as you described, the transaction that has
>> made modifications will anyway be considered aborted, so the result of
>> IndexOnlyScan should not be wrong.
>>
>
> Ah, you're right, I misunderstood.
>
> Attached updated patch incorporating your comments.
> I've changed it so that heap_xlog_lock clears vm flags if page is
> marked all frozen.
>

I think we should make a similar change in heap_lock_tuple API as
well.  Also, currently by default heap_xlog_lock tuple tries to clear
the visibility flags, isn't it better to handle it as we do at all
other places (ex. see log_heap_update), by logging the information
about same.   I think it is always advisable to log every action we
want replay to perform.  That way, it is always easy to extend it
based on if some change is required only in certain cases, but not in
others.

Though, it is important to get the patch right, but I feel in the
meantime, it might be better to start benchmarking.  AFAIU, even if
change some part of information while WAL logging it, the benchmark
results won't be much different.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-01 Thread Amit Kapila

On Sat, Jul 2, 2016 at 12:53 AM, Andres Freund  wrote:
> On 2016-07-01 15:18:39 -0400, Robert Haas wrote:
>> On Fri, Jul 1, 2016 at 10:22 AM, Masahiko Sawada  
>> wrote:
>> > Ah, you're right, I misunderstood.
>> >
>> > Attached updated patch incorporating your comments.
>> > I've changed it so that heap_xlog_lock clears vm flags if page is
>> > marked all frozen.
>>
>> I believe that this should be separated into two patches, since there
>> are two issues here:
>>
>> 1. Locking a tuple doesn't clear the all-frozen bit, but needs to do so.
>> 2. heap_update releases the buffer content lock without logging the
>> changes it has made.
>>
>> With respect to #1, there is no need to clear the all-visible bit,
>> only the all-frozen bit.  However, that's a bit tricky given that we
>> removed PD_ALL_FROZEN.  Should we think about putting that back again?
>
> I think it's fine to just do the vm lookup.
>
>> Should we just clear all-visible and call it good enough?
>
> Given that we need to do that in heap_lock_tuple, which entirely
> preserves all-visible (but shouldn't preserve all-frozen), ISTM we
> better find something that doesn't invalidate all-visible.
>

Sounds logical, considering that we have a way to set all-frozen and
vacuum does that as well.  So probably either we need to have a new
API or add a new parameter to visibilitymap_clear() to indicate the
same.  If we want to go that route, isn't it better to have
PD_ALL_FROZEN as well?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-01 Thread Jim Nasby


On 7/1/16 3:43 PM, Andres Freund wrote:

On 2016-07-01 15:42:22 -0500, Jim Nasby wrote:

On 7/1/16 2:23 PM, Andres Freund wrote:

The only
cost of that is that vacuum will come along and mark the page
all-visible again instead of skipping it, but that's probably not an
enormous expense in most cases.

I think the main cost is not having the page marked as all-visible for
index-only purposes. If it's an insert mostly table, it can be a long
while till vacuum comes around.


ISTM that's something that should be addressed anyway (and separately), no?


Huh? That's the current behaviour in heap_lock_tuple.


Oh, I was referring to autovac not being aggressive enough on 
insert-mostly tables. Certainly if there's a reasonable way to avoid 
invalidating the VM when locking a tuple that'd be good.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-01 Thread Andres Freund

On 2016-07-01 15:42:22 -0500, Jim Nasby wrote:
> On 7/1/16 2:23 PM, Andres Freund wrote:
> > > > The only
> > > > cost of that is that vacuum will come along and mark the page
> > > > all-visible again instead of skipping it, but that's probably not an
> > > > enormous expense in most cases.
> > I think the main cost is not having the page marked as all-visible for
> > index-only purposes. If it's an insert mostly table, it can be a long
> > while till vacuum comes around.
> 
> ISTM that's something that should be addressed anyway (and separately), no?

Huh? That's the current behaviour in heap_lock_tuple.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-01 Thread Jim Nasby


On 7/1/16 2:23 PM, Andres Freund wrote:

> The only
> cost of that is that vacuum will come along and mark the page
> all-visible again instead of skipping it, but that's probably not an
> enormous expense in most cases.

I think the main cost is not having the page marked as all-visible for
index-only purposes. If it's an insert mostly table, it can be a long
while till vacuum comes around.


ISTM that's something that should be addressed anyway (and separately), no?
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)   mobile: 512-569-9461


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-01 Thread Andres Freund

On 2016-07-01 15:18:39 -0400, Robert Haas wrote:
> On Fri, Jul 1, 2016 at 10:22 AM, Masahiko Sawada  
> wrote:
> > Ah, you're right, I misunderstood.
> >
> > Attached updated patch incorporating your comments.
> > I've changed it so that heap_xlog_lock clears vm flags if page is
> > marked all frozen.
> 
> I believe that this should be separated into two patches, since there
> are two issues here:
> 
> 1. Locking a tuple doesn't clear the all-frozen bit, but needs to do so.
> 2. heap_update releases the buffer content lock without logging the
> changes it has made.
> 
> With respect to #1, there is no need to clear the all-visible bit,
> only the all-frozen bit.  However, that's a bit tricky given that we
> removed PD_ALL_FROZEN.  Should we think about putting that back again?

I think it's fine to just do the vm lookup.

> Should we just clear all-visible and call it good enough?

Given that we need to do that in heap_lock_tuple, which entirely
preserves all-visible (but shouldn't preserve all-frozen), ISTM we
better find something that doesn't invalidate all-visible.


> The only
> cost of that is that vacuum will come along and mark the page
> all-visible again instead of skipping it, but that's probably not an
> enormous expense in most cases.

I think the main cost is not having the page marked as all-visible for
index-only purposes. If it's an insert mostly table, it can be a long
while till vacuum comes around.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-01 Thread Robert Haas

On Fri, Jul 1, 2016 at 10:22 AM, Masahiko Sawada  wrote:
> Ah, you're right, I misunderstood.
>
> Attached updated patch incorporating your comments.
> I've changed it so that heap_xlog_lock clears vm flags if page is
> marked all frozen.

I believe that this should be separated into two patches, since there
are two issues here:

1. Locking a tuple doesn't clear the all-frozen bit, but needs to do so.
2. heap_update releases the buffer content lock without logging the
changes it has made.

With respect to #1, there is no need to clear the all-visible bit,
only the all-frozen bit.  However, that's a bit tricky given that we
removed PD_ALL_FROZEN.  Should we think about putting that back again?
 Should we just clear all-visible and call it good enough?  The only
cost of that is that vacuum will come along and mark the page
all-visible again instead of skipping it, but that's probably not an
enormous expense in most cases.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-07-01 Thread Masahiko Sawada

On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila  wrote:
> On Thu, Jun 30, 2016 at 8:10 PM, Masahiko Sawada  
> wrote:
>> On Thu, Jun 30, 2016 at 3:12 PM, Amit Kapila  wrote:
>>> On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund  wrote:
 On 2016-06-30 08:59:16 +0530, Amit Kapila wrote:
> On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund  
> wrote:
> > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:
> >> There is nothing in this record which recorded the information about
> >> visibility clear flag.
> >
> > I think we can actually defer the clearing to the lock release?
>
> How about the case if after we release the lock on page, the heap page
> gets flushed, but not vm and then server crashes?

 In the released branches there's no need to clear all visible at that
 point. Note how heap_lock_tuple doesn't clear it at all. So we should be
 fine there, and that's the part where reusing an existing record is
 important (for compatibility).

>>>
>>> For back branches, I agree that heap_lock_tuple is sufficient,
>>
>> Even if we use heap_lock_tuple, If server crashed after flushed heap
>> but not vm, after crash recovery the heap is still marked all-visible
>> on vm.
>
> So, in this case both vm and page will be marked as all_visible.
>
>> This case could be happen even on released branched, and could make
>> IndexOnlyScan returns wrong result?
>>
>
> Why do you think IndexOnlyScan will return wrong result?  If the
> server crash in the way as you described, the transaction that has
> made modifications will anyway be considered aborted, so the result of
> IndexOnlyScan should not be wrong.
>

Ah, you're right, I misunderstood.

Attached updated patch incorporating your comments.
I've changed it so that heap_xlog_lock clears vm flags if page is
marked all frozen.

Regards,

--
Masahiko Sawada


emit_wal_already_marked_true_case_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-30 Thread Amit Kapila

On Thu, Jun 30, 2016 at 8:10 PM, Masahiko Sawada  wrote:
> On Thu, Jun 30, 2016 at 3:12 PM, Amit Kapila  wrote:
>> On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund  wrote:
>>> On 2016-06-30 08:59:16 +0530, Amit Kapila wrote:
 On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund  wrote:
 > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:
 >> There is nothing in this record which recorded the information about
 >> visibility clear flag.
 >
 > I think we can actually defer the clearing to the lock release?

 How about the case if after we release the lock on page, the heap page
 gets flushed, but not vm and then server crashes?
>>>
>>> In the released branches there's no need to clear all visible at that
>>> point. Note how heap_lock_tuple doesn't clear it at all. So we should be
>>> fine there, and that's the part where reusing an existing record is
>>> important (for compatibility).
>>>
>>
>> For back branches, I agree that heap_lock_tuple is sufficient,
>
> Even if we use heap_lock_tuple, If server crashed after flushed heap
> but not vm, after crash recovery the heap is still marked all-visible
> on vm.

So, in this case both vm and page will be marked as all_visible.

> This case could be happen even on released branched, and could make
> IndexOnlyScan returns wrong result?
>

Why do you think IndexOnlyScan will return wrong result?  If the
server crash in the way as you described, the transaction that has
made modifications will anyway be considered aborted, so the result of
IndexOnlyScan should not be wrong.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-30 Thread Masahiko Sawada

On Thu, Jun 30, 2016 at 3:12 PM, Amit Kapila  wrote:
> On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund  wrote:
>> On 2016-06-30 08:59:16 +0530, Amit Kapila wrote:
>>> On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund  wrote:
>>> > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:
>>> >> There is nothing in this record which recorded the information about
>>> >> visibility clear flag.
>>> >
>>> > I think we can actually defer the clearing to the lock release?
>>>
>>> How about the case if after we release the lock on page, the heap page
>>> gets flushed, but not vm and then server crashes?
>>
>> In the released branches there's no need to clear all visible at that
>> point. Note how heap_lock_tuple doesn't clear it at all. So we should be
>> fine there, and that's the part where reusing an existing record is
>> important (for compatibility).
>>
>
> For back branches, I agree that heap_lock_tuple is sufficient,

Even if we use heap_lock_tuple, If server crashed after flushed heap
but not vm, after crash recovery the heap is still marked all-visible
on vm.
This case could be happen even on released branched, and could make
IndexOnlyScan returns wrong result?

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-29 Thread Amit Kapila

On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund  wrote:
> On 2016-06-30 08:59:16 +0530, Amit Kapila wrote:
>> On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund  wrote:
>> > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:
>> >> There is nothing in this record which recorded the information about
>> >> visibility clear flag.
>> >
>> > I think we can actually defer the clearing to the lock release?
>>
>> How about the case if after we release the lock on page, the heap page
>> gets flushed, but not vm and then server crashes?
>
> In the released branches there's no need to clear all visible at that
> point. Note how heap_lock_tuple doesn't clear it at all. So we should be
> fine there, and that's the part where reusing an existing record is
> important (for compatibility).
>

For back branches, I agree that heap_lock_tuple is sufficient, but in
that case we should not clear the vm or page bit at all as done in
proposed patch.

> But your question made me realize that we despearately *do* need to
> clear the frozen bit in heap_lock_tuple in 9.6...
>

Right.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-29 Thread Andres Freund

On 2016-06-30 08:59:16 +0530, Amit Kapila wrote:
> On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund  wrote:
> > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:
> >> There is nothing in this record which recorded the information about
> >> visibility clear flag.
> >
> > I think we can actually defer the clearing to the lock release?
> 
> How about the case if after we release the lock on page, the heap page
> gets flushed, but not vm and then server crashes?

In the released branches there's no need to clear all visible at that
point. Note how heap_lock_tuple doesn't clear it at all. So we should be
fine there, and that's the part where reusing an existing record is
important (for compatibility).

But your question made me realize that we despearately *do* need to
clear the frozen bit in heap_lock_tuple in 9.6...

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-29 Thread Amit Kapila

On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund  wrote:
> On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:
>> There is nothing in this record which recorded the information about
>> visibility clear flag.
>
> I think we can actually defer the clearing to the lock release?

How about the case if after we release the lock on page, the heap page
gets flushed, but not vm and then server crashes?  After recovery,
vacuum will never consider such a page for freezing as the vm bit
still says all_frozen.   Another possibility could be that WAL for
xl_heap_lock got flushed, but not the heap page before crash, now
after recovery, it will set the tuple with appropriate infomask and
other flags, but the heap page will still be marked as ALL_VISIBLE. I
think that can lead to a situation which Thomas Munro has reported
upthread.

All other cases in heapam.c, after clearing vm and corresponding flag
in heap page, we are recording the same in WAL.  Why to make this a
different case and how is it safe to do it here and not at other
places.

> A tuple
> being locked doesn't require the vm being cleared.
>
>
>> I think in this approach, it is important to measure the performance
>> of update, may be you can use simple-update option of pgbench for
>> various workloads.  Try it with different fill factors (-F fillfactor
>> option in pgbench).
>
> Probably not sufficient, also needs toast activity, to show the really
> bad case of many lock releases.

Okay, makes sense.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-29 Thread Andres Freund

On 2016-06-29 19:04:31 +0530, Amit Kapila wrote:
> There is nothing in this record which recorded the information about
> visibility clear flag.

I think we can actually defer the clearing to the lock release? A tuple
being locked doesn't require the vm being cleared.


> I think in this approach, it is important to measure the performance
> of update, may be you can use simple-update option of pgbench for
> various workloads.  Try it with different fill factors (-F fillfactor
> option in pgbench).

Probably not sufficient, also needs toast activity, to show the really
bad case of many lock releases.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-29 Thread Amit Kapila

On Wed, Jun 29, 2016 at 11:14 AM, Masahiko Sawada  wrote:
> On Fri, Jun 24, 2016 at 11:04 AM, Amit Kapila  wrote:
>> On Fri, Jun 24, 2016 at 4:33 AM, Andres Freund  wrote:
>>> On 2016-06-23 18:59:57 -0400, Alvaro Herrera wrote:
 Andres Freund wrote:

 > I'm looking into three approaches right now:
 >
 > 3) Use WAL logging for the already_marked = true case.


 > 3) This approach so far seems the best. It's possible to reuse the
 > xl_heap_lock record (in an afaics backwards compatible manner), and in
 > most cases the overhead isn't that large.  It's of course annoying to
 > emit more WAL, but it's not that big an overhead compared to extending a
 > file, or to toasting.  It's also by far the simplest fix.

>>
>>>
>>
>> You are right, I think we can try such an optimization in Head and
>> that too if we see a performance hit with adding this new WAL in
>> heap_update.
>>
>>
>
> +1 for #3 approach, and attached draft patch for that.
> I think attached patch would fix this problem but please let me know
> if this patch is not what you're thinking.
>

Review comments:

+ if (RelationNeedsWAL(relation))
+ {
+ xl_heap_lock xlrec;
+ XLogRecPtr recptr;
+
..
+ xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self);
+ xlrec.locking_xid = xid;
+ xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask,
+   oldtup.t_data->t_infomask2);
+ XLogRegisterData((char *) &xlrec, SizeOfHeapLock);
+ recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK);
+ PageSetLSN(page, recptr);
+ }

There is nothing in this record which recorded the information about
visibility clear flag.  How will you ensure to clear the flag after
crash?  Have you considered to log cid using log_heap_new_cid() for
logical decoding?

It seems to me that the value of locking_xid should be xmax_old_tuple,
why you have chosen xid?

+ /* Celar PD_ALL_VISIBLE flags */
+ if (PageIsAllVisible(BufferGetPage(buffer)))
+ {
+ all_visible_cleared = true;
+ PageClearAllVisible(BufferGetPage(buffer));
+ visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+ vmbuffer);
+ }
+
+ MarkBufferDirty(buffer);
+

  /* Clear obsolete visibility flags ... */
  oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);


I think it is better to first update tuple related info and then clear
PD_ALL_VISIBLE flags (for order, refer how we have done in heap_update
in the code below where you are trying to add new code).

Couple of typo's -
/relasing/releasing
/Celar/Clear

I think in this approach, it is important to measure the performance
of update, may be you can use simple-update option of pgbench for
various workloads.  Try it with different fill factors (-F fillfactor
option in pgbench).


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-28 Thread Masahiko Sawada

On Fri, Jun 24, 2016 at 11:04 AM, Amit Kapila  wrote:
> On Fri, Jun 24, 2016 at 4:33 AM, Andres Freund  wrote:
>> On 2016-06-23 18:59:57 -0400, Alvaro Herrera wrote:
>>> Andres Freund wrote:
>>>
>>> > I'm looking into three approaches right now:
>>> >
>>> > 3) Use WAL logging for the already_marked = true case.
>>>
>>>
>>> > 3) This approach so far seems the best. It's possible to reuse the
>>> > xl_heap_lock record (in an afaics backwards compatible manner), and in
>>> > most cases the overhead isn't that large.  It's of course annoying to
>>> > emit more WAL, but it's not that big an overhead compared to extending a
>>> > file, or to toasting.  It's also by far the simplest fix.
>>>
>
> +1 for proceeding with Approach-3.
>
>>> I suppose it's fine if we crash midway from emitting this wal record and
>>> the actual heap_update one, since the xmax will appear to come from an
>>> aborted xid, right?
>>
>> Yea, that should be fine.
>>
>>
>>> I agree that the overhead is probably negligible, considering that this
>>> only happens when toast is invoked.  It's probably not as great when the
>>> new tuple goes to another page, though.
>>
>> I think it has to happen in both cases unfortunately. We could try to
>> add some optimizations (e.g. only release lock & WAL log if the target
>> page, via fsm, is before the current one), but I don't really want to go
>> there in the back branches.
>>
>
> You are right, I think we can try such an optimization in Head and
> that too if we see a performance hit with adding this new WAL in
> heap_update.
>
>

+1 for #3 approach, and attached draft patch for that.
I think attached patch would fix this problem but please let me know
if this patch is not what you're thinking.

Regards,

--
Masahiko Sawada
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 57da57a..2f3fd83 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -3923,6 +3923,28 @@ l2:
 
 	if (need_toast || newtupsize > pagefree)
 	{
+		/*
+		 * To prevent data corruption due to updating old tuple by
+		 * other backends after released buffer, we need to emit that
+		 * xmax of old tuple is set and clear visibility map bits if
+		 * needed before relasing buffer. We can reuse xl_heap_lock
+		 * for this pupose. It should be fine even if we crash midway
+		 * from this section and the actual updating one later, since
+		 * the xmax will appear to come from an aborted xid.
+		 */
+		START_CRIT_SECTION();
+
+		/* Celar PD_ALL_VISIBLE flags */
+		if (PageIsAllVisible(BufferGetPage(buffer)))
+		{
+			all_visible_cleared = true;
+			PageClearAllVisible(BufferGetPage(buffer));
+			visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
+vmbuffer);
+		}
+
+		MarkBufferDirty(buffer);
+
 		/* Clear obsolete visibility flags ... */
 		oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED);
 		oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED;
@@ -3936,6 +3958,26 @@ l2:
 		/* temporarily make it look not-updated */
 		oldtup.t_data->t_ctid = oldtup.t_self;
 		already_marked = true;
+
+		if (RelationNeedsWAL(relation))
+		{
+			xl_heap_lock xlrec;
+			XLogRecPtr recptr;
+
+			XLogBeginInsert();
+			XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+
+			xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self);
+			xlrec.locking_xid = xid;
+			xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask,
+  oldtup.t_data->t_infomask2);
+			XLogRegisterData((char *) &xlrec, SizeOfHeapLock);
+			recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK);
+			PageSetLSN(page, recptr);
+		}
+
+		END_CRIT_SECTION();
+
 		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 
 		/*

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-28 Thread Masahiko Sawada

On Tue, Jun 28, 2016 at 8:06 PM, Masahiko Sawada  wrote:
> On Tue, Jun 21, 2016 at 6:59 AM, Andres Freund  wrote:
>> On 2016-06-20 17:55:19 -0400, Robert Haas wrote:
>>> On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund  wrote:
>>> > On 2016-06-20 16:10:23 -0400, Robert Haas wrote:
>>> >> What exactly is the point of all of that already_marked stuff?
>>> >
>>> > Preventing the old tuple from being locked/updated by another backend,
>>> > while unlocking the buffer.
>>> >
>>> >> I
>>> >> mean, suppose we just don't do any of that before we go off to do
>>> >> toast_insert_or_update and RelationGetBufferForTuple.  Eventually,
>>> >> when we reacquire the page lock, we might find that somebody else has
>>> >> already updated the tuple, but couldn't that be handled by
>>> >> (approximately) looping back up to l2 just as we do in several other
>>> >> cases?
>>> >
>>> > We'd potentially have to undo a fair amount more work: the toasted data
>>> > would have to be deleted and such, just to retry. Which isn't going to
>>> > super easy, because all of it will be happening with the current cid (we
>>> > can't just increase CommandCounterIncrement() for correctness reasons).
>>>
>>> Why would we have to delete the TOAST data?  AFAIUI, the tuple points
>>> to the TOAST data, but not the other way around.  So if we change our
>>> mind about where to put the tuple, I don't think that requires
>>> re-TOASTing.
>>
>> Consider what happens if we, after restarting at l2, notice that we
>> can't actually insert, but return in the !HeapTupleMayBeUpdated
>> branch. If the caller doesn't error out - and there's certainly callers
>> doing that - we'd "leak" a toasted datum.
>
> Sorry for interrupt you, but I have a question about this case.
> Is there case where we back to l2 after we created toasted
> datum(called toast_insert_or_update)?
> IIUC, after we stored toast datum we just insert heap tuple and log
> WAL (or error out for some reasons).
>

I understood now, sorry for the noise.

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-28 Thread Masahiko Sawada

On Tue, Jun 21, 2016 at 6:59 AM, Andres Freund  wrote:
> On 2016-06-20 17:55:19 -0400, Robert Haas wrote:
>> On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund  wrote:
>> > On 2016-06-20 16:10:23 -0400, Robert Haas wrote:
>> >> What exactly is the point of all of that already_marked stuff?
>> >
>> > Preventing the old tuple from being locked/updated by another backend,
>> > while unlocking the buffer.
>> >
>> >> I
>> >> mean, suppose we just don't do any of that before we go off to do
>> >> toast_insert_or_update and RelationGetBufferForTuple.  Eventually,
>> >> when we reacquire the page lock, we might find that somebody else has
>> >> already updated the tuple, but couldn't that be handled by
>> >> (approximately) looping back up to l2 just as we do in several other
>> >> cases?
>> >
>> > We'd potentially have to undo a fair amount more work: the toasted data
>> > would have to be deleted and such, just to retry. Which isn't going to
>> > super easy, because all of it will be happening with the current cid (we
>> > can't just increase CommandCounterIncrement() for correctness reasons).
>>
>> Why would we have to delete the TOAST data?  AFAIUI, the tuple points
>> to the TOAST data, but not the other way around.  So if we change our
>> mind about where to put the tuple, I don't think that requires
>> re-TOASTing.
>
> Consider what happens if we, after restarting at l2, notice that we
> can't actually insert, but return in the !HeapTupleMayBeUpdated
> branch. If the caller doesn't error out - and there's certainly callers
> doing that - we'd "leak" a toasted datum.

Sorry for interrupt you, but I have a question about this case.
Is there case where we back to l2 after we created toasted
datum(called toast_insert_or_update)?
IIUC, after we stored toast datum we just insert heap tuple and log
WAL (or error out for some reasons).

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-26 Thread Noah Misch

On Tue, Jun 21, 2016 at 10:59:25AM +1200, Thomas Munro wrote:
> On Fri, Jun 17, 2016 at 3:36 PM, Noah Misch  wrote:
> > I agree the non-atomic, unlogged change is a problem.  A related threat
> > doesn't require a torn page:
> >
> >   AssignTransactionId() xid=123
> >   heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, 123);
> >   some ERROR before heap_update() finishes
> >   rollback;  -- xid=123
> >   some backend flushes the modified page
> >   immediate shutdown
> >   AssignTransactionId() xid=123
> >   commit;  -- xid=123
> >
> > If nothing wrote an xlog record that witnesses xid 123, the cluster can
> > reassign it after recovery.  The failed update is now considered a 
> > successful
> > update, and the row improperly becomes dead.  That's important.
> 
> I wonder if that was originally supposed to be handled with the
> HEAP_XMAX_UNLOGGED flag which was removed in 11919160.  A comment in
> the heap WAL logging commit f2bfe8a2 said that tqual routines would
> see the HEAP_XMAX_UNLOGGED flag in the event of a crash before logging
> (though I'm not sure if the tqual routines ever actually did that).

HEAP_XMAX_UNLOGGED does appear to have originated in contemplation of this
same hazard.  Looking at the three commits in "git log -S HEAP_XMAX_UNLOGGED"
(f2bfe8a b58c041 1191916), nothing ever completed the implementation by
testing for that flag.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-23 Thread Amit Kapila

On Fri, Jun 24, 2016 at 4:33 AM, Andres Freund  wrote:
> On 2016-06-23 18:59:57 -0400, Alvaro Herrera wrote:
>> Andres Freund wrote:
>>
>> > I'm looking into three approaches right now:
>> >
>> > 3) Use WAL logging for the already_marked = true case.
>>
>>
>> > 3) This approach so far seems the best. It's possible to reuse the
>> > xl_heap_lock record (in an afaics backwards compatible manner), and in
>> > most cases the overhead isn't that large.  It's of course annoying to
>> > emit more WAL, but it's not that big an overhead compared to extending a
>> > file, or to toasting.  It's also by far the simplest fix.
>>

+1 for proceeding with Approach-3.

>> I suppose it's fine if we crash midway from emitting this wal record and
>> the actual heap_update one, since the xmax will appear to come from an
>> aborted xid, right?
>
> Yea, that should be fine.
>
>
>> I agree that the overhead is probably negligible, considering that this
>> only happens when toast is invoked.  It's probably not as great when the
>> new tuple goes to another page, though.
>
> I think it has to happen in both cases unfortunately. We could try to
> add some optimizations (e.g. only release lock & WAL log if the target
> page, via fsm, is before the current one), but I don't really want to go
> there in the back branches.
>

You are right, I think we can try such an optimization in Head and
that too if we see a performance hit with adding this new WAL in
heap_update.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-23 Thread Andres Freund

On 2016-06-23 18:59:57 -0400, Alvaro Herrera wrote:
> Andres Freund wrote:
> 
> > I'm looking into three approaches right now:
> > 
> > 3) Use WAL logging for the already_marked = true case.
> 
> 
> > 3) This approach so far seems the best. It's possible to reuse the
> > xl_heap_lock record (in an afaics backwards compatible manner), and in
> > most cases the overhead isn't that large.  It's of course annoying to
> > emit more WAL, but it's not that big an overhead compared to extending a
> > file, or to toasting.  It's also by far the simplest fix.
> 
> I suppose it's fine if we crash midway from emitting this wal record and
> the actual heap_update one, since the xmax will appear to come from an
> aborted xid, right?

Yea, that should be fine.


> I agree that the overhead is probably negligible, considering that this
> only happens when toast is invoked.  It's probably not as great when the
> new tuple goes to another page, though.

I think it has to happen in both cases unfortunately. We could try to
add some optimizations (e.g. only release lock & WAL log if the target
page, via fsm, is before the current one), but I don't really want to go
there in the back branches.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-23 Thread Alvaro Herrera

Andres Freund wrote:

> I'm looking into three approaches right now:
> 
> 3) Use WAL logging for the already_marked = true case.

> 3) This approach so far seems the best. It's possible to reuse the
> xl_heap_lock record (in an afaics backwards compatible manner), and in
> most cases the overhead isn't that large.  It's of course annoying to
> emit more WAL, but it's not that big an overhead compared to extending a
> file, or to toasting.  It's also by far the simplest fix.

I suppose it's fine if we crash midway from emitting this wal record and
the actual heap_update one, since the xmax will appear to come from an
aborted xid, right?

I agree that the overhead is probably negligible, considering that this
only happens when toast is invoked.  It's probably not as great when the
new tuple goes to another page, though.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-23 Thread Andres Freund

On 2016-06-21 16:32:03 -0400, Robert Haas wrote:
> On Tue, Jun 21, 2016 at 3:46 PM, Andres Freund  wrote:
> > On 2016-06-21 15:38:25 -0400, Robert Haas wrote:
> >> On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund  wrote:
> >> >> I'm also a bit dubious that LockAcquire is safe to call in general
> >> >> with interrupts held.
> >> >
> >> > Looks like we could just acquire the tuple-lock *before* doing the
> >> > toast_insert_or_update/RelationGetBufferForTuple, but after releasing
> >> > the buffer lock. That'd allow us to do avoid doing the nested locking,
> >> > should make the recovery just a goto l2;, ...
> >>
> >> Why isn't that racey?  Somebody else can grab the tuple lock after we
> >> release the buffer content lock and before we acquire the tuple lock.
> >
> > Sure, but by the time the tuple lock is released, they'd have updated
> > xmax. So once we acquired that we can just do
> > if (xmax_infomask_changed(oldtup.t_data->t_infomask,
> >   infomask) 
> > ||
> > 
> > !TransactionIdEquals(HeapTupleHeaderGetRawXmax(oldtup.t_data),
> >  xwait))
> > goto l2;
> > which is fine, because we've not yet done the toasting.
> 
> I see.
> 
> > I'm not sure wether this approach is better than deleting potentially
> > toasted data though. It's probably faster, but will likely touch more
> > places in the code, and eat up a infomask bit (infomask & HEAP_MOVED
> > == HEAP_MOVED in my prototype).
> 
> Ugh.  That's not very desirable at all.

I'm looking into three approaches right now:

1) Flag approach from above
2) Undo toasting on concurrent activity, retry
3) Use WAL logging for the already_marked = true case.

1) primarily suffers from a significant amount of complexity. I still
have a bug in there that sometimes triggers "attempted to update
invisible tuple" ERRORs.  Otherwise it seems to perform decently
performancewise - even on workloads with many backends hitting the same
tuple, the retry-rate is low.

2) Seems to work too, but due to the amount of time the tuple is not
locked, the retry rate can be really high. As we perform significant
amount of work (toast insertion & index manipulation or extending a
file) , while the tuple is not locked, it's quite likely that another
session tries to modify the tuple inbetween.  I think it's possible to
essentially livelock.

3) This approach so far seems the best. It's possible to reuse the
xl_heap_lock record (in an afaics backwards compatible manner), and in
most cases the overhead isn't that large.  It's of course annoying to
emit more WAL, but it's not that big an overhead compared to extending a
file, or to toasting.  It's also by far the simplest fix.

Comments?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-21 Thread Robert Haas

On Tue, Jun 21, 2016 at 3:46 PM, Andres Freund  wrote:
> On 2016-06-21 15:38:25 -0400, Robert Haas wrote:
>> On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund  wrote:
>> >> I'm also a bit dubious that LockAcquire is safe to call in general
>> >> with interrupts held.
>> >
>> > Looks like we could just acquire the tuple-lock *before* doing the
>> > toast_insert_or_update/RelationGetBufferForTuple, but after releasing
>> > the buffer lock. That'd allow us to do avoid doing the nested locking,
>> > should make the recovery just a goto l2;, ...
>>
>> Why isn't that racey?  Somebody else can grab the tuple lock after we
>> release the buffer content lock and before we acquire the tuple lock.
>
> Sure, but by the time the tuple lock is released, they'd have updated
> xmax. So once we acquired that we can just do
> if (xmax_infomask_changed(oldtup.t_data->t_infomask,
>   infomask) ||
> 
> !TransactionIdEquals(HeapTupleHeaderGetRawXmax(oldtup.t_data),
>  xwait))
> goto l2;
> which is fine, because we've not yet done the toasting.

I see.

> I'm not sure wether this approach is better than deleting potentially
> toasted data though. It's probably faster, but will likely touch more
> places in the code, and eat up a infomask bit (infomask & HEAP_MOVED
> == HEAP_MOVED in my prototype).

Ugh.  That's not very desirable at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-21 Thread Andres Freund

On 2016-06-21 15:38:25 -0400, Robert Haas wrote:
> On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund  wrote:
> >> I'm also a bit dubious that LockAcquire is safe to call in general
> >> with interrupts held.
> >
> > Looks like we could just acquire the tuple-lock *before* doing the
> > toast_insert_or_update/RelationGetBufferForTuple, but after releasing
> > the buffer lock. That'd allow us to do avoid doing the nested locking,
> > should make the recovery just a goto l2;, ...
> 
> Why isn't that racey?  Somebody else can grab the tuple lock after we
> release the buffer content lock and before we acquire the tuple lock.

Sure, but by the time the tuple lock is released, they'd have updated
xmax. So once we acquired that we can just do
if (xmax_infomask_changed(oldtup.t_data->t_infomask,
  infomask) ||

!TransactionIdEquals(HeapTupleHeaderGetRawXmax(oldtup.t_data),
 xwait))
goto l2;
which is fine, because we've not yet done the toasting.

I'm not sure wether this approach is better than deleting potentially
toasted data though. It's probably faster, but will likely touch more
places in the code, and eat up a infomask bit (infomask & HEAP_MOVED
== HEAP_MOVED in my prototype).

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-21 Thread Robert Haas

On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund  wrote:
>> I'm also a bit dubious that LockAcquire is safe to call in general
>> with interrupts held.
>
> Looks like we could just acquire the tuple-lock *before* doing the
> toast_insert_or_update/RelationGetBufferForTuple, but after releasing
> the buffer lock. That'd allow us to do avoid doing the nested locking,
> should make the recovery just a goto l2;, ...

Why isn't that racey?  Somebody else can grab the tuple lock after we
release the buffer content lock and before we acquire the tuple lock.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-21 Thread Andres Freund

On 2016-06-21 13:03:24 -0400, Robert Haas wrote:
> On Tue, Jun 21, 2016 at 12:54 PM, Andres Freund  wrote:
> >> I don't quite understand the intended semantics of this proposed flag.
> >
> > Whenever the flag is set, we have to acquire the heavyweight tuple lock
> > before continuing. That guarantees nobody else can modify the tuple,
> > while the lock is released, without requiring to modify more than one
> > hint bit.  That should fix the torn page issue, no?
> 
> Yeah, I guess that would work.
> 
> >> If we don't already have the tuple lock at that point, we can't go
> >> acquire it before releasing the content lock, can we?
> >
> > Why not?  Afaics the way that tuple locks are used, the nesting should
> > be fine.
> 
> Well, the existing places where we acquire the tuple lock within
> heap_update() are all careful to release the page lock first, so I'm
> skeptical that doing it the other order is safe.  Certainly, if we've
> got some code that grabs the page lock and then the tuple lock and
> other code that grabs the tuple lock and then the page lock, that's a
> deadlock waiting to happen.

Just noticed this piece of code while looking into this:
UnlockReleaseBuffer(buffer);
if (have_tuple_lock)
UnlockTupleTuplock(relation, &(tp.t_self), 
LockTupleExclusive);
if (vmbuffer != InvalidBuffer)
ReleaseBuffer(vmbuffer);
return result;

seems weird to release the vmbuffer after the tuplelock...


> I'm also a bit dubious that LockAcquire is safe to call in general
> with interrupts held.

Looks like we could just acquire the tuple-lock *before* doing the
toast_insert_or_update/RelationGetBufferForTuple, but after releasing
the buffer lock. That'd allow us to do avoid doing the nested locking,
should make the recovery just a goto l2;, ...

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-21 Thread Robert Haas

On Tue, Jun 21, 2016 at 12:54 PM, Andres Freund  wrote:
>> I don't quite understand the intended semantics of this proposed flag.
>
> Whenever the flag is set, we have to acquire the heavyweight tuple lock
> before continuing. That guarantees nobody else can modify the tuple,
> while the lock is released, without requiring to modify more than one
> hint bit.  That should fix the torn page issue, no?

Yeah, I guess that would work.

>> If we don't already have the tuple lock at that point, we can't go
>> acquire it before releasing the content lock, can we?
>
> Why not?  Afaics the way that tuple locks are used, the nesting should
> be fine.

Well, the existing places where we acquire the tuple lock within
heap_update() are all careful to release the page lock first, so I'm
skeptical that doing it the other order is safe.  Certainly, if we've
got some code that grabs the page lock and then the tuple lock and
other code that grabs the tuple lock and then the page lock, that's a
deadlock waiting to happen.  I'm also a bit dubious that LockAcquire
is safe to call in general with interrupts held.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-21 Thread Andres Freund

On 2016-06-21 10:50:36 -0400, Robert Haas wrote:
> On Mon, Jun 20, 2016 at 11:51 PM, Andres Freund  wrote:
> >> > So far the best idea I have - and it's really not a good one - is to
> >> > invent a new hint-bit that tells concurrent updates to acquire a
> >> > heavyweight tuple lock, while releasing the page-level lock. If that
> >> > hint bit does not require any other modifications - and we don't need an
> >> > xid in xmax for this use case - that'll avoid doing all the other
> >> > `already_marked` stuff early, which should address the correctness
> >> > issue.
> >> >
> >>
> >> Don't we need to clear such a flag in case of error?  Also don't we need to
> >> reset it later, like when modifying the old page later before WAL.
> >
> > If the flag just says "acquire a heavyweight lock", then there's no need
> > for that. That's cheap enough to just do if it's errorneously set.  At
> > least I can't see any reason.
> 
> I don't quite understand the intended semantics of this proposed flag.

Whenever the flag is set, we have to acquire the heavyweight tuple lock
before continuing. That guarantees nobody else can modify the tuple,
while the lock is released, without requiring to modify more than one
hint bit.  That should fix the torn page issue, no?

> If we don't already have the tuple lock at that point, we can't go
> acquire it before releasing the content lock, can we?

Why not?  Afaics the way that tuple locks are used, the nesting should
be fine.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-21 Thread Robert Haas

On Tue, Jun 21, 2016 at 10:47 AM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Mon, Jun 20, 2016 at 5:59 PM, Andres Freund  wrote:
>>> Consider what happens if we, after restarting at l2, notice that we
>>> can't actually insert, but return in the !HeapTupleMayBeUpdated
>>> branch.
>
>> OK, I see what you mean.  Still, that doesn't seem like such a
>> terrible cost.  If you try to update a tuple and if it looks like you
>> can update it but then after TOASTing you find that the status of the
>> tuple has changed such that you can't update it after all, then you
>> might need to go set xmax = MyTxid() on all of the TOAST tuples you
>> created (whose CTIDs we could save someplace, so that it's just a
>> matter of finding them by CTID to kill them).
>
> ... and if you get an error or crash partway through that, what happens?

Then the transaction is aborted anyway, and we haven't leaked anything
because VACUUM will clean it up.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-21 Thread Robert Haas

On Mon, Jun 20, 2016 at 11:51 PM, Andres Freund  wrote:
>> > So far the best idea I have - and it's really not a good one - is to
>> > invent a new hint-bit that tells concurrent updates to acquire a
>> > heavyweight tuple lock, while releasing the page-level lock. If that
>> > hint bit does not require any other modifications - and we don't need an
>> > xid in xmax for this use case - that'll avoid doing all the other
>> > `already_marked` stuff early, which should address the correctness
>> > issue.
>> >
>>
>> Don't we need to clear such a flag in case of error?  Also don't we need to
>> reset it later, like when modifying the old page later before WAL.
>
> If the flag just says "acquire a heavyweight lock", then there's no need
> for that. That's cheap enough to just do if it's errorneously set.  At
> least I can't see any reason.

I don't quite understand the intended semantics of this proposed flag.
If we don't already have the tuple lock at that point, we can't go
acquire it before releasing the content lock, can we?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-21 Thread Tom Lane

Robert Haas  writes:
> On Mon, Jun 20, 2016 at 5:59 PM, Andres Freund  wrote:
>> Consider what happens if we, after restarting at l2, notice that we
>> can't actually insert, but return in the !HeapTupleMayBeUpdated
>> branch.

> OK, I see what you mean.  Still, that doesn't seem like such a
> terrible cost.  If you try to update a tuple and if it looks like you
> can update it but then after TOASTing you find that the status of the
> tuple has changed such that you can't update it after all, then you
> might need to go set xmax = MyTxid() on all of the TOAST tuples you
> created (whose CTIDs we could save someplace, so that it's just a
> matter of finding them by CTID to kill them).

... and if you get an error or crash partway through that, what happens?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-21 Thread Robert Haas

On Mon, Jun 20, 2016 at 5:59 PM, Andres Freund  wrote:
> On 2016-06-20 17:55:19 -0400, Robert Haas wrote:
>> On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund  wrote:
>> > On 2016-06-20 16:10:23 -0400, Robert Haas wrote:
>> >> I
>> >> mean, suppose we just don't do any of that before we go off to do
>> >> toast_insert_or_update and RelationGetBufferForTuple.  Eventually,
>> >> when we reacquire the page lock, we might find that somebody else has
>> >> already updated the tuple, but couldn't that be handled by
>> >> (approximately) looping back up to l2 just as we do in several other
>> >> cases?
>> >
>> > We'd potentially have to undo a fair amount more work: the toasted data
>> > would have to be deleted and such, just to retry. Which isn't going to
>> > super easy, because all of it will be happening with the current cid (we
>> > can't just increase CommandCounterIncrement() for correctness reasons).
>>
>> Why would we have to delete the TOAST data?  AFAIUI, the tuple points
>> to the TOAST data, but not the other way around.  So if we change our
>> mind about where to put the tuple, I don't think that requires
>> re-TOASTing.
>
> Consider what happens if we, after restarting at l2, notice that we
> can't actually insert, but return in the !HeapTupleMayBeUpdated
> branch. If the caller doesn't error out - and there's certainly callers
> doing that - we'd "leak" a toasted datum. Unless the transaction aborts,
> the toasted datum would never be cleaned up, because there's no datum
> pointing to it, so no heap_delete will ever recurse into the toast
> datum (via toast_delete()).

OK, I see what you mean.  Still, that doesn't seem like such a
terrible cost.  If you try to update a tuple and if it looks like you
can update it but then after TOASTing you find that the status of the
tuple has changed such that you can't update it after all, then you
might need to go set xmax = MyTxid() on all of the TOAST tuples you
created (whose CTIDs we could save someplace, so that it's just a
matter of finding them by CTID to kill them).  That's not likely to
happen particularly often, though, and when it does happen it's not
insanely expensive.  We could also reduce the cost by letting the
caller of heap_update() decide whether to back out the work; if the
caller intends to throw an error anyway, then there's no point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-20 Thread Amit Kapila

On Tue, Jun 21, 2016 at 9:21 AM, Andres Freund  wrote:
>
> On 2016-06-21 08:59:13 +0530, Amit Kapila wrote:
> > Can we consider to use some strategy to avoid deadlocks without
releasing
> > the lock on old page?  Consider if we could have a mechanism such that
> > RelationGetBufferForTuple() will ensure that it always returns a new
buffer
> > which has targetblock greater than the old block (on which we already
held
> > a lock).  I think here tricky part is whether we can get anything like
that
> > from FSM. Also, there could be cases where we need to extend the heap
when
> > there were pages in heap with space available, but we have ignored them
> > because there block number is smaller than the block number on which we
> > have lock.
>
> I can't see that being acceptable, from a space-usage POV.
>
> > > So far the best idea I have - and it's really not a good one - is to
> > > invent a new hint-bit that tells concurrent updates to acquire a
> > > heavyweight tuple lock, while releasing the page-level lock. If that
> > > hint bit does not require any other modifications - and we don't need
an
> > > xid in xmax for this use case - that'll avoid doing all the other
> > > `already_marked` stuff early, which should address the correctness
> > > issue.
> > >
> >
> > Don't we need to clear such a flag in case of error?  Also don't we
need to
> > reset it later, like when modifying the old page later before WAL.
>
> If the flag just says "acquire a heavyweight lock", then there's no need
> for that. That's cheap enough to just do if it's errorneously set.  At
> least I can't see any reason.
>

I think it will just increase the chances of other backends to acquire a
heavy weight lock.

> > >  It's kinda invasive though, and probably has performance
> > > implications.
> > >
> >
> > Do you see performance implication due to requirement of heavywieht
tuple
> > lock in more cases than now or something else?
>
> Because of that, yes.
>
>
> > Some others ways could be:
> >
> > Before releasing the lock on buffer containing old tuple, clear the VM
and
> > visibility info from page and WAL log it.  I think this could impact
> > performance depending on how frequently we need to perform this action.
>
> Doubling the number of xlog inserts in heap_update would certainly be
> measurable :(. My guess is that the heavyweight tuple lock approach will
> be less expensive.
>

Probably, but I think heavyweight tuple lock is more invasive.  I think
increasing the number of xlog inserts could surely impact performance, but
depending upon how frequently we need to call it.  I think we might want to
combine it with the idea of RelationGetBufferForTuple() to return
higher-block number, such that if we don't find higher block-number from
FSM, then we can release the lock on old page and try to get the locks on
old and new buffers as we do now.  This will further reduce the chances of
increasing xlog insert calls and address the issue of space-wastage.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Reviewing freeze map code

2016-06-20 Thread Amit Kapila

On Tue, Jun 21, 2016 at 9:16 AM, Thomas Munro 
wrote:
>
> On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila 
wrote:
> > Some others ways could be:
> >
> > Before releasing the lock on buffer containing old tuple, clear the VM
and
> > visibility info from page and WAL log it.  I think this could impact
> > performance depending on how frequently we need to perform this action.
> >
> > Have a new flag like HEAP_XMAX_UNLOGGED (as it was there when this
logic was
> > introduced in commit f2bfe8a24c46133f81e188653a127f939eb33c4a ) and set
the
> > same in old tuple header before releasing lock on buffer and teach
tqual.c
> > to honor the flag.  I think tqual.c should consider  HEAP_XMAX_UNLOGGED
as
> > an indication of aborted transaction unless it is currently in-progress.
> > Also, I think we need to clear this flag before WAL logging in
heap_update.
>
> I also noticed that and wondered whether it was a mistake to take that
> out.  It appears to have been removed as part of the logic to clear
> away UNDO log support in 11919160, but it may have been an important
> part of the heap_update protocol.  Though (as I mentioned nearby in a
> reply to Noah) it I'm not sure if the tqual.c side which would ignore
> the unlogged xmax in the event of a badly timed crash was ever
> implemented.
>

Right, my observation is similar to yours and that's what I am suggesting
as one-alternative to fix this issue.  I think making this approach work
(even if this doesn't have any problems) might turn out to be tricky.
However, the plus-point of this approach seems to be that it shouldn't
impact performance in most of the cases.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Reviewing freeze map code

2016-06-20 Thread Amit Kapila

On Tue, Jun 21, 2016 at 9:08 AM, Thomas Munro 
wrote:
>
> On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila 
wrote:
> > On Tue, Jun 21, 2016 at 1:03 AM, Andres Freund 
wrote:
> >> Well, I think generally nobody seriously looked at actually refactoring
> >> heap_update(), even though that'd be a good idea.  But in this
instance,
> >> the problem seems relatively fundamental:
> >>
> >> We need to lock the origin page, to do visibility checks, etc. Then we
> >> need to figure out the target page. Even disregarding toasting - which
> >> we could be doing earlier with some refactoring - we're going to have
to
> >> release the page level lock, to lock them in ascending order. Otherwise
> >> we'll risk kinda likely deadlocks.
> >
> > Can we consider to use some strategy to avoid deadlocks without
releasing
> > the lock on old page?  Consider if we could have a mechanism such that
> > RelationGetBufferForTuple() will ensure that it always returns a new
buffer
> > which has targetblock greater than the old block (on which we already
held a
> > lock).  I think here tricky part is whether we can get anything like
that
> > from FSM. Also, there could be cases where we need to extend the heap
when
> > there were pages in heap with space available, but we have ignored them
> > because there block number is smaller than the block number on which we
have
> > lock.
>
> Doesn't that mean that over time, given a workload that does only or
> mostly updates, your records tend to migrate further and further away
> from the start of the file, leaving a growing unusable space at the
> beginning, until you eventually need to CLUSTER/VACUUM FULL?
>

The request for updates should ideally fit in same page as old tuple for
many of the cases if fillfactor is properly configured, considering
update-mostly loads.  Why would it be that always the records will migrate
further away, they should get the space freed by other updates in
intermediate pages. I think there could be some impact space-wise, but
freed-up space will be eventually used.

> I was wondering about speculatively asking for a free page with a
> lower block number than the origin page, if one is available, before
> locking the origin page.

Do you wan't to lock it as well?  In any-case, I think adding the code
without deciding whether the update can be accommodated in current page can
prove to be costly.

>  Then after locking the origin page, if it
> turns out you need a page but didn't get it earlier, asking for a free
> page with a higher block number than the origin page.
>

Something like that might workout if it is feasible and people agree on
pursuing such an approach.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Reviewing freeze map code

2016-06-20 Thread Andres Freund

On 2016-06-21 08:59:13 +0530, Amit Kapila wrote:
> Can we consider to use some strategy to avoid deadlocks without releasing
> the lock on old page?  Consider if we could have a mechanism such that
> RelationGetBufferForTuple() will ensure that it always returns a new buffer
> which has targetblock greater than the old block (on which we already held
> a lock).  I think here tricky part is whether we can get anything like that
> from FSM. Also, there could be cases where we need to extend the heap when
> there were pages in heap with space available, but we have ignored them
> because there block number is smaller than the block number on which we
> have lock.

I can't see that being acceptable, from a space-usage POV.

> > So far the best idea I have - and it's really not a good one - is to
> > invent a new hint-bit that tells concurrent updates to acquire a
> > heavyweight tuple lock, while releasing the page-level lock. If that
> > hint bit does not require any other modifications - and we don't need an
> > xid in xmax for this use case - that'll avoid doing all the other
> > `already_marked` stuff early, which should address the correctness
> > issue.
> >
> 
> Don't we need to clear such a flag in case of error?  Also don't we need to
> reset it later, like when modifying the old page later before WAL.

If the flag just says "acquire a heavyweight lock", then there's no need
for that. That's cheap enough to just do if it's errorneously set.  At
least I can't see any reason.

> >  It's kinda invasive though, and probably has performance
> > implications.
> >
> 
> Do you see performance implication due to requirement of heavywieht tuple
> lock in more cases than now or something else?

Because of that, yes.


> Some others ways could be:
> 
> Before releasing the lock on buffer containing old tuple, clear the VM and
> visibility info from page and WAL log it.  I think this could impact
> performance depending on how frequently we need to perform this action.

Doubling the number of xlog inserts in heap_update would certainly be
measurable :(. My guess is that the heavyweight tuple lock approach will
be less expensive.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-20 Thread Thomas Munro

On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila  wrote:
> Some others ways could be:
>
> Before releasing the lock on buffer containing old tuple, clear the VM and
> visibility info from page and WAL log it.  I think this could impact
> performance depending on how frequently we need to perform this action.
>
> Have a new flag like HEAP_XMAX_UNLOGGED (as it was there when this logic was
> introduced in commit f2bfe8a24c46133f81e188653a127f939eb33c4a ) and set the
> same in old tuple header before releasing lock on buffer and teach tqual.c
> to honor the flag.  I think tqual.c should consider  HEAP_XMAX_UNLOGGED as
> an indication of aborted transaction unless it is currently in-progress.
> Also, I think we need to clear this flag before WAL logging in heap_update.

I also noticed that and wondered whether it was a mistake to take that
out.  It appears to have been removed as part of the logic to clear
away UNDO log support in 11919160, but it may have been an important
part of the heap_update protocol.  Though (as I mentioned nearby in a
reply to Noah) it I'm not sure if the tqual.c side which would ignore
the unlogged xmax in the event of a badly timed crash was ever
implemented.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-20 Thread Thomas Munro

On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila  wrote:
> On Tue, Jun 21, 2016 at 1:03 AM, Andres Freund  wrote:
>> Well, I think generally nobody seriously looked at actually refactoring
>> heap_update(), even though that'd be a good idea.  But in this instance,
>> the problem seems relatively fundamental:
>>
>> We need to lock the origin page, to do visibility checks, etc. Then we
>> need to figure out the target page. Even disregarding toasting - which
>> we could be doing earlier with some refactoring - we're going to have to
>> release the page level lock, to lock them in ascending order. Otherwise
>> we'll risk kinda likely deadlocks.
>
> Can we consider to use some strategy to avoid deadlocks without releasing
> the lock on old page?  Consider if we could have a mechanism such that
> RelationGetBufferForTuple() will ensure that it always returns a new buffer
> which has targetblock greater than the old block (on which we already held a
> lock).  I think here tricky part is whether we can get anything like that
> from FSM. Also, there could be cases where we need to extend the heap when
> there were pages in heap with space available, but we have ignored them
> because there block number is smaller than the block number on which we have
> lock.

Doesn't that mean that over time, given a workload that does only or
mostly updates, your records tend to migrate further and further away
from the start of the file, leaving a growing unusable space at the
beginning, until you eventually need to CLUSTER/VACUUM FULL?

I was wondering about speculatively asking for a free page with a
lower block number than the origin page, if one is available, before
locking the origin page.  Then after locking the origin page, if it
turns out you need a page but didn't get it earlier, asking for a free
page with a higher block number than the origin page.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-20 Thread Amit Kapila

On Tue, Jun 21, 2016 at 1:03 AM, Andres Freund  wrote:
>
> On 2016-06-15 08:56:52 -0400, Robert Haas wrote:
> > Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
> > and CTID without logging anything or clearing the all-visible flag and
> > then releases the lock on the heap page to go do some more work that
> > might even ERROR out.  Only if that other work all goes OK do we
> > relock the page and perform the WAL-logged actions.
> >
> > That doesn't seem like a good idea even in existing releases, because
> > you've taken a tuple on an all-visible page and made it not
> > all-visible, and you've made a page modification that is not
> > necessarily atomic without logging it.
>
> Right, that's broken.
>
>
> > I'm not sure what to do about this: this part of the heap_update()
> > logic has been like this forever, and I assume that if it were easy to
> > refactor this away, somebody would have done it by now.
>
> Well, I think generally nobody seriously looked at actually refactoring
> heap_update(), even though that'd be a good idea.  But in this instance,
> the problem seems relatively fundamental:
>
> We need to lock the origin page, to do visibility checks, etc. Then we
> need to figure out the target page. Even disregarding toasting - which
> we could be doing earlier with some refactoring - we're going to have to
> release the page level lock, to lock them in ascending order. Otherwise
> we'll risk kinda likely deadlocks.
>

Can we consider to use some strategy to avoid deadlocks without releasing
the lock on old page?  Consider if we could have a mechanism such that
RelationGetBufferForTuple() will ensure that it always returns a new buffer
which has targetblock greater than the old block (on which we already held
a lock).  I think here tricky part is whether we can get anything like that
from FSM. Also, there could be cases where we need to extend the heap when
there were pages in heap with space available, but we have ignored them
because there block number is smaller than the block number on which we
have lock.

>  We also certainly don't want to nest
> the lwlocks for the toast stuff, inside a content lock for the old
> tupe's page.
>
> So far the best idea I have - and it's really not a good one - is to
> invent a new hint-bit that tells concurrent updates to acquire a
> heavyweight tuple lock, while releasing the page-level lock. If that
> hint bit does not require any other modifications - and we don't need an
> xid in xmax for this use case - that'll avoid doing all the other
> `already_marked` stuff early, which should address the correctness
> issue.
>

Don't we need to clear such a flag in case of error?  Also don't we need to
reset it later, like when modifying the old page later before WAL.

>  It's kinda invasive though, and probably has performance
> implications.
>

Do you see performance implication due to requirement of heavywieht tuple
lock in more cases than now or something else?

Some others ways could be:

Before releasing the lock on buffer containing old tuple, clear the VM and
visibility info from page and WAL log it.  I think this could impact
performance depending on how frequently we need to perform this action.

Have a new flag like HEAP_XMAX_UNLOGGED (as it was there when this logic
was introduced in commit f2bfe8a24c46133f81e188653a127f939eb33c4a ) and set
the same in old tuple header before releasing lock on buffer and teach
tqual.c to honor the flag.  I think tqual.c should consider
 HEAP_XMAX_UNLOGGED as an indication of aborted transaction unless it is
currently in-progress.  Also, I think we need to clear this flag before WAL
logging in heap_update.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Reviewing freeze map code

2016-06-20 Thread Thomas Munro

On Fri, Jun 17, 2016 at 3:36 PM, Noah Misch  wrote:
> I agree the non-atomic, unlogged change is a problem.  A related threat
> doesn't require a torn page:
>
>   AssignTransactionId() xid=123
>   heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, 123);
>   some ERROR before heap_update() finishes
>   rollback;  -- xid=123
>   some backend flushes the modified page
>   immediate shutdown
>   AssignTransactionId() xid=123
>   commit;  -- xid=123
>
> If nothing wrote an xlog record that witnesses xid 123, the cluster can
> reassign it after recovery.  The failed update is now considered a successful
> update, and the row improperly becomes dead.  That's important.

I wonder if that was originally supposed to be handled with the
HEAP_XMAX_UNLOGGED flag which was removed in 11919160.  A comment in
the heap WAL logging commit f2bfe8a2 said that tqual routines would
see the HEAP_XMAX_UNLOGGED flag in the event of a crash before logging
(though I'm not sure if the tqual routines ever actually did that).

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-20 Thread Andres Freund

On 2016-06-20 17:55:19 -0400, Robert Haas wrote:
> On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund  wrote:
> > On 2016-06-20 16:10:23 -0400, Robert Haas wrote:
> >> What exactly is the point of all of that already_marked stuff?
> >
> > Preventing the old tuple from being locked/updated by another backend,
> > while unlocking the buffer.
> >
> >> I
> >> mean, suppose we just don't do any of that before we go off to do
> >> toast_insert_or_update and RelationGetBufferForTuple.  Eventually,
> >> when we reacquire the page lock, we might find that somebody else has
> >> already updated the tuple, but couldn't that be handled by
> >> (approximately) looping back up to l2 just as we do in several other
> >> cases?
> >
> > We'd potentially have to undo a fair amount more work: the toasted data
> > would have to be deleted and such, just to retry. Which isn't going to
> > super easy, because all of it will be happening with the current cid (we
> > can't just increase CommandCounterIncrement() for correctness reasons).
> 
> Why would we have to delete the TOAST data?  AFAIUI, the tuple points
> to the TOAST data, but not the other way around.  So if we change our
> mind about where to put the tuple, I don't think that requires
> re-TOASTing.

Consider what happens if we, after restarting at l2, notice that we
can't actually insert, but return in the !HeapTupleMayBeUpdated
branch. If the caller doesn't error out - and there's certainly callers
doing that - we'd "leak" a toasted datum. Unless the transaction aborts,
the toasted datum would never be cleaned up, because there's no datum
pointing to it, so no heap_delete will ever recurse into the toast
datum (via toast_delete()).

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-20 Thread Robert Haas

On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund  wrote:
> On 2016-06-20 16:10:23 -0400, Robert Haas wrote:
>> What exactly is the point of all of that already_marked stuff?
>
> Preventing the old tuple from being locked/updated by another backend,
> while unlocking the buffer.
>
>> I
>> mean, suppose we just don't do any of that before we go off to do
>> toast_insert_or_update and RelationGetBufferForTuple.  Eventually,
>> when we reacquire the page lock, we might find that somebody else has
>> already updated the tuple, but couldn't that be handled by
>> (approximately) looping back up to l2 just as we do in several other
>> cases?
>
> We'd potentially have to undo a fair amount more work: the toasted data
> would have to be deleted and such, just to retry. Which isn't going to
> super easy, because all of it will be happening with the current cid (we
> can't just increase CommandCounterIncrement() for correctness reasons).

Why would we have to delete the TOAST data?  AFAIUI, the tuple points
to the TOAST data, but not the other way around.  So if we change our
mind about where to put the tuple, I don't think that requires
re-TOASTing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-20 Thread Andres Freund

On 2016-06-20 16:10:23 -0400, Robert Haas wrote:
> What exactly is the point of all of that already_marked stuff?

Preventing the old tuple from being locked/updated by another backend,
while unlocking the buffer.

> I
> mean, suppose we just don't do any of that before we go off to do
> toast_insert_or_update and RelationGetBufferForTuple.  Eventually,
> when we reacquire the page lock, we might find that somebody else has
> already updated the tuple, but couldn't that be handled by
> (approximately) looping back up to l2 just as we do in several other
> cases?

We'd potentially have to undo a fair amount more work: the toasted data
would have to be deleted and such, just to retry. Which isn't going to
super easy, because all of it will be happening with the current cid (we
can't just increase CommandCounterIncrement() for correctness reasons).

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-20 Thread Robert Haas

On Mon, Jun 20, 2016 at 3:33 PM, Andres Freund  wrote:
>> I'm not sure what to do about this: this part of the heap_update()
>> logic has been like this forever, and I assume that if it were easy to
>> refactor this away, somebody would have done it by now.
>
> Well, I think generally nobody seriously looked at actually refactoring
> heap_update(), even though that'd be a good idea.  But in this instance,
> the problem seems relatively fundamental:
>
> We need to lock the origin page, to do visibility checks, etc. Then we
> need to figure out the target page. Even disregarding toasting - which
> we could be doing earlier with some refactoring - we're going to have to
> release the page level lock, to lock them in ascending order. Otherwise
> we'll risk kinda likely deadlocks.  We also certainly don't want to nest
> the lwlocks for the toast stuff, inside a content lock for the old
> tupe's page.
>
> So far the best idea I have - and it's really not a good one - is to
> invent a new hint-bit that tells concurrent updates to acquire a
> heavyweight tuple lock, while releasing the page-level lock. If that
> hint bit does not require any other modifications - and we don't need an
> xid in xmax for this use case - that'll avoid doing all the other
> `already_marked` stuff early, which should address the correctness
> issue.  It's kinda invasive though, and probably has performance
> implications.
>
> Does anybody have a better idea?

What exactly is the point of all of that already_marked stuff?  I
mean, suppose we just don't do any of that before we go off to do
toast_insert_or_update and RelationGetBufferForTuple.  Eventually,
when we reacquire the page lock, we might find that somebody else has
already updated the tuple, but couldn't that be handled by
(approximately) looping back up to l2 just as we do in several other
cases?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-20 Thread Andres Freund

On 2016-06-15 08:56:52 -0400, Robert Haas wrote:
> Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
> and CTID without logging anything or clearing the all-visible flag and
> then releases the lock on the heap page to go do some more work that
> might even ERROR out.  Only if that other work all goes OK do we
> relock the page and perform the WAL-logged actions.
> 
> That doesn't seem like a good idea even in existing releases, because
> you've taken a tuple on an all-visible page and made it not
> all-visible, and you've made a page modification that is not
> necessarily atomic without logging it.

Right, that's broken.

> I'm not sure what to do about this: this part of the heap_update()
> logic has been like this forever, and I assume that if it were easy to
> refactor this away, somebody would have done it by now.

Well, I think generally nobody seriously looked at actually refactoring
heap_update(), even though that'd be a good idea.  But in this instance,
the problem seems relatively fundamental:

We need to lock the origin page, to do visibility checks, etc. Then we
need to figure out the target page. Even disregarding toasting - which
we could be doing earlier with some refactoring - we're going to have to
release the page level lock, to lock them in ascending order. Otherwise
we'll risk kinda likely deadlocks.  We also certainly don't want to nest
the lwlocks for the toast stuff, inside a content lock for the old
tupe's page.

So far the best idea I have - and it's really not a good one - is to
invent a new hint-bit that tells concurrent updates to acquire a
heavyweight tuple lock, while releasing the page-level lock. If that
hint bit does not require any other modifications - and we don't need an
xid in xmax for this use case - that'll avoid doing all the other
`already_marked` stuff early, which should address the correctness
issue.  It's kinda invasive though, and probably has performance
implications.

Does anybody have a better idea?

Regards,

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-16 Thread Noah Misch

On Wed, Jun 15, 2016 at 08:56:52AM -0400, Robert Haas wrote:
> On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
>  wrote:
> > I spent some time chasing down the exact circumstances.  I suspect
> > that there may be an interlocking problem in heap_update.  Using the
> > line numbers from cae1c788 [1], I see the following interaction
> > between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
> > in reference to the same block number:
> >
> >   [VACUUM] sets all visible bit
> >
> >   [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, 
> > xmax_old_tuple);
> >   [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> >
> >   [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
> >   [SELECT] observes VM_ALL_VISIBLE as true
> >   [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
> >   [SELECT] barfs
> >
> >   [UPDATE] heapam.c:4116 visibilitymap_clear(...)
> 
> Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
> and CTID without logging anything or clearing the all-visible flag and
> then releases the lock on the heap page to go do some more work that
> might even ERROR out.  Only if that other work all goes OK do we
> relock the page and perform the WAL-logged actions.
> 
> That doesn't seem like a good idea even in existing releases, because
> you've taken a tuple on an all-visible page and made it not
> all-visible, and you've made a page modification that is not
> necessarily atomic without logging it.  This is is particularly bad in
> 9.6, because if that page is also all-frozen then XMAX will eventually
> be pointing into space and VACUUM will never visit the page to
> re-freeze it the way it would have done in earlier releases.  However,
> even in older releases, I think there's a remote possibility of data
> corruption.  Backend #1 makes these changes to the page, releases the
> lock, and errors out.  Backend #2 writes the page to the OS.  DBA
> takes a hot backup, tearing the page in the middle of XMAX.  Oops.

I agree the non-atomic, unlogged change is a problem.  A related threat
doesn't require a torn page:

  AssignTransactionId() xid=123
  heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, 123);
  some ERROR before heap_update() finishes
  rollback;  -- xid=123
  some backend flushes the modified page
  immediate shutdown
  AssignTransactionId() xid=123
  commit;  -- xid=123

If nothing wrote an xlog record that witnesses xid 123, the cluster can
reassign it after recovery.  The failed update is now considered a successful
update, and the row improperly becomes dead.  That's important.

I don't know whether the 9.6 all-frozen mechanism materially amplifies the
consequences of this bug.  The interaction with visibility map and freeze map
is not all bad; indeed, it can reduce the risk of experiencing consequences
from the non-atomic, unlogged change bug.  If the row is all-visible when
heap_update() starts, every transaction should continue to consider the row
visible until heap_update() finishes successfully.  If an ERROR interrupts
heap_update(), visibility verdicts should be as though the heap_update() never
happened.  If one of the previously-described mechanisms would make an xmax
visibility test give the wrong answer, an all-visible bit could mask the
problem for awhile.  Having said that, freeze map hurts in scenarios involving
toast_insert_or_update() failures and no crash recovery.  Instead of VACUUM
cleaning up the aborted xmax, that xmax could persist long enough for its xid
to be reused in a successful transaction.  When some other modification
finally clears all-frozen and all-visible, the row improperly becomes dead.
Both scenarios are fairly rare; I don't know which is more rare.  [Disclaimer:
I have not built tests cases to verify those alleged failure mechanisms.]

If we made this pre-9.6 bug a 9.6 open item, would anyone volunteer to own it?
Then we wouldn't need to guess whether 9.6 will be safer with the freeze map
or safer without the freeze map.

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-15 Thread Robert Haas

On Wed, Jun 15, 2016 at 9:59 AM, Amit Kapila  wrote:
>> That just kicks the can down the road.  Then you have PD_ALL_VISIBLE
>> clear but the VM bit is still set.
>
> I mean to say clear both as we are doing currently in code:
> if (PageIsAllVisible(BufferGetPage(buffer)))
> {
> all_visible_cleared = true;
> PageClearAllVisible(BufferGetPage(buffer));
> visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
> vmbuffer);
> }

Sure, but without emitting a WAL record, that's just broken.  You
could have the heap page get flushed to disk and the VM page not get
flushed to disk, and then crash, and now you have the classic VM
corruption scenario.

>>   And you still haven't WAL-logged
>> anything.
>
> Yeah, I think WAL requirement is more difficult to meet and I think
> releasing the lock on buffer before writing WAL could lead to flush of such
> a buffer before WAL.
>
> I feel this is an existing-bug and should go to Older Bugs Section in open
> items page.

It does seem to be an existing bug, but the freeze map makes the
problem more serious, I think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-15 Thread Robert Haas

On Wed, Jun 15, 2016 at 10:03 AM, Masahiko Sawada  wrote:
>> I'm not sure what to do about this: this part of the heap_update()
>> logic has been like this forever, and I assume that if it were easy to
>> refactor this away, somebody would have done it by now.
>
> How about changing collect_corrupt_items to acquire
> AccessExclusiveLock for safely checking?

Well, that would make it a lot less likely for
pg_check_{visible,frozen} to detect the bug in heap_update(), but it
wouldn't fix the bug in heap_update().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-15 Thread Masahiko Sawada

On Wed, Jun 15, 2016 at 9:56 PM, Robert Haas  wrote:
> On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
>  wrote:
>> I spent some time chasing down the exact circumstances.  I suspect
>> that there may be an interlocking problem in heap_update.  Using the
>> line numbers from cae1c788 [1], I see the following interaction
>> between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
>> in reference to the same block number:
>>
>>   [VACUUM] sets all visible bit
>>
>>   [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, 
>> xmax_old_tuple);
>>   [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
>>
>>   [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
>>   [SELECT] observes VM_ALL_VISIBLE as true
>>   [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
>>   [SELECT] barfs
>>
>>   [UPDATE] heapam.c:4116 visibilitymap_clear(...)
>
> Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
> and CTID without logging anything or clearing the all-visible flag and
> then releases the lock on the heap page to go do some more work that
> might even ERROR out.  Only if that other work all goes OK do we
> relock the page and perform the WAL-logged actions.
>
> That doesn't seem like a good idea even in existing releases, because
> you've taken a tuple on an all-visible page and made it not
> all-visible, and you've made a page modification that is not
> necessarily atomic without logging it.  This is is particularly bad in
> 9.6, because if that page is also all-frozen then XMAX will eventually
> be pointing into space and VACUUM will never visit the page to
> re-freeze it the way it would have done in earlier releases.  However,
> even in older releases, I think there's a remote possibility of data
> corruption.  Backend #1 makes these changes to the page, releases the
> lock, and errors out.  Backend #2 writes the page to the OS.  DBA
> takes a hot backup, tearing the page in the middle of XMAX.  Oops.
>
> I'm not sure what to do about this: this part of the heap_update()
> logic has been like this forever, and I assume that if it were easy to
> refactor this away, somebody would have done it by now.
>

How about changing collect_corrupt_items to acquire
AccessExclusiveLock for safely checking?

Regards,

--
Masahiko Sawada


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-15 Thread Amit Kapila

On Wed, Jun 15, 2016 at 7:13 PM, Robert Haas  wrote:
>
> On Wed, Jun 15, 2016 at 9:43 AM, Amit Kapila 
wrote:
> > On Wed, Jun 15, 2016 at 6:26 PM, Robert Haas 
wrote:
> >> On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
> >>  wrote:
> >> > I spent some time chasing down the exact circumstances.  I suspect
> >> > that there may be an interlocking problem in heap_update.  Using the
> >> > line numbers from cae1c788 [1], I see the following interaction
> >> > between the VACUUM, UPDATE and SELECT (pg_check_visible) backends,
all
> >> > in reference to the same block number:
> >> >
> >> >   [VACUUM] sets all visible bit
> >> >
> >> >   [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data,
> >> > xmax_old_tuple);
> >> >   [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> >> >
> >> >   [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
> >> >   [SELECT] observes VM_ALL_VISIBLE as true
> >> >   [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
> >> >   [SELECT] barfs
> >> >
> >> >   [UPDATE] heapam.c:4116 visibilitymap_clear(...)
> >>
> >> Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
> >> and CTID without logging anything or clearing the all-visible flag and
> >> then releases the lock on the heap page to go do some more work that
> >> might even ERROR out.
> >
> > Can't we clear the all-visible flag before releasing the lock?  We can
use
> > logic of already_marked as it is currently used in code to clear it just
> > once.
>
> That just kicks the can down the road.  Then you have PD_ALL_VISIBLE
> clear but the VM bit is still set.

I mean to say clear both as we are doing currently in code:
if (PageIsAllVisible(BufferGetPage(buffer)))
{
all_visible_cleared = true;
PageClearAllVisible(BufferGetPage(buffer));
visibilitymap_clear(relation, BufferGetBlockNumber(buffer),
vmbuffer);
}

>
>   And you still haven't WAL-logged
> anything.
>

Yeah, I think WAL requirement is more difficult to meet and I think
releasing the lock on buffer before writing WAL could lead to flush of such
a buffer before WAL.

I feel this is an existing-bug and should go to Older Bugs Section in open
items page.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Reviewing freeze map code

2016-06-15 Thread Robert Haas

On Wed, Jun 15, 2016 at 9:43 AM, Amit Kapila  wrote:
> On Wed, Jun 15, 2016 at 6:26 PM, Robert Haas  wrote:
>> On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
>>  wrote:
>> > I spent some time chasing down the exact circumstances.  I suspect
>> > that there may be an interlocking problem in heap_update.  Using the
>> > line numbers from cae1c788 [1], I see the following interaction
>> > between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
>> > in reference to the same block number:
>> >
>> >   [VACUUM] sets all visible bit
>> >
>> >   [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data,
>> > xmax_old_tuple);
>> >   [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
>> >
>> >   [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
>> >   [SELECT] observes VM_ALL_VISIBLE as true
>> >   [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
>> >   [SELECT] barfs
>> >
>> >   [UPDATE] heapam.c:4116 visibilitymap_clear(...)
>>
>> Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
>> and CTID without logging anything or clearing the all-visible flag and
>> then releases the lock on the heap page to go do some more work that
>> might even ERROR out.
>
> Can't we clear the all-visible flag before releasing the lock?  We can use
> logic of already_marked as it is currently used in code to clear it just
> once.

That just kicks the can down the road.  Then you have PD_ALL_VISIBLE
clear but the VM bit is still set.  And you still haven't WAL-logged
anything.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-15 Thread Amit Kapila

On Wed, Jun 15, 2016 at 6:26 PM, Robert Haas  wrote:
>
> On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
>  wrote:
> > I spent some time chasing down the exact circumstances.  I suspect
> > that there may be an interlocking problem in heap_update.  Using the
> > line numbers from cae1c788 [1], I see the following interaction
> > between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
> > in reference to the same block number:
> >
> >   [VACUUM] sets all visible bit
> >
> >   [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data,
xmax_old_tuple);
> >   [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> >
> >   [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
> >   [SELECT] observes VM_ALL_VISIBLE as true
> >   [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
> >   [SELECT] barfs
> >
> >   [UPDATE] heapam.c:4116 visibilitymap_clear(...)
>
> Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
> and CTID without logging anything or clearing the all-visible flag and
> then releases the lock on the heap page to go do some more work that
> might even ERROR out.
>

Can't we clear the all-visible flag before releasing the lock?  We can use
logic of already_marked as it is currently used in code to clear it just
once.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Reviewing freeze map code

2016-06-15 Thread Robert Haas

On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro
 wrote:
> I spent some time chasing down the exact circumstances.  I suspect
> that there may be an interlocking problem in heap_update.  Using the
> line numbers from cae1c788 [1], I see the following interaction
> between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
> in reference to the same block number:
>
>   [VACUUM] sets all visible bit
>
>   [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, 
> xmax_old_tuple);
>   [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
>
>   [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
>   [SELECT] observes VM_ALL_VISIBLE as true
>   [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
>   [SELECT] barfs
>
>   [UPDATE] heapam.c:4116 visibilitymap_clear(...)

Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2,
and CTID without logging anything or clearing the all-visible flag and
then releases the lock on the heap page to go do some more work that
might even ERROR out.  Only if that other work all goes OK do we
relock the page and perform the WAL-logged actions.

That doesn't seem like a good idea even in existing releases, because
you've taken a tuple on an all-visible page and made it not
all-visible, and you've made a page modification that is not
necessarily atomic without logging it.  This is is particularly bad in
9.6, because if that page is also all-frozen then XMAX will eventually
be pointing into space and VACUUM will never visit the page to
re-freeze it the way it would have done in earlier releases.  However,
even in older releases, I think there's a remote possibility of data
corruption.  Backend #1 makes these changes to the page, releases the
lock, and errors out.  Backend #2 writes the page to the OS.  DBA
takes a hot backup, tearing the page in the middle of XMAX.  Oops.

I'm not sure what to do about this: this part of the heap_update()
logic has been like this forever, and I assume that if it were easy to
refactor this away, somebody would have done it by now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-14 Thread Thomas Munro

On Wed, Jun 15, 2016 at 11:43 AM, Thomas Munro
 wrote:
> On Wed, Jun 15, 2016 at 12:44 AM, Robert Haas  wrote:
>> On Tue, Jun 14, 2016 at 8:11 AM, Robert Haas  wrote:
> I noticed that the tuples that it reported were always offset 1 in a
> page, and that the page always had a maxoff over a couple of hundred,
> and that we called record_corrupt_item because VM_ALL_VISIBLE returned
> true but HeapTupleSatisfiesVacuum on the first tuple returned
> HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE.
> It did that because HEAP_XMAX_COMMITTED was not set and
> TransactionIdIsInProgress returned true for xmax.

 So this seems like it might be a visibility map bug rather than a bug
 in the test code, but I'm not completely sure of that.  How was it
 legitimate to mark the page as all-visible if a tuple on the page
 still had a live xmax?  If xmax is live and not just a locker then the
 tuple is not visible to the transaction that wrote xmax, at least.
>>>
>>> Ah, wait a minute.  I see how this could happen.  Hang on, let me
>>> update the pg_visibility patch.
>>
>> The problem should be fixed in the attached revision of
>> pg_check_visible.  I think what happened is:
>>
>> 1. pg_check_visible computed an OldestXmin.
>> 2. Some transaction committed.
>> 3. VACUUM computed a newer OldestXmin and marked a page all-visible with it.
>> 4. pg_check_visible then used its older OldestXmin to check the
>> visibility of tuples on that page, and saw delete-in-progress as a
>> result.
>>
>> I added a guard against a similar scenario involving xmin in the last
>> version of this patch, but forgot that we need to protect xmax in the
>> same way.  With this version of the patch, I can no longer get any
>> TIDs to pop up out of pg_check_visible in my testing.  (I haven't run
>> your test script for lack of the proper Python environment...)
>
> I can still reproduce the problem with this new patch.  What I see is
> that the OldestXmin, the new RecomputedOldestXmin and the tuple's xmax
> are all the same.

I spent some time chasing down the exact circumstances.  I suspect
that there may be an interlocking problem in heap_update.  Using the
line numbers from cae1c788 [1], I see the following interaction
between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all
in reference to the same block number:

  [VACUUM] sets all visible bit

  [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, xmax_old_tuple);
  [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK);

  [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE);
  [SELECT] observes VM_ALL_VISIBLE as true
  [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state
  [SELECT] barfs

  [UPDATE] heapam.c:4116 visibilitymap_clear(...)

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/heap/heapam.c;hb=cae1c788b9b43887e4a4fa51a11c3a8ffa334939

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-14 Thread Thomas Munro

On Wed, Jun 15, 2016 at 12:44 AM, Robert Haas  wrote:
> On Tue, Jun 14, 2016 at 8:11 AM, Robert Haas  wrote:
 I noticed that the tuples that it reported were always offset 1 in a
 page, and that the page always had a maxoff over a couple of hundred,
 and that we called record_corrupt_item because VM_ALL_VISIBLE returned
 true but HeapTupleSatisfiesVacuum on the first tuple returned
 HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE.
 It did that because HEAP_XMAX_COMMITTED was not set and
 TransactionIdIsInProgress returned true for xmax.
>>>
>>> So this seems like it might be a visibility map bug rather than a bug
>>> in the test code, but I'm not completely sure of that.  How was it
>>> legitimate to mark the page as all-visible if a tuple on the page
>>> still had a live xmax?  If xmax is live and not just a locker then the
>>> tuple is not visible to the transaction that wrote xmax, at least.
>>
>> Ah, wait a minute.  I see how this could happen.  Hang on, let me
>> update the pg_visibility patch.
>
> The problem should be fixed in the attached revision of
> pg_check_visible.  I think what happened is:
>
> 1. pg_check_visible computed an OldestXmin.
> 2. Some transaction committed.
> 3. VACUUM computed a newer OldestXmin and marked a page all-visible with it.
> 4. pg_check_visible then used its older OldestXmin to check the
> visibility of tuples on that page, and saw delete-in-progress as a
> result.
>
> I added a guard against a similar scenario involving xmin in the last
> version of this patch, but forgot that we need to protect xmax in the
> same way.  With this version of the patch, I can no longer get any
> TIDs to pop up out of pg_check_visible in my testing.  (I haven't run
> your test script for lack of the proper Python environment...)

I can still reproduce the problem with this new patch.  What I see is
that the OldestXmin, the new RecomputedOldestXmin and the tuple's xmax
are all the same.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Reviewing freeze map code

2016-06-14 Thread Robert Haas

On Tue, Jun 14, 2016 at 8:11 AM, Robert Haas  wrote:
>>> I noticed that the tuples that it reported were always offset 1 in a
>>> page, and that the page always had a maxoff over a couple of hundred,
>>> and that we called record_corrupt_item because VM_ALL_VISIBLE returned
>>> true but HeapTupleSatisfiesVacuum on the first tuple returned
>>> HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE.
>>> It did that because HEAP_XMAX_COMMITTED was not set and
>>> TransactionIdIsInProgress returned true for xmax.
>>
>> So this seems like it might be a visibility map bug rather than a bug
>> in the test code, but I'm not completely sure of that.  How was it
>> legitimate to mark the page as all-visible if a tuple on the page
>> still had a live xmax?  If xmax is live and not just a locker then the
>> tuple is not visible to the transaction that wrote xmax, at least.
>
> Ah, wait a minute.  I see how this could happen.  Hang on, let me
> update the pg_visibility patch.

The problem should be fixed in the attached revision of
pg_check_visible.  I think what happened is:

1. pg_check_visible computed an OldestXmin.
2. Some transaction committed.
3. VACUUM computed a newer OldestXmin and marked a page all-visible with it.
4. pg_check_visible then used its older OldestXmin to check the
visibility of tuples on that page, and saw delete-in-progress as a
result.

I added a guard against a similar scenario involving xmin in the last
version of this patch, but forgot that we need to protect xmax in the
same way.  With this version of the patch, I can no longer get any
TIDs to pop up out of pg_check_visible in my testing.  (I haven't run
your test script for lack of the proper Python environment...)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From 18815b0d6fcfc2048e47f104ef85ee981687d4de Mon Sep 17 00:00:00 2001
From: Robert Haas 
Date: Fri, 10 Jun 2016 14:42:46 -0400
Subject: [PATCH 1/2] Add integrity-checking functions to pg_visibility.

The new pg_check_visible() and pg_check_frozen() functions can be used to
verify that the visibility map bits for a relation's data pages match the
actual state of the tuples on those pages.

Amit Kapila and Robert Haas, reviewed by Andres Freund.  Additional
testing help by Thomas Munro.
---
 contrib/pg_visibility/Makefile|   2 +-
 contrib/pg_visibility/pg_visibility--1.0--1.1.sql |  17 ++
 contrib/pg_visibility/pg_visibility--1.0.sql  |  52 
 contrib/pg_visibility/pg_visibility--1.1.sql  |  67 +
 contrib/pg_visibility/pg_visibility.c | 313 ++
 contrib/pg_visibility/pg_visibility.control   |   2 +-
 doc/src/sgml/pgvisibility.sgml|  28 +-
 src/tools/pgindent/typedefs.list  |   1 +
 8 files changed, 427 insertions(+), 55 deletions(-)
 create mode 100644 contrib/pg_visibility/pg_visibility--1.0--1.1.sql
 delete mode 100644 contrib/pg_visibility/pg_visibility--1.0.sql
 create mode 100644 contrib/pg_visibility/pg_visibility--1.1.sql

diff --git a/contrib/pg_visibility/Makefile b/contrib/pg_visibility/Makefile
index fbbaa2e..379591a 100644
--- a/contrib/pg_visibility/Makefile
+++ b/contrib/pg_visibility/Makefile
@@ -4,7 +4,7 @@ MODULE_big = pg_visibility
 OBJS = pg_visibility.o $(WIN32RES)
 
 EXTENSION = pg_visibility
-DATA = pg_visibility--1.0.sql
+DATA = pg_visibility--1.1.sql pg_visibility--1.0--1.1.sql
 PGFILEDESC = "pg_visibility - page visibility information"
 
 ifdef USE_PGXS
diff --git a/contrib/pg_visibility/pg_visibility--1.0--1.1.sql b/contrib/pg_visibility/pg_visibility--1.0--1.1.sql
new file mode 100644
index 000..2c97dfd
--- /dev/null
+++ b/contrib/pg_visibility/pg_visibility--1.0--1.1.sql
@@ -0,0 +1,17 @@
+/* contrib/pg_visibility/pg_visibility--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_visibility UPDATE TO '1.1'" to load this file. \quit
+
+CREATE FUNCTION pg_check_frozen(regclass, t_ctid OUT tid)
+RETURNS SETOF tid
+AS 'MODULE_PATHNAME', 'pg_check_frozen'
+LANGUAGE C STRICT;
+
+CREATE FUNCTION pg_check_visible(regclass, t_ctid OUT tid)
+RETURNS SETOF tid
+AS 'MODULE_PATHNAME', 'pg_check_visible'
+LANGUAGE C STRICT;
+
+REVOKE ALL ON FUNCTION pg_check_frozen(regclass) FROM PUBLIC;
+REVOKE ALL ON FUNCTION pg_check_visible(regclass) FROM PUBLIC;
diff --git a/contrib/pg_visibility/pg_visibility--1.0.sql b/contrib/pg_visibility/pg_visibility--1.0.sql
deleted file mode 100644
index da511e5..000
--- a/contrib/pg_visibility/pg_visibility--1.0.sql
+++ /dev/null
@@ -1,52 +0,0 @@
-/* contrib/pg_visibility/pg_visibility--1.0.sql */
-
--- complain if script is sourced in psql, rather than via CREATE EXTENSION
-\echo Use "CREATE EXTENSION pg_visibility" to load this file. \quit
-
--- Show visibility map information.
-CREATE FUNCTION pg_visibility_map(regclass, blkno bigint,
-  all_visible OUT boolean,
-  all_frozen OUT

Re: [HACKERS] Reviewing freeze map code

2016-06-14 Thread Robert Haas

On Tue, Jun 14, 2016 at 8:08 AM, Robert Haas  wrote:
> On Tue, Jun 14, 2016 at 2:53 AM, Thomas Munro
>  wrote:
>> On Tue, Jun 14, 2016 at 4:02 AM, Robert Haas  wrote:
>>> On Sat, Jun 11, 2016 at 5:00 PM, Robert Haas  wrote:
> How about changing the return tuple of heap_prepare_freeze_tuple to
> a bitmap?  Two flags: "Freeze [not] done" and "[No] more freezing
> needed"

 Yes, I think something like that sounds about right.
>>>
>>> Here's a patch.  I took the approach of adding a separate bool out
>>> parameter instead.  I am also attaching an update of the
>>> check-visibility patch which responds to assorted review comments and
>>> adjusting it for the problems found on Friday which could otherwise
>>> lead to false positives.  I'm still getting occasional TIDs from the
>>> pg_check_visible() function during pgbench runs, though, so evidently
>>> not all is well with the world.
>>
>> I'm still working out how half this stuff works, but I managed to get
>> pg_check_visible() to spit out a row every few seconds with the
>> following brute force approach:
>>
>> CREATE TABLE foo (n int);
>> INSERT INTO foo SELECT generate_series(1, 10);
>>
>> Three client threads (see attached script):
>> 1.  Run VACUUM in a tight loop.
>> 2.  Run UPDATE foo SET n = n + 1 in a tight loop.
>> 3.  Run SELECT pg_check_visible('foo'::regclass) in a tight loop, and
>> print out any rows it produces.
>>
>> I noticed that the tuples that it reported were always offset 1 in a
>> page, and that the page always had a maxoff over a couple of hundred,
>> and that we called record_corrupt_item because VM_ALL_VISIBLE returned
>> true but HeapTupleSatisfiesVacuum on the first tuple returned
>> HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE.
>> It did that because HEAP_XMAX_COMMITTED was not set and
>> TransactionIdIsInProgress returned true for xmax.
>
> So this seems like it might be a visibility map bug rather than a bug
> in the test code, but I'm not completely sure of that.  How was it
> legitimate to mark the page as all-visible if a tuple on the page
> still had a live xmax?  If xmax is live and not just a locker then the
> tuple is not visible to the transaction that wrote xmax, at least.

Ah, wait a minute.  I see how this could happen.  Hang on, let me
update the pg_visibility patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 3 >

1 - 100 of 256 matches

Mail list logo