Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-05-15 Thread Thomas Munro
On Thu, May 16, 2019 at 3:53 AM Tom Lane  wrote:
> Andres Freund  writes:
> > On 2019-05-15 12:01:07 +1200, Thomas Munro wrote:
> >> This is listed as an open item to resolve for 12.  IIUC the options on
> >> the table are:
> >>
> >> 1.  Do nothing, and ship with effective_io_concurrency + 10.
> >> 2.  Just use effective_io_concurrency without the hardwired boost.
> >> 3.  Switch to a new GUC maintenance_io_concurrency (or some better name).
> >>
> >> I vote for option 3.  I have no clue how to set it, but at least users
> >> have a fighting chance of experimenting and figuring it out that way.
> >> I volunteer to write the patch if we get a consensus.
>
> > I'd personally, unsurprisingly perhaps, go with 1 for v12. I think 3 is
> > also a good option - it's easy to imagine to later use it for for
> > VACUUM, ANALYZE and the like.  I think 2 is a bad idea.
>
> FWIW, I also agree with settling for #1 at this point.  A new GUC would
> make more sense if we have multiple use-cases for it, which we probably
> will at some point, but not today.  I'm concerned that if we invent a
> GUC now, we might find out that it's not really usable for other cases
> in future (e.g., default value is no good for other cases).  It's the
> old story that inventing an API with only one use-case in mind leads
> to a bad API.
>
> So yeah, let's leave this be for now, ugly as it is.  Improving it
> can be future work.

Cool, I moved it to the resolved section.

-- 
Thomas Munro
https://enterprisedb.com




Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-05-15 Thread Tom Lane
Andres Freund  writes:
> On 2019-05-15 12:01:07 +1200, Thomas Munro wrote:
>> This is listed as an open item to resolve for 12.  IIUC the options on
>> the table are:
>> 
>> 1.  Do nothing, and ship with effective_io_concurrency + 10.
>> 2.  Just use effective_io_concurrency without the hardwired boost.
>> 3.  Switch to a new GUC maintenance_io_concurrency (or some better name).
>> 
>> I vote for option 3.  I have no clue how to set it, but at least users
>> have a fighting chance of experimenting and figuring it out that way.
>> I volunteer to write the patch if we get a consensus.

> I'd personally, unsurprisingly perhaps, go with 1 for v12. I think 3 is
> also a good option - it's easy to imagine to later use it for for
> VACUUM, ANALYZE and the like.  I think 2 is a bad idea.

FWIW, I also agree with settling for #1 at this point.  A new GUC would
make more sense if we have multiple use-cases for it, which we probably
will at some point, but not today.  I'm concerned that if we invent a
GUC now, we might find out that it's not really usable for other cases
in future (e.g., default value is no good for other cases).  It's the
old story that inventing an API with only one use-case in mind leads
to a bad API.

So yeah, let's leave this be for now, ugly as it is.  Improving it
can be future work.

regards, tom lane




Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-05-14 Thread Andres Freund
Hi,

On 2019-05-15 12:01:07 +1200, Thomas Munro wrote:
> On Thu, May 2, 2019 at 5:10 AM Robert Haas  wrote:
> > On Wed, May 1, 2019 at 12:50 PM Tom Lane  wrote:
> > > > Not strongly enough to argue about it very hard.  The current behavior
> > > > is a little weird, but it's a long way from being the weirdest thing
> > > > we ship, and it appears that we have no tangible evidence that it
> > > > causes a problem in practice.
> > >
> > > I think there's nothing that fails to suck about a hardwired "+ 10".
> >
> > It avoids a performance regression without adding another GUC.
> >
> > That may not be enough reason to keep it like that, but it is one
> > thing that does fail to suck.
> 
> This is listed as an open item to resolve for 12.  IIUC the options on
> the table are:
> 
> 1.  Do nothing, and ship with effective_io_concurrency + 10.
> 2.  Just use effective_io_concurrency without the hardwired boost.
> 3.  Switch to a new GUC maintenance_io_concurrency (or some better name).
> 
> The rationale for using a different number is that this backend is
> working on behalf of multiple sessions, so you might want to give it
> some more juice, much like maintenance_work_mem.
> 
> I vote for option 3.  I have no clue how to set it, but at least users
> have a fighting chance of experimenting and figuring it out that way.
> I volunteer to write the patch if we get a consensus.

I'd personally, unsurprisingly perhaps, go with 1 for v12. I think 3 is
also a good option - it's easy to imagine to later use it for for
VACUUM, ANALYZE and the like.  I think 2 is a bad idea.

Greetings,

Andres Freund




Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-05-14 Thread Thomas Munro
On Thu, May 2, 2019 at 5:10 AM Robert Haas  wrote:
> On Wed, May 1, 2019 at 12:50 PM Tom Lane  wrote:
> > > Not strongly enough to argue about it very hard.  The current behavior
> > > is a little weird, but it's a long way from being the weirdest thing
> > > we ship, and it appears that we have no tangible evidence that it
> > > causes a problem in practice.
> >
> > I think there's nothing that fails to suck about a hardwired "+ 10".
>
> It avoids a performance regression without adding another GUC.
>
> That may not be enough reason to keep it like that, but it is one
> thing that does fail to suck.

This is listed as an open item to resolve for 12.  IIUC the options on
the table are:

1.  Do nothing, and ship with effective_io_concurrency + 10.
2.  Just use effective_io_concurrency without the hardwired boost.
3.  Switch to a new GUC maintenance_io_concurrency (or some better name).

The rationale for using a different number is that this backend is
working on behalf of multiple sessions, so you might want to give it
some more juice, much like maintenance_work_mem.

I vote for option 3.  I have no clue how to set it, but at least users
have a fighting chance of experimenting and figuring it out that way.
I volunteer to write the patch if we get a consensus.

-- 
Thomas Munro
https://enterprisedb.com




Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-05-01 Thread Robert Haas
On Wed, May 1, 2019 at 12:50 PM Tom Lane  wrote:
> > Not strongly enough to argue about it very hard.  The current behavior
> > is a little weird, but it's a long way from being the weirdest thing
> > we ship, and it appears that we have no tangible evidence that it
> > causes a problem in practice.
>
> I think there's nothing that fails to suck about a hardwired "+ 10".

It avoids a performance regression without adding another GUC.

That may not be enough reason to keep it like that, but it is one
thing that does fail to suck.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-05-01 Thread Tom Lane
Robert Haas  writes:
> On Wed, May 1, 2019 at 12:15 PM Andres Freund  wrote:
>> My current inclination is to not do anything for v12. Robert, do you
>> disagree?

> Not strongly enough to argue about it very hard.  The current behavior
> is a little weird, but it's a long way from being the weirdest thing
> we ship, and it appears that we have no tangible evidence that it
> causes a problem in practice.

I think there's nothing that fails to suck about a hardwired "+ 10".

We should either remove that and use effective_io_concurrency as-is,
or decide that it's worth having a separate GUC for bulk operations.
At this stage of the cycle I'd incline to the former, but if somebody
is excited enough to prepare a patch for a new GUC, I wouldn't push
back on that solution.

regards, tom lane




Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-05-01 Thread Robert Haas
On Wed, May 1, 2019 at 12:15 PM Andres Freund  wrote:
> On 2019-04-01 18:26:59 -0700, Andres Freund wrote:
> > I'm not yet convinced it's necessary to create a new GUC, but also not
> > strongly opposed.  I've created an open items issue for it, so we don't
> > forget.
>
> My current inclination is to not do anything for v12. Robert, do you
> disagree?

Not strongly enough to argue about it very hard.  The current behavior
is a little weird, but it's a long way from being the weirdest thing
we ship, and it appears that we have no tangible evidence that it
causes a problem in practice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-05-01 Thread Andres Freund
Hi,

On 2019-04-01 18:26:59 -0700, Andres Freund wrote:
> I'm not yet convinced it's necessary to create a new GUC, but also not
> strongly opposed.  I've created an open items issue for it, so we don't
> forget.

My current inclination is to not do anything for v12. Robert, do you
disagree?

Greetings,

Andres Freund




Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-04-01 Thread Andres Freund
Hi,

On 2019-03-30 11:44:36 -0400, Robert Haas wrote:
> On Sat, Mar 30, 2019 at 6:33 AM Thomas Munro  wrote:
> > I didn't understand that last sentence.
> >
> > Here's an attempt to write a suitable comment for the quick fix.  And
> > I suppose effective_io_concurrency is a reasonable default.
> >
> > It's pretty hard to think of a good way to get your hands on the real
> > value safely from here.  I wondered if there was a way to narrow this
> > to just GLOBALTABLESPACE_OID since that's where pg_tablespace lives,
> > but that doesn't work, we access other catalog too in that path.
> >
> > Hmm, it seems a bit odd that 0 is supposed to mean "disable issuance
> > of asynchronous I/O requests" according to config.sgml, but here 0
> > will prefetch 10 buffers.
> 
> Mmmph.  I'm starting to think we're not going to get a satisfactory
> result here unless we make this controlled by something other than
> effective_io_concurrency.  There's just no reason to suppose that the
> same setting that we use to control prefetching for bitmap index scans
> is also going to be right for what's basically a bulk operation.
> 
> Interestingly, Dilip Kumar ran into similar issues recently while
> working on bulk processing for undo records for zheap.  In that case,
> you definitely want to prefetch the undo aggressively, because you're
> reading it front to back and backwards scans suck without prefetching.
> And you possibly also want to prefetch the data pages to which the
> undo that you are prefetching applies, but maybe not as aggressively
> because you're going to be doing a WAL write for each data page and
> flooding the system with too many reads could be counterproductive, at
> least if pg_wal and the rest of $PGDATA are not on separate spindles.
> And even if they are, it's possible that as you suck in undo pages and
> the zheap pages that they need to update, you might evict dirty pages,
> generating write activity against the data directory.

I'm not yet convinced it's necessary to create a new GUC, but also not
strongly opposed.  I've created an open items issue for it, so we don't
forget.


> Overall I'm inclined to think that we're making the same mistake here
> that we did with work_mem, namely, assuming that you can control a
> bunch of different prefetching behaviors with a single GUC and things
> will be OK.  Let's just create a new GUC for this and default it to 10
> or something and go home.

I agree that we needed to split work_mem, but a) that was far less clear
for many years b) there was no logic ot use more work_mem in
maintenance-y cases...

Greetings,

Andres Freund




Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-04-01 Thread Simon Riggs
On Fri, 29 Mar 2019 at 16:32, Andres Freund  wrote:

> On 2019-03-29 16:20:54 +, Simon Riggs wrote:
> > On Fri, 29 Mar 2019 at 16:12, Andres Freund  wrote:
> >
> >
> > > On 2019-03-29 15:58:14 +, Simon Riggs wrote:
> > > > On Fri, 29 Mar 2019 at 15:29, Andres Freund 
> wrote:
> > > > > That's far from a trivial feature imo. It seems quite possible that
> > > we'd
> > > > > end up with increased overhead, because the current logic can get
> away
> > > > > with only doing hint bit style writes - but would that be true if
> we
> > > > > started actually replacing the item pointers? Because I don't see
> any
> > > > > guarantee they couldn't cross a page boundary etc? So I think we'd
> need
> > > > > to do WAL logging during index searches, which seems prohibitively
> > > > > expensive.
> > > > >
> > > >
> > > > Don't see that.
> > > >
> > > > I was talking about reusing the first 4 bytes of an index tuple's
> > > > ItemPointerData,
> > > > which is the first field of an index tuple. Index tuples are
> MAXALIGNed,
> > > so
> > > > I can't see how that would ever cross a page boundary.
> > >
> > > They're 8 bytes, and MAXALIGN often is 4 bytes:
> > >
> >
> > xids are 4 bytes, so we're good.
>
> I literally went on to explain why that's not sufficient? You can't
> *just* replace the block number with an xid. You *also* need to set a
> flag denoting that it's an xid and dead now. Which can't fit in the same
> 4 bytes.  You either have to set it in the IndexTuple's t_info, or or in
> the ItemIdData's lp_flags. And both can be on a different sectors.  If
> the flag persists, and the xid doesn't you're going to interpret a block
> number as an xid - not great; but even worse, if the xid survives, but
> the flag doesn't, you're going to access the xid as a block - definitely
> not ok, because you're going to return wrong results.
>

Yes, I agree, I was thinking the same thing after my last comment, but was
afk. The idea requires the atomic update of at least 4 bytes plus at least
1 bit and so would require at least 8byte MAXALIGN to be useful. Your other
points suggesting that multiple things all need to be set accurately for
this to work aren't correct. The idea was that we would write a hint that
would avoid later I/O, so the overall idea is still viable.

Anyway, thinking some more, I think the whole idea of generating
lastRemovedXid is moot and there are better ways in the future of doing
this that would avoid a performance issue altogether. Clearly not PG12.

The main issue relates to the potential overhead of moving this to the
master. I agree its the right thing to do, but we should have some way of
checking it is not a performance issue in practice.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-03-30 Thread Andres Freund
Hi,

On March 30, 2019 5:33:12 PM EDT, Thomas Munro  wrote:
>On Sun, Mar 31, 2019 at 8:20 AM Peter Geoghegan  wrote:
>> On Sat, Mar 30, 2019 at 8:44 AM Robert Haas 
>wrote:
>> > Overall I'm inclined to think that we're making the same mistake
>here
>> > that we did with work_mem, namely, assuming that you can control a
>> > bunch of different prefetching behaviors with a single GUC and
>things
>> > will be OK.  Let's just create a new GUC for this and default it to
>10
>> > or something and go home.
>>
>> I agree. If you invent a new GUC, then everybody notices, and it
>> usually has to be justified quite rigorously. There is a strong
>> incentive to use an existing GUC, if only because the problem that
>> this creates is harder to measure than the supposed problem that it
>> avoids. This can perversely work against the goal of making the
>system
>> easy to use. Stretching the original definition of a GUC is bad.
>>
>> I take issue with the general assumption that not adding a GUC at
>> least makes things easier for users. In reality, it depends entirely
>> on the situation at hand.
>
>I'm not sure I understand why this is any different from the bitmap
>heapscan case though, or in fact why we are adding 10 in this case.
>In both cases we will soon be reading the referenced buffers, and it
>makes sense to queue up prefetch requests for the blocks if they
>aren't already in shared buffers.  In both cases, the number of
>prefetch requests we want to send to the OS is somehow linked to the
>amount of IO requests we think the OS can handle concurrently at once
>(since that's one factor determining how fast it drains them), but
>it's not necessarily the same as that number, AFAICS.  It's useful to
>queue some number of prefetch requests even if you have no IO
>concurrency at all (a single old school spindle), just because the OS
>will chew on that queue in the background while we're also doing
>stuff, which is probably what that "+ 10" is expressing.  But that
>seems to apply to bitmap heapscan too, doesn't it?

The index page deletion code does work on behalf of multiple backends, bitmap 
scans don't. If your system is busy it makes sense to like resource usage of 
per backend work, but not really work on shared resources like page reuse. A 
bit like work mem vs mwm.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.




Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-03-30 Thread Thomas Munro
On Sun, Mar 31, 2019 at 8:20 AM Peter Geoghegan  wrote:
> On Sat, Mar 30, 2019 at 8:44 AM Robert Haas  wrote:
> > Overall I'm inclined to think that we're making the same mistake here
> > that we did with work_mem, namely, assuming that you can control a
> > bunch of different prefetching behaviors with a single GUC and things
> > will be OK.  Let's just create a new GUC for this and default it to 10
> > or something and go home.
>
> I agree. If you invent a new GUC, then everybody notices, and it
> usually has to be justified quite rigorously. There is a strong
> incentive to use an existing GUC, if only because the problem that
> this creates is harder to measure than the supposed problem that it
> avoids. This can perversely work against the goal of making the system
> easy to use. Stretching the original definition of a GUC is bad.
>
> I take issue with the general assumption that not adding a GUC at
> least makes things easier for users. In reality, it depends entirely
> on the situation at hand.

I'm not sure I understand why this is any different from the bitmap
heapscan case though, or in fact why we are adding 10 in this case.
In both cases we will soon be reading the referenced buffers, and it
makes sense to queue up prefetch requests for the blocks if they
aren't already in shared buffers.  In both cases, the number of
prefetch requests we want to send to the OS is somehow linked to the
amount of IO requests we think the OS can handle concurrently at once
(since that's one factor determining how fast it drains them), but
it's not necessarily the same as that number, AFAICS.  It's useful to
queue some number of prefetch requests even if you have no IO
concurrency at all (a single old school spindle), just because the OS
will chew on that queue in the background while we're also doing
stuff, which is probably what that "+ 10" is expressing.  But that
seems to apply to bitmap heapscan too, doesn't it?

-- 
Thomas Munro
https://enterprisedb.com




Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-03-30 Thread Peter Geoghegan
On Sat, Mar 30, 2019 at 8:44 AM Robert Haas  wrote:
> Overall I'm inclined to think that we're making the same mistake here
> that we did with work_mem, namely, assuming that you can control a
> bunch of different prefetching behaviors with a single GUC and things
> will be OK.  Let's just create a new GUC for this and default it to 10
> or something and go home.

I agree. If you invent a new GUC, then everybody notices, and it
usually has to be justified quite rigorously. There is a strong
incentive to use an existing GUC, if only because the problem that
this creates is harder to measure than the supposed problem that it
avoids. This can perversely work against the goal of making the system
easy to use. Stretching the original definition of a GUC is bad.

I take issue with the general assumption that not adding a GUC at
least makes things easier for users. In reality, it depends entirely
on the situation at hand.

-- 
Peter Geoghegan




Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-03-30 Thread Robert Haas
On Sat, Mar 30, 2019 at 6:33 AM Thomas Munro  wrote:
> I didn't understand that last sentence.
>
> Here's an attempt to write a suitable comment for the quick fix.  And
> I suppose effective_io_concurrency is a reasonable default.
>
> It's pretty hard to think of a good way to get your hands on the real
> value safely from here.  I wondered if there was a way to narrow this
> to just GLOBALTABLESPACE_OID since that's where pg_tablespace lives,
> but that doesn't work, we access other catalog too in that path.
>
> Hmm, it seems a bit odd that 0 is supposed to mean "disable issuance
> of asynchronous I/O requests" according to config.sgml, but here 0
> will prefetch 10 buffers.

Mmmph.  I'm starting to think we're not going to get a satisfactory
result here unless we make this controlled by something other than
effective_io_concurrency.  There's just no reason to suppose that the
same setting that we use to control prefetching for bitmap index scans
is also going to be right for what's basically a bulk operation.

Interestingly, Dilip Kumar ran into similar issues recently while
working on bulk processing for undo records for zheap.  In that case,
you definitely want to prefetch the undo aggressively, because you're
reading it front to back and backwards scans suck without prefetching.
And you possibly also want to prefetch the data pages to which the
undo that you are prefetching applies, but maybe not as aggressively
because you're going to be doing a WAL write for each data page and
flooding the system with too many reads could be counterproductive, at
least if pg_wal and the rest of $PGDATA are not on separate spindles.
And even if they are, it's possible that as you suck in undo pages and
the zheap pages that they need to update, you might evict dirty pages,
generating write activity against the data directory.

Overall I'm inclined to think that we're making the same mistake here
that we did with work_mem, namely, assuming that you can control a
bunch of different prefetching behaviors with a single GUC and things
will be OK.  Let's just create a new GUC for this and default it to 10
or something and go home.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-03-29 Thread Andres Freund
On 2019-03-29 16:20:54 +, Simon Riggs wrote:
> On Fri, 29 Mar 2019 at 16:12, Andres Freund  wrote:
> 
> 
> > On 2019-03-29 15:58:14 +, Simon Riggs wrote:
> > > On Fri, 29 Mar 2019 at 15:29, Andres Freund  wrote:
> > > > That's far from a trivial feature imo. It seems quite possible that
> > we'd
> > > > end up with increased overhead, because the current logic can get away
> > > > with only doing hint bit style writes - but would that be true if we
> > > > started actually replacing the item pointers? Because I don't see any
> > > > guarantee they couldn't cross a page boundary etc? So I think we'd need
> > > > to do WAL logging during index searches, which seems prohibitively
> > > > expensive.
> > > >
> > >
> > > Don't see that.
> > >
> > > I was talking about reusing the first 4 bytes of an index tuple's
> > > ItemPointerData,
> > > which is the first field of an index tuple. Index tuples are MAXALIGNed,
> > so
> > > I can't see how that would ever cross a page boundary.
> >
> > They're 8 bytes, and MAXALIGN often is 4 bytes:
> >
> 
> xids are 4 bytes, so we're good.

I literally went on to explain why that's not sufficient? You can't
*just* replace the block number with an xid. You *also* need to set a
flag denoting that it's an xid and dead now. Which can't fit in the same
4 bytes.  You either have to set it in the IndexTuple's t_info, or or in
the ItemIdData's lp_flags. And both can be on a different sectors.  If
the flag persists, and the xid doesn't you're going to interpret a block
number as an xid - not great; but even worse, if the xid survives, but
the flag doesn't, you're going to access the xid as a block - definitely
not ok, because you're going to return wrong results.

Greetings,

Andres Freund




Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-03-29 Thread Peter Geoghegan
 On Fri, Mar 29, 2019 at 9:12 AM Andres Freund  wrote:
> But even so, you can't have unlogged changes that you then rely on. Even
> if there's no torn page issue. Currently BTP_HAS_GARBAGE and
> ItemIdMarkDead() are treated as hints - if we want to guarantee all
> these are accurate, I don't quite see how we'd get around WAL logging
> those.

It might be possible to WAL-log the _bt_check_unique() item killing.
That seems to be much more effective than the similar and better known
kill_prior_tuple optimization in practice. I don't think that that
should be in scope for v12, though. I for one am satisfied with your
explanation.

> > The code can do literally hundreds of random I/Os in an 8192 blocksize.
> > What happens with 16 or 32kB?
>
> It's really hard to construct such cases after the sorting changes, but
> obviously not impossible. But to make it actually painful you need a
> workload where the implied randomness of accesses isn't already a major
> bottleneck - and that's hard.

There is also the fact that in many cases you'll just have accessed
the same buffers from within _bt_check_unique() anyway.

-- 
Peter Geoghegan




Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-03-29 Thread Simon Riggs
On Fri, 29 Mar 2019 at 16:12, Andres Freund  wrote:


> On 2019-03-29 15:58:14 +, Simon Riggs wrote:
> > On Fri, 29 Mar 2019 at 15:29, Andres Freund  wrote:
> > > That's far from a trivial feature imo. It seems quite possible that
> we'd
> > > end up with increased overhead, because the current logic can get away
> > > with only doing hint bit style writes - but would that be true if we
> > > started actually replacing the item pointers? Because I don't see any
> > > guarantee they couldn't cross a page boundary etc? So I think we'd need
> > > to do WAL logging during index searches, which seems prohibitively
> > > expensive.
> > >
> >
> > Don't see that.
> >
> > I was talking about reusing the first 4 bytes of an index tuple's
> > ItemPointerData,
> > which is the first field of an index tuple. Index tuples are MAXALIGNed,
> so
> > I can't see how that would ever cross a page boundary.
>
> They're 8 bytes, and MAXALIGN often is 4 bytes:
>

xids are 4 bytes, so we're good.

If MAXALIGN could ever be 2 bytes, we'd have a problem.

So as a whole they definitely can cross sector boundaries. You might be
> able to argue your way out of that by saying that the blkid is going to
> be aligned, but that's not that trivial, as t_info isn't guaranteed
> that.
>
> But even so, you can't have unlogged changes that you then rely on. Even
> if there's no torn page issue. Currently BTP_HAS_GARBAGE and
> ItemIdMarkDead() are treated as hints - if we want to guarantee all
> these are accurate, I don't quite see how we'd get around WAL logging
> those.
>

You can have unlogged changes that you rely on - that is exactly how hints
work.

If the hint is lost, we do the I/O. Worst case it would be the same as what
you have now.

I'm talking about saving many I/Os - this doesn't need to provably avoid
all I/Os to work, its incremental benefit all the way.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-03-29 Thread Andres Freund
Hi,

On 2019-03-29 15:58:14 +, Simon Riggs wrote:
> On Fri, 29 Mar 2019 at 15:29, Andres Freund  wrote:
> > That's far from a trivial feature imo. It seems quite possible that we'd
> > end up with increased overhead, because the current logic can get away
> > with only doing hint bit style writes - but would that be true if we
> > started actually replacing the item pointers? Because I don't see any
> > guarantee they couldn't cross a page boundary etc? So I think we'd need
> > to do WAL logging during index searches, which seems prohibitively
> > expensive.
> >
> 
> Don't see that.
> 
> I was talking about reusing the first 4 bytes of an index tuple's
> ItemPointerData,
> which is the first field of an index tuple. Index tuples are MAXALIGNed, so
> I can't see how that would ever cross a page boundary.

They're 8 bytes, and MAXALIGN often is 4 bytes:

struct ItemPointerData {
BlockIdDataip_blkid; /* 0 4 */
OffsetNumber   ip_posid; /* 4 2 */

/* size: 6, cachelines: 1, members: 2 */
/* last cacheline: 6 bytes */
};

struct IndexTupleData {
ItemPointerDatat_tid;/* 0 6 */
short unsigned int t_info;   /* 6 2 */

/* size: 8, cachelines: 1, members: 2 */
/* last cacheline: 8 bytes */
};

So as a whole they definitely can cross sector boundaries. You might be
able to argue your way out of that by saying that the blkid is going to
be aligned, but that's not that trivial, as t_info isn't guaranteed
that.

But even so, you can't have unlogged changes that you then rely on. Even
if there's no torn page issue. Currently BTP_HAS_GARBAGE and
ItemIdMarkDead() are treated as hints - if we want to guarantee all
these are accurate, I don't quite see how we'd get around WAL logging
those.


> > And I'm also doubtful it's worth it because:
> >
> > > Since this point of the code is clearly going to be a performance issue
> > it
> > > seems like something we should do now.
> >
> > I've tried quite a bit to find a workload where this matters, but after
> > avoiding redundant buffer accesses by sorting, and prefetching I was
> > unable to do so.  What workload do you see where this would be really be
> > bad? Without the performance optimization I'd found a very minor
> > regression by trying to force the heap visits to happen in a pretty
> > random order, but after sorting that went away.  I'm sure it's possible
> > to find a case on overloaded rotational disks where you'd find a small
> > regression, but I don't think it'd be particularly bad.

> The code can do literally hundreds of random I/Os in an 8192 blocksize.
> What happens with 16 or 32kB?

It's really hard to construct such cases after the sorting changes, but
obviously not impossible. But to make it actually painful you need a
workload where the implied randomness of accesses isn't already a major
bottleneck - and that's hard.

This has been discussed publically for a few months...

Greetings,

Andres Freund




Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-03-29 Thread Simon Riggs
On Fri, 29 Mar 2019 at 15:29, Andres Freund  wrote:


> On 2019-03-29 09:37:11 +, Simon Riggs wrote:
>


> > While trying to understand this, I see there is an even better way to
> > optimize this. Since we are removing dead index tuples, we could alter
> the
> > killed index tuple interface so that it returns the xmax of the tuple
> being
> > marked as killed, rather than just a boolean to say it is dead.
>
> Wouldn't that quite possibly result in additional and unnecessary
> conflicts? Right now the page level horizon is computed whenever the
> page is actually reused, rather than when an item is marked as
> deleted. As it stands right now, the computed horizons are commonly very
> "old", because of that delay, leading to lower rates of conflicts.
>

I wasn't suggesting we change when the horizon is calculated, so no change
there.

The idea was to cache the data for later use, replacing the hint bit with a
hint xid.

That won't change the rate of conflicts, up or down - but it does avoid I/O.


> > Indexes can then mark the killed tuples with the xmax that killed them
> > rather than just a hint bit. This is possible since the index tuples
> > are dead and cannot be used to follow the htid to the heap, so the
> > htid is redundant and so the block number of the tid could be
> > overwritten with the xmax, zeroing the itemid. Each killed item we
> > mark with its xmax means one less heap fetch we need to perform when
> > we delete the page - it's possible we optimize that away completely by
> > doing this.
>
> That's far from a trivial feature imo. It seems quite possible that we'd
> end up with increased overhead, because the current logic can get away
> with only doing hint bit style writes - but would that be true if we
> started actually replacing the item pointers? Because I don't see any
> guarantee they couldn't cross a page boundary etc? So I think we'd need
> to do WAL logging during index searches, which seems prohibitively
> expensive.
>

Don't see that.

I was talking about reusing the first 4 bytes of an index tuple's
ItemPointerData,
which is the first field of an index tuple. Index tuples are MAXALIGNed, so
I can't see how that would ever cross a page boundary.


> And I'm also doubtful it's worth it because:
>
> > Since this point of the code is clearly going to be a performance issue
> it
> > seems like something we should do now.
>
> I've tried quite a bit to find a workload where this matters, but after
> avoiding redundant buffer accesses by sorting, and prefetching I was
> unable to do so.  What workload do you see where this would be really be
> bad? Without the performance optimization I'd found a very minor
> regression by trying to force the heap visits to happen in a pretty
> random order, but after sorting that went away.  I'm sure it's possible
> to find a case on overloaded rotational disks where you'd find a small
> regression, but I don't think it'd be particularly bad.
>

The code can do literally hundreds of random I/Os in an 8192 blocksize.
What happens with 16 or 32kB?

"Small regression" ?

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-03-29 Thread Andres Freund
Hi,

On 2019-03-29 09:37:11 +, Simon Riggs wrote:
> This commit message was quite confusing. It took me a while to realize this
> relates to btree index deletes and that what you mean is that we are
> calculcating the latestRemovedXid for index entries. That is related to but
> not same thing as the horizon itself.

Well, it's the page level horizon...


> While trying to understand this, I see there is an even better way to
> optimize this. Since we are removing dead index tuples, we could alter the
> killed index tuple interface so that it returns the xmax of the tuple being
> marked as killed, rather than just a boolean to say it is dead.

Wouldn't that quite possibly result in additional and unnecessary
conflicts? Right now the page level horizon is computed whenever the
page is actually reused, rather than when an item is marked as
deleted. As it stands right now, the computed horizons are commonly very
"old", because of that delay, leading to lower rates of conflicts.


> Indexes can then mark the killed tuples with the xmax that killed them
> rather than just a hint bit. This is possible since the index tuples
> are dead and cannot be used to follow the htid to the heap, so the
> htid is redundant and so the block number of the tid could be
> overwritten with the xmax, zeroing the itemid. Each killed item we
> mark with its xmax means one less heap fetch we need to perform when
> we delete the page - it's possible we optimize that away completely by
> doing this.

That's far from a trivial feature imo. It seems quite possible that we'd
end up with increased overhead, because the current logic can get away
with only doing hint bit style writes - but would that be true if we
started actually replacing the item pointers? Because I don't see any
guarantee they couldn't cross a page boundary etc? So I think we'd need
to do WAL logging during index searches, which seems prohibitively
expensive.

And I'm also doubtful it's worth it because:

> Since this point of the code is clearly going to be a performance issue it
> seems like something we should do now.

I've tried quite a bit to find a workload where this matters, but after
avoiding redundant buffer accesses by sorting, and prefetching I was
unable to do so.  What workload do you see where this would be really be
bad? Without the performance optimization I'd found a very minor
regression by trying to force the heap visits to happen in a pretty
random order, but after sorting that went away.  I'm sure it's possible
to find a case on overloaded rotational disks where you'd find a small
regression, but I don't think it'd be particularly bad.

Greetings,

Andres Freund




Re: pgsql: Compute XID horizon for page level index vacuum on primary.

2019-03-29 Thread Simon Riggs
On Wed, 27 Mar 2019 at 00:06, Andres Freund  wrote:

> Compute XID horizon for page level index vacuum on primary.
>
> Previously the xid horizon was only computed during WAL replay.


This commit message was quite confusing. It took me a while to realize this
relates to btree index deletes and that what you mean is that we are
calculcating the latestRemovedXid for index entries. That is related to but
not same thing as the horizon itself. So now I understand the "was computed
only during WAL replay" since it seemed obvious that the xmin horizon was
calculcated regularly on the master, but as you say the latestRemovedXid
was not.

Now I understand, I'm happy that you've moved this from redo into mainline.
And you've optimized it, which is also important (since performance was the
original objection and why it was placed in redo). I can see you've removed
duplicate code in hash indexes as well, which is good.

The term "xid horizon" was only used once in the code in PG11. That usage
appears to be a typo, since in many other places the term "xmin horizon" is
used to mean the point at which we can finally mark tuples as dead. Now we
have some new, undocumented APIs that use the term "xid horizon" yet still
code that refers to "xmin horizon", with neither term being defined. I'm
hoping you'll do some later cleanup of that to avoid confusion.

While trying to understand this, I see there is an even better way to
optimize this. Since we are removing dead index tuples, we could alter the
killed index tuple interface so that it returns the xmax of the tuple being
marked as killed, rather than just a boolean to say it is dead. Indexes can
then mark the killed tuples with the xmax that killed them rather than just
a hint bit. This is possible since the index tuples are dead and cannot be
used to follow the htid to the heap, so the htid is redundant and so the
block number of the tid could be overwritten with the xmax, zeroing the
itemid. Each killed item we mark with its xmax means one less heap fetch we
need to perform when we delete the page - it's possible we optimize that
away completely by doing this.

Since this point of the code is clearly going to be a performance issue it
seems like something we should do now.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services