Re: Using old master as new replica after clean switchover

2019-02-22 Thread Claudio Freire
On Fri, Feb 22, 2019 at 5:47 AM Jehan-Guillaume de Rorthais
 wrote:
>
> On Thu, 21 Feb 2019 15:38:21 -0300
> Claudio Freire  wrote:
>
> > On Tue, Feb 19, 2019 at 9:44 PM Michael Paquier  wrote:
> > >
> > > On Tue, Feb 19, 2019 at 04:27:02PM -0800, RSR999GMAILCOM wrote:
> > > > So  wanted to clarify if this procedure really requires the WAL archive
> > > > location on a shared storage ?
> > >
> > > Shared storage for WAL archives is not a requirement.  It is perfectly
> > > possible to use streaming replication to get correct WAL changes.
> > > Using an archive is recommended for some deployments and depending on
> > > your requirements and data retention policy, still you could have
> > > those archives on a different host and have the restore_command of the
> > > standbyt in recovery or the archive_command of the primary save the
> > > segments to it.  Depending on the frequency new WAL segments are
> > > generated, this depends of course.
> >
> > If I'm not mistaken, if you don't have WAL archive set up (a shared
> > filesystem isn't necessary, but the standby has to be able to restore
> > WAL segments from the archive), a few transactions that haven't been
> > streamed at primary shutdown could be lost, since the secondary won't
> > be able to stream anything after the primary has shut down.
>
> This has been fixed in 9.3. The primary node wait for all WAL records to be
> streamed to the connected standbys before shutting down. Including its 
> shutdown
> checkpoint. See:
>
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=985bd7d49726c9f178558491d31a570d47340459
>
> Because a standby could disconnect because of some failure during the shutdown
> process, you still need to make sure the standby-to-be-promoted received the
> shutdown checkpoint though.
>
> > WAL archive can always be restored even without a primary running, hence
> > why a WAL archive is needed.
>
> No. Primary does not force a WAL switch/archive during shutdown.

That's good to know, both of the above.



Re: Using old master as new replica after clean switchover

2019-02-21 Thread Claudio Freire
On Tue, Feb 19, 2019 at 9:44 PM Michael Paquier  wrote:
>
> On Tue, Feb 19, 2019 at 04:27:02PM -0800, RSR999GMAILCOM wrote:
> > So  wanted to clarify if this procedure really requires the WAL archive
> > location on a shared storage ?
>
> Shared storage for WAL archives is not a requirement.  It is perfectly
> possible to use streaming replication to get correct WAL changes.
> Using an archive is recommended for some deployments and depending on
> your requirements and data retention policy, still you could have
> those archives on a different host and have the restore_command of the
> standbyt in recovery or the archive_command of the primary save the
> segments to it.  Depending on the frequency new WAL segments are
> generated, this depends of course.

If I'm not mistaken, if you don't have WAL archive set up (a shared
filesystem isn't necessary, but the standby has to be able to restore
WAL segments from the archive), a few transactions that haven't been
streamed at primary shutdown could be lost, since the secondary won't
be able to stream anything after the primary has shut down. WAL
archive can always be restored even without a primary running, hence
why a WAL archive is needed.

Or am I missing something?



Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-16 Thread Claudio Freire
On Mon, Jul 16, 2018 at 3:30 PM Andrew Dunstan
 wrote:
>
>
>
> On 07/16/2018 10:34 AM, Claudio Freire wrote:
> > On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan
> >  wrote:
> >>
> >>
> >> On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:
> >>> On 13/07/18 01:39, Andrew Dunstan wrote:
> >>>> On 07/12/2018 06:34 PM, Alvaro Herrera wrote:
> >>>>> On 2018-Jul-12, Andrew Dunstan wrote:
> >>>>>
> >>>>>> I fully understand. I think this needs to go back to "Waiting on
> >>>>>> Author".
> >>>>> Why?  Heikki's patch applies fine and passes the regression tests.
> >>>> Well, I understood Claudio was going to do some more work (see
> >>>> upthread).
> >>> Claudio raised a good point, that doing small pallocs leads to
> >>> fragmentation, and in particular, it might mean that we can't give
> >>> back the memory to the OS. The default glibc malloc() implementation
> >>> has a threshold of 4 or 32 MB or something like that - allocations
> >>> larger than the threshold are mmap()'d, and can always be returned to
> >>> the OS. I think a simple solution to that is to allocate larger
> >>> chunks, something like 32-64 MB at a time, and carve out the
> >>> allocations for the nodes from those chunks. That's pretty
> >>> straightforward, because we don't need to worry about freeing the
> >>> nodes in retail. Keep track of the current half-filled chunk, and
> >>> allocate a new one when it fills up.
> >>
> >> Google seems to suggest the default threshold is much lower, like 128K.
> >> Still, making larger allocations seems sensible. Are you going to work
> >> on that?
> > Below a few MB the threshold is dynamic, and if a block bigger than
> > 128K but smaller than the higher threshold (32-64MB IIRC) is freed,
> > the dynamic threshold is set to the size of the freed block.
> >
> > See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1]
> >
> > So I'd suggest allocating blocks bigger than M_MMAP_MAX.
> >
> > [1] http://man7.org/linux/man-pages/man3/mallopt.3.html
>
>
> That page says:
>
> M_MMAP_MAX
>This parameter specifies the maximum number of allocation
>requests that may be simultaneously serviced using mmap(2).
>This parameter exists because some systems have a limited
>number of internal tables for use by mmap(2), and using more
>than a few of them may degrade performance.
>
>The default value is 65,536, a value which has no special
>significance and which serves only as a safeguard. Setting
>this parameter to 0 disables the use of mmap(2) for servicing
>large allocation requests.
>
>
> I'm confused about the relevance.

It isn't relevant. See my next message, it should have read
DEFAULT_MMAP_THRESHOLD_MAX.



Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-16 Thread Claudio Freire
On Mon, Jul 16, 2018 at 11:34 AM Claudio Freire  wrote:
>
> On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan
>  wrote:
> >
> >
> >
> > On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:
> > > On 13/07/18 01:39, Andrew Dunstan wrote:
> > >> On 07/12/2018 06:34 PM, Alvaro Herrera wrote:
> > >>> On 2018-Jul-12, Andrew Dunstan wrote:
> > >>>
> > >>>> I fully understand. I think this needs to go back to "Waiting on
> > >>>> Author".
> > >>> Why?  Heikki's patch applies fine and passes the regression tests.
> > >>
> > >> Well, I understood Claudio was going to do some more work (see
> > >> upthread).
> > >
> > > Claudio raised a good point, that doing small pallocs leads to
> > > fragmentation, and in particular, it might mean that we can't give
> > > back the memory to the OS. The default glibc malloc() implementation
> > > has a threshold of 4 or 32 MB or something like that - allocations
> > > larger than the threshold are mmap()'d, and can always be returned to
> > > the OS. I think a simple solution to that is to allocate larger
> > > chunks, something like 32-64 MB at a time, and carve out the
> > > allocations for the nodes from those chunks. That's pretty
> > > straightforward, because we don't need to worry about freeing the
> > > nodes in retail. Keep track of the current half-filled chunk, and
> > > allocate a new one when it fills up.
> >
> >
> > Google seems to suggest the default threshold is much lower, like 128K.
> > Still, making larger allocations seems sensible. Are you going to work
> > on that?
>
> Below a few MB the threshold is dynamic, and if a block bigger than
> 128K but smaller than the higher threshold (32-64MB IIRC) is freed,
> the dynamic threshold is set to the size of the freed block.
>
> See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1]
>
> So I'd suggest allocating blocks bigger than M_MMAP_MAX.
>
> [1] http://man7.org/linux/man-pages/man3/mallopt.3.html

Sorry, substitute M_MMAP_MAX with DEFAULT_MMAP_THRESHOLD_MAX, the
former is something else.



Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-16 Thread Claudio Freire
On Fri, Jul 13, 2018 at 5:43 PM Andrew Dunstan
 wrote:
>
>
>
> On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:
> > On 13/07/18 01:39, Andrew Dunstan wrote:
> >> On 07/12/2018 06:34 PM, Alvaro Herrera wrote:
> >>> On 2018-Jul-12, Andrew Dunstan wrote:
> >>>
>  I fully understand. I think this needs to go back to "Waiting on
>  Author".
> >>> Why?  Heikki's patch applies fine and passes the regression tests.
> >>
> >> Well, I understood Claudio was going to do some more work (see
> >> upthread).
> >
> > Claudio raised a good point, that doing small pallocs leads to
> > fragmentation, and in particular, it might mean that we can't give
> > back the memory to the OS. The default glibc malloc() implementation
> > has a threshold of 4 or 32 MB or something like that - allocations
> > larger than the threshold are mmap()'d, and can always be returned to
> > the OS. I think a simple solution to that is to allocate larger
> > chunks, something like 32-64 MB at a time, and carve out the
> > allocations for the nodes from those chunks. That's pretty
> > straightforward, because we don't need to worry about freeing the
> > nodes in retail. Keep track of the current half-filled chunk, and
> > allocate a new one when it fills up.
>
>
> Google seems to suggest the default threshold is much lower, like 128K.
> Still, making larger allocations seems sensible. Are you going to work
> on that?

Below a few MB the threshold is dynamic, and if a block bigger than
128K but smaller than the higher threshold (32-64MB IIRC) is freed,
the dynamic threshold is set to the size of the freed block.

See M_MMAP_MAX and M_MMAP_THRESHOLD in the man page for mallopt[1]

So I'd suggest allocating blocks bigger than M_MMAP_MAX.

[1] http://man7.org/linux/man-pages/man3/mallopt.3.html



Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-12 Thread Claudio Freire
On Thu, Jul 12, 2018 at 10:44 AM Andrew Dunstan
 wrote:
>
>
>
> On 04/06/2018 08:00 PM, Claudio Freire wrote:
> > On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire  
> > wrote:
> >> On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas  
> >> wrote:
> >>> On 06/04/18 01:59, Claudio Freire wrote:
> >>>> The iteration interface, however, seems quite specific for the use
> >>>> case of vacuumlazy, so it's not really a good abstraction.
> >>>
> >>> Can you elaborate? It does return the items one block at a time. Is that
> >>> what you mean by being specific for vacuumlazy? I guess that's a bit
> >>> special, but if you imagine some other users for this abstraction, it's
> >>> probably not that unusual. For example, if we started using it in bitmap
> >>> heap scans, a bitmap heap scan would also want to get the TIDs one block
> >>> number at a time.
> >> But you're also tying the caller to the format of the buffer holding
> >> those TIDs, for instance. Why would you, when you can have an
> >> interface that just iterates TIDs and let the caller store them
> >> if/however they want?
> >>
> >> I do believe a pure iterator interface is a better interface.
> > Between the b-tree or not discussion and the refactoring to separate
> > the code, I don't think we'll get this in the next 24hs.
> >
> > So I guess we'll have ample time to poner on both issues during the
> > next commit fest.
> >
>
>
>
> There doesn't seem to have been much pondering done since then, at least
> publicly. Can we make some progress on this? It's been around for a long
> time now.

Yeah, life has kept me busy and I haven't had much time to make
progress here, but I was planning on doing the refactoring as we were
discussing soon. Can't give a time frame for that, but "soonish".



Re: [WIP] [B-Tree] Retail IndexTuple deletion

2018-06-18 Thread Claudio Freire
On Mon, Jun 18, 2018 at 4:59 PM Peter Geoghegan  wrote:
> > Note, guaranteed allowable time of index scans (used for quick deletion)
> > will be achieved by storing equal-key index tuples in physical TID order [2]
> > with patch [3].
>
> I now have greater motivation to develop that patch into a real project.
>
> I bet that my heap-tid-sort patch will allow you to refine your
> interface when there are many logical duplicates: You can create one
> explicit scan key, but have a list of heap TIDs that need to be killed
> within the range of matching index tuples. That could be a lot more
> efficient in the event of many non-HOT updates, where most index tuple
> values won't actually change. You can sort the list of heap TIDs that
> you want to kill once, and then "merge" it with the tuples at the leaf
> level as they are matched/killed. It should be safe to avoid
> rechecking anything other than the heap TID values.
>
> [1] 
> http://pgeoghegan.blogspot.com/2017/07/postgresql-index-bloat-microscope.html
> [2] 
> https://www.postgresql.org/message-id/CAH2-Wzmf6intNY1ggiNzOziiO5Eq=DsXfeptODGxO=2j-i1...@mail.gmail.com

Actually, once btree tids are sorted, you can continue tree descent
all the way to the exact leaf page that contains the tuple to be
deleted.

Thus, the single-tuple interface ends up being quite OK. Sure, you can
optimize things a bit by scanning a range, but only if vacuum is able
to group keys in order to produce the optimized calls, and I don't see
that terribly likely.

So, IMHO the current interface may be quite enough.



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-06-18 Thread Claudio Freire
On Mon, Jun 18, 2018 at 2:03 PM Peter Geoghegan  wrote:
>
> On Mon, Jun 18, 2018 at 7:57 AM, Claudio Freire  
> wrote:
> > Way back when I was dabbling in this kind of endeavor, my main idea to
> > counteract that, and possibly improve performance overall, was a
> > microvacuum kind of thing that would do some on-demand cleanup to
> > remove duplicates or make room before page splits. Since nbtree
> > uniqueification enables efficient retail deletions, that could end up
> > as a net win.
>
> That sounds like a mechanism that works a bit like
> _bt_vacuum_one_page(), which we run at the last second before a page
> split. We do this to see if a page split that looks necessary can
> actually be avoided.
>
> I imagine that retail index tuple deletion (the whole point of this
> project) would be run by a VACUUM-like process that kills tuples that
> are dead to everyone. Even with something like zheap, you cannot just
> delete index tuples until you establish that they're truly dead. I
> guess that the delete marking stuff that Robert mentioned marks tuples
> as dead when the deleting transaction commits. Maybe we could justify
> having _bt_vacuum_one_page() do cleanup to those tuples (i.e. check if
> they're visible to anyone, and if not recycle), because we at least
> know that the deleting transaction committed there. That is, they
> could be recently dead or dead, and it may be worth going to the extra
> trouble of checking which when we know that it's one of the two
> possibilities.

Yes, but currently bt_vacuum_one_page does local work on the pinned
page. Doing dead tuple deletion however involves reading the heap to
check visibility at the very least, and doing it on the whole page
might involve several heap fetches, so it's an order of magnitude
heavier if done naively.

But the idea is to do just that, only not naively.



Re: Making all nbtree entries unique by having heap TIDs participate in comparisons

2018-06-18 Thread Claudio Freire
On Fri, Jun 15, 2018 at 8:47 PM Peter Geoghegan  wrote:

> > I think it would be helpful if you could talk more about these
> > regressions (and the wins).
>
> I think that the performance regressions are due to the fact that when
> you have a huge number of duplicates today, it's useful to be able to
> claim space to fit further duplicates from almost any of the multiple
> leaf pages that contain or have contained duplicates. I'd hoped that
> the increased temporal locality that the patch gets would more than
> make up for that. As far as I can tell, the problem is that temporal
> locality doesn't help enough. I saw that performance was somewhat
> improved with extreme Zipf distribution contention, but it went the
> other way with less extreme contention. The details are not that fresh
> in my mind, since I shelved this patch for a while following limited
> performance testing.
>
> The code could certainly use more performance testing, and more
> general polishing. I'm not strongly motivated to do that right now,
> because I don't quite see a clear path to making this patch useful.
> But, as I said, I have an open mind about what the next step should
> be.

Way back when I was dabbling in this kind of endeavor, my main idea to
counteract that, and possibly improve performance overall, was a
microvacuum kind of thing that would do some on-demand cleanup to
remove duplicates or make room before page splits. Since nbtree
uniqueification enables efficient retail deletions, that could end up
as a net win.

I never got around to implementing it though, and it does get tricky
if you don't want to allow unbounded latency spikes.



Re: Locking B-tree leafs immediately in exclusive mode

2018-06-14 Thread Claudio Freire
On Thu, Jun 14, 2018 at 9:44 AM Alexander Korotkov
 wrote:
> > > Our B-tree is currently maintaining duplicates unordered.  So, during 
> > > insertion
> > > we can traverse rightlinks in order to find page, which would fit new
> > > index tuple.
> > > However, in this case we're traversing pages in exclusive lock mode, and
> > > that happens already after re-lock.  _bt_search() doesn't do anything 
> > > with that.
> > > So, I assume there shouldn't be any degradation in the case of many
> > > duplicate entries.
> >
> > BTW, I have a prototype patch that makes the keys at the leaf level
> > unique. It is actually an implementation of nbtree suffix truncation
> > that sometimes *adds* a new attribute in pivot tuples, rather than
> > truncating away non-distinguishing attributes -- it adds a special
> > heap TID attribute when it must. The patch typically increases fan-in,
> > of course, but the real goal was to make all entries at the leaf level
> > truly unique, as L intended (we cannot do it without suffix
> > truncation because that will kill fan-in). My prototype has full
> > amcheck coverage, which really helped me gain confidence in my
> > approach.
>
> Could you, please, clarify what do you mean by "fan-in"?  As I
> understood, higher "fan-in" means more children on single non-leaf
> page, and in turn "better branching".  Is my understanding correct?
>
> If my understanding is correct, then making leaf level truly unique
> without suffix truncation will kill fan-in, because additional heap
> TID attribute will increase pivot tuple size.  However, that doesn't
> look like inevitable shortcoming, because we could store heap TID in
> t_tid of pivot index tuples.  Yes, we've used t_tid offset for storing
> number of attributes in truncated tuples.  But heap TID is going to be
> virtually the last attribute of tuple.  So, pivot tuples containing
> heap TID are not going to be suffix-truncated anyway.  So, for such
> tuples we can just don't set INDEX_ALT_TID_MASK, and they would be
> assumed to have all the attributes.

I had a patch[1] that did pretty much just that. I saw no contention
issues or adverse effects, but my testing of it was rather limited.

The patch has rotted significantly since then, sadly, and it's quite complex.

> So, my idea that it's not necessary to couple suffix truncation with
> making leaf tuples unique.

No, but the work involved for one and the other shares so much that it
lends itself to be done in the same patch.

> Regarding just making leaf tuples unique,
> I understand that it wouldn't be always win.  When we have a lot of
> duplicates, current implementation searches among the pages containing
> those duplicates for the first page containing enough of free space.
> While with unique index, we would have to always insert into
> particular page.  Thus, current implementation gives us better filling
> of pages, and (probably) faster insertion.

Not at all. Insertion cost in unique indexes with lots of duplicates
(happens, dead duplicates) grows quadratically on the number of
duplicates, and that's improved by making the index unique and sorted.

> Therefore, my idea is that there is a tradeoff between making btree
> index unique or non-unique.  I think we will need an option for that.

We will need a flag somewhere to avoid having to require index
rebuilds during pg_upgrade. My old patch maintained backward
compatibility, but when using an older version index, the sort order
of tids could not be assumed, and thus many optimizations had to be
disabled.

But it is totally doable, and necessary unless you accept a very
painful pg_upgrade.


[1] 
https://www.postgresql.org/message-id/flat/CAGTBQpZ-kTRQiAa13xG1GNe461YOwrA-s-ycCQPtyFrpKTaDBQ%40mail.gmail.com#cagtbqpz-ktrqiaa13xg1gne461yowra-s-yccqptyfrpkta...@mail.gmail.com



Re: Spilling hashed SetOps and aggregates to disk

2018-06-13 Thread Claudio Freire
On Tue, Jun 5, 2018 at 5:05 AM Tomas Vondra
 wrote:
>
> On 06/05/2018 07:46 AM, Jeff Davis wrote:
> > On Tue, 2018-06-05 at 07:04 +0200, Tomas Vondra wrote:
> >> I expect the eviction strategy to be the primary design challenge of
> >> this patch. The other bits will be mostly determined by this one
> >> piece.
> >
> > Not sure I agree that this is the primary challenge.
> >
> > The cases that benefit from eviction are probably a minority. I see two
> > categories that would benefit:
> >
> >* Natural clustering in the heap. This sounds fairly common, but a
> > lot of the cases that come to mind are too low-cardinality to be
> > compelling; e.g. timestamps grouped by hour/day/month. If someone has
> > run into a high-cardinality natural grouping case, let me know.
> >* ARRAY_AGG (or similar): individual state values can be large enough
> > that we need to evict to avoid exceeding work_mem even if not adding
> > any new groups.
> >
> > In either case, it seems like a fairly simple eviction strategy would
> > work. For instance, we could just evict the entire hash table if
> > work_mem is exceeded or if the hit rate on the hash table falls below a
> > certain threshold. If there was really something important that should
> > have stayed in the hash table, it will go back in soon anyway.
> >
> > So why should eviction be a major driver for the entire design? I agree
> > it should be an area of improvement for the future, so let me know if
> > you see a major problem, but I haven't been as focused on eviction.
> >
>
> My concern is more about what happens when the input tuple ordering is
> inherently incompatible with the eviction strategy, greatly increasing
> the amount of data written to disk during evictions.
>
> Say for example that we can fit 1000 groups into work_mem, and that
> there are 2000 groups in the input data set. If the input is correlated
> with the groups, everything is peachy because we'll evict the first
> batch, and then group the remaining 1000 groups (or something like
> that). But if the input data is random (which can easily happen, e.g.
> for IP addresses, UUIDs and such) we'll hit the limit repeatedly and
> will evict much sooner.
>
> I know you suggested simply dumping the whole hash table and starting
> from scratch while we talked about this at pgcon, but ISTM it has
> exactly this issue.
>
> But I don't know if there actually is a better option - maybe we simply
> have to accept this problem. After all, running slowly is still better
> than OOM (which may or may not happen now).
>
> I wonder if we can somehow detect this at run-time and maybe fall-back
> to groupagg. E.g. we could compare number of groups vs. number of input
> tuples when we first hit the limit. It's a rough heuristics, but maybe
> sufficient.

I've been applying a strategy like that to do massive streaming
aggregation quite successfully.

The code I have is in python and in a private repo. There have been
talks of both opensourcing it, and of porting it into postgres as a
kind of aggregate node, because it sounds like a good idea. It seems
very related to this thread.

In essence, the technique I've been using always uses a fixed-size
hash table to do partial grouping. The table is never grown, when
collisions do happen, the existing entry in the hash table is flushed
to disk and the aggregate state in that bucket reset for the incoming
key. To avoid excessive spilling due to frequent collisions, we use a
kind of "lazy cuckoo" hash table. Lazy in the sense that it does no
relocations, it just checks 2 hash values, and if it has to evict, it
evicts the "oldest" value (with some cheap definition of "old").

The technique works very well to reduce temporary data size with a
fixed amount of working memory. The resulting spill file can then be
processed again to finalize the computation.

What I was pondering PG could do, is feed the spilled tuples to a sort
node, using the partial hash aggregation as a data-reducing node only.

scan -> partial hash agg -> sort -> final group agg

The group agg would have to know to consume and combine aggregate
states instead of producing them, but in essence it seems a relatively
efficient process.

An adaptive hash agg node would start as a hash agg, and turn into a
"partial hash agg + sort + final group agg" when OOM is detected.

The benefit over ordinary sort+group agg is that the sort is happening
on a potentially much smaller data set. When the buffer is large
enough to capture enough key repetitions, the output of the partial
hash agg can be orders of magnitude smaller than the scan.

My proposal was to use this for all group aggs, not only when the
planner chooses a hash agg, because it can speed up the sort and
reduce temp storage considerably. I guess the trick lies in estimating
that cardinality reduction to avoid applying this when there's no hope
of getting a payoff. The overhead of such a lazy hash table isn't
much, really. But yes, its cache locality is 

Re: POC: GROUP BY optimization

2018-06-06 Thread Claudio Freire
On Wed, Jun 6, 2018 at 8:06 PM Tomas Vondra
 wrote:
> >>> Comparison cost can be approximated probabilistically as:
> >>>
> >>> cost_comp = sum(cost_op_n * (1.0 / ndistinct(col_1_to_n)))
> >>>
> >>> Where cost_op_n is the cost of the comparison function for column N,
> >>> and ndistinct(col_1_to_n) is an estimation of the number of distinct
> >>> values for columns prior to N in the sort order.
> >>>
> >>> You can approximate ndistinct as the product of ndistinct of previous
> >>> columns, or use extended statistics when available.
> >>>
> >>
> >> Sure. The trouble is this also assumes uniform distribution, which is
> >> not always the case.
> >
> > Well, (1.0 / ndistinct) = p(left == right).
> >
> > If you can compute a better p(left == right) with an MCV, you can get
> > a better estimate. If you have an MCV. But that'd be a bit expensive,
> > and I'm not sure it's all that relevant.
> >
> > To what degree does improving this produce better plans? As long as
> > average expected cost is relatively well estimated (as in, one sort
> > order relative to another sort order), it won't affect the decision.
> >
>
> I'd bet I can construct corner-case examples with vastly different
> behavior depending on column data distributions, so it's not entirely
> insignificant. How important is it in practice I don't know.

I guess that can only be answered by building that solution and
testing it against the corner cases.



Re: POC: GROUP BY optimization

2018-06-06 Thread Claudio Freire
On Wed, Jun 6, 2018 at 7:18 PM Claudio Freire  wrote:
> > > Comparison cost can be approximated probabilistically as:
> > >
> > > cost_comp = sum(cost_op_n * (1.0 / ndistinct(col_1_to_n)))
> > >
> > > Where cost_op_n is the cost of the comparison function for column N,
> > > and ndistinct(col_1_to_n) is an estimation of the number of distinct
> > > values for columns prior to N in the sort order.
> > >
> > > You can approximate ndistinct as the product of ndistinct of previous
> > > columns, or use extended statistics when available.
> > >
> >
> > Sure. The trouble is this also assumes uniform distribution, which is
> > not always the case.
>
> Well, (1.0 / ndistinct) = p(left == right).
>
> If you can compute a better p(left == right) with an MCV, you can get
> a better estimate. If you have an MCV. But that'd be a bit expensive,
> and I'm not sure it's all that relevant.
>
> To what degree does improving this produce better plans? As long as
> average expected cost is relatively well estimated (as in, one sort
> order relative to another sort order), it won't affect the decision.
>
> But if needed, the MCV can be used for this.
>
> So, in essence, you need to account for:
>
> - restrictions on that column that constrain the domain
> - distribution skew (MCV, nulls)
> - ndistinct
>
> To compute p(left == right).
>
> Maybe something as simple as the following?
>
> p_special = sum(mcv_i * mcv_i) + null_frac * null_frac
> p_other = (1 - p_special) * (1 - p_special) / ndistinct(restr)

Well, that came out with a slew of math errors, but the idea is clear:
compute p(left == right) given the available statistics and
constrained by any applicable restrictions.



Re: POC: GROUP BY optimization

2018-06-06 Thread Claudio Freire
On Wed, Jun 6, 2018 at 6:57 PM Tomas Vondra
 wrote:
>
> On 06/06/2018 11:22 PM, Claudio Freire wrote:
> > On Wed, Jun 6, 2018 at 5:43 PM Tomas Vondra
> > As such, estimating sort performance correctly in the various plan
> > variants being considered seems to be a very central aspect of it.
> >
> > This means it's pretty much time to consider the effect of operator
> > cost *and* distinct values in the cost computation.
> >
>
> Yes, until now that was not really needed because the optimizer does not
> consider reordering of the pathkeys - it simply grabs either GROUP BY or
> ORDER BY pathkeys and runs with that.
>
> So the costing was fairly trivial, we simply do something like
>
> comparison_cost = 2.0 * cpu_operator_cost;
>
> sort_cost = comparison_cost * tuples * LOG2(tuples);
>
> which essentially ignores that there might be multiple columns, or that
> the columns may have sort operator with different costs.
>
> The question is how reliable the heuristics can be. The current patch
> uses just plain ndistinct, but that seems rather unreliable but I don't
> have a clear idea how to improve that - we may have MCV for the columns
> and perhaps some extended statistics, but I'm not sure how far we can
> run with that.
>
> Essentially what we need to estimate the number of comparisons for each
> column, to compute better comparison_cost.

This could be approached by adjusting statistics by any relevant
restrictions applicable to the columns being grouped, as is done for
selectivity estimations.

I'm not sure how far that would get us, though. It would be rather
easy to lose sight of those restrictions if there are complex
operations involved.

> > Assuming cost_sort does its thing, it's just a matter of building the
> > desired variants and picking the one with the smallest cost_sort. One
> > can use heuristics to build a subset of all possible column orderings
> > with a guiding principle that tries to maximize the likelihook of
> > finding a better order than the one in the query (like sorting by
> > ndistinct), but I'd suggest:
> >
> > - Always considering the sort order verbatim as given in the SQL (ie:
> > what the user requests)
> > - Relying on cost_sort to distinguish among them, and pick a winner, and
> > - When in doubt (if cost_sort doesn't produce a clear winner), use the
> > order given by the user
> >
>
> I don't see why not to generate all possible orderings (keeping only the
> cheapest path for each pathkey variant) and let the optimizer to sort it
> out.

I'm assuming costing the full N! possible orderings would be
prohibitively expensive.

> If the user specified an explicit ORDER BY, we'll slap an extra
> Sort by at the end, increasing the cost.

I think you misunderstood me. I'm saying the exact opposite.

When I mention handicap, I mean to give the explicitly requested group
by order a *reduced* cost, to give it an advantage over the
heuristics.

This is to try to attack the problem you mentioned where users already
accounting for operator costs in their queries would get overridden by
the planner, perhaps in detriment of overall execution time.

In essence:

- If the user requested that order, we assume it "somewhat
subjectively better" (by giving it a slightly reduced cost).
- If there is an explicit order by clause that differs from a
considered path, the required sort node will already penalize it
appropriately, nothing needs to be done in relation to sort costs.
- All other things being equal, cost_sort will make the decision. If a
plan beats the user-provided order in spite of the handicap, it will
be used. So it's still optimizing clear cases.

>
> > Comparison cost can be approximated probabilistically as:
> >
> > cost_comp = sum(cost_op_n * (1.0 / ndistinct(col_1_to_n)))
> >
> > Where cost_op_n is the cost of the comparison function for column N,
> > and ndistinct(col_1_to_n) is an estimation of the number of distinct
> > values for columns prior to N in the sort order.
> >
> > You can approximate ndistinct as the product of ndistinct of previous
> > columns, or use extended statistics when available.
> >
>
> Sure. The trouble is this also assumes uniform distribution, which is
> not always the case.

Well, (1.0 / ndistinct) = p(left == right).

If you can compute a better p(left == right) with an MCV, you can get
a better estimate. If you have an MCV. But that'd be a bit expensive,
and I'm not sure it's all that relevant.

To what degree does improving this produce better plans? As long as
average expected cost is relatively well estimated (as in, one sort
order relative to another sort order), it won't affect the decision.

But if needed, the MCV can be used for this.

So, in

Re: POC: GROUP BY optimization

2018-06-06 Thread Claudio Freire
On Wed, Jun 6, 2018 at 5:43 PM Tomas Vondra
 wrote:
>
>  For example, it seems to disregard that different data types have
>  different comparison costs. For example comparing bytea will be far
>  more expensive compared to int4, so it may be much more efficient to
>  compare int4 columns first, even if there are far fewer distinct
>  values in them.
> >>> as I can see cost_sort() doesn't pay attention to this details. And
> >>> it should be a separated patch to improve that.
> >>>
> >>
> >> But sort also does not reorder columns.
> > Yes. But estimation of cost of comparing function/number of unique
> > values in column could be not very accurate and so planner could make
> > a wrong choice.

...
>
> >> Imagine you have a custom data type that is expensive for comparisons.
> >> You know that, so you place it at the end of GROUP BY clauses, to
> >> reduce the number of comparisons on that field. And then we come along
> >> and just reorder the columns, placing it first, because it happens to
> >> have a high ndistinct statistic. And there's no way to get the
> >> original behavior :-(
> > Hm. that it could be, but I don't know how to improve here.  Current
> > cost_sort() will return the same cost for any columns order.
> >
> > Okay, here we know an estimation of nrow, we could somehow find a
> > comparing function oid and find a its procost field. And then? we also
> > need to know width of tuple here and mostly repeat the cost_sort.
> >
> > Another option is an introducing enable_groupby_reorder GUC variable
> > although personally I don't like an idea to add new GUC variable.
> >
>
> I don't like the idea of yet another GUC either, as it's rather blunt
> instrument. Which is why I'm suggesting to split the patch into two
> parts. Then we can apply the index stuff (which seems relatively
> straightforward) and keep working on this second part.
>
> FWIW I think it would be useful to have "development GUC" that would
> allow us to enable/disable these options during development, because
> that makes experiments much easier. But then remove them before commit.

Correct me if I'm wrong, but doesn't this patch concern itself with
precisely sort performance?

As such, estimating sort performance correctly in the various plan
variants being considered seems to be a very central aspect of it.

This means it's pretty much time to consider the effect of operator
cost *and* distinct values in the cost computation.

Assuming cost_sort does its thing, it's just a matter of building the
desired variants and picking the one with the smallest cost_sort. One
can use heuristics to build a subset of all possible column orderings
with a guiding principle that tries to maximize the likelihook of
finding a better order than the one in the query (like sorting by
ndistinct), but I'd suggest:

- Always considering the sort order verbatim as given in the SQL (ie:
what the user requests)
- Relying on cost_sort to distinguish among them, and pick a winner, and
- When in doubt (if cost_sort doesn't produce a clear winner), use the
order given by the user

Comparison cost can be approximated probabilistically as:

cost_comp = sum(cost_op_n * (1.0 / ndistinct(col_1_to_n)))

Where cost_op_n is the cost of the comparison function for column N,
and ndistinct(col_1_to_n) is an estimation of the number of distinct
values for columns prior to N in the sort order.

You can approximate ndistinct as the product of ndistinct of previous
columns, or use extended statistics when available.

I think that should give a good enough approximation of
ndistinct-enriched sort costs that considers operator cost
appropriately. That operator cost is basically an average and can be
used as a constant, so it's just a matter of passing the right
comparison_cost to cost_sort.

Even if ndistinct estimates are off, the fact that operator cost is
involved should be a good enough tool for the user should the planner
pick the wrong path - it's only a matter of boosting operator cost to
let the planner know that sorting that way is expensive.

Priorization of the user-provided order can be as simple as giving
that comparison_cost a small handicap.



Re: Faster inserts with mostly-monotonically increasing values

2018-04-10 Thread Claudio Freire
On Tue, Apr 10, 2018 at 11:10 AM, Heikki Linnakangas  wrote:
>> /* XLOG stuff */
>> if (RelationNeedsWAL(rel))
>> {
>> ...
>>
>> if (P_ISLEAF(lpageop))
>> {
>> xlinfo = XLOG_BTREE_INSERT_LEAF;
>>
>> /*
>>  * Cache the block information if we just
>> inserted into the
>>  * rightmost leaf page of the index.
>>  */
>> if (P_RIGHTMOST(lpageop))
>> RelationSetTargetBlock(rel,
>> BufferGetBlockNumber(buf));
>> }
>> ...
>
>
>
> Why is this RelationSetTargetBlock() call inside the "XLOG stuff" block?
> ISTM that we're failing to take advantage of this optimization for unlogged
> tables, for no particular reason. Just an oversight?
>
> - Heikki

Indeed.

Maybe Pavan knows of one, but I don't see any reason not to apply this
to unlogged tables as well. It slipped the review.



Re: Vacuum: allow usage of more than 1GB of work mem

2018-04-06 Thread Claudio Freire
On Fri, Apr 6, 2018 at 5:25 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Fri, Apr 6, 2018 at 10:39 AM, Heikki Linnakangas <hlinn...@iki.fi> wrote:
>> On 06/04/18 01:59, Claudio Freire wrote:
>>>
>>> The iteration interface, however, seems quite specific for the use
>>> case of vacuumlazy, so it's not really a good abstraction.
>>
>>
>> Can you elaborate? It does return the items one block at a time. Is that
>> what you mean by being specific for vacuumlazy? I guess that's a bit
>> special, but if you imagine some other users for this abstraction, it's
>> probably not that unusual. For example, if we started using it in bitmap
>> heap scans, a bitmap heap scan would also want to get the TIDs one block
>> number at a time.
>
> But you're also tying the caller to the format of the buffer holding
> those TIDs, for instance. Why would you, when you can have an
> interface that just iterates TIDs and let the caller store them
> if/however they want?
>
> I do believe a pure iterator interface is a better interface.

Between the b-tree or not discussion and the refactoring to separate
the code, I don't think we'll get this in the next 24hs.

So I guess we'll have ample time to poner on both issues during the
next commit fest.



Re: Vacuum: allow usage of more than 1GB of work mem

2018-04-05 Thread Claudio Freire
On Thu, Apr 5, 2018 at 5:02 PM, Heikki Linnakangas <hlinn...@iki.fi> wrote:
> On 03/04/18 17:20, Claudio Freire wrote:
>>
>> Ok, rebased patches attached
>
>
> Thanks! I took a look at this.
>
> First, now that the data structure is more complicated, I think it's time to
> abstract it, and move it out of vacuumlazy.c. The Tid Map needs to support
> the following operations:
>
> * Add TIDs, in order (in 1st phase of vacuum)
> * Random lookup, by TID (when scanning indexes)
> * Iterate through all TIDs, in order (2nd pass over heap)
>
> Let's add a new source file to hold the code for the tid map data structure,
> with functions corresponding those operations.
>
> I took a stab at doing that, and I think it makes vacuumlazy.c nicer.

About the refactoring to split this into their own set of files and
abstract away the underlying structure, I can totally get behind that.

The iteration interface, however, seems quite specific for the use
case of vacuumlazy, so it's not really a good abstraction. It also
copies stuff a lot, so it's quite heavyweight. I'd suggest trying to
go for a lighter weight interface with less overhead that is more
general at the same time.

If it was C++, I'd say build an iterator class.

C would do it probably with macros, so you can have a macro to get to
the current element, another to advance to the next element, and
another to check whether you've reached the end.

I can do that if we agree on the points below:

> Secondly, I'm not a big fan of the chosen data structure. I think the only
> reason that the segmented "multi-array" was chosen is that each "segment"
> works is similar to the simple array that we used to have. After putting it
> behind the abstraction, it seems rather ad hoc. There are many standard
> textbook data structures that we could use instead, and would be easier to
> understand and reason about, I think.
>
> So, I came up with the attached patch. I used a B-tree as the data
> structure. Not sure it's the best one, I'm all ears for suggestions and
> bikeshedding on alternatives, but I'm pretty happy with that. I would expect
> it to be pretty close to the simple array with binary search in performance
> characteristics. It'd be pretty straightforward to optimize further, and
> e.g. use a bitmap of OffsetNumbers or some other more dense data structure
> in the B-tree leaf pages, but I resisted doing that as part of this patch.

About the B-tree, however, I don't think a B-tree is a good idea.

Trees' main benefit is that they can be inserted to efficiently. When
all your data is loaded sequentially, in-order, in-memory and
immutable; the tree is pointless, more costly to build, and harder to
maintain - in terms of code complexity.

In this use case, the only benefit of B-trees would be that they're
optimized for disk access. If we planned to store this on-disk,
perhaps I'd grant you that. But we don't plan to do that, and it's not
even clear doing it would be efficient enough for the intended use.

On the other side, using B-trees incurs memory overhead due to the
need for internal nodes, can fragment memory because internal nodes
aren't the same size as leaf nodes, is easier to get wrong and
introduce bugs... I don't see a gain. If you propose its use, at least
benchmark it to show some gain.

So I don't think B-tree is a good idea, the sorted array already is
good enough, and if not, it's at least close to the earlier
implementation and less likely to introduce bugs.

Furthermore, among the 200-ish messages this thread has accumulated,
better ideas have been proposed, better because they do use less
memory and are faster (like using bitmaps when possible), but if we
can't push a simple refactoring first, there's no chance a bigger
rewrite will fare better. Remember, in this use case, using less
memory far outweights any other consideration. Less memory directly
translates to less iterations over the indexes, because more can be
crammed into m_w_m, which is a huge time saving. Far more than any
micro-optimization.

About 2 years ago, I chose to try to push this simple algorithm first,
then try to improve on it with better data structures. Nobody
complained at the time (I think, IIRC), and I don't think it fair to
go and revisit that now. It just delays getting a solution for this
issue for the persuit of "the perfect implementaiton" that might never
arrive. Or even if it doesn, there's nothing stopping us from pushing
another patch in the future with that better implementation if we
wish. Lets get something simple and proven first.

> I haven't done any performance testing of this (and not much testing in
> general), but at least the abstraction seems better this way. Performance
> testing would be good, too. In particular, I'd like to know how this might
> affect the performance of lazy_tid_reaped(). That's

Re: Vacuum: allow usage of more than 1GB of work mem

2018-04-03 Thread Claudio Freire
On Tue, Apr 3, 2018 at 11:09 AM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Tue, Apr 3, 2018 at 11:06 AM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> I didn't receive your comment, I just saw it. Nevertheless, I rebased the 
>> patches a while ago just because I noticed they didn't apply anymore in 
>> cputube, and they still seem to apply.
>
> Sorry, that is false.
>
> They appear green in cputube, so I was confident they did apply, but I
> just double-checked on a recent pull and they don't. I'll rebase them
> shortly.


Ok, rebased patches attached
From 712aff9a856c2bae1d87555057e6546d029ecc47 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 1/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.

Improve test ability to spot vacuum errors
---
 src/backend/commands/vacuumlazy.c| 402 ---
 src/test/regress/expected/vacuum.out |  72 ++-
 src/test/regress/sql/vacuum.sql  |  40 +++-
 3 files changed, 438 insertions(+), 76 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index d2a0066..2f82dc6 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -11,11 +11,24 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
- * initially allocate an array of TIDs of that size, with an upper limit that
+ * initially allocate an array of TIDs of 128MB, or an upper limit that
  * depends on table size (this limit ensures we don't allocate a huge area
- * uselessly for vacuuming small tables).  If the array threatens to overflow,
+ * uselessly for vacuuming small tables). Additional arrays of increasingly
+ * large sizes are allocated as they become necessary.
+ *
+ * The TID array is thus represented as a list of multiple segments of
+ * varying size, beginning with the initial size of up to 128MB, and growing
+ * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
+ * is used up.
+ *
+ * Lookup in that structure happens with a binary search of segments, and then
+ * with a binary search within each segment. Since segment's size grows
+ * exponentially, this retains O(log N) lookup complexity.
+ *
+ * If the multiarray's total size threatens to grow beyond maintenance_work_mem,
  * we suspend the heap scan phase and perform a pass of index cleanup and page
- * compaction, then resume the heap scan with an empty TID array.
+ * compaction, then resume the heap scan with an array of logically empty but
+ * already preallocated TID segments to be refilled with more dead tuple TIDs.
  *
  * If we're processing a table with no indexes, we can just vacuum each page
  * as we go; there's no need to save up multiple tuples to minimize the number
@@ -101,6 +114,14 @@
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
 
 /*
+ * Minimum (starting) size of the dead_tuples array segments. Will allocate
+ * space for 128MB worth of tid pointers in the first segment, further segments
+ * will grow in size exponentially. Don't make it too small or the segment list
+ * will grow bigger than the sweetspot for search efficiency on big vacuums.
+ */
+#define LAZY_INIT_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
+
+/*
  * Before we consider skipping a page that's marked as clean in
  * visibility map, we must've seen at least this many clean pages.
  */
@@ -112,6 +133,27 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+typedef struct DeadTuplesSegment
+{
+	ItemPointerData last_dead_tuple;	/* Copy of the last dead tuple (unset
+		 * until the segment is fully
+		 * populated). Keep it first to simplify
+		 * binary searches */
+	int			num_dead_tuples;	/* # of entries in the segment */
+	int			max_dead_tuples;	/* # of entries allocated in the segment */
+	ItemPointer dt_tids;		/* Array of dead tuples */
+}	DeadTuplesSegment;
+
+typedef struct DeadTuplesMultiArray
+{
+	int			num_entries;	/* current # of entries */
+	int			max_entries;	/* total # of slots that can be allocated in
+ * array */
+	int			num_segs;		/* number of dead tuple segments allocated */
+	int			last_seg;		/* last dead tuple segment with data (or 0) */
+	DeadTuplesSegment *dt_segments;		/* array of num_segs segments */
+}	DeadTuplesMultiArray;
+
 typedef struct LVRelStats
 {
 	/* hasindex = true means two-pass strategy; false means one-pass */
@@ -132,14 +174,20 @@ typedef struct LVRelStats
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 	/

Re: Vacuum: allow usage of more than 1GB of work mem

2018-04-03 Thread Claudio Freire
On Tue, Apr 3, 2018 at 11:06 AM, Claudio Freire <klaussfre...@gmail.com> wrote:
> I didn't receive your comment, I just saw it. Nevertheless, I rebased the 
> patches a while ago just because I noticed they didn't apply anymore in 
> cputube, and they still seem to apply.

Sorry, that is false.

They appear green in cputube, so I was confident they did apply, but I
just double-checked on a recent pull and they don't. I'll rebase them
shortly.



Re: Vacuum: allow usage of more than 1GB of work mem

2018-04-03 Thread Claudio Freire
I didn't receive your comment, I just saw it. Nevertheless, I rebased the 
patches a while ago just because I noticed they didn't apply anymore in 
cputube, and they still seem to apply.

Patch number 2 was committed a long while ago, that's why it's missing. It was 
a simple patch, it landed rewritten as commit 
7e26e02eec90370dd222f35f00042f8188488ac4

Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-04-03 Thread Claudio Freire
On Tue, Apr 3, 2018 at 10:19 AM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Thu, Mar 29, 2018 at 2:09 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> Claudio Freire <klaussfre...@gmail.com> writes:
>>> On Wed, Mar 28, 2018 at 6:54 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>>> After 0001,
>>>> there's no reason to assume that vacuum is particularly likely to get
>>>> cancelled between having made cleanups and having updated the upper FSM
>>>> levels.  (Maybe the odds are a bit more for the no-indexes case, but
>>>> that doesn't seem like it's common enough to justify a special mechanism
>>>> either.)
>>
>>> Why not?
>>
>>> Any kind of DDL (even those that don't rewrite the heap) would cancel
>>> autovacuum.
>>
>>> You might think DDL isn't common enough to worry about, but I've seen
>>> cases where regular reindex were required to keep index bloat in check
>>> (and were cron'd), and those cancel autovacuum all the time.
>>
>> If you've got a situation where every vacuum gets canceled partway
>> through, you've got bloat problems regardless, because the heap tuples are
>> never getting removed in the first place; worrying about whether the FSM
>> is up to date is pretty pointless.  The 0001 patch basically updates the
>> FSM as soon as it can after the tuples are actually deleted, so I think
>> we've made the window as small as practical, and I don't really see a need
>> to do extra work (and add substantial extra complexity) to deal with
>> missed cleanup a bit sooner.
>>
>> People who are dealing with this sort of scenario a lot might be well
>> advised to reduce autovacuum's maintenance_work_mem, so that the cleanup
>> cycles happen more often.  That's kind of costly in terms of the number
>> of index scans, but it reduces the amount of cleanup work that can be
>> lost to a cancel.
>>
>> (I'd also argue that a setup such as you describe is very possibly making
>> things worse not better.  Perhaps the 0001 patch will go some way towards
>> making it less necessary to do that.)
>
> Alright, so we just drop 2.

So, that's it then.

Thanks



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-04-03 Thread Claudio Freire
On Thu, Mar 29, 2018 at 7:55 PM, Tom Lane  wrote:
> I wrote:
>> I have to go do something
>> else right now, but I'll take a closer look at 0004 later.
>
> OK, so after studying 0004, it seems to me that we could do it more
> simply as attached; that is, move the IndexFreeSpaceMapVacuum calls
> into btvacuumscan/spgvacuumscan, do them only if we found any recyclable
> pages, and drop the calls in btvacuumcleanup/spgvacuumcleanup altogether.
>
> The reason I think this is sufficient is that the scans find and record
> every reusable page in the index, no matter whether they were recorded
> before or not.  Therefore, if we don't find any such pages, there's
> nothing useful in the FSM and no particular urgency about making its
> upper pages up-to-date.  It's true that if the FSM is actually corrupt,
> leaving that to be fixed retail by searches is probably less efficient
> than doing an IndexFreeSpaceMapVacuum call would be --- but *only* if
> you assume that the problem is just in the upper pages and the leaf
> pages are all fine.  That doesn't seem to be a case we should optimize
> for.

+1, LGTM.



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-03-28 Thread Claudio Freire
On Wed, Mar 28, 2018 at 6:54 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Claudio Freire <klaussfre...@gmail.com> writes:
>> Attached patches, rebased and modified as discussed:
>> 1 no longer does tree pruning, it just vacuums a range of the FSM
>> 2 reintroduces tree pruning for the initial FSM vacuum
>> 3 and 4 remain as they were, but rebased
>
> I reviewed and cleaned up 0001.  The API for FreeSpaceMapVacuumRange was
> underspecified, and the use of it seemed to have off-by-one errors.  Also,
> you still had vacuum doing a full FreeSpaceMapVacuum call at the end;
> I thought the point was to get rid of that.  We then need to make sure
> we clean up after a truncation, but we can do that by introducing a
> FreeSpaceMapVacuumRange call into FreeSpaceMapTruncateRel.  I think the
> attached 0001 is committable, but feel free to take another look.

+
+ /*
+  * Periodically do incremental FSM vacuuming to make newly-freed
+  * space visible on upper FSM pages.  Note: although we've cleaned
+  * the current block, we haven't yet updated its FSM entry (that
+  * happens further down), so passing end == blkno is correct.
+  */
+ if (blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES)
+ {
+ FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum,
+ blkno);
+ next_fsm_block_to_vacuum = blkno;
+ }
  }

Using next_fsm_block_to_vacuum there seems strange.

v10 counted the number of blocks with updated free space to vacuum the
FSM only after a lot of changes to it were made. This will vacuum the
FSM after *scanning* a lot of pages, even if little modifications were
made to them. Not sure which is better, but the logic in v10 sounded
right to me.

! if (fsm_end.logpageno == addr.logpageno)
! end_slot = fsm_end_slot;
! else if (fsm_end.logpageno > addr.logpageno)
! end_slot = SlotsPerFSMPage;
! else
! end_slot = -1;
!
! for (slot = start_slot; slot < end_slot; slot++)

This doesn't seem correct.

The +1 in v10 was intentional. Suppose a leaf node of the FSM has a
parent in slot 3, and that one has a parent at slot 10.

This would vacuum slots 0-9 on the upper level, but not 10. That would
leave a whole branch (the one where the end block belongs) unvisited.

That's why the end_slot has to be inclusive of the end block. We have
work to do recursing the end_slot.


> I still don't like 0002.  It's adding a lot of complexity, and
> not-negligible overhead, to solve yesterday's problem.

Are you sure it's non-negligible?

Most of my benchmarks couldn't measure any change whatsoever after
this patch in regards to run/cpu time.

The size of the FSM is so much smaller than the table, that the cost
of vacuuming it is drowned by all the other work done and buried under
the noise.

Maybe under very special cases where vacuum does nothing, skipping all
rows, it might be measurable. A heavily bloated-then-cleaned table
with few live rows per page, but all frozen, that would be a total
worst-case. But reading the visibility map to skip rows is comparable
work to vacuuming the FSM. There's no reason to think it would worsen
by *that* much. I might try to benchmark that a bit after the long
weekend.

Anyway, it's a separate patch to be independently committed/vetted.

> After 0001,
> there's no reason to assume that vacuum is particularly likely to get
> cancelled between having made cleanups and having updated the upper FSM
> levels.  (Maybe the odds are a bit more for the no-indexes case, but
> that doesn't seem like it's common enough to justify a special mechanism
> either.)

Why not?

Any kind of DDL (even those that don't rewrite the heap) would cancel
autovacuum.

You might think DDL isn't common enough to worry about, but I've seen
cases where regular reindex were required to keep index bloat in check
(and were cron'd), and those cancel autovacuum all the time.

> Not sure about 0004 either.  The fact that we can't localize what part of
> the index needs to be updated means that repeated IndexFreeSpaceMapVacuum
> calls are a roughly quadratic cost.

Well, the index bulk delete itself already is a huge elephant-sized
quadratic cost.

The FSM is only the cherry on top.

The updated part can't be localize because it isn't. All blocks could
potentially be changed. Even in correlated indexes, upper levels need
not be physically correlated and would screw with the "vacuum block
range" optimization.

I could try to optimize the case where it's possible, by recording the
first and last blocks modified, but that would be a hugely invasive
patch (it would have to touch a lot of places in btree code).

And index bloat is a very real issue, as bad as heap bl

Re: Index scan prefetch?

2018-03-26 Thread Claudio Freire
On Mon, Mar 26, 2018 at 4:59 PM, Justin Pryzby  wrote:
>> But now effective_io_concurrency parameter is applicable only for bitmap
> ...
>> Will it be useful to support it also for index scan?
>> Or there are some other ways to address this problem?
>
> Does your case perform well with bitmap heap scan (I mean bitmap scan of the
> single index)?  It seems to me that prefetch wouldn't help, as it would just
> incur the same random cost you're already seeing; the solution may be to 
> choose
> another plan(bitmap) with sequential access to enable read-ahead,
>
> Also: Claudio mentioned here that bitmap prefetch can cause the kernel to 
> avoid
> its own readahead, negatively affecting some queries:
> https://www.postgresql.org/message-id/flat/8fb758a1-d7fa-4dcc-fb5b-07a992ae6a32%40gmail.com#20180207054227.ge17...@telsasoft.com
>
> What's the pg_stats "correlation" for the table column with index being
> scanned?  How many tuples?  Would you send explain(analyze,buffers) for the
> problem query, and with SET enable_bitmapscan=off; ?

Also, check out this thread:

http://www.postgresql-archive.org/Prefetch-index-pages-for-B-Tree-index-scans-td5728926.html



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-03-26 Thread Claudio Freire
On Mon, Mar 26, 2018 at 2:46 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Mon, Mar 26, 2018 at 11:26 AM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> On Mon, Mar 26, 2018 at 11:19 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>> Claudio Freire <klaussfre...@gmail.com> writes:
>>>> On Sat, Mar 24, 2018 at 4:17 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>>>> I hadn't paid any attention to this patch previously, so maybe I'm
>>>>> missing something ... but this sure seems like a very bizarre approach
>>>>> to the problem.  If the idea is to fix the FSM's upper levels after
>>>>> vacuuming a known sub-range of the table, why would you not proceed
>>>>> by teaching FreeSpaceMapVacuum to recurse only within that sub-range
>>>>> of page numbers?  This setup with a threshold seems entirely Rube
>>>>> Goldbergian.  It's dependent on a magic threshold number that you can
>>>>> only select by trial and error, and it's inevitably going to spend time
>>>>> updating segments of the FSM tree that have nothing to do with the part
>>>>> that's been vacuumed.
>>>
>>>> Well, the point is to not only update the range we know we've
>>>> vacuumed, but also to finish the updates done by a potential
>>>> previously cancelled autovacuum.
>>>
>>> I think that's not an important consideration, or at least would stop
>>> being one after a patch like this.  The reason it's a problem now is
>>> precisely that we don't try to vacuum the FSM till the very end; if
>>> we did FSM cleanup every X pages --- in particular, before not after
>>> the final relation-truncation attempt --- there wouldn't be risk of
>>> skipping so much FSM work that we need to worry about forcing the
>>> issue just in case there'd been a prior cancellation.
>>
>> I'm thinking that in conjunction with the high MWM patch for vacuum,
>> it could still happen that considerable amount of vacuuming is left
>> unexposed upon cancellation.
>>
>> The "bizarre" approach provides some relief.
>>
>> I'll see about implementing the FSM range vacuuming operation for
>> non-initial runs, there seems to be consensus that it's a good idea.
>>
>> But I still think this partial run at the very beginning is useful still.
>
> Attached patches, rebased and modified as discussed:
>
> 1 no longer does tree pruning, it just vacuums a range of the FSM
>
> 2 reintroduces tree pruning for the initial FSM vacuum
>
> 3 and 4 remain as they were, but rebased

Sorry, ignore patch number 3 in the earlier mail, I selected the wrong
patch to attach.

Attached now is the correct number 3
From c069f7590fd0ff24aa77d8025eda7e17527bca71 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Mon, 26 Feb 2018 13:23:59 -0300
Subject: [PATCH 3/4] FSM: Fix relation extension FSM update

When updating the FSM recursively during relation extension,
the update could step on higher-level entries with an incorrect
value, by always bubbling up the leaf value instead of the root
value that is the result of the bubble up in fsm_set_avail.

If somehow the root contained a different value from the new value
set by fsm_update_recursive, upper levels would be set to the
wrong value instead.

This didn't happen often since the value set during relation
extension is usually pretty high and unlikely to be beaten by
other pre-existing values, but the possibility existed nonetheless,
especially during bulk loads.
---
 src/backend/storage/freespace/freespace.c | 36 ++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 2f48092..80f5e80 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -107,6 +107,8 @@ static Size fsm_space_cat_to_avail(uint8 cat);
 /* workhorse functions for various operations */
 static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
    uint8 newValue, uint8 minValue);
+static uint8 fsm_set_and_get_maxval(Relation rel, FSMAddress addr, uint16 slot,
+	uint8 newValue);
 static BlockNumber fsm_search(Relation rel, uint8 min_cat);
 static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, uint8 threshold, bool *eof,
 			 BlockNumber start, BlockNumber end);
@@ -719,6 +721,33 @@ fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
 }
 
 /*
+ * Set value in given FSM page and slot.
+ *
+ * The new value at the root node is returned.
+ */
+static uint8
+fsm_set_and_get_maxval(Relation rel, FSMAddress addr, uint16

Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-03-26 Thread Claudio Freire
On Mon, Mar 26, 2018 at 11:26 AM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Mon, Mar 26, 2018 at 11:19 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>> Claudio Freire <klaussfre...@gmail.com> writes:
>>> On Sat, Mar 24, 2018 at 4:17 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>>>> I hadn't paid any attention to this patch previously, so maybe I'm
>>>> missing something ... but this sure seems like a very bizarre approach
>>>> to the problem.  If the idea is to fix the FSM's upper levels after
>>>> vacuuming a known sub-range of the table, why would you not proceed
>>>> by teaching FreeSpaceMapVacuum to recurse only within that sub-range
>>>> of page numbers?  This setup with a threshold seems entirely Rube
>>>> Goldbergian.  It's dependent on a magic threshold number that you can
>>>> only select by trial and error, and it's inevitably going to spend time
>>>> updating segments of the FSM tree that have nothing to do with the part
>>>> that's been vacuumed.
>>
>>> Well, the point is to not only update the range we know we've
>>> vacuumed, but also to finish the updates done by a potential
>>> previously cancelled autovacuum.
>>
>> I think that's not an important consideration, or at least would stop
>> being one after a patch like this.  The reason it's a problem now is
>> precisely that we don't try to vacuum the FSM till the very end; if
>> we did FSM cleanup every X pages --- in particular, before not after
>> the final relation-truncation attempt --- there wouldn't be risk of
>> skipping so much FSM work that we need to worry about forcing the
>> issue just in case there'd been a prior cancellation.
>
> I'm thinking that in conjunction with the high MWM patch for vacuum,
> it could still happen that considerable amount of vacuuming is left
> unexposed upon cancellation.
>
> The "bizarre" approach provides some relief.
>
> I'll see about implementing the FSM range vacuuming operation for
> non-initial runs, there seems to be consensus that it's a good idea.
>
> But I still think this partial run at the very beginning is useful still.

Attached patches, rebased and modified as discussed:

1 no longer does tree pruning, it just vacuums a range of the FSM

2 reintroduces tree pruning for the initial FSM vacuum

3 and 4 remain as they were, but rebased
From f640e1b75b0644ac41b006ac2c69b81b7ddf Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Fri, 28 Jul 2017 21:31:59 -0300
Subject: [PATCH 1/4] Vacuum: Update FSM more frequently

Make Vacuum update the FSM more frequently, to avoid the case where
autovacuum fails to reach the point where it updates the FSM in
highly contended tables.

Implement heap block range FSM vacuuming, and make vacuumlazy use
it to vacuum the affected FSM at intermediate steps.
Intermediate FSM vacuums are only supposed to make enough
free space visible to avoid extension until the final (non-partial)
FSM vacuum.

Run partial FSM vacuums after each index pass, which is a point
at which whole ranges of the heap have been thorougly cleaned, and
we can expect no further updates to those ranges of the FSM save
for concurrent activity. When there are no indexes, and thus no
index passes, run partial FSM vacuums every 8GB of dirtied pages
or 1/8th of the relation, whichever is highest. This allows some
partial work to be made visible without incurring quadratic cost.

In any case, FSM are small in relation to the table, so even when
quadratic cost is involved, it should not be problematic. Index
passes already incur quadratic cost, and the addition of the FSM
is unlikely to be measurable.
---
 src/backend/commands/vacuumlazy.c | 57 +-
 src/backend/storage/freespace/freespace.c | 67 ---
 src/include/storage/freespace.h   |  2 +
 3 files changed, 118 insertions(+), 8 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index f9da24c..c44996f 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -85,6 +85,19 @@
 #define VACUUM_TRUNCATE_LOCK_TIMEOUT			5000	/* ms */
 
 /*
+ * When a table has no indexes, vacuum the FSM at most every 1/Nth
+ * of the relation has been vacuumed to prevent bloat during long-running
+ * vacuums. This specifies the N.
+ */
+#define VACUUM_FSM_EVERY_FRACTION 8
+
+/*
+ * When a table has no indexes, vacuum the FSM at most every 8GB
+ * worth of dirty pages.
+ */
+#define VACUUM_FSM_EVERY_PAGES (((Size)8 * 1024 * 1024 * 1024) / BLCKSZ)
+
+/*
  * Guesstimation of number of dead tuples per page.  This is used to
  * provide an upper limit to memory allocated when vacuuming small
  * tables.
@@ -465,7 +478

Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-03-26 Thread Claudio Freire
On Sat, Mar 24, 2018 at 4:17 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Claudio Freire <klaussfre...@gmail.com> writes:
>> [ 0001-Vacuum-Update-FSM-more-frequently-v9.patch ]
>
> I hadn't paid any attention to this patch previously, so maybe I'm
> missing something ... but this sure seems like a very bizarre approach
> to the problem.  If the idea is to fix the FSM's upper levels after
> vacuuming a known sub-range of the table, why would you not proceed
> by teaching FreeSpaceMapVacuum to recurse only within that sub-range
> of page numbers?  This setup with a threshold seems entirely Rube
> Goldbergian.  It's dependent on a magic threshold number that you can
> only select by trial and error, and it's inevitably going to spend time
> updating segments of the FSM tree that have nothing to do with the part
> that's been vacuumed.

Well, the point is to not only update the range we know we've
vacuumed, but also to finish the updates done by a potential
previously cancelled autovacuum.

But I guess that's only for the first run. Your approach seems like a
better idea for the other runs.

Just FTR, the number isn't so magic. During vacuum, the patch records
the highest amount of free space produced, and will only recurse into
branches that don't already record that much free space. So it's quite
deterministic in those cases, it's only the first run the one that has
to guess, and your approach doesn't apply for that first run.



Re: Faster inserts with mostly-monotonically increasing values

2018-03-22 Thread Claudio Freire
On Thu, Mar 22, 2018 at 3:29 AM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
>
> On Thu, Mar 22, 2018 at 7:22 AM, Claudio Freire <klaussfre...@gmail.com>
> wrote:
>>
>>
>> If you're not planning on making any further changes, would you mind
>> posting a coalesced patch?
>>
>> Notice that I removed the last offset thing only halfway, so it would
>> need some cleanup.
>
>
> Here is an updated patch. I removed the last offset caching thing completely
> and integrated your changes for conditional lock access. Some other
> improvements to test cases  too. I realised that we must truncate and
> re-insert to test index fastpath correctly.

Thanks.

Some comments

+/*
+ * It's very common to have an index on an auto-incremented or
+ * monotonically increasing value. In such cases, every insertion happens
+ * towards the end of the index. We try to optimise that case by caching
+ * the right-most block of the index. If our cached block is still the
+ * rightmost block, has enough free space to accommodate a new entry and
+ * the insertion key is greater or equal to the first key in this page,
+ * then we can safely conclude that the new key will be inserted in the
+ * cached block. So we simply search within the cached page and insert the
+ * key at the appropriate location. We call it a fastpath.

It should say "the insertion key is strictly greater than the first key"

Equal cases cannot be accepted since it would break uniqueness checks,
so the comment should say that. The code already is correctly checking
for strict inequality, it's just the comment that is out of sync.

You might accept equal keys for non-unique indexes, but I don't
believe making that distinction in the code is worth the hassle.

Also, "rightmost block" != "rightmost leaf page" ("leaf" being the key
difference). So it should say "rightmost leaf page".


+if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
+!P_INCOMPLETE_SPLIT(lpageop) &&
+!P_IGNORE(lpageop) &&
+(PageGetFreeSpace(page) > itemsz) &&
+PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
+_bt_compare(rel, natts, itup_scankey, page,
+P_FIRSTDATAKEY(lpageop)) > 0)
+{
+offset = InvalidOffsetNumber;
+fastpath = true;
+}
...later...
+if (!fastpath)
+{
+/* find the first page containing this key */
+stack = _bt_search(rel, natts, itup_scankey, false, , BT_WRITE,
+   NULL);
+
+offset = InvalidOffsetNumber;

Setting "offset = InvalidOffsetNumber" in that contorted way is
unnecessary. You can remove the first assignment and instead
initialize unconditionally right after the fastpath block (that
doesn't make use of offset anyway):

In indexing.out:

+explain select sum(a) from fastpath where a = 6456;
+ QUERY PLAN
+
+ Aggregate  (cost=4.31..4.32 rows=1 width=8)
+   ->  Index Only Scan using fpindex1 on fastpath  (cost=0.29..4.30
rows=1 width=4)
+ Index Cond: (a = 6456)
+(3 rows)

Having costs in explain tests can be fragile. Better use "explain
(costs off)". If you run "make check" continuously in a loop, you
should get some failures related to that pretty quickly.

+truncate fastpath;
+insert into fastpath select y.x, 'b' || (y.x/10)::text, 100 from
(select generate_series(1,1) as x) y;
+vacuum fastpath;

Vacuum can also be a pain. In parallel tests, vacuum won't do what you
expect it to do, as other concurrent transactions will prevent it
from, say, marking all-visible pages.

In fact, you can get those spurious failures by running "make -j8
check-world" repeatedly (with tap tests enabled). That causes enough
concurrency to cause trouble and sporadic failures.

Question whether you need it. I'm not sure why you need the explain
tests, which I imagine are the reason why you need that vacuum there.

If you do need it, I successfully used the following helper in another patch:

CREATE FUNCTION wait_barrier() RETURNS void LANGUAGE plpgsql AS $$
DECLARE vxids text[];
BEGIN
 -- wait for all transactions to commit to make deleted tuples vacuumable
 SELECT array_agg(DISTINCT virtualtransaction) FROM pg_locks WHERE pid
<> pg_backend_pid() INTO vxids;
 WHILE (SELECT count(virtualtransaction) FROM pg_locks WHERE pid <>
pg_backend_pid() AND virtualtransaction = ANY(vxids)) > 0 LOOP
PERFORM pg_sleep(0.1);
 END LOOP;
END
$$;

You call it right before vacuum to make sure all potentially
troublesome transactions finished by the time vacuum runs:

SELECT wait_barrier();
VACUUM blah;

Name the function something else, though, to avoid name clashing with
that patch. Also... that function up there is also still up for review
itself, so take it with a grain of salt. It worked for me.



Re: Faster inserts with mostly-monotonically increasing values

2018-03-21 Thread Claudio Freire
On Mon, Mar 19, 2018 at 12:06 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Mon, Mar 19, 2018 at 11:57 AM, Pavan Deolasee
> <pavan.deola...@gmail.com> wrote:
>>
>>
>> On Thu, Mar 15, 2018 at 7:51 AM, Claudio Freire <klaussfre...@gmail.com>
>> wrote:
>>>
>>>
>>>
>>> I applied the attached patches on top of your patch, so it would be
>>> nice to see if you can give it a try in your test hardware to see
>>> whether it helps.
>>
>>
>> I can confirm that I no longer see any regression at 8 or even 16 clients.
>> In fact, the patched version consistent do better than master even at higher
>> number of clients.
>>
>> The only thing I am a bit puzzled is that I am no longer able to reproduce
>> the 40-50% gains that I'd earlier observed with a single client. I now get
>> 20-25% with smaller number of client and 10-15% with larger number of
>> clients. I haven't been able to establish why it's happening, but since it's
>> a different AWS instance (though of the same type), I am inclined to blame
>> it on that for now.
>
> Your original patch also skipped the check for serializable conflicts.
>
> *IF* that was correct, it probably further reduced contention. I'm not
> entirely sure if skipping that check is correct, but if it is, you can
> still accomplish this checking whether the new key is beyond the
> current rightmost key, and setting a flag so that check is later
> skipped.
>
> But there are lots of complex interactions in that code and I'm no
> expert there, I'd prefer if someone more knowledgeable could confirm
> whether it's safe to skip that check or not. Or leave it just in case.

If you're not planning on making any further changes, would you mind
posting a coalesced patch?

Notice that I removed the last offset thing only halfway, so it would
need some cleanup.



Re: Faster inserts with mostly-monotonically increasing values

2018-03-19 Thread Claudio Freire
On Mon, Mar 19, 2018 at 11:57 AM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
>
>
> On Thu, Mar 15, 2018 at 7:51 AM, Claudio Freire <klaussfre...@gmail.com>
> wrote:
>>
>>
>>
>> I applied the attached patches on top of your patch, so it would be
>> nice to see if you can give it a try in your test hardware to see
>> whether it helps.
>
>
> I can confirm that I no longer see any regression at 8 or even 16 clients.
> In fact, the patched version consistent do better than master even at higher
> number of clients.
>
> The only thing I am a bit puzzled is that I am no longer able to reproduce
> the 40-50% gains that I'd earlier observed with a single client. I now get
> 20-25% with smaller number of client and 10-15% with larger number of
> clients. I haven't been able to establish why it's happening, but since it's
> a different AWS instance (though of the same type), I am inclined to blame
> it on that for now.

Your original patch also skipped the check for serializable conflicts.

*IF* that was correct, it probably further reduced contention. I'm not
entirely sure if skipping that check is correct, but if it is, you can
still accomplish this checking whether the new key is beyond the
current rightmost key, and setting a flag so that check is later
skipped.

But there are lots of complex interactions in that code and I'm no
expert there, I'd prefer if someone more knowledgeable could confirm
whether it's safe to skip that check or not. Or leave it just in case.



Re: [HACKERS] GUC for cleanup indexes threshold.

2018-03-19 Thread Claudio Freire
On Mon, Mar 19, 2018 at 8:50 AM, Masahiko Sawada  wrote:
>>> require the bulk-delete method of scanning whole index and of logging
>>> WAL. But it leads some extra overhead. With this patch we no longer
>>> need to depend on the full scan on b-tree index. This might be useful
>>> for a future when we make the bulk-delete of b-tree index not scan
>>> whole index.
>>
>> Perhaps I'm taking something incorrectly, but is it just the
>> result of skipping 'maybe needed' scans without condiering the
>> actual necessity?
>
> I meant to scan only index pages that are relevant with garbages TIDs
> on a table. The current b-tree index bulk-deletion is very slow and
> heavy because we always scans the whole index even if there is only 1
> dead tuples in a table. To address this problem I'm thinking a way to
> make bulk-delete not scan whole index if there is a few dead tuples in
> a table. That is, we do index scans to collect the stack of btree
> pages and reclaim garbage. Maybe we will full index scan if there are
> a lot of dead tuples, which would be same as what we're doing on
> planning access paths.

In theory it's not very difficult to do. I was pondering doing some
PoC after the other vacuum patches get through.

TL;DR version is, as long as there's enough MWM to fit the keys, they
can be stashed before vacuuming the heap, and used to perform index
scans instead for index cleanup. If MWM runs out, it just goes back to
bulk-delete scans (it would imply there's a lot to clean up anyway, so
index scans wouldn't be worth it). A finer decision can be made with
random_page_cost on which technique is likely faster as well.

So yes, lets not paint ourselves into a corner where full index scans
are required for correct operation, that would make the above
impossible.



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-03-16 Thread Claudio Freire
On Fri, Feb 9, 2018 at 1:05 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
> Turns out that it was a tad oversized. 300k tuples seems enough.
>
> Attached is a new patch version that:
>
> - Uses an unlogged table to make the large mwm test faster
> - Uses a wait_barrier helper that waits for concurrent transactions
>   to finish before vacuuming tables, to make sure deleted tuples
>   actually are vacuumable
> - Tweaks the size of the large mwm test to be as small as possible
> - Optimizes the delete to avoid expensive operations yet attain
>   the same end result

Attached rebased versions of the patches (they weren't applying to
current master)
From d8241ba70bc70c5dd989d2dccdee03a71151c814 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 1/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.

Improve test ability to spot vacuum errors
---
 src/backend/commands/vacuumlazy.c| 402 ---
 src/test/regress/expected/vacuum.out |  72 ++-
 src/test/regress/sql/vacuum.sql  |  40 +++-
 3 files changed, 438 insertions(+), 76 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 9ac84e8..793ce74 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -11,11 +11,24 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
- * initially allocate an array of TIDs of that size, with an upper limit that
+ * initially allocate an array of TIDs of 128MB, or an upper limit that
  * depends on table size (this limit ensures we don't allocate a huge area
- * uselessly for vacuuming small tables).  If the array threatens to overflow,
+ * uselessly for vacuuming small tables). Additional arrays of increasingly
+ * large sizes are allocated as they become necessary.
+ *
+ * The TID array is thus represented as a list of multiple segments of
+ * varying size, beginning with the initial size of up to 128MB, and growing
+ * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
+ * is used up.
+ *
+ * Lookup in that structure happens with a binary search of segments, and then
+ * with a binary search within each segment. Since segment's size grows
+ * exponentially, this retains O(log N) lookup complexity.
+ *
+ * If the multiarray's total size threatens to grow beyond maintenance_work_mem,
  * we suspend the heap scan phase and perform a pass of index cleanup and page
- * compaction, then resume the heap scan with an empty TID array.
+ * compaction, then resume the heap scan with an array of logically empty but
+ * already preallocated TID segments to be refilled with more dead tuple TIDs.
  *
  * If we're processing a table with no indexes, we can just vacuum each page
  * as we go; there's no need to save up multiple tuples to minimize the number
@@ -92,6 +105,14 @@
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
 
 /*
+ * Minimum (starting) size of the dead_tuples array segments. Will allocate
+ * space for 128MB worth of tid pointers in the first segment, further segments
+ * will grow in size exponentially. Don't make it too small or the segment list
+ * will grow bigger than the sweetspot for search efficiency on big vacuums.
+ */
+#define LAZY_INIT_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
+
+/*
  * Before we consider skipping a page that's marked as clean in
  * visibility map, we must've seen at least this many clean pages.
  */
@@ -103,6 +124,27 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+typedef struct DeadTuplesSegment
+{
+	ItemPointerData last_dead_tuple;	/* Copy of the last dead tuple (unset
+		 * until the segment is fully
+		 * populated). Keep it first to simplify
+		 * binary searches */
+	int			num_dead_tuples;	/* # of entries in the segment */
+	int			max_dead_tuples;	/* # of entries allocated in the segment */
+	ItemPointer dt_tids;		/* Array of dead tuples */
+}	DeadTuplesSegment;
+
+typedef struct DeadTuplesMultiArray
+{
+	int			num_entries;	/* current # of entries */
+	int			max_entries;	/* total # of slots that can be allocated in
+ * array */
+	int			num_segs;		/* number of dead tuple segments allocated */
+	int			last_seg;		/* last dead tuple segment with data (or 0) */
+	DeadTuplesSegment *dt_segments;		/* array of num_segs segments */
+}	DeadTuplesMultiArray;
+
 typedef struct LVRelStats
 {
 	/* hasindex = true means two-pass strategy; false means one-pass */
@@ -123,14 +165,20 @@ typedef struct LVRelStats
 	BlockNumber 

Re: Faster inserts with mostly-monotonically increasing values

2018-03-14 Thread Claudio Freire
On Wed, Mar 14, 2018 at 12:05 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Wed, Mar 14, 2018 at 1:36 AM, Pavan Deolasee
> <pavan.deola...@gmail.com> wrote:
>>
>>
>> On Sun, Mar 11, 2018 at 9:18 PM, Claudio Freire <klaussfre...@gmail.com>
>> wrote:
>>>
>>> On Sun, Mar 11, 2018 at 2:27 AM, Pavan Deolasee
>>>
>>> >
>>> > Yes, I will try that next - it seems like a good idea. So the idea would
>>> > be:
>>> > check if the block is still the rightmost block and the insertion-key is
>>> > greater than the first key in the page. If those conditions are
>>> > satisfied,
>>> > then we do a regular binary search within the page to find the correct
>>> > location. This might add an overhead of binary search when keys are
>>> > strictly
>>> > ordered and a single client is inserting the data. If that becomes a
>>> > concern, we might be able to look for that special case too and optimise
>>> > for
>>> > it too.
>>>
>>> Yeah, pretty much that's the idea. Beware, if the new item doesn't
>>> fall in the rightmost place, you still need to check for serialization
>>> conflicts.
>>
>>
>> So I've been toying with this idea since yesterday and I am quite puzzled
>> with the results. See the attached patch which compares the insertion key
>> with the last key inserted by this backend, if the cached block is still the
>> rightmost block in the tree. I initially only compared with the first key in
>> the page, but I tried this version because of the strange performance
>> regression which I still have no answers.
>>
>> For a small number of clients, the patched version does better. But as the
>> number of clients go up, the patched version significantly underperforms
>> master. I roughly counted the number of times the fastpath is taken and I
>> noticed that almost 98% inserts take the fastpath. I first thought that the
>> "firstkey" location in the page might be becoming a hot-spot for concurrent
>> processes and hence changed that to track the per-backend last offset and
>> compare against that the next time. But that did not help much.
>
> +_bt_compare(rel, natts, itup_scankey, page,
> +RelationGetLastOffset(rel)) >= 0)
>
> Won't this access garbage if the last offset is stale and beyond the
> current rightmost page's last offset?
>
> I'd suggest simply using P_FIRSTDATAKEY after checking that the page
> isn't empty (see _bt_binsrch).
>
> On Wed, Mar 14, 2018 at 11:12 AM, Pavan Deolasee
> <pavan.deola...@gmail.com> wrote:
>>> > Hmm. I can try that. It's quite puzzling though that slowing down things
>>> > actually make them better.
>>>
>>> That's not what is happening though.
>>>
>>> The cache path is 1) try to lock cached block, 2) when got lock check
>>> relevance, if stale 3) recheck from top
>>>
>>> The non-cached path is just 3) recheck from top
>>>
>>> The overall path length is longer in the cached case but provides
>>> benefit if we can skip step 3 in high % of cases. The non-cached path
>>> is more flexible because it locates the correct RHS block, even if it
>>> changes dynamically between starting the recheck from top.
>>>
>>
>> So as I noted in one of the previous emails, the revised patch still takes
>> fast path in 98% cases. So it's not clear why the taking steps 1, 2 and 3 in
>> just 2% cases should cause such dramatic slowdown.
>
> Real-world workloads will probably take the slow path more often, so
> it's probably worth keeping the failure path as contention-free as
> possible.
>
> Besides, even though it may be just 2% the times it lands there, it
> could still block for a considerable amount of time for no benefit.
>
> So I guess a conditional lock is not a bad idea.

Re all of the above, I did some tests.

I couldn't reproduce the regression you showed so clearly, in fact at
all, but I was able to get consistently lock-bound with 8 clients by
setting fsync=off.

Something noteworthy, is that both master and patched are lock-bound.
It just must be that when that happens, the extra check for the
rightmost page really hurts.

So, I tried the conditional locks, and indeed it (at least in my
meager hardware) turns the lock-bound test into an I/O bound one.

I applied the attached patches on top of your patch, so it would be
nice to see if you can give it a try in your test hardware to see
whether it helps.
From 95d558c50b95729d109aa0135795c1887812ee1d Mon Sep 

Re: Faster inserts with mostly-monotonically increasing values

2018-03-14 Thread Claudio Freire
On Wed, Mar 14, 2018 at 1:36 AM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
>
>
> On Sun, Mar 11, 2018 at 9:18 PM, Claudio Freire <klaussfre...@gmail.com>
> wrote:
>>
>> On Sun, Mar 11, 2018 at 2:27 AM, Pavan Deolasee
>>
>> >
>> > Yes, I will try that next - it seems like a good idea. So the idea would
>> > be:
>> > check if the block is still the rightmost block and the insertion-key is
>> > greater than the first key in the page. If those conditions are
>> > satisfied,
>> > then we do a regular binary search within the page to find the correct
>> > location. This might add an overhead of binary search when keys are
>> > strictly
>> > ordered and a single client is inserting the data. If that becomes a
>> > concern, we might be able to look for that special case too and optimise
>> > for
>> > it too.
>>
>> Yeah, pretty much that's the idea. Beware, if the new item doesn't
>> fall in the rightmost place, you still need to check for serialization
>> conflicts.
>
>
> So I've been toying with this idea since yesterday and I am quite puzzled
> with the results. See the attached patch which compares the insertion key
> with the last key inserted by this backend, if the cached block is still the
> rightmost block in the tree. I initially only compared with the first key in
> the page, but I tried this version because of the strange performance
> regression which I still have no answers.
>
> For a small number of clients, the patched version does better. But as the
> number of clients go up, the patched version significantly underperforms
> master. I roughly counted the number of times the fastpath is taken and I
> noticed that almost 98% inserts take the fastpath. I first thought that the
> "firstkey" location in the page might be becoming a hot-spot for concurrent
> processes and hence changed that to track the per-backend last offset and
> compare against that the next time. But that did not help much.

+_bt_compare(rel, natts, itup_scankey, page,
+RelationGetLastOffset(rel)) >= 0)

Won't this access garbage if the last offset is stale and beyond the
current rightmost page's last offset?

I'd suggest simply using P_FIRSTDATAKEY after checking that the page
isn't empty (see _bt_binsrch).

On Wed, Mar 14, 2018 at 11:12 AM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
>> > Hmm. I can try that. It's quite puzzling though that slowing down things
>> > actually make them better.
>>
>> That's not what is happening though.
>>
>> The cache path is 1) try to lock cached block, 2) when got lock check
>> relevance, if stale 3) recheck from top
>>
>> The non-cached path is just 3) recheck from top
>>
>> The overall path length is longer in the cached case but provides
>> benefit if we can skip step 3 in high % of cases. The non-cached path
>> is more flexible because it locates the correct RHS block, even if it
>> changes dynamically between starting the recheck from top.
>>
>
> So as I noted in one of the previous emails, the revised patch still takes
> fast path in 98% cases. So it's not clear why the taking steps 1, 2 and 3 in
> just 2% cases should cause such dramatic slowdown.

Real-world workloads will probably take the slow path more often, so
it's probably worth keeping the failure path as contention-free as
possible.

Besides, even though it may be just 2% the times it lands there, it
could still block for a considerable amount of time for no benefit.

So I guess a conditional lock is not a bad idea.



Re: Faster inserts with mostly-monotonically increasing values

2018-03-14 Thread Claudio Freire
On Wed, Mar 14, 2018 at 10:51 AM, Simon Riggs
 wrote:
> If there is enough delay in step 1 then any benefit is lost anyway, so
> there is no point taking the cached path even when successful, so its
> better to ignore in that case.

I find myself agreeing with that.



Re: Faster inserts with mostly-monotonically increasing values

2018-03-13 Thread Claudio Freire
On Wed, Mar 14, 2018 at 1:36 AM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
>
>
> On Sun, Mar 11, 2018 at 9:18 PM, Claudio Freire <klaussfre...@gmail.com>
> wrote:
>>
>> On Sun, Mar 11, 2018 at 2:27 AM, Pavan Deolasee
>>
>> >
>> > Yes, I will try that next - it seems like a good idea. So the idea would
>> > be:
>> > check if the block is still the rightmost block and the insertion-key is
>> > greater than the first key in the page. If those conditions are
>> > satisfied,
>> > then we do a regular binary search within the page to find the correct
>> > location. This might add an overhead of binary search when keys are
>> > strictly
>> > ordered and a single client is inserting the data. If that becomes a
>> > concern, we might be able to look for that special case too and optimise
>> > for
>> > it too.
>>
>> Yeah, pretty much that's the idea. Beware, if the new item doesn't
>> fall in the rightmost place, you still need to check for serialization
>> conflicts.
>
>
> So I've been toying with this idea since yesterday and I am quite puzzled
> with the results. See the attached patch which compares the insertion key
> with the last key inserted by this backend, if the cached block is still the
> rightmost block in the tree. I initially only compared with the first key in
> the page, but I tried this version because of the strange performance
> regression which I still have no answers.
>
> For a small number of clients, the patched version does better. But as the
> number of clients go up, the patched version significantly underperforms
> master. I roughly counted the number of times the fastpath is taken and I
> noticed that almost 98% inserts take the fastpath. I first thought that the
> "firstkey" location in the page might be becoming a hot-spot for concurrent
> processes and hence changed that to track the per-backend last offset and
> compare against that the next time. But that did not help much.
>
> +-++---+
> | clients | Master - Avg load time in sec | Patched - Avg load time in sec |
> +-++---+
> | 1   | 500.0725203| 347.632079|
> +-++---+
> | 2   | 308.4580771| 263.9120163   |
> +-++---+
> | 4   | 359.4851779| 514.7187444   |
> +-++---+
> | 8   | 476.4062592| 780.2540855   |
> +-++---+
>
> The perf data does not show anything interesting either. I mean there is a
> reduction in CPU time spent in btree related code in the patched version,
> but the overall execution time to insert the same number of records go up
> significantly.
>
> Perf (master):
> ===
>
> +   72.59% 1.81%  postgres  postgres[.] ExecInsert
> +   47.55% 1.27%  postgres  postgres[.]
> ExecInsertIndexTuples
> +   44.24% 0.48%  postgres  postgres[.] btinsert
> -   42.40% 0.87%  postgres  postgres[.] _bt_doinsert
>- 41.52% _bt_doinsert
>   + 21.14% _bt_search
>   + 12.57% _bt_insertonpg
>   + 2.03% _bt_binsrch
> 1.60% _bt_mkscankey
> 1.20% LWLockAcquire
>   + 1.03% _bt_freestack
> 0.67% LWLockRelease
> 0.57% _bt_check_unique
>+ 0.87% _start
> +   26.03% 0.95%  postgres  postgres[.] ExecScan
> +   21.14% 0.82%  postgres  postgres[.] _bt_search
> +   20.70% 1.31%  postgres  postgres[.] ExecInterpExpr
> +   19.05% 1.14%  postgres  postgres[.] heap_insert
> +   18.84% 1.16%  postgres  postgres[.] nextval_internal
> +   18.08% 0.84%  postgres  postgres[.] ReadBufferExtended
> +   17.24% 2.03%  postgres  postgres[.] ReadBuffer_common
> +   12.57% 0.59%  postgres  postgres[.] _bt_insertonpg
> +   11.12% 1.63%  postgres  postgres[.] XLogInsert
> +9.90% 0.10%  postgres  postgres[.] _bt_relandgetbuf
> +8.97% 1.16%  postgres  postgres[.] LWLockAcquire
> +8.42% 2.03%  postgres  postgres[.] XLogInsertRecord
> +7.26% 1.01%  postgres  postgres

Re: Faster inserts with mostly-monotonically increasing values

2018-03-11 Thread Claudio Freire
On Sun, Mar 11, 2018 at 2:27 AM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
>
>
> On Sat, Mar 10, 2018 at 12:11 AM, Claudio Freire <klaussfre...@gmail.com>
> wrote:
>>
>> On Fri, Mar 9, 2018 at 2:54 PM, Pavan Deolasee <pavan.deola...@gmail.com>
>> wrote:
>> >
>> >
>>
>> >
>> > So yes, the benefits of the patch go down with higher number of clients,
>> > but
>> > it does not entirely vanish.
>>
>> What if you implement my suggestion?
>>
>> That should improve the multi-client case considerably.
>
>
>
> Yes, I will try that next - it seems like a good idea. So the idea would be:
> check if the block is still the rightmost block and the insertion-key is
> greater than the first key in the page. If those conditions are satisfied,
> then we do a regular binary search within the page to find the correct
> location. This might add an overhead of binary search when keys are strictly
> ordered and a single client is inserting the data. If that becomes a
> concern, we might be able to look for that special case too and optimise for
> it too.

Yeah, pretty much that's the idea. Beware, if the new item doesn't
fall in the rightmost place, you still need to check for serialization
conflicts.



Re: Faster inserts with mostly-monotonically increasing values

2018-03-09 Thread Claudio Freire
On Fri, Mar 9, 2018 at 2:54 PM, Pavan Deolasee <pavan.deola...@gmail.com> wrote:
>
>
> On Tue, Mar 6, 2018 at 10:10 AM, Pavan Deolasee <pavan.deola...@gmail.com>
> wrote:
>>
>>
>>
>> On Tue, Mar 6, 2018 at 7:29 AM, Peter Geoghegan <p...@bowt.ie> wrote:
>>>
>>> On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire <klaussfre...@gmail.com>
>>> wrote:
>>>
>>> > I believe PKs are a prime candidate for this optimization, and
>>> > expecting it to apply only when no concurrency is involved is severely
>>> > dumbing down the optimization.
>>>
>>> Pavan justified the patch using a benchmark that only involved a
>>> single client -- hardly typical for a patch that changes the B-Tree
>>> code. If the benefits with many clients can be shown to matter, that
>>> will make this much more interesting to me.
>>
>>
>> Ok. I will repeat those tests with more number of clients and report back.
>>
>
> So I repeated the tests with 1,2,4 and 8 clients, each running the following
> statement and a total of 1024 transactions. So roughly 100M rows are
> inserted.
>
> INSERT INTO testtab(b) SELECT generate_series(1,10);
>
> The table definition is:
> postgres=# \d+ testtab
>Table "public.testtab"
>  Column |  Type  | Collation | Nullable |  Default
> | Storage | Stats target | Description
> ++---+--++-+--+-
>  a  | bigint |   | not null | nextval('testtab_a_seq'::regclass)
> | plain   |  |
>  b  | bigint |   |  |
> | plain   |  |
> Indexes:
> "testtab_a_key" UNIQUE CONSTRAINT, btree (a)
>
>
> After taking average of 3-runs:
>
> +-++---+
> | clients | Patched - time in sec | Master - time in sec |
> +-++---+
> | 1   | 311.8643602| 411.832757|
> +-++---+
> | 2   | 252.5433   | 300.7875613   |
> +-++---+
> | 4   | 337.0414279| 350.9636766   |
> +-++---+
> | 8   | 444.2035582| 477.1903417   |
> +-++---+
>
> So yes, the benefits of the patch go down with higher number of clients, but
> it does not entirely vanish.

What if you implement my suggestion?

That should improve the multi-client case considerably.



Re: disable SSL compression?

2018-03-08 Thread Claudio Freire
On Thu, Mar 8, 2018 at 11:06 PM, Peter Eisentraut
<peter.eisentr...@2ndquadrant.com> wrote:
> On 3/8/18 14:23, Claudio Freire wrote:
>> On Thu, Mar 8, 2018 at 3:40 PM, Peter Eisentraut
>> <peter.eisentr...@2ndquadrant.com> wrote:
>>> It appears that SSL compression is nowadays deprecated as insecure.
>>> Yet, it is still enabled by libpq by default, and there is no way to
>>> disable it in the server.  Should we make some changes here?  Does
>>> anyone know more about this?
>>
>> Even if libpq enables it, it has to be enabled both in the client and
>> the server for it to work.
>>
>> OpenSSL disables the whole feature by default, and enabling it is
>> rather cumbersome. The result is that, at least with OpenSSL, the
>> server and client won't accept compression without extensive fiddling
>> by the user.
>
> But however that may be, libpq appears to enable it by default.  This is
> what I get from psql:
>
> SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384,
> bits: 256, compression: on)

I don't get that:

SSL connection (protocol: TLSv1.2, cipher:
ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)

Even if I set OPENSSL_DEFAULT_ZLIB=1 on the client, I get the same.
The serverside refuses.



Re: disable SSL compression?

2018-03-08 Thread Claudio Freire
On Thu, Mar 8, 2018 at 3:40 PM, Peter Eisentraut
 wrote:
> It appears that SSL compression is nowadays deprecated as insecure.
> Yet, it is still enabled by libpq by default, and there is no way to
> disable it in the server.  Should we make some changes here?  Does
> anyone know more about this?

Even if libpq enables it, it has to be enabled both in the client and
the server for it to work.

OpenSSL disables the whole feature by default, and enabling it is
rather cumbersome. The result is that, at least with OpenSSL, the
server and client won't accept compression without extensive fiddling
by the user.

So I don't think libpq has to change anything here.



Re: Faster inserts with mostly-monotonically increasing values

2018-03-06 Thread Claudio Freire
On Tue, Mar 6, 2018 at 9:06 AM, Simon Riggs  wrote:
>> Simon had raised concerns about DESC indexes and whether we need to do the
>> checks for leftmost page in that case. I haven't yet figured out if DESC
>> indexes are actually stored in the reverse order. I am gonna look at that
>> too.
>
> No, I meant that you were testing whether the value was higher (> 0),
> whereas it should be lower (< 0) on DESC indexes.

Isn't that already handled by _bt_compare?



Re: Faster inserts with mostly-monotonically increasing values

2018-03-06 Thread Claudio Freire
On Tue, Mar 6, 2018 at 1:45 AM, Peter Geoghegan <p...@bowt.ie> wrote:
> On Mon, Mar 5, 2018 at 7:10 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
>>> Introducing any case that allows us to land on a recycled page, and
>>> reason that it must at least not be the page we were looking for based
>>> on *any* criteria about the page itself seems very brittle. Yeah, it
>>> probably won't break in practice, but it's a bad design.
>>
>> How is this any different from btvacuumscan?
>>
>> I don't think it can be argued that somehow btvacuumscan has
>> permission to be brittle and the rest of the code doesn't.
>
> VACUUM doesn't have to worry about concurrent page recycling because
> it is already the only thing that performs page deletion. It's already
> the process that has the exclusive right to give index pages back to
> the FSM.

Not *concurrent* recycling, but it does have to handle *recycled*
pages correctly.

With the write lock, I don't think there's an issue with *concurrent*
recycling. According to the readme, leaf pages aren't deleted at all,
so only concurrent splitting is a concern.

The only issue is that we may land on a *recycled* page. Now, it
depends on what you mean by recycled here... if you mean "deleted",
!P_IGNORE will skip that page. If you mean it was deleted and freed
but not reused, again, !P_IGNORE will skip it (much as it happens for
vacuum). If you mean that someone reused it, then it will be a valid
page with valid headers, so we need not worry about it not being
consistent. It can be some other page than the one we expect it to be,
but the checks ought to be sufficient to quickly verify whether that's
the case.

Unless you see a way in which a write-locked page could say it is a
rightmost leaf when it is not?



Re: Faster inserts with mostly-monotonically increasing values

2018-03-05 Thread Claudio Freire
On Mon, Mar 5, 2018 at 10:59 PM, Peter Geoghegan <p...@bowt.ie> wrote:
> On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
>> But in thise case there's no right link to follow, so it's a non-issue.
>>
>> BTree doesn't truncate AFAIK, so the cached block number can't be
>> nonexisting (beyond the end of the relation). It's probably fragile to
>> assume BTree will never truncate, so maybe it would be wise to check
>> for that. But if the block exists, I believe it contains a page that
>> can be safely interpreted by that condition. As long as there can be
>> no other pages with the same flags on the index while the cached page
>> is write-locked, that code will be safe.
>
> Introducing any case that allows us to land on a recycled page, and
> reason that it must at least not be the page we were looking for based
> on *any* criteria about the page itself seems very brittle. Yeah, it
> probably won't break in practice, but it's a bad design.

How is this any different from btvacuumscan?

I don't think it can be argued that somehow btvacuumscan has
permission to be brittle and the rest of the code doesn't.

Point is, it's not brittle. The on-disk format of the tree is such
that any block can be catalogued by inspecting its header,
btvacuumscan depends on that. Questions that can be answered looking
at a page header alone, are:

* Is it deleted or new?
* Is it a leaf?
* Is it rightmost?

Only question remaining is whether it's the *only* rightmost leaf, and
that can be guaranteed by locking.

So, can a flag check result in a random outcome? No. That would also
cause a random outcome for btvacuumscan and then vacuum would corrupt
the index just as well.

And now that we mention this... why is the code not using _bt_getbuf?
It already does quite a bit of sanity checking that is IMO quite
desirable here.



Re: Faster inserts with mostly-monotonically increasing values

2018-03-05 Thread Claudio Freire
On Mon, Mar 5, 2018 at 9:52 PM, Peter Geoghegan <p...@bowt.ie> wrote:
> On Mon, Mar 5, 2018 at 4:37 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
>> Correct me if I'm wrong, but there's lots of code in nbtree already
>> that risks reading recycled pages for various reasons. Presumably,
>> checking P_ISDELETED and P_ISHALFDEAD should be enough, which is what
>> !P_IGNORE implies.
>
> You're mistaken. Read the nbtree README on page deletion, and the
> RecentGlobalXmin interlock. Note also that there are 3 distinct page
> states in play as far as deletion/recycling is concerned:
>
> 1. The first phase of deletion
> 2. The second phase of deletion.
> 3. Post-recycling, when in general the page could look like just about 
> anything.
>
> It's page deletion's job to make sure that an index scan cannot land
> on a page after it was recycled. The corollary is that it is not
> everyone else's job to make sure that that didn't happen -- how could
> that possibly work?

>From what I read, both phase 1 & 2 are served by the !P_IGNORE check.

For the third phase:

> A deleted page can only be reclaimed once there is no scan or search that
> has a reference to it; until then, it must stay in place with its
> right-link undisturbed.

That's because:

> A deleted page cannot be reclaimed immediately, since there may be other
> processes waiting to reference it (ie, search processes that just left the
> parent, or scans moving right or left from one of the siblings).  These
> processes must observe that the page is marked dead and recover
> accordingly.  Searches and forward scans simply follow the right-link
> until they find a non-dead page --- this will be where the deleted page's
> key-space moved to.

But in thise case there's no right link to follow, so it's a non-issue.

BTree doesn't truncate AFAIK, so the cached block number can't be
nonexisting (beyond the end of the relation). It's probably fragile to
assume BTree will never truncate, so maybe it would be wise to check
for that. But if the block exists, I believe it contains a page that
can be safely interpreted by that condition. As long as there can be
no other pages with the same flags on the index while the cached page
is write-locked, that code will be safe.

>>>> You could allow for some slack, by comparing against the leftmost key
>>>> instead. You just need to know whether the key fits in the page. Then,
>>>> the insert can proceed as usual, doing the binsearch to find the
>>>> proper insertion point, etc.
>>>
>>> The whole way that this patch opts out of using an exclusive buffer
>>> lock on the "first page the value could be on" (the _bt_check_unique()
>>> + _bt_findinsertloc() stuff) makes me uneasy. I know that that isn't
>>> terribly concrete.
>>
>> Assuming the rightmost page is the first page the value could be on,
>> it already does get an exclusive buffer lock.
>
> This can be quite fiddly. The downlink in the parent can be equal to
> the value in the target page, meaning that we actually lock the
> target's left sibling (see _bt_binsrch() special case for internal
> pages).

I'm not sure which special case you're referring to, I don't see
anything relevant in _bt_binsrch.

Yes, if the scankey is equal to the leftmost key, the first occurrence
of that key could be on the left, so the check would have to be for
strictly greater. But that's about as complex as it would get.

> I didn't say that the patch is necessarily wrong here. Just that it
> makes me nervous.

Any change to btree would make anyone nervous ;-)

>> If one wanted to relax the condition as I proposed, that uniqueness
>> check is necessary, so that's why I propose shortcircuiting the search
>> only, and not the actual insertion on the page. I believe IMO it's
>> more important to try and maximize the conditions under which the
>> optimization can be applied.
>
> I didn't get what the point of checking the first item on the page
> (your proposal) was.

Right now, if you've got concurrent transactions, there's absolutely
no chance this optimization would kick in. For it to happen,
insertions on the index need to happen in order, each insertion a key
greater than any other key (visible or not) on the index.

Concurrent inserts on PKs are unlikely to arrive in order, a slight
disorder is to be expected as serial sequence numbers get generated a
significant amount of time before the insert actually takes place.
Different backends can get to the index insertion at different times,
causing unordered inserts.

I believe PKs are a prime candidate for this optimization, and
expecting it to apply only when no concurrency is involved is severely
dumbing down the optimization.

Consider 3 trans

Re: Faster inserts with mostly-monotonically increasing values

2018-03-05 Thread Claudio Freire
On Mon, Mar 5, 2018 at 9:37 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
> Assuming the rightmost page is the first page the value could be on,
> it already does get an exclusive buffer lock.

That made me check, and:

+/*
+ * Acquire exclusive lock on the buffer before doing any checks. This
+ * ensures that the index state cannot change, as far as the rightmost
+ * part of the index is concerned.
+ */
+LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);

BTree code uses BT_READ and BT_WRITE instead, so that should be:

LockBuffer(buf, BT_WRITE)



Re: Faster inserts with mostly-monotonically increasing values

2018-03-05 Thread Claudio Freire
On Mon, Mar 5, 2018 at 9:12 PM, Peter Geoghegan <p...@bowt.ie> wrote:
> On Thu, Mar 1, 2018 at 3:18 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
>> The key there is strictly greater than the rightmost key. As such, it
>> must precede the first page with an equal key, and since it's the
>> rightmost page... there's no such key. But if there was, the check for
>> uniqueness only needs the insert point to point to the first such
>> page, and this guarantees it.
>
> This code assumes that it can use block number as a stable way of
> finding the same point in the B-Tree over time, without any interlock.
> That may actually work out in practice, since even if the page is
> recycled there can only ever be one rightmost leaf page, but it seems
> like a brittle approach to me. The page image may be recycled by the
> FSM already, and we really shouldn't be depending on the page not
> looking a particular way once that has happened. I guess that you can
> say the same thing about using page LSN to determine cached block
> staleness instead, but that still seems a lot safer.
>
> Basically, I'm worried that we could do something entirely
> unpredictable when a page has actually been recycled, since we're
> entirely opting out of the RecentGlobalXmin interlock business on the
> assumption that we can be sure that recycling hasn't occurred in some
> other way. Imagine what could happen if we ever teach the FSM to share
> freespace among small relations, which seems quite possible to me.

Correct me if I'm wrong, but there's lots of code in nbtree already
that risks reading recycled pages for various reasons. Presumably,
checking P_ISDELETED and P_ISHALFDEAD should be enough, which is what
!P_IGNORE implies.

>> You could allow for some slack, by comparing against the leftmost key
>> instead. You just need to know whether the key fits in the page. Then,
>> the insert can proceed as usual, doing the binsearch to find the
>> proper insertion point, etc.
>
> The whole way that this patch opts out of using an exclusive buffer
> lock on the "first page the value could be on" (the _bt_check_unique()
> + _bt_findinsertloc() stuff) makes me uneasy. I know that that isn't
> terribly concrete.

Assuming the rightmost page is the first page the value could be on,
it already does get an exclusive buffer lock.

And it seems that assumption is correct in the patch as-is. In fact,
the current patch checks for a much stronger situation than needed to
apply this optimization, so it can even skip checking for conflicting
concurrent transactions. With an exclusive lock on the buffer, and
being the rightmost page, it means there can be no conflicting key
since the check is that the scan key is strictly greater than the
rightmost key (lets forget for a moment empty rightmost pages). Unless
there can be a new rightmost page in construction somehow (which I
don't believe it can happen, new rightmost pages would be created by
splitting the rightmost page, and that would conflict with the
exclusive lock), I don't see how this can fail.

If one wanted to relax the condition as I proposed, that uniqueness
check is necessary, so that's why I propose shortcircuiting the search
only, and not the actual insertion on the page. I believe IMO it's
more important to try and maximize the conditions under which the
optimization can be applied.



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-03-05 Thread Claudio Freire
On Sun, Mar 4, 2018 at 10:25 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> On Mon, Mar 5, 2018 at 10:21 AM, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
>> On Fri, Mar 2, 2018 at 10:50 PM, Claudio Freire <klaussfre...@gmail.com> 
>> wrote:
>>> On Fri, Mar 2, 2018 at 10:47 AM, Claudio Freire <klaussfre...@gmail.com> 
>>> wrote:
>>>> On Fri, Mar 2, 2018 at 7:38 AM, Masahiko Sawada <sawada.m...@gmail.com> 
>>>> wrote:
>>>>> Thank you for updating the patches!
>>>>>
>>>>> +/*
>>>>> + * When a table has no indexes, vacuum the FSM at most every this
>>>>> + * many dirty pages. With a default page size of 8kb, this value
>>>>> + * basically means 8GB of dirtied pages.
>>>>> + */
>>>>> +#define VACUUM_FSM_EVERY_PAGES 1048576
>>>>>
>>>>> Is it better if we vacuum fsm every 8GB regardless of the block size?
>>>>> Because an installation that uses >8GB block size is likely to have
>>>>> the pages less than what an 8GB block size installation has, the
>>>>> current patch might lead to delay fsm vacuum. What do you think? If
>>>>> it's better, we can define it as follows.
>>>>>
>>>>> #define VACUUM_FSM_EVERY_PAGES ((8 * 1024 * 1024) / BLCKSZ)
>>>>
>>>> I honestly don't know. I've never tweaked page size, and know nobody
>>>> that did, so I have no experience with those installations.
>>>>
>>>> Lets go with your proposal.
>>>>
>>>>>
>>>>> --
>>>>> @@ -470,7 +484,9 @@ lazy_scan_heap(Relation onerel, int options,
>>>>> LVRelStats *vacrelstats,
>>>>>  TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
>>>>>  TransactionId relminmxid = onerel->rd_rel->relminmxid;
>>>>>  BlockNumber empty_pages,
>>>>> -vacuumed_pages;
>>>>> +vacuumed_pages,
>>>>> +fsm_updated_pages,
>>>>> +vacuum_fsm_every_pages;
>>>>>  doublenum_tuples,
>>>>>  tups_vacuumed,
>>>>>  nkeep,
>>>>>
>>>>> Regarding fsm_updated_pages variable name, I think it's better to be
>>>>> freespace_updated_pages because we actually counts the page updated
>>>>> its freespace, not fsm.
>>>>
>>>> Good point.
>>>>
>>>> New version attached.
>>>
>>> Sorry, forgot to actually squash. Now the real new version is attached.
>>
>> Thank you for updating. I think the patches has enough discussion and
>> quality, so can go to the committer's review. I've marked them as
>> Ready for Committer.
>>
>
> Oops, the following comment needs to be updated. Since it would be
> better to defer decision of this behavior to committers I'll keep the
> status of this patch as Ready for Committer behavior.
>
> +/*
> + * When a table has no indexes, vacuum the FSM at most every this
> + * many dirty pages. With a default page size of 8kb, this value
> + * basically means 8GB of dirtied pages.
> + */
> +#define VACUUM_FSM_EVERY_PAGES ((8 * 1024 * 1024) / BLCKSZ)
> +

I just noticed that it actually computes 8MB (not GB) of pages.

Attached a fix for that and the comment update.
From 4f6d694dd2bc8e0c1185682716175ecb08ca6c04 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Fri, 28 Jul 2017 21:31:59 -0300
Subject: [PATCH 1/4] Vacuum: Update FSM more frequently

Make Vacuum update the FSM more frequently, to avoid the case where
autovacuum fails to reach the point where it updates the FSM in
highly contended tables.

Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids
recursing into branches that already contain enough free space, to
avoid having to traverse the whole FSM and thus induce quadratic
costs. Intermediate FSM vacuums are only supposed to make enough
free space visible to avoid extension until the final (non-partial)
FSM vacuum.

Run partial FSM vacuums after each index pass, which is a point
at which whole ranges of the heap have been thorougly cleaned, and
we can expect no further updates to those ranges of the FSM save
for concurrent activity. When there are no indexes, and thus no
index passes, run partial FSM vacuums every 8GB of dirtied pages
or 1/8th of the relation, whichever is highest. This allows some
partial work to be made visible without incurring quadratic cost.

In any case, FSM are 

Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-03-02 Thread Claudio Freire
On Fri, Mar 2, 2018 at 10:47 AM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Fri, Mar 2, 2018 at 7:38 AM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
>> Thank you for updating the patches!
>>
>> +/*
>> + * When a table has no indexes, vacuum the FSM at most every this
>> + * many dirty pages. With a default page size of 8kb, this value
>> + * basically means 8GB of dirtied pages.
>> + */
>> +#define VACUUM_FSM_EVERY_PAGES 1048576
>>
>> Is it better if we vacuum fsm every 8GB regardless of the block size?
>> Because an installation that uses >8GB block size is likely to have
>> the pages less than what an 8GB block size installation has, the
>> current patch might lead to delay fsm vacuum. What do you think? If
>> it's better, we can define it as follows.
>>
>> #define VACUUM_FSM_EVERY_PAGES ((8 * 1024 * 1024) / BLCKSZ)
>
> I honestly don't know. I've never tweaked page size, and know nobody
> that did, so I have no experience with those installations.
>
> Lets go with your proposal.
>
>>
>> --
>> @@ -470,7 +484,9 @@ lazy_scan_heap(Relation onerel, int options,
>> LVRelStats *vacrelstats,
>>  TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
>>  TransactionId relminmxid = onerel->rd_rel->relminmxid;
>>  BlockNumber empty_pages,
>> -vacuumed_pages;
>> +vacuumed_pages,
>> +fsm_updated_pages,
>> +vacuum_fsm_every_pages;
>>  doublenum_tuples,
>>  tups_vacuumed,
>>  nkeep,
>>
>> Regarding fsm_updated_pages variable name, I think it's better to be
>> freespace_updated_pages because we actually counts the page updated
>> its freespace, not fsm.
>
> Good point.
>
> New version attached.

Sorry, forgot to actually squash. Now the real new version is attached.
From 76946b2220b94de821ecc8035b89f62da492e9fa Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Fri, 28 Jul 2017 21:31:59 -0300
Subject: [PATCH 1/4] Vacuum: Update FSM more frequently

Make Vacuum update the FSM more frequently, to avoid the case where
autovacuum fails to reach the point where it updates the FSM in
highly contended tables.

Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids
recursing into branches that already contain enough free space, to
avoid having to traverse the whole FSM and thus induce quadratic
costs. Intermediate FSM vacuums are only supposed to make enough
free space visible to avoid extension until the final (non-partial)
FSM vacuum.

Run partial FSM vacuums after each index pass, which is a point
at which whole ranges of the heap have been thorougly cleaned, and
we can expect no further updates to those ranges of the FSM save
for concurrent activity. When there are no indexes, and thus no
index passes, run partial FSM vacuums every 8GB of dirtied pages
or 1/8th of the relation, whichever is highest. This allows some
partial work to be made visible without incurring quadratic cost.

In any case, FSM are small in relation to the table, so even when
quadratic cost is involved, it should not be problematic. Index
passes already incur quadratic cost, and the addition of the FSM
is unlikely to be measurable.

Run a partial FSM vacuum with a low pruning threshold right at
the beginning of the VACUUM to finish up any work left over from
prior canceled vacuum runs, something that is common in highly
contended tables when running only autovacuum.
---
 src/backend/access/brin/brin.c|  2 +-
 src/backend/access/brin/brin_pageops.c| 10 ++--
 src/backend/commands/vacuumlazy.c | 81 ---
 src/backend/storage/freespace/freespace.c | 27 +--
 src/backend/storage/freespace/indexfsm.c  |  2 +-
 src/include/storage/freespace.h   |  2 +-
 6 files changed, 105 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 68b3371..24d6df7 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1461,5 +1461,5 @@ brin_vacuum_scan(Relation idxrel, BufferAccessStrategy strategy)
 	 * the way to the top.
 	 */
 	if (vacuum_fsm)
-		FreeSpaceMapVacuum(idxrel);
+		FreeSpaceMapVacuum(idxrel, 0);
 }
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 60a7025..d4c7a87 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -136,7 +136,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 brin_initialize_empty_new_buffer(idxrel, newbuf);
 			UnlockReleaseBuffer(newbuf);
 			if (extended)
-FreeSpaceMapVacuum(idxrel);
+FreeSpaceMapVacuum(idxrel, 0);
 		}
 

Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-03-02 Thread Claudio Freire
On Fri, Mar 2, 2018 at 7:38 AM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> Thank you for updating the patches!
>
> +/*
> + * When a table has no indexes, vacuum the FSM at most every this
> + * many dirty pages. With a default page size of 8kb, this value
> + * basically means 8GB of dirtied pages.
> + */
> +#define VACUUM_FSM_EVERY_PAGES 1048576
>
> Is it better if we vacuum fsm every 8GB regardless of the block size?
> Because an installation that uses >8GB block size is likely to have
> the pages less than what an 8GB block size installation has, the
> current patch might lead to delay fsm vacuum. What do you think? If
> it's better, we can define it as follows.
>
> #define VACUUM_FSM_EVERY_PAGES ((8 * 1024 * 1024) / BLCKSZ)

I honestly don't know. I've never tweaked page size, and know nobody
that did, so I have no experience with those installations.

Lets go with your proposal.

>
> --
> @@ -470,7 +484,9 @@ lazy_scan_heap(Relation onerel, int options,
> LVRelStats *vacrelstats,
>  TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;
>  TransactionId relminmxid = onerel->rd_rel->relminmxid;
>  BlockNumber empty_pages,
> -vacuumed_pages;
> +vacuumed_pages,
> +fsm_updated_pages,
> +vacuum_fsm_every_pages;
>  doublenum_tuples,
>  tups_vacuumed,
>  nkeep,
>
> Regarding fsm_updated_pages variable name, I think it's better to be
> freespace_updated_pages because we actually counts the page updated
> its freespace, not fsm.

Good point.

New version attached.
From bc79add4171beb1c8a0a714436cb4d7c27d6a2fc Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Fri, 28 Jul 2017 21:31:59 -0300
Subject: [PATCH 1/4] Vacuum: Update FSM more frequently

Make Vacuum update the FSM more frequently, to avoid the case where
autovacuum fails to reach the point where it updates the FSM in
highly contended tables.

Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids
recursing into branches that already contain enough free space, to
avoid having to traverse the whole FSM and thus induce quadratic
costs. Intermediate FSM vacuums are only supposed to make enough
free space visible to avoid extension until the final (non-partial)
FSM vacuum.

Run partial FSM vacuums after each index pass, which is a point
at which whole ranges of the heap have been thorougly cleaned, and
we can expect no further updates to those ranges of the FSM save
for concurrent activity. When there are no indexes, and thus no
index passes, run partial FSM vacuums every 8GB of dirtied pages
or 1/8th of the relation, whichever is highest. This allows some
partial work to be made visible without incurring quadratic cost.

In any case, FSM are small in relation to the table, so even when
quadratic cost is involved, it should not be problematic. Index
passes already incur quadratic cost, and the addition of the FSM
is unlikely to be measurable.

Run a partial FSM vacuum with a low pruning threshold right at
the beginning of the VACUUM to finish up any work left over from
prior canceled vacuum runs, something that is common in highly
contended tables when running only autovacuum.
---
 src/backend/access/brin/brin.c|  2 +-
 src/backend/access/brin/brin_pageops.c| 10 ++--
 src/backend/commands/vacuumlazy.c | 81 ---
 src/backend/storage/freespace/freespace.c | 27 +--
 src/backend/storage/freespace/indexfsm.c  |  2 +-
 src/include/storage/freespace.h   |  2 +-
 6 files changed, 105 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 68b3371..24d6df7 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1461,5 +1461,5 @@ brin_vacuum_scan(Relation idxrel, BufferAccessStrategy strategy)
 	 * the way to the top.
 	 */
 	if (vacuum_fsm)
-		FreeSpaceMapVacuum(idxrel);
+		FreeSpaceMapVacuum(idxrel, 0);
 }
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 60a7025..d4c7a87 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -136,7 +136,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 brin_initialize_empty_new_buffer(idxrel, newbuf);
 			UnlockReleaseBuffer(newbuf);
 			if (extended)
-FreeSpaceMapVacuum(idxrel);
+FreeSpaceMapVacuum(idxrel, 0);
 		}
 		return false;
 	}
@@ -156,7 +156,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 brin_initialize_empty_new_buffer(idxrel, newbuf);
 			UnlockReleaseBuffer(newbuf);
 			if (extended)
-FreeSpaceMapVacuum(idxrel);
+FreeSpaceMapVacuum(idxrel, 0);
 		}
 		return false;
 	}
@@ -211,7 +211,7 @@ brin_doupdate(Relati

Re: Faster inserts with mostly-monotonically increasing values

2018-03-01 Thread Claudio Freire
On Sun, Dec 31, 2017 at 8:06 AM, Peter Geoghegan  wrote:
> I also have my
> doubts about unique index enforcement remaining correct with the patch
> when there are many physical duplicates, to the extent that more than
> a single leaf page is needed for a single value.

given...

+if (P_ISLEAF(lpageop) && P_RIGHTMOST(lpageop) &&
+!P_INCOMPLETE_SPLIT(lpageop) &&
+!P_IGNORE(lpageop) &&
+(PageGetFreeSpace(page) > itemsz) &&
+_bt_compare(rel, natts, itup_scankey, page,
PageGetMaxOffsetNumber(page)) > 0)

The key there is strictly greater than the rightmost key. As such, it
must precede the first page with an equal key, and since it's the
rightmost page... there's no such key. But if there was, the check for
uniqueness only needs the insert point to point to the first such
page, and this guarantees it.

You could allow for some slack, by comparing against the leftmost key
instead. You just need to know whether the key fits in the page. Then,
the insert can proceed as usual, doing the binsearch to find the
proper insertion point, etc.

That would allow this case to be applied not only to perfectly
monotonic sequences, but for nearly monotonic ones as well (think
timestamps).



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-27 Thread Claudio Freire
On Tue, Feb 6, 2018 at 4:56 AM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> For vacuuming fsm of index, we might have to consider to
> vacuum fsm of index after lazy_vacuum_index.

I've been thinking about that, and I think you're right.

So here's a fourth patch that adds that to nbtree's bulkdelete implementation.
Seems to be the best place to add such a thing.

GIN and GIST don't delete pages until vacuumcleanup, so they can't do
the same, sadly.
From 505f3143f85d42cea5adf6f04332443a61edcac0 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Tue, 27 Feb 2018 12:51:46 -0300
Subject: [PATCH] Index vacuum: Vacuum FSM after each bulkdelete call

If any pages have been deleted during bulkdelete, vacuum
the FSM to expose those pages to concurrent activity.
Try to avoid redundant FSM vacuum at vacuumcleanup.
---
 src/backend/access/nbtree/nbtree.c| 22 --
 src/backend/access/spgist/spgvacuum.c | 18 --
 2 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 8158508..d673b88 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -798,6 +798,12 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		cycleid = _bt_start_vacuum(rel);
 
 		btvacuumscan(info, stats, callback, callback_state, cycleid);
+
+		if (stats->pages_deleted > 0)
+		{
+			/* vacuum the FSM to expose deleted pages, if any */
+			IndexFreeSpaceMapVacuum(info->index);
+		}
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(_bt_end_vacuum_callback, PointerGetDatum(rel));
 	_bt_end_vacuum(rel);
@@ -813,6 +819,8 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 IndexBulkDeleteResult *
 btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
+	bool	needs_fsm_vacuum;
+
 	/* No-op in ANALYZE ONLY mode */
 	if (info->analyze_only)
 		return stats;
@@ -825,15 +833,25 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 *
 	 * Since we aren't going to actually delete any leaf items, there's no
 	 * need to go through all the vacuum-cycle-ID pushups.
+	 *
+	 * If there was a btbulkdelete call, it will vacuum the FSM too if it
+	 * deleted any pages, so we can skip our FSM vacuum in that case only.
 	 */
 	if (stats == NULL)
 	{
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
 		btvacuumscan(info, stats, NULL, NULL, 0);
+
+		needs_fsm_vacuum = true;
 	}
+	else
+		needs_fsm_vacuum = (stats->pages_deleted == 0);
 
-	/* Finally, vacuum the FSM */
-	IndexFreeSpaceMapVacuum(info->index);
+	if (needs_fsm_vacuum)
+	{
+		/* Finally, vacuum the FSM */
+		IndexFreeSpaceMapVacuum(info->index);
+	}
 
 	/*
 	 * It's quite possible for us to be fooled by concurrent page splits into
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 72839cb..e9ed3fb 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -898,6 +898,12 @@ spgbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 	spgvacuumscan();
 
+	if (stats->pages_deleted > 0)
+	{
+		/* vacuum the FSM to expose deleted pages, if any */
+		IndexFreeSpaceMapVacuum(info->index);
+	}
+
 	return stats;
 }
 
@@ -918,6 +924,7 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
 	Relation	index = info->index;
 	spgBulkDeleteState bds;
+	bool		needs_fsm_vacuum;
 
 	/* No-op in ANALYZE ONLY mode */
 	if (info->analyze_only)
@@ -938,10 +945,17 @@ spgvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 		bds.callback_state = NULL;
 
 		spgvacuumscan();
+
+		needs_fsm_vacuum = true;
 	}
+	else
+		needs_fsm_vacuum = stats->pages_deleted == 0;
 
-	/* Finally, vacuum the FSM */
-	IndexFreeSpaceMapVacuum(index);
+	if (needs_fsm_vacuum)
+	{
+		/* Finally, vacuum the FSM */
+		IndexFreeSpaceMapVacuum(index);
+	}
 
 	/*
 	 * It's quite possible for us to be fooled by concurrent tuple moves into
-- 
1.8.4.5



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-27 Thread Claudio Freire
On Mon, Feb 26, 2018 at 10:20 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> Thank you for updating patches!
>
> 0001 patch looks good to me except for the following unnecessary empty lines.
>
> +* If there are no indexes then we should periodically
> vacuum the FSM
> +* on huge relations to make free space visible early.
> +*/
> +   else if (nindexes == 0 && fsm_updated_pages >
> vacuum_fsm_every_pages)
> +   {
> +   /* Vacuum the Free Space Map */
> +   FreeSpaceMapVacuum(onerel, max_freespace);
> +   fsm_updated_pages = 0;
> +   max_freespace = 0;
> +   }
> +
> +
> +   /*
>
> @@ -913,10 +967,14 @@ lazy_scan_heap(Relation onerel, int options,
> LVRelStats *vacrelstats,
>
> vmbuffer, InvalidTransactionId,
>
> VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN);
> END_CRIT_SECTION();
> +
> }
>
> 0002 patch looks good to me.
>
> For 0003 patch, I think fsm_set() function name could lead misreading
> because it actually not only sets the value but also returns the
> value. fsm_set_and_get() might be better?

Attached is the patchset modified as requested
From bc79add4171beb1c8a0a714436cb4d7c27d6a2fc Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Fri, 28 Jul 2017 21:31:59 -0300
Subject: [PATCH 1/3] Vacuum: Update FSM more frequently

Make Vacuum update the FSM more frequently, to avoid the case where
autovacuum fails to reach the point where it updates the FSM in
highly contended tables.

Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids
recursing into branches that already contain enough free space, to
avoid having to traverse the whole FSM and thus induce quadratic
costs. Intermediate FSM vacuums are only supposed to make enough
free space visible to avoid extension until the final (non-partial)
FSM vacuum.

Run partial FSM vacuums after each index pass, which is a point
at which whole ranges of the heap have been thorougly cleaned, and
we can expect no further updates to those ranges of the FSM save
for concurrent activity. When there are no indexes, and thus no
index passes, run partial FSM vacuums every 8GB of dirtied pages
or 1/8th of the relation, whichever is highest. This allows some
partial work to be made visible without incurring quadratic cost.

In any case, FSM are small in relation to the table, so even when
quadratic cost is involved, it should not be problematic. Index
passes already incur quadratic cost, and the addition of the FSM
is unlikely to be measurable.

Run a partial FSM vacuum with a low pruning threshold right at
the beginning of the VACUUM to finish up any work left over from
prior canceled vacuum runs, something that is common in highly
contended tables when running only autovacuum.
---
 src/backend/access/brin/brin.c|  2 +-
 src/backend/access/brin/brin_pageops.c| 10 ++--
 src/backend/commands/vacuumlazy.c | 81 ---
 src/backend/storage/freespace/freespace.c | 27 +--
 src/backend/storage/freespace/indexfsm.c  |  2 +-
 src/include/storage/freespace.h   |  2 +-
 6 files changed, 105 insertions(+), 19 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 68b3371..24d6df7 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -1461,5 +1461,5 @@ brin_vacuum_scan(Relation idxrel, BufferAccessStrategy strategy)
 	 * the way to the top.
 	 */
 	if (vacuum_fsm)
-		FreeSpaceMapVacuum(idxrel);
+		FreeSpaceMapVacuum(idxrel, 0);
 }
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index 60a7025..d4c7a87 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -136,7 +136,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 brin_initialize_empty_new_buffer(idxrel, newbuf);
 			UnlockReleaseBuffer(newbuf);
 			if (extended)
-FreeSpaceMapVacuum(idxrel);
+FreeSpaceMapVacuum(idxrel, 0);
 		}
 		return false;
 	}
@@ -156,7 +156,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 brin_initialize_empty_new_buffer(idxrel, newbuf);
 			UnlockReleaseBuffer(newbuf);
 			if (extended)
-FreeSpaceMapVacuum(idxrel);
+FreeSpaceMapVacuum(idxrel, 0);
 		}
 		return false;
 	}
@@ -211,7 +211,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		LockBuffer(oldbuf, BUFFER_LOCK_UNLOCK);
 
 		if (extended)
-			FreeSpaceMapVacuum(idxrel);
+			FreeSpaceMapVacuum(idxrel, 0);
 
 		return true;
 	}
@@ -313,7 +313,7 @@ brin_doupdate(Relation idxrel, BlockNumber pagesPerRange,
 		{
 			Assert(BlockNumberIsValid(newblk));
 			RecordP

Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-26 Thread Claudio Freire
On Mon, Feb 26, 2018 at 1:32 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Mon, Feb 26, 2018 at 11:31 AM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> On Mon, Feb 26, 2018 at 6:00 AM, Masahiko Sawada <sawada.m...@gmail.com> 
>> wrote:
>>> Here is review comment for v4 patch.
>>>
>>> @@ -1922,6 +1988,8 @@ count_nondeletable_pages(Relation onerel,
>>> LVRelStats *vacrelstats)
>>>  * We don't insert a vacuum delay point here, because we 
>>> have an
>>>  * exclusive lock on the table which we want to hold
>>> for as short a
>>>  * time as possible.  We still need to check for
>>> interrupts however.
>>> +* We might have to acquire the autovacuum lock,
>>> however, but that
>>> +* shouldn't pose a deadlock risk.
>>>  */
>>> CHECK_FOR_INTERRUPTS();
>>>
>>> Is this change necessary?
>>
>> I don't recall doing that change. Maybe a rebase gone wrong.
>>
>>> 
>>> +   /*
>>> +* If there are no indexes then we should periodically
>>> vacuum the FSM
>>> +* on huge relations to make free space visible early.
>>> +*/
>>> +   if (nindexes == 0 &&
>>> +   (vacuumed_pages - vacuumed_pages_at_fsm_vac) >
>>> vacuum_fsm_every_pages)
>>> +   {
>>> +   /* Vacuum the Free Space Map */
>>> +   FreeSpaceMapVacuum(onerel, max_freespace);
>>> +   vacuumed_pages_at_fsm_vac = vacuumed_pages;
>>> +   max_freespace = 0;
>>> +   }
>>>
>>> I think this code block should be executed before we check if the page
>>> is whether new or empty and then do 'continue'. Otherwise we cannot
>>> reach this code if the table has a lot of new or empty pages.
>>
>> In order for the counter (vacuumed_pages) to increase, there have to
>> be plenty of opportunities for this code to run, and I specifically
>> wanted to avoid vacuuming the FSM too often for those cases
>> particularly (when Vacuum scans lots of pages but does no writes).
>>
>>> 
>>> @@ -663,6 +663,8 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
>>>   * If minValue > 0, the updated page is also searched for a page with at
>>>   * least minValue of free space. If one is found, its slot number is
>>>   * returned, -1 otherwise.
>>> + *
>>> + * If minValue == 0, the value at the root node is returned.
>>>   */
>>>  static int
>>>  fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
>>> @@ -687,6 +689,8 @@ fsm_set_and_search(Relation rel, FSMAddress addr,
>>> uint16 slot,
>>>
>>> addr.level == FSM_BOTTOM_LEVEL,
>>>true);
>>> }
>>> +   else
>>> +   newslot = fsm_get_avail(page, 0);
>>>
>>> I think it's not good idea to make fsm_set_and_search return either a
>>> slot number or a category value according to the argument. Category
>>> values is actually uint8 but this function returns it as int.
>>
>> Should be fine, uint8 can be contained inside an int in all platforms.
>>
>>> Also we
>>> can call fsm_set_and_search with minValue = 0 at
>>> RecordAndGetPageWithFreeSpace() when search_cat is 0. With you patch,
>>> fsm_set_and_search() then returns the root value but it's not correct.
>>
>> I guess I can add another function to do that.
>>
>>> Also, is this change necessary for this patch? ISTM this change is
>>> used for the change in fsm_update_recursive() as follows but I guess
>>> this change can be a separate patch.
>>>
>>> +   new_cat = fsm_set_and_search(rel, parent, parentslot, new_cat, 0);
>>> +
>>> +   /*
>>> +* Bubble up, not the value we just set, but the one now in the root
>>> +* node of the just-updated page, which is the page's highest value.
>>> +*/
>>
>> I can try to separate them I guess.
>
> Patch set with the requested changes attached.
>
> 2 didn't change AFAIK, it's just rebased on top of the new 1.

Oops. Forgot about the comment changes requested. Fixed those in v6, attached.
From 3bde81cd9a0474e65c487899d

Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-26 Thread Claudio Freire
On Mon, Feb 26, 2018 at 11:31 AM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Mon, Feb 26, 2018 at 6:00 AM, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:
>> Here is review comment for v4 patch.
>>
>> @@ -1922,6 +1988,8 @@ count_nondeletable_pages(Relation onerel,
>> LVRelStats *vacrelstats)
>>  * We don't insert a vacuum delay point here, because we 
>> have an
>>  * exclusive lock on the table which we want to hold
>> for as short a
>>  * time as possible.  We still need to check for
>> interrupts however.
>> +* We might have to acquire the autovacuum lock,
>> however, but that
>> +* shouldn't pose a deadlock risk.
>>  */
>> CHECK_FOR_INTERRUPTS();
>>
>> Is this change necessary?
>
> I don't recall doing that change. Maybe a rebase gone wrong.
>
>> 
>> +   /*
>> +* If there are no indexes then we should periodically
>> vacuum the FSM
>> +* on huge relations to make free space visible early.
>> +*/
>> +   if (nindexes == 0 &&
>> +   (vacuumed_pages - vacuumed_pages_at_fsm_vac) >
>> vacuum_fsm_every_pages)
>> +   {
>> +   /* Vacuum the Free Space Map */
>> +   FreeSpaceMapVacuum(onerel, max_freespace);
>> +   vacuumed_pages_at_fsm_vac = vacuumed_pages;
>> +   max_freespace = 0;
>> +   }
>>
>> I think this code block should be executed before we check if the page
>> is whether new or empty and then do 'continue'. Otherwise we cannot
>> reach this code if the table has a lot of new or empty pages.
>
> In order for the counter (vacuumed_pages) to increase, there have to
> be plenty of opportunities for this code to run, and I specifically
> wanted to avoid vacuuming the FSM too often for those cases
> particularly (when Vacuum scans lots of pages but does no writes).
>
>> 
>> @@ -663,6 +663,8 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
>>   * If minValue > 0, the updated page is also searched for a page with at
>>   * least minValue of free space. If one is found, its slot number is
>>   * returned, -1 otherwise.
>> + *
>> + * If minValue == 0, the value at the root node is returned.
>>   */
>>  static int
>>  fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
>> @@ -687,6 +689,8 @@ fsm_set_and_search(Relation rel, FSMAddress addr,
>> uint16 slot,
>>
>> addr.level == FSM_BOTTOM_LEVEL,
>>true);
>> }
>> +   else
>> +   newslot = fsm_get_avail(page, 0);
>>
>> I think it's not good idea to make fsm_set_and_search return either a
>> slot number or a category value according to the argument. Category
>> values is actually uint8 but this function returns it as int.
>
> Should be fine, uint8 can be contained inside an int in all platforms.
>
>> Also we
>> can call fsm_set_and_search with minValue = 0 at
>> RecordAndGetPageWithFreeSpace() when search_cat is 0. With you patch,
>> fsm_set_and_search() then returns the root value but it's not correct.
>
> I guess I can add another function to do that.
>
>> Also, is this change necessary for this patch? ISTM this change is
>> used for the change in fsm_update_recursive() as follows but I guess
>> this change can be a separate patch.
>>
>> +   new_cat = fsm_set_and_search(rel, parent, parentslot, new_cat, 0);
>> +
>> +   /*
>> +* Bubble up, not the value we just set, but the one now in the root
>> +* node of the just-updated page, which is the page's highest value.
>> +*/
>
> I can try to separate them I guess.

Patch set with the requested changes attached.

2 didn't change AFAIK, it's just rebased on top of the new 1.
From f9c5c0e0c8a79d241796a9b26d305c2b18bb665d Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Fri, 28 Jul 2017 21:31:59 -0300
Subject: [PATCH 1/3] Vacuum: Update FSM more frequently

Make Vacuum update the FSM more frequently, to avoid the case where
autovacuum fails to reach the point where it updates the FSM in
highly contended tables.

Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids
recursing into branches that already contain enough free space, to
avoid having to traverse the whole FSM and thus induce quadratic
costs. Interme

Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-26 Thread Claudio Freire
On Mon, Feb 26, 2018 at 11:31 AM, Claudio Freire <klaussfre...@gmail.com> wrote:
>> 
>> +   /*
>> +* If there are no indexes then we should periodically
>> vacuum the FSM
>> +* on huge relations to make free space visible early.
>> +*/
>> +   if (nindexes == 0 &&
>> +   (vacuumed_pages - vacuumed_pages_at_fsm_vac) >
>> vacuum_fsm_every_pages)
>> +   {
>> +   /* Vacuum the Free Space Map */
>> +   FreeSpaceMapVacuum(onerel, max_freespace);
>> +   vacuumed_pages_at_fsm_vac = vacuumed_pages;
>> +   max_freespace = 0;
>> +   }
>>
>> I think this code block should be executed before we check if the page
>> is whether new or empty and then do 'continue'. Otherwise we cannot
>> reach this code if the table has a lot of new or empty pages.
>
> In order for the counter (vacuumed_pages) to increase, there have to
> be plenty of opportunities for this code to run, and I specifically
> wanted to avoid vacuuming the FSM too often for those cases
> particularly (when Vacuum scans lots of pages but does no writes).

Wait, I see what you mean. Entirely empty pages.

Ok, I can move it.



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-26 Thread Claudio Freire
On Mon, Feb 26, 2018 at 6:00 AM, Masahiko Sawada  wrote:
> Here is review comment for v4 patch.
>
> @@ -1922,6 +1988,8 @@ count_nondeletable_pages(Relation onerel,
> LVRelStats *vacrelstats)
>  * We don't insert a vacuum delay point here, because we have 
> an
>  * exclusive lock on the table which we want to hold
> for as short a
>  * time as possible.  We still need to check for
> interrupts however.
> +* We might have to acquire the autovacuum lock,
> however, but that
> +* shouldn't pose a deadlock risk.
>  */
> CHECK_FOR_INTERRUPTS();
>
> Is this change necessary?

I don't recall doing that change. Maybe a rebase gone wrong.

> 
> +   /*
> +* If there are no indexes then we should periodically
> vacuum the FSM
> +* on huge relations to make free space visible early.
> +*/
> +   if (nindexes == 0 &&
> +   (vacuumed_pages - vacuumed_pages_at_fsm_vac) >
> vacuum_fsm_every_pages)
> +   {
> +   /* Vacuum the Free Space Map */
> +   FreeSpaceMapVacuum(onerel, max_freespace);
> +   vacuumed_pages_at_fsm_vac = vacuumed_pages;
> +   max_freespace = 0;
> +   }
>
> I think this code block should be executed before we check if the page
> is whether new or empty and then do 'continue'. Otherwise we cannot
> reach this code if the table has a lot of new or empty pages.

In order for the counter (vacuumed_pages) to increase, there have to
be plenty of opportunities for this code to run, and I specifically
wanted to avoid vacuuming the FSM too often for those cases
particularly (when Vacuum scans lots of pages but does no writes).

> 
> @@ -663,6 +663,8 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
>   * If minValue > 0, the updated page is also searched for a page with at
>   * least minValue of free space. If one is found, its slot number is
>   * returned, -1 otherwise.
> + *
> + * If minValue == 0, the value at the root node is returned.
>   */
>  static int
>  fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,
> @@ -687,6 +689,8 @@ fsm_set_and_search(Relation rel, FSMAddress addr,
> uint16 slot,
>
> addr.level == FSM_BOTTOM_LEVEL,
>true);
> }
> +   else
> +   newslot = fsm_get_avail(page, 0);
>
> I think it's not good idea to make fsm_set_and_search return either a
> slot number or a category value according to the argument. Category
> values is actually uint8 but this function returns it as int.

Should be fine, uint8 can be contained inside an int in all platforms.

> Also we
> can call fsm_set_and_search with minValue = 0 at
> RecordAndGetPageWithFreeSpace() when search_cat is 0. With you patch,
> fsm_set_and_search() then returns the root value but it's not correct.

I guess I can add another function to do that.

> Also, is this change necessary for this patch? ISTM this change is
> used for the change in fsm_update_recursive() as follows but I guess
> this change can be a separate patch.
>
> +   new_cat = fsm_set_and_search(rel, parent, parentslot, new_cat, 0);
> +
> +   /*
> +* Bubble up, not the value we just set, but the one now in the root
> +* node of the just-updated page, which is the page's highest value.
> +*/

I can try to separate them I guess.



Re: Hash Joins vs. Bloom Filters / take 2

2018-02-22 Thread Claudio Freire
On Thu, Feb 22, 2018 at 5:11 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
> On 02/22/2018 08:33 PM, Claudio Freire wrote:
>> That's kinda slow to do per-item. I tried to "count" distinct items by
>> checking the BF before adding (don't add redundantly), but that's less
>> precise than a HLL in my experience.
>
> But you don't need to do that for each item you try to add - you know
> that with M bits and K hash functions you can't reach 50% before at
> least M/(2*K) additions. And you don't really need to track the number
> of bits exactly - if it gets 55% full, it's not going to explode.
>
> So just wait until M/(2*K) additions, see how many bits remain until the
> threshold - rinse and repeat. And use some reasonable minimum distance
> (say, 5% of the capacity), not to do the check too often.

Ok, that works too.

Point is, SBFs help to keep the BF size as small as possible, keep it
below work_mem, and to avoid caring about the total number of distinct
items.

You may want the planner to try and estimate that to figure out
whether it's worth trying BFs or not, but once you decide to try, SBFs
remove any dependency on the accuracy of that estimate.



Re: Hash Joins vs. Bloom Filters / take 2

2018-02-22 Thread Claudio Freire
On Thu, Feb 22, 2018 at 12:45 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
>
>
> On 02/22/2018 12:44 PM, Claudio Freire wrote:
>> Let me reiterate, you can avoid both issues with scalable bloom filters[1].
>>
>
> I'm afraid it's not as straight-forward as "Use scalable bloom filters!"
>
> This is not merely a question of unreliable estimates of number of
> entries. That could have been solved by scalable bloom filters, which
> are essentially a sequence of larger and larger bloom filters, added
> when the smaller bloom filter "fills up" (1/2 full).
>
> The problem is twofold:
>
> (a) we need to decide what false positive rate to use (i.e. what is a
> reasonable trade-off between filter size and how much work it saves)
>
> (b) we also need to consider work_mem (which I assume we all agree we
> must respect)
> So for example we can't size the first bloom filter to just perfectly
> fit into work_mem, only to add larger bloom filters later (each 2x the
> size of the previous one). Not only will that increase the memory usage
> to 7x the initial estimate

Aside from a, (b) is exactly the problem SBFs solve.

You decide how much of work_mem you're willing to devote for BFs, and
then keep creating bigger and bigger BFs until you've allocated that
many.

Then you either keep updating the biggest filter, or give up entirely.

You can use a rather conservative initial bloom filter size for that,
and let scalability expand until you hit the limit.

> but it will also make the bloom filter less
> efficient (having to probe larger and larger filters, likely not fitting
> into CPU cache).

Again, you factor that into account when choosing the limit.

>> An HLL can be used to estimate set size, the paper makes no mention of
>> it, probably assuming only distinct items are added to the set.
>>
>
> The problem with HLL is, it's only an estimate of how many entries you
> saw so far. It only tells you that after observing the items, and it
> only tells you how many items you saw so far. What we need for sizing a
> bloom filter is an estimate of number of distinct values in advance.
>
> In other words, HLL is entirely useless for sizing Bloom Filters.

Normal BFs, yes. But that's exactly what you need for scalable BFs.
You need an estimate of the amount of distinct entries you've added to
your current filter, not the total set size.

> Furthermore, we could estimate number of observed distinct values from
> the number of 1s in the bloom filter

That's kinda slow to do per-item. I tried to "count" distinct items by
checking the BF before adding (don't add redundantly), but that's less
precise than a HLL in my experience.



Re: Hash Joins vs. Bloom Filters / take 2

2018-02-22 Thread Claudio Freire
On Wed, Feb 21, 2018 at 11:21 PM, Tomas Vondra
 wrote:
> On 02/21/2018 02:10 AM, Peter Geoghegan wrote:
>> On Tue, Feb 20, 2018 at 3:54 PM, Tomas Vondra
>>  wrote:
 I suspect that it could make sense to use a Bloom filter to
 summarize the entire inner side of the join all at once, even
 when there are multiple batches. I also suspect that this is
 particularly beneficial with parallel hash joins, where
 IPC/synchronization overhead can be a big problem.

>>>
>>> But that's what the patch does, currently - the filter is built
>>> during the initial pass through the data, and then used for all
>>> batches.
>>
>> I misunderstood. I would probably do something like double or triple
>> the original rows estimate instead, though. The estimate must be at
>> least slightly inaccurate when we get to this point, but I don't
>> think that that's a good enough reason to give up on the estimate
>> completely.
>>
>
> That's a problem only for the multi-batch case, though.
>
> With a single batch we can walk the hash table and count non-empty
> buckets, to get a good ndistinct estimate cheaply. And then size the
> filter considering both memory requirements (fits into CPU cache) and
> false positive rate. There are other things we may need to consider
> (memory usage vs. work_mem) but that's a separate issue.
>
> With multiple batches I think we could use the "size the bloom filter
> for a fraction of work_mem" which the current patch uses when switching
> to multiple batches halfway-through. That pretty much entirely ignores
> the estimate and essentially replaces it with a "fictional" estimate.
>
> I think that's a better approach than using some arbitrary multiple of
> the estimate. When we have to start batching halfway through, the
> estimate is proven to be rather bogus anyway, but we may treat it as a
> lower boundary for the bloom filter size.

...

>> As I said, X should not be a portion of work_mem, because that has
>> only a weak relationship to what really matters.
>>
>
> I agree a fixed fraction of work_mem may not be the right thing, but the
> goal was to make the bloom filter part of the Hash memory budget, i.e.
>
> bloom filter + hash table <= work_mem
>
> (which I think we agree should be the case), without increasing the
> number of batches too much. For example, if you size the filter ignoring
> this, and it end up being 90% of work_mem, you may need to do the hash
> join in 128 batches instead of just 16. Or something like that.
>
> Maybe that would still be a win, though. Firstly, the higher number of
> batches may not have a huge impact - in one case we need to serialie
> 15/16 and in the other one 127/128. That's 93% vs. 99%. And if the more
> accurate filter allows us to discard much more data from the outer
> relation ...

Let me reiterate, you can avoid both issues with scalable bloom filters[1].

An HLL can be used to estimate set size, the paper makes no mention of
it, probably assuming only distinct items are added to the set.

[1] http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf



Re: Hash Joins vs. Bloom Filters / take 2

2018-02-20 Thread Claudio Freire
On Tue, Feb 20, 2018 at 8:23 PM, Peter Geoghegan <p...@bowt.ie> wrote:
> On Tue, Feb 20, 2018 at 3:17 PM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> I've worked a lot with bloom filters, and for large false positive
>> rates and large sets (multi-million entries), you get bloom filter
>> sizes of about 10 bits per distinct item.
>
> It's generally true that you need 9.6 bits per element to get a 1%
> false positive rate. 1% could be considered much too low here.
>
> Do we need to eliminate 99% of all hash join probes (that find nothing
> to join on) to make this Bloom filter optimization worthwhile?
> Personally, I doubt it.

Even for 90% it's about 4.6 bits per element.

What I'm saying is that you can't assume you can fit the whole thing
in work_mem. If it could be said that any set worth a hash join will
fit, you can build a work_mem-sized bloom filter and just accept that
if you exceed its capacity its filtering efficiency will drop
benignly. But IMO that can't be taken for granted, so at some point
you should just give up, the overhead of querying a low-quality BF
won't be worth it.

The HLL is good for that. You can keep adding to the BF until the HLL
tells you you're way past the capacity of a work_mem-sized BF, then
you free that memory and go on without it.

You might also want to consider scalable BFs. The smaller you can keep
the BF, the better chance you'll have of fitting it into the L3 cache
(you can fit quite a lot of entries into a decen-sized L3 cache).



Re: Hash Joins vs. Bloom Filters / take 2

2018-02-20 Thread Claudio Freire
On Tue, Feb 20, 2018 at 8:06 PM, Peter Geoghegan  wrote:
> You should try to exploit the fact that a Bloom filter can summarize a
> large set reasonably well with a very compact, simple representation.
> A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
> for cases where Bloom probes will save a lot of work, it probably
> doesn't make all that much difference -- the hash join is still much
> faster. If you're willing to accept a 10% false positive rate, then
> you can summarize a set of 40 million distinct values with only
> 22.85MB of memory and 3 hash functions. I think that the smallest
> possible amount of memory that any hash table of 40 million elements
> would require a much greater amount of memory than 22.85MB --
> certainly closer to 20x than to 8x. Even that is pretty conservative.
> I bet there are plenty of hash joins were the ratio could be 30x or
> more. Not to mention cases with multiple batches.

I've worked a lot with bloom filters, and for large false positive
rates and large sets (multi-million entries), you get bloom filter
sizes of about 10 bits per distinct item.

That's efficient, but it's not magic. It can still happen that the
whole set can't fit in work_mem with an acceptable false positive
rate.



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-15 Thread Claudio Freire
On Thu, Feb 15, 2018 at 4:47 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Wed, Feb 14, 2018 at 3:59 AM, Masahiko Sawada <sawada.m...@gmail.com> 
> wrote:

>>> The final FSM vacuum pass isn't partial, to finally correct all those
>>> small inconsistencies.
>>
>> Yes, but the purpose of this patch is to prevent table bloating from
>> concurrent modifications?
>
> Yes, by *marking* unmarked space. If the slot is above the threshold,
> it means there's already marked free space on that branch, and search
> will go into it already, so no pressing need to refine the mark. The
> space already marked can serve while vacuum makes further progress.
> Ie: it can wait.

Although... I think I see what you mean.

If you have this:

255
.   0
.   .   0
.   .   255
.   0
.   .   64
.   .   64
.   0
.   .   0
.   .   64


A partial vacuum won't even recurse beyond the root node, so it won't
expose the free space 2 levels down.

This could be arrived at by having an almost fully packed table that
contains some free space near the end, and it gets deletions near the
beginning. Vacuum will mark the leaf levels at the beginning of the
table, and we'd like for that free space to be discoverable to avoid
packing more rows near the end where they prevent truncation.

The patch as it is now will only vacuum that part of the FSM when the
root value drops, which usually requires all the free space on that
region of the heap to be exhausted.

With the current recursive FSM vacuum method, however, that's
unavoidable. We can't detect that there's some free space below the
current level to be exposed without recursing into the tree, and
recursing is exactly the kind of work we want to prevent with tree
pruning.

In all my trials, however, I did not run into that case. It must not
be easy to get into that situation, because the root node already has
~4000 slots in it. Each of those 4000 slots should be in that
situation to keep the space at the end of the table as the sole
discoverable free space, which seems like a rather contorted worst
case.



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-15 Thread Claudio Freire
On Wed, Feb 14, 2018 at 3:59 AM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> On Fri, Feb 9, 2018 at 11:48 PM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> On Fri, Feb 9, 2018 at 1:36 AM, Masahiko Sawada <sawada.m...@gmail.com> 
>> wrote:
>>> On Fri, Feb 9, 2018 at 12:45 AM, Claudio Freire <klaussfre...@gmail.com> 
>>> wrote:
>>>> On Thu, Feb 8, 2018 at 1:36 AM, Masahiko Sawada <sawada.m...@gmail.com> 
>>>> wrote:
>>>>> On Tue, Feb 6, 2018 at 9:51 PM, Claudio Freire <klaussfre...@gmail.com> 
>>>>> wrote:
>>>>>> I can look into doing 3, that *might* get rid of the need to do that
>>>>>> initial FSM vacuum, but all other intermediate vacuums are still
>>>>>> needed.
>>>>>
>>>>> Understood. So how about that this patch focuses only make FSM vacuum
>>>>> more frequently and leaves the initial FSM vacuum and the handling
>>>>> cancellation cases? The rest can be a separate patch.
>>>>
>>>> Ok.
>>>>
>>>> Attached split patches. All but the initial FSM pass is in 1, the
>>>> initial FSM pass as in the original patch is in 2.
>>>>
>>>> I will post an alternative patch for 2 when/if I have one. In the
>>>> meanwhile, we can talk about 1.
>>>
>>> Thank you for updating the patch!
>>>
>>> +/* Tree pruning for partial vacuums */
>>> +if (threshold)
>>> +{
>>> +child_avail = fsm_get_avail(page, slot);
>>> +if (child_avail >= threshold)
>>> +continue;
>>> +}
>>>
>>> I think slots in a non-leaf page might not have a correct value
>>> because fsm_set_avail doesn't propagate up beyond pages. So do we need
>>> to check the root of child page rather than a slot in the non-lead
>>> page? I might be missing something though.
>>
>> I'm not sure I understand what you're asking.
>>
>> Yes, a partial FSM vacuum won't correct those inconsistencies. It
>> doesn't matter though, because as long as the root node (or whichever
>> inner node we're looking at the time) reports free space, the lookup
>> for free space will recurse and find the lower nodes, even if they
>> report more space than the inner node reported. The goal of making
>> free space discoverable will have been achieved nonetheless.
>
> For example, if a slot of internal node of fsm tree has a value
> greater than the threshold while the root of leaf node of fsm tree has
> a value less than threshold, we want to update the internal node of
> fsm tree but we will skip it with your patch. I wonder if this can
> happen.

If I understand your point correctly, that would mean that space was
used since the last vacuum. Partial FSM vacuums don't concern
themselves with that case, the normal free space search algorithm will
handle that, retrying with the next slot until it succeeds (or until
it gives up).

IIUC, each time a failure of such kind is found while searching, the
FSM gets updated to avoid following that link a second time.

See, in fsm_search:

/*
 * At lower level, failure can happen if the value in the upper-
 * level node didn't reflect the value on the lower page. Update
 * the upper node, to avoid falling into the same trap again, and
 * start over.


>
>> The final FSM vacuum pass isn't partial, to finally correct all those
>> small inconsistencies.
>
> Yes, but the purpose of this patch is to prevent table bloating from
> concurrent modifications?

Yes, by *marking* unmarked space. If the slot is above the threshold,
it means there's already marked free space on that branch, and search
will go into it already, so no pressing need to refine the mark. The
space already marked can serve while vacuum makes further progress.
Ie: it can wait.

> Here is other review comments.
>
> +/* Tree pruning for partial vacuums */
> +if (threshold)
> +{
> +child_avail = fsm_get_avail(page, slot);
> +if (child_avail >= threshold)
> +continue;
> +}
>
> 'child_avail' is a category value while 'threshold' is an actual size
> of freespace.

Good catch. It was meant to be a category, but I see the public
interface of the FSM doesn't usually expose categories, so fixing
that.

> The logic of finding the max_freespace seems not work fine if table
> with indices because max_freespace is not updated based on
> post-compaction freespace. I think we need to get the max freesp

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-09 Thread Claudio Freire
On Fri, Feb 9, 2018 at 10:32 AM, Alvaro Herrera <alvhe...@alvh.no-ip.org> wrote:
> Claudio Freire wrote:
>> On Thu, Feb 8, 2018 at 8:39 PM, Alvaro Herrera <alvhe...@alvh.no-ip.org> 
>> wrote:
>
>> During the process of developing the patch, I got seriously broken
>> code that passed the tests nonetheless. The test as it was was very
>> ineffective at actually detecting issues.
>>
>> This new test may be slow, but it's effective. That's a very good
>> reason to make it slower, if you ask me.
>
> OK, I don't disagree with improving the test, but if we can make it fast
> *and* effective, that's better than slow and effective.

I'd love to have a test that uses multiple segments of dead tuples,
but for that, it needs to use more than 64MB of mwm. That amounts to,
basically, ~12M rows.

Is there a "slow test suite" where such a test could be added that
won't bother regular "make check"?

That, or we turn the initial segment size into a GUC, but I don't
think it's a useful GUC outside of the test suite.

>> > 3. Figure out the minimum size for the table that triggers the behavior
>> >you want.  Right now you use 400k tuples -- maybe 100k are sufficient?
>> >Don't know.
>>
>> For that test, I need enough *dead* tuples to cause several passes.
>> Even small mwm settings require tons of tuples for this. In fact, I'm
>> thinking that number might be too low for its purpose, even. I'll
>> re-check, but I doubt it's too high. If anything, it's too low.
>
> OK.

Turns out that it was a tad oversized. 300k tuples seems enough.

Attached is a new patch version that:

- Uses an unlogged table to make the large mwm test faster
- Uses a wait_barrier helper that waits for concurrent transactions
  to finish before vacuuming tables, to make sure deleted tuples
  actually are vacuumable
- Tweaks the size of the large mwm test to be as small as possible
- Optimizes the delete to avoid expensive operations yet attain
  the same end result
From 20ca670215a7bf63f1bc2290efd5a89bcafabb20 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 1/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.

Improve test ability to spot vacuum errors
---
 src/backend/commands/vacuumlazy.c| 402 ---
 src/test/regress/expected/vacuum.out |  72 ++-
 src/test/regress/sql/vacuum.sql  |  40 +++-
 3 files changed, 438 insertions(+), 76 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index cf7f5e1..1e82d26 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -11,11 +11,24 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
- * initially allocate an array of TIDs of that size, with an upper limit that
+ * initially allocate an array of TIDs of 128MB, or an upper limit that
  * depends on table size (this limit ensures we don't allocate a huge area
- * uselessly for vacuuming small tables).  If the array threatens to overflow,
+ * uselessly for vacuuming small tables). Additional arrays of increasingly
+ * large sizes are allocated as they become necessary.
+ *
+ * The TID array is thus represented as a list of multiple segments of
+ * varying size, beginning with the initial size of up to 128MB, and growing
+ * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
+ * is used up.
+ *
+ * Lookup in that structure happens with a binary search of segments, and then
+ * with a binary search within each segment. Since segment's size grows
+ * exponentially, this retains O(log N) lookup complexity.
+ *
+ * If the multiarray's total size threatens to grow beyond maintenance_work_mem,
  * we suspend the heap scan phase and perform a pass of index cleanup and page
- * compaction, then resume the heap scan with an empty TID array.
+ * compaction, then resume the heap scan with an array of logically empty but
+ * already preallocated TID segments to be refilled with more dead tuple TIDs.
  *
  * If we're processing a table with no indexes, we can just vacuum each page
  * as we go; there's no need to save up multiple tuples to minimize the number
@@ -92,6 +105,14 @@
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
 
 /*
+ * Minimum (starting) size of the dead_tuples array segments. Will allocate
+ * space for 128MB worth of tid pointers in the first segment, further segments
+ * will grow in size exponentially. Don't make it too small 

Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-09 Thread Claudio Freire
On Fri, Feb 9, 2018 at 1:36 AM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> On Fri, Feb 9, 2018 at 12:45 AM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> On Thu, Feb 8, 2018 at 1:36 AM, Masahiko Sawada <sawada.m...@gmail.com> 
>> wrote:
>>> On Tue, Feb 6, 2018 at 9:51 PM, Claudio Freire <klaussfre...@gmail.com> 
>>> wrote:
>>>> I can look into doing 3, that *might* get rid of the need to do that
>>>> initial FSM vacuum, but all other intermediate vacuums are still
>>>> needed.
>>>
>>> Understood. So how about that this patch focuses only make FSM vacuum
>>> more frequently and leaves the initial FSM vacuum and the handling
>>> cancellation cases? The rest can be a separate patch.
>>
>> Ok.
>>
>> Attached split patches. All but the initial FSM pass is in 1, the
>> initial FSM pass as in the original patch is in 2.
>>
>> I will post an alternative patch for 2 when/if I have one. In the
>> meanwhile, we can talk about 1.
>
> Thank you for updating the patch!
>
> +/* Tree pruning for partial vacuums */
> +if (threshold)
> +{
> +child_avail = fsm_get_avail(page, slot);
> +if (child_avail >= threshold)
> +continue;
> +}
>
> I think slots in a non-leaf page might not have a correct value
> because fsm_set_avail doesn't propagate up beyond pages. So do we need
> to check the root of child page rather than a slot in the non-lead
> page? I might be missing something though.

I'm not sure I understand what you're asking.

Yes, a partial FSM vacuum won't correct those inconsistencies. It
doesn't matter though, because as long as the root node (or whichever
inner node we're looking at the time) reports free space, the lookup
for free space will recurse and find the lower nodes, even if they
report more space than the inner node reported. The goal of making
free space discoverable will have been achieved nonetheless.

The final FSM vacuum pass isn't partial, to finally correct all those
small inconsistencies.



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-09 Thread Claudio Freire
On Thu, Feb 8, 2018 at 8:39 PM, Alvaro Herrera <alvhe...@alvh.no-ip.org> wrote:
> Claudio Freire wrote:
>
>> I don't like looping, though, seems overly cumbersome. What's worse?
>> maintaining that fragile weird loop that might break (by causing
>> unexpected output), or a slight slowdown of the test suite?
>>
>> I don't know how long it might take on slow machines, but in my
>> machine, which isn't a great machine, while the vacuum test isn't fast
>> indeed, it's just a tiny fraction of what a simple "make check" takes.
>> So it's not a huge slowdown in any case.
>
> Well, what about a machine running tests under valgrind, or the weird
> cache-clobbering infuriatingly slow code?  Or buildfarm members running
> on really slow hardware?  These days, a few people have spent a lot of
> time trying to reduce the total test time, and it'd be bad to lose back
> the improvements for no good reason.

It's not for no good reason. The old tests were woefully inadequate.

During the process of developing the patch, I got seriously broken
code that passed the tests nonetheless. The test as it was was very
ineffective at actually detecting issues.

This new test may be slow, but it's effective. That's a very good
reason to make it slower, if you ask me.

> I grant you that the looping I proposed is more complicated, but I don't
> see any reason why it would break.
>
> Another argument against the LOCK pg_class idea is that it causes an
> unnecessary contention point across the whole parallel test group --
> with possible weird side effects.  How about a deadlock?

The real issue with lock pg_class is that locks on pg_class are
short-lived, so I'm not waiting for whole transactions.

> Other than the wait loop I proposed, I think we can make a couple of
> very simple improvements to this test case to avoid a slowdown:
>
> 1. the DELETE takes about 1/4th of the time and removes about the same
> number of rows as the one using the IN clause:
>   delete from vactst where random() < 3.0 / 4;

I did try this at first, but it causes random output, so the test
breaks randomly.

> 2. Use a new temp table rather than vactst.  Everything is then faster.

I might try that.

> 3. Figure out the minimum size for the table that triggers the behavior
>you want.  Right now you use 400k tuples -- maybe 100k are sufficient?
>Don't know.

For that test, I need enough *dead* tuples to cause several passes.
Even small mwm settings require tons of tuples for this. In fact, I'm
thinking that number might be too low for its purpose, even. I'll
re-check, but I doubt it's too high. If anything, it's too low.



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-08 Thread Claudio Freire
On Thu, Feb 8, 2018 at 1:36 AM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> On Tue, Feb 6, 2018 at 9:51 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
>> I can look into doing 3, that *might* get rid of the need to do that
>> initial FSM vacuum, but all other intermediate vacuums are still
>> needed.
>
> Understood. So how about that this patch focuses only make FSM vacuum
> more frequently and leaves the initial FSM vacuum and the handling
> cancellation cases? The rest can be a separate patch.

Ok.

Attached split patches. All but the initial FSM pass is in 1, the
initial FSM pass as in the original patch is in 2.

I will post an alternative patch for 2 when/if I have one. In the
meanwhile, we can talk about 1.
From 0a980794c90973383e5a63f64839616a79542260 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Thu, 8 Feb 2018 12:39:15 -0300
Subject: [PATCH 2/2] Vacuum: do a partial FSM vacuum at the beginning

A partial FSM vacuum right at the start of the vacuum process
helps ameliorate issues with cancelled autovacuums never updating
the FSM.
---
 src/backend/commands/vacuumlazy.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 9ad0f34..2965204 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -103,6 +103,19 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+/*
+ * If autovacuums get regularly cancelled, the FSM may never be
+ * vacuumed. To work around that, we perform an initial partial
+ * FSM vacuum at the beginning of the vacuum process, to quickly
+ * make existing unmarked free space visible. To avoid useless
+ * redundant work, however, we avoid recursing into branches
+ * that already have a set amount of free space, we only try
+ * to discover unmarked free space. This value controls how
+ * much free space is enough free space in that context. The
+ * value is in the terms of the FSM map.
+ */
+#define INITIAL_PARTIAL_FSM_VACUUM_THRESHOLD	64
+
 typedef struct LVRelStats
 {
 	/* hasindex = true means two-pass strategy; false means one-pass */
@@ -250,6 +263,17 @@ lazy_vacuum_rel(Relation onerel, int options, VacuumParams *params,
 	vacrelstats->pages_removed = 0;
 	vacrelstats->lock_waiter_detected = false;
 
+	/*
+	 * Vacuum the Free Space Map partially before we start.
+	 * If an earlier vacuum was canceled, and that's likely in
+	 * highly contended tables, we may need to finish up. If we do
+	 * it now, we make the space visible to other backends regardless
+	 * of whether we succeed in finishing this time around.
+	 * Don't bother checking branches that already have usable space,
+	 * though.
+	 */
+	FreeSpaceMapVacuum(onerel, INITIAL_PARTIAL_FSM_VACUUM_THRESHOLD);
+
 	/* Open all indexes of the relation */
 	vac_open_indexes(onerel, RowExclusiveLock, , );
 	vacrelstats->hasindex = (nindexes > 0);
-- 
1.8.4.5

From fdb2e0ae486fc6160df48c6a668eafa50b32814a Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Fri, 28 Jul 2017 21:31:59 -0300
Subject: [PATCH 1/2] Vacuum: Update FSM more frequently

Make Vacuum update the FSM more frequently, to avoid the case where
autovacuum fails to reach the point where it updates the FSM in
highly contended tables.

Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids
recursing into branches that already contain enough free space, to
avoid having to traverse the whole FSM and thus induce quadratic
costs. Intermediate FSM vacuums are only supposed to make enough
free space visible to avoid extension until the final (non-partial)
FSM vacuum.

Run partial FSM vacuums after each index pass, which is a point
at which whole ranges of the heap have been thorougly cleaned, and
we can expect no further updates to those ranges of the FSM save
for concurrent activity. When there are no indexes, and thus no
index passes, run partial FSM vacuums every 8GB of dirtied pages
or 1/8th of the relation, whichever is highest. This allows some
partial work to be made visible without incurring quadratic cost.

In any case, FSM are small in relation to the table, so even when
quadratic cost is involved, it should not be problematic. Index
passes already incur quadratic cost, and the addition of the FSM
is unlikely to be measurable.

Run a partial FSM vacuum with a low pruning threshold right at
the beginning of the VACUUM to finish up any work left over from
prior canceled vacuum runs, something that is common in highly
contended tables when running only autovacuum.
---
 src/backend/access/brin/brin.c|  2 +-
 src/backend/access/brin/brin_pageops.c| 10 +++
 src/backend/commands/vacuumlazy.c | 50 +--
 src/backend/storage/freespace/freespace.c | 29 ++
 src/backend/storage/freespace/indexfsm.c  |  2 +-
 src/include/storage

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-07 Thread Claudio Freire
On Thu, Feb 8, 2018 at 2:44 AM, Kyotaro HORIGUCHI
 wrote:
> Hello, I looked this a bit closer.
>
> In upthread[1] Robert mentioned the exponentially increasing size
> of additional segments.
>
>>> Hmm, I had imagined making all of the segments the same size rather
>>> than having the size grow exponentially.  The whole point of this is
>>> to save memory, and even in the worst case you don't end up with that
>>> many segments as long as you pick a reasonable base size (e.g. 64MB).
>>
>> Wastage is bound by a fraction of the total required RAM, that is,
>> it's proportional to the amount of required RAM, not the amount
>> allocated. So it should still be fine, and the exponential strategy
>> should improve lookup performance considerably.
>
> It seems that you are getting him wrong. (Anyway I'm not sure
> what you meant by the above. not-yet-allocated memory won't be a
> waste.) The conclusive number of dead tuples in a heap scan is
> undeteminable until the scan ends. If we had a new dead tuple
> required a, say 512MB new segment and the scan ends just after,
> the wastage will be almost the whole of the segment.

And the segment size is bound by a fraction of total needed memory.

When I said "allocated", I meant m_w_m. Wastage is not proportional to m_w_m.

> On the other hand, I don't think the exponential strategy make
> things considerably better. bsearch iterations in
> lazy_tid_reaped() are distributed between segment search and tid
> search. Intuitively more or less the increased segment size just
> moves some iterations of the former to the latter.
>
> I made a calculation[2]. With maintemance_work_mem of 4096MB, the
> number of segments is 6 and expected number of bsearch iteration
> is about 20.8 for the exponential strategy. With 64MB fixed size
> segments, we will have 64 segments (that is not so many) and the
> expected iteration is 20.4. (I suppose the increase comes from
> the imbalanced size among segments.) Addition to that, as Robert
> mentioned, the possible maximum memory wastage of the exponential
> strategy is about 2GB and 64MB in fixed size strategy.

That calculation has a slight bug in that it should be log2, and that
segment size is limited to 1GB at the top end.

But in any case, the biggest issue is that it's ignoring the effect of
cache locality. The way in which the exponential strategy helps, is by
keeping the segment list small and comfortably fitting in fast cache
memory, while also keeping wastage at a minimum for small lists. 64MB
segments with 4G mwm would be about 2kb of segment list. It fits in
L1, if there's nothing else contending for it, but it's already
starting to get big, and it would be expected settings larger than 4G
mwm could be used.

I guess I could tune the starting/ending sizes a bit.

Say, with an upper limit of 256M (instead of 1G), and after fixing the
other issues, we get:

exponential sized strategy
...
#18 : segsize=64MB total=4096MB, (tuples = 715827882, min
tsize=18.8GB), iterseg(18)=4.169925, iteritem(11184810) = 23.415037,
expected iter=29.491213


fixed sized strategy
...
#64 : segsize=64MB total=4096MB, (tuples = 715827882, min
tsize=18.2GB), interseg(64)=6.00, iteritem(11184810) = 23.415037,
expected iter=29.415037

Almost identical, and we get all the benefits of cache locality with
the exponential strategy. The fixed strategy might fit in the L1, but
it's less likely the bigger the mwm is.

The scaling factor could also be tuned I guess, but I'm wary of using
anything other than a doubling strategy, since it might cause memory
fragmentation.



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-07 Thread Claudio Freire
On Thu, Feb 8, 2018 at 12:13 AM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Wed, Feb 7, 2018 at 11:29 PM, Alvaro Herrera <alvhe...@alvh.no-ip.org> 
> wrote:
>> Claudio Freire wrote:
>>> On Wed, Feb 7, 2018 at 8:52 PM, Alvaro Herrera <alvhe...@alvh.no-ip.org> 
>>> wrote:
>>> >> Waiting as you say would be akin to what the patch does by putting
>>> >> vacuum on its own parallel group.
>>> >
>>> > I don't think it's the same.  We don't need to wait until all the
>>> > concurrent tests are done -- we only need to wait until the transactions
>>> > that were current when the delete finished are done, which is very
>>> > different since each test runs tons of small transactions rather than
>>> > one single big transaction.
>>>
>>> Um... maybe "lock pg_class" ?
>>
>> I was thinking in first doing
>>   SELECT array_agg(DISTINCT virtualtransaction) vxids
>> FROM pg_locks \gset
>>
>> and then in a DO block loop until
>>
>>SELECT DISTINCT virtualtransaction
>>  FROM pg_locks
>> INTERSECT
>>SELECT (unnest(:'vxids'::text[]));
>>
>> returns empty; something along those lines.
>
> Isn't it the same though?
>
> I can't think how a transaction wouldn't be holding at least an access
> share on pg_class.

Never mind, I just saw the error of my ways.

I don't like looping, though, seems overly cumbersome. What's worse?
maintaining that fragile weird loop that might break (by causing
unexpected output), or a slight slowdown of the test suite? I don't
know how long it might take on slow machines, but in my machine, which
isn't a great machine, while the vacuum test isn't fast indeed, it's
just a tiny fraction of what a simple "make check" takes. So it's not
a huge slowdown in any case.

I'll give it some thought, maybe there's a simpler way.



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-07 Thread Claudio Freire
On Wed, Feb 7, 2018 at 11:29 PM, Alvaro Herrera <alvhe...@alvh.no-ip.org> wrote:
> Claudio Freire wrote:
>> On Wed, Feb 7, 2018 at 8:52 PM, Alvaro Herrera <alvhe...@alvh.no-ip.org> 
>> wrote:
>> >> Waiting as you say would be akin to what the patch does by putting
>> >> vacuum on its own parallel group.
>> >
>> > I don't think it's the same.  We don't need to wait until all the
>> > concurrent tests are done -- we only need to wait until the transactions
>> > that were current when the delete finished are done, which is very
>> > different since each test runs tons of small transactions rather than
>> > one single big transaction.
>>
>> Um... maybe "lock pg_class" ?
>
> I was thinking in first doing
>   SELECT array_agg(DISTINCT virtualtransaction) vxids
> FROM pg_locks \gset
>
> and then in a DO block loop until
>
>SELECT DISTINCT virtualtransaction
>  FROM pg_locks
> INTERSECT
>SELECT (unnest(:'vxids'::text[]));
>
> returns empty; something along those lines.

Isn't it the same though?

I can't think how a transaction wouldn't be holding at least an access
share on pg_class.



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-07 Thread Claudio Freire
On Wed, Feb 7, 2018 at 8:52 PM, Alvaro Herrera <alvhe...@alvh.no-ip.org> wrote:
>> Waiting as you say would be akin to what the patch does by putting
>> vacuum on its own parallel group.
>
> I don't think it's the same.  We don't need to wait until all the
> concurrent tests are done -- we only need to wait until the transactions
> that were current when the delete finished are done, which is very
> different since each test runs tons of small transactions rather than
> one single big transaction.

Um... maybe "lock pg_class" ?

That should conflict with basically any other running transaction and
have pretty much that effect.

Attached is a version of patch 1 with that approach.
From 3f1eb8829f9b396d1e5aef33114df44b04631455 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 1/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.

Improve test ability to spot vacuum errors
---
 src/backend/commands/vacuumlazy.c| 402 ---
 src/test/regress/expected/vacuum.out |  31 ++-
 src/test/regress/sql/vacuum.sql  |  23 +-
 3 files changed, 380 insertions(+), 76 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index cf7f5e1..1e82d26 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -11,11 +11,24 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
- * initially allocate an array of TIDs of that size, with an upper limit that
+ * initially allocate an array of TIDs of 128MB, or an upper limit that
  * depends on table size (this limit ensures we don't allocate a huge area
- * uselessly for vacuuming small tables).  If the array threatens to overflow,
+ * uselessly for vacuuming small tables). Additional arrays of increasingly
+ * large sizes are allocated as they become necessary.
+ *
+ * The TID array is thus represented as a list of multiple segments of
+ * varying size, beginning with the initial size of up to 128MB, and growing
+ * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
+ * is used up.
+ *
+ * Lookup in that structure happens with a binary search of segments, and then
+ * with a binary search within each segment. Since segment's size grows
+ * exponentially, this retains O(log N) lookup complexity.
+ *
+ * If the multiarray's total size threatens to grow beyond maintenance_work_mem,
  * we suspend the heap scan phase and perform a pass of index cleanup and page
- * compaction, then resume the heap scan with an empty TID array.
+ * compaction, then resume the heap scan with an array of logically empty but
+ * already preallocated TID segments to be refilled with more dead tuple TIDs.
  *
  * If we're processing a table with no indexes, we can just vacuum each page
  * as we go; there's no need to save up multiple tuples to minimize the number
@@ -92,6 +105,14 @@
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
 
 /*
+ * Minimum (starting) size of the dead_tuples array segments. Will allocate
+ * space for 128MB worth of tid pointers in the first segment, further segments
+ * will grow in size exponentially. Don't make it too small or the segment list
+ * will grow bigger than the sweetspot for search efficiency on big vacuums.
+ */
+#define LAZY_INIT_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
+
+/*
  * Before we consider skipping a page that's marked as clean in
  * visibility map, we must've seen at least this many clean pages.
  */
@@ -103,6 +124,27 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+typedef struct DeadTuplesSegment
+{
+	ItemPointerData last_dead_tuple;	/* Copy of the last dead tuple (unset
+		 * until the segment is fully
+		 * populated). Keep it first to simplify
+		 * binary searches */
+	int			num_dead_tuples;	/* # of entries in the segment */
+	int			max_dead_tuples;	/* # of entries allocated in the segment */
+	ItemPointer dt_tids;		/* Array of dead tuples */
+}	DeadTuplesSegment;
+
+typedef struct DeadTuplesMultiArray
+{
+	int			num_entries;	/* current # of entries */
+	int			max_entries;	/* total # of slots that can be allocated in
+ * array */
+	int			num_segs;		/* number of dead tuple segments allocated */
+	int			last_seg;		/* last dead tuple segment with data (or 0) */
+	DeadTuplesSegment *dt_segments;		/* array of num_segs segments */
+}	DeadTuplesMultiArray;
+
 typedef struct LVRelStats
 {
 	/* hasindex = true means two-pass strategy; false means one-pass */
@@ -123,14 +165,20 @@ typedef s

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-07 Thread Claudio Freire
On Wed, Feb 7, 2018 at 7:57 PM, Alvaro Herrera <alvhe...@alvh.no-ip.org> wrote:
> Claudio Freire wrote:
>
>> - vacuum test on its own parallel group
>
> Hmm, this solution is not very friendly to the goal of reducing test
> runtime, particularly since the new test creates a nontrivial-sized
> table.  Maybe we can find a better alternative.  Can we use some wait
> logic instead?  Maybe something like grab a snapshot of running VXIDs
> and loop waiting until they're all gone before doing the vacuum?

I'm not sure there's any alternative. I did some tests and any active
snapshot on any other table, not necessarily on the one being
vacuumed, distorted the test. And it makes sense, since that snapshot
makes those deleted tuples unvacuumable.

Waiting as you say would be akin to what the patch does by putting
vacuum on its own parallel group.

I'm guessing all tests write something to the database, so all tests
will create a snapshot. Maybe if there were read-only tests, those
might be safe to include in vacuum's parallel group, but otherwise I
don't see any alternative.

> Also, I don't understand why pg_relation_size() is not a better solution
> to determining the table size compared to explain.

I was told pg_relation_size can return stale information. I didn't
check, I took that at face value.



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-07 Thread Claudio Freire
On Wed, Feb 7, 2018 at 12:50 AM, Kyotaro HORIGUCHI
<horiguchi.kyot...@lab.ntt.co.jp> wrote:
> Hello,
>
> At Tue, 6 Feb 2018 10:41:22 -0300, Claudio Freire <klaussfre...@gmail.com> 
> wrote in 

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-06 Thread Claudio Freire
On Tue, Feb 6, 2018 at 10:18 AM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Tue, Feb 6, 2018 at 4:35 AM, Kyotaro HORIGUCHI
> <horiguchi.kyot...@lab.ntt.co.jp> wrote:
>>> It's starting to look like a timing effect indeed.
>>
>> It seems to be truncation skip, maybe caused by concurrent
>> autovacuum.
>
> Good point, I'll also disable autovacuum on vactst.
>
>> See lazy_truncate_heap() for details. Updates of
>> pg_stat_*_tables can be delayed so looking it also can fail. Even
>> though I haven't looked the patch closer, the "SELECT
>> pg_relation_size()" doesn't seem to give something meaningful
>> anyway.
>
> Maybe then "explain (analyze, buffers, costs off, timing off, summary
> off) select * from vactst" then.
>
> The point is to check that the relation's heap has 0 pages.

Attached rebased patches with those changes mentioned above, namely:

- vacuum test now creates vactst with autovacuum disabled for it
- vacuum test on its own parallel group
- use explain analyze instead of pg_relation_size to check the
relation is properly truncated
From 35ce2e22760e83c45ec776dab60ab59d8da801a3 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Tue, 28 Mar 2017 22:40:39 -0300
Subject: [PATCH 2/2] Vacuum: free dead tuples array as early as possible

Allow other parts of the system to benefit from the possibly
large amount of memory reserved for dead tuples after they're
no longer necessary.

While the memory would be freed when the command finishes, it
can sometimes be some considerable time between the time vacuum
is done with the array until the command finishes - mostly due
to truncate taking a long time.
---
 src/backend/commands/vacuumlazy.c| 25 +
 src/test/regress/expected/vacuum.out | 10 +-
 src/test/regress/parallel_schedule   |  7 ++-
 src/test/regress/sql/vacuum.sql  |  4 ++--
 4 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 1e82d26..2c7850f 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -212,6 +212,7 @@ static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
 static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
 	   ItemPointer itemptr);
 static void lazy_clear_dead_tuples(LVRelStats *vacrelstats);
+static void lazy_free_dead_tuples(LVRelStats *vacrelstats);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
 static bool heap_page_is_all_visible(Relation rel, Buffer buf,
 		 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -1394,6 +1395,9 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 	pgstat_progress_update_param(PROGRESS_VACUUM_PHASE,
  PROGRESS_VACUUM_PHASE_INDEX_CLEANUP);
 
+	/* dead tuples no longer needed past this point */
+	lazy_free_dead_tuples(vacrelstats);
+
 	/* Do post-vacuum cleanup and statistics update for each index */
 	for (i = 0; i < nindexes; i++)
 		lazy_cleanup_index(Irel[i], indstats[i], vacrelstats);
@@ -2221,6 +2225,27 @@ lazy_clear_dead_tuples(LVRelStats *vacrelstats)
 }
 
 /*
+ * lazy_free_dead_tuples - reset all dead tuples segments
+ * and free all allocated memory
+ */
+static void
+lazy_free_dead_tuples(LVRelStats *vacrelstats)
+{
+	int			nseg;
+
+	for (nseg = 0; nseg < GetNumDeadTuplesSegments(vacrelstats); nseg++)
+		if (GetDeadTuplesSegment(vacrelstats, nseg)->dt_tids != NULL)
+			pfree(GetDeadTuplesSegment(vacrelstats, nseg)->dt_tids);
+	if (vacrelstats->dead_tuples.dt_segments != NULL)
+		if (vacrelstats->dead_tuples.dt_segments != NULL)
+			pfree(vacrelstats->dead_tuples.dt_segments);
+	vacrelstats->dead_tuples.last_seg = 0;
+	vacrelstats->dead_tuples.num_segs = 0;
+	vacrelstats->dead_tuples.num_entries = 0;
+	vacrelstats->dead_tuples.dt_segments = NULL;
+}
+
+/*
  *	vac_itemptr_binsrch() -- search a sorted array of item pointers
  *
  *		Returns the offset of the first item pointer greater than or
diff --git a/src/test/regress/expected/vacuum.out b/src/test/regress/expected/vacuum.out
index 57550d9..828960d 100644
--- a/src/test/regress/expected/vacuum.out
+++ b/src/test/regress/expected/vacuum.out
@@ -1,7 +1,7 @@
 --
 -- VACUUM
 --
-CREATE TABLE vactst (i INT);
+CREATE TABLE vactst (i INT) WITH (autovacuum_enabled = false);
 INSERT INTO vactst VALUES (1);
 INSERT INTO vactst SELECT * FROM vactst;
 INSERT INTO vactst SELECT * FROM vactst;
@@ -125,10 +125,10 @@ INSERT INTO vactst SELECT * from generate_series(1,40);
 CREATE INDEX ix_vactst ON vactst (i);
 DELETE FROM vactst;
 VACUUM vactst;
-SELECT pg_relation_size('vactst', 'main');
- pg_relation_size 
---
-0
+EXPLAIN (ANALYZE, BUFFERS, COSTS off, TIMING off, SUMMARY off) SELECT * FROM vactst;
+ QUERY PLAN  

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-06 Thread Claudio Freire
On Tue, Feb 6, 2018 at 4:35 AM, Kyotaro HORIGUCHI
 wrote:
>> It's starting to look like a timing effect indeed.
>
> It seems to be truncation skip, maybe caused by concurrent
> autovacuum.

Good point, I'll also disable autovacuum on vactst.

> See lazy_truncate_heap() for details. Updates of
> pg_stat_*_tables can be delayed so looking it also can fail. Even
> though I haven't looked the patch closer, the "SELECT
> pg_relation_size()" doesn't seem to give something meaningful
> anyway.

Maybe then "explain (analyze, buffers, costs off, timing off, summary
off) select * from vactst" then.

The point is to check that the relation's heap has 0 pages.



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-06 Thread Claudio Freire
On Tue, Feb 6, 2018 at 4:56 AM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> On Tue, Feb 6, 2018 at 2:55 AM, Claudio Freire <klaussfre...@gmail.com> wrote:
>> On Mon, Feb 5, 2018 at 1:53 AM, Masahiko Sawada <sawada.m...@gmail.com> 
>> wrote:
>>> On Fri, Feb 2, 2018 at 11:13 PM, Claudio Freire <klaussfre...@gmail.com> 
>>> wrote:
>>>> After autovacuum gets cancelled, the next time it wakes up it will
>>>> retry vacuuming the cancelled relation. That's because a cancelled
>>>> autovacuum doesn't update the last-vacuumed stats.
>>>>
>>>> So the timing between an autovacuum work item and the next retry for
>>>> that relation is more or less an autovacuum nap time, except perhaps
>>>> in the case where many vacuums get cancelled, and they have to be
>>>> queued.
>>>
>>> I think that's not true if there are multiple databases.
>>
>> I'd have to re-check.
>>
>> The loop is basically, IIRC:
>>
>> while(1) { vacuum db ; work items ; nap }
>>
>> Now, if that's vacuum one db, not all, and if the decision on the
>> second run doesn't pick the same db because that big table failed to
>> be vacuumed, then you're right.
>>
>> In that case we could add the FSM vacuum as a work item *in addition*
>> to what this patch does. If the work item queue is full and the FSM
>> vacuum doesn't happen, it'd be no worse than with the patch as-is.
>>
>> Is that what you suggest?
>
> That's one of the my suggestion. I might had to consider this issue
> for each case. To be clear let me summarize for each case.
>
> For table with indices, vacuum on fsm of heap is done after
> lazy_vacuum_heap(). Therefore I think that the case where a table got
> vacuumed but fsm couldn't get vacuumed doesn't happen unless the
> autovacuum gets cancelled before or while vacuuming fsm.

Well, that's the whole point. Autovacuums get cancelled all the time
in highly contended tables. I have more than a handful tables in
production that never finish autovacuum, so I have to trigger manual
vacuums periodically.

> (1) using autovacuum work-item, (2) vacuuming fsm of table in
> PG_CATCH, (3) remembering the tables got cancelled and vacuuming them
> after finished a loop of table_oids.
...
> So I'm in favor of (3).
...
> However, it's quite possible that I'm not seeing the whole picture here.

Well, none of that handles the case where the autovacuum of a table
doesn't get cancelled, but takes a very long time.

No free space becomes visible during long-running vacuums. That means
bloat keeps accumulating even though vacuum is freeing space, because
the FSM doesn't expose that free space.

The extra work incurred in those FSM vacuums isn't useless, it's
preventing bloat in long-running vacuums.

I can look into doing 3, that *might* get rid of the need to do that
initial FSM vacuum, but all other intermediate vacuums are still
needed.



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-05 Thread Claudio Freire
On Mon, Feb 5, 2018 at 2:55 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
> With that in mind, I'm noticing WorkItems have a avw_database that
> isn't checked by do_autovacuum. Is that right? Shouldn't only work
> items that belong to the database being autovacuumed be processed?

NVM. I had to git pull, it's fixed in head.



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-05 Thread Claudio Freire
On Mon, Feb 5, 2018 at 1:53 AM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> On Fri, Feb 2, 2018 at 11:13 PM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> After autovacuum gets cancelled, the next time it wakes up it will
>> retry vacuuming the cancelled relation. That's because a cancelled
>> autovacuum doesn't update the last-vacuumed stats.
>>
>> So the timing between an autovacuum work item and the next retry for
>> that relation is more or less an autovacuum nap time, except perhaps
>> in the case where many vacuums get cancelled, and they have to be
>> queued.
>
> I think that's not true if there are multiple databases.

I'd have to re-check.

The loop is basically, IIRC:

while(1) { vacuum db ; work items ; nap }

Now, if that's vacuum one db, not all, and if the decision on the
second run doesn't pick the same db because that big table failed to
be vacuumed, then you're right.

In that case we could add the FSM vacuum as a work item *in addition*
to what this patch does. If the work item queue is full and the FSM
vacuum doesn't happen, it'd be no worse than with the patch as-is.

Is that what you suggest?

With that in mind, I'm noticing WorkItems have a avw_database that
isn't checked by do_autovacuum. Is that right? Shouldn't only work
items that belong to the database being autovacuumed be processed?

>> That's why an initial FSM vacuum makes sense. It has a similar timing
>> to the autovacuum work item, it has the benefit that it can be
>> triggered manually (manual vacuum), and it's cheap and efficient.
>>
>>> Also the patch always vacuums fsm at beginning of the vacuum with a
>>> threshold but it is useless if the table has been properly vacuumed. I
>>> don't think it's good idea to add an additional step that "might be"
>>> efficient, because current vacuum is already heavy.
>>
>> FSMs are several orders of magnitude smaller than the tables
>> themselves. A TB-sized table I have here has a 95M FSM. If you add
>> threshold skipping, that initial FSM vacuum *will* be efficient.
>>
>> By definition, the FSM will be less than 1/4000th of the table, so
>> that initial FSM pass takes less than 1/4000th of the whole vacuum.
>> Considerably less considering the simplicity of the task.
>
> I agree the fsm is very smaller than heap and vacuum on fsm will not
> be comparatively heavy but I'm afraid that the vacuum will get more
> heavy in the future if we pile up such improvement that are small but
> might not be efficient. For example, a feature for reporting the last
> vacuum status has been proposed[1]. I wonder if we can use it to
> determine whether we do the fsm vacuum at beginning of vacuum.

Yes, such a feature would allow skipping that initial FSM vacuum. That
can be improved in a future patch if that proposed feature gets
merged. This patch can be treated independently from that in the
meanwhile, don't you think?



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-02 Thread Claudio Freire
On Thu, Jan 25, 2018 at 6:21 PM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
> On Fri, Jan 26, 2018 at 9:38 AM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> I had the tests running in a loop all day long, and I cannot reproduce
>> that variance.
>>
>> Can you share your steps to reproduce it, including configure flags?
>
> Here are two build logs where it failed:
>
> https://travis-ci.org/postgresql-cfbot/postgresql/builds/332968819
> https://travis-ci.org/postgresql-cfbot/postgresql/builds/332592511
>
> Here's one where it succeeded:
>
> https://travis-ci.org/postgresql-cfbot/postgresql/builds/333139855
>
> The full build script used is:
>
> ./configure --enable-debug --enable-cassert --enable-coverage
> --enable-tap-tests --with-tcl --with-python --with-perl --with-ldap
> --with-icu && make -j4 all contrib docs && make -Otarget -j3
> check-world
>
> This is a virtualised 4 core system.  I wonder if "make -Otarget -j3
> check-world" creates enough load on it to produce some weird timing
> effect that you don't see on your development system.

I can't reproduce it, not even with the same build script.

It's starting to look like a timing effect indeed.

I get a similar effect if there's an active snapshot in another
session while vacuum runs. I don't know how the test suite ends up in
that situation, but it seems to be the case.

How do you suggest we go about fixing this? The test in question is
important, I've caught actual bugs in the implementation with it,
because it checks that vacuum effectively frees up space.

I'm thinking this vacuum test could be put on its own parallel group
perhaps? Since I can't reproduce it, I can't know whether that will
fix it, but it seems sensible.



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-02 Thread Claudio Freire
On Thu, Feb 1, 2018 at 9:34 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> On Mon, Jan 29, 2018 at 11:31 PM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> On Mon, Jan 29, 2018 at 4:12 AM, Masahiko Sawada <sawada.m...@gmail.com> 
>> wrote:
>>> On Sat, Jul 29, 2017 at 9:42 AM, Claudio Freire <klaussfre...@gmail.com> 
>>> wrote:
>>>> Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids
>>>> recursing into branches that already contain enough free space, to
>>>> avoid having to traverse the whole FSM and thus induce quadratic
>>>> costs. Intermediate FSM vacuums are only supposed to make enough
>>>> free space visible to avoid extension until the final (non-partial)
>>>> FSM vacuum.
>>>
>>> Hmm, I think this resolve a  part of the issue. How about calling
>>> AutoVacuumRequestWork() in PG_CATCH() if VACOPT_VACUUM is specified
>>> and give the relid that we were vacuuming but could not complete as a
>>> new autovacuum work-item? The new autovacuum work-item makes the
>>> worker vacuum FSMs of the given relation and its indices.
>>
>> Well, I tried that in fact, as I mentioned in the OP.
>>
>> I abandoned it due to the conjunction of the 2 main blockers I found
>> and mentioned there. In essence, those issues defeat the purpose of
>> the patch (to get the free space visible ASAP).
>>
>> Don't forget, this is aimed at cases where autovacuum of a single
>> relation takes a very long time. That is, very big relations. Maybe
>> days, like in my case. A whole autovacuum cycle can take weeks, so
>> delaying FSM vacuum that much is not good, and using work items still
>> cause those delays, not to mention the segfaults.
>
> Yeah, I agree to vacuum fsm more frequently because it can prevent
> table bloating from concurrent modifying. But considering the way to
> prevent the table bloating from cancellation of autovacuum, I guess we
> need more things. This proposal seems to provide us an ability that is
> we "might be" able to prevent table bloating due to cancellation of
> autovacuum. Since we can know that autovacuum is cancelled, I'd like
> to have a way so that we can surely vacuum fsm even if vacuum is
> cancelled. Thoughts?

After autovacuum gets cancelled, the next time it wakes up it will
retry vacuuming the cancelled relation. That's because a cancelled
autovacuum doesn't update the last-vacuumed stats.

So the timing between an autovacuum work item and the next retry for
that relation is more or less an autovacuum nap time, except perhaps
in the case where many vacuums get cancelled, and they have to be
queued.

That's why an initial FSM vacuum makes sense. It has a similar timing
to the autovacuum work item, it has the benefit that it can be
triggered manually (manual vacuum), and it's cheap and efficient.

> Also the patch always vacuums fsm at beginning of the vacuum with a
> threshold but it is useless if the table has been properly vacuumed. I
> don't think it's good idea to add an additional step that "might be"
> efficient, because current vacuum is already heavy.

FSMs are several orders of magnitude smaller than the tables
themselves. A TB-sized table I have here has a 95M FSM. If you add
threshold skipping, that initial FSM vacuum *will* be efficient.

By definition, the FSM will be less than 1/4000th of the table, so
that initial FSM pass takes less than 1/4000th of the whole vacuum.
Considerably less considering the simplicity of the task.

> On detail of the patch,
>
> --- a/src/backend/storage/freespace/indexfsm.c
> +++ b/src/backend/storage/freespace/indexfsm.c
> @@ -70,5 +70,5 @@ RecordUsedIndexPage(Relation rel, BlockNumber usedBlock)
>  void
>  IndexFreeSpaceMapVacuum(Relation rel)
>  {
> -  FreeSpaceMapVacuum(rel);
> +  FreeSpaceMapVacuum(rel, 0);
>  }
>
> @@ -816,11 +820,19 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,
> bool *eof_p)
>   {
>  int child_avail;
>
> +/* Tree pruning for partial vacuums */
> +if (threshold)
> +{
> +   child_avail = fsm_get_avail(page, slot);
> +   if (child_avail >= threshold)
> +  continue;
> +}
>
> Don't we skip all fsm pages if we set the threshold to 0?

No, that doesn't skip anything if threshold is 0, since it doesn't get
to the continue, which is the one skipping nodes.



Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-01-29 Thread Claudio Freire
On Mon, Jan 29, 2018 at 4:12 AM, Masahiko Sawada <sawada.m...@gmail.com> wrote:
> On Sat, Jul 29, 2017 at 9:42 AM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids
>> recursing into branches that already contain enough free space, to
>> avoid having to traverse the whole FSM and thus induce quadratic
>> costs. Intermediate FSM vacuums are only supposed to make enough
>> free space visible to avoid extension until the final (non-partial)
>> FSM vacuum.
>
> Hmm, I think this resolve a  part of the issue. How about calling
> AutoVacuumRequestWork() in PG_CATCH() if VACOPT_VACUUM is specified
> and give the relid that we were vacuuming but could not complete as a
> new autovacuum work-item? The new autovacuum work-item makes the
> worker vacuum FSMs of the given relation and its indices.

Well, I tried that in fact, as I mentioned in the OP.

I abandoned it due to the conjunction of the 2 main blockers I found
and mentioned there. In essence, those issues defeat the purpose of
the patch (to get the free space visible ASAP).

Don't forget, this is aimed at cases where autovacuum of a single
relation takes a very long time. That is, very big relations. Maybe
days, like in my case. A whole autovacuum cycle can take weeks, so
delaying FSM vacuum that much is not good, and using work items still
cause those delays, not to mention the segfaults.

> That way, we
> can ensure that FSM gets vacuumed by the cancelled autovacuum process
> or other autovacuum processes. Since a work-item can be handled by
> other autovacuum process I think 256 work-item limit would not be a
> problem.

Why do you think it wouldn't? In particular if you take into account
the above. If you have more than 256 relations in the cluster, it
could very well happen that you've queued the maximum amount and no
autovac worker has had a chance to take a look at them, because
they're all stuck vacuuming huge relations.

Not to mention the need to de-duplicate work items. We wouldn't want
to request repeated FSM vacuums, or worst, queue an FSM vacuum of a
single table 256 times and fill up the queue with redundant items.
With the current structure, de-duplication is O(N), so if we wanted to
raise the limit of 256 work items, we'd need a structure that would
let us de-duplicate in less than O(N). In essence, it's a ton of work
for very little gain. Hence why I abandoned it.

> Or it might be better to make autovacuum launcher launch
> worker process if there is pending work-item in a database even if
> there is no table needs to be vacuumed/analyzed.

Quite a must, in fact.

>> For one, it would sometimes crash when adding the work item from
>> inside autovacuum itself. I didn't find the cause of the crash, but I suspect
>> AutoVacuumRequestWork was being called in a context where it
>> was not safe.
>
> Perhaps the cause of this might be that autovacuum work-item is
> implemented using DSA, which has been changed before by commit
> 31ae1638ce35c23979f9bcbb92c6bb51744dbccb.

I thought about that. Perhaps. But the work-item approach didn't fail
just because of the segfaults. It's also that it doesn't address the
free space visibility issue quickly enough.



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-01-25 Thread Claudio Freire
On Thu, Jan 25, 2018 at 10:56 AM, Claudio Freire <klaussfre...@gmail.com> wrote:
> On Thu, Jan 25, 2018 at 4:11 AM, Thomas Munro
> <thomas.mu...@enterprisedb.com> wrote:
>> On Thu, Jan 18, 2018 at 9:17 AM, Claudio Freire <klaussfre...@gmail.com> 
>> wrote:
>>> Huh. That was simpler than I thought.
>>>
>>> Attached rebased versions.
>>
>> Hi Claudio,
>>
>> FYI the regression test seems to have some run-to-run variation.
>> Though it usually succeeds, recently I have seen a couple of failures
>> like this:
>>
>> = Contents of ./src/test/regress/regression.diffs
>> *** 
>> /home/travis/build/postgresql-cfbot/postgresql/src/test/regress/expected/vacuum.out
>> 2018-01-24 01:41:28.200454371 +
>> --- 
>> /home/travis/build/postgresql-cfbot/postgresql/src/test/regress/results/vacuum.out
>> 2018-01-24 01:51:07.970049937 +
>> ***
>> *** 128,134 
>>   SELECT pg_relation_size('vactst', 'main');
>>pg_relation_size
>>   --
>> ! 0
>>   (1 row)
>>
>>   SELECT count(*) FROM vactst;
>> --- 128,134 
>>   SELECT pg_relation_size('vactst', 'main');
>>pg_relation_size
>>   --
>> !  8192
>>   (1 row)
>>
>>   SELECT count(*) FROM vactst;
>> ==
>>
>> --
>> Thomas Munro
>> http://www.enterprisedb.com
>
> I'll look into it

I had the tests running in a loop all day long, and I cannot reproduce
that variance.

Can you share your steps to reproduce it, including configure flags?



Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-01-25 Thread Claudio Freire
On Thu, Jan 25, 2018 at 4:11 AM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
> On Thu, Jan 18, 2018 at 9:17 AM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> Huh. That was simpler than I thought.
>>
>> Attached rebased versions.
>
> Hi Claudio,
>
> FYI the regression test seems to have some run-to-run variation.
> Though it usually succeeds, recently I have seen a couple of failures
> like this:
>
> = Contents of ./src/test/regress/regression.diffs
> *** 
> /home/travis/build/postgresql-cfbot/postgresql/src/test/regress/expected/vacuum.out
> 2018-01-24 01:41:28.200454371 +
> --- 
> /home/travis/build/postgresql-cfbot/postgresql/src/test/regress/results/vacuum.out
> 2018-01-24 01:51:07.970049937 +
> ***
> *** 128,134 
>   SELECT pg_relation_size('vactst', 'main');
>pg_relation_size
>   --
> ! 0
>   (1 row)
>
>   SELECT count(*) FROM vactst;
> --- 128,134 
>   SELECT pg_relation_size('vactst', 'main');
>pg_relation_size
>   --
> !  8192
>   (1 row)
>
>   SELECT count(*) FROM vactst;
> ==
>
> --
> Thomas Munro
> http://www.enterprisedb.com

I'll look into it

However, shouldn't an empty relation have an initial page anyway? In
that case shouldn't the correct value be 8192?



Re: Built-in connection pooling

2018-01-19 Thread Claudio Freire
On Fri, Jan 19, 2018 at 2:22 PM, Tomas Vondra <tomas.von...@2ndquadrant.com>
wrote:

>
>
> On 01/19/2018 06:13 PM, Claudio Freire wrote:
> >
> >
> > On Fri, Jan 19, 2018 at 2:07 PM, Konstantin Knizhnik
> > <k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>> wrote:
> >
> >
> >
> >
> > Well, I haven't said it has to be single-threaded like
> pgbouncer. I
> > don't see why the bgworker could not use multiple threads
> > internally (of
> > course, it'd need to be not to mess the stuff that is not
> > thread-safe).
> >
> >
> > Certainly architecture with N multiple scheduling bgworkers and M
> > executors (backends) may be more flexible
> > than solution when scheduling is done in executor itself. But we
> > will have to pay extra cost for redirection.
> > I am not sure that finally it will allow to reach better performance.
> > More flexible solution in many cases doesn't mean more efficient
> > solution.
> >
> >
> > I think you can take the best of both worlds.
> >
> > You can take your approach of passing around fds, and build a "load
> > balancing protocol" in a bgworker.
> >
> > The postmaster sends the socket to the bgworker, the bgworker waits for
> > a command as pgbouncer does, but instead of proxying everything, when
> > commands arrive, it passes the socket to a backend to handle.
> >
> > That way, the bgworker can do what pgbouncer does, handle different
> > pooling modes, match backends to databases, etc, but it doesn't have to
> > proxy all data, it just delegates handling of a command to a backend,
> > and forgets about that socket.
> >
> > Sounds like it could work.
> >
>
> How could it do all that without actually processing all the data? For
> example, how could it determine the statement/transaction boundaries?
>

It only needs to determine statement/transaction start.

After that, it hands off the connection to a backend, and the backend
determines when to give it back.

So instead of processing all the data, it only processes a tiny part of it.


Re: Built-in connection pooling

2018-01-19 Thread Claudio Freire
On Fri, Jan 19, 2018 at 2:07 PM, Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

>
>
>
>> Well, I haven't said it has to be single-threaded like pgbouncer. I
>> don't see why the bgworker could not use multiple threads internally (of
>> course, it'd need to be not to mess the stuff that is not thread-safe).
>>
>>
> Certainly architecture with N multiple scheduling bgworkers and M
> executors (backends) may be more flexible
> than solution when scheduling is done in executor itself. But we will have
> to pay extra cost for redirection.
> I am not sure that finally it will allow to reach better performance.
> More flexible solution in many cases doesn't mean more efficient solution.


I think you can take the best of both worlds.

You can take your approach of passing around fds, and build a "load
balancing protocol" in a bgworker.

The postmaster sends the socket to the bgworker, the bgworker waits for a
command as pgbouncer does, but instead of proxying everything, when
commands arrive, it passes the socket to a backend to handle.

That way, the bgworker can do what pgbouncer does, handle different pooling
modes, match backends to databases, etc, but it doesn't have to proxy all
data, it just delegates handling of a command to a backend, and forgets
about that socket.

Sounds like it could work.


Re: Built-in connection pooling

2018-01-19 Thread Claudio Freire
On Fri, Jan 19, 2018 at 2:06 PM, Tomas Vondra <tomas.von...@2ndquadrant.com>
wrote:

>
>
> On 01/19/2018 06:03 PM, Claudio Freire wrote:
> >
> >
> > On Fri, Jan 19, 2018 at 1:53 PM, Konstantin Knizhnik
> > <k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>> wrote:
> >
> >
> >
> > On 19.01.2018 19:28, Pavel Stehule wrote:
> >>
> >>
> >> When I've been thinking about adding a built-in connection
> >> pool, my
> >> rough plan was mostly "bgworker doing something like
> >> pgbouncer" (that
> >> is, listening on a separate port and proxying everything
> >> to regular
> >> backends). Obviously, that has pros and cons, and probably
> >> would not
> >> work serve the threading use case well.
> >>
> >>
> >> And we will get the same problem as with pgbouncer: one
> >> process will not be able to handle all connections...
> >> Certainly it is possible to start several such scheduling
> >> bgworkers... But in any case it is more efficient to multiplex
> >> session in backend themselves.
> >>
> >>
> >> pgbouncer hold all time client connect. When we implement the
> >> listeners, then all work can be done by worker processes not by
> >> listeners.
> >>
> >
> > Sorry, I do not understand your point.
> > In my case pgbench establish connection to the pgbouncer only  once
> > at the beginning of the test.
> > And pgbouncer spends all time in context switches (CPU usage is 100%
> > and it is mostly in kernel space: top of profile are kernel
> functions).
> > The same picture will be if instead of pgbouncer you will do such
> > scheduling in one bgworker.
> > For the modern systems are not able to perform more than several
> > hundreds of connection switches per second.
> > So with single multiplexing thread or process you can not get speed
> > more than 100k, while at powerful NUMA system it is possible to
> > achieve millions of TPS.
> > It is illustrated by the results I have sent in the previous mail:
> > by spawning 10 instances of pgbouncer I was able to receive 7 times
> > bigger speed.
> >
> >
> > I'm sure pgbouncer can be improved. I've seen async code handle millions
> > of packets per second (zmq), pgbouncer shouldn't be radically different.
> >
>
> The trouble is pgbouncer is not handling individual packets. It needs to
> do additional processing to assemble the messages, understand the state
> of the connection (e.g. to do transaction pooling) etc. Or handle SSL.
>

I understand. But zmq also has to process framing very similar to the fe
protocol, so I'm still hopeful.


Re: Built-in connection pooling

2018-01-19 Thread Claudio Freire
On Fri, Jan 19, 2018 at 1:53 PM, Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

>
>
> On 19.01.2018 19:28, Pavel Stehule wrote:
>
>
>
> When I've been thinking about adding a built-in connection pool, my
>>> rough plan was mostly "bgworker doing something like pgbouncer" (that
>>> is, listening on a separate port and proxying everything to regular
>>> backends). Obviously, that has pros and cons, and probably would not
>>> work serve the threading use case well.
>>>
>>
>> And we will get the same problem as with pgbouncer: one process will not
>> be able to handle all connections...
>> Certainly it is possible to start several such scheduling bgworkers...
>> But in any case it is more efficient to multiplex session in backend
>> themselves.
>>
>
> pgbouncer hold all time client connect. When we implement the listeners,
> then all work can be done by worker processes not by listeners.
>
>
> Sorry, I do not understand your point.
> In my case pgbench establish connection to the pgbouncer only  once at the
> beginning of the test.
> And pgbouncer spends all time in context switches (CPU usage is 100% and
> it is mostly in kernel space: top of profile are kernel functions).
> The same picture will be if instead of pgbouncer you will do such
> scheduling in one bgworker.
> For the modern systems are not able to perform more than several hundreds
> of connection switches per second.
> So with single multiplexing thread or process you can not get speed more
> than 100k, while at powerful NUMA system it is possible to achieve millions
> of TPS.
> It is illustrated by the results I have sent in the previous mail: by
> spawning 10 instances of pgbouncer I was able to receive 7 times bigger
> speed.
>

I'm sure pgbouncer can be improved. I've seen async code handle millions of
packets per second (zmq), pgbouncer shouldn't be radically different.


Re: Built-in connection pooling

2018-01-18 Thread Claudio Freire
On Thu, Jan 18, 2018 at 11:48 AM, Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

>
> Attached please find new version of the patch with few fixes.
> And more results at NUMA system with 144 cores and 3Tb of RAM.
>
> Read-only pgbench (-S):
>
>
> #Connections\kTPS
> Vanilla Postgres
> Session pool size 256
> 1k
> 1300 1505
> 10k
> 633
> 1519
> 100k
> - 1425
>
>
>
> Read-write contention test: access to small number of records with 1% of
> updates.
>
> #Clients\TPS Vanilla Postgres Session pool size 256
> 100 557232 573319
> 200 520395 551670
> 300 511423 533773
> 400 468562 523091
> 500 442268 514056
> 600 401860 526704
> 700 363912 530317
> 800 325148 512238
> 900 301310 512844
> 1000 278829 554516
>
> So, as you can see, there is no degrade of performance with increased number 
> of connections in case of using session pooling.
>
>
TBH, the tests you should be running are comparisons with a similar pool
size managed by pgbouncer, not just vanilla unlimited postgres.

Of course a limited pool size will beat thousands of concurrent queries by
a large margin. The real question is whether a pthread-based approach beats
the pgbouncer approach.


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-01-17 Thread Claudio Freire
On Wed, Jan 17, 2018 at 5:02 PM, Claudio Freire <klaussfre...@gmail.com>
wrote:

>
>
> On Sat, Jan 6, 2018 at 7:35 PM, Stephen Frost <sfr...@snowman.net> wrote:
>
>> Greetings,
>>
>> * Michael Paquier (michael.paqu...@gmail.com) wrote:
>> > On Mon, Dec 4, 2017 at 2:38 PM, Claudio Freire <klaussfre...@gmail.com>
>> wrote:
>> > > They did apply at the time, but I think major work on vacuum was
>> > > pushed since then, and also I was traveling so out of reach.
>> > >
>> > > It may take some time to rebase them again. Should I move to needs
>> > > review myself after that?
>> >
>> > Sure, if you can get into this state, please feel free to update the
>> > status of the patch yourself.
>>
>> We're now over a month since this status update- Claudio, for this to
>> have a chance during this commitfest to be included (which, personally,
>> I think would be great as it solves a pretty serious issue..), we really
>> need to have it be rebased and updated.  Once that's done, as Michael
>> says, please change the patch status back to 'Needs Review'.
>>
>
> Sorry, had tons of other stuff that took priority.
>
> I'll get to rebase this patch now.
>
>
>
Huh. That was simpler than I thought.

Attached rebased versions.
From df993486d3847c59605c8e35ae8b27183a3c9ae7 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 1/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.

Improve test ability to spot vacuum errors
---
 src/backend/commands/vacuumlazy.c| 402 ---
 src/test/regress/expected/vacuum.out |  27 +++
 src/test/regress/sql/vacuum.sql  |  19 ++
 3 files changed, 374 insertions(+), 74 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index cf7f5e1..1e82d26 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -11,11 +11,24 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
- * initially allocate an array of TIDs of that size, with an upper limit that
+ * initially allocate an array of TIDs of 128MB, or an upper limit that
  * depends on table size (this limit ensures we don't allocate a huge area
- * uselessly for vacuuming small tables).  If the array threatens to overflow,
+ * uselessly for vacuuming small tables). Additional arrays of increasingly
+ * large sizes are allocated as they become necessary.
+ *
+ * The TID array is thus represented as a list of multiple segments of
+ * varying size, beginning with the initial size of up to 128MB, and growing
+ * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
+ * is used up.
+ *
+ * Lookup in that structure happens with a binary search of segments, and then
+ * with a binary search within each segment. Since segment's size grows
+ * exponentially, this retains O(log N) lookup complexity.
+ *
+ * If the multiarray's total size threatens to grow beyond maintenance_work_mem,
  * we suspend the heap scan phase and perform a pass of index cleanup and page
- * compaction, then resume the heap scan with an empty TID array.
+ * compaction, then resume the heap scan with an array of logically empty but
+ * already preallocated TID segments to be refilled with more dead tuple TIDs.
  *
  * If we're processing a table with no indexes, we can just vacuum each page
  * as we go; there's no need to save up multiple tuples to minimize the number
@@ -92,6 +105,14 @@
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
 
 /*
+ * Minimum (starting) size of the dead_tuples array segments. Will allocate
+ * space for 128MB worth of tid pointers in the first segment, further segments
+ * will grow in size exponentially. Don't make it too small or the segment list
+ * will grow bigger than the sweetspot for search efficiency on big vacuums.
+ */
+#define LAZY_INIT_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
+
+/*
  * Before we consider skipping a page that's marked as clean in
  * visibility map, we must've seen at least this many clean pages.
  */
@@ -103,6 +124,27 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+typedef struct DeadTuplesSegment
+{
+	ItemPointerData last_dead_tuple;	/* Copy of the last dead tuple (unset
+		 * until the segment is fully
+		 * populated). Keep it first to simplify
+		 * binary searches */
+	int			num_dead_tuples

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-01-17 Thread Claudio Freire
On Sat, Jan 6, 2018 at 7:35 PM, Stephen Frost <sfr...@snowman.net> wrote:

> Greetings,
>
> * Michael Paquier (michael.paqu...@gmail.com) wrote:
> > On Mon, Dec 4, 2017 at 2:38 PM, Claudio Freire <klaussfre...@gmail.com>
> wrote:
> > > They did apply at the time, but I think major work on vacuum was
> > > pushed since then, and also I was traveling so out of reach.
> > >
> > > It may take some time to rebase them again. Should I move to needs
> > > review myself after that?
> >
> > Sure, if you can get into this state, please feel free to update the
> > status of the patch yourself.
>
> We're now over a month since this status update- Claudio, for this to
> have a chance during this commitfest to be included (which, personally,
> I think would be great as it solves a pretty serious issue..), we really
> need to have it be rebased and updated.  Once that's done, as Michael
> says, please change the patch status back to 'Needs Review'.
>

Sorry, had tons of other stuff that took priority.

I'll get to rebase this patch now.


Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-01-17 Thread Claudio Freire
On Sat, Jan 6, 2018 at 7:33 PM, Stephen Frost <sfr...@snowman.net> wrote:

> Greetings Claudio,
>
> * Michael Paquier (michael.paqu...@gmail.com) wrote:
> > On Mon, Nov 27, 2017 at 2:39 PM, Jing Wang <jingwang...@gmail.com>
> wrote:
> > > A few general comments.
> > >
> > > +   FreeSpaceMapVacuum(onerel, 64);
> > >
> > > Just want to know why '64' is used here? It's better to give a
> description.
> > >
> > > +else
> > > +   {
> > > +   newslot = fsm_get_avail(page, 0);
> > > +   }
> > >
> > > Since there is only one line in the else the bracket will not be
> needed. And
> > > there in one more space ahead of else which should be removed.
> > >
> > >
> > > +   if (nindexes == 0 &&
> > > +   (vacuumed_pages_at_fsm_vac - vacuumed_pages) >
> > > vacuum_fsm_every_pages)
> > > +   {
> > > +   /* Vacuum the Free Space Map */
> > > +   FreeSpaceMapVacuum(onerel, 0);
> > > +   vacuumed_pages_at_fsm_vac = vacuumed_pages;
> > > +   }
> > >
> > > vacuumed_pages_at_fsm_vac is initialised with value zero and seems no
> chance
> > > to go into the bracket.
> >
> > The patch presented still applies, and there has been a review two
> > days ago. This is a short delay to answer for the author, so I am
> > moving this patch to next CF with the same status of waiting on
> > author.
>
> Looks like this patch probably still applies and the changes suggested
> above seem pretty small, any chance you can provide an updated patch
> soon Claudio, so that it can be reviewed properly during this
> commitfest?
>

I failed to notice the comments among the list's noise, sorry.

I'll get on to it now.

On Mon, Nov 27, 2017 at 2:39 AM, Jing Wang <jingwang...@gmail.com> wrote:

> A few general comments.
>
> +   FreeSpaceMapVacuum(onerel, 64);
>
> Just want to know why '64' is used here? It's better to give a
> description.
>

It's just a random cutoff point.

The point of that vacuum run is to mark free space early, to avoid needless
relation extension before long vacuum runs finish. This only happens if the
FSM is mostly zeroes, if there's free space in there, it will be used.

To make this run fast, since it's all redundant work before the final FSM
vacuum pass, branches with more than that amount of free space are skipped.
There's little point in vacuuming those early since they already contain
free space. This helps avoid the quadratic cost of vacuuming the FSM every
N dead tuples, since already-vacuumed branches will have free space, and
will thus be skipped. It could still be quadratic in the worst case, but it
avoids it often enough.

As for the value itself, it's an arbitrary choice. In-between index passes,
we have a rather precise cutoff we can use to avoid this redundant work.
But in the first run, we don't, so I made an arbitrary choice there that
felt right. I have no empirical performance data about alternative values.


>
> +else
> +   {
> +   newslot = fsm_get_avail(page, 0);
> +   }
>
> Since there is only one line in the else the bracket will not be needed.
> And there in one more space ahead of else which should be removed.
>

Ok


>
> +   if (nindexes == 0 &&
> +   (vacuumed_pages_at_fsm_vac - vacuumed_pages) >
> vacuum_fsm_every_pages)
> +   {
> +   /* Vacuum the Free Space Map */
> +   FreeSpaceMapVacuum(onerel, 0);
> +   vacuumed_pages_at_fsm_vac = vacuumed_pages;
> +   }
>
> vacuumed_pages_at_fsm_vac is initialised with value zero and seems no
> chance to go into the bracket.
>

You're right, the difference there is backwards

Attached patch with the proposed changes.
From c1f9a70172d2284508eb877f0f2e647237b43095 Mon Sep 17 00:00:00 2001
From: Claudio Freire <klaussfre...@gmail.com>
Date: Fri, 28 Jul 2017 21:31:59 -0300
Subject: [PATCH] Vacuum: Update FSM more frequently

Make Vacuum update the FSM more frequently, to avoid the case where
autovacuum fails to reach the point where it updates the FSM in
highly contended tables.

Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids
recursing into branches that already contain enough free space, to
avoid having to traverse the whole FSM and thus induce quadratic
costs. Intermediate FSM vacuums are only supposed to make enough
free space visible to avoid extension until the final (non-partial)
FSM vacuum.

Run partial FSM vacuums after each index pass, which is a point
at which whole ranges of the heap have been thorougly cleaned, and
we can expect no further updates to those ranges of the FSM save
for concurren

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-01-12 Thread Claudio Freire
On Fri, Jan 12, 2018 at 10:49 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> Claudio Freire <klaussfre...@gmail.com> writes:
> > On Sat, Dec 30, 2017 at 7:32 AM, Simon Riggs <si...@2ndquadrant.com>
> wrote:
> >> So we can't completely remove xl_prev field, without giving up some
> >> functionality.
>
> > Or, you can use the lower 16-bits of the previous record's CRC
>
> Hmm ... that is an interesting idea, but I'm not sure it helps much
> towards Simon's actual objective.  AIUI the core problem here is the
> contention involved in retrieving the previous WAL record's address.
> Changing things so that we need the previous record's CRC isn't really
> gonna improve that --- if anything, it'll be worse, because the
> record's address can (in principle) be known sooner than its CRC.
>
> Still, if we were just looking to shave some bits off of WAL record
> headers, it might be possible to do something with this idea.
>
> regards, tom lane
>

I later realized. That's why I corrected myself to the first record, not
the previous.

Now, that assumes there's enough entropy in CRC values to actually make
good use of those 16 bits... there may not. WAL segments are highly
compressible after all.

So, maybe a hash of the LSN of the first record instead? That should be
guaranteed to have good entropy (given a good hash).

In any case, there are many rather good alternatives to the segment number
that should be reasonably safe from consistent collisions with garbage data.


Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-01-12 Thread Claudio Freire
On Fri, Jan 12, 2018 at 8:43 PM, Claudio Freire <klaussfre...@gmail.com>
wrote:

>
>
> On Sat, Dec 30, 2017 at 7:32 AM, Simon Riggs <si...@2ndquadrant.com>
> wrote:
>
>> So we can't completely remove xl_prev field, without giving up some
>> functionality. But we don't really need to store the 8-byte previous
>> WAL pointer in order to detect torn pages. Something else which can
>> tell us that the WAL record does not belong to current WAL segno would
>> be enough as well. I propose that we replace it with a much smaller
>> 2-byte field (let's call it xl_walid). The "xl_walid" (or whatever we
>> decide to call it) is the low order 16-bits of the WAL segno to which
>> the WAL record belongs. While reading WAL, we always match that the
>> "xl_walid" value stored in the WAL record matches with the current WAL
>> segno's lower order 16-bits and if not, then consider that as the end
>> of the stream.
>>
>> For this to work, we must ensure that WAL files are either recycled in
>> such a way that the "xl_walid" of the previous (to be recycled) WAL
>> differs from the new WAL or we zero-out the new WAL file. Seems quite
>> easy to do with the existing infrastructure.
>>
>
> Or, you can use the lower 16-bits of the previous record's CRC
>
>
>
Sorry, missed the whole point. Of the *first* record's CRC


Re: Huge backend memory footprint

2017-12-22 Thread Claudio Freire
On Fri, Dec 22, 2017 at 10:07 AM, Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

> While my experiments with pthreads version of Postgres I find out that I
> can not create more than 100k backends even at the system with 4Tb of RAM.
> I do not want to discuss now the idea of creating so large number of
> backends - yes, most of the real production systems are using pgbouncer or
> similar connection pooling
> tool allowing to restrict number of connections to the database. But there
> are 144 cores at this system and if we want to utilize all system resources
> then optimal number of
> backends will be several hundreds (especially taken in account that
> Postgres backends are usually not CPU bounded and have to read data from
> the disk, so number of backends
> should be much larger than number of cores).
>
> There are several per-backend arrays in postgres which size depends on
> maximal number of backends.
> For max_connections=10 Postgres allocates 26Mb for each snapshot:
>
> CurrentRunningXacts->xids = (TransactionId *)
> malloc(TOTAL_MAX_CACHED_SUBXIDS * sizeof(TransactionId));
>
> It seems to be too overestimated value, because TOTAL_MAX_CACHED_SUBXIDS
> is defined as:
>
> /*
>  * During Hot Standby processing we have a data structure called
>  * KnownAssignedXids, created in shared memory. Local data structures
> are
>  * also created in various backends during GetSnapshotData(),
>  * TransactionIdIsInProgress() and GetRunningTransactionData(). All of
> the
>  * main structures created in those functions must be identically
> sized,
>  * since we may at times copy the whole of the data structures around.
> We
>  * refer to this size as TOTAL_MAX_CACHED_SUBXIDS.
>  *
>  * Ideally we'd only create this structure if we were actually doing
> hot
>  * standby in the current run, but we don't know that yet at the time
>  * shared memory is being set up.
>  */
> #define TOTAL_MAX_CACHED_SUBXIDS \
> ((PGPROC_MAX_CACHED_SUBXIDS + 1) * PROCARRAY_MAXPROCS)
>
>
> Another 12Mb array is used for deadlock detection:
>
> #2  0x008ac397 in InitDeadLockChecking () at deadlock.c:196
> 196(EDGE *) palloc(maxPossibleConstraints * sizeof(EDGE));
> (gdb) list
> 191 * last MaxBackends entries in possibleConstraints[] are
> reserved as
> 192 * output workspace for FindLockCycle.
> 193 */
> 194maxPossibleConstraints = MaxBackends * 4;
> 195possibleConstraints =
> 196(EDGE *) palloc(maxPossibleConstraints * sizeof(EDGE));
> 197
>
>
> As  result amount of dynamic memory allocated for each backend exceeds
> 50Mb and so 100k backends can not be launched even at the system with 4Tb!
> I think that we should use more accurate allocation policy in this places
> and do not waste memory in such manner (even if it is virtual).
>

Don't forget each thread also has its own stack. I don't think you can
expect 100k threads to ever work.

If you get to that point, you really need to consider async query
execution. There was a lot of work related to that in other threads, you
may want to take a look.


Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2017-12-03 Thread Claudio Freire
On Tue, Nov 28, 2017 at 10:37 PM, Michael Paquier
<michael.paqu...@gmail.com> wrote:
> On Mon, Oct 2, 2017 at 11:02 PM, Claudio Freire <klaussfre...@gmail.com> 
> wrote:
>> Rebased version of the patches attached
>
> The status of the patch is misleading:
> https://commitfest.postgresql.org/15/844/. This was marked as waiting
> on author but a new version has been published. Let's be careful.
>
> The last patches I am aware of, aka those from
> https://www.postgresql.org/message-id/CAGTBQpZHTf2JtShC=ijc9wzeipo3xokwqhx+8wip7zjpc3f...@mail.gmail.com,
> do not apply. I am moving the patch to the next commit fest with a
> waiting on author status, as this should be reviewed, but those need a
> rebase.

They did apply at the time, but I think major work on vacuum was
pushed since then, and also I was traveling so out of reach.

It may take some time to rebase them again. Should I move to needs
review myself after that?