from:"Claudio Freire"

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2016-03-28 Thread Claudio Freire

On Thu, Mar 24, 2016 at 7:12 PM, Peter Geoghegan  wrote:
> On Thu, Mar 24, 2016 at 7:17 AM, Robert Haas  wrote:
>> I really like this idea, and the performance results seem impressive,
>> but I think we should push this out to 9.7.  A btree patch that didn't
>> have WAL support until two and a half weeks into the final CommitFest
>> just doesn't seem to me like a good candidate.  First, as a general
>> matter, if a patch isn't code-complete at the start of a CommitFest,
>> it's reasonable to say that it should be reviewed but not necessarily
>> committed in that CommitFest.  This patch has had some review, but I'm
>> not sure how deep that review is, and I think it's had no code review
>> at all of the WAL logging changes, which were submitted only a week
>> ago, well after the CF deadline.  Second, the btree AM is a
>> particularly poor place to introduce possibly destabilizing changes.
>> Everybody depends on it, all the time, for everything.  And despite
>> new tools like amcheck, it's not a particularly easy thing to debug.
>
> Regrettably, I must agree. I don't see a plausible path to commit for
> this patch in the ongoing CF.
>
> I think that Anastasia did an excellent job here, and I wish I could
> have been of greater help sooner. Nevertheless, it would be unwise to
> commit this given the maturity of the code. There have been very few
> instances of performance improvements to the B-Tree code for as long
> as I've been interested, because it's so hard, and the standard is so
> high. The only example I can think of from the last few years is
> Kevin's commit 2ed5b87f96 and Tom's commit 1a77f8b63d both of which
> were far less invasive, and Simon's commit c7111d11b1, which we just
> outright reverted from 9.5 due to subtle bugs (and even that was
> significantly less invasive than this patch). Improving nbtree is
> something that requires several rounds of expert review, and that's
> something that's in short supply for the B-Tree code in particular. I
> think that a new testing strategy is needed to make this easier, and I
> hope to get that going with amcheck. I need help with formalizing a
> "testing first" approach for improving the B-Tree code, because I
> think it's the only way that we can move forward with projects like
> this. It's *incredibly* hard to push forward patches like this given
> our current, limited testing strategy.

I've been toying (having gotten nowhere concrete really) with prefix
compression myself, I agree that messing with btree code is quite
harder than it ought to be.

Perhaps trying experimental format changes in a separate experimental
am wouldn't be all that bad (say, nxbtree?). People could opt-in to
those, by creating the indexes with nxbtree instead of plain btree
(say in development environments) and get some testing going without
risking much.

Normally the same effect should be achievable with mere flags, but
since format changes to btree tend to be rather invasive, ensuring the
patch doesn't change behavior with the flag off is hard as well, hence
the wholly separate am idea.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Use pread and pwrite instead of lseek + write and read

2016-08-17 Thread Claudio Freire

On Wed, Aug 17, 2016 at 10:40 AM, Tom Lane  wrote:
> Oskari Saarenmaa  writes:
>> On my laptop a simple pgbench run (scale 100, 15 minutes) shows a 1.5%
>> performance improvement.
>
> I would have hoped for a lot better result before anyone would propose
> that we should deal with all the portability issues this'll create.
>
>> A 1.5% performance improvement is small but
>> measurable - and IMV more importantly it allows us to drop more than 100
>> lines of backwards (compatible?) code; maybe we could start targeting
>> more recent platforms in v10?
>
> That's basically nonsense: we'll end up adding way more than that to
> deal with platforms that haven't got these APIs.

Perhaps a better target would then be to try and make use of readv and
writev, which should both be useful for sequential access of various
kinds and network I/O. For instance, when reading 10 sequential
buffers, a readv could fill 10 buffers at a time.

I remember a project where we got a linear improvement in thoughput by
using them for network I/O, because we were limited by packet
thoughput rather than byte thoughput, and using the iovec utilities
reduced the overhead considerably.

But all this is anecdotal evidence in any case, YMMV.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2016-08-18 Thread Claudio Freire

On Thu, Aug 18, 2016 at 5:05 PM, Robert Haas  wrote:
> On Wed, Aug 17, 2016 at 10:54 PM, Claudio Freire  
> wrote:
>> The attached patch tries to maintain the initial status of B-Tree
>> indexes, which are created with equal-key runs in physical order,
>> during the whole life of the B-Tree, and make key-tid pairs
>> efficiently searchable in the process.
>
> I have two pretty important questions that doesn't seem to be
> addressed in your email.

I believe I addressed that in the README, but if I wasn't clear, let
me know and I'll expand there too.

Summarizing here:

> 1. How does it do that?

It adds the block number and the offset number of the index tuple's
t_tid (that is, the pointer into the heap) as a final (implicit) sort
key, as is done already in tuplesort. It just keeps doing that while
comparing keys in the B-Tree when searching the leaf page for
insertion, so the insertion point now points to the exact point where
the insertion has to happen to maintain that condition, and not just
to the first valid position within the same-key run.

In fact, that's why non-leaf index tuples need a different format,
because while leaf index tuples contain the heap pointer already,
non-leaf ones contain only the downlink, not the pointer into the
heap. To be able to do comparisons and pick the right downlink, the
original heap pointer in the leaf index tuple is copied into the
downlink index tuple when splitting pages into an additional
IndexTupleData header that is prepended only to non-leaf index tuples.

This representation is chosen to minimize code changes, such that
(itup+1) is usable as before, and also because it's reasonably
compact. But it *is* necessary to check whether it's a leaf or
non-leaf tuple to choose whether to use (itup+1) or just itup, which
is why the patch requires so many changes in so many places.

> 2. What's the benefit?

The idea came up in the discussion about WARM updates.

Basically, in various situations it is convenient to be able to find
the index tuples that point to a specific heap tuple. Without this,
doing so isn't efficient - the only way to find such index tuples is
to find the leftmost index tuple that has the same key, and walk the
index right links until the right tid is found. For non-unique
indexes, but also for unique indexes with heavy bloat, this could
involve a large amount of I/O, so it isn't efficient in several
contexts. Think vacuum and WARM updates mostly (like HOT updates, but
where only some indexes don't need updating, and some do).

Having the ability to find a heap tuple by key-ctid pair is thus
beneficial because it allows efficient implementations for those
operations. It may benefit other parts of the code, but those are the
ones that come to mind.

It also makes index scans of the form

SELECT * FROM t WHERE some_col = some_const;

Scan the heap in sequential order, even if some_col has low
cardinality (ie: the query returns lots of tuples), which is a nice
performance side effect.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2016-08-18 Thread Claudio Freire

On Thu, Aug 18, 2016 at 6:04 PM, Kevin Grittner  wrote:
> On Thu, Aug 18, 2016 at 3:41 PM, Claudio Freire  
> wrote:
>
>> It also makes index scans of the form
>>
>> SELECT * FROM t WHERE some_col = some_const;
>>
>> Scan the heap in sequential order, even if some_col has low
>> cardinality (ie: the query returns lots of tuples), which is a nice
>> performance side effect.
>
> Speaking of performance side effects, does this avoid O(N^2)
> performance on index tuple insertion with duplicate values, for all
> insertion orderings?  For example, does it descend directly to the
> right leaf page for the insert rather than starting at the front of
> the block of duplicate values and scanning to the right for a
> block with space, with a random chance to split a full block on
> each page it moves through?

Yes, but only on non-unique indexes.

Unique indexes still need to scan all duplicates to check visibility
and will become O(N^2) there.

The code with the random chance to split is still there, but it should
never have a chance to run because the comparison against the heap
tuple pointer would stop it very quickly (before walking a single
offset I believe).

I thought about cleaning that up, but to make review easier I thought
I'd leave it there for now. Removing it would add a lot of diff noise.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2016-08-18 Thread Claudio Freire

On Thu, Aug 18, 2016 at 6:15 PM, Peter Geoghegan  wrote:
> On Thu, Aug 18, 2016 at 1:41 PM, Claudio Freire  
> wrote:
>> In fact, that's why non-leaf index tuples need a different format,
>> because while leaf index tuples contain the heap pointer already,
>> non-leaf ones contain only the downlink, not the pointer into the
>> heap. To be able to do comparisons and pick the right downlink, the
>> original heap pointer in the leaf index tuple is copied into the
>> downlink index tuple when splitting pages into an additional
>> IndexTupleData header that is prepended only to non-leaf index tuples.
>
> I think that this is a bad idea. We need to implement suffix
> truncation of internal page index tuples at some point, to make them
> contain less information from the original leaf page index tuple.
> That's an important optimization, because it increases fan-in. This
> seems like a move in the opposite direction.

I see that. I could try to measure average depth to measure the impact
this had on fan-in.

While it should cut it in half for narrow indexes, half of very high
is still high. Wide indexes, which are are the ones that would suffer
from poor fan-in, would feel this far less, since the overhead is
constant.

Even if it does have an impact, I don't see an alternative, without
also implementing suffix truncation. Perhaps I could try to avoid
adding the leaf tid header if it isn't necessary, though I would have
to use up the last available flag bit in t_info for that.

> ISTM that the way to address this problem is with a duplicate list
> and/or prefix compression in leaf pages.

Prefix compression is another one I will be looking into eventually,
but last time I tried it was far more invasive so I abandoned until I
could find a less invasive way to do it.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2016-08-18 Thread Claudio Freire

On Thu, Aug 18, 2016 at 6:27 PM, Tom Lane  wrote:
> Claudio Freire  writes:
>> On Thu, Aug 18, 2016 at 6:04 PM, Kevin Grittner  wrote:
>>> Speaking of performance side effects, does this avoid O(N^2)
>>> performance on index tuple insertion with duplicate values, for all
>>> insertion orderings?  For example, does it descend directly to the
>>> right leaf page for the insert rather than starting at the front of
>>> the block of duplicate values and scanning to the right for a
>>> block with space, with a random chance to split a full block on
>>> each page it moves through?
>
>> Yes, but only on non-unique indexes.
>
> How's that work if the existing entries aren't in TID order (which they
> will not be, in a pre-existing index)?  Or are you assuming you can blow
> off on-disk compatibility of indexes?
>
> regards, tom lane

This patch does blow off on-disk compatibility, but the plan is to
re-add it later on.

A flag in the meta page would indicate whether it's a sorted index or
not, and the only way to get a sorted index would be with a reindex.
The code would have to support both for a while until whatever point
we'd figure we can drop support for old format indexes.

Since this is a sort order change I see no alternative, either the
whole index is sorted by TID or not.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2016-08-18 Thread Claudio Freire

On Thu, Aug 18, 2016 at 6:38 PM, Peter Geoghegan  wrote:
> On Thu, Aug 18, 2016 at 2:26 PM, Claudio Freire  
> wrote:
>> I see that. I could try to measure average depth to measure the impact
>> this had on fan-in.
>>
>> While it should cut it in half for narrow indexes, half of very high
>> is still high. Wide indexes, which are are the ones that would suffer
>> from poor fan-in, would feel this far less, since the overhead is
>> constant.
>
> You'll also have to consider the impact on the choice of split point,
> FWIW. There is currently leeway about what page an index tuple can
> land on, if and when it happens to be equal to the high-key. I'm not
> sure how important that is, but it's a consideration.

When I read the code, it seemed generic enough, using ItemIdGetLength
(which works on non-leaf index tuples correctly, reporting the right
size), so it should still work.

But the choice of split point may make a difference for future
insertions, so I'll look into that.

>> Even if it does have an impact, I don't see an alternative, without
>> also implementing suffix truncation. Perhaps I could try to avoid
>> adding the leaf tid header if it isn't necessary, though I would have
>> to use up the last available flag bit in t_info for that.
>
> To be clear, I'm particularly worried about painting ourselves into a
> corner, suffix-truncation-wise. It's a very common optimization, that
> we should really have.

Well, page version numbers could be used to switch between
semantically similar page formats later on. So if another format is
necessary (say, prefix compression, suffix truncation, etc), it can be
changed on-the-fly on existing indexes by checking page version
numbers and doing the proper switch.

So we won't be painting ourselves into a corner, but picking the wrong
format may indeed be a headache.

I can go the way of the flag in t_info if that's preferrable. Both
would work. It's just that it's the last flag available, and that
would be painting ourselves into a corner regarding flags. To avoid
that, the flag itself could be "has extra data" (instead of has leaf
tid), and add a whole set of flags in the extra data. That could also
help for preffix compression and suffix truncation.

>>> ISTM that the way to address this problem is with a duplicate list
>>> and/or prefix compression in leaf pages.
>>
>> Prefix compression is another one I will be looking into eventually,
>> but last time I tried it was far more invasive so I abandoned until I
>> could find a less invasive way to do it.
>
> Unfortunately, sometimes it is necessary to be very ambitious in order
> to solve a problem. The understandable and usually well-reasoned
> approach of making progress as incremental as possible occasionally
> works against contributors. It's worth considering if this is such a
> case.

I'd agree if this regressed performance considerably without the other
improvements, so I guess this will depend on the fan-in measurements.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2016-08-19 Thread Claudio Freire

On Thu, Aug 18, 2016 at 6:26 PM, Claudio Freire  wrote:
> On Thu, Aug 18, 2016 at 6:15 PM, Peter Geoghegan  wrote:
>> On Thu, Aug 18, 2016 at 1:41 PM, Claudio Freire  
>> wrote:
>>> In fact, that's why non-leaf index tuples need a different format,
>>> because while leaf index tuples contain the heap pointer already,
>>> non-leaf ones contain only the downlink, not the pointer into the
>>> heap. To be able to do comparisons and pick the right downlink, the
>>> original heap pointer in the leaf index tuple is copied into the
>>> downlink index tuple when splitting pages into an additional
>>> IndexTupleData header that is prepended only to non-leaf index tuples.
>>
>> I think that this is a bad idea. We need to implement suffix
>> truncation of internal page index tuples at some point, to make them
>> contain less information from the original leaf page index tuple.
>> That's an important optimization, because it increases fan-in. This
>> seems like a move in the opposite direction.
>
> I see that. I could try to measure average depth to measure the impact
> this had on fan-in.
>
> While it should cut it in half for narrow indexes, half of very high
> is still high. Wide indexes, which are are the ones that would suffer
> from poor fan-in, would feel this far less, since the overhead is
> constant.
>
> Even if it does have an impact, I don't see an alternative, without
> also implementing suffix truncation. Perhaps I could try to avoid
> adding the leaf tid header if it isn't necessary, though I would have
> to use up the last available flag bit in t_info for that.

So, I did some tests.

TL;DR version is, as expected, there is a big impact on narrow
indexes, but I don't believe it's something to be worried about.

**Begin LONG version** (skip to end long version if not interested)

Regardless of the fanout (or fanin, if you look at it that way), what
matters is tree depth, so I used pageinspect to check how depth of
indexes changed after patching.

This is all on pristine recently built indexes. I will try to measure
the same on bloated indexes later on, but the tests are painfully
slow.

I used the following query to inspect all user indexes, since
pageinspect failed to work for toast indexes for some reason (it
probably needed qualified relnames or something like that). Anyway
user indexes should be enough.

select level, count(*), pg_size_pretty(max(relpages) * 8192.0) as biggest
from pg_class, bt_metap(relname)
where relkind = 'i' and relnamespace = 2200 and relam = 403
group by level
order by level desc;

I tested it on both patched and unpached versions of both my test
database (roughly 23GB), pgbench scale 1000 (15GB) and pgbench scale
1 (146GB).

I believe pgbench is a worst case example, because it only has very
narrow PK indexes, which are the ones that will suffer the most from
this patch.

The numbers:

mat, patched

level;count;biggest
3;18;"1369 MB"
2;60;"322 MB"
1;26;"808 kB"
0;50;"16 kB"

mat, unpatched

level;count;biggest
3;15;"1367 MB"
2;63;"334 MB"
1;26;"800 kB"
0;49;"16 kB"

pgbench, scale 1000, patched

level;count;biggest
3;1;"2145 MB"
1;2;"456 kB"

pgbench, scale 1000, unpatched

level;count;biggest
3;1;"2142 MB"
1;2;"456 kB"

pgbench, scale 1, patched

level;count;biggest
3;1;"21 GB"
2;1;"2224 kB"
1;1;"240 kB"

pgbench, scale 1, unpatched

level;count;biggest
3;1;"21 GB"
1;2;"2208 kB"

So, clearly some indexes are pushed to the next level, but it isn't a
widespread effect - it looks more like the threshold index size for
needing a root split is slightly pushed downwards.

It totally agrees with the math. The index tuple header is 8 bytes,
and the datums need to be maxaligned so they will also be 8 bytes at
least. That means the B in B-tree gets cut by 50% (2/3) at the worst
but only for inner tuples, and so you get that

if a * log_(B)(N/B) = log_(B/(3/2))(N/B)

then for big enogh N

a < (3/2)*log_(B)(N) - 2

Which means the depth of huge B-trees is increased by 50%, but the
effects only begins to be seen at level 2, which is what we have here,
and level 2 for such a narrow index seems to be about 2GB. So we're
talking about very big indexes already. If level 2 can hold 2GB of
index, level 3 can hold around 2G x 8k / 16 = 1TB of index, and I have
yet to see (even in very big databases) a 1TB index on a PK.

If anyone has such an index, yes, this patch will hurt performance by
25% (4 page reads instead of 3).

Bad, but since it only affects very extreme cases, doesn't seem so bad.

**End LONG version**

So, that said, I started mocking up a version of the patch that used a
"has aux data&quo

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2016-08-19 Thread Claudio Freire

On Sat, Aug 20, 2016 at 1:05 AM, Amit Kapila  wrote:
> On Thu, Aug 18, 2016 at 8:24 AM, Claudio Freire  
> wrote:
>>
>> A couple of points make me uneasy about this patch, yet I can think of
>> no better alternative, so I seek feedback:
>>
>>  - introducing a different format for inner index tuples makes for an
>> invasive patch and quite difficult-to-reason code (it's easy to forget
>> whether a page is leaf or inner and that now causes assertion failures
>> or segfaults)
>>  - the ascent-descent to check for uniqueness when there are large
>> dead tuple runs could have locking subtleties that escape me. Perhaps
>> there's better ways to solve this.
>
> I have tried to study this part of your patch and it seems somewhat
> non-trivial and risky part of proposal.
>
> + } else {
> + /*
> + * We found the actual first item on another block, so we have to perform
> + * a two-step search - first half, until our write-locked buffer, then 
> another
> + * starting from our write-locked buffer.
> + */
> + LockBuffer(buf, BUFFER_LOCK_UNLOCK);
> + LockBuffer(buf, BT_WRITE);
> +
> + buf = _bt_moveright(rel, buf, natts, itup_scankey, &(itup->t_tid), false,
> + true, stack, BT_WRITE, NULL);
> +
> + first_offset = _bt_binsrch(rel, buf, natts, itup_scankey, NULL, false, 
> NULL);
> +
> + xwait = _bt_check_unique(rel, itup, heapRel, nbuf, buf,
> first_offset, itup_scankey,
> + checkUnique, &is_unique, &speculativeToken);
> +
> + _bt_relbuf(rel, nbuf);
> + }
>
> The idea for uniqueness check is that we hold the write lock on
> buffer/page on which we are trying to operate (we do take read locks
> on the consecutive pages during this check).  Here, in your changed
> proposal, you have two buffers (one to which the search led you and
> one buffer previous in the chain) and before checking uniqueness, on
> one of the buffers you seem to have write lock and on other you have
> read lock.  This seems to change the way uniqueness check works till
> now, can you explain how this works (can we safely assume that all
> searches for uniqueness check will lead to the same buffer first).

All uniqueness checks will find the same "nbuf" (read-locked), which
is the first one that matches the key.

They could have different "buf" buffers (the write-locked) one. But
since read locks cannot be held while a write-lock is held, I believe
it doesn't matter. All uniqueness checks will try to read-lock nbuf
unless the uniqueness check for that insertion is done. When they do,
they will block until the other insertion is finished, and when that
happens, either the insert happened, or it returned an xid to wait for
and retry. If it succeeded, the check attempting the read-lock will
see the inserted tuple and return the xid of the earlier transaction
and wait on it. If it failed, no insert happened because there's a
visible tuple there, and the read-lock will succeed, find the visible
tuple, and the insert will fail too.

In essence, it is still the case, I believe, that only one insert can
succeed at a time, all the others will retry. It only happens that
with the patch, the contention will be between a read lock and a write
lock, and not two write locks, and there's a little less contention
(some more work can be done in parallel).

> With this new mechanism, do we have two type of search interfaces
> where one would work for keys (as now) and other for key-ctid or it
> will be just a single interface which works both ways?  I think either
> way there are pros and cons.

The patch doesn't add another interface, but the idea is to add it.
The implementation will probably fall on a common function that can do
both, but some index ams can't implement the key-ctid operation (hash
indexes), so they will have to be separate interfaces to be able to
properly represent the separate capabilities (besides, other
implementations may well use completely different paths for the two
operations).

> Also, in the thread below you are talking about using the last bit in
> t_info, I want to bring it to your notice that there is a patch of
> mine [1] in which I have used it to avoid changing on-disk
> compatibility of hash indexes.  I am not saying that we should not
> plan it for some other use, but just to keep in mind that there are
> other proposals to use it.  We can decide what is best way to proceed,
> if we are aware of all the proposals that want to use it.
>
>
> [1] - 
> https://www.postgresql.org/message-id/caa4ek1lkq_udism-z2dq6cuvjh3db5fnfnnezzbpsrjw0ha...@mail.gmail.com

I wasn't aware of that particular thread, but there's another (the
WARM one) where there's another proposed use of that bit.

IMHO, any use of the bit that doesn't leave

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2016-08-19 Thread Claudio Freire

On Sat, Aug 20, 2016 at 1:53 AM, Claudio Freire  wrote:
> All uniqueness checks will try to read-lock nbuf
> unless the uniqueness check for that insertion is done.


That should read "all uniqueness checks will try to read-lock the buf
unless the uniqueness check for that insertion is done."

(nbuf -> buf)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2016-08-20 Thread Claudio Freire

On Sat, Aug 20, 2016 at 4:27 AM, Amit Kapila  wrote:
>> All uniqueness checks will find the same "nbuf" (read-locked), which
>> is the first one that matches the key.
>>
>> They could have different "buf" buffers (the write-locked) one. But
>> since read locks cannot be held while a write-lock is held, I believe
>> it doesn't matter. All uniqueness checks will try to read-lock nbuf
>> unless the uniqueness check for that insertion is done. When they do,
>> they will block until the other insertion is finished, and when that
>> happens, either the insert happened, or it returned an xid to wait for
>> and retry. If it succeeded, the check attempting the read-lock will
>> see the inserted tuple and return the xid of the earlier transaction
>> and wait on it. If it failed, no insert happened because there's a
>> visible tuple there, and the read-lock will succeed, find the visible
>> tuple, and the insert will fail too.
>>
>> In essence, it is still the case, I believe, that only one insert can
>> succeed at a time, all the others will retry. It only happens that
>> with the patch, the contention will be between a read lock and a write
>> lock, and not two write locks, and there's a little less contention
>> (some more work can be done in parallel).
>>
>>> With this new mechanism, do we have two type of search interfaces
>>> where one would work for keys (as now) and other for key-ctid or it
>>> will be just a single interface which works both ways?  I think either
>>> way there are pros and cons.
>>
>> The patch doesn't add another interface, but the idea is to add it.
>> The implementation will probably fall on a common function that can do
>> both
>>
>
> That makes sense, but this means there is a chance that the searches
> could lead to different buffers in case of uniqueness checks (the
> search with key-ctid could lead to a different buffer than the search
> with just key).  I am not clear do we have requirement for doing this
> uniqueness check for key-ctid search API, because as I understand you
> want to do it mainly for vacuum and WARM tuples. Vacuum won't add new
> tuples, so is this required for WARM tuples?

Well, I'm not realy sure what exactly would need to be done when doing
the WARM conditional insert in the case of unique indexes, and that's
a strong motivation to not add the interface for inserts just yet.
Vacuum will only need to delete, and in the case of deletes the
operation would be quite straight forward.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Block level parallel vacuum WIP

2016-08-23 Thread Claudio Freire

On Tue, Aug 23, 2016 at 8:02 AM, Masahiko Sawada  wrote:
>
> 2. Vacuum table and index (after 1 transaction executed)
>   1 worker   : 12 sec
>   2 workers : 49 sec
>   3 workers : 54 sec
>   4 workers : 53 sec
>
> As a result of my test, since multiple process could frequently try to
> acquire the cleanup lock on same index buffer, execution time of
> parallel vacuum got worse.
> And it seems to be effective for only table vacuum so far, but is not
> improved as expected (maybe disk bottleneck).

Not only that, but from your description (I haven't read the patch,
sorry), you'd be scanning the whole index multiple times (one per
worker).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)

2016-08-23 Thread Claudio Freire

On Tue, Aug 23, 2016 at 7:11 PM, Tomas Vondra
 wrote:
> On 08/22/2016 10:32 PM, Robert Haas wrote:
>>
>>
>> ...
>>
>> 1. The number of tables for which we would need to add a duplicate,
>> unlogged table is formidable.  You need pg_attribute, pg_attrdef,
>> pg_constraint, pg_description, pg_type, pg_trigger, pg_rewrite, etc.
>> And the backend changes needed so that we used the unlogged copy for
>> temp tables and the permanent copy for regular tables is probably
>> really large.
>>
>> 2. You can't write to unlogged tables on standby servers, so this
>> doesn't help solve the problem of wanting to use temporary tables on
>> standbys.
>>
>> 3. While it makes creating temporary tables a lighter-weight
>> operation, because you no longer need to write WAL for the catalog
>> entries, there's probably still substantially more overhead than just
>> stuffing them in backend-local RAM.  So the performance benefits are
>> probably fairly modest.
>>
>> Overall I feel like the development effort that it would take to make
>> this work would almost certainly be better-expended elsewhere.  But of
>> course I'm not in charge of how people who work for other companies
>> spend their time...
>>
>
> Could someone please explain how the unlogged tables are supposed to fix the
> catalog bloat problem, as stated in the initial patch submission? We'd still
> need to insert/delete the catalog rows when creating/dropping the temporary
> tables, causing the bloat. Or is there something I'm missing?

Wouldn't more aggressive vacuuming of catalog tables fix the bloat?

Perhaps reserving a worker or N to run only on catalog schemas?

That'd be far simpler.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)

2016-08-23 Thread Claudio Freire

On Tue, Aug 23, 2016 at 7:20 PM, Andres Freund  wrote:
>> Wouldn't more aggressive vacuuming of catalog tables fix the bloat?
>
> Not really in my experience, at least not without more drastic vacuum
> changes. The issue is that if you have a single "long running"
> transaction (in some workloads that can even just be a 3 min taking
> query/xact), nothing will be cleaned up during that time. If you have a
> few hundred temp tables created per sec, you'll be in trouble even
> then. Not to speak of the case where you have queries taking hours (say
> a backup).

Well, my experience isn't as extreme as that (just a few dozen temp
tables per minute), but when I see bloat in catalog tables it's
because all autovacuum workers are stuck vacuuming huge tables for
huge periods of time (hours or days).

So that's certainly another bloat case to consider.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)

2016-08-23 Thread Claudio Freire

On Tue, Aug 23, 2016 at 7:25 PM, Tomas Vondra
 wrote:
>>> Could someone please explain how the unlogged tables are supposed to fix
>>> the
>>> catalog bloat problem, as stated in the initial patch submission? We'd
>>> still
>>> need to insert/delete the catalog rows when creating/dropping the
>>> temporary
>>> tables, causing the bloat. Or is there something I'm missing?
>>
>>
>> Wouldn't more aggressive vacuuming of catalog tables fix the bloat?
>>
>> Perhaps reserving a worker or N to run only on catalog schemas?
>>
>> That'd be far simpler.
>
>
> Maybe, although IIRC the issues with catalog bloat were due to a combination
> of long queries and many temporary tables being created/dropped. In that
> case simply ramping up autovacuum (or even having a dedicated workers for
> catalogs) would not realy help due to the xmin horizon being blocked by the
> long-running queries.
>
> Maybe it's entirely crazy idea due to the wine I drank at the dinner, but
> couldn't we vacuum the temporary table records differently? For example,
> couldn't we just consider them removable as soon as the backend that owns
> them disappears?

Or perhaps go all the way and generalize that to rows that never
become visible outside their parent transaction.

As in, delete of rows created by the deleting transaction could clean
up, carefully to avoid voiding indexes and all that, but more
aggressively than regular deletes.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)

2016-08-23 Thread Claudio Freire

On Tue, Aug 23, 2016 at 8:50 PM, Greg Stark  wrote:
> On Tue, Aug 23, 2016 at 4:15 PM, Aleksander Alekseev
>  wrote:
>> Frankly I have much more faith in Tom's idea of using virtual part of the
>> catalog for all temporary tables, i.e turning all temporary tables into
>> "fast" temporary tables. Instead of introducing a new type of temporary 
>> tables
>> that solve catalog bloating problem and forcing users to rewrite applications
>> why not just not to create a problem in a first place?
>
> Would applications really need to be rewritten? Are they really
> constructing temporary tables where the definition of the table is
> dynamic, not just the content?

Mine is. But it wouldn't be a big deal to adapt.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)

2016-08-23 Thread Claudio Freire

On Tue, Aug 23, 2016 at 9:12 PM, Tomas Vondra
 wrote:
> On 08/24/2016 12:38 AM, Claudio Freire wrote:
>>
>> On Tue, Aug 23, 2016 at 7:25 PM, Tomas Vondra
>>  wrote:
>>>>>
>>>>> Could someone please explain how the unlogged tables are supposed to
>>>>> fix
>>>>> the
>>>>> catalog bloat problem, as stated in the initial patch submission? We'd
>>>>> still
>>>>> need to insert/delete the catalog rows when creating/dropping the
>>>>> temporary
>>>>> tables, causing the bloat. Or is there something I'm missing?
>>>>
>>>>
>>>>
>>>> Wouldn't more aggressive vacuuming of catalog tables fix the bloat?
>>>>
>>>> Perhaps reserving a worker or N to run only on catalog schemas?
>>>>
>>>> That'd be far simpler.
>>>
>>>
>>>
>>> Maybe, although IIRC the issues with catalog bloat were due to a
>>> combination
>>> of long queries and many temporary tables being created/dropped. In that
>>> case simply ramping up autovacuum (or even having a dedicated workers for
>>> catalogs) would not realy help due to the xmin horizon being blocked by
>>> the
>>> long-running queries.
>>>
>>> Maybe it's entirely crazy idea due to the wine I drank at the dinner, but
>>> couldn't we vacuum the temporary table records differently? For example,
>>> couldn't we just consider them removable as soon as the backend that owns
>>> them disappears?
>>
>>
>> Or perhaps go all the way and generalize that to rows that never
>> become visible outside their parent transaction.
>>
>> As in, delete of rows created by the deleting transaction could clean
>> up, carefully to avoid voiding indexes and all that, but more
>> aggressively than regular deletes.
>>
>
> Maybe, but I wouldn't be surprised if such generalization would be an order
> of magnitude more complicated - and even the vacuuming changes I mentioned
> are undoubtedly a fair amount of work.

After looking at it from a birdseye view, I agree it's conceptually
complex (reading HeapTupleSatisfiesSelf already makes one dizzy).

But other than that, the implementation seems rather simple. It seems
to me, if one figures out that it is safe to do so (a-priori, xmin not
committed, xmax is current transaction), it would simply be a matter
of chasing the HOT chain root, setting all LP except the first to
LP_UNUSED and the first one to LP_DEAD.

Of course I may be missing a ton of stuff.

> Sadly, I don't see how this might fix the other issues mentioned in this
> thread (e.g. impossibility to create temp tables on standbys),

No it doesn't :(


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)

2016-08-23 Thread Claudio Freire

On Wed, Aug 24, 2016 at 2:04 AM, Alvaro Herrera
 wrote:
> Claudio Freire wrote:
>
>> After looking at it from a birdseye view, I agree it's conceptually
>> complex (reading HeapTupleSatisfiesSelf already makes one dizzy).
>>
>> But other than that, the implementation seems rather simple. It seems
>> to me, if one figures out that it is safe to do so (a-priori, xmin not
>> committed, xmax is current transaction), it would simply be a matter
>> of chasing the HOT chain root, setting all LP except the first to
>> LP_UNUSED and the first one to LP_DEAD.
>>
>> Of course I may be missing a ton of stuff.
>
> What you seem to be missing is that rows corresponding to temp tables
> are not "visible to its own transaction only".  The rows are valid
> after the transaction is gone; what makes the tables temporary is the
> fact that they are in a temporary schema.  And what makes them invisible
> to one backend is the fact that they are in the temporary schema for
> another backend.  Not that they are uncommitted.

Yeah, I was thinking of "on commit drop" behavior, but granted there's
all the others.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2016-08-24 Thread Claudio Freire

On Wed, Aug 24, 2016 at 10:31 PM, Robert Haas  wrote:
> 1. Transaction rates are vastly higher these days.  In 1999, I think
> we were still limited to ~2^32 transactions during the entire lifetime
> of the server; transaction ID wraparound hadn't been invented yet.[1]
> Today, some installations do that many write transactions in under a
> week.  The practical consequence of this is that WAL files fill up in
> extremely short periods of time. Some users generate multiple
> terabytes of WAL per day, which means they are generating - and very
> likely archiving - WAL files a rate of greater than 1 per second!
> That poses multiple problems. For example, if your archive command
> happens to involve ssh, you might run into trouble because of this
> sort of thing:
>
> [rhaas pgsql]$ /usr/bin/time ssh hydra true
> 1.57 real 0.00 user 0.00 sys
...
> Considering those three factors, I think we should consider pushing
> the default value up somewhat higher for v10.  Reverting to the 64MB
> size that we had prior to 47937403676d913c0e740eec6b85113865c6c8ab
> sounds pretty reasonable.  Users with really high transaction rates
> might even prefer a higher value (e.g. 256MB, 1GB) but that's hardly
> practical for small installs given our default of max_wal_size = 1GB.
> Possibly it would make sense for this to be configurable at initdb
> time instead of requiring a recompile; we probably don't save any
> significant number of cycles by compiling this into the server.

FWIW, +1

We're already hurt by the small segments due to a similar phenomenon
as the ssh case: TCP slow start. Designing the archive/recovery
command to work around TCP slow start is quite complex, and bigger
segments would just be a better thing.

Not to mention that bigger segments compress better.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2016-08-25 Thread Claudio Freire

On Thu, Aug 25, 2016 at 12:21 PM, Simon Riggs  wrote:
> On 25 August 2016 at 02:31, Robert Haas  wrote:
>
>> Furthermore, there is an enforced, synchronous fsync at the end of
>> every segment, which actually does hurt performance on write-heavy
>> workloads.[2] Of course, if that were the only reason to consider
>> increasing the segment size, it would probably make more sense to just
>> try to push that extra fsync into the background, but that's not
>> really the case.  From what I hear, the gigantic number of files is a
>> bigger pain point.
>
> I think we should fully describe the problem before finding a solution.
>
> This is too big a change to just tweak a value without discussing the
> actual issue.
>
> And if the problem is as described, how can a change of x4 be enough
> to make it worth the pain of change? I think you're already admitting
> it can't be worth it by discussing initdb configuration.
>
> If we do have the pain of change, should we also consider making WAL
> files variable length? What do we gain by having the files all the
> same size? ISTM better to have WAL files that vary in length up to 1GB
> in size.
>
> (This is all about XLOG_SEG_SIZE; I presume XLOG_BLCKSZ can stay as it
> is, right?)

Avoiding variable sizes does avoid some failure modes on the
filesystem side in the face of crashes/power loss.

So making them variable size, while possible, wouldn't be simple at
all (it would involve figuring out the way filesystems behave when
facing crash when a file is changing in size).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2016-08-31 Thread Claudio Freire

On Wed, Aug 31, 2016 at 1:45 PM, Pavan Deolasee 
wrote:

> We discussed a few ideas to address the "Duplicate Scan" problem. For
> example, we can teach Index AMs to discard any duplicate (key, CTID) insert
> requests. Or we could guarantee uniqueness by either only allowing updates
> in one lexical order. While the former is a more complete solution to avoid
> duplicate entries, searching through large number of keys for non-unique
> indexes could be a drag on performance. The latter approach may not be
> sufficient for many workloads. Also tracking increment/decrement for many
> indexes will be non-trivial.
>
> There is another problem with allowing many index entries pointing to the
> same WARM chain. It will be non-trivial to know how many index entries are
> currently pointing to the WARM chain and index/heap vacuum will throw up
> more challenges.
>
> Instead, what I would like to propose and the patch currently implements
> is to restrict WARM update to once per chain. So the first non-HOT update
> to a tuple or a HOT chain can be a WARM update. The chain can further be
> HOT updated any number of times. But it can no further be WARM updated.
> This might look too restrictive, but it can still bring down the number of
> regular updates by almost 50%. Further, if we devise a strategy to convert
> a WARM chain back to HOT chain, it can again be WARM updated. (This part is
> currently not implemented). A good side effect of this simple strategy is
> that we know there can maximum two index entries pointing to any given WARM
> chain.
>

We should probably think about coordinating with my btree patch.

>From the description above, the strategy is quite readily "upgradable" to
one in which the indexam discards duplicate (key,ctid) pairs and that would
remove the limitation of only one WARM update... right?

[HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-02 Thread Claudio Freire

The attached patch allows setting maintainance_work_mem or
autovacuum_work_mem higher than 1GB (and be effective), by turning the
allocation of the dead_tuples into a huge allocation.

This results in fewer index scans for heavily bloated tables, and
could be a lifesaver in many situations (in particular, the situation
I'm living right now in production, where we don't have enough room
for a vacuum full, and have just deleted 75% of a table to make room
but have to rely on regular lazy vacuum to free the space).

The patch also makes vacuum free the dead_tuples before starting
truncation. It didn't seem necessary to hold onto it beyond that
point, and it might help give the OS more cache, especially if work
mem is configured very high to avoid multiple index scans.

Tested with pgbench scale 4000 after deleting the whole
pgbench_accounts table, seemed to work fine.
From feddd4f58c30ca6a446ce0eed343f2562030f0e5 Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Fri, 2 Sep 2016 23:21:01 -0300
Subject: [PATCH 1/2] Vacuum: allow using more than 1GB work mem

Turns the palloc for dead_tuples into a huge allocation to allow
using more than 1GB worth of dead_tuples, saving index scans on
heavily bloated tables
---
 src/backend/commands/vacuumlazy.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 231e92d..dbe2040 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -155,6 +155,7 @@ static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
 static BlockNumber count_nondeletable_pages(Relation onerel,
 		 LVRelStats *vacrelstats);
 static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
+static void lazy_space_dealloc(LVRelStats *vacrelstats);
 static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
 	   ItemPointer itemptr);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
@@ -1310,6 +1311,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 	for (i = 0; i < nindexes; i++)
 		lazy_cleanup_index(Irel[i], indstats[i], vacrelstats);
 
+	lazy_space_dealloc(vacrelstats);
+
 	/* If no indexes, make log report that lazy_vacuum_heap would've made */
 	if (vacuumed_pages)
 		ereport(elevel,
@@ -1952,7 +1955,7 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
 	{
 		maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
 		maxtuples = Min(maxtuples, INT_MAX);
-		maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
+		maxtuples = Min(maxtuples, MaxAllocHugeSize / sizeof(ItemPointerData));
 
 		/* curious coding here to ensure the multiplication can't overflow */
 		if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks)
@@ -1969,7 +1972,19 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
 	vacrelstats->num_dead_tuples = 0;
 	vacrelstats->max_dead_tuples = (int) maxtuples;
 	vacrelstats->dead_tuples = (ItemPointer)
-		palloc(maxtuples * sizeof(ItemPointerData));
+		MemoryContextAllocHuge(CurrentMemoryContext, maxtuples * sizeof(ItemPointerData));
+}
+
+/*
+ * lazy_space_dealloc - free lazy vacuum work mem
+ */
+static void
+lazy_space_dealloc(LVRelStats *vacrelstats)
+{
+	if (vacrelstats->dead_tuples != NULL) {
+		pfree(vacrelstats->dead_tuples);
+		vacrelstats->dead_tuples = NULL;
+	}
 }
 
 /*
-- 
2.6.6


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-05 Thread Claudio Freire

On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs  wrote:
> On 3 September 2016 at 04:25, Claudio Freire  wrote:
>> The patch also makes vacuum free the dead_tuples before starting
>> truncation. It didn't seem necessary to hold onto it beyond that
>> point, and it might help give the OS more cache, especially if work
>> mem is configured very high to avoid multiple index scans.
>
> How long does that part ever take? Is there any substantial gain from this?
>
> Lets discuss that as a potential second patch.

In the test case I mentioned, it takes longer than the vacuum part itself.

Other than freeing RAM there's no gain. I didn't measure any speed
difference while testing, but that's probably because the backward
scan doesn't benefit from the cache, but other activity on the system
might. So, depending on the workload on the server, extra available
RAM may be a significant gain on its own or not. It just didn't seem
there was a reason to keep that RAM reserved, especially after making
it a huge allocation.

I'm fine either way. I can remove that from the patch or leave it
as-is. It just seemed like a good idea at the time.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-05 Thread Claudio Freire

On Mon, Sep 5, 2016 at 11:50 AM, Claudio Freire  wrote:
> On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs  wrote:
>> On 3 September 2016 at 04:25, Claudio Freire  wrote:
>>> The patch also makes vacuum free the dead_tuples before starting
>>> truncation. It didn't seem necessary to hold onto it beyond that
>>> point, and it might help give the OS more cache, especially if work
>>> mem is configured very high to avoid multiple index scans.
>>
>> How long does that part ever take? Is there any substantial gain from this?
>>
>> Lets discuss that as a potential second patch.
>
> In the test case I mentioned, it takes longer than the vacuum part itself.
>
> Other than freeing RAM there's no gain. I didn't measure any speed
> difference while testing, but that's probably because the backward
> scan doesn't benefit from the cache, but other activity on the system
> might. So, depending on the workload on the server, extra available
> RAM may be a significant gain on its own or not. It just didn't seem
> there was a reason to keep that RAM reserved, especially after making
> it a huge allocation.
>
> I'm fine either way. I can remove that from the patch or leave it
> as-is. It just seemed like a good idea at the time.


Rebased and split versions attached
From 7b47b59d89bf3edaeb11c4ffe3660f3d5b7b86de Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Fri, 2 Sep 2016 23:21:01 -0300
Subject: [PATCH 1/2] Vacuum: allow using more than 1GB work mem

Turns the palloc for dead_tuples into a huge allocation to allow
using more than 1GB worth of dead_tuples, saving index scans on
heavily bloated tables
---
 src/backend/commands/vacuumlazy.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 231e92d..2c98da4 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -1952,7 +1952,7 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
 	{
 		maxtuples = (vac_work_mem * 1024L) / sizeof(ItemPointerData);
 		maxtuples = Min(maxtuples, INT_MAX);
-		maxtuples = Min(maxtuples, MaxAllocSize / sizeof(ItemPointerData));
+		maxtuples = Min(maxtuples, MaxAllocHugeSize / sizeof(ItemPointerData));
 
 		/* curious coding here to ensure the multiplication can't overflow */
 		if ((BlockNumber) (maxtuples / LAZY_ALLOC_TUPLES) > relblocks)
@@ -1969,7 +1969,7 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
 	vacrelstats->num_dead_tuples = 0;
 	vacrelstats->max_dead_tuples = (int) maxtuples;
 	vacrelstats->dead_tuples = (ItemPointer)
-		palloc(maxtuples * sizeof(ItemPointerData));
+		MemoryContextAllocHuge(CurrentMemoryContext, maxtuples * sizeof(ItemPointerData));
 }
 
 /*
-- 
2.6.6

From 24fa0815d915cc31ae455ef60585f54a33dbd09c Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Fri, 2 Sep 2016 23:21:01 -0300
Subject: [PATCH 2/2] Vacuum: free dead_tuples sooner

Frees the dead_tuples array sooner, decreasing impact of huge settings
for autovacuum_work_mem or maintainance_work_mem during lazy vacuum.
---
 src/backend/commands/vacuumlazy.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 2c98da4..dbe2040 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -155,6 +155,7 @@ static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
 static BlockNumber count_nondeletable_pages(Relation onerel,
 		 LVRelStats *vacrelstats);
 static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
+static void lazy_space_dealloc(LVRelStats *vacrelstats);
 static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
 	   ItemPointer itemptr);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
@@ -1310,6 +1311,8 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 	for (i = 0; i < nindexes; i++)
 		lazy_cleanup_index(Irel[i], indstats[i], vacrelstats);
 
+	lazy_space_dealloc(vacrelstats);
+
 	/* If no indexes, make log report that lazy_vacuum_heap would've made */
 	if (vacuumed_pages)
 		ereport(elevel,
@@ -1973,6 +1976,18 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks)
 }
 
 /*
+ * lazy_space_dealloc - free lazy vacuum work mem
+ */
+static void
+lazy_space_dealloc(LVRelStats *vacrelstats)
+{
+	if (vacrelstats->dead_tuples != NULL) {
+		pfree(vacrelstats->dead_tuples);
+		vacrelstats->dead_tuples = NULL;
+	}
+}
+
+/*
  * lazy_record_dead_tuple - remember one deletable tuple
  */
 static void
-- 
2.6.6


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-05 Thread Claudio Freire

On Mon, Sep 5, 2016 at 5:36 PM, Simon Riggs  wrote:
> On 5 September 2016 at 15:50, Claudio Freire  wrote:
>> On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs  wrote:
>>> On 3 September 2016 at 04:25, Claudio Freire  wrote:
>>>> The patch also makes vacuum free the dead_tuples before starting
>>>> truncation. It didn't seem necessary to hold onto it beyond that
>>>> point, and it might help give the OS more cache, especially if work
>>>> mem is configured very high to avoid multiple index scans.
>>>
>>> How long does that part ever take? Is there any substantial gain from this?
>>>
>>> Lets discuss that as a potential second patch.
>>
>> In the test case I mentioned, it takes longer than the vacuum part itself.
>
> Please provide a test case and timings so we can see what's happening.

The referenced test case is the one I mentioned on the OP:

- createdb pgbench
- pgbench -i -s 4000 pgbench
- psql pgbench -c 'delete from pgbench_accounts;'
- vacuumdb -v -t pgbench_accounts pgbench

fsync=off, autovacuum=off, maintainance_work_mem=4GB

>From what I remember, it used ~2.7GB of RAM up until the truncate
phase, where it freed it. It performed a single index scan over the
PK.

I don't remember timings, and I didn't take them, so I'll have to
repeat the test to get them. It takes all day and makes my laptop
unusably slow, so I'll post them later, but they're not very
interesting. The only interesting bit is that it does a single index
scan instead of several, which on TB-or-more tables it's kinda nice.

Btw, without a further patch to prefetch pages on the backward scan
for truncate, however, my patience ran out before it finished
truncating. I haven't submitted that patch because there was an
identical patch in an older thread that was discussed and more or less
rejected since it slightly penalized SSDs. While I'm confident my
version of the patch is a little easier on SSDs, I haven't got an SSD
at hand to confirm, so that patch is still waiting on the queue until
I get access to an SSD. The tests I'll make include that patch, so the
timing regarding truncate won't be representative of HEAD (I really
can't afford to run the tests on a scale=4000 pgbench without that
patch, it crawls, and smaller scales don't fill the dead_tuples
array).

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Claudio Freire

On Sun, Sep 4, 2016 at 8:10 PM, Jim Nasby  wrote:
> On 9/4/16 1:46 AM, Simon Riggs wrote:
>>>
>>> > The patch also makes vacuum free the dead_tuples before starting
>>> > truncation. It didn't seem necessary to hold onto it beyond that
>>> > point, and it might help give the OS more cache, especially if work
>>> > mem is configured very high to avoid multiple index scans.
>>
>> How long does that part ever take? Is there any substantial gain from
>> this?
>
>
> If you're asking about how long the dealloc takes, we're going to have to
> pay that cost anyway when the context gets destroyed/reset, no? Doing that
> sooner rather than later certainly seems like a good idea since we've seen
> that truncation can take quite some time. Might as well give the memory back
> to the OS ASAP.

AFAIK, except on debug builds where it has to memset the whole thing,
the cost is constant (unrelated to the allocated block size), so it
should be rather small in this context.


On Tue, Sep 6, 2016 at 1:42 PM, Robert Haas  wrote:
> On Sat, Sep 3, 2016 at 8:55 AM, Claudio Freire  wrote:
>> The attached patch allows setting maintainance_work_mem or
>> autovacuum_work_mem higher than 1GB (and be effective), by turning the
>> allocation of the dead_tuples into a huge allocation.
>>
>> This results in fewer index scans for heavily bloated tables, and
>> could be a lifesaver in many situations (in particular, the situation
>> I'm living right now in production, where we don't have enough room
>> for a vacuum full, and have just deleted 75% of a table to make room
>> but have to rely on regular lazy vacuum to free the space).
>>
>> The patch also makes vacuum free the dead_tuples before starting
>> truncation. It didn't seem necessary to hold onto it beyond that
>> point, and it might help give the OS more cache, especially if work
>> mem is configured very high to avoid multiple index scans.
>>
>> Tested with pgbench scale 4000 after deleting the whole
>> pgbench_accounts table, seemed to work fine.
>
> The problem with this is that we allocate the entire amount of
> maintenance_work_mem even when the number of actual dead tuples turns
> out to be very small.  That's not so bad if the amount of memory we're
> potentially wasting is limited to ~1 GB, but it seems pretty dangerous
> to remove the 1 GB limit, because somebody might have
> maintenance_work_mem set to tens or hundreds of gigabytes to speed
> index creation, and allocating that much space for a VACUUM that
> encounters 1 dead tuple does not seem like a good plan.
>
> What I think we need to do is make some provision to initially
> allocate only a small amount of memory and then grow the allocation
> later if needed.  For example, instead of having
> vacrelstats->dead_tuples be declared as ItemPointer, declare it as
> ItemPointer * and allocate the array progressively in segments.  I'd
> actually argue that the segment size should be substantially smaller
> than 1 GB, like say 64MB; there are still some people running systems
> which are small enough that allocating 1 GB when we may need only 6
> bytes can drive the system into OOM.

This would however incur the cost of having to copy the whole GB-sized
chunk every time it's expanded. It woudln't be cheap.

I've monitored the vacuum as it runs and the OS doesn't map the whole
block unless it's touched, which it isn't until dead tuples are found.
Surely, if overcommit is disabled (as it should), it could exhaust the
virtual address space if set very high, but it wouldn't really use the
memory unless it's needed, it would merely reserve it.

To fix that, rather than repalloc the whole thing, dead_tuples would
have to be an ItemPointer** of sorted chunks. That'd be a
significantly more complex patch, but at least it wouldn't incur the
memcpy. I could attempt that, but I don't see the difference between
vacuum and create index in this case. Both could allocate a huge chunk
of the virtual address space if maintainance work mem says so, both
proportional to the size of the table. I can't see how that could take
any DBA by surprise.

A sensible compromise could be dividing the maintainance_work_mem by
autovacuum_max_workers when used in autovacuum, as is done for cost
limits, to protect those that set both rather high.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Make initdb's suggested "pg_ctl start" command line more reliabl

2016-09-06 Thread Claudio Freire

On Tue, Sep 6, 2016 at 2:08 PM, Tom Lane  wrote:
>> The not-quoting-if-not-needed doesn't appear to do anything useful for me:
>> 'pg-install/bin/pg_ctl' -D 'pg-install/var/data' -l logfile start
>
> Dash is considered a character that needs quoting.  It might be possible
> to avoid that if we could be certain that appendShellString's output would
> never be placed in a spot where it could be taken for a switch, but that
> seems like a large assumption to me.  Or maybe we could treat it as
> forcing quotes only if it starts the string; but I don't personally care
> enough to write the code for that.  Feel free if you want to.


Wouldn't it be taken for a switch even with quoting?

Quoting "-D" seems to work fine, which would suggest the above is true.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Claudio Freire

On Tue, Sep 6, 2016 at 2:39 PM, Robert Haas  wrote:
>> I could attempt that, but I don't see the difference between
>> vacuum and create index in this case. Both could allocate a huge chunk
>> of the virtual address space if maintainance work mem says so, both
>> proportional to the size of the table. I can't see how that could take
>> any DBA by surprise.
>
> Really?  CREATE INDEX isn't going to allocate more storage space than
> the size of the data actually being sorted, because tuplesort.c is
> smart about that kind of thing.  But VACUUM will very happily allocate
> vastly more memory than the number of dead tuples.  It is thankfully
> smart enough not to allocate more storage than the number of line
> pointers that could theoretically exist in a relation of the given
> size, but that only helps for very small relations.  In a large
> relation that divergence between the amount of storage space that
> could theoretically be needed and the amount that is actually needed
> is likely to be extremely high.  1 TB relation = 2^27 blocks, each of
> which can contain MaxHeapTuplesPerPage dead line pointers.  On my
> system, MaxHeapTuplesPerPage is 291, so that's 291 * 2^27 possible
> dead line pointers, which at 6 bytes each is 291 * 6 * 2^27 = ~218GB,
> but the expected number of dead line pointers is much less than that.
> Even if this is a vacuum triggered by autovacuum_vacuum_scale_factor
> and you're using the default of 0.2 (probably too high for such a
> large table), assuming there are about 60 tuples for page (which is
> what I get with pgbench -i) the table would have about 2^27 * 60 = 7.7
> billion tuples of which 1.5 billion would be dead, meaning we need
> about 9-10GB of space to store all of those dead tuples.  Allocating
> as much as 218GB when we need 9-10GB is going to sting, and I don't
> see how you will get a comparable distortion with CREATE INDEX.  I
> might be missing something, though.

CREATE INDEX could also allocate 218GB, you just need to index enough
columns and you'll get that.

Aside from the fact that CREATE INDEX will only allocate what is going
to be used and VACUUM will overallocate, the potential to fully
allocate the amount given is still there for both cases.

> There's no real issue when there's only one process running on the
> system at a time.  If the user set maintenance_work_mem to an amount
> of memory that he can't afford to pay even once, then that's simple
> misconfiguration and it's not really our problem.  The issue is that
> when there are 3 or potentially more VACUUM processes running plus a
> CREATE INDEX or two at the same time.  If you set maintenance_work_mem
> to a value that is large enough to make the CREATE INDEX run fast, now
> with your patch that is also going to cause each VACUUM process to
> gobble up lots of extra memory that it probably doesn't need, and now
> you may well start to get failures.  I've seen this happen even with
> the current 1GB limit, though you need a pretty small system - e.g.
> 8GB RAM - for it to be a problem.  I think it is really really likely
> to cause big problems for us if we dramatically increase that limit
> without making the allocation algorithm smarter.

Ok, a pity it will invalidate all the testing already done though (I
was almost done with the testing).

I guess I'll send the results anyway.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Claudio Freire

On Tue, Sep 6, 2016 at 3:11 PM, Tom Lane  wrote:
> We could get around (1) by something like Robert's idea of segmented
> allocation, but TBH I've seen nothing on this thread to make me think
> it's necessary or would even result in any performance improvement
> at all.  The bigger we make that array, the worse index-cleaning
> is going to perform, and complicating the data structure will add
> another hit on top of that.

I wouldn't be so sure, I've seen cases where two binary searches were
faster than a single binary search, especially when working with
humongus arrays like this tid array, because touching less (memory)
pages for a search does pay off considerably.

I'd try before giving up on the idea.

The test results (which I'll post in a second) do give credit to your
expectation that making the array bigger/more complex does impact
index scan performance. It's still faster than scanning several times
though.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-06 Thread Claudio Freire

On Tue, Sep 6, 2016 at 3:45 AM, Simon Riggs  wrote:
> On 5 September 2016 at 21:58, Claudio Freire  wrote:
>
>>>>> How long does that part ever take? Is there any substantial gain from 
>>>>> this?
>
>> Btw, without a further patch to prefetch pages on the backward scan
>> for truncate, however, my patience ran out before it finished
>> truncating. I haven't submitted that patch because there was an
>> identical patch in an older thread that was discussed and more or less
>> rejected since it slightly penalized SSDs.
>
> OK, thats enough context. Sorry for being forgetful on that point.
>
> Please post that new patch also.

Attached.

On Mon, Sep 5, 2016 at 5:58 PM, Claudio Freire  wrote:
> On Mon, Sep 5, 2016 at 5:36 PM, Simon Riggs  wrote:
>> On 5 September 2016 at 15:50, Claudio Freire  wrote:
>>> On Sun, Sep 4, 2016 at 3:46 AM, Simon Riggs  wrote:
>>>> On 3 September 2016 at 04:25, Claudio Freire  
>>>> wrote:
>>>>> The patch also makes vacuum free the dead_tuples before starting
>>>>> truncation. It didn't seem necessary to hold onto it beyond that
>>>>> point, and it might help give the OS more cache, especially if work
>>>>> mem is configured very high to avoid multiple index scans.
>>>>
>>>> How long does that part ever take? Is there any substantial gain from this?
>>>>
>>>> Lets discuss that as a potential second patch.
>>>
>>> In the test case I mentioned, it takes longer than the vacuum part itself.
>>
>> Please provide a test case and timings so we can see what's happening.

Robert made a strong point for a change in the approach, so the
information below is applicable only to the old patch (to be
rewritten).

I'm sending this merely to document the testing done, it will be a
while before I can get the proposed design running and tested.

> The referenced test case is the one I mentioned on the OP:
>
> - createdb pgbench
> - pgbench -i -s 4000 pgbench
> - psql pgbench -c 'delete from pgbench_accounts;'
> - vacuumdb -v -t pgbench_accounts pgbench
>
> fsync=off, autovacuum=off, maintainance_work_mem=4GB
>
> From what I remember, it used ~2.7GB of RAM up until the truncate
> phase, where it freed it. It performed a single index scan over the
> PK.
>
> I don't remember timings, and I didn't take them, so I'll have to
> repeat the test to get them. It takes all day and makes my laptop
> unusably slow, so I'll post them later, but they're not very
> interesting. The only interesting bit is that it does a single index
> scan instead of several, which on TB-or-more tables it's kinda nice.

So, the test results below:

During setup, maybe for context, the delete took 52m 50s real time
(measured with time psql pgbench -c 'delete from pgbench_accounts;')

During the delete, my I/O was on average like the following, which
should give an indication of what my I/O subsystem is capable of (not
much, granted):

Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda   0.47 5.27   35.53   77.4217.5842.95
1097.51   145.22 1295.23   33.47 1874.36   8.85 100.00

Since it's a 5k RPM laptop drive, it's rather slow on IOPS, and since
I'm using the defaults for shared buffers and checkpoints, write
thoughput isn't stellar either. But that's not the point of the test
anyway, it's just for context.

The hardware is an HP envy laptop with a 1TB 5.4k RPM hard drive, 12GB
RAM, core i7-4722HQ, no weird performance tweaking of any kind (ie:
cpu scaling left intact). The system was not dedicated of course,
being a laptop, but it had little else going on while the test was
running. Given the size of the test, I don't believe there's any
chance concurrent activity could invalidate the results.

The timing for setup was comparable with both versions (patched and
unpatched), so I'm reporting the patched times only.

The vacuum phase:

patched:

$ vacuumdb -v -t pgbench_accounts pgbench
INFO:  vacuuming "public.pgbench_accounts"
INFO:  scanned index "pgbench_accounts_pkey" to remove 4 row versions
DETAIL:  CPU 12.46s/48.76u sec elapsed 566.47 sec.
INFO:  "pgbench_accounts": removed 4 row versions in 6557378 pages
DETAIL:  CPU 56.68s/28.90u sec elapsed 1872.76 sec.
INFO:  index "pgbench_accounts_pkey" now contains 0 row versions in
1096762 pages
DETAIL:  4 index row versions were removed.
1092896 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.47 sec.
INFO:  "pgbench_accounts": found 4 remova

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-06 Thread Claudio Freire

On Mon, Aug 15, 2016 at 9:33 PM, Peter Geoghegan  wrote:
> The patch is intended to be applied on top of parallel B-Tree patches
> 0001-* and 0002-* [1]. I happened to test it with parallelism, but
> these are all independently useful, and will be entered as a separate
> CF entry (perhaps better to commit the earlier two patches first, to
> avoid merge conflicts). I'm optimistic that we can get those 3 patches
> in the series out of the way early, without blocking on discussing
> parallel sort.

Applied patches 1 and 2, builds fine, regression tests run fine. It
was a prerequisite to reviewing patch 3 (which I'm going to do below),
so I thought I might as well report on that tidbit of info, fwiw.

> The patch makes tuplesort merging shift down and displace the root
> tuple with the tape's next preread tuple, rather than compacting and
> then inserting into the heap anew. This approach to maintaining the
> heap as tuples are returned to caller will always produce fewer
> comparisons overall. The new approach is also simpler. We were already
> shifting down to compact the heap within the misleadingly named [2]
> function tuplesort_heap_siftup() -- why not instead just use the
> caller tuple (the tuple that we currently go on to insert) when
> initially shifting down (not the heap's preexisting last tuple, which
> is guaranteed to go straight to the leaf level again)? That way, we
> don't need to enlarge the heap at all through insertion, shifting up,
> etc. We're done, and are *guaranteed* to have performed less work
> (fewer comparisons and swaps) than with the existing approach (this is
> the reason for my optimism about getting this stuff out of the way
> early).

Patch 3 applies fine to git master as of
25794e841e5b86a0f90fac7f7f851e5d950e51e2 (on top of patches 1 and 2).

Builds fine and without warnings on gcc 4.8.5 AFAICT, regression test
suite runs without issues as well.

Patch lacks any new tests, but the changed code paths seem covered
sufficiently by existing tests. A little bit of fuzzing on the patch
itself, like reverting some key changes, or flipping some key
comparisons, induces test failures as it should, mostly in cluster.

The logic in tuplesort_heap_root_displace seems sound, except:

+*/
+   memtuples[i] = memtuples[imin];
+   i = imin;
+   }
+
+   Assert(state->memtupcount > 1 || imin == 0);
+   memtuples[imin] = *newtup;
+}

Why that assert? Wouldn't it make more sense to Assert(imin < n) ?

In the meanwhile, I'll go and do some perf testing.

Assuming the speedup is realized during testing, LGTM.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-06 Thread Claudio Freire

On Tue, Sep 6, 2016 at 8:28 PM, Peter Geoghegan  wrote:
> On Tue, Sep 6, 2016 at 12:57 PM, Claudio Freire  
> wrote:
>> Patch lacks any new tests, but the changed code paths seem covered
>> sufficiently by existing tests. A little bit of fuzzing on the patch
>> itself, like reverting some key changes, or flipping some key
>> comparisons, induces test failures as it should, mostly in cluster.
>>
>> The logic in tuplesort_heap_root_displace seems sound, except:
>>
>> +*/
>> +   memtuples[i] = memtuples[imin];
>> +   i = imin;
>> +   }
>> +
>> +   Assert(state->memtupcount > 1 || imin == 0);
>> +   memtuples[imin] = *newtup;
>> +}
>>
>> Why that assert? Wouldn't it make more sense to Assert(imin < n) ?
>
> There might only be one or two elements in the heap. Note that the
> heap size is indicated by state->memtupcount at this point in the
> sort, which is a little confusing (that differs from how memtupcount
> is used elsewhere, where we don't partition memtuples into a heap
> portion and a preread tuples portion, as we do here).

I noticed, but here n = state->memtupcount

+   Assert(memtuples[0].tupindex == newtup->tupindex);
+
+   CHECK_FOR_INTERRUPTS();
+
+   n = state->memtupcount; /* n is heap's size,
including old root */
+   imin = 0;   /*
start with caller's "hole" in root */
+   i = imin;

In fact, the assert on the patch would allow writing memtuples outside
the heap, as in calling tuplesort_heap_root_displace if
memtupcount==0, but I don't think that should be legal (memtuples[0]
== memtuples[imin] would be outside the heap).

Sure, that's a weird enough case (that assert up there already reads
memtuples[0] which would be equally illegal if memtupcount==0), but it
goes on to show that the assert expression just seems odd for its
intent.

BTW, I know it's not the scope of the patch, but shouldn't
root_displace be usable on the TSS_BOUNDED phase?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-06 Thread Claudio Freire

On Tue, Sep 6, 2016 at 9:19 PM, Peter Geoghegan  wrote:
> On Tue, Sep 6, 2016 at 4:55 PM, Claudio Freire  wrote:
>> I noticed, but here n = state->memtupcount
>>
>> +   Assert(memtuples[0].tupindex == newtup->tupindex);
>> +
>> +   CHECK_FOR_INTERRUPTS();
>> +
>> +   n = state->memtupcount; /* n is heap's size,
>> including old root */
>> +   imin = 0;   /*
>> start with caller's "hole" in root */
>> +   i = imin;
>
> I'm fine with using "n" in the later assertion you mentioned, if
> that's clearer to you. memtupcount is broken out as "n" simply because
> that's less verbose, in a place where that makes things far clearer.
>
>> In fact, the assert on the patch would allow writing memtuples outside
>> the heap, as in calling tuplesort_heap_root_displace if
>> memtupcount==0, but I don't think that should be legal (memtuples[0]
>> == memtuples[imin] would be outside the heap).
>
> You have to have a valid heap (i.e. there must be at least one
> element) to call tuplesort_heap_root_displace(), and it doesn't
> directly compact the heap, so it must remain valid on return. The
> assertion exists to make sure that everything is okay with a
> one-element heap, a case which is quite possible.

More than using "n" or "memtupcount" what I'm saying is to assert that
memtuples[imin] is inside the heap, which would catch the same errors
the original assert would, and more.

Assert(imin < state->memtupcount)

If you prefer.

The original asserts allows any value of imin for memtupcount>1, and
that's my main concern. It shouldn't.


On Tue, Sep 6, 2016 at 9:19 PM, Peter Geoghegan  wrote:
>> Sure, that's a weird enough case (that assert up there already reads
>> memtuples[0] which would be equally illegal if memtupcount==0), but it
>> goes on to show that the assert expression just seems odd for its
>> intent.
>>
>> BTW, I know it's not the scope of the patch, but shouldn't
>> root_displace be usable on the TSS_BOUNDED phase?
>
> I don't think it should be, no. With a top-n heap sort, the
> expectation is that after a little while, we can immediately determine
> that most tuples do not belong in the heap (this will require more
> than one comparison per tuple when the tuple that may be entered into
> the heap will in fact go in the heap, which should be fairly rare
> after a time). That's why that general strategy can be so much faster,
> of course.

I wasn't proposing getting rid of that optimization, but just
replacing the siftup+insert step with root_displace...

> Note that that heap is "reversed" -- the sort order is inverted, so
> that we can use a minheap. The top of the heap is the most marginal
> tuple in the top-n heap so far, and so is the next to be removed from
> consideration entirely (not the next to be returned to caller, when
> merging).

...but I didn't pause to consider that point.

It still looks like a valid optimization, instead rearranging the heap
twice (siftup + insert), do it once (replace + relocate).

However, I agree that it's not worth the risk conflating the two
optimizations. That one can be done later as a separate patch.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-07 Thread Claudio Freire

On Wed, Sep 7, 2016 at 12:12 PM, Greg Stark  wrote:
> On Wed, Sep 7, 2016 at 1:45 PM, Simon Riggs  wrote:
>> On 6 September 2016 at 19:59, Tom Lane  wrote:
>>
>>> The idea of looking to the stats to *guess* about how many tuples are
>>> removable doesn't seem bad at all.  But imagining that that's going to be
>>> exact is folly of the first magnitude.
>>
>> Yes.  Bear in mind I had already referred to allowing +10% to be safe,
>> so I think we agree that a reasonably accurate, yet imprecise
>> calculation is possible in most cases.
>
> That would all be well and good if it weren't trivial to do what
> Robert suggested. This is just a large unsorted list that we need to
> iterate throught. Just allocate chunks of a few megabytes and when
> it's full allocate a new chunk and keep going. There's no need to get
> tricky with estimates and resizing and whatever.

I agree. While the idea of estimating the right size sounds promising
a priori, considering the estimate can go wrong and over or
underallocate quite severely, the risks outweigh the benefits when you
consider the alternative of a dynamic allocation strategy.

Unless the dynamic strategy has a bigger CPU impact than expected, I
believe it's a superior approach.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-08 Thread Claudio Freire

On Thu, Sep 8, 2016 at 11:54 AM, Pavan Deolasee
 wrote:
> For example, for a table with 60 bytes wide tuple (including 24 byte
> header), each page can approximately have 8192/60 = 136 tuples. Say we
> provision for 136*2 = 272 bits per page i.e. 34 bytes per page for the
> bitmap. First 272 offsets in every page are represented in the bitmap and
> anything greater than are in overflow region. On the other hand, the current
> representation will need about 16 bytes per page assuming 2% dead tuples, 40
> bytes per page assuming 5% dead tuples and 80 bytes assuming 10% dead
> tuples. So bitmap will take more space for small tuples or when vacuum is
> run very aggressively, both seems unlikely for very large tables. Of course
> the calculation does not take into account the space needed by the overflow
> area, but I expect that too be small.

I thought about something like this, but it could be extremely
inefficient for mostly frozen tables, since the bitmap cannot account
for frozen pages without losing the O(1) lookup characteristic


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-08 Thread Claudio Freire

On Tue, Sep 6, 2016 at 8:28 PM, Peter Geoghegan  wrote:
>> In the meanwhile, I'll go and do some perf testing.
>>
>> Assuming the speedup is realized during testing, LGTM.
>
> Thanks. I suggest spending at least as much time on unsympathetic
> cases (e.g., only 2 or 3 tapes must be merged). At the same time, I
> suggest focusing on a type that has relatively expensive comparisons,
> such as collated text, to make differences clearer.

The tests are still running (the benchmark script I came up with runs
for a lot longer than I anticipated, about 2 days), but preliminar
results are very promising, I can see a clear and consistent speedup.
We'll have to wait for the complete results to see if there's any
significant regression, though. I'll post the full results when I have
them, but until now it all looks like this:

setup:

create table lotsofitext(i text, j text, w text, z integer, z2 bigint);
insert into lotsofitext select cast(random() * 10.0 as text)
|| 'blablablawblabla', cast(random() * 10.0 as text) ||
'blablablawjjjblabla', cast(random() * 10.0 as text) ||
'blablabl
awwwabla', random() * 10.0, random() * 1.0 from
generate_series(1, 1000);

timed:

select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t;

Unpatched Time: 100351.251 ms
Patched Time: 75180.787 ms

That's like a 25% speedup on random input. As we say over here, rather
badly translated, not a turkey's boogers (meaning "nice!")


On Tue, Sep 6, 2016 at 9:50 PM, Claudio Freire  wrote:
> On Tue, Sep 6, 2016 at 9:19 PM, Peter Geoghegan  wrote:
>> On Tue, Sep 6, 2016 at 4:55 PM, Claudio Freire  
>> wrote:
>>> I noticed, but here n = state->memtupcount
>>>
>>> +   Assert(memtuples[0].tupindex == newtup->tupindex);
>>> +
>>> +   CHECK_FOR_INTERRUPTS();
>>> +
>>> +   n = state->memtupcount; /* n is heap's size,
>>> including old root */
>>> +   imin = 0;   /*
>>> start with caller's "hole" in root */
>>> +   i = imin;
>>
>> I'm fine with using "n" in the later assertion you mentioned, if
>> that's clearer to you. memtupcount is broken out as "n" simply because
>> that's less verbose, in a place where that makes things far clearer.
>>
>>> In fact, the assert on the patch would allow writing memtuples outside
>>> the heap, as in calling tuplesort_heap_root_displace if
>>> memtupcount==0, but I don't think that should be legal (memtuples[0]
>>> == memtuples[imin] would be outside the heap).
>>
>> You have to have a valid heap (i.e. there must be at least one
>> element) to call tuplesort_heap_root_displace(), and it doesn't
>> directly compact the heap, so it must remain valid on return. The
>> assertion exists to make sure that everything is okay with a
>> one-element heap, a case which is quite possible.
>
> More than using "n" or "memtupcount" what I'm saying is to assert that
> memtuples[imin] is inside the heap, which would catch the same errors
> the original assert would, and more.
>
> Assert(imin < state->memtupcount)
>
> If you prefer.
>
> The original asserts allows any value of imin for memtupcount>1, and
> that's my main concern. It shouldn't.

So, for the assertions to properly avoid clobbering/reading out of
bounds memory, you need both the above assert:

 +*/
 +   memtuples[i] = memtuples[imin];
 +   i = imin;
 +   }
 +
>+   Assert(imin < state->memtupcount);
 +   memtuples[imin] = *newtup;
 +}

And another one at the beginning, asserting:

 +   SortTuple  *memtuples = state->memtuples;
 +   int n,
 +   imin,
 +   i;
 +
>+   Assert(state->memtupcount > 0 && memtuples[0].tupindex == 
>newtup->tupindex);
 +
 +   CHECK_FOR_INTERRUPTS();

It's worth making that change, IMHO, unless I'm missing something.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-08 Thread Claudio Freire

On Thu, Sep 8, 2016 at 2:13 PM, Peter Geoghegan  wrote:
> On Thu, Sep 8, 2016 at 8:53 AM, Claudio Freire  wrote:
>> setup:
>>
>> create table lotsofitext(i text, j text, w text, z integer, z2 bigint);
>> insert into lotsofitext select cast(random() * 10.0 as text)
>> || 'blablablawblabla', cast(random() * 10.0 as text) ||
>> 'blablablawjjjblabla', cast(random() * 10.0 as text) ||
>> 'blablabl
>> awwwabla', random() * 10.0, random() * 1.0 from
>> generate_series(1, 1000);
>>
>> timed:
>>
>> select count(*) FROM (select * from lotsofitext order by i, j, w, z, z2) t;
>>
>> Unpatched Time: 100351.251 ms
>> Patched Time: 75180.787 ms
>>
>> That's like a 25% speedup on random input. As we say over here, rather
>> badly translated, not a turkey's boogers (meaning "nice!")
>
> Cool! What work_mem setting were you using here?

The script iterates over a few variations of string patterns (easy
comparisons vs hard comparisons), work mem (4MB, 64MB, 256MB, 1GB,
4GB), and table sizes (~350M, ~650M, ~1.5G).

That particular case I believe is using work_mem=4MB, easy strings, 1.5GB table.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

2016-09-08 Thread Claudio Freire

On Thu, Sep 8, 2016 at 4:20 PM, Peter Geoghegan  wrote:
> On Thu, Sep 8, 2016 at 10:40 AM, Tom Lane  wrote:
>>> On Thu, Sep 8, 2016 at 12:01 AM, Heikki Linnakangas  wrote:
 I still think tuplesort_heap_siftup is a confusing name, although I'm not
 sure that Peter's "compact" is much better. I suggest that we rename it to
 tuplesort_heap_delete_top(). In comments within the function, explain that
 the *loop* corresponds to the "siftup" in Knuth's book.
>>
>>> I'm also fine with that name.
>>
>> I can live with it too.
>
> Attached patch does it that way, then. I stuck with the reference to
> "shift down", though, since I think we all agree that that is
> unambiguous.

I believe the term is "sift" not "shift"


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Is tuplesort_heap_siftup() a misnomer?

2016-09-08 Thread Claudio Freire

On Thu, Sep 8, 2016 at 2:19 PM, Peter Geoghegan  wrote:
>> Peter, looking at your "displace" patch in this light, I think
>> tuplesort_heap_root_displace() and tuplesort_heap_delete_top() (as I'm
>> calling it now), should share a common subroutine. Displace replaces the top
>> node with the new node passed as argument, and then calls the subroutine to
>> push it down to the right place. Delete_top replaces the top node with the
>> last value in the heap, and then calls the subroutine. Or perhaps delete_top
>> should remove the last value from the heap, and then call displace() with
>> it, to re-insert it.
>
> I can see the value in that, but I'd still rather not. The code reuse
> win isn't all that large, and having a separate
> tuplesort_heap_root_displace() routine allows us to explain what's
> going on for merging, to assert what we're able to assert with tape
> numbers vs. heap position, and to lose the HEAPCOMPARE()/checkIndex
> cruft that the existing routines need (if only barely) to support
> replacement selection. I would be surprised if with a very tight inner
> loop like this, HEAPCOMPARE() has an appreciable overhead (maybe it
> increases pipeline stalls).

BTW, regarding this, since I read in some other thread that it was ok
to use static inline functions, I believe the compiler is smart enough
to optimize out the constant true/false in checkIndex for inlined
calls, so that may be viable (on modern compilers).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2016-09-09 Thread Claudio Freire

...
On Fri, Sep 9, 2016 at 9:22 PM, Claudio Freire  wrote:
> Since it is true that doing so would make it impossible to keep the
> asserts about tupindex in tuplesort_heap_root_displace, I guess it
> depends on how useful those asserts are (ie: how likely it is that
> those conditions could be violated, and how damaging it could be if
> they were). If it is decided the refactor is desirable, I'd suggest
> making the common siftup producedure static inline, to allow
> tuplesort_heap_root_displace to inline and specialize it, since it
> will be called with checkIndex=False and that simplifies the resulting
> code considerably.
>
> Peter also mentioned that there were some other changes going on in
> the surrounding code that could impact this patch, so I'm marking the
> patch Waiting on Author.
>
> Overall, however, I believe the patch is in good shape. Only minor
> form issues need to be changed, the functionality seems both desirable
> and ready.


Sorry, forgot to specify, that was all about patch 3, the one about
tuplesort_heap_root_displace.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tuplesort merge pre-reading

2016-09-09 Thread Claudio Freire

On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas  wrote:
>
> Claudio, if you could also repeat the tests you ran on Peter's patch set on
> the other thread, with these patches, that'd be nice. These patches are
> effectively a replacement for
> 0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
> would be much appreciated too, of course.
>
> Attached are new versions. Compared to last set, they contain a few comment
> fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
> that were completely unused.


Will do so


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tuplesort merge pre-reading

2016-09-09 Thread Claudio Freire

On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire  wrote:
> On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas  wrote:
>>
>> Claudio, if you could also repeat the tests you ran on Peter's patch set on
>> the other thread, with these patches, that'd be nice. These patches are
>> effectively a replacement for
>> 0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
>> would be much appreciated too, of course.
>>
>> Attached are new versions. Compared to last set, they contain a few comment
>> fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
>> that were completely unused.
>
>
> Will do so

It seems both 1 and 1+2 break make check.

Did I misunderstand something? I'm applying the first patch of Peter's
series (cap number of tapes), then your first one (remove prefetch)
and second one (use larger read buffers).

Peter's patch needs some rebasing on top of those but nothing major.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tuplesort merge pre-reading

2016-09-12 Thread Claudio Freire

On Sun, Sep 11, 2016 at 12:47 PM, Heikki Linnakangas  wrote:
> Here's a new version of these patches, rebased over current master. I
> squashed the two patches into one, there's not much point to keep them
> separate.


I don't know what was up with the other ones, but this one works fine.

Benchmarking now.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-13 Thread Claudio Freire

On Tue, Sep 13, 2016 at 3:51 PM, Robert Haas  wrote:
> On Fri, Sep 9, 2016 at 3:04 AM, Masahiko Sawada  wrote:
>> Attached PoC patch changes the representation of dead tuple locations
>> to the hashmap having tuple bitmap.
>> The one hashmap entry consists of the block number and the TID bitmap
>> of corresponding block, and the block number is the hash key of
>> hashmap.
>> Current implementation of this patch is not smart yet because each
>> hashmap entry allocates the tuple bitmap with fixed
>> size(LAZY_ALLOC_TUPLES), so each hashentry can store up to
>> LAZY_ALLOC_TUPLES(291 if block size is 8kB) tuples.
>> In case where one block can store only the several tens tuples, the
>> most bits are would be waste.
>>
>> After improved this patch as you suggested, I will measure performance 
>> benefit.
>
> We also need to consider the amount of memory gets used.  What I
> proposed - replacing the array with an array of arrays - would not
> increase memory utilization significantly.  I don't think it would
> have much impact on CPU utilization either.

I've finished writing that patch, I'm in the process of testing its CPU impact.

First test seemed to hint at a 40% increase in CPU usage, which seems
rather steep compared to what I expected, so I'm trying to rule out
some methodology error here.

> It would require
> replacing the call to bsearch() in lazy_heap_reaptid() with an
> open-coded implementation of bsearch, or with one bsearch to find the
> chunk and another to find the TID within the chunk, but that shouldn't
> be very expensive.

I did a linear search to find the chunk, with exponentially growing
chunks, and then a bsearch to find the item inside the chunk.

With the typical number of segments and given the 12GB limit, the
segment array size is well within the range that favors linear search.

> For example, we could keep a bitmap with one bit per K
> pages.  If the bit is set, there is at least 1 dead tuple on that
> page; if clear, there are none.  When we see an index tuple, we
> consult the bitmap to determine whether we need to search the TID
> list.  We select K to be the smallest power of 2 such that the bitmap
> uses less memory than some threshold, perhaps 64kB.

I've been pondering something like that, but that's an optimization
that's quite orthogonal to the multiarray stuff.

>  Assuming that
> updates and deletes to the table have some locality, we should be able
> to skip a large percentage of the TID searches with a probe into this
> very compact bitmap.

I don't think you can assume locality


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-13 Thread Claudio Freire

On Tue, Sep 13, 2016 at 4:06 PM, Robert Haas  wrote:
> On Tue, Sep 13, 2016 at 2:59 PM, Claudio Freire  
> wrote:
>> I've finished writing that patch, I'm in the process of testing its CPU 
>> impact.
>>
>> First test seemed to hint at a 40% increase in CPU usage, which seems
>> rather steep compared to what I expected, so I'm trying to rule out
>> some methodology error here.
>
> Hmm, wow.  That's pretty steep.  Maybe lazy_heap_reaptid() is hotter
> than I think it is, but even if it accounts for 10% of total CPU usage
> within a vacuum, which seems like an awful lot, you'd have to make it
> 4x as expensive, which also seems like an awful lot.

IIRC perf top reported a combined 45% between layz_heap_reaptid +
vac_cmp_itemptr (after patching).

vac_cmp_itemptr was around 15% on its own

Debug build of couse (I need the assertions and the debug symbols),
I'll retest with optimizations once debug tests make sense.

>>> For example, we could keep a bitmap with one bit per K
>>> pages.  If the bit is set, there is at least 1 dead tuple on that
>>> page; if clear, there are none.  When we see an index tuple, we
>>> consult the bitmap to determine whether we need to search the TID
>>> list.  We select K to be the smallest power of 2 such that the bitmap
>>> uses less memory than some threshold, perhaps 64kB.
>>
>> I've been pondering something like that, but that's an optimization
>> that's quite orthogonal to the multiarray stuff.
>
> Sure, but if this really does increase CPU time, it'd be reasonable to
> do something to decrease it again in order to get the other benefits
> of this patch - i.e. increasing the maintenance_work_mem limit while
> reducing the chances that overallocation will cause OOM.

I was hoping it wouldn't regress performance so much. I'd rather
micro-optimize the multiarray implementation until it doesn't and then
think of orthogonal optimizations.

>>>  Assuming that
>>> updates and deletes to the table have some locality, we should be able
>>> to skip a large percentage of the TID searches with a probe into this
>>> very compact bitmap.
>>
>> I don't think you can assume locality
>
> Really?  If you have a 1TB table, how many 2MB ranges of that table do
> you think will contain dead tuples for a typical vacuum?  I think most
> tables of that size are going to be mostly static, and the all-visible
> and all-frozen bits are going to be mostly set.  You *could* have
> something like a pgbench-type workload that does scattered updates
> across the entire table, but that's going to perform pretty poorly
> because you'll constantly be updating blocks that have to be pulled in
> from disk.

I have a few dozen of those in my biggest database. They do updates
and deletes all over the place and, even if they were few, they're
scattered almost uniformly.

Thing is, I think we really need to not worsen that case, which seems
rather common (almost any OLTP with a big enough user base, or a K-V
type of table, or TOAST tables).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Claudio Freire

On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas  wrote:
>
> I am kind of doubtful about this whole line of investigation because
> we're basically trying pretty hard to fix something that I'm not sure
> is broken.I do agree that, all other things being equal, the TID
> lookups will probably be faster with a bitmap than with a binary
> search, but maybe not if the table is large and the number of dead
> TIDs is small, because cache efficiency is pretty important.  But even
> if it's always faster, does TID lookup speed even really matter to
> overall VACUUM performance? Claudio's early results suggest that it
> might, but maybe that's just a question of some optimization that
> hasn't been done yet.

FYI, the reported impact was on CPU time, not runtime. There was no
significant difference in runtime (real time), because my test is
heavily I/O bound.

I tested with a few small tables and there was no significant
difference either, but small tables don't stress the array lookup
anyway so that's expected.

But on the assumption that some systems may be CPU bound during vacuum
(particularly those able to do more than 300-400MB/s sequential I/O),
in those cases the increased or decreased cost of lazy_tid_reaped will
directly correlate to runtime. It's just none of my systems, which all
run on amazon and is heavily bandwidth constrained (fastest I/O
subsystem I can get my hands on does 200MB/s).

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-14 Thread Claudio Freire

On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas  wrote:
> For instance, one idea to grow memory usage incrementally would be to
> store dead tuple information separately for each 1GB segment of the
> relation.  So we have an array of dead-tuple-representation objects,
> one for every 1GB of the relation.  If there are no dead tuples in a
> given 1GB segment, then this pointer can just be NULL.  Otherwise, it
> can point to either the bitmap representation (which will take ~4.5MB)
> or it can point to an array of TIDs (which will take 6 bytes/TID).
> That could handle an awfully wide variety of usage patterns
> efficiently; it's basically never worse than what we're doing today,
> and when the dead tuple density is high for any portion of the
> relation it's a lot better.

If you compress the list into a bitmap a posteriori, you know the
number of tuples per page, so you could encode the bitmap even more
efficiently.

It's not a bad idea, one that can be slapped on top of the multiarray
patch - when closing a segment, it can be decided whether to turn it
into a bitmap or not.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-15 Thread Claudio Freire

On Thu, Sep 15, 2016 at 12:50 PM, Tomas Vondra
 wrote:
> On 09/14/2016 07:57 PM, Tom Lane wrote:
>>
>> Pavan Deolasee  writes:
>>>
>>> On Wed, Sep 14, 2016 at 10:53 PM, Alvaro Herrera
>>> 
>>> wrote:
>>>>
>>>> One thing not quite clear to me is how do we create the bitmap
>>>> representation starting from the array representation in midflight
>>>> without using twice as much memory transiently.  Are we going to write
>>>> the array to a temp file, free the array memory, then fill the bitmap by
>>>> reading the array from disk?
>>
>>
>>> We could do that.
>>
>>
>> People who are vacuuming because they are out of disk space will be very
>> very unhappy with that solution.
>
>
> The people are usually running out of space for data, while these files
> would be temporary files placed wherever temp_tablespaces points to. I'd
> argue if this is a source of problems, the people are already in deep
> trouble due to sorts, CREATE INDEX, ... as those commands may also generate
> a lot of temporary files.

One would not expect "CREATE INDEX" to succeed when space is tight,
but VACUUM is quite the opposite.

Still, temporary storage could be used if available, and gracefully
fall back to some other technique (like not using bitmaps) when not.

Not sure it's worth the trouble, though.

On Wed, Sep 14, 2016 at 12:24 PM, Claudio Freire  wrote:
> On Wed, Sep 14, 2016 at 12:17 PM, Robert Haas  wrote:
>>
>> I am kind of doubtful about this whole line of investigation because
>> we're basically trying pretty hard to fix something that I'm not sure
>> is broken.I do agree that, all other things being equal, the TID
>> lookups will probably be faster with a bitmap than with a binary
>> search, but maybe not if the table is large and the number of dead
>> TIDs is small, because cache efficiency is pretty important.  But even
>> if it's always faster, does TID lookup speed even really matter to
>> overall VACUUM performance? Claudio's early results suggest that it
>> might, but maybe that's just a question of some optimization that
>> hasn't been done yet.
>
> FYI, the reported impact was on CPU time, not runtime. There was no
> significant difference in runtime (real time), because my test is
> heavily I/O bound.
>
> I tested with a few small tables and there was no significant
> difference either, but small tables don't stress the array lookup
> anyway so that's expected.
>
> But on the assumption that some systems may be CPU bound during vacuum
> (particularly those able to do more than 300-400MB/s sequential I/O),
> in those cases the increased or decreased cost of lazy_tid_reaped will
> directly correlate to runtime. It's just none of my systems, which all
> run on amazon and is heavily bandwidth constrained (fastest I/O
> subsystem I can get my hands on does 200MB/s).

Attached is the patch with the multiarray version.

The tests are weird. Best case comparison over several runs, to remove
the impact of concurrent activity on this host (I couldn't remove all
background activity even when running the tests overnight, the distro
adds tons of crons and background cleanup tasks it would seem),
there's only very mild CPU impact. I'd say insignificant, as it's well
below the mean variance.

Worst case:

DETAIL:  CPU 9.90s/80.94u sec elapsed 1232.42 sec.

Best case:

DETAIL:  CPU 12.10s/63.82u sec elapsed 832.79 sec.

There seems to be more variance with the multiarray approach than the
single array one, but I could not figure out why. Even I/O seems less
stable:

Worst case:

INFO:  "pgbench_accounts": removed 4 row versions in 6557382 pages
DETAIL:  CPU 64.31s/37.60u sec elapsed 2573.88 sec.

Best case:

INFO:  "pgbench_accounts": removed 4 row versions in 6557378 pages
DETAIL:  CPU 54.48s/31.78u sec elapsed 1552.18 sec.

Since this test takes several hours to complete, I could only run a
few runs of each version, so the statistical significance of the test
isn't very bright.

I'll try to compare with smaller pgbench scale numbers and more runs
over the weekend (gotta script that). It's certainly puzzling, I
cannot explain the increased variance, especially in I/O, since the
I/O should be exactly the same. I'm betting it's my system that's
unpredictable somehow. So I'm posting the patch in case someone gets
inspired and can spot the reason, and because there's been a lot of
talk about this very same approach, so I thought I'd better post the
code ;)

I'll also try to get a more predictable system to run the tests on.
From f85fd4213eb6c0b4da8dc9196424f6

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-09-15 Thread Claudio Freire

On Thu, Sep 15, 2016 at 3:48 PM, Tomas Vondra
 wrote:
> For example, we always allocate the TID array as large as we can fit into
> m_w_m, but maybe we don't need to wait with switching to the bitmap until
> filling the whole array - we could wait as long as the bitmap fits into the
> remaining part of the array, build it there and then copy it to the
> beginning (and use the bitmap from that point).

The bitmap can be created like that, but grow from the end of the
segment backwards.

So the scan can proceed until the bitmap fills the whole segment
(filling backwards), no copy required.

I'll try that later, but first I'd like to get multiarray approach
right since that's the basis of it.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tuplesort merge pre-reading

2016-09-20 Thread Claudio Freire

On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire  wrote:
> On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas  wrote:
>>
>> Claudio, if you could also repeat the tests you ran on Peter's patch set on
>> the other thread, with these patches, that'd be nice. These patches are
>> effectively a replacement for
>> 0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
>> would be much appreciated too, of course.
>>
>> Attached are new versions. Compared to last set, they contain a few comment
>> fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
>> that were completely unused.
>
>
> Will do so

Well, here they are, the results.

ODS format only (unless you've got issues opening the ODS).

The results seem all over the map. Some regressions seem significant
(both in the amount of performance lost and their significance, since
all 4 runs show a similar regression). The worst being "CREATE INDEX
ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB
work_mem, which should be an in-memory sort, which makes it odd.

I will re-run it overnight just in case to confirm the outcome.

logtape_preload_timings.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tuplesort merge pre-reading

2016-09-21 Thread Claudio Freire

On Tue, Sep 20, 2016 at 3:34 PM, Claudio Freire  wrote:
> On Fri, Sep 9, 2016 at 9:51 PM, Claudio Freire  wrote:
>> On Fri, Sep 9, 2016 at 8:13 AM, Heikki Linnakangas  wrote:
>>>
>>> Claudio, if you could also repeat the tests you ran on Peter's patch set on
>>> the other thread, with these patches, that'd be nice. These patches are
>>> effectively a replacement for
>>> 0002-Use-tuplesort-batch-memory-for-randomAccess-sorts.patch. And review
>>> would be much appreciated too, of course.
>>>
>>> Attached are new versions. Compared to last set, they contain a few comment
>>> fixes, and a change to the 2nd patch to not allocate tape buffers for tapes
>>> that were completely unused.
>>
>>
>> Will do so
>
> Well, here they are, the results.
>
> ODS format only (unless you've got issues opening the ODS).
>
> The results seem all over the map. Some regressions seem significant
> (both in the amount of performance lost and their significance, since
> all 4 runs show a similar regression). The worst being "CREATE INDEX
> ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB
> work_mem, which should be an in-memory sort, which makes it odd.
>
> I will re-run it overnight just in case to confirm the outcome.

A new run for "patched" gives better results, it seems it was some
kind of glitch in the run (maybe some cron decided to do something
while running those queries).

Attached

In essence, it doesn't look like it's harmfully affecting CPU
efficiency. Results seem neutral on the CPU front.


logtape_preload_timings.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Tuplesort merge pre-reading

2016-09-22 Thread Claudio Freire

On Thu, Sep 22, 2016 at 4:17 AM, Heikki Linnakangas  wrote:
> On 09/22/2016 03:40 AM, Claudio Freire wrote:
>>
>> On Tue, Sep 20, 2016 at 3:34 PM, Claudio Freire 
>> wrote:
>>>
>>> The results seem all over the map. Some regressions seem significant
>>> (both in the amount of performance lost and their significance, since
>>> all 4 runs show a similar regression). The worst being "CREATE INDEX
>>> ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" with 4GB
>>> work_mem, which should be an in-memory sort, which makes it odd.
>>>
>>> I will re-run it overnight just in case to confirm the outcome.
>>
>>
>> A new run for "patched" gives better results, it seems it was some
>> kind of glitch in the run (maybe some cron decided to do something
>> while running those queries).
>>
>> Attached
>>
>> In essence, it doesn't look like it's harmfully affecting CPU
>> efficiency. Results seem neutral on the CPU front.
>
>
> Looking at the spreadsheet, there is a 40% slowdown in the "slow" "CREATE
> INDEX ix_lotsofitext_zz2ijw ON lotsofitext (z, z2, i, j, w);" test with 4GB
> of work_mem. I can't reproduce that on my laptop, though. Got any clue
> what's going on there?

It's not present in other runs, so I think it's a fluke the
spreadsheet isn't filtering out. Especially considering that one
should be a fully in-memory fast sort and thus unaffected by the
current patch (z and z2 being integers, IIRC, most comparisons should
be about comparing the first columns and thus rarely involve the big
strings).

I'll try to confirm that's the case though.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small improvement to compactify_tuples

2017-11-03 Thread Claudio Freire

On Thu, Nov 2, 2017 at 11:46 PM, Tom Lane  wrote:
> Sokolov Yura  writes:
>> [ 0001-Improve-compactify_tuples.patch, v5 or thereabouts ]
>
> I started to review this patch.  I spent a fair amount of time on
> beautifying the code, because I found it rather ugly and drastically
> undercommented.  Once I had it to the point where it seemed readable,
> I went to check the shellsort algorithm against Wikipedia's entry,
> and found that this appears to be an incorrect implementation of
> shellsort: where pg_shell_sort_pass has
>
> for (_i = off; _i < _n; _i += off) \
>
> it seems to me that we need to have
>
> for (_i = off; _i < _n; _i += 1) \
>
> or maybe just _i++.  As-is, this isn't h-sorting the whole file,
> but just the subset of entries that have multiple-of-h indexes
> (ie, the first of the h distinct subfiles that should get sorted).
> The bug is masked by the final pass of plain insertion sort, but
> we are not getting the benefit we should get from the earlier passes.
>
> However, I'm a bit dubious that it's worth fixing that; instead
> my inclination would be to rip out the shellsort implementation
> entirely.  The code is only using it for the nitems <= 48 case
> (which makes the first three offset steps certainly no-ops) and
> I am really unconvinced that it's worth expending the code space
> for a shellsort rather than plain insertion sort in that case,
> especially when we have good reason to think that the input data
> is nearly sorted.

I actually noticed that and benchmarked some variants. Neither
made any noticeable difference in performance, so I decided not
to complain about them.

I guess the same case can be made for removing the shell sort.
So I'm inclined to agree.

> BTW, the originally given test case shows no measurable improvement
> on my box.

I did manage to reproduce the original test and got a consistent improvement.

> I was eventually able to convince myself by profiling
> that the patch makes us spend less time in compactify_tuples, but
> this test case isn't a very convincing one.
>
> So, quite aside from the bug, I'm not excited about committing the
> attached as-is.  I think we should remove pg_shell_sort and just
> use pg_insertion_sort.  If somebody can show a test case that
> provides a measurable speed improvement from the extra code,
> I could be persuaded to reconsider.

My tests modifying the shell sort didn't produce any measurable
difference, but I didn't test removing it altogether.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small improvement to compactify_tuples

2017-11-03 Thread Claudio Freire

On Fri, Nov 3, 2017 at 4:30 PM, Tom Lane  wrote:
> Claudio Freire  writes:
>> On Thu, Nov 2, 2017 at 11:46 PM, Tom Lane  wrote:
>>> BTW, the originally given test case shows no measurable improvement
>>> on my box.
>
>> I did manage to reproduce the original test and got a consistent improvement.
>
> This gave me the idea to memcpy the page into some workspace and then use
> memcpy, not memmove, to put the tuples back into the caller's copy of the
> page.  That gave me about a 50% improvement in observed TPS, and a perf
> profile like this:
>
> +   38.50%38.40%299520  postmaster   postgres 
>   [.] compactify_tuples
> +   31.11%31.02%241975  postmaster   libc-2.12.so 
>   [.] memcpy
> +8.74% 8.72% 68022  postmaster   postgres 
>   [.] itemoffcompare
> +6.51% 6.49% 50625  postmaster   postgres 
>   [.] compactify_tuples_loop
> +4.21% 4.19% 32719  postmaster   postgres 
>   [.] pg_qsort
> +1.70% 1.69% 13213  postmaster   postgres 
>   [.] memcpy@plt
>
> There still doesn't seem to be any point in replacing the qsort,
> but it does seem like something like the second attached patch
> might be worth doing.
>
> So I'm now wondering why my results seem to be so much different
> from those of other people who have tried this, both as to whether
> compactify_tuples is worth working on at all and as to what needs
> to be done to it if so.  Thoughts?
>
> regards, tom lane
>

I'm going to venture a guess that the version of gcc and libc, and
build options used both in the libc (ie: the distro) and postgres may
play a part here.

I'm running with glibc 2.22, for instance, and building with gcc 4.8.5.

I will try and benchmark memcpy vs memmove and see what the
performance difference is there with my versions, too. This may
heavily depend on compiler optimizations that may vary between
versions, since memcpy/memmove tend to be inlined a lot.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small improvement to compactify_tuples

2017-11-05 Thread Claudio Freire

On Sat, Nov 4, 2017 at 8:07 PM, Юрий Соколов  wrote:
> 2017-11-03 5:46 GMT+03:00 Tom Lane :
>>
>> Sokolov Yura  writes:
>> > [ 0001-Improve-compactify_tuples.patch, v5 or thereabouts ]
>>
>> I went to check the shellsort algorithm against Wikipedia's entry,
>> and found that this appears to be an incorrect implementation of
>> shellsort: where pg_shell_sort_pass has
>>
>> for (_i = off; _i < _n; _i += off) \
>>
>> it seems to me that we need to have
>>
>> for (_i = off; _i < _n; _i += 1) \
>>
>> or maybe just _i++.
>
>
> Shame on me :-(
> I've wrote shell sort several times, so I forgot to recheck myself once
> again.
> And looks like best gap sequence from wikipedia is really best
> ( {301, 132, 57, 23, 10 , 4} in my notation),
>
>
> 2017-11-03 17:37 GMT+03:00 Claudio Freire :
>> On Thu, Nov 2, 2017 at 11:46 PM, Tom Lane  wrote:
>>> BTW, the originally given test case shows no measurable improvement
>>> on my box.
>>
>> I did manage to reproduce the original test and got a consistent
>> improvement.
>
> I've rechecked my self using my benchmark.
> Without memmove, compactify_tuples comsumes:
> - with qsort 11.66% cpu (pg_qsort + med3 + swapfunc + itemoffcompare +
> compactify_tuples = 5.97 + 0.51 + 2.87 + 1.88 + 0.44)
> - with just insertion sort 6.65% cpu (sort is inlined, itemoffcompare also
> inlined, so whole is compactify_tuples)
> - with just shell sort 5,98% cpu (sort is inlined again)
> - with bucket sort 1,76% cpu (sort_itemIds + compactify_tuples = 1.30 +
> 0.46)

Is that just insertion sort without bucket sort?

Because I think shell sort has little impact in your original patch
because it's rarely exercised. With bucket sort, most buckets are very
small, too small for shell sort to do any useful work.

That's why I'm inclined to agree with Tom in that we could safely
simplify it out, remove it, without much impact.

Maybe leave a fallback to qsort if some corner case produces big buckets?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small improvement to compactify_tuples

2017-11-06 Thread Claudio Freire

On Mon, Nov 6, 2017 at 11:50 AM, Юрий Соколов  wrote:
>> Maybe leave a fallback to qsort if some corner case produces big buckets?
>
> For 8kb pages, each bucket is per 32 bytes. So, for heap pages it is at
> most 1 heap-tuple per bucket, and for index pages it is at most 2 index
> tuples per bucket. For 32kb pages it is 4 heap-tuples and 8 index-tuples
> per bucket.
> It will be unnecessary overhead to call non-inlineable qsort in this cases
>
> So, I think, shell sort could be removed, but insertion sort have to remain.
>
> I'd prefer shell sort to remain also. It could be useful in other places
> also,
> because it is easily inlinable, and provides comparable to qsort performance
> up to several hundreds of elements.

I'd rather have an inlineable qsort.

And I'd recommend doing that when there is a need, and I don't think
this patch really needs it, since bucket sort handles most cases
anyway.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small improvement to compactify_tuples

2017-11-06 Thread Claudio Freire

On Mon, Nov 6, 2017 at 6:58 PM, Юрий Соколов  wrote:
>
> 2017-11-06 17:55 GMT+03:00 Claudio Freire :
>>
>> On Mon, Nov 6, 2017 at 11:50 AM, Юрий Соколов 
>> wrote:
>> >> Maybe leave a fallback to qsort if some corner case produces big
>> >> buckets?
>> >
>> > For 8kb pages, each bucket is per 32 bytes. So, for heap pages it is at
>> > most 1 heap-tuple per bucket, and for index pages it is at most 2 index
>> > tuples per bucket. For 32kb pages it is 4 heap-tuples and 8 index-tuples
>> > per bucket.
>> > It will be unnecessary overhead to call non-inlineable qsort in this
>> > cases
>> >
>> > So, I think, shell sort could be removed, but insertion sort have to
>> > remain.
>> >
>> > I'd prefer shell sort to remain also. It could be useful in other places
>> > also,
>> > because it is easily inlinable, and provides comparable to qsort
>> > performance
>> > up to several hundreds of elements.
>>
>> I'd rather have an inlineable qsort.
>
> But qsort is recursive. It is quite hard to make it inlineable. And still it
> will be
> much heavier than insertion sort (btw, all qsort implementations uses
> insertion
> sort for small arrays). And it will be heavier than shell sort for small
> arrays.

I haven't seen this trick used in postgres, nor do I know whether it
would be well received, so this is more like throwing an idea to see
if it sticks...

But a way to do this without macros is to have an includable
"template" algorithm that simply doesn't define the comparison
function/type, it rather assumes it:

qsort_template.h

#define QSORT_NAME qsort_ ## QSORT_SUFFIX

static void QSORT_NAME(ELEM_TYPE arr, size_t num_elems)
{
... if (ELEM_LESS(arr[a], arr[b]))
...
}

#undef QSORT_NAME

Then, in "offset_qsort.h":

#define QSORT_SUFFIX offset
#define ELEM_TYPE offset
#define ELEM_LESS(a,b) ((a) < (b))

#include "qsort_template.h"

#undef QSORT_SUFFIX
#undef ELEM_TYPE
#undef ELEM_LESS

Now, I realize this may have its cons, but it does simplify
maintainance of type-specific or parameterized variants of
performance-critical functions.

> I can do specialized qsort for this case. But it will be larger bunch of
> code, than
> shell sort.
>
>> And I'd recommend doing that when there is a need, and I don't think
>> this patch really needs it, since bucket sort handles most cases
>> anyway.
>
> And it still needs insertion sort for buckets.
> I can agree to get rid of shell sort. But insertion sort is necessary.

I didn't suggest getting rid of insertion sort. But the trick above is
equally applicable to insertion sort.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small improvement to compactify_tuples

2017-11-07 Thread Claudio Freire

On Mon, Nov 6, 2017 at 9:08 PM, Юрий Соколов  wrote:
> 2017-11-07 1:14 GMT+03:00 Claudio Freire :
>>
>> I haven't seen this trick used in postgres, nor do I know whether it
>> would be well received, so this is more like throwing an idea to see
>> if it sticks...
>>
>> But a way to do this without macros is to have an includable
>> "template" algorithm that simply doesn't define the comparison
>> function/type, it rather assumes it:
>>
>> qsort_template.h
>>
>> #define QSORT_NAME qsort_ ## QSORT_SUFFIX
>>
>> static void QSORT_NAME(ELEM_TYPE arr, size_t num_elems)
>> {
>> ... if (ELEM_LESS(arr[a], arr[b]))
>> ...
>> }
>>
>> #undef QSORT_NAME
>>
>> Then, in "offset_qsort.h":
>>
>> #define QSORT_SUFFIX offset
>> #define ELEM_TYPE offset
>> #define ELEM_LESS(a,b) ((a) < (b))
>>
>> #include "qsort_template.h"
>>
>> #undef QSORT_SUFFIX
>> #undef ELEM_TYPE
>> #undef ELEM_LESS
>>
>> Now, I realize this may have its cons, but it does simplify
>> maintainance of type-specific or parameterized variants of
>> performance-critical functions.
>>
>> > I can do specialized qsort for this case. But it will be larger bunch of
>> > code, than
>> > shell sort.
>> >
>> >> And I'd recommend doing that when there is a need, and I don't think
>> >> this patch really needs it, since bucket sort handles most cases
>> >> anyway.
>> >
>> > And it still needs insertion sort for buckets.
>> > I can agree to get rid of shell sort. But insertion sort is necessary.
>>
>> I didn't suggest getting rid of insertion sort. But the trick above is
>> equally applicable to insertion sort.
>
> This trick is used in simplehash.h . I agree, it could be useful for qsort.
> This will not make qsort inlineable, but will reduce overhead much.
>
> This trick is too heavy-weight for insertion sort alone, though. Without
> shellsort, insertion sort could be expressed in 14 line macros ( 8 lines
> without curly braces). But if insertion sort will be defined together with
> qsort (because qsort still needs it), then it is justifiable.

What do you mean by heavy-weight?

Aside from requiring all that include magic, if you place specialized
sort functions in a reusable header, using it is as simple as
including the type-specific header (or declaring the type macros and
including the template), and using them as regular functions. There's
no runtime overhead involved, especially if you declare the comparison
function as a macro or a static inline function. The sort itself can
be declared static inline as well, and the compiler will decide
whether it's worth inlining.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small improvement to compactify_tuples

2017-11-07 Thread Claudio Freire

On Tue, Nov 7, 2017 at 11:42 AM, Юрий Соколов  wrote:
>
>
> 2017-11-07 17:15 GMT+03:00 Claudio Freire :
>> Aside from requiring all that include magic, if you place specialized
>> sort functions in a reusable header, using it is as simple as
>> including the type-specific header (or declaring the type macros and
>> including the template), and using them as regular functions. There's
>> no runtime overhead involved, especially if you declare the comparison
>> function as a macro or a static inline function. The sort itself can
>> be declared static inline as well, and the compiler will decide
>> whether it's worth inlining.
>
> Ok, if no one will complain against another one qsort implementation,
> I will add template header for qsort. Since qsort needs insertion sort,
> it will be in a same file.
> Do you approve of this?
>
> With regards,
> Sokolov Yura

If you need it. I'm not particularly fond of writing code before it's needed.

If you can measure the impact for that particular case where qsort
might be needed, and it's a real-world case, then by all means.

Otherwise, if it's a rarely-encountered corner case, I'd recommend
simply calling the stdlib's qsort.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small improvement to compactify_tuples

2017-11-08 Thread Claudio Freire

On Wed, Nov 8, 2017 at 12:33 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Nov 7, 2017 at 4:39 PM, Tom Lane  wrote:
>>> What I'm getting from the standard pgbench measurements, on both machines,
>>> is that this patch might be a couple percent slower than HEAD, but that is
>>> barely above the noise floor so I'm not too sure about it.
>
>> Hmm.  It seems like slowing down single client performance by a couple
>> of percent is something that we really don't want to do.
>
> I do not think there is any change here that can be proven to always be a
> win.  Certainly the original patch, which proposes to replace an O(n log n)
> sort algorithm with an O(n^2) one, should not be thought to be that.
> The question to focus on is what's the average case, and I'm not sure how
> to decide what the average case is.  But more than two test scenarios
> would be a good start.
>
> regards, tom lane

Doing no change to the overall algorithm and replacing qsort with an
inlineable type-specific one should be a net win in all cases.

Doing bucket sort with a qsort of large buckets (or small tuple
arrays) should also be a net win in all cases.

Using shell sort might not seem clear, but lets not forget the
original patch only uses it in very small arrays and very infrequently
at that.

What's perhaps not clear is whether there are better ideas. Like
rebuilding the page as Tom proposes, which doesn't seem like a bad
idea. Bucket sort already is O(bytes), just as memcopy, only it has a
lower constant factor (it's bytes/256 in the original patch), which
might make copying the whole page an extra time lose against bucket
sort in a few cases.

Deciding that last point does need more benchmarking. That doesn't
mean the other improvements can't be pursued in the meanwhile, right?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical tape pause/resume

2016-10-04 Thread Claudio Freire

On Tue, Oct 4, 2016 at 7:47 AM, Heikki Linnakangas  wrote:
> 1. The first patch changes the way we store the logical tapes on disk.
> Instead of using indirect blocks, HFS style, the blocks form a linked list.
> Each block has a trailer, with the block numbers of the previous and next
> block of the tape. That eliminates the indirect blocks, which simplifies the
> code quite a bit, and makes it simpler to implement the pause/resume
> functionality in the second patch. It also immediately reduces the memory
> needed for the buffers, from 3 to 1 block per tape, as we don't need to hold
> the indirect blocks in memory.

Doesn't that make prefetching more than a buffer ahead harder?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Merge Join with an Index Optimization

2016-10-11 Thread Claudio Freire

On Sun, Sep 11, 2016 at 1:20 PM, Tom Lane  wrote:
> Michael Malis  writes:
>> As I understand it, a merge join will currently read all tuples from both
>> subqueries (besides early termination). I believe it should be possible to
>> take advantages of the indexes on one or both of the tables being read from
>> to skip a large number of tuples that would currently be read.
>
> IIUC, what you're proposing is to replace "read past some tuples in the
> index" with "re-descend the index btree".  Yes, that would win if it
> allows skipping over a large number of index entries, but it could lose
> big-time if not.  The difficulty is that I don't see how the merge join
> level could predict whether it would win for any particular advance step.
> You'd really need most of the smarts to be in the index AM.
>
> You might want to troll the archives for past discussions of index skip
> scan, which is more or less the same idea (could possibly be exactly the
> same idea, depending on how it's implemented).  I think we had arrived
> at the conclusion that re-descent from the root might be appropriate
> when transitioning to another index page, but not intra-page.  Anyway
> no one's produced a concrete patch yet.

Interesting it should be brought up.

I was considering the index skip optimization for vacuum at some
point, might as well consider it for regular scans too if I get to
that.

Basically, instead of a simple get next, I was considering adding a
"get next skip until". The caller would be the one responsible for
sending the hint, and the index am would be free to implement the skip
in a smart way if possible.

The interface for vacuum is a bit different in practice, but
conceptually the same.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Indirect indexes

2016-10-18 Thread Claudio Freire

On Tue, Oct 18, 2016 at 3:28 PM, Alvaro Herrera
 wrote:
> I propose we introduce the concept of "indirect indexes".  I have a toy
> implementation and before I go further with it, I'd like this assembly's
> input on the general direction.
>
> Indirect indexes are similar to regular indexes, except that instead of
> carrying a heap TID as payload, they carry the value of the table's
> primary key.  Because this is laid out on top of existing index support
> code, values indexed by the PK can only be six bytes long (the length of
> ItemPointerData); in other words, 281,474,976,710,656 rows are
> supported, which should be sufficient for most use cases.[1]


You don't need that limitation (and vacuum will be simpler) if you add
the PK as another key, akin to:

CREATE INDIRECT INDEX idx ON tab (a, b, c);

turns into

CREATE INDEX idx ON tab (a, b, c, pk);

And is queried appropriately (using an index-only scan, extracting the
PK from the index tuple, and then querying the PK index to get the
tids).

In fact, I believe that can work with all index ams supporting index-only scans.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Indirect indexes

2016-10-18 Thread Claudio Freire

On Tue, Oct 18, 2016 at 5:48 PM, Simon Riggs  wrote:
> On 18 October 2016 at 22:04, Claudio Freire  wrote:
>> On Tue, Oct 18, 2016 at 3:28 PM, Alvaro Herrera
>>  wrote:
>>> I propose we introduce the concept of "indirect indexes".  I have a toy
>>> implementation and before I go further with it, I'd like this assembly's
>>> input on the general direction.
>>>
>>> Indirect indexes are similar to regular indexes, except that instead of
>>> carrying a heap TID as payload, they carry the value of the table's
>>> primary key.  Because this is laid out on top of existing index support
>>> code, values indexed by the PK can only be six bytes long (the length of
>>> ItemPointerData); in other words, 281,474,976,710,656 rows are
>>> supported, which should be sufficient for most use cases.[1]
>>
>>
>> You don't need that limitation (and vacuum will be simpler) if you add
>> the PK as another key, akin to:
>>
>> CREATE INDIRECT INDEX idx ON tab (a, b, c);
>>
>> turns into
>>
>> CREATE INDEX idx ON tab (a, b, c, pk);
>>
>> And is queried appropriately (using an index-only scan, extracting the
>> PK from the index tuple, and then querying the PK index to get the
>> tids).
>>
>> In fact, I believe that can work with all index ams supporting index-only 
>> scans.
>
> But if you did that, an UPDATE of a b or c would cause a non-HOT
> update, so would defeat the purpose of indirect indexes.

I meant besides all the other work, omitting the tid from the index
(as only the PK matters), marking them indirect, and all that.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Indirect indexes

2016-10-19 Thread Claudio Freire

On Wed, Oct 19, 2016 at 10:21 AM, Simon Riggs  wrote:
>> Simon objected that putting the PK
>> into the index tuple would disable HOT, but I don't think that's a
>> valid objection.
>
> Just to be clear, that's not what I objected to. Claudio appeared to
> be suggesting that an indirect index is the same thing as an index
> with PK tacked onto the end, which I re-confirm is not the case since
> doing that would not provide the primary objective of indirect
> indexes.

No, I was suggesting using the storage format of those indexes.
Perhaps I wasn't clear.

CREATE INDEX could be implemented entirely as the rewrite I mention, I
believe. But everything else can't, as you say.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Indirect indexes

2016-10-19 Thread Claudio Freire

On Wed, Oct 19, 2016 at 2:04 PM, Bruce Momjian  wrote:
>> What we should ask is what is the difference between indirect indexes
>> and WARM and to what extent they overlap.
>>
>> My current understanding is that WARM won't help you if you update
>> parts of a JSON document and/or use GIN indexes, but is effective
>> without needing to add a new index type and will be faster for
>> retrieval than indirect indexes.
>>
>> So everybody please chirp in with benefits or comparisons.
>
> I am not sure we have even explored all the limits of WARM with btree
> indexes --- I haven't heard anyone talk about non-btree indexes yet.

AFAIK there's no fundamental reason why it wouldn't work for other
index ams, but it does require quite a bit of legwork to get
everything working there.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Indirect indexes

2016-10-20 Thread Claudio Freire

On Thu, Oct 20, 2016 at 12:14 PM, Petr Jelinek  wrote:
> WARM can do WARM update 50% of time, indirect index can do HOT update
> 100% of time (provided the column is not changed), I don't see why we
> could not have both solutions.
>
> That all being said, it would be interesting to hear Álvaro's thoughts
> about which use-cases he expects indirect indexes to work better than WARM.

I'm not Alvaro, but it's quite evident that indirect indexes don't
need space on the same page to get the benefits of HOT update (even
though it wouldn't be HOT).

That's a big difference IMO.

That said, WARM isn't inherently limited to 50%, but it *is* limited
to HOT-like updates (new tuple is in the same page as the old), and
since in many cases that is a limiting factor for HOT updates, one can
expect WARM will be equally limited.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Indirect indexes

2016-10-20 Thread Claudio Freire

On Thu, Oct 20, 2016 at 12:30 PM, Pavan Deolasee
 wrote:
>
>
> On Thu, Oct 20, 2016 at 8:44 PM, Petr Jelinek  wrote:
>>
>>
>>
>> WARM can do WARM update 50% of time, indirect index can do HOT update
>> 100% of time (provided the column is not changed), I don't see why we
>> could not have both solutions.
>>
>
> I think the reason why I restricted WARM to one update per chain, also
> applies to indirect index. For example, if a indirect column value is
> changed from 'a' to 'b' and back to 'a', there will be two pointers from 'a'
> to the PK and AFAICS that would lead to the same duplicate scan issue.
>
> We have a design to convert WARM chains back to HOT and that will increase
> the percentage of WARM updates much beyond 50%. I was waiting for feedback
> on the basic patch before putting in more efforts, but it went unnoticed
> last CF.

With indirect indexes, since you don't need to insert a tid, you can
just "insert on conflict do nothing" on the index.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Indirect indexes

2016-10-20 Thread Claudio Freire

On Thu, Oct 20, 2016 at 1:08 PM, Pavan Deolasee
 wrote:
> On Thu, Oct 20, 2016 at 9:20 PM, Claudio Freire 
> wrote:
>>
>>
>>
>> With indirect indexes, since you don't need to insert a tid, you can
>> just "insert on conflict do nothing" on the index.
>
>
> Would that work with non-unique indexes? Anyways, the point I was trying to
> make is that there are a similar technical challenges and we could solve it
> for WARM as well with your work for finding conflicting  pairs and
> then not doing inserts. My thinking currently is that it will lead to other
> challenges, especially around vacuum, but I could be wrong.

Consider this:

Have a vacuum_cycle_id field in the metapage, with one bit reserved to
whether there's a vacuum in progress.

While there is a vacuum in progress on the index, all kinds of
modifications will look up the  entry, and store the current
vacuum_cycle_id on the unused space for the tid pointer on the index
entry. When not, only new entries will be added (with the current
vacuum cycle id). So, during vacuum, indirect indexes incur a similar
cost to that of regular indexes, but only during vacuum.

When vacuuming, allocate 1/2 maintenance_work_mem for a bloom filter,
and increase all vacuum cycle ids (on the metapage) and mark a vacuum
in progress.

Scan the heap, add  pairs of *non-dead* tuples to the bloom
filter. That's one BF per index, sadly, but bear with me.

Then scan the indexes.  pairs *not* in the BF that have the
*old* vacuum cycle id get removed.

Clear the vacuum in progress flag on all indexes' metapage.

The only drawback here is that mwm dictates the amount of uncleanable
waste left on the indexes (BF false positives). Surely, the BF could
be replaced with an accurate set rather than an approximate one, but
that could require a lot of memory if keys are big, and a lot of scans
of the indexes. The BF trick bounds the amount of waste left while
minimizing work.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CLUSTER, reform_and_rewrite_tuple(), and parallelism

2016-10-27 Thread Claudio Freire

On Wed, Oct 26, 2016 at 9:33 PM, Peter Geoghegan  wrote:
> Besides, parallel CREATE INDEX alone will probably
> be quite effective at speeding up CLUSTER in practice, simply because
> that's often where most wall clock time is spent during a CLUSTER
> operation.

Creating all indexes in parallel could also help considerably, for the
case when there are multiple indexes, and seems far simpler.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-11-17 Thread Claudio Freire

On Thu, Nov 17, 2016 at 2:34 PM, Masahiko Sawada  wrote:
> I glanced at the patches but the both patches don't obey the coding
> style of PostgreSQL.
> Please refer to [1].
>
> [1] 
> http://wiki.postgresql.org/wiki/Developer_FAQ#What.27s_the_formatting_style_used_in_PostgreSQL_source_code.3F.

I thought I had. I'll go through that list to check what I missed.

>> In the results archive, the .vacuum prefix is the patched version with
>> both patch 1 and 2, .git.ref is just patch 1 (without which the
>> truncate takes unholily long).
>
> Did you measure the performance benefit of 0001 patch by comparing
> HEAD with HEAD+0001 patch?

Not the whole test, but yes. Without the 0001 patch the backward scan
step during truncate goes between 3 and 5 times slower. I don't have
timings because the test never finished without the patch. It would
have finished, but it would have taken about a day.

>> Grepping the results a bit, picking an average run out of all runs on
>> each scale:
>>
>> Timings:
>>
>> Patched:
>>
>> s100: CPU: user: 3.21 s, system: 1.54 s, elapsed: 18.95 s.
>> s400: CPU: user: 14.03 s, system: 6.35 s, elapsed: 107.71 s.
>> s4000: CPU: user: 228.17 s, system: 108.33 s, elapsed: 3017.30 s.
>>
>> Unpatched:
>>
>> s100: CPU: user: 3.39 s, system: 1.64 s, elapsed: 18.67 s.
>> s400: CPU: user: 15.39 s, system: 7.03 s, elapsed: 114.91 s.
>> s4000: CPU: user: 282.21 s, system: 105.95 s, elapsed: 3017.28 s.
>>
>> Total I/O (in MB)
>>
>> Patched:
>>
>> s100: R:2.4 - W:5862
>> s400: R:1337.4 - W:29385.6
>> s4000: R:318631 - W:370154
>>
>> Unpatched:
>>
>> s100: R:1412.4 - W:7644.6
>> s400: R:3180.6 - W:36281.4
>> s4000: R:330683 - W:370391
>>
>>
>> So, in essence, CPU time didn't get adversely affected. If anything,
>> it got improved by about 20% on the biggest case (scale 4000).
>
> And this test case deletes all tuples in relation and then measure
> duration of vacuum.
> It would not be effect much in practical use case.

Well, this patch set started because I had to do exactly that, and
realized just how inefficient vacuum was in that case.

But it doesn't mean it won't benefit more realistic use cases. Almost
any big database ends up hitting this 1GB limit because big tables
take very long to vacuum and accumulate a lot of bloat in-between
vacuums.

If you have a specific test case in mind I can try to run it.

>> While total runtime didn't change much, I believe this is only due to
>> the fact that the index is perfectly correlated (clustered?) since
>> it's a pristine index, so index scans either remove or skip full
>> pages, never leaving things half-way. A bloated index would probably
>> show a substantially different behavior, I'll try to get a script that
>> does it by running pgbench a while before the vacuum or something like
>> that.
>>
>> However, the total I/O metric already shows remarkable improvement.
>> This metric is measuring all the I/O including pgbench init, the
>> initial vacuum pgbench init does, the delete and the final vacuum. So
>> it's not just the I/O for the vacuum itself, but the whole run. We can
>> see the patched version reading a lot less (less scans over the
>> indexes will do that), and in some cases writing less too (again, less
>> index scans may be performing less redundant writes when cleaning
>> up/reclaiming index pages).
>
> What value of maintenance_work_mem did you use for this test?

4GB on both, patched and HEAD.

> Since
> DeadTuplesSegment struct still stores array of ItemPointerData(6byte)
> representing dead tuple I supposed that the frequency of index vacuum
> does not change. But according to the test result, a index vacuum is
> invoked once and removes 4 rows at the time. It means that the
> vacuum stored about 2289 MB memory during heap vacuum. On the other
> side, the result of test without 0002 patch show that a index vacuum
> remove 178956737 rows at the time, which means 1GB memory was used.

1GB is a hardcoded limit on HEAD for vacuum work mem. This patch
removes that limit and lets vacuum use all the memory the user gave to
vacuum.

In the test, in both cases, 4GB was used as maintenance_work_mem
value, but HEAD cannot use all the 4GB, and it will limit itself to
just 1GB.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-11-17 Thread Claudio Freire

On Thu, Nov 17, 2016 at 2:51 PM, Claudio Freire  wrote:
> On Thu, Nov 17, 2016 at 2:34 PM, Masahiko Sawada  
> wrote:
>> I glanced at the patches but the both patches don't obey the coding
>> style of PostgreSQL.
>> Please refer to [1].
>>
>> [1] 
>> http://wiki.postgresql.org/wiki/Developer_FAQ#What.27s_the_formatting_style_used_in_PostgreSQL_source_code.3F.
>
> I thought I had. I'll go through that list to check what I missed.

Attached is patch 0002 with pgindent applied over it

I don't think there's any other formatting issue, but feel free to
point a finger to it if I missed any
From b9782cef0d971de341115bd3474d192419674219 Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 2/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.
---
 src/backend/commands/vacuumlazy.c | 295 +-
 1 file changed, 230 insertions(+), 65 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 854bce3..dfb0612 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -91,6 +91,7 @@
  * tables.
  */
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
+#define LAZY_MIN_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
 
 /*
  * Before we consider skipping a page that's marked as clean in
@@ -98,6 +99,27 @@
  */
 #define SKIP_PAGES_THRESHOLD	((BlockNumber) 32)
 
+typedef struct DeadTuplesSegment
+{
+	int			num_dead_tuples;	/* # of entries in the segment */
+	int			max_dead_tuples;	/* # of entries allocated in the segment */
+	ItemPointerData last_dead_tuple;	/* Copy of the last dead tuple (unset
+		 * until the segment is fully
+		 * populated) */
+	unsigned short padding;
+	ItemPointer dead_tuples;	/* Array of dead tuples */
+}	DeadTuplesSegment;
+
+typedef struct DeadTuplesMultiArray
+{
+	int			num_entries;	/* current # of entries */
+	int			max_entries;	/* total # of slots that can be allocated in
+ * array */
+	int			num_segs;		/* number of dead tuple segments allocated */
+	int			last_seg;		/* last dead tuple segment with data (or 0) */
+	DeadTuplesSegment *dead_tuples;		/* array of num_segs segments */
+}	DeadTuplesMultiArray;
+
 typedef struct LVRelStats
 {
 	/* hasindex = true means two-pass strategy; false means one-pass */
@@ -117,14 +139,14 @@ typedef struct LVRelStats
 	BlockNumber nonempty_pages; /* actually, last nonempty page + 1 */
 	/* List of TIDs of tuples we intend to delete */
 	/* NB: this list is ordered by TID address */
-	int			num_dead_tuples;	/* current # of entries */
-	int			max_dead_tuples;	/* # slots allocated in array */
-	ItemPointer dead_tuples;	/* array of ItemPointerData */
+	DeadTuplesMultiArray dead_tuples;
 	int			num_index_scans;
 	TransactionId latestRemovedXid;
 	bool		lock_waiter_detected;
 } LVRelStats;
 
+#define DeadTuplesCurrentSegment(lvrelstats) (&((lvrelstats)->dead_tuples.dead_tuples[ \
+	(lvrelstats)->dead_tuples.last_seg ]))
 
 /* A few variables that don't seem worth passing around as parameters */
 static int	elevel = -1;
@@ -149,7 +171,7 @@ static void lazy_cleanup_index(Relation indrel,
    IndexBulkDeleteResult *stats,
    LVRelStats *vacrelstats);
 static int lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
- int tupindex, LVRelStats *vacrelstats, Buffer *vmbuffer);
+ int tupindex, LVRelStats *vacrelstats, DeadTuplesSegment * seg, Buffer *vmbuffer);
 static bool should_attempt_truncation(LVRelStats *vacrelstats);
 static void lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats);
 static BlockNumber count_nondeletable_pages(Relation onerel,
@@ -157,8 +179,9 @@ static BlockNumber count_nondeletable_pages(Relation onerel,
 static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks);
 static void lazy_record_dead_tuple(LVRelStats *vacrelstats,
 	   ItemPointer itemptr);
+static void lazy_clear_dead_tuples(LVRelStats *vacrelstats);
 static bool lazy_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
+static inline int vac_cmp_itemptr(const void *left, const void *right);
 static bool heap_page_is_all_visible(Relation rel, Buffer buf,
 	 TransactionId *visibility_cutoff_xid, bool *all_frozen);
 
@@ -498,7 +521,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = nblocks;
-	initprog_val[2] = vacrelstats->max_dead_tu

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-11-17 Thread Claudio Freire

On Thu, Nov 17, 2016 at 6:34 PM, Robert Haas  wrote:
> On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire  
> wrote:
>> Attached is patch 0002 with pgindent applied over it
>>
>> I don't think there's any other formatting issue, but feel free to
>> point a finger to it if I missed any
>
> Hmm, I had imagined making all of the segments the same size rather
> than having the size grow exponentially.  The whole point of this is
> to save memory, and even in the worst case you don't end up with that
> many segments as long as you pick a reasonable base size (e.g. 64MB).

Wastage is bound by a fraction of the total required RAM, that is,
it's proportional to the amount of required RAM, not the amount
allocated. So it should still be fine, and the exponential strategy
should improve lookup performance considerably.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2016-11-18 Thread Claudio Freire

On Fri, Nov 18, 2016 at 10:05 PM, Peter Geoghegan  wrote:
> On Thu, Aug 18, 2016 at 2:15 PM, Peter Geoghegan  wrote:
>> I think that this is a bad idea. We need to implement suffix
>> truncation of internal page index tuples at some point, to make them
>> contain less information from the original leaf page index tuple.
>> That's an important optimization, because it increases fan-in. This
>> seems like a move in the opposite direction.
>
> Maybe I was too hasty, here. Can you rebase this patch, Claudio?

I've been changing the on-disk format considerably, to one that makes
more sense.

I haven't posted it because it still doesn't have suffix (and prefix)
truncation, but it should be compatible with it (ie: it could be
implemented as an optimization that doesn't change the on-disk
format).

However, I haven't tested the current state of the patch thoroughly.

If a WIP is fine, I can post the what I have, rebased.

> It's possible that this idea could have further non-obvious benefits.
> I see some potential for this to help with nbtree page deletion, item
> recycling, etc. For example, maybe VACUUM can manage to do something
> much closer to a regular index scan when that seems to make sense --

That was on my mind too

> I also think that this might help with bitmap index scans.

How so?

AFAIK it helps regular scans on low-cardinality indexes more than
bitmap index scans. With low cardinality indexes, the resulting heap
access pattern will be an unordered series of sequential range scans,
which is better than the fully random scan the current layout causes.
Bitmap index scans, however, deny that benefit.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2016-11-21 Thread Claudio Freire

On Fri, Nov 18, 2016 at 11:09 PM, Peter Geoghegan  wrote:
> On Fri, Nov 18, 2016 at 5:27 PM, Claudio Freire  
> wrote:
>> I've been changing the on-disk format considerably, to one that makes
>> more sense.
>
> I can see how that would be necessary. I'm going to suggest some more
> things that need a new btree version number now, too.
>
> I realized today, quite suddenly, why I disliked your original patch:
> it didn't go far enough with embracing the idea of
> heap-item-pointer-as-key. In other words, I didn't like that you
> didn't change anything in the internal pages.

But it did. In fact it only changed internal pages.

> Maybe you should put
> heap TIDs someplace in new internal pages, so that even there we treat
> them as part of the key.

That is indeed what's happening (both in the old and new version).

The new version also opens up to the possibility of omitting the heap
TID in inner pages, if they're redundant (ie: if two consecutive keys
are different already, the heap TID part of the key is redundant,
since it's not necessary to make tree descent decisions).

> This may significantly benefit UPDATE-heavy
> workloads where HOT doesn't work out so well. Particularly for
> non-unique indexes, we currently have to wade through a lot of bloat
> from the leaf level, rather than jumping straight to the correct leaf
> page when updating or inserting.

That is improved in the patch as well. The lookup for an insertion
point for a heavily bloated (unique or not) index is done efficiently,
instead of the O(N^2) way it used before.

> You might not want to do the same with unique indexes, where we must
> initially buffer lock "the first leaf page that a duplicate could be
> on" while we potentially look in one or more additional sibling pages
> (but bloat won't be so bad there, perhaps).

It's done even for unique indexes. Locking in that case becomes
complex, true, but I believe it's not an insurmountable problem.

> Or, you might want to make
> sure that B-Tree tuple insertions still find that "first page" to
> check, despite still generally treating heap item pointer as part of
> the key proper (in unique indexes, it can still behave like NULL, and
> not participate in uniqueness checking, while still essentially being
> part of the key/sort order).

It behaves like that (even though it's not coded like that)

> It also occurs to me that our performance for updates in the event of
> many NULL values in indexes is probably really bad, and could be
> helped by this. You'll want to test all of this with a workload that
> isn't very sympathetic to HOT, of course -- most of these benefits are
> seen when HOT doesn't help.

I haven't really tested with an overabundance of NULLs, I'll add that
to the tests. I have tested low-cardinality indexes though, but only
for correctness.

>> However, I haven't tested the current state of the patch thoroughly.
>>
>> If a WIP is fine, I can post the what I have, rebased.
>
> I'm mostly curious about the effects on bloat. I now feel like you
> haven't sufficiently examined the potential benefits there, since you
> never made heap item pointer a first class part of the key space.
> Maybe you'd like to look into that yourself first?

What do you mean with "first class part"?

It's not included in the scankey for a variety of reasons, not the
least of which is not breaking the interface for current uses, and
because the tid type is quite limited in its capabilities ATM. Also,
the heap TID is usually there on most AM calls that care about it (ie:
insertions have it, of course), so adding it to the scankey also
seemed redundant.

If not adding to the scankey, what do you mean by "first class" then?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2016-11-21 Thread Claudio Freire

On Mon, Nov 21, 2016 at 2:15 PM, Masahiko Sawada  wrote:
> On Fri, Nov 18, 2016 at 6:54 AM, Claudio Freire  
> wrote:
>> On Thu, Nov 17, 2016 at 6:34 PM, Robert Haas  wrote:
>>> On Thu, Nov 17, 2016 at 1:42 PM, Claudio Freire  
>>> wrote:
>>>> Attached is patch 0002 with pgindent applied over it
>>>>
>>>> I don't think there's any other formatting issue, but feel free to
>>>> point a finger to it if I missed any
>>>
>>> Hmm, I had imagined making all of the segments the same size rather
>>> than having the size grow exponentially.  The whole point of this is
>>> to save memory, and even in the worst case you don't end up with that
>>> many segments as long as you pick a reasonable base size (e.g. 64MB).
>>
>> Wastage is bound by a fraction of the total required RAM, that is,
>> it's proportional to the amount of required RAM, not the amount
>> allocated. So it should still be fine, and the exponential strategy
>> should improve lookup performance considerably.
>
> I'm concerned that it could use repalloc for large memory area when
> vacrelstats->dead_tuples.dead_tuple is bloated. It would be overhead
> and slow.

How large?

That array cannot be very large. It contains pointers to
exponentially-growing arrays, but the repalloc'd array itself is
small: one struct per segment, each segment starts at 128MB and grows
exponentially.

In fact, IIRC, it can be proven that such a repalloc strategy has an
amortized cost of O(log log n) per tuple. If it repallocd the whole
tid array it would be O(log n), but since it handles only pointers to
segments of exponentially growing tuples it becomes O(log log n).

Furthermore, n there is still limited to MAX_INT, which means the cost
per tuple is bound by O(log log 2^32) = 5. And that's an absolute
worst case that's ignoring the 128MB constant factor which is indeed
relevant.

> What about using semi-fixed memroy space without repalloc;
> Allocate the array of ItemPointerData array, and each ItemPointerData
> array stores the dead tuple locations. The size of ItemPointerData
> array starts with small size (e.g. 16MB or 32MB). After we used an
> array up, we then allocate next segment with twice size as previous
> segment.

That's what the patch does.

> But to prevent over allocating memory, it would be better to
> set a limit of allocating size of ItemPointerData array to 512MB or
> 1GB.

There already is a limit to 1GB (the maximum amount palloc can allocate)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location

2016-11-21 Thread Claudio Freire

On Mon, Nov 21, 2016 at 5:42 PM, Peter Geoghegan  wrote:
>>> Or, you might want to make
>>> sure that B-Tree tuple insertions still find that "first page" to
>>> check, despite still generally treating heap item pointer as part of
>>> the key proper (in unique indexes, it can still behave like NULL, and
>>> not participate in uniqueness checking, while still essentially being
>>> part of the key/sort order).
>>
>> It behaves like that (even though it's not coded like that)
>
> Why not? What do you mean?
>
> What I'm interested in evaluating is an implementation where an index
> on (foo) has a sort order/key space that is precisely the same as an
> index on (foo, ctid), with zero exceptions.

Well, the patch does the same sorting, but without explicitly adding
the ctid to the scankey.

When inserting into a unique index, the lookup for the insertion point
proceeds as it would if the key was (foo, ctid), finding a page in the
middle of the range that contain item pointers for foo.

When that happens on a regular index, the insertion is done at that
point and nothing else needs to be done. But when it happens on a
unique index, the code first checks to see if the first item pointer
for foo is on the same page (allegedly a frequent case). If so, the
uniqueness check is done in a very straightforward and efficient
manner. If not, however, the code retraces its steps up the tree to
the first time it followed a key other than foo (that's the only way
it could land at a middle page), and then follows the downlinks
looking at a truncated key (just foo, ignoring ctid).

Kind of what you say, but without the explicit null.

>
>>> It also occurs to me that our performance for updates in the event of
>>> many NULL values in indexes is probably really bad, and could be
>>> helped by this. You'll want to test all of this with a workload that
>>> isn't very sympathetic to HOT, of course -- most of these benefits are
>>> seen when HOT doesn't help.
>>
>> I haven't really tested with an overabundance of NULLs, I'll add that
>> to the tests. I have tested low-cardinality indexes though, but only
>> for correctness.
>
> I think we'll have to model data distribution to evaluate the idea
> well. After all, if there is a uniform distribution of values anyway,
> you're going to end up with many dirty leaf pages. Whereas, in the
> more realistic case where updates are localized, we can hope to better
> amortize the cost of inserting index tuples, and recycling index
> tuples.

Quite possibly

>> What do you mean with "first class part"?
>>
>> It's not included in the scankey for a variety of reasons, not the
>> least of which is not breaking the interface for current uses, and
>> because the tid type is quite limited in its capabilities ATM. Also,
>> the heap TID is usually there on most AM calls that care about it (ie:
>> insertions have it, of course), so adding it to the scankey also
>> seemed redundant.
>
> I mean that insertions into indexes that are low cardinality should be
> largely guided by TID comparisons.

...

>> If not adding to the scankey, what do you mean by "first class" then?
>
> Just what I said about the sort order of an index on "(foo)" now
> precisely matching what we'd get for "(foo, ctid)".

It is the case in both versions of the patch

> There are a couple
> of tricky issues with that that you'd have to look out for, like
> making sure that the high key continues to hold a real TID, which at a
> glance looks like something that just happens already. I'm not sure
> that we preserve the item pointer TID in all cases, since the current
> assumption is that a high key's TID is just filler -- maybe we can
> lose that at some point.

I thought long about that, and inner pages don't need a real TID in
their keys, as those keys only specify key space cutoff points and not
real tids, and high tids in leaf pages are always real.

I haven't had any issue with that, aside from the headaches resulting
from thinking about that for protracted periods of time.

> You should use amcheck to specifically verify that that happens
> reliably in all cases. Presumably, its use of an insertion scankey
> would automatically see the use of TID as a tie-breaker with patched
> Postgres amcheck verification, and so amcheck will work for this
> purpose unmodified.

It's totally on my roadmap. I'll have to adapt it for the new index
format though, it doesn't work without modification.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: ParallelFinish-hook of FDW/CSP (Re: [HACKERS] Steps inside ExecEndGather)

2017-02-24 Thread Claudio Freire

On Fri, Feb 24, 2017 at 1:24 PM, Robert Haas  wrote:
> On Fri, Feb 24, 2017 at 7:29 PM, Kohei KaiGai  wrote:
>> This attached patch re-designed the previous v2 according to Robert's 
>> comment.
>> All I had to do is putting entrypoint for ForeignScan/CustomScan at
>> ExecShutdownNode,
>> because it is already modified to call child node first, earlier than
>> ExecShutdownGather
>> which releases DSM segment.
>
> Aside from the documentation, which needs some work, this looks fine
> to me on a quick read.

LGTM too.

You may want to clarify in the docs when the hook is called in
relation to other hooks too.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-06-12 Thread Claudio Freire

On Thu, Jun 12, 2014 at 1:25 PM, Robert Haas  wrote:
>
> It appears that any string starting with the letter "a" will create
> output that begins with 001S00 and the seventh character always
> appears to be 0 or 1:
>
> [rhaas ~]$ ./strxfrm en_US ab ac ad ae af a% a0 "a "
> "ab" -> "001S001T001S001T"
> "ac" -> "001S001U001S001U"
> "ad" -> "001S001V001S001V"
> "ae" -> "001S001W001S001W"
> "af" -> "001S001X001S001X"
> "a%" -> "001S000W001S000W"
> "a0" -> "001S000b001S000b"
> "a " -> "001S000R001S000R"

...

> [rhaas@hydra ~]$ ./strxfrm-binary en_US  aaaA
> "" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020202 (26 bytes)
> "aaaA" -> 0c0c0c0c0c0c0c0c010202020202020202010202020202020204 (26 bytes)
>
> Still, it's fair to say that on this Linux system, the first 8 bytes
> capture a significant portion of the entropy of the first 8 bytes of
> the string, whereas on MacOS X you only get entropy from the first 2
> bytes of the string.  It would be interesting to see results from
> other platforms people might care about also.

If you knew mac's output character set with some certainty, you could
compress it rather efficiently by applying a tabulated decode back
into non-printable bytes. Say, like base64 decoding (the set appears
to be a subset of base64, but it's hard to be sure).


Ie,
x = strxfrm(s)
xz = hex(tab[x[0]] + 64*tab[x[1]] + 64^2*tab[x[2]] ... etc)

This can be made rather efficient. But the first step is defining with
some certainty the output character set.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-17 Thread Claudio Freire

On Tue, Jun 17, 2014 at 1:04 PM, Andres Freund  wrote:
> For me minmax indexes are helpful because they allow to generate *small*
> 'coarse' indexes over large volumes of data. From my pov that's possible
> possible because they don't contain item pointers for every contained
> row.


But minmax is just a specific form of bloom filter.

This could certainly be generalized to a bloom filter index with some
set of bloom&hashing operators (minmax being just one).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-06-17 Thread Claudio Freire

On Tue, Jun 17, 2014 at 8:47 AM, Abhijit Menon-Sen  wrote:
> if (compress_backup_block = BACKUP_BLOCK_COMPRESSION_SNAPPY)


You mean == right?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-06-19 Thread Claudio Freire

On Thu, Jun 19, 2014 at 10:06 AM, Heikki Linnakangas
 wrote:
> On 06/18/2014 06:09 PM, Claudio Freire wrote:
>>
>> On Tue, Jun 17, 2014 at 8:48 PM, Greg Stark  wrote:
>>>
>>> An aggregate to generate a min/max "bounding box" from several values
>>> A function which takes an "bounding box" and a new value and returns
>>> the new "bounding box"
>>> A function which tests if a value is in a "bounding box"
>>> A function which tests if a "bounding box" overlaps a "bounding box"
>>
>>
>> Which I'd generalize a bit further by renaming "bounding box" with
>> "compressed set", and allow it to be parameterized.
>
>
> What do you mean by parameterized?

Bloom filters can be paired with number of hashes, number of bit
positions, and hash function, so it's not a simple bloom filter index,
but a bloom filter index with N SHA-1-based hashes spread on a
K-length bitmap.

>> So, you have:
>>
>> An aggregate to generate a "compressed set" from several values
>> A function which adds a new value to the "compressed set" and returns
>> the new "compressed set"
>> A function which tests if a value is in a "compressed set"
>> A function which tests if a "compressed set" overlaps another
>> "compressed set" of equal type
>
>
> Yeah, something like that. I'm not sure I like the "compressed set" term any
> more than bounding box, though. GiST seems to have avoided naming the thing,
> and just talks about "index entries". But if we can come up with a good
> name, that would be more clear.

I don't want to use the term bloom filter since it's very specific of
a specific technique, but it's basically that - an approximate set
without false negatives. Ie: compressed set.

It's not a bounding box either when using bloom filters. So...

>> One problem with such a generalized implementation would be, that I'm
>> not sure in-place modification of the "compressed set" on-disk can be
>> assumed to be safe on all cases. Surely, for strictly-enlarging sets
>> it would, but while min/max and bloom filters both fit the bill, it's
>> not clear that one can assume this for all structures.
>
>
> I don't understand what you mean. It's a fundamental property of minmax
> indexes that you can always replace the "min" or "max" or "compressing set"
> or "bounding box" or whatever with another datum that represents all the
> keys that the old one did, plus some.

Yes, and bloom filters happen to fall on that category too.

Never mind what I said. I was thinking of other potential and
imaginary implementation that supports removal or updates, that might
need care with transaction lifetimes, but that's easily fixed by
letting vacuum or some lazy process do the deleting just as it happens
with other indexes anyway.

So, I guess the interface must include also the invariant that
compressed sets only grow, never shrink unless from a rebuild or a
vacuum operation.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-19 Thread Claudio Freire

On Thu, Jun 19, 2014 at 2:49 PM, Greg Stark  wrote:
> I don't think the threaded implementation on Linux is the one to use
> though. I find this *super* confusing but the kernel definitely
> supports aio syscalls, glibc also has a threaded implementation it
> uses if run on a kernel that doesn't implement the syscalls, and I
> think there are existing libaio and librt libraries from outside glibc
> that do one or the other. Which you build against seems to make a big
> difference. My instinct is that anything but the kernel native
> implementation will be worthless. The overhead of thread communication
> will completely outweigh any advantage over posix_fadvise's partial
> win.

What overhead?

The only communication is through a "done" bit and associated
synchronization structure when *and only when* you want to wait on it.

Furthermore, posix_fadvise is braindead on this use case, been there,
done that. What you win with threads is a better postgres-kernel
interaction, even if you loose some CPU performance it's gonna beat
posix_fadvise by a large margin.

> The main advantage of posix aio is that we can actually receive the
> data out of order. With posix_fadvise we can get the i/o and cpu
> overlap but we will never process the later blocks until the earlier
> requests are satisfied and processed in order. With aio you could do a
> sequential scan, initiating i/o on 1,000 blocks and then processing
> them as they arrive, initiating new requests as those blocks are
> handled.

And each and every I/O you issue with it counts as such on the kernel side.

It's not the case with posix_fadvise, mind you, and that's terribly
damaging for performance.

> When I investigated this I found the buffer manager's I/O bits seemed
> to already be able to represent the state we needed (i/o initiated on
> this buffer but not completed). The problem was in ensuring that a
> backend would process the i/o completion promptly when it might be in
> the midst of handling other tasks and might even get an elog() stack
> unwinding. The interface that actually fits Postgres best might be the
> threaded interface (orthogonal to the threaded implementation
> question) which is you give aio a callback which gets called on a
> separate thread when the i/o completes. The alternative is you give
> aio a list of operation control blocks and it tells you the state of
> all the i/o operations. But it's not clear to me how you arrange to do
> that regularly, promptly, and reliably.

Indeed we've been musing about using librt's support of completion
callbacks for that.

> The other gotcha here is that the kernel implementation only does
> anything useful on DIRECT_IO files. That means you have to do *all*
> the prefetching and i/o scheduling yourself.

If that's the case, we should discard kernel-based implementations and
stick to thread-based ones. Postgres cannot do scheduling as
efficiently as the kernel, and it shouldn't try.

> You would be doing that
> anyways for sequential scans and bitmap scans -- and we already do it
> with things like synchronised scans and posix_fadvise

That only works because the patterns are semi-sequential. If you try
to schedule random access, it becomes effectively impossible without
hardware info.

The kernel is the one with hardware info.

> Finally, when I did the posix_fadvise work I wrote a synthetic
> benchmark for testing the equivalent i/o pattern of a bitmap scan. It
> let me simulate bitmap scans of varying densities with varying
> parameters, notably how many i/o to keep in flight at once. It
> supported posix_fadvise or aio. You should look it up in the archives,
> it made for some nice looking graphs. IIRC I could not find any build
> environment where aio offered any performance boost at all. I think
> this means I just didn't know how to build it against the right
> libraries or wasn't using the right kernel or there was some skew
> between them at the time.

If it's old, it probable you didn't hit the kernel's braindeadness
since it was introduced somewhat recently (somewhate, ie, before 3.x I
believe).

Even if you did hit it, bitmap heap scans are blessed with sequential
ordering. The technique doesn't work nearly as well with random I/O
that might be sorted from time to time.

When traversing an index, you do a mostly sequential pattern due to
physical correlation, but not completely sequential. Not only that,
with the assumption of random I/O, and the uncertainty of when will
the scan be aborted too, you don't read ahead as much as you could if
you knew it was sequential or a full scan. That kills performance. You
don't fetch enough ahead of time to avoid stalls, and the kernel
doesn't do read-ahead either because posix_fadvise effectively
disables it, resulting in the equivalent of direct I/O with bad
scheduling.

Solving this for index scans isn't just a little more complex. It's
insanely more complex, because you need hardware information to do it
right. How many spindles, how many

Re: [HACKERS] Minmax indexes

2014-06-19 Thread Claudio Freire

On Wed, Jun 18, 2014 at 8:51 AM, Heikki Linnakangas
 wrote:
>
> I liked Greg's sketch of what the opclass support functions would be. It
> doesn't seem significantly more complicated than what's in the patch now.

Which was

On Tue, Jun 17, 2014 at 8:48 PM, Greg Stark  wrote:
> An aggregate to generate a min/max "bounding box" from several values
> A function which takes an "bounding box" and a new value and returns
> the new "bounding box"
> A function which tests if a value is in a "bounding box"
> A function which tests if a "bounding box" overlaps a "bounding box"

Which I'd generalize a bit further by renaming "bounding box" with
"compressed set", and allow it to be parameterized.

So, you have:

An aggregate to generate a "compressed set" from several values
A function which adds a new value to the "compressed set" and returns
the new "compressed set"
A function which tests if a value is in a "compressed set"
A function which tests if a "compressed set" overlaps another
"compressed set" of equal type

If you can define different compressed sets, you can use this to
generate both min/max indexes as well as bloom filter indexes. Whether
we'd want to have both is perhaps questionable, but having the ability
to is probably desirable.

One problem with such a generalized implementation would be, that I'm
not sure in-place modification of the "compressed set" on-disk can be
assumed to be safe on all cases. Surely, for strictly-enlarging sets
it would, but while min/max and bloom filters both fit the bill, it's
not clear that one can assume this for all structures.

Adding also a "in-place updateable" bit to the "type" would perhaps
inflate the complexity of the patch due to the need to provide both
code paths?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Extended Prefetching using Asynchronous IO - proposal and patch

2014-06-25 Thread Claudio Freire

On Tue, Jun 24, 2014 at 12:08 PM, John Lumby  wrote:
>> The question is, if you receive the notification of the I/O completion
>> using a signal or a thread, is it safe to release the lwlock from the
>> signal handler or a separate thread?
>
> In the forthcoming  new version of the patch that uses sigevent,
> the originator locks a LWlock associated with that BAaiocb eXclusive,
> and ,   when signalled,  in the signal handler it places that LWlock
> on a process-local queue of LWlocks awaiting release.
> (No, It cannot be safely released inside the signal handler or in a
> separate thread). Whenever the mainline passes a CHECK_INTERRUPTS macro
> and at a few additional points in bufmgr,  the backend walks this 
> process-local
> queue and releases those LWlocks.This is also done if the originator
> itself issues a ReadBuffer,  which is the most frequent case in which it
> is released.

I suggest using a semaphore instead.

Semaphores are supposed to be incremented/decremented from multiple
threads or processes already. So, in theory, the callback itself
should be able to do it.

The problem with the process-local queue is that it may take time to
be processed (the time it takes to get to a CHECK_INTERRUPTS macro,
which as it happened with regexes, it can be quite high).


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] NUMA packaging and patch

2014-06-26 Thread Claudio Freire

On Thu, Jun 26, 2014 at 11:18 AM, Kohei KaiGai  wrote:
> One thing I concern is, it may conflict with numa-balancing
> features that is supported in the recent Linux kernel; that
> migrates physical pages according to the location of tasks
> which references the page beyond the numa zone.
> # I'm not sure whether it is applied on shared memory region.
> # Please correct me if I misunderstood. But it looks to me
> # physical page in shared memory is also moved.
> http://events.linuxfoundation.org/sites/events/files/slides/summit2014_riel_chegu_w_0340_automatic_numa_balancing_0.pdf


Sadly, it excludes the OS cache explicitly (when it mentions libc.so),
which is one of the hottest sources of memory bandwidth consumption in
a database.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-10 Thread Claudio Freire

On Wed, Jul 9, 2014 at 6:16 PM, Alvaro Herrera  wrote:
> Another thing I noticed is that version 8 of the patch blindly believed
> the "pages_per_range" declared in catalogs.  This meant that if somebody
> did "alter index foo set pages_per_range=123" the index would
> immediately break (i.e. return corrupted results when queried).  I have
> fixed this by storing the pages_per_range value used to construct the
> index in the metapage.  Now if you do the ALTER INDEX thing, the new
> value is only used when the index is recreated by REINDEX.

This seems a lot like parameterizing. So I guess the only thing left
is to issue a NOTICE when said alter takes place (I don't see that on
the patch, but maybe it's there?)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-11 Thread Claudio Freire

On Thu, Jul 10, 2014 at 4:20 PM, Alvaro Herrera
 wrote:
> Claudio Freire wrote:
>> On Wed, Jul 9, 2014 at 6:16 PM, Alvaro Herrera  
>> wrote:
>> > Another thing I noticed is that version 8 of the patch blindly believed
>> > the "pages_per_range" declared in catalogs.  This meant that if somebody
>> > did "alter index foo set pages_per_range=123" the index would
>> > immediately break (i.e. return corrupted results when queried).  I have
>> > fixed this by storing the pages_per_range value used to construct the
>> > index in the metapage.  Now if you do the ALTER INDEX thing, the new
>> > value is only used when the index is recreated by REINDEX.
>>
>> This seems a lot like parameterizing.
>
> I don't understand what that means -- care to elaborate?

We've been talking about bloom filters, and how their shape differs
according to the parameters of the bloom filter (number of hashes,
hash type, etc).

But after seeing this case of pages_per_range, I noticed it's an
effective-enough mechanism. Like:

CREATE INDEX ix_blah ON some_table USING bloom (somecol)
  WITH (BLOOM_HASHES=15, BLOOM_BUCKETS=1024, PAGES_PER_RANGE=64);

Marking as read-only is ok, or emitting a NOTICE so that if anyone
changes those parameters that change the shape of the index, they know
it needs a rebuild would be OK too. Both mechanisms work for me.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Minmax indexes

2014-07-11 Thread Claudio Freire

On Fri, Jul 11, 2014 at 3:47 PM, Greg Stark  wrote:
> On Fri, Jul 11, 2014 at 6:00 PM, Claudio Freire  
> wrote:
>> Marking as read-only is ok, or emitting a NOTICE so that if anyone
>> changes those parameters that change the shape of the index, they know
>> it needs a rebuild would be OK too. Both mechanisms work for me.
>
> We don't actually have any of these mechanisms. They wouldn't be bad
> things to have but I don't think we should gate adding new types of
> indexes on adding them. In particular, the index could just hard code
> a value for these parameters and having them be parameterized is
> clearly better even if that doesn't produce all the warnings or
> rebuild things automatically or whatever.


No, I agree, it's just a nice to have.

But at least the docs should mention it.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2014-07-14 Thread Claudio Freire

On Mon, Jul 14, 2014 at 2:53 PM, Peter Geoghegan  wrote:
> My concern is that it won't be worth it to do the extra work,
> particularly given that I already have 8 bytes to work with. Supposing
> I only had 4 bytes to work with (as researchers writing [2] may have
> only had in 1994), that would leave me with a relatively small number
> of distinct normalized keys in many representative cases. For example,
> I'd have a mere 40,665 distinct normalized keys in the case of my
> "cities" database, rather than 243,782 (out of a set of 317,102 rows)
> for 8 bytes of storage. But if I double that to 16 bytes (which might
> be taken as a proxy for what a good compression scheme could get me),
> I only get a modest improvement - 273,795 distinct keys. To be fair,
> that's in no small part because there are only 275,330 distinct city
> names overall (and so most dups get away with a cheap memcmp() on
> their tie-breaker), but this is a reasonably organic, representative
> dataset.


Are those numbers measured on MAC's strxfrm?

That was the one with suboptimal entropy on the first 8 bytes.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Incremental Backup

2014-07-25 Thread Claudio Freire

On Fri, Jul 25, 2014 at 10:14 AM, Marco Nenciarini
 wrote:
> 1. Proposal
> =
> Our proposal is to introduce the concept of a backup profile. The backup
> profile consists of a file with one line per file detailing tablespace,
> path, modification time, size and checksum.
> Using that file the BASE_BACKUP command can decide which file needs to
> be sent again and which is not changed. The algorithm should be very
> similar to rsync, but since our files are never bigger than 1 GB per
> file that is probably granular enough not to worry about copying parts
> of files, just whole files.

That wouldn't nearly as useful as the LSN-based approach mentioned before.

I've had my share of rsyncing live databases (when resizing
filesystems, not for backup, but the anecdotal evidence applies
anyhow) and with moderately write-heavy databases, even if you only
modify a tiny portion of the records, you end up modifying a huge
portion of the segments, because the free space choice is random.

There have been patches going around to change the random nature of
that choice, but none are very likely to make a huge difference for
this application. In essence, file-level comparisons get you only a
mild speed-up, and are not worth the effort.

I'd go for the hybrid file+lsn method, or nothing. The hybrid avoids
the I/O of inspecting the LSN of entire segments (necessary
optimization for huge multi-TB databases) and backups only the
portions modified when segments do contain changes, so it's the best
of both worlds. Any partial implementation would either require lots
of I/O (LSN only) or save very little (file only) unless it's an
almost read-only database.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Incremental Backup

2014-07-25 Thread Claudio Freire

On Fri, Jul 25, 2014 at 3:44 PM, Robert Haas  wrote:
> On Fri, Jul 25, 2014 at 2:21 PM, Claudio Freire  
> wrote:
>> On Fri, Jul 25, 2014 at 10:14 AM, Marco Nenciarini
>>  wrote:
>>> 1. Proposal
>>> =
>>> Our proposal is to introduce the concept of a backup profile. The backup
>>> profile consists of a file with one line per file detailing tablespace,
>>> path, modification time, size and checksum.
>>> Using that file the BASE_BACKUP command can decide which file needs to
>>> be sent again and which is not changed. The algorithm should be very
>>> similar to rsync, but since our files are never bigger than 1 GB per
>>> file that is probably granular enough not to worry about copying parts
>>> of files, just whole files.
>>
>> That wouldn't nearly as useful as the LSN-based approach mentioned before.
>>
>> I've had my share of rsyncing live databases (when resizing
>> filesystems, not for backup, but the anecdotal evidence applies
>> anyhow) and with moderately write-heavy databases, even if you only
>> modify a tiny portion of the records, you end up modifying a huge
>> portion of the segments, because the free space choice is random.
>>
>> There have been patches going around to change the random nature of
>> that choice, but none are very likely to make a huge difference for
>> this application. In essence, file-level comparisons get you only a
>> mild speed-up, and are not worth the effort.
>>
>> I'd go for the hybrid file+lsn method, or nothing. The hybrid avoids
>> the I/O of inspecting the LSN of entire segments (necessary
>> optimization for huge multi-TB databases) and backups only the
>> portions modified when segments do contain changes, so it's the best
>> of both worlds. Any partial implementation would either require lots
>> of I/O (LSN only) or save very little (file only) unless it's an
>> almost read-only database.
>
> I agree with much of that.  However, I'd question whether we can
> really seriously expect to rely on file modification times for
> critical data-integrity operations.  I wouldn't like it if somebody
> ran ntpdate to fix the time while the base backup was running, and it
> set the time backward, and the next differential backup consequently
> omitted some blocks that had been modified during the base backup.

I was thinking the same. But that timestamp could be saved on the file
itself, or some other catalog, like a "trusted metadata" implemented
by pg itself, and it could be an LSN range instead of a timestamp
really.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Incremental Backup

2014-07-25 Thread Claudio Freire

On Fri, Jul 25, 2014 at 7:38 PM, Josh Berkus  wrote:
> On 07/25/2014 11:49 AM, Claudio Freire wrote:
>>> I agree with much of that.  However, I'd question whether we can
>>> > really seriously expect to rely on file modification times for
>>> > critical data-integrity operations.  I wouldn't like it if somebody
>>> > ran ntpdate to fix the time while the base backup was running, and it
>>> > set the time backward, and the next differential backup consequently
>>> > omitted some blocks that had been modified during the base backup.
>> I was thinking the same. But that timestamp could be saved on the file
>> itself, or some other catalog, like a "trusted metadata" implemented
>> by pg itself, and it could be an LSN range instead of a timestamp
>> really.
>
> What about requiring checksums to be on instead, and checking the
> file-level checksums?   Hmmm, wait, do we have file-level checksums?  Or
> just page-level?

It would be very computationally expensive to have up-to-date
file-level checksums, so I highly doubt it.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Incremental Backup

2014-07-29 Thread Claudio Freire

On Tue, Jul 29, 2014 at 1:24 PM, Marco Nenciarini
 wrote:
>> On Fri, Jul 25, 2014 at 10:14 AM, Marco Nenciarini
>>  wrote:
>>> 1. Proposal
>>> =
>>> Our proposal is to introduce the concept of a backup profile. The backup
>>> profile consists of a file with one line per file detailing tablespace,
>>> path, modification time, size and checksum.
>>> Using that file the BASE_BACKUP command can decide which file needs to
>>> be sent again and which is not changed. The algorithm should be very
>>> similar to rsync, but since our files are never bigger than 1 GB per
>>> file that is probably granular enough not to worry about copying parts
>>> of files, just whole files.
>>
>> That wouldn't nearly as useful as the LSN-based approach mentioned before.
>>
>> I've had my share of rsyncing live databases (when resizing
>> filesystems, not for backup, but the anecdotal evidence applies
>> anyhow) and with moderately write-heavy databases, even if you only
>> modify a tiny portion of the records, you end up modifying a huge
>> portion of the segments, because the free space choice is random.
>>
>> There have been patches going around to change the random nature of
>> that choice, but none are very likely to make a huge difference for
>> this application. In essence, file-level comparisons get you only a
>> mild speed-up, and are not worth the effort.
>>
>> I'd go for the hybrid file+lsn method, or nothing. The hybrid avoids
>> the I/O of inspecting the LSN of entire segments (necessary
>> optimization for huge multi-TB databases) and backups only the
>> portions modified when segments do contain changes, so it's the best
>> of both worlds. Any partial implementation would either require lots
>> of I/O (LSN only) or save very little (file only) unless it's an
>> almost read-only database.
>>
>
> From my experience, if a database is big enough and there is any kind of
> historical data in the database, the "file only" approach works well.
> Moreover it has the advantage of being simple and easily verifiable.

I don't see how that would be true if it's not full of read-only or
append-only tables.

Furthermore, even in that case, you need to have the database locked
while performing the file-level backup, and computing all the
checksums means processing the whole thing. That's a huge amount of
time to be locked for multi-TB databases, so how is that good enough?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance issue in pg_dump's dependency loop searching

2014-07-29 Thread Claudio Freire

On Tue, Jul 29, 2014 at 3:06 PM, Tom Lane  wrote:
> Simon Riggs  writes:
>> On 25 July 2014 20:47, Tom Lane  wrote:
>>> Another idea would be to
>
>> ...persist the optimal dump order in the database.
>
>> That way we can maintain the correct dump order each time we do DDL,
>> which is only a small incremental cost, no matter how many objects we
>> have.
>
> I don't see any obvious way to make it incremental; so I doubt that
> it would be a small extra cost.  In any case I disagree that making DDL
> slower to make pg_dump faster is a good tradeoff.  Many people seldom
> or never use pg_dump.
>
> regards, tom lane


Not to mention slowing down temp tables


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Incremental Backup

2014-07-31 Thread Claudio Freire

On Thu, Jul 31, 2014 at 5:26 AM, desmodemone  wrote:
> b) yes the backends need to update the map, but it's in memory, and as I
> show, could be very small if we you chunk of blocks.If we not compress the
> map, I not think could be a bottleneck.

If it's in memory, it's not crash-safe. For something aimed at
backups, I think crash safety is a requirement. So it's at least one
extra I/O per commit, maybe less if many can be coalesced at
checkpoints, but I wouldn't count on it too much, because worst cases
are easy to come by (sparse enough updates).

I think this could be pegged on WAL replay / checkpoint stuff alone,
so it would be very asynchronous, but not free.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Incremental Backup

2014-08-01 Thread Claudio Freire

On Fri, Aug 1, 2014 at 12:35 AM, Amit Kapila  wrote:
>> c) the map is not crash safe by design, because it needs only for
>> incremental backup to track what blocks needs to be backuped, not for
>> consistency or recovery of the whole cluster, so it's not an heavy cost for
>> the whole cluster to maintain it. we could think an option (but it's heavy)
>> to write it at every flush  on file to have crash-safe map, but I not think
>> it's so usefull . I think it's acceptable, and probably it's better to force
>> that, to say: "if your db will crash, you need a fullbackup ",
>
> I am not sure if your this assumption is right/acceptable, how can
> we say that in such a case users will be okay to have a fullbackup?
> In general, taking fullbackup is very heavy operation and we should
> try to avoid such a situation.


Besides, the one taking the backup (ie: script) may not be aware of
the need to take a full one.

It's a bad design to allow broken backups at all, IMNSHO.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Incremental Backup

2014-08-01 Thread Claudio Freire

On Fri, Aug 1, 2014 at 1:43 PM, desmodemone  wrote:
>
>
>
> 2014-08-01 18:20 GMT+02:00 Claudio Freire :
>
>> On Fri, Aug 1, 2014 at 12:35 AM, Amit Kapila 
>> wrote:
>> >> c) the map is not crash safe by design, because it needs only for
>> >> incremental backup to track what blocks needs to be backuped, not for
>> >> consistency or recovery of the whole cluster, so it's not an heavy cost
>> >> for
>> >> the whole cluster to maintain it. we could think an option (but it's
>> >> heavy)
>> >> to write it at every flush  on file to have crash-safe map, but I not
>> >> think
>> >> it's so usefull . I think it's acceptable, and probably it's better to
>> >> force
>> >> that, to say: "if your db will crash, you need a fullbackup ",
>> >
>> > I am not sure if your this assumption is right/acceptable, how can
>> > we say that in such a case users will be okay to have a fullbackup?
>> > In general, taking fullbackup is very heavy operation and we should
>> > try to avoid such a situation.
>>
>>
>> Besides, the one taking the backup (ie: script) may not be aware of
>> the need to take a full one.
>>
>> It's a bad design to allow broken backups at all, IMNSHO.
>
>
> Hi Claudio,
>  thanks for your observation
> First: the case it's after a crash of a database, and it's not something
> happens every day or every week. It's something that happens in rare
> conditions, or almost my experience is so. If it happens very often probably
> there are other problems.

Not so much. In this case, the software design isn't software-crash
safe, it's not that it's not hardware-crash safe.

What I mean, is that an in-memory bitmap will also be out of sync if
you kill -9 (or if one of the backends is killed by the OOM), or if it
runs out of disk space too.

Normally, a simple restart fixes it because pg will do crash recovery
just fine, but now the bitmap is out of sync, and further backups are
broken. It's not a situation I want to face unless there's a huge
reason to go for such design.

If you make it so that the commit includes flipping the bitmap, it can
be done cleverly enough to avoid too much overhead (though it will
have some), and you now have it so that any to-be-touched block is now
part of the backup. You just apply all the bitmap changes in batch
after a checkpoint, before syncing to disk, and before erasing the WAL
segments. Simple, relatively efficient, and far more robust than an
in-memory thing.

Still, it *can* double checkpoint I/O on the worst case, and it's not
an unfathomable case either.

> Second: to avoid the problem to know if the db needed to have a full backup
> to rebuild the map we could think to write in the map header the backup
> reference (with an id and LSN reference for example ) so  if the
> someone/something try to do an incremental backup after a crash, the map
> header will not have noone full backup listed [because it will be empty] ,
> and automaticcaly switch to a full one. I think after a crash it's a good
> practice to do a full backup, to see if there are some problems on files or
> on filesystems, but if I am wrong I am happy to know :) .

After a crash I do not do a backup, I do a verification of the data
(VACUUM and some data consistency checks usually), lest you have a
useless backup. The backup goes after that.

But, I'm not DBA guru.

> Remember that I propose a map in ram to reduce the impact on performances,
> but we could create an option to leave the choose to the user, if you want a
> crash safe map, at every flush will be updated also a map file , if not, the
> map will be in ram.

I think the performance impact of a WAL-linked map isn't so big as to
prefer the possibility of broken backups. I wouldn't even allow it.

It's not free, making it crash safe, but it's not that expensive
either. If you want to support incremental backups, you really really
need to make sure those backups are correct and usable, and IMV
anything short of full crash safety will be too fragile for that
purpose. I don't want to be in a position of needing the backup and
finding out it's inconsistent after the fact, and I don't want to
encourage people to set themselves up for that by adding that "faster
but unsafe backups" flag.

I'd do it either safe, or not at all.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Incremental Backup

2014-08-04 Thread Claudio Freire

On Mon, Aug 4, 2014 at 5:15 AM, Gabriele Bartolini
 wrote:
>I really like the proposal of working on a block level incremental
> backup feature and the idea of considering LSN. However, I'd suggest
> to see block level as a second step and a goal to keep in mind while
> working on the first step. I believe that file-level incremental
> backup will bring a lot of benefits to our community and users anyway.

Thing is, I don't see how the LSN method is that much harder than an
on-disk bitmap. In-memory bitmap IMO is just a recipe for disaster.

Keeping a last-updated-LSN for each segment (or group of blocks) is
just as easy as keeping a bitmap, and far more flexible and robust.

The complexity and cost of safely keeping the map up-to-date is what's
in question here, but as was pointed before, there's no really safe
alternative. Nor modification times nor checksums (nor in-memory
bitmaps IMV) are really safe enough for backups, so you really want to
use something like the LSN. It's extra work, but opens up a world of
possibilities.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Incremental Backup

2014-08-05 Thread Claudio Freire

On Tue, Aug 5, 2014 at 3:23 PM, Simon Riggs  wrote:
> On 4 August 2014 19:30, Claudio Freire  wrote:
>> On Mon, Aug 4, 2014 at 5:15 AM, Gabriele Bartolini
>>  wrote:
>>>I really like the proposal of working on a block level incremental
>>> backup feature and the idea of considering LSN. However, I'd suggest
>>> to see block level as a second step and a goal to keep in mind while
>>> working on the first step. I believe that file-level incremental
>>> backup will bring a lot of benefits to our community and users anyway.
>>
>> Thing is, I don't see how the LSN method is that much harder than an
>> on-disk bitmap. In-memory bitmap IMO is just a recipe for disaster.
>>
>> Keeping a last-updated-LSN for each segment (or group of blocks) is
>> just as easy as keeping a bitmap, and far more flexible and robust.
>>
>> The complexity and cost of safely keeping the map up-to-date is what's
>> in question here, but as was pointed before, there's no really safe
>> alternative. Nor modification times nor checksums (nor in-memory
>> bitmaps IMV) are really safe enough for backups, so you really want to
>> use something like the LSN. It's extra work, but opens up a world of
>> possibilities.
>
> OK, some comments on all of this.
>
> * Wikipedia thinks the style of backup envisaged should be called 
> "Incremental"
> https://en.wikipedia.org/wiki/Differential_backup
>
> * Base backups are worthless without WAL right up to the *last* LSN
> seen during the backup, which is why pg_stop_backup() returns an LSN.
> This is the LSN that is the effective viewpoint of the whole base
> backup. So if we wish to get all changes since the last backup, we
> must re-quote this LSN. (Or put another way - file level LSNs don't
> make sense - we just need one LSN for the whole backup).

File-level LSNs are an optimization. When you want to backup all files
modified since the last base or incremental backup (yes, you need the
previous backup label at least), you check the file-level LSN range.
That tells you which "changesets" touched that file, so you know
whether to process it or not.

Block-level LSNs (or, rather, block-segment-level) are just a
refinement of that.

> * When we take an incremental backup we need the WAL from the backup
> start LSN through to the backup stop LSN. We do not need the WAL
> between the last backup stop LSN and the new incremental start LSN.
> That is a huge amount of WAL in many cases and we'd like to avoid
> that, I would imagine. (So the space savings aren't just the delta
> from the main data files, we should also look at WAL savings).

Yes, probably something along the lines of removing redundant FPW and
stuff like that.

> * For me, file based incremental is a useful and robust feature.
> Block-level incremental is possible, but requires either significant
> persistent metadata (1 MB per GB file) or access to the original
> backup. One important objective here is to make sure we do NOT have to
> re-read the last backup when taking the next backup; this helps us to
> optimize the storage costs for backups. Plus, block-level recovery
> requires us to have a program that correctly re-writes data into the
> correct locations in a file, which seems likely to be a slow and bug
> ridden process to me. Nice, safe, solid file-level incremental backup
> first please. Fancy, bug prone, block-level stuff much later.

Ok. You could do incremental first without any kind of optimization,
then file-level optimization by keeping a file-level LSN range, and
then extend that to block-segment-level LSN ranges. That sounds like a
plan to me.

But, I don't see how you'd do the one without optimization without
reading the previous backup for comparing deltas. Remember checksums
are deemed not trustworthy, not just by me, so that (which was the
original proposition) doesn't work.

> * If we don't want/have file checksums, then we don't need a profile
> file and using just the LSN seems fine. I don't think we should
> specify that manually - the correct LSN is written to the backup_label
> file in a base backup and we should read it back from there.

Agreed


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 3 4 5 >

1 - 100 of 476 matches

Mail list logo