from:"Peter Geoghegan"

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-11-09 Thread Peter Geoghegan

On Thu, Nov 9, 2017 at 4:53 PM, Andres Freund  wrote:
> Primarily because it's not an anti-corruption tool. I'd be surprised if
> there weren't ways to corrupt the page using these corruptions that
> aren't detected by it.

It's very hard to assess the risk of missing something that's actually
detectable with total confidence, but I think that the check is actually
very thorough.

> But even if it were, I don't think there's
> enough information to do so in the general case. You very well can end
> up with pages where subsequent hot pruning has removed a good bit of the
> direct evidence of this bug.

Sure, but maybe those are cases that can't get any worse anyway. So the
question of avoiding making it worse doesn't arise.

> But I'm not really sure why the error detection capabilities of matter
> much for the principal point I raised, which is how much work we need to
> do to not further worsen the corruption.

You're right. Just trying to put the risk in context, and to understand the
extent of the concern that you have.

-- 
Peter Geoghegan

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-11-09 Thread Peter Geoghegan

On Thu, Nov 9, 2017 at 4:17 PM, Andres Freund  wrote:
>> Actually, on second thought, I take that back -- I don't think that
>> REINDEXing will even finish once a HOT chain is broken by the bug.
>> IndexBuildHeapScan() actually does quite a good job of making sure
>> that HOT chains are sane, which is how the enhanced amcheck notices
>> the bug here in practice.
>
> I think that's too optimistic.

Why? Because the "find the TID of the root" logic in
IndexBuildHeapScan()/heap_get_root_tuples() won't reliably find the
actual root (it might be some other HOT chain root following TID
recycling by VACUUM)?

Assuming that's what you meant: I would have thought that the
xmin/xmax matching within heap_get_root_tuples() makes the sanity
checking fairly reliable in practice.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-11-09 Thread Peter Geoghegan

On Thu, Nov 9, 2017 at 4:17 PM, Andres Freund  wrote:
>> I don't follow you here. Why would REINDEXing make the rows that
>> should be dead disappear again, even for a short period of time?
>
> It's not the REINDEX that makes them reappear.

Of course. I was just trying to make sense of what you said.

> It's the second
> vacuum. The reindex part was about $user trying to fix the problem...
> As you need two vacuums with appropriate cutoffs to hit the "rows
> revive" problem, that'll often in practice not happen immediately.

This explanation clears things up, though.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-11-09 Thread Peter Geoghegan

On Thu, Nov 9, 2017 at 2:24 PM, Andres Freund  wrote:
> Attached is a version of the already existing regression test that both
> reproduces the broken hot chain (and thus failing index lookups) and
> then also the tuple reviving.  I don't see any need for letting this run
> with arbitrary permutations.

I thought that the use of every possible permutation was excessive,
myself. It left us with an isolation test that didn't precisely
describe the behavior that is tested. What you came up with seems far,
far better, especially because of the comments you included. The mail
message-id references seem to add a lot, too.

> What I'm currently wondering about is how much we need to harden
> postgres against such existing corruption. If e.g. the hot chains are
> broken somebody might have reindexed thinking the problem is fixed - but
> if they then later vacuum everything goes to shit again, with dead rows
> reappearing.

I don't follow you here. Why would REINDEXing make the rows that
should be dead disappear again, even for a short period of time? It
might do so for index scans, I suppose, but not for sequential scans.
Are you concerned about a risk of somebody not noticing that
sequential scans are still broken?

Actually, on second thought, I take that back -- I don't think that
REINDEXing will even finish once a HOT chain is broken by the bug.
IndexBuildHeapScan() actually does quite a good job of making sure
that HOT chains are sane, which is how the enhanced amcheck notices
the bug here in practice. (Before this bug was discovered, I would
have expected amcheck to catch problems like it slightly later, during
the Bloom filter probe for that HOT chain...but, in fact, it never
gets there with corruption from this bug in practice, AFAIK.)

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect option to forgo buffer locking?

2017-11-09 Thread Peter Geoghegan

On Thu, Nov 9, 2017 at 9:49 AM, Andres Freund  wrote:
> Currently the locking in get_raw_page_internal() prevents that.  If it's
> an option defaulting to off, I don't see why we couldn't allow that to
> skip locking the page's contents. Obviously you can get corrupted
> contents that way, but we already allow to pass arbitrary stuff to
> heap_page_items().  Since pinning wouldn't be changed, there's no danger
> of the page being moved out from under us.

+1. I've done things like this before myself.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small improvement to compactify_tuples

2017-11-08 Thread Peter Geoghegan

On Wed, Nov 8, 2017 at 12:59 PM, Andres Freund  wrote:
> I complained about multiple related things, I'm not exactly sure what
> exactly you're referring to here:
> - The fact that HeapTupleHeaderData's are commonly iterated over in
>   reverse order is bad for performance. For shared buffers resident
>   workloads involving seqscans that yields 15-25% slowdowns for me. It's
>   trivial to fix that by just changing iteration order, but that
>   obviously changes results. But we could reorder the page during heap
>   pruning.

FWIW, the classic page layout (the one that appears in Gray's
Transaction Processing Systems, at any rate) has the ItemId array at
the end of the page and the tuples at the start (immediately after a
generic page header) -- it's the other way around.

I think that that has its pros and cons.

> - The layout of items in index pages is suboptimal. We regularly do
>   binary searches over the the linearly ordered items, which is cache
>   inefficient. So instead we should sort items as [1/2, 1/4, 3/4, ...]
>   elements, which will access items in a close-ish to linear manner.

I still think that we can repurpose each ItemId's lp_len as an
abbreviated key in internal index pages [1], and always get IndexTuple
size through the index tuple header. I actual got as far as writing a
very rough prototype of that. That's obviously a significant project,
but it seems doable.

[1] 
https://www.postgresql.org/message-id/CAH2-Wz=mv4dmoapficrsyntv2kinxeotbwuy5r7fxxoc-oe...@mail.gmail.com
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small improvement to compactify_tuples

2017-11-08 Thread Peter Geoghegan

On Wed, Nov 8, 2017 at 8:19 AM, Robert Haas  wrote:
> I don't remember any more just how much faster qsort_tuple() and
> qsort_ssup() are than plain qsort(), but it was significant enough to
> convince me to commit 337b6f5ecf05b21b5e997986884d097d60e4e3d0...

IIRC, qsort_ssup() was about 20% faster at the time, while
qsort_tuple() was 5% - 10% faster.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MERGE SQL Statement for PG11

2017-11-07 Thread Peter Geoghegan

On Tue, Nov 7, 2017 at 3:29 PM, Nico Williams  wrote:
> On Thu, Nov 02, 2017 at 03:25:48PM -0700, Peter Geoghegan wrote:
>> Nico Williams  wrote:
>> >A MERGE mapped to a DML like this:
>
> I needed to spend more time reading MERGE docs from other RDBMSes.

Please don't hijack this thread. It's about the basic question of
semantics, and is already hard enough for others to follow as-is.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small improvement to compactify_tuples

2017-11-07 Thread Peter Geoghegan

On Tue, Nov 7, 2017 at 2:40 PM, Юрий Соколов  wrote:
>> The same is true of unique indexes vs. non-unique.
>
> offtopic: recently I'd a look at setting LP_DEAD in indexes.
> I didn't found huge difference between unique and non-unique indices.
> There is codepath that works only for unique, but it is called less
> frequently than common codepath that also sets LP_DEAD.

I meant to say that this is only important with UPDATEs + contention.
The extra LP_DEAD setting within _bt_check_unique() makes quite a
noticeable difference, at least in terms of index bloat (though less
so in terms of raw TPS).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small improvement to compactify_tuples

2017-11-07 Thread Peter Geoghegan

On Tue, Nov 7, 2017 at 2:36 PM, Tom Lane  wrote:
> Peter Geoghegan  writes:
>> My point is only that it's worth considering that this factor affects
>> how representative your sympathetic case is. It's not clear how many
>> PageIndexMultiDelete() calls are from opportunistic calls to
>> _bt_vacuum_one_page(), how important that subset of calls is, and so
>> on. Maybe it doesn't matter at all.
>
> According to the perf measurements I took earlier, essentially all the
> compactify_tuple calls in this test case are from PageRepairFragmentation
> (from heap_page_prune), not PageIndexMultiDelete.

For a workload with high contention (e.g., lots of updates that follow
a Zipfian distribution) lots of important cleanup has to occur within
_bt_vacuum_one_page(), and with an exclusive buffer lock held. It may
be that making PageIndexMultiDelete() faster pays off
disproportionately well there, but I'd only expect to see that at
higher client count workloads with lots of contention -- workloads
that we still do quite badly on (note that we always have not done
well here, even prior to commit 2ed5b87f9 -- Yura showed this at one
point).

It's possible that this work influenced Yura in some way.

When Postgres Pro did some benchmarking of this at my request, we saw
that the bloat got really bad past a certain client count. IIRC there
was a clear point at around 32 or 64 clients where TPS nosedived,
presumably because cleanup could not keep up. This was a 128 core box,
or something like that, so you'll probably have difficulty recreating
it with what's at hand.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small improvement to compactify_tuples

2017-11-07 Thread Peter Geoghegan

)

On Tue, Nov 7, 2017 at 1:39 PM, Tom Lane  wrote:
> So I think we should seriously consider the attached, but it'd be a
> good idea to benchmark it on a wider variety of platforms and test
> cases.

> create unlogged table test3 (
>  id integer PRIMARY KEY with (fillfactor=85),
>  val text
>  ) WITH (fillfactor=85);

Passing observation:  Unlogged table B-Tree indexes have a much
greater tendency for LP_DEAD setting/kill_prior_tuple() working out
following commit 2ed5b87f9 [1], because unlogged tables were
unaffected by that commit. (I've been meaning to follow up with my
analysis of that regression, actually.)

The same is true of unique indexes vs. non-unique. There are workloads
where the opportunistic LP_DEAD setting performed by
_bt_check_unique() is really important (it calls ItemIdMarkDead()).
Think high contention workloads, like when Postgres is used to
implement a queue table.

My point is only that it's worth considering that this factor affects
how representative your sympathetic case is. It's not clear how many
PageIndexMultiDelete() calls are from opportunistic calls to
_bt_vacuum_one_page(), how important that subset of calls is, and so
on. Maybe it doesn't matter at all.

[1] 
https://postgr.es/m/cah2-wzmyry7mnjf0gw5wtk3cszh3gqfhhoxvsyuno5pk8cu...@mail.gmail.com
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Hash take II

2017-11-07 Thread Peter Geoghegan

On Tue, Nov 7, 2017 at 1:01 PM, Andres Freund  wrote:
> +/*
> + * Build the name for a given segment of a given BufFile.
> + */
> +static void
> +MakeSharedSegmentName(char *name, const char *buffile_name, int segment)
> +{
> +   snprintf(name, MAXPGPATH, "%s.%d", buffile_name, segment);
> +}
>
> Not a fan of this name - you're not "making" a filename here (as in
> allocating or such). I think I'd just remove the Make prefix.

+1

Can we document the theory behind file naming here, if that isn't in
the latest version? This is a routine private to parallel hash join
(or shared tuplestore), not Buffile. Maybe Buffile should have some
opinion on this, though. Just as a matter of style.

> +/*
> + * Delete a BufFile that was created by BufFileCreateShared in the given
> + * SharedFileSet using the given name.
> + *
> + * It is not necessary to delete files explicitly with this function.  It is
> + * provided only as a way to delete files proactively, rather than waiting 
> for
> + * the SharedFileSet to be cleaned up.
> + *
> + * Only one backend should attempt to delete a given name, and should know
> + * that it exists and has been exported or closed.
> + */

This part is new to me. We now want one backend to delete a given
filename. What changed? Please provide a Message-Id reference if that
will help me to understand.

For now, I'm going to guess that this development had something to do
with the need to deal with virtual FDs that do a close() on an FD to
keep under backend limits. Do I have that right?

> +   /*
> +* We don't set FD_DELETE_AT_CLOSE for files opened this way, but we 
> still
> +* want to make sure they get closed at end of xact.
> +*/
> +   ResourceOwnerEnlargeFiles(CurrentResourceOwner);
> +   ResourceOwnerRememberFile(CurrentResourceOwner, file);
> +   VfdCache[file].resowner = CurrentResourceOwner;
>
> So maybe I'm being pedantic here, but wouldn't the right order be to do
> ResourceOwnerEnlargeFiles() *before* creating the file? It's a memory
> allocating operation, so it can fail, which'd leak the file.

I remember going to pains to get this right with my own unifiable
BufFile concept. I'm going to wait for an my question about file
deletion + resowners before commenting further, though.

> +   if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
> +   {
> +   /* Subtract its size from current usage (do first in case of 
> error) */
> +   temporary_files_size -= vfdP->fileSize;
> +   vfdP->fileSize = 0;
> +   }
>
> So, is it right to do so unconditionally and without regard for errors?
> If the file isn't deleted, it shouldn't be subtracted from fileSize. I
> guess you're managing that through the flag, but that's not entirely
> obvious.

I think that the problem here is that the accounting is expected to
always work. It's not like there is a resowner style error path in
which temporary_files_size gets reset to 0.

> Is that entirely unproblematic? Are there any DSM callbacks that rely on
> locks still being held? Please split this part into a separate commit
> with such analysis.

I was always confused about the proper use of DSM callbacks myself.
They seemed generally underdocumented.

> +   /* Create the output buffer. */
> +   if (accessor->write_chunk != NULL)
> +   pfree(accessor->write_chunk);
> +   accessor->write_chunk = (SharedTuplestoreChunk *)
> +   palloc0(accessor->write_pages * BLCKSZ);
>
> Are we guaranteed to be in a long-lived memory context here?

I imagine that Thomas looked to tuplestore_begin_heap() + interXact as
a kind of precedent here. See comments above that function.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MERGE SQL Statement for PG11

2017-11-06 Thread Peter Geoghegan


Simon Riggs  wrote:

In step 3 we discover that an entry exists in the index for a committed row.

Since we have a unique index we use it to locate the row we know
exists and UPDATE that.

We don't use a new MVCC snapshot, we do what EPQ does. EPQ is already
violating MVCC for UPDATEs, so why does it matter if we do it for
INSERTs also?


Before I go on to say why I think that this approach is problematic, I
want to point out a few things that I think we actually agree on:

* EPQ is fairly arbitrary as a behavior for READ COMMITTED UPDATE
 conflict handling. It has more to do with how VACUUM works than about
 some platonic ideal that everyone agrees on.

* We can imagine other alternatives, such as the behavior in Oracle
 (statement level rollback + optimistic retry).

* Those alternatives are probably better in some ways but worse in other
 ways.

* EPQ violates snapshot consistency, even though that's not inherently
 necessary to avoid "READ COMMITTED serialization errors".

* ON CONFLICT also violates snapshot consistency, in rather a different
 way. (Whether or not this is necessary is more debatable.)

I actually think that other MVCC systems don't actually copy Oracle here,
either, and for similar pragmatic reasons. It's a mixed bag.


Where hides the problem?


The problem is violating MVCC is something that can be done in different
ways, and by meaningful degrees:

* EPQ semantics are believed to be fine because we don't get complaints
 about it. I think that that's because it's specialized to UPDATEs and
 UPDATE-like operations, where we walk an UPDATE chain specifically,
 and only use a dirty snapshot for the chain's newer tuples.

* ON CONFLICT doesn't care about UPDATE chains. Unlike EPQ, it makes no
 distinction between a concurrent UPDATE, and a concurrent DELETE + fresh
 INSERT. It's specialized to CONFLICTs.

This might seem abstract, but it has real, practical implications.
Certain contradictions exist when you start with MVCC semantics, then
fall back to EPQ semantics, then finally fall back to ON CONFLICT
semantics.

Questions about mixing these two things:

* What do we do if someone concurrently UPDATEs in a way that makes the
 qual not pass during EPQ traversal? Should we INSERT when that
 happens?

* If so, what about the case when the MERGE join qual/unique index
 values didn't change (just some other attributes that do not pass the
 additional WHEN MATCHED qual)?

* What about when there was a concurrent DELETE -- should we INSERT then?

ON CONFLICT goes from a CONFLICT, and then applies its own qual. That's
hugely different to doing it the other way around: starting from your
own MVCC snapshot qual, and going to a CONFLICT. This is because
evaluating the DO UPDATE's WHERE clause is just one little extra step
after the one and only latest row for that value has been locked.  You
could theoretically go this way with 2PL, I think, because that's a bit
like locking every row that the predicate touches, but of course that
isn't at all practical.

I should stop trying to make a watertight case against this, even though
I still think that's possible. For now, instead, I'll just say that this
is *extremely* complicated, and still has unresolved questions about
semantics.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MERGE SQL Statement for PG11

2017-11-06 Thread Peter Geoghegan


Simon Riggs  wrote:

APPROACH1
1. Join to produce results based upon snapshot at start of query
2. Apply results for INSERT, UPDATE or DELETE



Such failures are of great concern in practice because the time
between 1 and 2 could be very long for large statements, or for
smaller statements we might have sufficiently high concurrency to
allow us to see regular failures.


I'm not sure that they're a *great* concern in a world with something
that targets UPSERT use cases, which is a situation that does not exist
in DBMSs with MERGE (with the notable exception of Teradata). But it's
clearly a concern that users may expect to avoid duplicate violations in
READ COMMITTED, since this caused confusion among users of other
database systems with MERGE.


APPROACH2 (modified from my original proposal slightly)


This write-up actually begins to confront the issues that I've raised.
I'm glad to see this.


1. Join...
2. Apply results for UPDATE, if present not visible via the snapshot
taken at 1, do EPQ to ensure we locate current live tuple
3. If still not visible, do speculative insertion if we have a unique
index available, otherwise ERROR. If spec insertion fails, go to 2

The loop created above can live-lock, meaning that an infinite loop
could be created.


The loop is *guaranteed* to live-lock once you "goto 2". So you might as
well just throw an error at that point, which is the behavior that I've
been arguing for all along!

If this isn't guaranteed to live-lock at "goto 2", then it's not clear
why. The outcome of step 2 is clearly going to be identical if you don't
acquire a new MVCC snapshot, but you don't address that.

You might have meant "apply an equivalent ON CONFLICT DO UPDATE", or
something like that, despite the fact that the use of ON CONFLICT DO
NOTHING was clearly implied by the "goto 2". I also see problems with
that, but I'll wait for you to clarify what you meant before going into
what they are.


In practice, such live-locks are rare and we could detect them by
falling out of the loop after a few tries. Approach2's purpose is to
alleviate errors in Approach1, so falling out of the loop merely takes
us back to the error we would have got if we didn't try, so Approach2
has considerable benefit over Approach1.


I don't hate the idea of retrying a fixed number of times for things
like this, but I don't like it either. I'm going to assume that it's
fine for now.


I read that step 3 in Approach2 is some kind of problem in MVCC
semantics. My understanding is that SQL Standard allows us to define
what the semantics of the statement are in relation to concurrency, so
any semantic issue can be handled by defining it to work the way we
want.


My only concern is that our choices here should be good ones, based on
practical considerations. We both more or less agree on how this should
be assessed, I think; we just reach different conclusions.


As you point out, whichever we choose, we will be bound by those
semantics. So if we take Approach1, as has been indicated currently,
what is the written explanation for that, so we can show that to the
people who ask in the future about our decisions?


Well, Approach1 is what other systems implement. I think that it would
be important to point out that MERGE with Approach1 isn't special, but
ON CONFLICT DO UPDATE is special. We'd also say that higher isolation
levels will not have duplicate violations.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Display number of heap accesses for index scans

2017-11-05 Thread Peter Geoghegan

Andres Freund wrote:

The number of index lookups that failed to return anything can be a
critical performance factor in OLTP workloads. Therefore it seems like
it'd be a good idea to extend the explain analyze output to include that
information.

I certainly agree.

I've sometimes wondered if we could do better here by exploiting unique
indexes' uniqueness property. The general idea is that we could avoid
visiting duplicates in the heap when we've determined that they cannot
be visible to our MVCC snapshot - we already encountered the visible
version for that value, so this must be true. This is somewhat like what
we do with HOT chains, except it occurs across index tuples with the
same value rather than across heap tuples within a HOT chain.

The big difficulty is LP_DEAD bit setting within B-Tree pages; if no one
is visiting the heap, no one can opportunistically mark the bit to
indicate that nobody needs to visit the heap tuple pointed to by the
index tuple (because the heap tuple is dead to everyone). Now, there
isn't an immediate problem with that, because the main point of both
optimizations is to avoid heap visits. But there is a problem when there
needs to be a page split and there is a lack of LP_DEAD bits set from
which to reclaim space to defer the page split (in nbtree, we can
reclaim the space needed for our an index tuple that needs to be
inserted without actually spliting, in the end). We cannot undermine the
secondary goal of the LP_DEAD/kill_prior_tuple optimization, which is
reclaiming space early within B-Tree pages.

I can imagine a way of addressing this problem, though it is very
invasive - versioning distinct index tuple versions in the index with
ordinal version numbers. I wonder if anyone else has thought about doing
something like this, and has a better idea, though. I think that we
could be cleverer about unique indexes in a number of different ways
[1]. Reducing heap accesses is a big one.

[1]
https://wiki.postgresql.org/wiki/Key_normalization#Making_all_items_in_the_index_unique_by_treating_heap_TID_as_an_implicit_last_attribute
--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Small improvement to compactify_tuples

2017-11-04 Thread Peter Geoghegan


Юрий Соколов  wrote:

tps is also reflects changes:
~17ktps with qsort
~19ktps with bucket sort

Also vacuum of benchmark's table is also improved:
~3s with qsort,
~2.4s with bucket sort


One thing that you have to be careful with when it comes to our qsort
with partially presored inputs is what I like to call "banana skin
effects":

https://postgr.es/m/cah2-wzku2xk2dpz7n8-a1mvuuttuvhqkfna+eutwnwctgyc...@mail.gmail.com

This may have nothing at all to do with your results; I'm just pointing
it out as a possibility.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-11-03 Thread Peter Geoghegan


Alvaro Herrera  wrote:

He means that the tuple that heap_update moves to page 1 (which will no
longer be processed by vacuum) will contain a multixact that's older
than relminmxid -- because it is copied unchanged by heap_update instead
of properly checking against age limit.


I see. The problem is more or less with this heap_update() code:

   /*
* And also prepare an Xmax value for the new copy of the tuple.  If there
* was no xmax previously, or there was one but all lockers are now gone,
* then use InvalidXid; otherwise, get the xmax from the old tuple.  (In
* rare cases that might also be InvalidXid and yet not have the
* HEAP_XMAX_INVALID bit set; that's fine.)
*/
   if ((oldtup.t_data->t_infomask & HEAP_XMAX_INVALID) ||
   HEAP_LOCKED_UPGRADED(oldtup.t_data->t_infomask) ||
   (checked_lockers && !locker_remains))
   xmax_new_tuple = InvalidTransactionId;
   else
   xmax_new_tuple = HeapTupleHeaderGetRawXmax(oldtup.t_data);

My naive guess is that we have to create a new MultiXactId here in at
least some cases, just like FreezeMultiXactId() sometimes does.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-11-03 Thread Peter Geoghegan


Andres Freund  wrote:

Here's that patch.  I've stared at this some, and Robert did too. Robert
mentioned that the commit message might need some polish and I'm not
100% sure about the error message texts yet.


The commit message should probably say that the bug involves the
resurrection of previously dead tuples, which is different to there
being duplicates because a constraint is not enforced because HOT chains
are broken (that's a separate, arguably less serious problem).


Staring at the vacuumlazy hunk I think I might have found a related bug:
heap_update_tuple() just copies the old xmax to the new tuple's xmax if
a multixact and still running.  It does so without verifying liveliness
of members.  Isn't that buggy? Consider what happens if we have three
blocks: 1 has free space, two is being vacuumed and is locked, three is
full and has a tuple that's key share locked by a live tuple and is
updated by a dead xmax from before the xmin horizon. In that case afaict
the multi will be copied from the third page to the first one.  Which is
quite bad, because vacuum already processed it, and we'll set
relfrozenxid accordingly.  I hope I'm missing something here?


Can you be more specific about what you mean here? I think that I
understand where you're going with this, but I'm not sure.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MERGE SQL Statement for PG11

2017-11-03 Thread Peter Geoghegan


Simon Riggs  wrote:

The *only* behavioural difference I have proposed would be the *lack*
of an ERROR in (some) concurrent cases.


I think that's a big difference.  Error vs. non-error is a big deal by
itself;


Are you saying avoiding an ERROR is a bad or good thing?


Are you really asking Robert to repeat what has already been said about
a dozen different ways?

That's *not* the only difference. You need to see a couple of steps
ahead to see further differences, as the real dilemma comes when you
have to reconcile having provided the UPSERT-guarantees with cases that
that doesn't map on to (which can happen in a number of different ways).

I don't understand why you'll talk about just about anything but that.
This is a high-level concern about the overarching design. Do you really
not understand the concern at this point?


also, the non-error case involves departing from MVCC
semantics just as INSERT .. ON CONFLICT UPDATE does.


Meaning what exactly? What situation occurs that a user would be concerned with?

Please describe exactly what you mean so we get it clear.

The concurrent behaviour for MERGE is allowed to be
implementation-specific, so we can define it any way we want.


Agreed -- we can. It isn't controversial at all to say that the SQL
standard has nothing to say on this question. The problem is that the
semantics you argue for are ill-defined, and seem to create more
problems than they solve. Why keep bringing up the SQL standard?

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-11-02 Thread Peter Geoghegan


Thomas Munro  wrote:

I'm going to make an item on my personal TODO list for that. No useful
insights on that right now, though.


I decided to try that, but it didn't really work: fd.h gets included
by front-end code, so I can't very well define a struct and declare
functions that deal in dsm_segment and slock_t.  On the other hand it
does seem a bit better to for these shared file sets to work in terms
of File, not BufFile.


Realistically, fd.h has a number of functions that are really owned by
buffile.c already. This sounds fine.


That way you don't have to opt in to BufFile's
double buffering and segmentation schemes just to get shared file
clean-up, if for some reason you want direct file handles.


Is that something that you really think is possible?

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MERGE SQL Statement for PG11

2017-11-02 Thread Peter Geoghegan


Nico Williams  wrote:

A MERGE mapped to a DML like this:

 WITH
 updated AS (
   UPDATE 
   SET ...
   WHERE 
   RETURNING 
   )
   , inserted AS (
   INSERT INTO 
   SELECT ...
   WHERE  NOT IN (SELECT  FROM updated) AND ..
   ON CONFLICT DO NOTHING -- see below!
   RETURNING 
   )
 DELETE FROM 
 WHERE  NOT IN (SELECT  FROM updated) AND
NOT IN (SELECT  FROM inserted) AND ...;



This is a bad idea. An implementation like this is not at all
maintainable.


can handle concurrency via ON CONFLICT DO NOTHING in the INSERT CTE.


That's not handling concurrency -- it's silently ignoring an error. Who
is to say that the conflict that IGNORE ignored is associated with a row
visible to the MVCC snapshot of the statement? IOW, why should the DELETE
affect any row?

There are probably a great many reasons why you need a ModifyTable
executor node that keeps around state, and explicitly indicates that a
MERGE is a MERGE. For example, we'll probably want statement level
triggers to execute in a fixed order, regardless of the MERGE, RLS will
probably require explicitly knowledge of MERGE semantics, and so on.

FWIW, your example doesn't actually have a source (just a target), so it
isn't actually like MERGE.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MERGE SQL Statement for PG11

2017-11-02 Thread Peter Geoghegan


Simon Riggs  wrote:

I think people imagined you had worked out how to make MERGE run
concurrently, I certainly did, but in fact you're just saying you
don't believe it ever should.


I'm certain that they didn't think that at all. But I'll let them speak
for themselves.


That is strange since the SQL Standard specifically allows the
implementation to decide upon concurrent behaviour.


And yet nobody else decided to do what you propose with this apparent
leeway.  (See the UPSERT wiki page for many references that confirm
this.)


So in your view we should make no attempt to avoid concurrent errors,
even when we have the capability to do so (in some cases) and doing so
would be perfectly compliant with the SQLStandard.


Yes. That's what I believe. I believe this because I can't see a way to
do this that isn't a mess, and because ON CONFLICT DO UPDATE exists and
works well for the cases where we can do better in READ COMMITTED mode.

Did you know that Oracle doesn't have an EPQ style mechanism at all?
Instead, it rolls back the entire statement and retries it from scratch.
That is not really any further from or closer to the standard than the
EPQ stuff, because the standard doesn't say anything about what should
happen as far as READ COMMITTED conflict handling goes.

My point here is that all of the stuff I'm talking about is only
relevant in READ COMMITTED mode, in areas where the standard never
provides us with guidance. If you can rely on SSI, then there is no
difference between what you propose and what I propose anyway, except
that what I propose is more general and will have better performance,
especially for batch MERGEs. If READ COMMITTED didn't exist,
implementing ON CONFLICT would have been more or less free of
controversy.


Yes, that certainly will make an easier patch for MERGE.


Indeed, it will.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MERGE SQL Statement for PG11

2017-11-02 Thread Peter Geoghegan


Simon Riggs  wrote:

So if I understand you correctly, in your view MERGE should just fail
with an ERROR if it runs concurrently with other DML?


That's certainly my opinion on the matter. It seems like that might be
the consensus, too.

Obviously there are things that you as a user can do about this on your
own, like opt to use a higher isolation level, or manually LOCK TABLE.
For some use cases, including bulk loading for OLAP, users might just
know that there isn't going to be concurrent activity because it's not
an OLTP system.

If this still seems odd to you, then consider that exactly the same
situation exists with UPDATE. A user could want their UPDATE to affect a
row where no row version is actually visible to their MVCC snapshot,
because they have an idea about reliably updating the latest row. UPDATE
doesn't work like that, of course. Is this unacceptable because the user
expects that it should work that way?

Bear in mind that ON CONFLICT DO UPDATE *can* actually update a row when
there is no version of it visible to the snapshot. It can also update a
row where there is a concurrent DELETE + INSERT, and the tuples with the
relevant unique index values end up not even being part of the same
update chain in each case (MVCC-snapshot-visible vs. latest). IOW, you
may end up updating a completely different logical row to the row with
the conflicting value that is visible to your MVCC snapshot!


i.e. if a race condition between the query and an INSERT runs
concurrently with another INSERT

We have no interest in making that work?


Without meaning to sound glib: we already did make it work for a
special, restricted case that is important enough to justify introducing
a couple of kludges -- ON CONFLICT DO UPDATE/upsert.

I do agree that what I propose for MERGE will probably cause confusion;
just look into Oracle's MERGE implementation for examples of this.  We
ought to go out of our way to make it clear that MERGE doesn't provide
these guarantees.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MERGE SQL Statement for PG11

2017-11-02 Thread Peter Geoghegan


Nico Williams  wrote:

If you want to ignore conflicts arising from concurrency you could
always add an ON CONFLICT DO NOTHING to the INSERT DML in the mapping I
proposed earlier.  Thus a MERGE CONCURRENTLY could just do that.

Is there any reason not to map MERGE as I proposed?


Performance, for one. MERGE generally has a join that can be optimized
like an UPDATE FROM join.

I haven't studied this question in any detail, but FWIW I think that
using CTEs for merging is morally equivalent to a traditional MERGE
implementation. It may actually be possible to map from CTEs to a MERGE
statement, but I don't think that that's a good approach to implementing
MERGE.

Most of the implementation time will probably be spent doing things like
making sure MERGE behaves appropriately with triggers, RLS, updatable
views, and so on. That will take quite a while, but isn't particularly
technically challenging IMV.

--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MERGE SQL Statement for PG11

2017-11-02 Thread Peter Geoghegan


Robert Haas  wrote:

And if, in the meantime, MERGE can only handle the cases where there
is a unique index, then it can only handle the cases INSERT .. ON
CONFLICT UPDATE can cover, which makes it, as far as I can see,
syntactic sugar over what we already have.  Maybe it's not entirely -
you might be planning to make some minor functional enhancements - but
it's not clear what those are, and I feel like whatever it is could be
done with less work and more elegance by just extending the INSERT ..
ON CONFLICT UPDATE syntax.


+1

Marko Tiikkaja's INSERT ... ON CONFLICT SELECT patch, which is in the
current CF [1], moves things in this direction. I myself have
occasionally wondered if it was worth adding an alternative DO DELETE
conflict_action. This could appear alongside DO UPDATE, and be applied
using MERGE-style conditions.

All of these things seem like small adjuncts to ON CONFLICT because
they're all just an alternative way of modifying or projecting the tuple
that is locked by ON CONFLICT. Everything new would have to happen after
the novel ON CONFLICT handling has already completed.

The only reason that I haven't pursued this is because it doesn't seem
that compelling. I mention it now because It's worth acknowledging that
ON CONFLICT could be pushed a bit further in this direction. Of course,
this still falls far short of making ON CONFLICT entirely like MERGE.

[1] https://commitfest.postgresql.org/15/1241/
--
Peter Geoghegan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-11-02 Thread Peter Geoghegan

On Thu, Nov 2, 2017 at 9:44 AM, Robert Haas  wrote:
> The second commit (22576734b805fb0952f9be841ca8f643694ee868) is where
> I think things get a lot more dangerous.  The problem (as Andres
> pointed out to me this afternoon) is that it seems possible to end up
> with a situation where there should be two HOT chains on a page, and
> because of the weakened xmin/xmax checking rules, we end up thinking
> that the second one is a continuation of the first one, which will be
> all kinds of bad news.  That would be particularly likely to happen in
> releases from before we invented HEAP_XMIN_FROZEN, when there's no
> xmin/xmax matching at all, but could happen on later releases if we
> are extraordinarily unlucky (i.e. xmin of the first tuple in the
> second chain happens to be the same as the pre-freeze xmax in the old
> chain, probably because the same XID was used to update the page in
> two consecutive epochs).  Fortunately, that commit is (I think) not
> released anywhere.

FWIW, if you look at the second commit
(22576734b805fb0952f9be841ca8f643694ee868) carefully, you'll realize
that it doesn't even treat those two cases differently. It was buggy
even on its own terms. The FrozenTransactionId test used an xmin from
HeapTupleHeaderGetXmin(), not HeapTupleHeaderGetRawXmin().

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-11-02 Thread Peter Geoghegan

On Thu, Nov 2, 2017 at 4:20 AM, Andres Freund  wrote:
> I think the problem is on the pruning, rather than the freezing side. We
> can't freeze a tuple if it has an alive predecessor - rather than
> weakining this, we should be fixing the pruning to not have the alive
> predecessor.

Excellent catch.

> If the update xmin is actually below the cutoff we can remove the tuple
> even if there's live lockers - the lockers will also be present in the
> newer version of the tuple.  I verified that for me that fixes the
> problem. Obviously that'd require some comment work and more careful
> diagnosis.

I didn't even know that that was safe.

> I think a5736bf754c82d8b86674e199e232096c679201d might be dangerous in
> the face of previously corrupted tuple chains and pg_upgraded clusters -
> it can lead to tuples being considered related, even though they they're
> from entirely independent hot chains. Especially when upgrading 9.3 post
> your fix, to current releases.

Frankly, I'm relieved that you got to this. I was highly suspicious of
a5736bf754c82d8b86674e199e232096c679201d, even beyond my specific,
actionable concern about how it failed to handle the
9.3/FrozenTransactionId xmin case as special. As I went into in the
"heap/SLRU verification, relfrozenxid cut-off, and freeze-the-dead
bug" thread, these commits left us with a situation where there didn't
seem to be a reliable way of knowing whether or not it is safe to
interrogate clog for a given heap tuple using a tool like amcheck.
And, it wasn't obvious that you couldn't have a codepath that failed
to account for pre-cutoff non-frozen tuples -- codepaths that call
TransactionIdDidCommit() despite it actually being unsafe.

If I'm not mistaken, your proposed fix restores sanity there.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MERGE SQL Statement for PG11

2017-11-01 Thread Peter Geoghegan

On Wed, Nov 1, 2017 at 10:19 AM, Simon Riggs  wrote:
>> The problem here is: Iff the first statement uses ON CONFLICT
>> infrastructure, doesn't the absence of WHEN NOT MATCHED imply
>> different semantics for the remaining updates and deletes in the
>> second version of the query?
>
> Not according to the SQL Standard, no. I have no plans for such
> differences to exist.
>
> Spec says: If we hit HeapTupleSelfUpdated then we throw an ERROR.

Your documentation said that the MERGE was driven by a speculative
insertion (BTW, I don't think that this internal implementation detail
should be referenced in user-facing docs). I inferred that that could
not always be true, since there won't always be an INSERT/WHEN NOT
MATCHED case, assuming that you allow that at all (I now gather that you
will).

>> You've removed what seems like a neat
>> adjunct to the MERGE, but it actually changes everything else too when
>> using READ COMMITTED. Isn't that pretty surprising?
>
> I think you're presuming things I haven't said and don't mean, so
> we're both surprised.

You're right -- I'm surmising what I think might be true, because I
don't have the information available to know one way or the other. As
far as this issue with using speculative insertions in one context but
not in another goes, I still don't really know where you stand. I can
still only surmise that you must want both implementations, and will use
one or the other as circumstances dictate (to avoid dup violations in
the style of ON CONFLICT where that's possible).

This seems true because you now say that it will be possible to omit
WHEN NOT MATCHED, and yet there is no such thing as a speculative
insertion without the insertion. You haven't said that that conclusion
is true yourself, but it's the only conclusion that I can draw based on
what you have said.

> I think we need some way of expressing the problems clearly.

It's certainly hard to talk about these problems. I know this from
experience.

> "a general purpose solution is one that more or
> less works like an UPDATE FROM, with an outer join, whose ModifyTable
> node is capable of insert, update, or delete (and accepts quals for
> MATCHED and NOT matched cases, etc). You could still get duplicate
> violations due to concurrent activity in READ COMMITTED mode".
>
> Surely the whole point of this is to avoid duplicate violations due to
> concurrent activity?

Now we're getting somewhere.

I *don't* think that that's the whole point of MERGE. No other MERGE
implementation does that, or claims to do that. The SQL standard says
nothing about this. Heikki found this to be acceptable when working on
the GSoC MERGE implementation that went nowhere.

My position is that we ought to let MERGE be MERGE, and let ON CONFLICT
be ON CONFLICT.

In Postgres, you can avoid duplicate violations with MERGE by using a
higher isolation level (these days, those are turned into a
serialization error at higher isolation levels when no duplicate is
visible to the xact's snapshot).  MERGE isn't and shouldn't be special
when it comes to concurrency.

> I'm not seeing how either design sketch rigorously avoids live locks,
> but those are fairly unlikely and easy to detect and abort.

My MERGE semantics (which really are not mine at all) avoid live
lock/lock starvation by simply never retrying anything without making
forward progress. MERGE doesn't take any special interest in
concurrency, just like any other DML statement that isn't INSERT with ON
CONFLICT.

ON CONFLICT would have live locks if it didn't always have the choice of
inserting [1]. In many ways, the syntax of INSERT ON CONFLICT DO UPDATE
is restricted in exactly the way it needs to be in order to function
correctly. It wasn't an accident that it didn't end up being UPDATE ...
ON NOUPDATE DO INSERT, or something like that, which Robert proposed at
one point.

ON CONFLICT plays by its own rules to a certain extent, because that's
what you need in order to get the desired guarantees in READ COMMITTED
mode [2]. This is the main reason why it was as painful a project as it
was. Further generalizing that seems fraught with difficulties. It seems
logically impossible to generalize it in a way where you don't end up
with two behaviors masquerading as one.

[1] https://wiki.postgresql.org/wiki/UPSERT#Theoretical_lock_starvation_hazards
[2] https://wiki.postgresql.org/wiki/UPSERT#Goals_for_implementation
--
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-10-31 Thread Peter Geoghegan

On Tue, Oct 31, 2017 at 5:07 PM, Thomas Munro
 wrote:
> So that's this bit:
>
> + pg_itoa(worker, filename);
> + lts->pfile = BufFileCreateShared(fileset, filename);
>
> ... and:
>
> + pg_itoa(i, filename);
> + file = BufFileOpenShared(fileset, filename);

Right.

> What's wrong with using a worker number like this?

I guess nothing, though there is the question of discoverability for
DBAs, etc. You do address this separately, by having (potentially)
descriptive filenames, as you go into.

> It's not random choice: buffile.c creates a uniquely named directory
> (or directories, if you have more than one location configured in the
> temp_tablespaces GUC) to hold all the backing files involved in each
> BufFileSet.  Naming of BufFiles within the BufFileSet is the caller's
> problem, and a worker number seems like a reasonable choice to me.  It
> won't collide with a concurrent parallel CREATE INDEX because that'll
> be using its own BufFileSet.

Oh, I see. I may have jumped the gun on that one.

> One complaint about the current coding that someone might object to:
> MakeSharedSegmentPath() just dumps the caller's BufFile name into a
> path without sanitisation: I should fix that so that we only accept
> fairly limited strings here.  Another complaint is that perhaps fd.c
> knows too much about buffile.c's business.  For example,
> RemovePgTempFilesInDir() knows about the ".set" directories created by
> buffile.c, which might be called a layering violation.  Perhaps the
> set/directory logic should move entirely into fd.c, so you'd call
> FileSetInit(FileSet *), not BufFileSetInit(BufFileSet *), and then
> BufFileOpenShared() would take a FileSet *, not a BufFileSet *.
> Thoughts?

I'm going to make an item on my personal TODO list for that. No useful
insights on that right now, though.

> 3.  sharedtuplestore.c takes a caller-supplied BufFileSet and creates
> its shared BufFiles in there.  Earlier versions created and owned a
> BufFileSet, but in the current Parallel Hash patch I create loads of
> separate SharedTuplestore objects but I didn't want to create load of
> directories to back them, so you can give them all the same
> BufFileSet.  That works because SharedTuplestores are also given a
> name, and it's the caller's job (in my case nodeHash.c) to make sure
> the SharedTuplestores are given unique names within the same
> BufFileSet.  For Parallel Hash you'll see names like 'i3of8' (inner
> batch 3 of 8).  There is no need for there to be in any sort of
> central registry for that though, because it rides on top of the
> guarantees from 2 above: buffile.c will put those files into a
> uniquely named directory, and that works as long as no one else is
> allowed to create files or directories in the temp directory that
> collide with its reserved pattern /^pgsql_tmp.+\.set$/.  For the same
> reason, parallel CREATE INDEX is free to use worker numbers as BufFile
> names, since it has its own BufFileSet to work within.

If the new standard is that you have temp file names that suggest the
purpose of each temp file, then that may be something that parallel
CREATE INDEX should buy into.

> In an earlier version, BufFileSet was one of those annoying data
> structures with a FLEXIBLE_ARRAY_MEMBER that you'd use as an
> incomplete type (declared but not defined in the includable header),
> and here it was being used "inside" (or rather after) SharedSort,
> which *itself* had a FLEXIBLE_ARRAY_MEMBER.  The reason for the
> variable sized object was that I needed all backends to agree on the
> set of temporary tablespace OIDs, of which there could be any number,
> but I also needed a 'flat' (pointer-free) object I could stick in
> relocatable shared memory.  In the newest version I changed that
> flexible array to tablespaces[8], because 8 should be enough
> tablespaces for anyone (TM).

I guess that that's something that you'll need to take up with Andres,
if you haven't already. I have a hard time imagining a single query
needed to use more than that many tablespaces at once, so maybe this
is fine.

> I don't really believe anyone uses
> temp_tablespaces for IO load balancing anymore and I hate code like
> the above.  So I think Rushabh should now remove the above-quoted code
> and just use a BufFileSet directly as a member of SharedSort.

FWIW, I agree with you that nobody uses temp_tablespaces this way
these days. This seems like a discussion for your hash join patch,
though. I'm happy to buy into that.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-10-31 Thread Peter Geoghegan

On Thu, Oct 26, 2017 at 4:22 AM, Rushabh Lathia
 wrote:
> Attaching the re based patch according to the v22 parallel-hash patch sets

I took a quick look at this today, and noticed a few issues:

* make_name() is used to name files in sharedtuplestore.c, which is
what is passed to BufFileOpenShared() for parallel hash join. Your
using your own logic for that within the equivalent logtape.c call to
BufFileOpenShared(), presumably because make_name() wants to identify
participants by PID rather than by an ordinal identifier number.

I think that we need some kind of central registry for things that use
shared buffiles. It could be that sharedtuplestore.c is further
generalized to support this, or it could be that they both call
something else that takes care of naming. It's not okay to have this
left to random chance.

You're going to have to ask Thomas about this. You should also use
MAXPGPATH for the char buffer on the stack.

* This logtape.c comment needs to be updated, as it's no longer true:

 * successfully.  In general, workers can take it that the leader will
 * reclaim space in files under their ownership, and so should not
 * reread from tape.

* Robert hated the comment changes in the header of nbtsort.c. You
might want to change it back, because he is likely to be the one that
commits this.

* You should look for similar comments in tuplesort.c (IIRC a couple
of places will need to be revised).

* tuplesort_begin_common() should actively reject a randomAccess
parallel case using elog(ERROR).

* tuplesort.h should note that randomAccess isn't supported, too.

* What's this all about?:

+ /* Accessor for the SharedBufFileSet that is at the end of Sharedsort. */
+ #define GetSharedBufFileSet(shared)\
+   ((BufFileSet *) (&(shared)->tapes[(shared)->nTapes]))

You can't just cast from one type to the other without regard for the
underling size of the shared memory buffer, which is what this looks
like to me. This only fails to crash because you're only abusing the
last member in the tapes array for this purpose, and there happens to
be enough shared memory slop that you get away with it. I'm pretty
sure that ltsUnify() ends up clobbering the last/leader tape, which is
a place where BufFileSet is also used, so this is just wrong. You
should rethink the shmem structure a little bit.

* There is still that FIXME comment within leader_takeover_tapes(). I
believe that you should still have a leader tape (at least in local
memory in the leader), even though you'll never be able to do anything
with it, since randomAccess is no longer supported. You can remove the
FIXME, and just note that you have a leader tape to be consistent with
the serial case, though recognize that it's not useful. Note that even
with randomAccess, we always had the leader tape, so it's not that
different, really.

I suppose it might make sense to make shared->tapes not have a leader
tape. It hardly matters -- perhaps you should leave it there in order
to keep the code simple, as you'll be keeping the leader tape in local
memory, too. (But it still won't fly to continue to clobber it, of
course -- you still need to find a dedicated place for BufFileSet in
shared memory.)

That's all I have right now.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-31 Thread Peter Geoghegan

On Tue, Oct 31, 2017 at 2:25 AM, Simon Riggs  wrote:
> If there are challenges ahead, its reasonable to ask for test cases
> for that now especially if you think you know what they already are.
> Imagine we go forwards 2 months - if you dislike my patch when it
> exists you will submit a test case showing the fault. Why not save us
> all the trouble and describe that now? Test Driven Development.

I already have, on several occasions now. But if you're absolutely
insistent on my constructing the test case in terms of a real SQL
statement, then that's what I'll do.

Consider this MERGE statement, from your mock documentation:

MERGE INTO wines w
USING wine_stock_changes s
ON s.winename = w.winename
WHEN NOT MATCHED AND s.stock_delta > 0 THEN
  INSERT VALUES(s.winename, s.stock_delta)
WHEN MATCHED AND w.stock + s.stock_delta > 0 THEN
  UPDATE SET stock = w.stock + s.stock_delta
ELSE
  DELETE;

Suppose we remove the WHEN NOT MATCHED case, leaving us with:

MERGE INTO wines w
USING wine_stock_changes s
ON s.winename = w.winename
WHEN MATCHED AND w.stock + s.stock_delta > 0 THEN
  UPDATE SET stock = w.stock + s.stock_delta
ELSE
  DELETE;

We now have a MERGE that will not INSERT, but will continue to UPDATE
and DELETE. (It's implied that your syntax cannot do this at all,
because you propose use the ON CONFLICT infrastructure, but I think we
have to imagine a world in which that restriction either never existed
or has subsequently been lifted.)

The problem here is: Iff the first statement uses ON CONFLICT
infrastructure, doesn't the absence of WHEN NOT MATCHED imply
different semantics for the remaining updates and deletes in the
second version of the query? You've removed what seems like a neat
adjunct to the MERGE, but it actually changes everything else too when
using READ COMMITTED. Isn't that pretty surprising? If you're not
clear on what I mean, see my previous remarks on EPQ, live lock, and
what a CONFLICT could be in READ COMMITTED mode. Concurrent activity
at READ COMMITTED mode can be expected to significantly alter the
outcome here.

Why not just always use the behavior that the second query requires,
which is very much like an UPDATE FROM with an outer join that can
sometimes do deletes (and inserts)? We should always use an MVCC
snapshot, and never play ON CONFLICT style games with visibility/dirty
snapshots.

> It's difficult to discuss anything with someone that refuses to
> believe that there are acceptable ways around things. I believe there
> are.

Isn't that blind faith? Again, it seems like you haven't really tried
to convince me.

> If you can calm down the rhetoric we can work together, but if you
> continue to grandstand it makes it more difficult.

I'm trying to break the deadlock, by getting you to actually consider
what I'm saying. I don't enjoy confrontation. Currently, it seems like
you're just ignoring my objections, which you actually could fairly
easily work through. That is rather frustrating.

> You've said its possible another way. Show that assertion is actually
> true. We're all listening, me especially, for the technical details.

My proposal, if you want to call it that, has the merit of actually
being how MERGE works in every other system. Both Robert and Stephen
seem to be almost sold on what I suggest, so I think that I've
probably already explained my position quite well.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-30 Thread Peter Geoghegan

On Mon, Oct 30, 2017 at 11:07 AM, Simon Riggs  wrote:
> Please explain in detail the MERGE SQL statements that you think will
> be problematic and why.

Your proposal is totally incomplete, so I can only surmise its
behavior in certain cases, to make a guess at what the problems might
be (see my remarks on EPQ, live lock, etc). This makes it impossible
to do what you ask right now.

Besides, you haven't answered the question from my last e-mail
("What's wrong with that [set of MERGE semantics]?"), so why should I
go to further trouble? You're just not constructively engaging with me
at this point. We're going around in circles.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-30 Thread Peter Geoghegan

On Mon, Oct 30, 2017 at 6:21 AM, Simon Riggs  wrote:
>> That's not really true.  Nobody's going to be happy if MERGE has one
>> behavior in one set of cases and an astonishingly different behavior
>> in another set of cases.  If you adopt a behavior for certain cases
>> that can't be extended to other cases, then you're blocking a
>> general-purpose MERGE.
>
> If a general purpose solution exists, please explain what it is.

For the umpteenth time, a general purpose solution is one that more or
less works like an UPDATE FROM, with an outer join, whose ModifyTable
node is capable of insert, update, or delete (and accepts quals for
MATCHED and NOT matched cases, etc). You could still get duplicate
violations due to concurrent activity in READ COMMITTED mode, but not
at higher isolation levels thanks to Thomas Munro's work there. In
this world, ON CONFLICT and MERGE are fairly distinct things.

What's wrong with that? You haven't actually told us why you don't like that.

> The problem is that nobody has ever done so, so its not like we are
> discussing the difference between bad solution X and good solution Y,
> we are comparing reasonable solution X with non-existent solution Y.

Nobody knows what your proposal would be like when time came to remove
the restrictions that you suggest could be removed later. You're the
one with the information here. We need to know what those semantics
would be up-front, since you're effectively committing us down that
path.

You keep making vague appeals to pragmatism, but, in all sincerity, I
don't understand where you're coming from at all. I strongly believe
that generalizing from ON CONFLICT doesn't even make the
implementation easier in the short term. ISTM that you're making this
difficult for yourself for reasons that are known only to you.

>> And, indeed, it seems that you're proposing an implementation that
>> adds no new functionality, just syntax compatibility.  Do we really
>> want or need two syntaxes  for the same thing in core?  I kinda think
>> Peter might have the right idea here.  Under his proposal, we'd be
>> getting something that is, in a way, new.
>
> Partitioning looked like "just new syntax", but it has been a useful
> new feature.

False equivalency. Nobody, including you, ever argued that that work
risked painting us into a corner. (IIRC you said something like the
progress was too small to justify putting into a single release.)

> MERGE provides new capabilities that we do not have and is much more
> powerful than INSERT/UPDATE, in a syntax that follow what other
> databases use today. Just like partitioning.

But you haven't told us *how* it is more powerful. Again, the
semantics of a MERGE that is a generalization of ON CONFLICT are not
at all obvious, and seem like they might be very surprising and risky.
There is no question that it's your job to (at a minimum) define those
semantics ahead of time, since you're going to commit us to them in
the long term if you continue down this path. It is most emphatically
*not* just a "small matter of programming".

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-29 Thread Peter Geoghegan

On Sun, Oct 29, 2017 at 4:48 AM, Simon Riggs  wrote:
> I have no objection to you writing a better version than me and if my
> work inspires you to complete that in a reasonable timescale then we
> all win.

My whole point is that the way that you seem determined to go on this
is a dead end. I don't think that *anyone* can go improve on what you
come up with if that's based heavily on ON CONFLICT, for the simple
reason that the final user visible design is totally unclear. There is
an easy way to make me shut up - come up with a design for MERGE that
more or less builds on how UPDATE FROM works, rather than building
MERGE on ON CONFLICT. (You might base things like RLS handling on ON
CONFLICT, but in the main MERGE should be like an UPDATE FROM with an
outer join, that can do INSERTs and DELETEs, too.)

The original effort to add MERGE didn't do anything upsert-like, which
Heikki (the GSOC mentor of the project) was perfectly comfortable
with. I'm too lazy to go search the archives right now, but it's
there. Heikki cites the SQL standard.

This is what MERGE *actually is*, which you can clearly see from the
Oracle/SQL Server/DB2 docs. It says this in the first paragraph of
their MERGE documentation. It's crystal clear from their docs -- "This
statement is a convenient way to combine multiple operations. It lets
you avoid multiple INSERT, UPDATE, and DELETE DML statements."

> I'm also very happy to work together on writing the version
> described - you have already done much work in this area.

You seem to want to preserve the ON CONFLICT guarantees at great cost.
But you haven't even defended that based on a high level goal, or a
use case, or something that makes sense to users (describing how it is
possible is another matter). You haven't even tried to convince me.

> I don't see any major problems in points you've raised so far, though
> obviously there is much detail.

Did you even read them? They are not mere details. They're fundamental
to the semantics of the feature (if you base it on ON CONFLICT). It's
not actually important that you understand them all; the important
message is that generalizing ON CONFLICT has all kinds of terrible
problems.

> All of this needs to be written and then committed, so I'll get on and
> write my proposal. We can then see whether that is an 80% solution or
> something less. There are more obvious barriers to completion at this
> point, like time and getting on with it.

Getting on with *what*, exactly?

In general, I have nothing against an 80% solution, or even a 50%
solution, provided there is a plausible path to a 100% solution. I
don't think that you have such a path, but only because you're tacitly
inserting requirements that no other MERGE implementation has to live
with, that I doubt any implementation *could* live with. Again, I'm
not the one making this complicated, or adding requirements that will
be difficult for you to get in to your v1 -- you're the one doing
that.

The semantics that I suggest (the SQL standard's semantics) will
require less code, and will be far simpler. Right now, I simply don't
understand why you're insisting on using ON CONFLICT without even
saying why. I can only surmise that you think that doing so will
simplify the implementation, but I can guarantee you that it won't.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-28 Thread Peter Geoghegan

On Sat, Oct 28, 2017 at 12:49 PM, Simon Riggs  wrote:
> Nothing I am proposing blocks later work.

Actually, many things will block future work if you go down that road.
You didn't respond to the specific points I raised, but that doesn't
mean that they're not real.

> Everything you say makes it clear that a fully generalized solution is
> going to be many years in the making, assuming we agree.

I think that it's formally impossible as long as you preserve the ON
CONFLICT guarantees, unless you somehow define the problems out of
existence. Those are guarantees which no other MERGE implementation
has ever made, and which the SQL standard says nothing about. And for
good reasons.

> "The extent to which an SQL-implementation may disallow independent
> changes that are not significant is implementation-defined”.
>
> So we get to choose. I recommend that we choose something practical.
> We're approaching the 10 year anniversary of my first serious attempt
> to do MERGE. I say that its time to move forwards with useful
> solutions, rather than wait another 10 years for the perfect one, even
> assuming it exists.

As far as I'm concerned, you're the one arguing for an unobtainable
solution over a good one, not me. I *don't* think you should solve the
problems that I raise -- you should instead implement MERGE without
any of the ON CONFLICT guarantees, just like everyone else has.
Building MERGE on top of the ON CONFLICT guarantees, and ultimately
arriving at something that is comparable to other implementations over
many releases might be okay if anyone had the slightest idea of what
that would look like. You haven't even _described the semantics_,
which you could do by addressing the specific points that I raised.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-28 Thread Peter Geoghegan

On Fri, Oct 27, 2017 at 3:00 PM, Serge Rielau  wrote:
>> What other systems *do* have this restriction? I've never seen one that did.
>
> Not clear what you are leading up to here.
> When I did MERGE in DB2 there was also no limitation:
> "Each row in the target can only be operated on once. A row in the target can 
> only be identified as MATCHED with one row in the result table of the 
> table-reference”
> What there was however was a significant amount of code I had to write and 
> test to enforce the above second sentence.

Then it seems that we were talking about two different things all along.

> Maybe in PG there is a trivial way to detect an expanding join and block it 
> at runtime.

There is for ON CONFLICT. See the cardinality violation logic within
ExecOnConflictUpdate(). (There are esoteric cases where this error can
be raised due to a wCTE that does an insert "from afar", which is
theoretically undesirable but not actually a problem.)

The MERGE implementation that I have in mind would probably do almost
the same thing, and make the "HeapTupleSelfUpdated" case within
ExecUpdate() raise an error when the caller happened to be a MERGE,
rather than following the historic UPDATE behavior. (The behavior is
to silently suppress a second or subsequent UPDATE attempt from the
same command, a behavior that Simon's mock MERGE documentation
references.)

> So the whole point I’m trying to make is that I haven’t seen the need for the 
> extra work I had to do once the feature appeared in the wild.

That seems pretty reasonable to me.

My whole point is that I think it's a mistake to do things like lock
rows ahead of evaluating any UPDATE predicate, in the style of ON
CONFLICT, in order to replicate the ON CONFLICT guarantees [1].

I'm arguing for implementation simplicity, too. Trying to implement
MERGE in a way that extends ON CONFLICT seems like a big mistake to
me, because ON CONFLICT updates rows on the basis of a would-be
duplicate violation, along with all the baggage that that carries.
This is actually enormously different to an equi-join that is fed by a
scan using an MVCC snapshot. The main difference is that there
actually is no MVCC snapshot in play in most cases [2]. If *no* row
with the PK value of 5 is visible to our MVCC snapshot, but an xact
committed having inserted such a row, that still counts as a CONFLICT
with READ COMMITTED.

[1] https://wiki.postgresql.org/wiki/UPSERT#Goals_for_implementation
[2] 
https://www.postgresql.org/docs/devel/static/transaction-iso.html#xact-read-committed
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-28 Thread Peter Geoghegan

On Sat, Oct 28, 2017 at 3:10 AM, Simon Riggs  wrote:
> SQL:2011 specifically states "The extent to which an
> SQL-implementation may disallow independent changes that are not
> significant is implementation-defined”, so in my reading the above
> behaviour would make us fully spec compliant. Thank you to Peter for
> providing the infrastructure on which this is now possible for PG11.
>
> Serge puts this very nicely by identifying two different use cases for MERGE.

MERGE benefits from having a join that is more or less implemented in
the same way as any other join. It can be a merge join, hash join, or
nestloop join. ON CONFLICT doesn't work using a join.

Should I to take it that you won't be supporting any of these
alternative join algorithms? If not, then you'll have something that
really isn't comparable to MERGE as implemented in Oracle, SQL Server,
or DB2. They *all* do this.

Would the user be able to omit WHEN NOT MATCHED/INSERT, as is the case
with every existing MERGE implementation? If so, what actually happens
under the hood when WHEN NOT MATCHED is omitted? For example, would
you actually use a regular "UPDATE FROM" style join, as opposed to the
ON CONFLICT infrastructure? And, if that is so, can you justify the
semantic difference for rows that are updated in each scenario
(omitted vs. not omitted) in READ COMMITTED mode? Note that this could
be the difference between updating a row when *no* version is visible
to our MVCC snapshot, as opposed to doing the EPQ stuff and updating
the latest row version if possible. That's a huge, surprising
difference. On top of all this, you risk live-lock if INSERT isn't a
possible outcome (this is also why ON CONFLICT can never accept a
predicate on its INSERT portion -- again, quite unlike MERGE).

Why not just follow what other systems do? It's actually easier to go
that way, and you get a better outcome. ON CONFLICT involves what you
could call a sleight of hand, and I fear that you don't appreciate
just how specialized the internal infrastructure is.

> Now, I accept that you might also want a MERGE statement that
> continues to work even if there is no unique constraint, but it would
> need to have different properties to the above. I do not in any way
> argue against adding that.

Maybe you *should* be arguing against it, though, and arguing against
ever supporting anything but equijoins, because these things will
*become* impossible if you go down that road. By starting with the ON
CONFLICT infrastructure, while framing no-unique-index-support as work
for some unspecified future release, you're leaving it up to someone
else to resolve the problems. Someone else must square the circle of
mixing ON CONFLICT semantics with fully generalized MERGE semantics.
But who?

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-27 Thread Peter Geoghegan

On Fri, Oct 27, 2017 at 2:13 PM, srielau  wrote:
> While the standard may not require a unique index for the ON clause I have
> never seen a MERGE statement that did not have this property.  So IMHO this
> is a reasonable restrictions.

The Oracle docs on MERGE say nothing about unique indexes or
constraints. They don't even mention them in passing. They do say
"This statement is a convenient way to combine multiple operations. It
lets you avoid multiple INSERT, UPDATE, and DELETE DML statements."

SQL Server's MERGE docs do mention unique indexes, but only in
passing, saying something about unique violations, and that unique
violations *cannot* be suppressed in MERGE, even though that's
possible with other DML statements (with something called
IGNORE_DUP_KEY).

What other systems *do* have this restriction? I've never seen one that did.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-27 Thread Peter Geoghegan

On Fri, Oct 27, 2017 at 6:24 AM, Robert Haas  wrote:
> I think one of the reasons why Peter Geoghegan decided to pursue
> INSERT .. ON CONFLICT UPDATE was that, because it is non-standard SQL
> syntax, he felt free to mandate a non-standard SQL requirement, namely
> the presence of a unique index on the arbiter columns.

That's true, but what I was really insistent on, more than anything
else, was that the user would get a practical guarantee about an
insert-or-update outcome under concurrency. There could be no
"unprincipled deadlocks", nor could there be spurious unique
violations.

This is the kind of thing that the SQL standard doesn't really concern
itself with, and yet it's of significant practical importance to
users. Both Oracle and SQL Server allow these things that I
specifically set out to avoid. I think that that's mostly a good
thing, though; they do a really bad job of explaining what's what, and
don't provide for a very real need ("upsert") in some other way, but
their MERGE semantics do make sense to me.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] MERGE SQL Statement for PG11

2017-10-27 Thread Peter Geoghegan

On Fri, Oct 27, 2017 at 7:41 AM, Simon Riggs  wrote:
> Good points.
>
> I didn't say it but my intention was to just throw an ERROR if no
> single unique index can be identified.

You'd also throw an error when there was no "upsert compatible" join
quals, I take it?

I don't see the point in that. That's just mapping one syntax on to another.

> It could be possible to still run MERGE in that situaton but we would
> need to take a full table lock at ShareRowExclusive. It's quite likely
> that such statements would throw duplicate update errors, so I
> wouldn't be aiming to do anything with that for PG11.

I would avoid mixing up ON CONFLICT DO UPDATE and MERGE. The
"anomalies" you describe in MERGE are not really anomalies IMV.
They're simply how the feature is supposed to operate, and how it's
possible to make MERGE use alternative join algorithms based only on
the underlying cost. You might use a merge join for a bulk load
use-case, for example.

I think an SQL MERGE feature would be compelling, but I don't think
that it should take much from ON CONFLICT. As I've said many times
[1], they're really two different features (Teradata had features
similar to each, for example). I suggest that you come up with
something that has the semantics that the standard requires, and
therefore makes none of the ON CONFLICT guarantees about an outcome
under concurrency (INSERT or UPDATE). Those guarantees are basically
incompatible with how MERGE needs to work.

In case it matters, I think that the idea of varying relation
heavyweight lock strength based on subtle semantics within a DML
statement is a bad one. Frankly, I think that that's going to be a
nonstarter.

[1] 
https://www.postgresql.org/message-id/CAM3SWZRP0c3g6+aJ=yydgyactzg0xa8-1_fcvo5xm7hrel3...@mail.gmail.com
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Peter Geoghegan

On Tue, Oct 24, 2017 at 10:20 PM, Justin Pryzby  wrote:
> I think you must have compared these:

Yes, I did. My mistake.

> On Tue, Oct 24, 2017 at 03:11:44PM -0500, Justin Pryzby wrote:
>> ts=# SELECT * FROM bt_page_items(get_raw_page('sites_idx', 1));
>>
>> itemoffset | 48
>> ctid   | (1,37)
>> itemlen| 32
>> nulls  | f
>> vars   | t
>> data   | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 0b 31 31 31 31 00 00 00 
>> 00 00 00
> ...
>> itemoffset | 37
>> ctid   | (0,97)
>> itemlen| 24
>> nulls  | f
>> vars   | t
>> data   | 1b 43 52 43 4c 4d 54 2d 43 45 4d 53 30 03 00 00
>
> ..but note those are both items in sites_idx (48 and 37, which I seem to have
> pasted out of order)..  I included both because I'm not confident I know what
> the "index tid=(1,37)" referred to, but I gather it means item at offset=37
> (and not item with ctid=(1,37).)
>
> | [pryzbyj@database amcheck]$ psql --port 5678 ts -c "SELECT 
> bt_index_check('sites_idx'::regclass::oid, heapallindexed=>True)"
> | ERROR:  high key invariant violated for index "sites_idx"
> | DETAIL:  Index tid=(1,37) points to heap tid=(0,97) page lsn=0/0.

This means that the item at (1,37) in the index is not <= the high
key, which is located at (1,1). (The high key comes first, despite
being an upper bound rather than a lower bound, per pageinspect docs.)

I find it suspicious that the page lsn is 0 here, btw.

> ts=# SELECT * FROM page_header(get_raw_page('sites_idx', 1));
>  lsn | checksum | flags | lower | upper | special | pagesize | version | 
> prune_xid
> -+--+---+---+---+-+--+-+---
>  0/0 |0 | 0 |   872 |  1696 |8176 | 8192 |   4 |  
>0
>
> Here is its heap page:
>
> ts=# SELECT * FROM heap_page_items(get_raw_page('sites', 0)) WHERE lp=97;
>  lp | lp_off | lp_flags | lp_len | t_xmin |  t_xmax  | t_field3 | t_ctid | 
> t_infomask2 | t_infomask | t_hoff |  t_bits  | t_oid |
>t_data
> ++--+++--+--++-+++--+---+
>  97 |968 |1 | 52 |  21269 | 33567444 |0 | (3,27) |
> 8204 |   2307 | 32 | 11101001 |   | 
> \x70001b4352434c4d542d43454d5330030303
>
> Which I see ends with 0303 vs ..

Looks like I was accidentally right, then -- the heap and index do differ.

You might try this tool I published recently, to get a better sense of
details like this:

https://github.com/petergeoghegan/pg_hexedit

(Don't do so with a data directory that you cannot afford to corrupt
again, though!)

> Maybe this is relevant ?
> ts=# SELECT * FROM heap_page_items(get_raw_page('sites', 3)) WHERE lp=27;
>  lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | 
> t_infomask2 | t_infomask | t_hoff | t_bits | t_oid | t_data
> +----+--++++--++-++++---+
>  27 |  0 |0 |  0 |||  ||  
>||||   |

This looks like an LP_DEAD item.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Peter Geoghegan

On Tue, Oct 24, 2017 at 1:11 PM, Justin Pryzby  wrote:
> ..which I gather just verifies that the index is corrupt, not sure if there's
> anything else to do with it?  Note, we've already removed the duplicate rows.

Yes, the index itself is definitely corrupt -- this failed before the
new "heapallindexed" check even started. Though it looks like that
would have failed too, if you got that far, since the index points to
a row that does not contain the same data. (I only know this because
you dumped the heap tuple and the index tuple.)

Maybe you could try verifying a different index on the same table with
"heapallindexed", too. Perhaps that would fail in a more interesting
way.

I don't know how pg_repack works in any detail, but I have a hard time
imagining it causing corruption like this, where a single B-Tree page
is corrupt (high key invariant fails), possibly because of a torn page
(possibly due to recovery not replaying all the WAL needed, for
whatever reason).

You're using LVM snapshots -- I hope that you're aware that they're
not guaranteed to be consistent across logical volumes. There are a
few different ways that they could cause corruption like this if you
weren't careful. (In general, I wouldn't recommend using LVM snapshots
as any kind of backup solution.)

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] unique index violation after pg_upgrade to PG10

2017-10-24 Thread Peter Geoghegan

On Tue, Oct 24, 2017 at 11:48 AM, Kenneth Marshall  wrote:
>> Really ?  pg_repack "found" and was victim to the duplicate keys, and rolled
>> back its work.  The CSV logs clearly show that our application INSERTed rows
>> which are duplicates.
>>
>> [pryzbyj@database ~]$ rpm -qa pg_repack10
>> pg_repack10-1.4.2-1.rhel6.x86_64
>>
>> Justin
>
> Hi Justin,
>
> I just dealt with a similar problem with pg_repack and a PostgreSQL 9.5 DB,
> the exact same error. It seemed to caused by a tuple visibility issue that
> allowed the "working" unique index to be built, even though a duplicate row
> existed. Then the next pg_repack would fail with the error you got. In our
> case I needed the locality of reference to keep the DB performance acceptable
> and it was not a critical issue if there was a duplicate. We would remove the
> duplicates if we had a failure. We never had a problem with the NO pg_repack
> scenarios.

A new, enhanced version of the corruption detection tool amcheck is
now available, and has both apt + yum packages available:

https://github.com/petergeoghegan/amcheck

Unlike the version in Postgres 10, this enhanced version (V1.2) has
"heapallindexed" verification, which is really what you want here. If
you install it, and run it against the unique index in question (with
"heapallindexed" verification), that could help. It might provide a
more useful diagnostic message.

This is very new, so do let me know how you get on if you try it out.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How to determine that a TransactionId is really aborted?

2017-10-22 Thread Peter Geoghegan

On Sun, Oct 22, 2017 at 2:19 PM, Eric Ridge  wrote:
> I'm looking for the status as any concurrent open transaction might see it.  
> For example, if any concurrent transaction might see it as "in progress", 
> that's what I'd want returned.  Does that make sense?

Maybe, but note that that's fundamentally something that can become
stale immediately. And, just because an MVCC snapshot cannot see a row
does not mean that it cannot affect its transaction/statement in some
other way (e.g. unique index enforcement).

Again, you'll probably need to put this low level requirement into
context if you want sound advice from this list.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] How to determine that a TransactionId is really aborted?

2017-10-22 Thread Peter Geoghegan

On Sun, Oct 22, 2017 at 12:23 PM, Eric Ridge  wrote:
> Can anyone confirm or deny that this is correct?  I feel like it is correct, 
> but I'm no expert.

What are you going to use the code for? I think that that context is
likely to matter here.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A design for amcheck heapam verification

2017-10-20 Thread Peter Geoghegan

On Thu, Oct 5, 2017 at 7:00 PM, Peter Geoghegan  wrote:
> v3 of the patch series, attached, does it that way -- it adds a
> bloom_create(). The new bloom_create() function still allocates its
> own memory, but does so while using a FLEXIBLE_ARRAY_MEMBER. A
> separate bloom_init() function (that works with dynamic shared memory)
> could easily be added later, for the benefit of parallel hash join.

Since Peter E's work on making the documentation sgml files more
XML-like has broken the v3 patch doc build, attached is v4, which
fixes this bit rot. It also has a few small tweaks here and there to
the docs. Nothing worth noting specifically, really -- I just don't
like to leave my patches with bit rot for long. (Hat-tip to Thomas
Munro for making this easy to detect with his new CF continuous
integration tooling.)

I should point out that I shipped virtually the same code yesterday,
as v1.1 of the Github version of amcheck (also known as amcheck_next).
Early adopters will be able to use this new "heapallindexed"
functionality in the next few days, once packages become available for
the apt and yum community repos. Just as before, the Github version
will work on versions of Postgres >= 9.4.

This seems like good timing on my part, because we know that this new
"heapallindexed" verification will detect the "freeze the dead" bugs
that the next point release is set to have fixes for -- that is
actually kind of how one of the bugs was found [1]. We may even want
to advertise the available of this check within amcheck_next, in the
release notes for the next Postgres point release.

[1] 
https://www.postgresql.org/message-id/cah2-wznm4rcrhfaiwkpwtpew2bxdtgrozk7jwwgucxeh3d1...@mail.gmail.com
-- 
Peter Geoghegan
From 7906c7391a9f52d334c2cbc7d3e245ff014629f2 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Tue, 2 May 2017 00:19:24 -0700
Subject: [PATCH 2/2] Add amcheck verification of indexes against heap.

Add a new, optional capability to bt_index_check() and
bt_index_parent_check():  callers can check that each heap tuple that
ought to have an index entry does in fact have one.  This happens at the
end of the existing verification checks.

This is implemented by using a Bloom filter data structure.  The
implementation performs set membership tests within a callback (the same
type of callback that each index AM registers for CREATE INDEX).  The
Bloom filter is populated during the initial index verification scan.
---
 contrib/amcheck/Makefile |   2 +-
 contrib/amcheck/amcheck--1.0--1.1.sql|  28 +++
 contrib/amcheck/amcheck.control  |   2 +-
 contrib/amcheck/expected/check_btree.out |  14 +-
 contrib/amcheck/sql/check_btree.sql  |   9 +-
 contrib/amcheck/verify_nbtree.c  | 298 ---
 doc/src/sgml/amcheck.sgml| 173 ++
 src/include/utils/snapmgr.h  |   2 +-
 8 files changed, 454 insertions(+), 74 deletions(-)
 create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index 43bed91..c5764b5 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -4,7 +4,7 @@ MODULE_big	= amcheck
 OBJS		= verify_nbtree.o $(WIN32RES)

 EXTENSION = amcheck
-DATA = amcheck--1.0.sql
+DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql
 PGFILEDESC = "amcheck - function for verifying relation integrity"

 REGRESS = check check_btree
diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql
new file mode 100644
index 000..e6cca0a
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0--1.1.sql
@@ -0,0 +1,28 @@
+/* contrib/amcheck/amcheck--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+DROP FUNCTION bt_index_check(regclass);
+CREATE FUNCTION bt_index_check(index regclass,
+heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+--
+-- bt_index_parent_check()
+--
+DROP FUNCTION bt_index_parent_check(regclass);
+CREATE FUNCTION bt_index_parent_check(index regclass,
+heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+-- Don't want these to be available to public
+REVOKE ALL ON FUNCTION bt_index_check(regclass, boolean) FROM PUBLIC;
+REVOKE ALL ON FUNCTION bt_index_parent_check(regclass, boolean) FROM PUBLIC;
diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control
index 05e2861..4690484 100644
--- a/contrib/amcheck/amcheck.control
+++ b/contrib/amcheck/amcheck.control
@@ -1,5 +1,5 @@
 # amcheck extension
 comment = 'functions for verifying r

Re: [HACKERS] [COMMITTERS] pgsql: Fix traversal of half-frozen update chains

2017-10-19 Thread Peter Geoghegan

On Thu, Oct 19, 2017 at 9:03 AM, Peter Geoghegan  wrote:
>> /me studies the problem for a while.
>>
>> What's bothering me about this is: how is cutoff_xid managing to be a
>> new enough transaction ID for this to happen in the first place?  The
>> cutoff XID should certainly be older than anything any current
>> snapshot can see, because it's computed using GetOldestXmin() -- which
>> means that XIDs older than the cutoff can't still be running (except
>> maybe if old_snapshot_threshold is set to a non-default value).

> Now that the problem is fixed by a5736bf7, this test case will prune
> and get an LP_REDIRECT ItemId (not an LP_DEAD one), as we're no longer
> confused about the continuity of HOT chains within heap_prune_chain().

In case I was unclear: the key point here is that it wasn't wrong to
prune the tuples because they might still be visible to somebody's
snapshot. It was wrong to do so in a specific way that could still
happen prior to a5736bf7, leaving only a stub LP_DEAD line pointer,
because index scans needed to do HOT chain traversal in order to get
to the visible tuple that wasn't pruned (the one that was still
frozen). You might say that the HOT chain "appeared to split in two"
from heap_prune_chain()'s perspective, a problem that has nothing to
do with an XID cutoff.

That's why after 20b65522 there was no problem with sequential scans.
There were remaining problems with index scans, though, purely because
of this HOT pruning business -- that was fixed in a5736bf7. (And, as
I've said, problems with "traversal of half-frozen update chains" are
presumed to have existed for routines like EvalPlanQual() -- even more
subtle problems.)

One thing I'm not clear on is if we could or should have limited this
new HeapTupleUpdateXmaxMatchesXmin() stuff to HOT chains, and so not
altered behavior for cross-page cases (for routines like
EvalPlanQualFetch() -- not routines like heap_prune_chain(), that only
ever deal with a single page at a time).

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix traversal of half-frozen update chains

2017-10-19 Thread Peter Geoghegan

On Thu, Oct 19, 2017 at 7:21 AM, Robert Haas  wrote:
> The commit message for a5736bf7 doesn't say anything about a race; it
> just claims that it is fixing traversal of half-frozen update chains,
> without making any reference to how such a thing as a half-frozen
> update chain came to exist in the first place.  I think the commit
> that talks about that race condition is 20b65522.

Well, the race was that the wrong tuple is pruned, because of naivety
in the heap_prune_chain() update chain handling, which is what started
work that became commit a5736bf7. That's by far the worst consequence
of the traversal of half-frozen update chain business that a5736bf7's
commit message describes, since we know it leads to wrong answers to
index scans (and REINDEX fails with "failed to find parent tuple").
It's the only consequence that's been directly observed, but probably
not the only one that's possible. Most other places that suffer the
problem are in obscure-to-users paths like EvalPlanQual().

> /me studies the problem for a while.
>
> What's bothering me about this is: how is cutoff_xid managing to be a
> new enough transaction ID for this to happen in the first place?  The
> cutoff XID should certainly be older than anything any current
> snapshot can see, because it's computed using GetOldestXmin() -- which
> means that XIDs older than the cutoff can't still be running (except
> maybe if old_snapshot_threshold is set to a non-default value).

The problem that we directly observed during pruning, which caused
"failed to find parent tuple" even after the first freeze-the-dead
commit (commit 20b65522), was that the mixture of frozen and unfrozen
tuples in a hot chain caused heap_prune_chain() to make an incorrect
conclusion about the continuity of a HOT chain. We'd prune the wrong
tuples -- it failed to find the visible tuple at the end of the HOT
chain, and respect that that meant it could not prune down to a stub
ItemId (blow away what it *incorrectly* thought was an entire HOT
chain).

Now that the problem is fixed by a5736bf7, this test case will prune
and get an LP_REDIRECT ItemId (not an LP_DEAD one), as we're no longer
confused about the continuity of HOT chains within heap_prune_chain().

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix traversal of half-frozen update chains

2017-10-18 Thread Peter Geoghegan

On Wed, Oct 18, 2017 at 1:54 PM, Robert Haas  wrote:
> Well, I guess what I don't understand is, suppose we have a HOT chain
> that looks like this, where [x,y] denotes a tuple with an xmin of x
> and an xmax of y.
>
> [a,b]->[b,c]->[c,d]
>
> If b is eligible to be frozen, then b is committed and all-visible,
> which means that the [a,b] tuple should be pruned altogether.  IOW, I
> don't think that a non-root tuple should ever have a frozen xmin; if
> that happens, I think we've already screwed up.  So I don't quite
> understand how this problem arises in the first place.

Technically, we don't freeze the heap-only tuples here, because we
cannot. because freezing tuples renders them visible (we don't set
XMIN_FROZEN). Instead, we set hint bits with WAL-logging, on the
assumption that nobody will ever *need* to interrogate the clog -- see
commit 20b65522 from several weeks back. To be clear, this *does* mean
that hint bits stop being mere hints, which I find troubling.

I described these heap-only tuples as "logically frozen" over on the
"heap/SLRU verification, relfrozenxid cut-off, and freeze-the-dead
bug" thread, which is where I brought up my general concern.

> I think that the way we are supposed to be guaranteeing this is to
> first prune the page and then freeze it.

There is a race where we cannot prune the page, though. That's why we
had to revisit what I suppose was a tacit assumption, and address its
problems in the commit that started this thread (commit a5736bf7).

> But that also seems like something
> that shouldn't happen - our notion of the freeze threshold should be
> frozen at the beginning of the scan and should not advance, and it
> should be prior to our freeze horizon, which could be allowed to
> advance but not retreat in the course of the scan.

It is not allowed retreat -- kind of. (Alvaro mentioned something in
passing about an *alternative* fix where it really was allowed to
retreat in the simplest sense [1], but I don't think that that's going
to happen.)

> Apologies if this is stupid; please clue me on what I'm missing.

I don't think that your questions are stupid at all. It intuitively
seems bad that xmin freezing is only something we can do to HOT root
and non-HOT tuples, while xmax freezing needs to happen to heap-only
tuples, as well as HOT root tuples and non-HOT tuples. But, as I said,
I'm still playing catch-up on MultiXacts, and I too feel like I might
still be missing important details.

[1] https://postgr.es/m/20171017100200.ruszi2c6qqwetce5@alvherre.pgsql
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: heap/SLRU verification, relfrozenxid cut-off, and freeze-the-dead bug (Was: amcheck (B-Tree integrity checking tool))

2017-10-18 Thread Peter Geoghegan

On Mon, Oct 16, 2017 at 8:09 PM, Noah Misch  wrote:
> That presupposes construction of two pieces of software, the server and the
> checker, such that every disagreement is a bug in the server.  But checkers
> get bugs just like servers get bugs.

You make a good point, which is that *some* code must be wrong when an
error is raised and hardware is not to blame, but ISTM that the nuance
of that really matters. The checker seems much less likely to be where
bugs are, for three reasons:

* There is far less code for us to maintain as compared to the volume
of backend code that is effectively tested (again, not including the
hidden universe of complex, unauditable firmware code that could be
involved these days).

* Much of the actual checking (as much as possible) is outsourced to
core code that is already critically important. If that has bugs in
it, then it is unlikely to be defined as an amcheck bug.

* Knowing all this, we can go out of our way to do a good job of
getting the design right the first time. (A sound design is far more
important than actually having zero bugs.)

Obviously there could be unambiguous bugs; I'm not arguing otherwise.
I just hope that we can push this model as far as we need to, and
perhaps accommodate verifiability as a goal for *future* development
projects. We're almost doing that today; debuggability of on-disk
structures is something that the community already values. This is the
logical next step, IMV.

> Checkers do provide a sort of
> double-entry bookkeeping.  When a reproducible test case prompts a checker
> complaint, we'll know *some* code is wrong.

I really like your double entry bookkeeping analogy. A tiny
discrepancy will bubble up, even in a huge organization, and yet the
underlying principles are broad and not all that complicated.

> That's an admirable contribution.

Thank you. I just hope that it becomes something that other
contributors have some sense of ownership over.

> I'm essentially saying that the server is innocent until proven guilty.  It
> would be cool to have a self-contained specification of PostgreSQL data files,
> but where the server disagrees with the spec without causing problem
> behaviors, we'd ultimately update the spec to fit the server.

I might not have done a good job of explaining my position. I agree
with everything you say here. I would like to see amcheck become a
kind of vehicle for discussing things that we already discuss. You get
a nice tool at the end, that clarifies and increases confidence in the
original understanding over time (or acts as a canary-in-the-coalmine
forcing function when the original understanding turns out to be
faulty). The tool itself is ultimately just a bonus.

Bringing it back to the concrete freeze-the-dead issue, and the
question of an XID-cutoff for safely interrogating CLOG: I guess it
will be possible to assess a HOT chain as a whole. We can probably do
this safely while holding a super-exclusive lock on the buffer. I can
probably find a way to ensure this only needs to happen in a rare slow
path, when it looks like the invariant might be violated but we need
to make sure (I'm already following this pattern in a couple of
places). Realistically, there will be some amount of "try it and see"
here.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix traversal of half-frozen update chains

2017-10-17 Thread Peter Geoghegan

On Tue, Oct 17, 2017 at 3:02 AM, Alvaro Herrera  wrote:
> Yeah, me too.  If you see another way to fix the problem, let's discuss
> it.

I doubt that there is a better way.

> I think a possible way is to avoid considering that the relfrozenxid
> value computed by the caller is final.

While that alternative seems possible, it also seems riskier.

> One thing I didn't quite investigate is why this bug only shows up with
> multixacts so far.  Is it just because multixacts provide an easy way to
> reproduce it, and that there are others, more difficult ways to cause
> the same problem without involving multixacts?  If so, then the problem
> is likely present in 9.2 as well.

The obvious explanation (although not necessarily the correct one) is
that freezing didn't have a MultiXactIdGetUpdateXid() call in 9.2. The
way we pass down both cutoff_xid and cutoff_multi to
FreezeMultiXactId() seems like it might be involved in the data
corruption that we saw (the incorrect pruning/failed to find parent
tuple thing).

I might spend some time figuring this out later in the week. It's hard
to pin down, and I've only really started to learn about MultiXacts in
the past few months.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2017-10-16 Thread Peter Geoghegan

On Sat, Oct 14, 2017 at 2:47 PM, Peter Geoghegan  wrote:
> I am working on an experimental version of pg_filedump, customized to
> output XML that can be interpreted by an open source hex editor. The
> XML makes the hex editor produce color coded, commented
> tags/annotations for any given heap or B-Tree relation. This includes
> tooltips with literal values for all status bits (including
> t_infomask/t_infomask2 bits, IndexTuple bits, B-Tree meta page status
> bits, PD_* page-level bits, ItemId bits, and others).

This is now available from: https://github.com/petergeoghegan/pg_hexedit

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: heap/SLRU verification, relfrozenxid cut-off, and freeze-the-dead bug (Was: amcheck (B-Tree integrity checking tool))

2017-10-16 Thread Peter Geoghegan

On Fri, Oct 13, 2017 at 7:09 PM, Noah Misch  wrote:
> All good questions; I don't know offhand.  Discovering those answers is
> perhaps the chief labor required of such a project.

ISTM that by far the hardest part of the project is arriving at a
consensus around what a good set of invariants for CLOG and MultiXact
looks like.

I think that it's fair to say that this business with relfrozenxid now
appears to be more complicated than many of us would have thought. Or,
at least, more complicated than I thought when I first started
thinking about it. Once we're measuring this complexity (by having
checks), we should be in a better position to keep it under control,
and to avoid ambiguity.

> The checker should
> consider circumstances potentially carried from past versions via pg_upgrade.

Right. False positives are simply unacceptable.

> Fortunately, if you get some details wrong, it's cheap to recover from checker
> bugs.

Ideally, amcheck will become a formal statement of the contracts
provided by major subsystems, such as the heapam, the various SLRUs,
and so on. While I agree that having bugs there is much less severe
than having bugs in backend code, I would like the tool to reach a
point where it actually *defines* correctness (by community
consensus). If a bug in amcheck reflects a bug in our high level
thinking about correctness, then that actually is a serious problem.
Arguably, it's the most costly variety of bug that Postgres can have.

I may never be able to get general buy-in here; building broad
consensus like that is a lot harder than writing some code for a
contrib module. Making the checking code the *authoritative* record of
how invariants are *expected* to work is a major goal of the project,
though.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2017-10-14 Thread Peter Geoghegan

On Sat, Oct 14, 2017 at 10:58 AM, Robert Haas  wrote:
> I think it's perfectly sensible to view those 2 bits as making up a
> 2-bit field with 4 states rather than displaying each bit
> individually, but you obviously disagree.  Fair enough.

I guess it is that simple.

> I can think of two possible explanations for that.  Number one, the
> tool was written before HEAP_XMIN_FROZEN was invented and hasn't been
> updated for those changes.

Have we invented our last t_infomask/t_infomask2 (logical) status already?

> Number two, the author of the tool agrees
> with your position rather than mine.

I am working on an experimental version of pg_filedump, customized to
output XML that can be interpreted by an open source hex editor. The
XML makes the hex editor produce color coded, commented
tags/annotations for any given heap or B-Tree relation. This includes
tooltips with literal values for all status bits (including
t_infomask/t_infomask2 bits, IndexTuple bits, B-Tree meta page status
bits, PD_* page-level bits, ItemId bits, and others). I tweeted about
this several months ago, when it was just a tool I wrote for myself,
and received a surprisingly positive response. It seems like I'm on to
something, and should release the tool to the community.

I mention this project because it very much informs my perspective
here. Having spent quite a while deliberately corrupting test data in
novel ways, just to see what happens, the "work backwards from the
storage format" perspective feels very natural to me. I do think that
I understand where you're coming from too, though.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2017-10-13 Thread Peter Geoghegan

On Fri, Oct 13, 2017 at 1:02 PM, Robert Haas  wrote:
>> I don't think it's our place to "interpret" the bits. Are we *also*
>> going to show HEAP_XMIN_FROZEN when xmin is physically set to
>> FrozenTransactionId?
>
> No, of course not.  We're talking about how to display the 256 and 512
> bits of t_infomask.  Those have four states: nothing, committed,
> invalid, frozen.  You're arguing that frozen isn't a real state, that
> it's somehow just a combination of committed and invalid, but I think
> that's the wrong way of thinking about it.

No, I'm arguing that they're just bits. Show the bits, rather than
interpreting what is displayed. Document that there are other logical
states that are represented as composites of contradictory/mutually
exclusive states.

Anyone who hopes to interpret these values has to be an expert anyway,
or willing to become something of an expert. There is a good chance
that they've taken an interest because something is already wrong.

> Before HEAP_XMIN_FROZEN existed, it would have been right to display
> the state where both bits are set as committed|invalid, because that
> would clearly show you that two things had been set that should never
> both be set at the same time.  But now that's a valid state with a
> well-defined meaning and I think we should display the actual meaning
> of that state.
>
>> Where does it end?
>
> I guess it ends wherever we decide to stop.

You can take what you're saying much further. What about
HEAP_XMAX_SHR_LOCK, and HEAP_MOVED? Code like HEAP_LOCKED_UPGRADED()
pretty strongly undermines the idea that these composite values are
abstractions.

>> I think that we should prominently document that HEAP_XMIN_COMMITTED
>> |HEAP_XMIN_ABORTED == HEAP_XMIN_FROZEN, rather than trying to hide
>> complexity that we have no business hiding in a tool like pageinspect.
>
> I respect that opinion, but I don't think I'm trying to hide anything.
> I think I'm proposing that we display the information in what I
> believed to be the clearest and most accurate way.

pg_filedump doesn't display HEAP_XMIN_FROZEN, either. (Nor does it
ever display any of the other composite t_infomask/t_infomask2
values.)

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-10-13 Thread Peter Geoghegan

On Thu, Oct 12, 2017 at 2:23 PM, Robert Haas  wrote:
> On Thu, Oct 12, 2017 at 4:38 PM, Alexander Korotkov
>> However I imply that alternative storage would share our "MVCC model".  So,
>> it
>> should share our transactional model including transactions,
>> subtransactions, snapshots etc.
>> Therefore, if alternative storage is transactional, then in particular it
>> should be able to fetch tuple with
>> given TID according to given snapshot.  However, how it's implemented
>> internally is
>> a black box for us.  Thus, we don't insist that tuple should have different
>> TID after update;
>> we don't insist there is any analogue of HOT; we don't insist alternative
>> storage needs vacuum
>> (or if even it needs vacuum, it might be performed in completely different
>> way) and so on.
>
> Fully agreed.

If we implement that interface, where does that leave EvalPlanQual()?
Do those semantics have to be preserved?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix traversal of half-frozen update chains

2017-10-13 Thread Peter Geoghegan

On Fri, Oct 13, 2017 at 4:54 AM, Robert Haas  wrote:
> I haven't really followed this thread in depth, but I'm very nervous
> about the idea that we should ever be examining the raw-xmin of a
> tuple that has been marked HEAP_XMIN_FROZEN for anything other than
> forensic purposes.

I'm nervous about it as well. These problems are very difficult to reason about.

> The design principle I followed in writing the
> original patch was that marking a tuple HEAP_XMIN_FROZEN was identical
> to setting the xmin to 2 except that the original xmin was still
> available for manual inspection.

I did point this out myself.

> If we're finding that we need the
> raw xmin after freezing, doesn't that mean that our freezing algorithm
> was flat busted prior to that commit?

I wouldn't put it that way at all. That commit provided us with the
opportunity to put in a better fix for a problem with update chain
traversal, a problem that was particularly critical when certain HOT
pruning + freezing races occur (the "failed to find parent tuple"
thing). It's clear that freezing was just as busted before and after
that commit, though.

> And maybe still busted when
> things wrap around multiple times and raw-xmins are reused?

I'm more concerned about the situation that will remain (or has now
been created) on 9.3, where we don't have raw xmin, and so must use
very forgiving FrozenTransactionId matching within
HeapTupleUpdateXmaxMatchesXmin().

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] pageinspect function to decode infomasks

2017-10-12 Thread Peter Geoghegan

On Tue, Aug 15, 2017 at 10:54 AM, Robert Haas  wrote:
>> Or at least make the filtering optional.
>
> I don't think "filtering" is the right way to think about it.  It's
> just labeling each combination of bits with the meaning appropriate to
> that combination of bits.

I do. -1 to not just showing what's on the page -- if the
HEAP_XMIN_COMMITTED and HEAP_XMIN_ABORTED bits are set, then I think
we should show them. Yeah, I accept that there is a real danger of
confusing people with that. Unfortunately, I think that displaying
HEAP_XMIN_FROZEN will cause even more confusion. I don't think that
HEAP_XMIN_FROZEN is an abstraction at all. It's a notational
convenience.

I don't think it's our place to "interpret" the bits. Are we *also*
going to show HEAP_XMIN_FROZEN when xmin is physically set to
FrozenTransactionId? Where does it end?

I think that we should prominently document that HEAP_XMIN_COMMITTED
|HEAP_XMIN_ABORTED == HEAP_XMIN_FROZEN, rather than trying to hide
complexity that we have no business hiding in a tool like pageinspect.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-10-11 Thread Peter Geoghegan

On Wed, Oct 11, 2017 at 1:08 PM, Robert Haas  wrote:
> On Mon, Oct 9, 2017 at 10:22 AM, Alexander Korotkov
>  wrote:
>> For me, it's crucial point that pluggable storages should be able to have
>> different MVCC implementation, and correspondingly have full control over
>> its interactions with indexes.
>> Thus, it would be good if we would get consensus on that point.  I'd like
>> other discussion participants to comment whether they agree/disagree and
>> why.
>> Any comments?
>
> I think it's good for new storage managers to have full control over
> interactions with indexes.  I'm not sure about the MVCC part.  I think
> it would be legitimate to want a storage manager to ignore MVCC
> altogether - e.g. to build a non-transactional table.

I agree with Alexander -- if you're going to have a new MVCC
implementation, you have to do significant work within index access
methods. Adding "retail index tuple deletion" is probably just the
beginning. ISTM that you need something like InnoDB's purge thread
when index values change, since two versions of the same index tuple
(each with distinct attribute values) have to physically co-exist for
a time.

> I don't know
> that it would be a very good idea to have two different full-fledged
> MVCC implementations, though.  Like Tom says, that would be
> replicating a lot of the awfulness of the MySQL model.

It's not just the MySQL model, FWIW. SQL-on-Hadoop systems like
Impala, certain NoSQL systems, and AFAIK any database system that
claims to have pluggable storage all do it this way. That is, core
transaction management functions (e.g. MVCC snapshot acquisition) is
outsourced to the storage engine. It *is* very cumbersome, but that's
what they do.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On markers of changed data

2017-10-09 Thread Peter Geoghegan

On Sat, Oct 7, 2017 at 6:34 AM, Stephen Frost  wrote:
>> > That’s actually what pg_rman is doing for what it calls incremental
>> > backups (perhaps that would be differential backup in PG
>> > terminology?), and the performance is bad as you can imagine. We could
>> > have a dedicated LSN map to do such things with 4 bytes per page. I am
>> > still not convinced that this much facility and the potential bug
>> > risks are worth it though, Postgres already knows about differential
>> > backups if you shape it as a delta of WAL segments. I think that, in
>> > order to find a LSN map more convincing, we should find first other
>> > use cases where it could become useful. Some use cases may pop up with
>> > VACUUM, but I have not studied the question hard enough...
>>
>> The case I've discussed with barman developers is a large database
>> (couple dozen of TBs should be enough) where a large fraction (say 95%)
>> is read-only but there are many changes to the active part of the data,
>> so that WAL is more massive than size of active data.
>
> Yes, we've seen environments like that also.

I'm pretty sure that those cases are cases where there are many more
FPIs than might be expected, due to a lack of locality. (UUID PKs can
make the size of WAL balloon, for example.)

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] heap/SLRU verification, relfrozenxid cut-off, and freeze-the-dead bug (Was: amcheck (B-Tree integrity checking tool))

2017-10-09 Thread Peter Geoghegan

On Sun, Oct 16, 2016 at 6:46 PM, Noah Misch  wrote:
>> Noah and I discussed possible future directions for amcheck in person
>> recently. I would like to get Noah's thoughts again here on how a tool
>> like amcheck might reasonably target other access methods for
>> verification. In particular, the heapam. MultiXacts were mentioned as
>> a structure that could receive verification in a future iteration of
>> this tool, but I lack expertise there.
>
> Yes, a heap checker could examine plenty of things.  Incomplete list:

The heap checking enhancement that is under review in the next CF [1],
which checks an index against the table it indexes, doesn't directly
test any of the things you outlined in this mail almost exactly a year
ago. (That said, some of the same things are checked indirectly,
simply by using IndexBuildHeapScan().)

I do still think that there is a very real need for a "direct
heap/SLRU checker" function, though. I only put off implementing one
because there was still some low hanging fruit. I would like to go
over your suggestions again now. My understanding of what we *can* do
has evolved in the past several months.

> - Detect impossible conditions in the hint bits.  A tuple should not have both
>   HEAP_XMAX_COMMITTED and HEAP_XMAX_INVALID.  Every tuple bearing
>   HEAP_ONLY_TUPLE should bear HEAP_UPDATED.  HEAP_HASVARWIDTH should be true
>   if and only if the tuple has a non-NULL value in a negative-typlen column,
>   possibly a dropped column.  A tuple should not have both HEAP_KEYS_UPDATED
>   and HEAP_XMAX_LOCK_ONLY.

It certainly makes sense to check for clearly contradictory hint bits like this.

> - Verify agreement between CLOG, MULTIXACT, and hint bits.

This is where it gets complicated, I think. This is what I really want
to talk about.

The recent commit 20b65522 pretty clearly established that a tuple
could technically escape freezing (having HEAP_XMIN_FROZEN set)
despite actually being before a relfrozenxid cut-off. The idea was
that as long as we reliably set hint bits on heap-only tuples through
WAL-logging, it doesn't matter that their CLOG could be truncated,
because nobody will ever need to interrogate the CLOG anyway (to coin
a phrase, the heap-only tuple nevertheless still had its xmax
"logically frozen"). If nothing else, this leaves me with a very
squishy set of invariant conditions to work with when I go to verify
agreement with CLOG, MULTIXACT, and hint bits.

So, the question is: What is the exact set of invariant conditions
that I can check when I go to verify agreement between a heap relation
and the various SLRUs? In particular, at what precise XID-wise point
do CLOG and MULTIXACT stop reliably storing transaction status? And,
is there a different point for the xmax of tuples that happen to be
heap-only tuples?

Another important concern here following 20b65522 is: Why is it safe
to assume that nobody will ever call TransactionIdDidCommit() for
"logically frozen" heap-only tuples that are not at the end of their
HOT chain, and in so doing get a wrong answer? I can't find a rule
that implies that there is no dangerous ambiguity that's written down
anywhere. I *can* find a comment that suggests that it's wrong, though
-- the "N.B." comment at the top of heap_prepare_freeze_tuple().
(Granted, that comment doesn't seem to acknowledge the fact that the
caller does WAL-log as part of freezing; there has been some churn in
this area.)

I'm pretty sure that the classic rule for TransactionIdDidCommit()
when called from tqual.c was that the tuple cannot have
HEAP_XMIN_FROZEN set, which was straightforward enough back when a
frozen tuple was assumed to have necessarily committed (and to have a
"raw" xmin point that should logically be visible to everyone). This
business of "logically frozen" xmax values for dead heap-only tuples
makes things *way* more complicated, though. I'm concerned that we've
failed to account for that, and that TransactionIdDidCommit() calls
concerning pre-relfrozenxid XIDs can still happen.

I should point out that for whatever the reason, there is evidence
that we do in fact risk TransactionIdDidCommit() calls that give wrong
answers (independent of the  ~2014 MULTIXACT bugs where that was
clearly the case, and even before 20b65522): Jeff Janes showed [2]
that there is probably some unaccounted-for case where "premature"
truncation takes place. This may be related to the recent HOT
chain/freezing bugs, and our (only partial [3]) fixes may have fixed
that, too -- I just don't know.

[1] https://commitfest.postgresql.org/15/1263/
[2] 
https://postgr.es/m/CAMkU=1y-tnp_7jdfg_ubxqdb8esmo280acdgdwsa429hrte...@mail.gmail.com
[3] https://postgr.es/m/20171007232524.sdpi3jk2636fvjge@alvherre.pgsql
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-08 Thread Peter Geoghegan

On Sat, Oct 7, 2017 at 4:25 PM, Alvaro Herrera  wrote:
> Hmm, I think I added a random sleep (max. 100ms) right after the
> HeapTupleSatisfiesVacuum call in vacuumlazy.c (lazy_scan_heap), and that
> makes the race easier to hit.

I still cannot reproduce. Perhaps you can be more specific?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Prepared statements assume text type in PG10

2017-10-07 Thread Peter Geoghegan

On Sat, Oct 7, 2017 at 4:27 PM, Peter Geoghegan  wrote:
> I suspect commit d8d32d9 is involved here, though I haven't verified that.

Weirdly, there is a git hash collision here, so you'll have to put in
d8d32d9a (8 characters -- the default of 7 for a short git hash isn't
cutting it).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Prepared statements assume text type in PG10

2017-10-07 Thread Peter Geoghegan

On Sat, Oct 7, 2017 at 2:56 PM, Jack Christensen
 wrote:
> The test suite for the Go PostgreSQL driver pgx
> (https://github.com/jackc/pgx) found an unexpected behavior change in PG10.
> Previously, it was impossible to prepare a statement with a unknown or
> ambiguous parameter type.
>
> Pre-version 10:
>
> jack=# prepare ps as select $1;
> ERROR:  could not determine data type of parameter $1
>
> But on PG10 the type defaults to text:
>
> jack=# prepare ps as select $1;
> PREPARE
> Time: 0.183 ms
> jack=# execute ps('Hello, there');
>?column?
> --
>  Hello, there
> (1 row)
>
> Time: 0.437 ms
>
> I looked through the git log and couldn't find any commits referencing this.
> Is this an intended behavior change?

I suspect commit d8d32d9 is involved here, though I haven't verified that.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-07 Thread Peter Geoghegan

On Sat, Oct 7, 2017 at 1:31 AM, Alvaro Herrera  wrote:
>> As you must have seen, Alvaro said he has a variant of Dan's original
>> script that demonstrates that a problem remains, at least on 9.6+,
>> even with today's fix. I think it's the stress-test that plays with
>> fillfactor, many clients, etc [1].
>
> I just execute setup.sql once and then run this shell command,
>
> while :; do
> psql -e -P pager=off -f ./repro.sql
> for i in `seq 1 5`; do
> psql -P pager=off -e --no-psqlrc -f ./lock.sql &
> done
> wait && psql -P pager=off -e --no-psqlrc -f ./reindex.sql
> psql -P pager=off -e --no-psqlrc -f ./report.sql
> echo "done"
> done

I cannot reproduce the problem on my personal machine using this
script/stress-test. I tried to do so on the master branch git tip.
This reinforces the theory that there is some timing sensitivity,
because the remaining race condition is very narrow.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-06 Thread Peter Geoghegan

On Fri, Oct 6, 2017 at 2:09 PM, Wong, Yi Wen  wrote:
> Yesterday, I've been spending time with pg_visibility on the pages when I 
> reproduce the issue in 9.6.
> None of the all-frozen or all-visible bits are necessarily set in problematic 
> pages.

Since this happened yesterday, I assume it was with an unfixed version?

As you must have seen, Alvaro said he has a variant of Dan's original
script that demonstrates that a problem remains, at least on 9.6+,
even with today's fix. I think it's the stress-test that plays with
fillfactor, many clients, etc [1].

I've tried to independently reproduce the problem on the master
branch's current tip, with today's new fix, but cannot break things
despite trying many variations. I cannot reproduce the problem that
Alvaro still sees.

I'll have to wait until Alvaro posts his repro to the list before
commenting further, which I assume he'll post as soon as he can. There
doesn't seem to be much point in not waiting for that.

[1] https://postgr.es/m/20171005162402.jahqflf3mekileqm@alvherre.pgsql
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-06 Thread Peter Geoghegan

On Fri, Oct 6, 2017 at 11:34 AM, Peter Geoghegan  wrote:
>> I don't know if it's really the freeze map at fault or something else.
>
> Ideally, it would be possible to effectively disable the new freeze
> map stuff in a minimal way, for testing purposes. Perhaps the authors
> of that patch, CC'd, can suggest a way to do that.

Actually, the simplest thing might be to just use pg_visibility's
pg_check_frozen() to check that the visibility/freeze map accurately
summarizes the all-frozen status of tuples in the heap. If that
doesn't indicate that there is corruption, we can be fairly confident
that the problem is elsewhere. The metadata in the visibility/freeze
map should be accurate when a bit is set to indicate that an entire
heap page is all-frozen (or, separately, all-visible). We can hardly
expect it to have better information that the authoritative source of
truth, the heap itself.

The more I think about it, the more I tend to doubt that the remaining
problems are with the freeze map. If the freeze map was wrong, and
incorrectly said that a page was all-frozen, then surely the outward
symptoms would take a long time to show up, as they always do when we
accidentally fail to freeze a tuple before a relfrozenxid cutoff. ISTM
that that's the only meaningful way that the freeze map can be wrong
-- it only promises to be accurate when it says that no further
freezing is needed for a page/bit.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-06 Thread Peter Geoghegan

On Fri, Oct 6, 2017 at 10:49 AM, Alvaro Herrera  wrote:
> I can tell that, in 9.6, REINDEX still reports the error we saw in
> earlier releases, after some of the runs of my reproducer scripts.  I'm
> unable to reproduce it anymore in 9.3 to 9.5.  I can't see the one Dan
> originally reported anywhere, either.

You mean the enhanced stress-test that varied fillfactor, added filler
columns, and so on [1]? Can you post that to the list, please? I think
that several of us would like to have a reproducible test case.

> I don't know if it's really the freeze map at fault or something else.

Ideally, it would be possible to effectively disable the new freeze
map stuff in a minimal way, for testing purposes. Perhaps the authors
of that patch, CC'd, can suggest a way to do that.

If I had to guess, I'd say that it's just as likely that the issue is
only reproducible on 9.6 because of the enhancements added in that
release that improved buffer pinning (the use of atomic ops to pin
buffers, moving buffer content locks into buffer descriptors, etc). It
was already a bit tricky to get the problem that remained after
20b6552 but before today's a5736bf to reproduce with Dan's script. It
often took me 4 or 5 attempts. (I wonder what it looks like with your
enhanced version of that script -- the one that I just asked about.)

It seems possible that we've merely reduced the window for the race to
the point that it's practically (though not theoretically) impossible
to reproduce the problem on versions < 9.6, though not on 9.6+.
Applying Occam's razor, the problem doesn't seem particularly likely
to be in the freeze map stuff, which isn't actually all that closely
related.

[1] https://postgr.es/m/20171005162402.jahqflf3mekileqm@alvherre.pgsql
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-06 Thread Peter Geoghegan

On Fri, Oct 6, 2017 at 7:59 AM, Alvaro Herrera  wrote:
> By the way, I still wonder if there's any way for a new tuple to get
> inserted in the place where a HOT redirect would be pointing to, and
> have it be marked as Frozen, where the old redirect contains a
> non-invalid Xmax.  I tried to think of a way for that to happen, but
> couldn't think of anything.
>
> What I imagine is a sequence like this:
>
> 1. insert a tuple
> 2. HOT-update a tuple
> 3. prune the page, making lp 1 be a redirect (lp 2 is the new tuple)
> 4. start transaction
> 5. HOT-update the tuple again, creating HOT in lp 3
> 6. abort transaction (leaving aborted update in lp 3)
> 7. somehow remove tuple from lp 3, make slot available
> 8. new transaction comes along, inserts tuple in lp 3
> 9. somebody freezes tuple in lp3 (???)
>
> Then we follow the HOT chain, see that Xmin=2 in lp3 and conclude that
> the tuple is part of the chain because of an xid "match".
>
> Basically from step 7 onwards I don't think this is possible, but maybe
> I'm just blind.

For the record, I also think that this is impossible, in part because
pruning requires a cleanup buffer lock (and because HOT chains cannot
span pages). I wouldn't say that I am 100% confident about this,
though.

BTW, is this comment block that appears above
heap_prepare_freeze_tuple() now obsolete, following 20b65522 (and
maybe much earlier commits)?

 * NB: It is not enough to set hint bits to indicate something is
 * committed/invalid -- they might not be set on a standby, or after crash
 * recovery.  We really need to remove old xids.
 */

We WAL-log setting hint bits during freezing now, iff tuple xmin is
before the Xid cutoff and tuple is a heap-only tuple.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-06 Thread Peter Geoghegan

On Fri, Oct 6, 2017 at 6:18 AM, Alvaro Herrera  wrote:
> Wood, Dan wrote:
>> Yes, I’ve been testing 9.6.  I’ll try Alvaro’s patch today.
>>
>> I would prefer to focus on either latest 9X or 11dev.
>
> I tested my patch on 9.4 and 9.5 today and it seems to close the problem
> (with the patch, I waited 10x as many iterations as it took for the
> problem to occur ~10 times without the patch), but I can reproduce a
> problem in 9.6 with my patch installed.  There must be something new in
> 9.6 that is causing the problem to reappear.

What problem persists? The original one (or, at least, the original
symptom of pruning HOT chains incorrectly)? If that's what you mean, I
wouldn't be so quick to assume that it's the freeze map.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A design for amcheck heapam verification

2017-10-05 Thread Peter Geoghegan

On Fri, Sep 29, 2017 at 10:54 AM, Peter Geoghegan  wrote:
>> Something that allocates new memory as the patch's bloom_init()
>> function does I'd tend to call 'make' or 'create' or 'new' or
>> something, rather than 'init'.
>
> I tend to agree. I'll adopt that style in the next version. I just
> didn't want the caller to have to manage the memory themselves.

v3 of the patch series, attached, does it that way -- it adds a
bloom_create(). The new bloom_create() function still allocates its
own memory, but does so while using a FLEXIBLE_ARRAY_MEMBER. A
separate bloom_init() function (that works with dynamic shared memory)
could easily be added later, for the benefit of parallel hash join.

Other notable changes:

* We now support bloom filters that have bitsets of up to 512MB in
size. The previous limit was 128MB.

* We now use TransactionXmin for our AccessShareLock xmin cutoff,
rather than calling GetOldestXmin(). This is the same cut-off used by
xacts that must avoid broken hot chains for their earliest snapshot.
This avoids a scan of the proc array, and allows more thorough
verification, since GetOldestXmin() was overly restrictive here.

* Expanded code comments describing the kinds of problems the new
verification capability is expected to be good at catching.

For example, there is now a passing reference to the CREATE INDEX
CONCURRENTLY bug that was fixed back in February (it's given as an
example of a more general problem -- faulty HOT safety assessment).
With the new heapallindexed enhancement added by this patch, amcheck
can be expected to catch that issue much of the time. We also go into
heap-only tuple handling within IndexBuildHeapScan(). The way that
CREATE INDEX tries to index the most recent tuple in a HOT chain
(while locating the root tuple in the chain, to get the right heap TID
for the index) has proven to be very useful as a smoke test while
investigating HOT/VACUUM FREEZE bugs in the past couple of weeks [1].
I believe it would have caught several historic MultiXact/recovery
bugs, too. This all seems worth noting explicitly.

[1] 
https://postgr.es/m/cah2-wznm4rcrhfaiwkpwtpew2bxdtgrozk7jwwgucxeh3d1...@mail.gmail.com
-- 
Peter Geoghegan
From 3bed03a9e0506c0b81097b634c5f1b5534a2dcb3 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Tue, 2 May 2017 00:19:24 -0700
Subject: [PATCH 2/2] Add amcheck verification of indexes against heap.

Add a new, optional capability to bt_index_check() and
bt_index_parent_check():  callers can check that each heap tuple that
ought to have an index entry does in fact have one.  This happens at the
end of the existing verification checks.

This is implemented by using a Bloom filter data structure.  The
implementation performs set membership tests within a callback (the same
type of callback that each index AM registers for CREATE INDEX).  The
Bloom filter is populated during the initial index verification scan.
---
 contrib/amcheck/Makefile |   2 +-
 contrib/amcheck/amcheck--1.0--1.1.sql|  28 
 contrib/amcheck/amcheck.control  |   2 +-
 contrib/amcheck/expected/check_btree.out |  14 +-
 contrib/amcheck/sql/check_btree.sql  |   9 +-
 contrib/amcheck/verify_nbtree.c  | 275 ---
 doc/src/sgml/amcheck.sgml| 157 ++
 src/include/utils/snapmgr.h  |   2 +-
 8 files changed, 423 insertions(+), 66 deletions(-)
 create mode 100644 contrib/amcheck/amcheck--1.0--1.1.sql

diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile
index 43bed91..c5764b5 100644
--- a/contrib/amcheck/Makefile
+++ b/contrib/amcheck/Makefile
@@ -4,7 +4,7 @@ MODULE_big	= amcheck
 OBJS		= verify_nbtree.o $(WIN32RES)

 EXTENSION = amcheck
-DATA = amcheck--1.0.sql
+DATA = amcheck--1.0--1.1.sql amcheck--1.0.sql
 PGFILEDESC = "amcheck - function for verifying relation integrity"

 REGRESS = check check_btree
diff --git a/contrib/amcheck/amcheck--1.0--1.1.sql b/contrib/amcheck/amcheck--1.0--1.1.sql
new file mode 100644
index 000..e6cca0a
--- /dev/null
+++ b/contrib/amcheck/amcheck--1.0--1.1.sql
@@ -0,0 +1,28 @@
+/* contrib/amcheck/amcheck--1.0--1.1.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION amcheck UPDATE TO '1.1'" to load this file. \quit
+
+--
+-- bt_index_check()
+--
+DROP FUNCTION bt_index_check(regclass);
+CREATE FUNCTION bt_index_check(index regclass,
+heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_check'
+LANGUAGE C STRICT PARALLEL RESTRICTED;
+
+--
+-- bt_index_parent_check()
+--
+DROP FUNCTION bt_index_parent_check(regclass);
+CREATE FUNCTION bt_index_parent_check(index regclass,
+heapallindexed boolean DEFAULT false)
+RETURNS VOID
+AS 'MODULE_PATHNAME', 'bt_index_parent_check'
+LANGUAGE

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-05 Thread Peter Geoghegan

On Thu, Oct 5, 2017 at 4:35 AM, Alvaro Herrera  wrote:
> At any rate, I was thinking in a new routine to encapsulate the logic,
>
> /*
>  * Check the tuple XMIN against prior XMAX, if any
>  */
> if (!HeapTupleUpdateXmaxMatchesXmin(priorXmax, 
> HeapTupleHeaderGetXmin(htup)))
> break;
>
> where ..XmaxMatchesXmin would know about checking for possibly frozen
> tuples.

Makes sense.

>> I can see why it
>> would be safe (or at least no more dangerous) to rely on
>> HeapTupleHeaderGetRawXmin() in the way mentioned here, at least on
>> installations that initdb'd on a version after commit 37484ad2a
>> (version 9.4+). However, I'm not sure why what you propose here would
>> be safe when even raw xmin happens to be FrozenTransactionId. Are you
>> sure that that's truly race-free? If it's really true that we only
>> need to check for FrozenTransactionId on 9.3, why not just do that on
>> all versions, and never bother with HeapTupleHeaderGetRawXmin()?
>> ("Sheer paranoia" is a valid answer; I just want us to be clear on the
>> reasoning.)
>
> I think the RawXmin is the better mechanism.  I'm not absolutely certain
> that the windows is completely closed in 9.3; as I understand things, it
> is possible for transaction A to prune an aborted heap-only tuple, then
> transaction B to insert a frozen tuple in the same location, then
> transaction C follows a link to the HOT that was pruned.  I think we'd
> end up considering that the new frozen tuple is part of the chain, which
> is wrong ...  In 9.4 we can compare the real xmin.

Good point: the race doesn't exist on 9.4+ with pg_upgrade from 9.3,
when raw xmin happens to be FrozenTransactionId, because there can be
no new tuples that have that property. This possible race is strictly
a 9.3 issue. (You have to deal with a raw xmin of FrozenTransactionId
on 9.4+, but there are no race conditions, because affected tuples are
all pre-pg_upgrade tuples -- they were frozen months ago, not
microseconds ago.)

> I hope I am proved wrong about this, because if not, I think we're
> looking at an essentially unsolvable problem in 9.3.

Well, as noted within README.HOT, the xmin/xmax matching did not
appear in earlier versions of the original HOT patch, because it was
thought that holding a pin of the buffer was a sufficient interlock
against concurrent line pointer recycling (pruning must have a buffer
cleanup lock). Clearly it is more robust to match xmin to xmax, but is
there actually any evidence that it was really necessary at all (for
single-page HOT chain traversals) when HOT went in in 2007, or that it
has since become necessary? heap_page_prune()'s caller has to have a
buffer cleanup lock.

I'm certainly not advocating removing the xmin/xmax matching within
heap_prune_chain() on 9.3. However, it may be acceptable to rely on
holding a buffer cleanup lock within heap_prune_chain() on 9.3, just
for the rare/theoretical cases where the FrozenTransactionId raw xmin
ambiguity means that xmin/xmax matching alone might not be enough. As
you say, it seems unsolvable through looking at state on the page
directly on 9.3, so there may be no other way. And, that's still
strictly better than what we have today.

> As far as I understand, this problem only emerges if one part of a HOT
> chain reaches the min freeze age while another part of the same chain is
> still visible by some running transaction.  It is particularly
> noticeably in our current test case because we use a min freeze age of
> 0, with many concurrrent modifying the same page.  What this says to me
> is that VACUUM FREEZE is mildly dangerous when there's lots of high
> concurrent HOT UPDATE activity.

I'm not sure what you mean here. It is dangerous right now, because
there is a bug, but we can squash the bug.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-04 Thread Peter Geoghegan

On Wed, Oct 4, 2017 at 6:46 AM, Alvaro Herrera  wrote:
> Wong, Yi Wen wrote:
>> My interpretation of README.HOT is the check is just to ensure the chain is 
>> continuous; in which case the condition should be:
>>
>> > if (TransactionIdIsValid(priorXmax) &&
>> > !TransactionIdEquals(priorXmax, 
>> > HeapTupleHeaderGetRawXmin(htup)))
>> > break;
>>
>> So the difference is GetRawXmin vs GetXmin, because otherwise we get the 
>> FreezeId instead of the Xmin when the transaction happened

As you know, on version 9.4+, as of commit 37484ad2a, we decided that
we are "largely ignoring the value to which it [xmin] is set". The
expectation became that raw xmin is available after freezing, but
mostly for forensic purposes. I think Alvaro should now memorialize
the idea that its value is actually critical in some place
(htup_details.h?).

> I independently arrived at the same conclusion.  Since I was trying with
> 9.3, the patch differs -- in the old version we must explicitely test
> for the FrozenTransactionId value, instead of using GetRawXmin.

Obviously you're going to have to be prepared for a raw xmin of
FrozenTransactionId, even on 9.4+, due to pg_upgrade. I can see why it
would be safe (or at least no more dangerous) to rely on
HeapTupleHeaderGetRawXmin() in the way mentioned here, at least on
installations that initdb'd on a version after commit 37484ad2a
(version 9.4+). However, I'm not sure why what you propose here would
be safe when even raw xmin happens to be FrozenTransactionId. Are you
sure that that's truly race-free? If it's really true that we only
need to check for FrozenTransactionId on 9.3, why not just do that on
all versions, and never bother with HeapTupleHeaderGetRawXmin()?
("Sheer paranoia" is a valid answer; I just want us to be clear on the
reasoning.)

Obviously any race would have a ridiculously tiny window, but it's not
obvious why this protocol would be completely race-free (in the event
of a FrozenTransactionId raw xmin).

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-04 Thread Peter Geoghegan

On Wed, Oct 4, 2017 at 11:00 AM, Wood, Dan  wrote:
> The early “break;” here is likely the xmin frozen reason as I found in the 
> other loop.

It looks that way.

Since we're already very defensive when it comes to this xmin/xmax
matching business, and we're defensive while following an update chain
more generally, I wonder if we should be checking
HeapTupleHeaderIsSpeculative() on versions >= 9.5 (versions with ON
CONFLICT DO UPDATE, where t_ctid block number might actually be a
speculative insertion token). Or, at least acknowledging that case in
comments. I remember expressing concern that something like this was
possible at the time that that went in.

We certainly don't want to have heap_abort_speculative() "super
delete" the wrong tuple in the event of item pointer recycling. There
are at least defensive sanity checks that would turn that into an
error within heap_abort_speculative(), so it wouldn't be a disaster if
it was attempted. I don't think that it's possible in practice, and
maybe it's sufficient that routines like heap_get_latest_tid() check
for a sane item offset, which will discriminate against
SpecTokenOffsetNumber, which must be well out of range for ItemIds on
the page. Could be worth a comment.

(I guess that heap_prune_chain() wouldn't need to be changed if we
decide to add such comments, because the speculative tuple ItemId is
going to be skipped over due to being ItemIdIsUsed() before we even
get there.)
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-04 Thread Peter Geoghegan

On Tue, Oct 3, 2017 at 8:43 PM, Wong, Yi Wen  wrote:
> My interpretation of README.HOT is the check is just to ensure the chain is 
> continuous; in which case the condition should be:
>
>> if (TransactionIdIsValid(priorXmax) &&
>> !TransactionIdEquals(priorXmax, 
>> HeapTupleHeaderGetRawXmin(htup)))
>> break;
>
> So the difference is GetRawXmin vs GetXmin, because otherwise we get the 
> FreezeId instead of the Xmin when the transaction happened

I was thinking along similar lines.

> The interesting consequence of changing that is the prune seems to get the 
> entire chain altogether with Dan's repro... I've run it a couple of times and 
> have consistently gotten the following page
>
>  lp | t_ctid | lp_off | lp_flags | t_infomask | t_infomask2
> +++--++-
>   1 | (0,1)  |   8152 |1 |   2818 |   3
>   2 ||  7 |2 ||
>   3 ||  0 |0 ||
>   4 ||  0 |0 ||
>   5 ||  0 |0 ||
>   6 ||  0 |0 ||
>   7 | (0,7)  |   8112 |1 |  11010 |   32771
> (7 rows)

That's also what I see. This is a good thing, I think; that's how we
ought to prune.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-03 Thread Peter Geoghegan

On Tue, Oct 3, 2017 at 6:09 PM, Wood, Dan  wrote:
> I’ve just started looking at this again after a few weeks break.

> if (TransactionIdIsValid(priorXmax) &&
> !TransactionIdEquals(priorXmax, 
> HeapTupleHeaderGetXmin(htup)))
> break;

> We need to understand why these TXID equal checks exist.  Can we 
> differentiate the cases they are protecting against with the two exceptions 
> I’ve found?

I haven't read your remarks here in full, since I'm about to stop
working for the day, but I will point out that
src/backend/access/heap/README.HOT says a fair amount about this,
under "Abort Cases".


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-03 Thread Peter Geoghegan

On Tue, Oct 3, 2017 at 5:15 PM, Michael Paquier
 wrote:
>> FYI, the repro case page contents looks like this with the patch applied:
>> postgres=# select lp, lp_flags, t_xmin, t_xmax, t_ctid,
>> to_hex(t_infomask) as infomask,
>> to_hex(t_infomask2) as infomask2
>> from heap_page_items(get_raw_page('t', 0));
>>  lp | lp_flags | t_xmin  | t_xmax | t_ctid | infomask | infomask2
>> +--+-+++--+---
>>   1 |1 | 1845995 |  0 | (0,1)  | b02  | 3
>>   2 |2 | |||  |
>>   3 |0 | |||  |
>>   4 |0 | |||  |
>>   5 |0 | |||  |
>>   6 |0 | |||  |
>>   7 |1 | 1846001 |  0 | (0,7)  | 2b02 | 8003
>> (7 rows)
>
> Is lp_off for tid (0,2) pointing to (0,7)? A hot chain preserved is
> what would look correct to me.

Yes, it is:

postgres=# select * from bt_page_items('foo', 1);
 itemoffset | ctid  | itemlen | nulls | vars |  data
+---+-+---+--+-
  1 | (0,1) |  16 | f | f| 01 00 00 00 00 00 00 00
  2 | (0,2) |  16 | f | f| 03 00 00 00 00 00 00 00
(2 rows)

I can tell from looking at my hex editor that the 4 bytes of ItemId
that we see for position '(0,2)' in the ItemId array are "07 00 01
00", meaning that '(0,2)' this is a LP_REDIRECT item, repointing us to
'(0,7)'. Everything here looks sane to me, at least at first blush.

> - * Check the tuple XMIN against prior XMAX, if any
> - */
> -if (TransactionIdIsValid(priorXmax) &&
> -!TransactionIdEquals(HeapTupleHeaderGetXmin(htup), priorXmax))
> -break;
> If you remove this check, you could also remove completely priorXmax.
>
> Actually, I may be missing something, but why is priorXmax updated
> even for dead tuples? For example just doing that is also taking care
> of the problem:

I'll study what you suggest here some more tomorrow.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-03 Thread Peter Geoghegan

On Tue, Oct 3, 2017 at 9:48 AM, Alvaro Herrera  wrote:
> But that still doesn't fix the problem;
> as far as I can see, vacuum removes the root of the chain, not yet sure
> why, and then things are just as corrupted as before.

Are you sure it's not opportunistic pruning? Another thing that I've
noticed with this problem is that the relevant IndexTuple will pretty
quickly vanish, presumably due to LP_DEAD setting (but maybe not
actually due to LP_DEAD setting).

(Studies the problem some more...)

I now think that it actually is a VACUUM problem, specifically a
problem with VACUUM pruning. You see the HOT xmin-to-xmax check
pattern that you mentioned within heap_prune_chain(), which looks like
where the incorrect tuple prune (or possibly, at times, redirect?)
takes place. (I refer to the prune/kill that you mentioned today, that
frustrated your first attempt at a fix -- "I modified the multixact
freeze code...".)

The attached patch "fixes" the problem -- I cannot get amcheck to
complain about corruption with this applied. And, "make check-world"
passes. Hopefully it goes without saying that this isn't actually my
proposed fix. It tells us something that this at least *masks* the
problem, though; it's a start.

FYI, the repro case page contents looks like this with the patch applied:

postgres=# select lp, lp_flags, t_xmin, t_xmax, t_ctid,
to_hex(t_infomask) as infomask,
to_hex(t_infomask2) as infomask2
from heap_page_items(get_raw_page('t', 0));
 lp | lp_flags | t_xmin  | t_xmax | t_ctid | infomask | infomask2
+--+-+++--+---
  1 |1 | 1845995 |  0 | (0,1)  | b02  | 3
  2 |2 | |||  |
  3 |0 | |||  |
  4 |0 | |||  |
  5 |0 | |||  |
  6 |0 | |||  |
  7 |    1 | 1846001 |  0 | (0,7)  | 2b02 | 8003
(7 rows)

-- 
Peter Geoghegan
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 52231ac..90eb39e 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -470,13 +470,6 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 		ItemPointerSet(&(tup.t_self), BufferGetBlockNumber(buffer), offnum);

 		/*
-		 * Check the tuple XMIN against prior XMAX, if any
-		 */
-		if (TransactionIdIsValid(priorXmax) &&
-			!TransactionIdEquals(HeapTupleHeaderGetXmin(htup), priorXmax))
-			break;
-
-		/*
 		 * OK, this tuple is indeed a member of the chain.
 		 */
 		chainitems[nchain++] = offnum;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-03 Thread Peter Geoghegan

On Tue, Oct 3, 2017 at 9:48 AM, Alvaro Herrera  wrote:
> which shows a HOT-update chain, where the t_xmax are multixacts.  Then a
> vacuum freeze comes, and because the multixacts are below the freeze
> horizon for multixacts, we get this:
>
> select lp, lp_flags, t_xmin, t_xmax, t_ctid, to_hex(t_infomask) as infomask,
> to_hex(t_infomask2) as infomask2
> from heap_page_items(get_raw_page('t', 0));
>  lp | lp_flags | t_xmin | t_xmax | t_ctid | infomask | infomask2
> +--++++--+---
>   1 |1 |  2 |  0 | (0,1)  | 902  | 3
>   2 |0 ||||  |
>   3 |1 |  2 |  14662 | (0,4)  | 2502 | c003
>   4 |1 |  2 |  14663 | (0,5)  | 2502 | c003
>   5 |1 |  2 |  14664 | (0,6)  | 2502 | c003
>   6 |1 |  2 |  14665 | (0,7)  | 2502 | c003
>   7 |1 |  2 |  0 | (0,7)  | 2902 | 8003
> (7 filas)
>
> where the xmin values have all been frozen, and the xmax values are now
> regular Xids.

I thought that we no longer store FrozenTransactionId (xid 2) as our
"raw" xmin while freezing, and yet that's what we see here. It looks
like pageinspect is looking at raw xmin (it's calling
HeapTupleHeaderGetRawXmin(), not HeapTupleHeaderGetXmin()). What's the
deal with that? Shouldn't the original xmin be preserved for forensic
analysis, following commit 37484ad2?

I guess what you mean is that this is what you see having modified the
code to actually store FrozenTransactionId as xmin once more, in an
effort to fix this?

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 64-bit queryId?

2017-10-02 Thread Peter Geoghegan

On Mon, Oct 2, 2017 at 9:10 AM, Robert Haas  wrote:
> On Mon, Oct 2, 2017 at 11:02 AM, Joshua D. Drake  
> wrote:
>> +1 to both of these as well.
>
> OK, so here's a patch.  Review appreciated.

You need to change the SQL interface as well, although I'm not sure
exactly how. The problem is that you are now passing a uint64 queryId
to Int64GetDatumFast() within pg_stat_statements_internal(). That
worked when queryId was a uint32, because you can easily represent
values <= UINT_MAX as an int64/int8. However, you cannot represent the
second half of the range of uint64 within a int64/int8. I think that
this will behave different depending on USE_FLOAT8_BYVAL, if nothing
else.

The background here is that we used int8 as the output type for the
function when queryId was first exposed via the SQL interface because
there was no 32-bit unsigned int type that we could have used (except
possibly Oid, but that's weird). You see the same pattern in a couple
of other places.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

2017-10-02 Thread Peter Geoghegan

On Mon, Oct 2, 2017 at 9:42 AM, Peter Eisentraut
 wrote:
> I understand where you are coming from.  My experience with developing
> this feature has been that it is quite fragile in some ways.  We have
> had this out there for testing for many months, and we have fixed many
> bugs, and follow-up bugs, up to very recently.  We don't have good
> automatic test coverage, so we need to rely on user testing.  Making
> significant code and design changes a week or two before the final
> release is just too late, even if the changes were undisputed.  And this
> wasn't just one specific isolated change, it was a set of overlapping
> changes that seemingly kept changing and expanding.

For the record, I don't accept that summary. We could have not
bothered with the sanitization thing at all, and I wouldn't have cared
very much. I even revised my proposal such that you would never have
needed to do the language tag mapping at all [1] (albeit while
desupporting ICU versions prior to 4.6). And, as recently as 7 days
ago, you yourself said:

"Reading over this code again, it is admittedly not quite clear why
this "canonicalization" is in there right now.  I think it had
something to do with how we built the keyword variants at one point.
It might not make sense.  I'd be glad to take that out and use the
result straight  from uloc_getAvailable() for collcollate." [2]

This would have been far more radical than what I proposed.

> I'm satisfied that what we are shipping is "good enough".  It has been
> in development for over a year, it has been reviewed and tested for
> months, and a lot of credit for that goes to you.

It is "good enough", I suppose. I am disappointed by this particular
outcome, but that's mostly because it could have been avoided. I'm
still very happy that you took on this project, and I don't want that
to be overshadowed by this particular disagreement.

> I'm available to discuss the changes you are envisioning, preferably one
> at a time, in the near future as part of the PG11 development process.

I don't really think that there is anything more to be done about the
issues raised on this thread, with the possible exception of the
DEBUG1 display name string thing. The sanitization just isn't valuable
enough to add after the fact. Apart from the risks of adding it after
the fact, another reason not to bother with it is that it's
*incredibly* forgiving. So forgiving that you could reasonably argue
that we might as well not have it at all.

I may well do more on ICU with you in the future, but not in the area
of how things are canonicalized or sanitized. It's too late for that
now.

[1] 
https://www.postgresql.org/message-id/CAH2-WzmVtRyNg2gT=u=ktEC-jM3aKq4bYzJ0u2=osxe+o3k...@mail.gmail.com
[2] 
https://www.postgresql.org/message-id/f6c0fca7-e277-3f46-c0c1-adc001bff...@2ndquadrant.com
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

2017-10-01 Thread Peter Geoghegan

On Sun, Oct 1, 2017 at 6:27 AM, Peter Eisentraut
 wrote:
> I'm still pondering committing my documentation patch I had posted, and
> I've been reviewing your patches to see if there is anything else
> documentation-wise that could be added.

-1 from me -- I don't think we should tell users about the deprecated
locale format at all.

I hope that ICU continue to support the deprecated locale string
format within ucol_open() into the future, and that users will only
ever use BCP 47 within CREATE COLLATION (as the 'locale' string).
Perhaps it won't matter that much in the end, and will turn out to be
more of a wart than something that causes pain for users. It's hard to
predict.

> As I had mentioned before, I disagree with the substance of the
> functionality changes you are proposing, and I think it would be way too
> late to make such significant changes.  The general community opinion
> also seems to be in favor of letting things be.

It does seem too late. It's disappointing that we did not do better
here. This problem was entirely avoidable.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-09-30 Thread Peter Geoghegan

On Thu, Sep 28, 2017 at 1:20 PM, Alvaro Herrera  wrote:
> Maybe what this means is that we need to do both Dan's initially
> proposed patch (or something related to it) apart from the fixes already
> pushed.  IOW we need to put back some of the "tupkeep" business ...

I took the time to specifically check if that would fix the problem.
Unfortunately, it did not. We see exactly the same problem, or at
least amcheck/REINDEX produces exactly the same error. I checked both
Dan's original update_freeze.patch, and your revision that retained
some of the "tupkeep" stuff,
0002-Don-t-freeze-recently-dead-HOT-tuples, which you posted on
September 6th.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

2017-09-30 Thread Peter Geoghegan

On Sat, Sep 30, 2017 at 12:28 PM, Tom Lane  wrote:
> This suggests to me that arguing about canonicalization is moot so
> far as avoiding reindexing goes: if you change ICU library versions,
> you're screwed and will have to jump through all the reindexing hoops,
> no matter what we do or don't do.  (Maybe we are looking at the wrong
> information to populate collversion?)

Just to be clear, I was never arguing that canonicalization would
avoid reindexing, and I didn't think anyone else was. I believe that
the version string will change when the behavior of its collator
changes for any reason and in any way. This includes changes to how
binary sort keys are generated.

(We don't currently store binary keys on disk, so a change to that
representation doesn't really necessitate a REINDEX for us. The UCA
spec explicitly decouples compatibility for indexing with binary keys
from changes needed due to human requirements. ICU binary sort keys
are compressed, and this is sometimes improved, independently of the
evolution of how natural language experts say text should be sorted.)

> Now, it may still be worthwhile to argue about whether canonicalization
> will help the other component of the problem, which is whether you can
> dump and reload CREATE COLLATION commands into a new ICU version and
> expect to get more-or-less-the-same behavior as before.

Again, to be clear, that's what I was arguing all along. I think that
users will get very good results with this approach. When the sorting
behavior of a locale is refined by natural language experts, it's
almost certainly a very small change that most users that are affected
won't actually notice. For example, when en-us-x-icu is changed,
that's actually due to a general change in the inherited root collator
that doesn't really affect English speakers. There is no practical
question about how you're supposed to sort English text.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

2017-09-30 Thread Peter Geoghegan

On Sat, Sep 30, 2017 at 8:25 AM, Tom Lane  wrote:
> I'd also argue that the point of adopting ICU was exactly so we *could*
> distinguish those cases, and limit the scope of a normal upgrade to
> "reindex these identifiable indexes and you're done".  In the libc world,
> when you upgrade libc's locale definitions, you have no idea what the
> consequences are.

Right. With libc, we think of collations as something that there is a
small, fixed number of on a system, that we cannot safely assume
anything about. But with ICU, all of the semantics of how natural
languages should be sorted are exposed via various APIs, and there are
literally more possible sets of collation behaviors than there are
grains of sand in the Sahara (there are hundreds of distinct scripts,
which we can change the overall ordering of arbitrarily, on top of all
the other customizations). Clearly the libc way of looking at things
doesn't really carry over.

BCP 47 is supposed to be universal -- it's an IETF standard. That's
where all the stability guarantees are. The officially recognized 'u'
extension that ICU uses is a CLDR/Unicode thing, not an ICU thing. The
same format could, in the future, be used by other collation
providers, since there actually are other CLDR consumers/UCA
implementations. And, ICU have said that they have deprecated the old
locale format, and have standardized on BCP 47. As of ICU 54, it is
recommended that ucol_open() be passed a string in BCP 47 format.

I'm surprised that this issue was not resolved earlier in the week. I
presumed that all of this was obvious to Peter E., but I seem to have
been wrong about that.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 64-bit queryId?

2017-09-30 Thread Peter Geoghegan

On Sat, Sep 30, 2017 at 8:39 AM, Peter Geoghegan  wrote:
> Isn't that already true in the case of queryId? I've never heard any
> complaints about collisions. Most people don't change
> pg_stat_statements.max, so the probability of a collision is more like
> 1%. And, that's the probability of *any* collision, not the
> probability of a collision that the user actually cares about. The
> majority of entries in pg_stat_statements among those ten thousand
> will not be interesting.

Correction: ten thousand is an example value of pg_stat_statements.max
in the docs, not the default value. The default is five thousand.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 64-bit queryId?

2017-09-30 Thread Peter Geoghegan

On Sat, Sep 30, 2017 at 7:34 AM, Robert Haas  wrote:
> Assuming, however, that you don't manage to prove all known
> mathematics inconsistent, what one might reasonably hope to do is
> render collisions remote enough that one need not worry about them too
> much in practice.

Isn't that already true in the case of queryId? I've never heard any
complaints about collisions. Most people don't change
pg_stat_statements.max, so the probability of a collision is more like
1%. And, that's the probability of *any* collision, not the
probability of a collision that the user actually cares about. The
majority of entries in pg_stat_statements among those ten thousand
will not be interesting. Often, 90%+ will be one-hit wonders. If that
isn't true, then you're probably not going to find pg_stat_statements
very useful, because you have nothing to focus on.

I have heard complaints about a number of different things in
pg_stat_statements, like the fact that we don't always manage to
replace constants with '?'/'$n' characters in all cases. I heard about
that quite a few times during my time at Heroku. But never this.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

2017-09-29 Thread Peter Geoghegan

On Mon, Sep 25, 2017 at 11:40 AM, Peter Eisentraut
 wrote:
> Reading over this code again, it is admittedly not quite clear why this
> "canonicalization" is in there right now.  I think it had something to
> do with how we built the keyword variants at one point.  It might not
> make sense.  I'd be glad to take that out and use the result straight
> from uloc_getAvailable() for collcollate.  That is, after all, the
> "canonical" version that ICU chooses to report to us.

Is anything going to happen here ahead of shipping v10? This remains
an open item.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-09-29 Thread Peter Geoghegan

On Wed, Sep 20, 2017 at 2:32 AM, Rushabh Lathia
 wrote:
> Yes, I haven't touched the randomAccess part yet. My initial goal was
> to incorporate the BufFileSet api's here.

This is going to need a rebase, due to the commit today to remove
replacement selection sort. That much should be easy.

> Sorry, I didn't get this part. Are you talking about the your patch changes
> into OpenTemporaryFileInTablespace(),  BufFileUnify() and other changes
> related to ltsUnify() ?  If that's the case, I don't think it require with
> the
> BufFileSet. Correct me if I am wrong here.

I thought that you'd have multiple BufFiles, which would be
multiplexed (much like a single BufFile itself mutiplexes 1GB
segments), so that logtape.c could still recycle space in the
randomAccess case. I guess that that's not a goal now.

> To be frank its too early for me to comment anything in this area.  I need
> to study this more closely. As an initial goal I was just focused on
> understanding the current implementation of the patch and incorporate
> the BufFileSet APIs.

Fair enough.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pgbench stuck with 100% cpu usage

2017-09-29 Thread Peter Geoghegan

On Thu, Sep 28, 2017 at 10:36 PM, Pavan Deolasee
 wrote:
> Well, I think it's a very legitimate test, not for testing performance, but
> testing crash recovery and I use it very often. This particular test was run
> to catch another bug which will be reported separately.

Yeah, I use pgbench for stuff like that all the time.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A design for amcheck heapam verification

2017-09-29 Thread Peter Geoghegan

On Fri, Sep 29, 2017 at 10:29 AM, Robert Haas  wrote:
> I am also wondering whether this patch should consider
> 81c5e46c490e2426db243eada186995da5bb0ba7 as a way of obtaining
> multiple hash values.  I suppose that's probably inferior to what is
> already being done on performance grounds, but I'll just throw out a
> mention of it here all the same in case it was overlooked or the
> relevance not spotted...

Well, we sometimes only want one hash value. This happens when we're
very short on memory (especially relative to the estimated final size
of the set), so it's a fairly common requirement. And, we have a
convenient way to get a second independent uint32 hash function now
(murmurhash32()).

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A design for amcheck heapam verification

2017-09-29 Thread Peter Geoghegan

On Thu, Sep 28, 2017 at 8:34 PM, Thomas Munro
 wrote:
> FWIW I think if I were attacking that problem the first thing I'd
> probably try would be getting rid of that internal pointer
> filter->bitset in favour of a FLEXIBLE_ARRAY_MEMBER and then making
> the interface look something like this:
>
> extern size_t bloom_estimate(int64 total elems, int work_mem);
> extern void bloom_init(bloom_filter *filter, int64 total_elems, int work_mem);
>
> Something that allocates new memory as the patch's bloom_init()
> function does I'd tend to call 'make' or 'create' or 'new' or
> something, rather than 'init'.

I tend to agree. I'll adopt that style in the next version. I just
didn't want the caller to have to manage the memory themselves.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The case for removing replacement selection sort

2017-09-29 Thread Peter Geoghegan

On Fri, Sep 29, 2017 at 7:19 AM, Robert Haas  wrote:
> That supports your theory that there's some confounding factor in the
> CREATE INDEX case, such as I/O scheduling.  Since this machine has an
> SSD, I guess I don't have a mental model for how that works.  We're
> not waiting for the platter to rotate...

Random I/O is still significantly more expensive with SSDs, especially
random writes, where all the wear leveling stuff comes into play.
There is a tiny universe of very complicated firmware within every SSD
[1]. (I am generally concerned about the trend towards increasingly
complicated, unauditable firmware like this, but that's another
story.)

> ...but I guess that's all irrelevant as far as this patch goes.  The
> point of this patch is to simplify things from removing a technique
> that is no longer effective, and the evidence we have supports the
> contention that it is no longer effective.  I'll go commit this.

Thanks.

[1] https://lwn.net/Articles/353411/
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The case for removing replacement selection sort

2017-09-28 Thread Peter Geoghegan

On Thu, Sep 28, 2017 at 3:18 PM, Robert Haas  wrote:
> On Fri, Jul 14, 2017 at 6:20 PM, Peter Geoghegan  wrote:
>> With the additional enhancements made to Postgres 10, I doubt that
>> there are any remaining cases where it wins.
>
> I tried my favorite sorting test case -- pgbench -i -s 300 and then
> reindex index pgbench_accounts_pkey.  I set maintenance_work_mem to
> 4MB so that replacement selection would be chosen.
>
> On master, this takes ~21.5 seconds; with replacement_sort_tuples = 0,
> it takes ~19.1 seconds.  So, disabling replacement selection is a win.
>
> On 9.6, this takes ~19.1 seconds; with replacement_sort_tuples = 0, it
> takes ~23 seconds.  So, disabling replacement selection is a loss.
>
> This supports your theory that we should go ahead and remove
> replacement selection.

I'm glad to hear that. But, I should reiterate that if sorting
actually gets faster when my patch is applied, then that's something
that I consider to be a bonus. (This is primarily a refactoring patch,
to remove a bunch of obsolete code.)

> It does however both me a bit that this test
> case is apparently slower in master than in 9.6, and not just with
> very small values of work_mem.  With default maintenance_work_mem
> (64MB), this takes about 13 seconds on 9.6 but about 15 seconds on
> master -- and with that setting replacement selection is not chosen at
> all.

As I mentioned, the picture changed for replacement selection during
the Postgres 10 development cycle, which caused me to finally push for
it to be killed in this thread. So, anyone that just started following
the thread should note that it's totally expected that replacement
selection still just about pulls its weight in 9.6, but not in 10.

> Any idea what causes this regression?

I don't know. My best guess is that the overall I/O scheduling is now
suboptimal due to commit e94568e (Heikki's preloading thing), because
this is CREATE INDEX, and there is new competitive pressure. You might
find it hard to replicate the problem with a "SELECT COUNT(DISTINCT
aid) FROM pgbench_accounts", which would confirm this explanation. Or,
you could also see what happens with a separate temp tablespace.

It's really, really hard to have a 100%, unambiguous win within
tuplesort.c. We've seen this time and again over the years.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-09-28 Thread Peter Geoghegan

On Thu, Sep 28, 2017 at 3:20 PM, Robert Haas  wrote:
> On Thu, Sep 28, 2017 at 5:47 PM, Peter Geoghegan  wrote:
>> In the end, commit 6bfa88a fixed that old recovery bug by making sure
>> the recovery routine heap_xlog_lock() did the right thing. In both
>> cases (Feb 2014 and today), the index wasn't really corrupt -- it just
>> pointed to the root of a HOT chain when it should point to some child
>> tuple (or maybe a successor HOT chain).
>
> Unless I'm very confused, it's really not OK to point at a child tuple
> rather than the root of the HOT chain.

Please see my later clarification.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-09-28 Thread Peter Geoghegan

On Thu, Sep 28, 2017 at 2:47 PM, Peter Geoghegan  wrote:
> In the end, commit 6bfa88a fixed that old recovery bug by making sure
> the recovery routine heap_xlog_lock() did the right thing. In both
> cases (Feb 2014 and today), the index wasn't really corrupt -- it just
> pointed to the root of a HOT chain when it should point to some child
> tuple (or maybe a successor HOT chain).

Just to be clear, obviously I don't mean that the index should have
been magically updated to compensate for that 2014 heap_lock_tuple()
recovery bug (say, via an in-place IndexTuple update during recovery).
Certainly, the index AM was entitled to assume that the heap TID it
pointed to would continue to be the root of the same HOT chain, at
least until the next VACUUM. That was what I meant by "the index
wasn't actually corrupt".

All I meant here is that the index scan and sequential scan would have
been in agreement if something like that *had* happened artificially
(say, when the index tuple was edited with a hex editor). And, that
the actual high level problem was that we failed to treat a
HOT-updated tuple as a HOT-updated tuple, leading to consequences that
were mostly noticeable during index scans. Probably on both occasions
(back in 2014, and now).

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-09-28 Thread Peter Geoghegan

On Thu, Sep 28, 2017 at 2:39 PM, Alvaro Herrera  wrote:
>> FWIW, I am reminded a little bit of the MultiXact/recovery bug I
>> reported way back in February of 2014 [1], which also had a HOT
>> interaction that caused index scans to give wrong answers, despite
>> more or less structurally sound indexes.
>>
>> [1] 
>> https://www.postgresql.org/message-id/CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=o67w0csswpvv7xfuc...@mail.gmail.com
>
> Thanks for the reference.  I didn't remember this problem and it's not
> (wasn't) in my list of things to look into.  Perhaps these are both the
> same bug.

I was reminded of that old bug because initially, at the time, it
looked very much like a corrupt index: sequential scans were fine, but
index scans gave wrong answers. This is what I saw today.

In the end, commit 6bfa88a fixed that old recovery bug by making sure
the recovery routine heap_xlog_lock() did the right thing. In both
cases (Feb 2014 and today), the index wasn't really corrupt -- it just
pointed to the root of a HOT chain when it should point to some child
tuple (or maybe a successor HOT chain).

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-09-28 Thread Peter Geoghegan

On Thu, Sep 28, 2017 at 1:20 PM, Alvaro Herrera  wrote:
> Alvaro Herrera wrote:
>
>> Odd that it's not fixed.  I guess there's still some more work to do
>> here ...
>
> Maybe what this means is that we need to do both Dan's initially
> proposed patch (or something related to it) apart from the fixes already
> pushed.  IOW we need to put back some of the "tupkeep" business ...

We certainly do still see wrong answers to queries here:

postgres=# select ctid, xmin, xmax, * from t;
 ctid  | xmin  | xmax | id | name | x
---+---+--++--+---
 (0,1) | 21171 |0 |  1 | 111  | 0
 (0,7) | 21177 |0 |  3 | 333  | 5
(2 rows)

postgres=# select * from t where id = 3;
 id | name | x
+--+---
  3 | 333  | 5
(1 row)

postgres=# set enable_seqscan = off;
SET
postgres=# select * from t where id = 3;
 id | name | x
+--+---
(0 rows)

FWIW, I am reminded a little bit of the MultiXact/recovery bug I
reported way back in February of 2014 [1], which also had a HOT
interaction that caused index scans to give wrong answers, despite
more or less structurally sound indexes.

[1] 
https://www.postgresql.org/message-id/CAM3SWZTMQiCi5PV5OWHb+bYkUcnCk=o67w0csswpvv7xfuc...@mail.gmail.com

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] A design for amcheck heapam verification

2017-09-27 Thread Peter Geoghegan

add_element(), when they have one close at hand anyway -- much
like addHyperLogLog(). Again, that seems like work for the ultimate
consumer of that functionality. It's a trivial tweak that can happen
later.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] The case for removing replacement selection sort

2017-09-27 Thread Peter Geoghegan

On Wed, Sep 27, 2017 at 12:06 AM, Simon Riggs  wrote:
>> I think we should remove the replacement_sort_tuples GUC, and kill
>> replacement selection entirely. There is no need to do this for
>> Postgres 10. I don't feel very strongly about it. It just doesn't make
>> sense to continue to support replacement selection.
>
> Forgive me if I missed the explanation, but how will we handle bounded sorts?

No change there at all. If anything they will probably be slightly
faster, because the heap management routines don't have to work with a
special heap comparator that compares run number, but only when
replacement selection is in use.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 3472 matches

Mail list logo