Re: [bug fix] Cascaded standby cannot start after a clean shutdown

2018-02-21 Thread Michael Paquier
On Mon, Feb 19, 2018 at 03:01:15AM +, Tsunakawa, Takayuki wrote:
> From: Michael Paquier [mailto:mich...@paquier.xyz]

Sorry for my late reply.  I was looking at this problem for the last
couple of days here and there, still thinking about it.

>> It seems to me that the consolidation of the page read should happen directly
>> in xlogreader.c and not even in one of its callbacks so as no garbage data
>> is presented back to the caller using its own XLogReader.
>> I think that you need to initialize XLogReaderState->readBuf directly in
>> ReadPageInternal() before reading a page and you should be good.  With your
>> patch you get visibly only one portion of things addressed, what of other
>> code paths using xlogreader.c's APIs like pg_rewind, 2PC code and such?
> 
> ReadPageInternal() doesn't know where the end of valid WAL is, so it
> cannot determine where to do memset(0).  For example, in non-streaming
> cases, it reads the entire WAL block into readbuf, including the
> garbage.

Why couldn't it know about it?  It would be perfectly fine to feed to it
the end LSN position as well and it is an internal API to xlogreader.c.
Note that both its callers, XLogFindNextRecord or XLogReadRecord know
about that as well.

>   /*
>* If the current segment is being streamed from master, calculate how
>* much of the current page we have received already. We know the
>* requested record has been received, but this is for the benefit of
>* future calls, to allow quick exit at the top of this function.
>*/
>   if (readSource == XLOG_FROM_STREAM)
>   {
>   if (((targetPagePtr) / XLOG_BLCKSZ) != (receivedUpto / 
> XLOG_BLCKSZ))
>   readLen = XLOG_BLCKSZ;
>   else
>   readLen = receivedUpto % XLogSegSize - targetPageOff;
>   }
>   else
>   readLen = XLOG_BLCKSZ;
> 
> So we have to zero-fill the empty space of a WAL block before writing
> it.  Currently, the master does that in AdvanceXLInsertBuffer(),
> StartupXLOG(), XLogFileCopy().  The cascading standby receives WAL
> from the master block by block, so it doesn't suffer from the garbage.
> pg_receivexlog zero-fills a new WAL file.

I cannot completely parse this statement.  Physical replication sends up
to 16 WAL pages per message, rounded down to the last page completed.
Even a cascading standby sends WAL following this protocol, using the
last flush position as a base for the maximum amount of data sent.

>> +if (zbuffer == NULL)
>> +zbuffer = palloc0(XLOG_BLCKSZ);
>> You could just use a static buffer which is MemSet'd with 0 only if 
>> necessary.
> 
> I wanted to allocate memory only when necessary.  Currently, only the
> cascaded standby needs this.  To that end, I moved the memory
> allocation code to a right place.  Thanks for letting me notice this. 

Still, it seems to me that any problems are not related to the fact that
we are using cascading standbys.  As [1] says as well, similar problems
have been reported on a standby after restarting its primary.

I am definitely ready to buy that it can be possible to have garbage
being read the length field which can cause allocate_recordbuf to fail
as that's the only code path in xlogreader.c which does such an
allocation.  Still, it seems to me that we should first try to see if
there are strange allocation patterns that happen and see if it is
possible to have a reproduceable test case or a pattern which gives us
confidence that we are on the right track.  One idea I have to
monitor those allocations like the following:
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -162,6 +162,10 @@ allocate_recordbuf(XLogReaderState *state, uint32 
reclength)
newSize += XLOG_BLCKSZ - (newSize % XLOG_BLCKSZ);
newSize = Max(newSize, 5 * Max(BLCKSZ, XLOG_BLCKSZ));

+#ifndef FRONTEND
+   elog(LOG, "Allocation for xlogreader increased to %u", newSize);
+#endif

This could be upgraded to a PANIC or such if it sees a larger
allocation, as pgbench generates records with known lengths, that would
be a good fit to see if there is some garbage being read.  No need to go
up to 1GB before seeing a failure.

[1]:
https://www.postgresql.org/message-id/CAE2gYzzVZNsGn%3D-E6grO4sVQs04J02zNKQofQEO8gu8%3DqCFR%2BQ%40mail.gmail.com
--
Michael


signature.asc
Description: PGP signature


Re: PATCH: pgbench - break out timing data for initialization phases

2018-02-21 Thread Fabien COELHO


Hello Doug,

Doing the "in progress" way suffers from everything before 'generating 
data' possibly scrolling off the screen/window.


Yeah, that is a point.

I tend to "| less" when I want to see a long output in details, so it is 
not an issue for me.


Also, I like to have an information when it is available, and not have to 
wait 
for the end.


Finally, note that pre data loading operations are expected to be quite 
fast, thus not that significant (create table, ...). Longer operations 
would occur after a large data set is loaded, thus after the large loading 
progress output which scrolls the screen.


For me, it is much handier to look at one set of duration times all 
reported together after all of the initialize phases are done.


Then for this apporach the duration data must be stored and then printed 
in the end.


--
Fabien.



Re: SHA-2 functions

2018-02-21 Thread Michael Paquier
On Wed, Feb 21, 2018 at 03:45:17PM -0500, Peter Eisentraut wrote:
> On 2/20/18 23:04, Michael Paquier wrote:
>> I think that crypto_hash.c or hash_crypt.c would be adapted as well.
>> crypt.c is too much generic, so including both concepts in the name is
>> the way to go.  The name given by Tom here sounds actually nice.
> 
> Updated patches

I have been reviewing both patches, and those look good to me.

git diff --check has one complain:
src/backend/utils/adt/cryptohashes.c:170: new blank line at EOF.
--
Michael


signature.asc
Description: PGP signature


Re: ALTER TABLE ADD COLUMN fast default

2018-02-21 Thread Andrew Dunstan
On Wed, Feb 21, 2018 at 7:48 PM, Andres Freund  wrote:
> Hi,
>

[ Long and useful review]

> This doesn't seem ready yet.
>


Thanks.

I'm working through the issues you raised. Will reply in a few days.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Constifying numeric.c's local vars

2018-02-21 Thread Tom Lane
Mark Dilger  writes:
>> This means that the const variable 'const_zero' contains a pointer that is
>> non-const, pointing at something that is static const, stored in read only
>> memory.  Yikes.

> I still believe this is unsafe.

I'm with Andres: I don't see the problem.  It's true that we've casted
away a chance for the compiler to notice a problem, but that's not the
only defense against problems.  If we did try to modify const_zero,
what should happen now is that we get a SIGSEGV from trying to scribble
on read-only memory.  But that's actually a step forward from before.
Before, we'd have successfully modified the value of const_zero and
thereby silently bollixed subsequent computations using it.  Since,
in fact, the code should never try to modify const_zero, the SIGSEGV
should never happen.  So effectively we have a hardware-enforced Assert
that we don't modify it, and that seems good.

As far as compiler-detectable mistakes go, Andres' changes to declare
various function inputs as const seem like a pretty considerable
improvement too, even if they aren't offering 100% coverage.

regards, tom lane



Re: [HACKERS] Partition-wise aggregation/grouping

2018-02-21 Thread Robert Haas
On Thu, Feb 8, 2018 at 8:05 AM, Jeevan Chalke
 wrote:
> 0003 - 0006 are refactoring patches as before.

I have committed 0006 with some modifications.  In particular, [1] I
revised the comments and formatting; [2] I made cost_merge_append()
add cpu_tuple_cost * APPEND_CPU_COST_MULTIPLIER in lieu of, rather
than in addition to, cpu_operator_cost; and [3] I modified the
regression test so that the overall plan shape didn't change.

[2] was proposed upthread, but not adopted.  I had the same thought
while reading the patch (having forgotten the previous discussion) and
that seemed like a good enough reason to do it according to the
previous proposal.  If there is a good reason to think MergeAppend
needs that extra cost increment to be fairly-costed, I don't see it on
this thread.

[3] was also remarked upon upthread -- Ashutosh mentioned that the
change in plan shape was "sad" but there was no further discussion of
the matter.  I also found it sad; hence the change.  This is, by the
way, an interesting illustration of how partition-wise join could
conceivably lose.  Up until now I've thought that it seemed to be a
slam dunk to always win or at least break even, but if you've got a
relatively unselective join, such that the output is much larger than
either input, then doing the join partition-wise means putting all of
the output rows through an Append node, whereas doing it the normal
way means putting only the input rows through Append nodes.  If the
smaller number of rows being joined at one time doesn't help -- e.g.
all of the inner rows across all partitions fit in a tiny little hash
table -- then we're just feeding more rows through the Append for no
gain.  Not a common case, perhaps, but not impossible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [bug fix] Produce a crash dump before main() on Windows

2018-02-21 Thread Craig Ringer
On 20 February 2018 at 23:43, Magnus Hagander  wrote:

>
> I've seen a number of usecases where apps start it alongside the app
> instead of as a service. I'm not sure how recent those apps are though, and
> I'm not sure it's better than using a service in the first place (but it
> does let you install things without being an admin).
>
> We really shouldn't *break* that scenario for people. But making it work
> well for the service usecase should definitely be the priority.
> 
>

Good point and agreed.

The patch proposed here means that early crashes will invoke WER. If we're
going to allow WER we should probably just do so unconditionally.

I suggest changing this to a command line flag or environment variable test
that suppresses Pg's default disabling of WER. A GUC probably doesn't make
sense; it's too niche, and too early.

I'd be in favour of leaving WER on when we find out we're in a
noninteractive service too, but that'd be a separate patch for pg11+ only.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Add support for tuple routing to foreign partitions

2018-02-21 Thread Amit Langote
Fujita-san,

On 2018/02/21 20:54, Etsuro Fujita wrote:
> (2018/02/02 19:33), Etsuro Fujita wrote:
>> (2018/01/25 23:33), Stephen Frost wrote:
>>> I'm afraid a good bit of this patch is now failing to apply. I don't
>>> have much else to say except to echo the performance concern that Amit
>>> Langote raised about expanding the inheritence tree twice.
>>
>> To address that concern, I'm thinking to redesign the patch so that it
>> wouldn't expand the tree at planning time anymore. I don't have any
>> clear solution for that yet, but what I have in mind now is to add new
>> FDW APIs to the executor, instead, so that the FDW could 1) create stuff
>> such as a query for remote INSERT as PlanForeignModify and 2)
>> initialize/end the remote INSERT operation as BeginForeignModify and
>> EndForeignModify, somewhere in the executor.
> 
> New FDW APIs I would like to propose for that are:

Thanks for updating the proposal.

> void
> BeginForeignRouting(ModifyTableState *mtstate,
>     ResultRelInfo *resultRelInfo,
>     int partition_index);
>
> Prepare for a tuple-routing operation on a foreign table.  This is called
> from ExecSetupPartitionTupleRouting and ExecInitPartitionInfo.

I wonder why partition_index needs to be made part of this API?

> TupleTableSlot *
> ExecForeignRouting(EState *estate,
>    ResultRelInfo *resultRelInfo,
>    TupleTableSlot *slot);
> 
> Route one tuple to the foreign table.  This is called from ExecInsert.
> 
> void
> EndForeignRouting(EState *estate,
>   ResultRelInfo *resultRelInfo);
> 
> End the operation and release resources.  This is called from
> ExecCleanupTupleRouting.
> 
> Attached are WIP patches for that:
> 
> Patch postgres-fdw-refactoring-WIP.patch: refactoring patch for
> postgres_fdw.c to reduce duplicate code.
> 
> Patch foreign-routing-fdwapi-WIP.patch: main patch to add new FDW APIs,
> which is created on top of patch postgres-fdw-refactoring-WIP.patch and
> the lazy-initialization-of-partition-info patch [1].

Noticed a typo in the patch (s/parition/partition/g):

+* Also let the FDW init itself if this 
parition is foreign.

+* Also let the FDW init itself if this parition is foreign.

> By this change we don't need to expand the inheritance tree at planning
> time, so no need to worry about the performance concern.  Maybe I'm
> missing something, though.  Early feedback would be greatly appreciated.

Perhaps an independent concern, but one thing I noticed is that it does
not seem to play well with the direct modification (update push-down)
feature.  Now because updates (at least local, let's say) support
re-routing, I thought we'd be able move rows across servers via the local
server, but with direct modification we'd never get the chance.  However,
since update tuple routing is triggered by partition constraint failure,
which we don't enforce for foreign tables/partitions anyway, I'm not sure
if we need to do anything about that, and even if we did, whether it
concerns this proposal.

That said, I saw in the changes to ExecSetupPartitionTupleRouting() that
BeginForeignRouting() is called for a foreign partition even if direct
modification might already have been set up.  If direct modification is
set up, then ExecForeignRouting() will never get called, because we'd
never call ExecUpdate() or ExecInsert().

Thanks,
Amit




Re: Hash Joins vs. Bloom Filters / take 2

2018-02-21 Thread Tomas Vondra

On 02/21/2018 08:17 AM, Thomas Munro wrote:
> On Wed, Feb 21, 2018 at 10:23 AM, Tomas Vondra
>  wrote:
>> In 2015/2016 I've been exploring if we could improve hash joins by
>> leveraging bloom filters [1], and I was reminded about this idea in a
>> thread about amcheck [2]. I also see that bloom filters were briefly
>> mentioned in the thread about parallel hash [3].
>>
>> So I've decided to revive the old patch, rebase it to current
>> master, and see if we can resolve the issues that killed it in
>> 2016.
> 
> Nice!
> 
>> Opinions?
> 
> I'm definitely following this and interested in helping in some way 
> if I can. I have wondered about this subject and discussed it a bit 
> with Peter Geoghegan off-list.
> 

Good ;-)

I think one important thing we need to figure out is the costing, or
some other way that would allow us to decide when to build the Bloom
filters (and what perhaps whether to prefer larger and more accurate
one, or a smaller one).

But if you want to look into adding support for parallel hash, or
pushing the bloom filter down to the scans, feel free to do so.


> Some assorted thoughts:
> 
> In the old thread, Peter pointed at a curious undergrad student
> project from 2008[1] evaluating Bloom filters for hash joins in
> PostgreSQL 8.3, inspired by a couple of older papers[2][3].  While
> your patch uses a Bloom filter to short-circuit the regular bucket
> probe in ExecHashJoinImpl(), these approach push the Bloom filter down
> into the outer relation scan.  I suspect you're right about the fixed
> sizing being a problem, but the general idea seems pretty interesting
> to me and there seems to be no reason you couldn't make the filter
> size dynamic as you have it and then share it via a parameter or
> something.  But is there any point?
> 
> On the one hand, pushing down Bloom filters requires the hash value 
> to be computed by the lower scan, and then computed again if the 
> tuple survives the filter and makes it into the Hash Join node 
> (unless there is some way to attach it to the tuple...). On the
> other hand, throwing away tuples sooner can avoid more work,
> particularly in the case of multi-joins.
> 

I do agree it's an interesting idea, and being able to push the filter
down would be great, particularly in case of very selective joins (i.e.
when many outer rows have no match in the hash table). I have no idea
how much infrastructure would it require, though, or how widely it could
be used.

Judging by your thoughts on impact of left-deep vs. right-deep joins
etc. you've already given this far more thought that I did ;-)


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Hash Joins vs. Bloom Filters / take 2

2018-02-21 Thread Tomas Vondra
On 02/21/2018 02:10 AM, Peter Geoghegan wrote:
> On Tue, Feb 20, 2018 at 3:54 PM, Tomas Vondra
>  wrote:
>>> I suspect that it could make sense to use a Bloom filter to 
>>> summarize the entire inner side of the join all at once, even
>>> when there are multiple batches. I also suspect that this is
>>> particularly beneficial with parallel hash joins, where
>>> IPC/synchronization overhead can be a big problem.
>>>
>>
>> But that's what the patch does, currently - the filter is built
>> during the initial pass through the data, and then used for all
>> batches.
> 
> I misunderstood. I would probably do something like double or triple 
> the original rows estimate instead, though. The estimate must be at 
> least slightly inaccurate when we get to this point, but I don't
> think that that's a good enough reason to give up on the estimate 
> completely.
> 

That's a problem only for the multi-batch case, though.

With a single batch we can walk the hash table and count non-empty
buckets, to get a good ndistinct estimate cheaply. And then size the
filter considering both memory requirements (fits into CPU cache) and
false positive rate. There are other things we may need to consider
(memory usage vs. work_mem) but that's a separate issue.

With multiple batches I think we could use the "size the bloom filter
for a fraction of work_mem" which the current patch uses when switching
to multiple batches halfway-through. That pretty much entirely ignores
the estimate and essentially replaces it with a "fictional" estimate.

I think that's a better approach than using some arbitrary multiple of
the estimate. When we have to start batching halfway through, the
estimate is proven to be rather bogus anyway, but we may treat it as a
lower boundary for the bloom filter size.

>> Actually, now that I think about it - I think the patch should
>> throw away the filter away after the initial pass over the outer
>> relation, because at that point we've used all the information in
>> the filter.
> 
> Makes sense.
> 

Actually, the patch already does that - it stops using the filter if
(curbatch != 0). We don't throw it away, though, because it also
includes some additional instrumentation that are shown by explain analyze.

>> I'm not sure it would make sense to then build smaller bloom
>> filters for individual batches, but maybe it would?
> 
> I doubt it.
> 

I think it might help if the global bloom filter ended up having high
false positive rate. But only if the per-batch filters fit into CPU
cache (i.e. it's the same reasoning as for single-batch case).

But those "per-batch" filters are rather incompatible with pushing the
filter to scan nodes, I think.

>> Yeah, I admit those are rather crude rules.
> 
> You have to start somewhere.
> 
>> The trouble is that when we start with a single batch and then find
>> out the estimate was wrong and we need to start batching, all bets
>> are off. At that point it seems reasonable to just say "Here is X
>> MBs of RAM, do what you can".
> 
> As I said above, I wouldn't say all bets are off when this happens
> -- not at all. Estimates are likely to often be somewhat wrong. If 
> they're completely wrong, we can probably swallow the cost of giving 
> up on a Bloom filter relatively late.
> 
> As I said, X should not be a portion of work_mem, because that has 
> only a weak relationship to what really matters.
> 

I agree a fixed fraction of work_mem may not be the right thing, but the
goal was to make the bloom filter part of the Hash memory budget, i.e.

bloom filter + hash table <= work_mem

(which I think we agree should be the case), without increasing the
number of batches too much. For example, if you size the filter ignoring
this, and it end up being 90% of work_mem, you may need to do the hash
join in 128 batches instead of just 16. Or something like that.

Maybe that would still be a win, though. Firstly, the higher number of
batches may not have a huge impact - in one case we need to serialie
15/16 and in the other one 127/128. That's 93% vs. 99%. And if the more
accurate filter allows us to discard much more data from the outer
relation ...


>>> You should try to exploit the fact that a Bloom filter can summarize a
>>> large set reasonably well with a very compact, simple representation.
>>> A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
>>> for cases where Bloom probes will save a lot of work, it probably
>>> doesn't make all that much difference -- the hash join is still much
>>> faster.
> 
>> But the problem is that I don't know what is the total size of the
>> hash table, because we're building the bloom filter for all the
>> batches at once. And we don't know how many batches will be there -
>> if we knew that, we could estimate the number of distinct values
>> and we could use that to size the filter instead of doing this.
>> (All of this only applies to the state where we start with a single
>> batch and then 

Re: Duplicate Item Pointers in Gin index

2018-02-21 Thread Masahiko Sawada
On Thu, Feb 22, 2018 at 8:28 AM, Peter Geoghegan  wrote:
> On Wed, Feb 21, 2018 at 3:02 PM, R, Siva  wrote:
>> Did you mean pin on the metapage buffer during ginInsertCleanup and not lock
>> during addition of tuples to the accumulator? The exclusive lock on metapage
>> buffer is released after reading/locking head of pending list and before we
>> process pages/add tuples to the accumulator in ginInsertCleanup [1].
>
> AFAICT, nobody ever holds just a pin on the metapage as some kind of
> interlock (since nobody else ever acquires a "super exclusive lock" on
> the metapage -- if anybody else ever did that, then simply holding a
> pin might make sense as a way of blocking the "super exclusive" lock
> acquisition). Maybe you're thinking of the root page of posting trees?
>
> I think that Sawada-san simply means that holding an ExclusiveLock on
> the metapage makes writers block each other, and concurrent VACUUMs.
> At least, for as long as they're in ginInsertCleanup().

Yes, but I realized my previous mail was wrong, sorry. Insertion to
pending list doesn't acquire ExclusiveLock on metapage. So we can
insert tuples to pending list while cleaning up.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-02-21 Thread Tomas Vondra
On 02/21/2018 04:00 AM, Peter Eisentraut wrote:
> ...
> 
> Some example measurements:
> 
> 6 GB database, pg_upgrade unpatched 30 seconds, patched 3 seconds (XFS
> and APFS)
> 
> similar for a CREATE DATABASE from a large template
> 

Nice improvement, of course. How does that affect performance on the
cloned database? If I understand this correctly, it essentially enables
CoW on the files, so what's the overhead on that? It'd be unfortunate to
speed up CREATE DATABASE only to get degraded performance later.

In any case, I find this interesting mainly for pg_upgrade use case. On
running systems I think the main issue with CREATE DATABASE is that it
forces a checkpoint.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Duplicate Item Pointers in Gin index

2018-02-21 Thread Peter Geoghegan
On Wed, Feb 21, 2018 at 3:02 PM, R, Siva  wrote:
> Did you mean pin on the metapage buffer during ginInsertCleanup and not lock
> during addition of tuples to the accumulator? The exclusive lock on metapage
> buffer is released after reading/locking head of pending list and before we
> process pages/add tuples to the accumulator in ginInsertCleanup [1].

AFAICT, nobody ever holds just a pin on the metapage as some kind of
interlock (since nobody else ever acquires a "super exclusive lock" on
the metapage -- if anybody else ever did that, then simply holding a
pin might make sense as a way of blocking the "super exclusive" lock
acquisition). Maybe you're thinking of the root page of posting trees?

I think that Sawada-san simply means that holding an ExclusiveLock on
the metapage makes writers block each other, and concurrent VACUUMs.
At least, for as long as they're in ginInsertCleanup().

BTW, are you aware of this basic fact about GIN?:

/*
 * First, scan the pending list and collect any matching entries into the
 * bitmap.  After we scan a pending item, some other backend could post it
 * into the main index, and so we might visit it a second time during the
 * main scan.  This is okay because we'll just re-set the same bit in the
 * bitmap.  (The possibility of duplicate visits is a major reason why GIN
 * can't support the amgettuple API, however.) Note that it would not do
 * to scan the main index before the pending list, since concurrent
 * cleanup could then make us miss entries entirely.
 */

(This comment is from gingetbitmap().)

> The metapage (not metapage buffer) is locked using LockPage upon entry
> in ginInsertCleanup and unlocked after processing of pending list is complete.
> Does that prevent concurrent insertions into the pending list?

It's supposed to be. I suggested that there might be a similar though
distinct problem there, right before discussion trailed off [1] --
"The metapage HW lock is sufficient for writers to block writers, but
AFAICT it is not sufficient for writers to block readers."

Is that related to what you're asking about?

[1] 
https://www.postgresql.org/message-id/CAH2-Wz%3DGTnAPzEEZqYELOv3h1Fxpo5xhMrBP6aMGEKLKv95csQ%40mail.gmail.com
-- 
Peter Geoghegan



Re: Two small patches for the isolationtester lexer

2018-02-21 Thread Tom Lane
I wrote;
> Daniel Gustafsson  writes:
>> I also (again) forgot about the # comments not being allowed inside setup and
>> teardown blocks, so patch 0002 proposes adding support for these as the
>> documentation implies.

> Hmm, not sure this is a good idea.  # is a valid SQL operator name, so
> doing this would create some risk of breaking legal queries.

Actually, looking closer, this would also trigger on '#' used inside a
SQL literal, which seems to move the problem cases into the "pretty
likely" category instead of the "far-fetched" one.  So I'd only be OK
with it if we made the lexer smart enough to distinguish inside-a-SQL-
literal from not.  That might be a good thing anyway, since it would
allow us to not choke on literals containing '}', but it'd be a great
deal more work.  You might be able to steal code from the psql lexer
though.

regards, tom lane



ERROR: left and right pathkeys do not match in mergejoin

2018-02-21 Thread Alexander Kuzmenkov

Hi hackers,

I found a bug related to the planning of merge joins. The steps to 
reproduce are as follows:

```
CREATE TABLE J1_TBL (
    i integer,
    j integer,
    t text
);
CREATE TABLE J2_TBL (
    i integer,
    k integer
);
set enable_hashjoin to off;
explain select * from j1_tbl full join (select * from j2_tbl order by 
j2_tbl.i desc, j2_tbl.k) j2_tbl on j1_tbl.i = j2_tbl.i and j1_tbl.i = 
j2_tbl.k;


ERROR:  left and right pathkeys do not match in mergejoin
```

It can be reproduced on the latest 9.6, 10 and devel versions. 
Basically, j2_tbl is used as the outer relation in merge join. We try to 
use the existing order of j2_tbl, that is, 'i2 desc' and 'k asc'. As a 
result, there is no order for i1 that would allow it to be merge-joined 
to both i2 and k. The corresponding code path begins in 
`generate_mergejoin_pathkeys()`, when it is first called by 
`match_unsorted_outer()`. `find_mergeclauses_for_pathkeys()` selects 
both mergeclauses for the given outer pathkeys, and then 
`make_inner_pathkeys_for_merge()` generates inner pathkey for the first 
mergeclause. Inner pathkey for the second mergeclause is skipped as 
redundant. It looks suspicious already, because this new pathkey has 
different direction and is more like conflicting than redundant. 
Finally, `create_mergejoin_plan()` detects that the inner pathkey 
doesn't match the second mergeclause and throws the error.


I found this while working on the inequality merge join patch. I don't 
know yet how to fix this, so any help is welcome.


--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Two small patches for the isolationtester lexer

2018-02-21 Thread Daniel Gustafsson
> On 21 Feb 2018, at 21:41, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> When writing an isolation testcase recently I bumped into the 1024 line 
>> buffer
>> size limit in the lexer for my setup block.  Adding some stored procedures to
>> the test makes it quite easy to break 1024 characters, and while these could 
>> be
>> added as steps it, it’s not a good workaround since the permutation order
>> becomes trickier (and more set in stone).  As far as I can see in the 
>> history,
>> this limit is chosen as a decent sized buffer and not rooted in a specific
>> requirement, so I propose to bump it slightly to 2048 instead (an equally
>> arbitrarily chosen number).  Is there a reason to keep it at 1024 that I’m
>> missing?
> 
> I can't think of one; but I wonder if it's worth working a bit harder and
> removing the fixed limit altogether, probably by using a PQExpBuffer.
> If you've hit 1024 today, somebody will bump up against 2048 tomorrow.

The thought did cross my mind, but I opted for the simple hack first.  I can
take a stab at using a PQExpBuffer to see where that leads.

>> I also (again) forgot about the # comments not being allowed inside setup and
>> teardown blocks, so patch 0002 proposes adding support for these as the
>> documentation implies.  Since SQL comments will be counted towards the line
>> buffer, and sent with the command, supporting both kinds of comments seems
>> reasonable and consistent.
> 
> Hmm, not sure this is a good idea.  # is a valid SQL operator name, so
> doing this would create some risk of breaking legal queries.  Admittedly,
> those operators are rare enough that maybe nobody would ever need them in
> isolationtester scripts, but I'm not sure that providing an additional
> way to spell "comment" is worth that.

Good point, didn’t think about that.

cheers ./daniel


Online enabling of checksums

2018-02-21 Thread Magnus Hagander
*Once more, here is an attempt to solve the problem of on-line enabling of
checksums that me and Daniel have been hacking on for a bit. See for
example
https://www.postgresql.org/message-id/CABUevEx8KWhZE_XkZQpzEkZypZmBp3GbM9W90JLp%3D-7OJWBbcg%40mail.gmail.com

and
https://www.postgresql.org/message-id/flat/FF393672-5608-46D6-9224-6620EC532693%40endpoint.com#ff393672-5608-46d6-9224-6620ec532...@endpoint.com

for some previous discussions.Base design:Change the checksum flag to
instead of on and off be an enum. off/inprogress/on. When checksums are off
and on, they work like today. When checksums are in progress, checksums are
*written* but not verified. State can go from “off” to “inprogress”, from
“inprogress” to either “on” or “off”, or from “on” to “off”.Two new
functions are added, pg_enable_data_checksums() and
pg_disable_data_checksums(). The disable one is easy -- it just changes to
disable. The enable one will change the state to inprogress, and then start
a background worker (the “checksumhelper launcher”). This worker in turn
will start one sub-worker (“checksumhelper worker”) in each database
(currently all done sequentially). This worker will enumerate all
tables/indexes/etc in the database and validate their checksums. If there
is no checksum, or the checksum is incorrect, it will compute a new
checksum and write it out. When all databases have been processed, the
checksum state changes to “on” and the launcher shuts down. At this point,
the cluster has checksums enabled as if it was initdb’d with checksums
turned on.If the cluster shuts down while “inprogress”, the DBA will have
to manually either restart the worker (by calling pg_enable_checksums()) or
turn checksums off again. Checksums “in progress” only carries a cost and
no benefit.The change of the checksum state is WAL logged with a new xlog
record. All the buffers written by the background worker are forcibly
enabled full page writes to make sure the checksum is fully updated on the
standby even if no actual contents of the buffer changed.We’ve also
included a small commandline tool, bin/pg_verify_checksums, that can be run
against an offline cluster to validate all checksums. Future improvements
includes being able to use the background worker/launcher to perform an
online check as well. Being able to run more parallel workers in the
checksumhelper might also be of interest.The patch includes two sets of
tests, an isolation test turning on checksums while one session is writing
to the cluster and another is continuously reading, to simulate turning on
checksums in a production database. There is also a TAP test which enables
checksums with streaming replication turned on to test the new xlog record.
The isolation test ran into the 1024 character limit of the isolation test
lexer, with a separate patch and discussion at
https://www.postgresql.org/message-id/8d628be4-6606-4ff6-a3ff-8b2b0e9b4...@yesql.se
*

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4c998fe51f..dc05ac3e55 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -8537,7 +8537,8 @@ LOG:  CleanUpLock: deleting: lock(0xb7acd844) id(24688,24696,0,0,0,1)
 or hide corruption, or other serious problems.  However, it may allow
 you to get past the error and retrieve undamaged tuples that might still be
 present in the table if the block header is still sane. If the header is
-corrupt an error will be reported even if this option is enabled. The
+corrupt an error will be reported even if this option is enabled. This
+option can only enabled when data checksums are enabled. The
 default setting is off, and it can only be changed by a superuser.

   
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1e535cf215..8000ce89df 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -19412,6 +19412,64 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
 
   
 
+  
+   Data Checksum Functions
+
+   
+The functions shown in  can
+be used to enable or disable data checksums in a running cluster.
+See  for details.
+   
+
+   
+Checksum SQL Functions
+
+ 
+  
+   Function
+   Return Type
+   Description
+  
+ 
+ 
+  
+   
+
+ pg_enable_data_checksums
+
+pg_enable_data_checksums()
+   
+   
+bool
+   
+   
+Initiates data checksums 

Re: SHA-2 functions

2018-02-21 Thread Peter Eisentraut
On 2/20/18 23:04, Michael Paquier wrote:
> I think that crypto_hash.c or hash_crypt.c would be adapted as well.
> crypt.c is too much generic, so including both concepts in the name is
> the way to go.  The name given by Tom here sounds actually nice.

Updated patches

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 2debe6cf7eec396e269a2c3d89ee56f5aea711e2 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 6 Feb 2018 21:46:46 -0500
Subject: [PATCH v2 1/2] Add user-callable SHA-2 functions

Add the user-callable functions sha224, sha256, sha384, sha512.  We
already had these in the C code to support SCRAM, but there was no test
coverage outside of the SCRAM tests.  Adding these as user-callable
functions allows writing some tests.  Also, we have a user-callable md5
function but no more modern alternative, which led to wide use of md5 as
a general-purpose hash function, which leads to occasional complaints
about using md5.

Also mark the existing md5 functions as leak-proof.
---
 doc/src/sgml/func.sgml   |  71 -
 src/backend/utils/adt/Makefile   |   3 +-
 src/backend/utils/adt/cryptohashes.c | 170 +++
 src/backend/utils/adt/varlena.c  |  48 -
 src/include/catalog/pg_proc.h|  12 ++-
 src/test/regress/expected/opr_sanity.out |   6 ++
 src/test/regress/expected/strings.out|  53 ++
 src/test/regress/sql/strings.sql |  18 
 8 files changed, 329 insertions(+), 52 deletions(-)
 create mode 100644 src/backend/utils/adt/cryptohashes.c

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 1e535cf215..2f59af25a6 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -3640,7 +3640,7 @@ Other Binary String Functions
returning the result in hexadecimal
   
   md5(E'Th\\000omas'::bytea)
-  8ab2d3c9689aaf18 b4958c334c82d8b1
+  
8ab2d3c9689aaf18b4958c334c82d8b1
  
 
   
@@ -3674,6 +3674,66 @@ Other Binary String Functions
set_byte(E'Th\\000omas'::bytea, 4, 64)
Th\000o@as
   
+
+  
+   
+
+ sha224
+
+sha224(bytea)
+   
+   bytea
+   
+SHA-224 hash
+   
+   sha224('abc')
+   
\x23097d223405d8228642a477bda255b32aadbce4bda0b3f7e36c9da7
+  
+
+  
+   
+
+ sha256
+
+sha256(bytea)
+   
+   bytea
+   
+SHA-256 hash
+   
+   sha256('abc')
+   
\xba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad
+  
+
+  
+   
+
+ sha384
+
+sha384(bytea)
+   
+   bytea
+   
+SHA-384 hash
+   
+   sha384('abc')
+   
\xcb00753f45a35e8bb5a03d699ac65007272c32ab0eded1631a8b605a43ff5bed8086072ba1e7cc2358baeca134c825a7
+  
+
+  
+   
+
+ sha512
+
+sha512(bytea)
+   
+   bytea
+   
+SHA-512 hash
+   
+   sha512('abc')
+   
\xddaf35a193617abacc417349ae20413112e6fa4e89a97ea20a964b55d39a2192992a274fc1a836ba3c23a3feebbd454d4423643ce80e2a9ac94fa54ca49f
+  
 

   
@@ -3686,6 +3746,15 @@ Other Binary String Functions
the first byte, and bit 15 is the most significant bit of the second byte.
   
 
+  
+   Note that for historic reasons, the function md5
+   returns a hex-encoded value of type text whereas the SHA-2
+   functions return type bytea.  Use the functions
+   encode and decode to convert
+   between the two, for example encode(sha256('abc'),
+   'hex') to get a hex-encoded text representation.
+  
+
   
See also the aggregate function string_agg in
 and the large object functions
diff --git a/src/backend/utils/adt/Makefile b/src/backend/utils/adt/Makefile
index 61ca90312f..4b35dbb8bb 100644
--- a/src/backend/utils/adt/Makefile
+++ b/src/backend/utils/adt/Makefile
@@ -11,7 +11,8 @@ include $(top_builddir)/src/Makefile.global
 # keep this list arranged alphabetically or it gets to be a mess
 OBJS = acl.o amutils.o arrayfuncs.o array_expanded.o array_selfuncs.o \
array_typanalyze.o array_userfuncs.o arrayutils.o ascii.o \
-   bool.o cash.o char.o date.o datetime.o datum.o dbsize.o domains.o \
+   bool.o cash.o char.o cryptohashes.o \
+   date.o datetime.o datum.o dbsize.o domains.o \
encode.o enum.o expandeddatum.o expandedrecord.o \
float.o format_type.o formatting.o genfile.o \
geo_ops.o geo_selfuncs.o geo_spgist.o inet_cidr_ntop.o inet_net_pton.o \
diff --git a/src/backend/utils/adt/cryptohashes.c 
b/src/backend/utils/adt/cryptohashes.c
new file mode 100644
index 00..1f242eba71
--- /dev/null
+++ b/src/backend/utils/adt/cryptohashes.c
@@ -0,0 +1,170 @@
+/*-
+ *
+ * cryptohashes.c
+ *   

Re: [doc fix] Correct calculation of vm.nr_hugepages

2018-02-21 Thread Robert Haas
On Mon, Feb 19, 2018 at 9:43 PM, Tsunakawa, Takayuki
 wrote:
> Thanks, I'd like to take this.

Why are these values so large?  The example in the documentation shows
6490428 kB, and in my test I got 8733888 kB.  But 8733888 kB = 8.3 TB!
 8.3 GB would make sense, but 8.3 TB does not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Two small patches for the isolationtester lexer

2018-02-21 Thread Daniel Gustafsson
When writing an isolation testcase recently I bumped into the 1024 line buffer
size limit in the lexer for my setup block.  Adding some stored procedures to
the test makes it quite easy to break 1024 characters, and while these could be
added as steps it, it’s not a good workaround since the permutation order
becomes trickier (and more set in stone).  As far as I can see in the history,
this limit is chosen as a decent sized buffer and not rooted in a specific
requirement, so I propose to bump it slightly to 2048 instead (an equally
arbitrarily chosen number).  Is there a reason to keep it at 1024 that I’m
missing?

I also (again) forgot about the # comments not being allowed inside setup and
teardown blocks, so patch 0002 proposes adding support for these as the
documentation implies.  Since SQL comments will be counted towards the line
buffer, and sent with the command, supporting both kinds of comments seems
reasonable and consistent.

make check passes with these patches applies, but I’m quite rusty in this area
so I might be missing something.

cheers ./daniel



0001-Increase-the-linebuf-in-the-isolation-spec-scanner.patch
Description: Binary data


0002-Allow-comments-in-the-isolation-spec-SQL-blocks.patch
Description: Binary data


Re: Add PGDLLIMPORT to enable_hashagg

2018-02-21 Thread Brian Cloutier
On Wed, Feb 21, 2018 at 10:14 AM, Andres Freund  wrote:

> Could you take the relevant commit, backport it to the
> relevant branches, resolve conflicts, make possibly appropriate
> adaptions, and post?
>

The original commit touched some new variables and therefore didn't apply
cleanly. Attached are equivalent patches for REL_10_STABLE and
REL9_6_STABLE.
From 3e2c0a444a0e07792408841a629d83797ff5883a Mon Sep 17 00:00:00 2001
From: Robert Haas 
Date: Fri, 9 Feb 2018 15:54:45 -0500
Subject: [PATCH] Mark assorted GUC variables as PGDLLIMPORT.

This makes life easier for extension authors.
---
 src/include/miscadmin.h   |  2 +-
 src/include/optimizer/cost.h  | 30 +++---
 src/include/optimizer/paths.h |  8 
 src/include/utils/guc.h   |  2 +-
 4 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 6eacd2a..e76b4b9 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -158,7 +158,7 @@ extern PGDLLIMPORT int NBuffers;
 extern PGDLLIMPORT int MaxBackends;
 extern PGDLLIMPORT int MaxConnections;
 extern PGDLLIMPORT int max_worker_processes;
-extern int	max_parallel_workers;
+extern PGDLLIMPORT int max_parallel_workers;
 
 extern PGDLLIMPORT int MyProcPid;
 extern PGDLLIMPORT pg_time_t MyStartTime;
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 63feba0..7925e4c 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,21 +53,21 @@ extern PGDLLIMPORT double cpu_operator_cost;
 extern PGDLLIMPORT double parallel_tuple_cost;
 extern PGDLLIMPORT double parallel_setup_cost;
 extern PGDLLIMPORT int effective_cache_size;
-extern Cost disable_cost;
-extern int	max_parallel_workers_per_gather;
-extern bool enable_seqscan;
-extern bool enable_indexscan;
-extern bool enable_indexonlyscan;
-extern bool enable_bitmapscan;
-extern bool enable_tidscan;
-extern bool enable_sort;
-extern bool enable_hashagg;
-extern bool enable_nestloop;
-extern bool enable_material;
-extern bool enable_mergejoin;
-extern bool enable_hashjoin;
-extern bool enable_gathermerge;
-extern int	constraint_exclusion;
+extern PGDLLIMPORT Cost disable_cost;
+extern PGDLLIMPORT int	max_parallel_workers_per_gather;
+extern PGDLLIMPORT bool enable_seqscan;
+extern PGDLLIMPORT bool enable_indexscan;
+extern PGDLLIMPORT bool enable_indexonlyscan;
+extern PGDLLIMPORT bool enable_bitmapscan;
+extern PGDLLIMPORT bool enable_tidscan;
+extern PGDLLIMPORT bool enable_sort;
+extern PGDLLIMPORT bool enable_hashagg;
+extern PGDLLIMPORT bool enable_nestloop;
+extern PGDLLIMPORT bool enable_material;
+extern PGDLLIMPORT bool enable_mergejoin;
+extern PGDLLIMPORT bool enable_hashjoin;
+extern PGDLLIMPORT bool enable_gathermerge;
+extern PGDLLIMPORT int	constraint_exclusion;
 
 extern double clamp_row_est(double nrows);
 extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 4e06b2e..f22fe0a 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -20,10 +20,10 @@
 /*
  * allpaths.c
  */
-extern bool enable_geqo;
-extern int	geqo_threshold;
-extern int	min_parallel_table_scan_size;
-extern int	min_parallel_index_scan_size;
+extern PGDLLIMPORT bool enable_geqo;
+extern PGDLLIMPORT int	geqo_threshold;
+extern PGDLLIMPORT int	min_parallel_table_scan_size;
+extern PGDLLIMPORT int	min_parallel_index_scan_size;
 
 /* Hook for plugins to get control in set_rel_pathlist() */
 typedef void (*set_rel_pathlist_hook_type) (PlannerInfo *root,
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index b89e8e8..6bcc904 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -262,7 +262,7 @@ extern char *HbaFileName;
 extern char *IdentFileName;
 extern char *external_pid_file;
 
-extern char *application_name;
+extern PGDLLIMPORT char *application_name;
 
 extern int	tcp_keepalives_idle;
 extern int	tcp_keepalives_interval;
-- 
2.7.4

From 4d7e9c6dd32d68e2b9d87325f5dbbedbef6ce886 Mon Sep 17 00:00:00 2001
From: Robert Haas 
Date: Fri, 9 Feb 2018 15:54:45 -0500
Subject: [PATCH] Mark assorted GUC variables as PGDLLIMPORT.

This makes life easier for extension authors.
---
 src/include/optimizer/cost.h  | 28 ++--
 src/include/optimizer/paths.h |  6 +++---
 src/include/utils/guc.h   |  2 +-
 3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 2a4df2f..214e868 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -53,20 +53,20 @@ extern PGDLLIMPORT double cpu_operator_cost;
 extern PGDLLIMPORT double parallel_tuple_cost;
 extern PGDLLIMPORT double parallel_setup_cost;
 extern PGDLLIMPORT int effective_cache_size;
-extern Cost disable_cost;
-extern int	

Re: Two small patches for the isolationtester lexer

2018-02-21 Thread Tom Lane
Daniel Gustafsson  writes:
> When writing an isolation testcase recently I bumped into the 1024 line buffer
> size limit in the lexer for my setup block.  Adding some stored procedures to
> the test makes it quite easy to break 1024 characters, and while these could 
> be
> added as steps it, it’s not a good workaround since the permutation order
> becomes trickier (and more set in stone).  As far as I can see in the history,
> this limit is chosen as a decent sized buffer and not rooted in a specific
> requirement, so I propose to bump it slightly to 2048 instead (an equally
> arbitrarily chosen number).  Is there a reason to keep it at 1024 that I’m
> missing?

I can't think of one; but I wonder if it's worth working a bit harder and
removing the fixed limit altogether, probably by using a PQExpBuffer.
If you've hit 1024 today, somebody will bump up against 2048 tomorrow.

> I also (again) forgot about the # comments not being allowed inside setup and
> teardown blocks, so patch 0002 proposes adding support for these as the
> documentation implies.  Since SQL comments will be counted towards the line
> buffer, and sent with the command, supporting both kinds of comments seems
> reasonable and consistent.

Hmm, not sure this is a good idea.  # is a valid SQL operator name, so
doing this would create some risk of breaking legal queries.  Admittedly,
those operators are rare enough that maybe nobody would ever need them in
isolationtester scripts, but I'm not sure that providing an additional
way to spell "comment" is worth that.

regards, tom lane



Re: pgsql: Avoid valgrind complaint about write() of uninitalized bytes.

2018-02-21 Thread Peter Geoghegan
On Tue, Feb 20, 2018 at 9:34 PM, Andres Freund  wrote:
> Doesn't appear to have fixed the problem entirely:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2018-02-20%2017%3A10%3A01
>
> relevant excerpt:
> ==12452== Syscall param write(buf) points to uninitialised byte(s)
> ==12452==at 0x4E49C64: write (write.c:26)
> ==12452==by 0x4BF8BF: FileWrite (fd.c:2017)
> ==12452==by 0x4C1B69: BufFileDumpBuffer (buffile.c:513)
> ==12452==by 0x4C1C61: BufFileFlush (buffile.c:657)
> ==12452==by 0x4C21D6: BufFileRead (buffile.c:561)
> ==12452==by 0x63ADA8: ltsReadBlock (logtape.c:274)
> ==12452==by 0x63AEF6: ltsReadFillBuffer (logtape.c:304)
> ==12452==by 0x63B560: LogicalTapeRewindForRead (logtape.c:771)
> ==12452==by 0x640567: mergeruns (tuplesort.c:2671)
> ==12452==by 0x645122: tuplesort_performsort (tuplesort.c:1866)
> ==12452==by 0x23357D: _bt_parallel_scan_and_sort (nbtsort.c:1626)
> ==12452==by 0x234F71: _bt_parallel_build_main (nbtsort.c:1527)

> Note that the above path doesn't appear to go through
> LogicalTapeFreeze(), therfore not hitting the VALGRIND_MAKE_MEM_DEFINED
> added in the above commit.

Sure, but it looks like it has the exact same underlying cause to the
LogicalTapeFreeze() issue. It shouldn't be very hard to write an
equivalent patch for LogicalTapeRewindForRead() -- I pointed out that
this could happen there instead before the first patch went in, in
fact. My mistake was to imagine that that could never happen during
the regression tests.

I think I see why this happened, and why it happens inconsistently on
skink. Because memtuples is generally much smaller during merging than
it is during initial run generation, you can have enough tuples to
have to spill a run to disk, and yet not enough to fully fill even a
small buffer during the subsequent merge (logtape.c buffers also don't
have per-tuple palloc() header overhead, which is a huge proportion of
total memory used for an int4 primary key CREATE INDEX). This failure
is sensitive to the scheduling of worker processes, which is why we
only see it occasionally on skink.

The multiple-row-versions isolation test, which is where we see the
failure, seems to want to use multiple parallel workers for its ALTER
TABLE ... ADD PRIMARY KEY, in part because the table fillfactor is
rather low -- you end up with a case that gets a parallel sort, and
yet still has very few tuples to sort, a bit like the
LogicalTapeFreeze() issue already fixed.

Should we even be doing a parallel external sort here? It's hardly
part of an intentional effort to test the code. I had to push to get
us to give external sorts test coverage at one point about 18 months
ago, because of concerns about the overhead/duration of external
sorts. Not that I feel strongly about this myself.

-- 
Peter Geoghegan



Re: [HACKERS] pageinspect option to forgo buffer locking?

2018-02-21 Thread Peter Geoghegan
On Thu, Nov 9, 2017 at 2:21 PM, Andres Freund  wrote:
> On 2017-11-09 17:14:11 -0500, Tom Lane wrote:
>> If we do this, I'd suggest exposing it as a separate SQL function
>> get_raw_page_unlocked() rather than as an option to get_raw_page().
>>
>> The reasoning is that if we ever allow these functions to be controlled
>> via GRANT instead of hardwired superuser checks (cf nearby debate about
>> lo_import/lo_export), one might reasonably consider the unlocked form as
>> more risky than the locked form, and hence not wish to have to give out
>> privileges to both at once.
>
> Good idea!

I hope that you follow up on this soon.

-- 
Peter Geoghegan



Re: [doc fix] Correct calculation of vm.nr_hugepages

2018-02-21 Thread Justin Pryzby
On Wed, Feb 21, 2018 at 03:14:57PM -0500, Robert Haas wrote:
> On Mon, Feb 19, 2018 at 9:43 PM, Tsunakawa, Takayuki
>  wrote:
> > Thanks, I'd like to take this.
> 
> Why are these values so large?  The example in the documentation shows
> 6490428 kB, and in my test I got 8733888 kB.  But 8733888 kB = 8.3 TB!
>  8.3 GB would make sense, but 8.3 TB does not.

pryzbyj@pryzbyj:~$ units -t -v 8733888kB GiB
8733888kB = 8.1340671 GiB



Re: pgsql: Avoid valgrind complaint about write() of uninitalized bytes.

2018-02-21 Thread Peter Geoghegan
On Wed, Feb 21, 2018 at 10:55 AM, Peter Geoghegan  wrote:
> Sure, but it looks like it has the exact same underlying cause to the
> LogicalTapeFreeze() issue. It shouldn't be very hard to write an
> equivalent patch for LogicalTapeRewindForRead() -- I pointed out that
> this could happen there instead before the first patch went in, in
> fact. My mistake was to imagine that that could never happen during
> the regression tests.

Attached patch does this. I cannot recreate this issue locally, but
this should still fix it on skink.

> Should we even be doing a parallel external sort here? It's hardly
> part of an intentional effort to test the code. I had to push to get
> us to give external sorts test coverage at one point about 18 months
> ago, because of concerns about the overhead/duration of external
> sorts. Not that I feel strongly about this myself.

I suppose that we should commit this patch, even if we subsequently
suppress parallel CREATE INDEX in the multiple-row-versions isolation
test.

-- 
Peter Geoghegan
From 9e3ead04bf86a3e0fd9e16a7c823e2f7335337ff Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Wed, 21 Feb 2018 11:57:56 -0800
Subject: [PATCH] Avoid another Valgrind complaint about a logtape.c write().

LogicalTapeRewindForRead() may write out its first block when it is
dirty but not full, just like LogicalTapeFreeze().  To avoid Valgrind
complaints, tell it to treat the tail of logtape.c's buffer as defined
within LogicalTapeRewindForRead().  This is a mechanical duplication of
the code from commit 9fafa413ac602624e10f61ef44a20c17029d43d8.

Valgrind complained inconsistently when isolation tests are run because
the precise scheduling of worker processes affects whether or not
uninitialized/poisoned bytes can be flushed in the
LogicalTapeRewindForRead() path.  There has to be enough tuples for a
worker to need to spill and merge multiple runs, and yet not so many
that even a BLCKSZ buffer is filled.  This is just barely possible.

Per buildfarm member skink.  It's not completely clear that this will
fix the issue, since it's hard to reproduce, but let's see what the
buildfarm thinks of it.

Peter Geoghegan
---
 src/backend/utils/sort/logtape.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/src/backend/utils/sort/logtape.c b/src/backend/utils/sort/logtape.c
index d6794bf..05dde63 100644
--- a/src/backend/utils/sort/logtape.c
+++ b/src/backend/utils/sort/logtape.c
@@ -739,6 +739,18 @@ LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum, size_t buffer_size)
 		 */
 		if (lt->dirty)
 		{
+			/*
+			 * As long as we've filled the buffer at least once, its contents
+			 * are entirely defined from valgrind's point of view, even though
+			 * contents beyond the current end point may be stale.  But it's
+			 * possible - at least in the case of a parallel sort - to sort
+			 * such small amount of data that we do not fill the buffer even
+			 * once.  Tell valgrind that its contents are defined, so it
+			 * doesn't bleat.
+			 */
+			VALGRIND_MAKE_MEM_DEFINED(lt->buffer + lt->nbytes,
+	  lt->buffer_size - lt->nbytes);
+
 			TapeBlockSetNBytes(lt->buffer, lt->nbytes);
 			ltsWriteBlock(lts, lt->curBlockNumber, (void *) lt->buffer);
 		}
-- 
2.7.4



Typo in predicate.c comment

2018-02-21 Thread Thomas Munro
Hi,

Here's a tiny patch to fix a typo.

-- 
Thomas Munro
http://www.enterprisedb.com


typo.patch
Description: Binary data


Re: [PATCH] pgbench - refactor some connection finish/null into common function

2018-02-21 Thread Rady, Doug

On 1/30/18, 03:41, "Fabien COELHO"  wrote:
Hello Doug,
Hi Fabien,

> This patch refactors all of the connection state PQfinish() and NULL’ing 
into a single function.
> Excludes PQfinish() in doConnect().

My 0.02€:

The argument could be "PGconn **" instead of a "CState *"?
If so, it may be used in a few more places. What is your opinion?

I should have named  finishCon()  as  finishCStateCon()  since it was specific 
to that use pattern.
I'll resubmit with that change if you think it helps.

I'm fine with this kind of factorization which takes out a three-line 
pattern, but I'm wondering whether it would please committers.

Guess we'll find out ...

-- 
Fabien.
Thanks!
doug



Weird failures on lorikeet

2018-02-21 Thread Andres Freund
Hi Andrew,

I noticed your animal lorikeet failed in the last two runs:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet=2018-02-21%2009%3A47%3A17
TRAP: FailedAssertion("!(((PageHeader) (page))->pd_special >= 
(__builtin_offsetof (PageHeaderData, pd_linp)))", File: 
"/home/andrew/bf64/root/HEAD/pgsql/src/include/storage/bufpage.h", Line: 313)

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet=2018-02-20%2012%3A46%3A17
TRAP: FailedAssertion("!(PMSignalState->PMChildFlags[slot] == 1)", File: 
"/home/andrew/bf64/root/HEAD/pgsql.build/../pgsql/src/backend/storage/ipc/pmsignal.c",
 Line: 229)
2018-02-20 08:07:14.054 EST [5a8c1c3b.21d0:3] LOG:  select() failed in 
postmaster: Bad address
2018-02-20 08:07:14.073 EST [5a8c1c3b.21d0:4] LOG:  database system is shut down

The difference between the last successfull and the last failing build
is a single comment typo commit.

It kinda looks like there might be some underlying issue on the machine
with shared memory going away or such?

Greetings,

Andres Freund



Re: PATCH: pgbench - break out timing data for initialization phases

2018-02-21 Thread Rady, Doug

On 1/29/18, 23:52, "Fabien COELHO"  wrote:

Hello Doug,
Hi Fabien,

> With patch and ‘-I dtgvpf’ options:
> pgrun pgbench -i -s 2000 -F 90 -q -I dtgvpf
> dropping old tables...
> creating tables...
> generating data...
> …
> 2 of 2 tuples (100%) done (elapsed 168.76 s, remaining 
0.00 s)
> vacuuming...
> creating primary keys...
> creating foreign keys...
> total time: 353.52 s (drop 1.67 s, tables 0.11 s, insert 168.82 s, commit 
0.46 s, primary 92.32 s, foreign 40.11 s, vacuum 50.03 s)
> done.

I'm in favor of such a feature.

However, I think that the durations should be shown in the order in which 
the initialization is performed.

Agreed.

I would suggest to:

- move the time measure in the initialization loop, instead of doing it
  in each function, so that it is done just in one place.

I will do this.

- maybe store the actions in some array/list data structure, eg:
   "{ char * phase; double duration; }", so that they can be kept
   in order and eventually repeated.

In order to extract the commit time, I'd say that explicit begin and 
commit should be separate instructions triggerred by '(' and ')'.

Also, I'm not sure of the one line display, maybe it could be done while 

The one line display was for ease of parsing it from the output flow.

it is in progress, i.e. something like:
   dropping table...
   table drop: 1.67 s
   creating table...
   table creation: 0.11 s
   ...
In which case there is no need for storing the actions and their 
durations, only the running total is needed.

Doing the "in progress" way suffers from everything before 'generating data' 
possibly scrolling off the screen/window.
For me, it is much handier to look at one set of duration times all reported 
together after all of the initialize phases are done.

-- 
Fabien.

Thanks!
doug



Re: Drop --disable-floatN-byval configure options?

2018-02-21 Thread Tom Lane
Robert Haas  writes:
> On Wed, Feb 21, 2018 at 10:50 AM, Tom Lane  wrote:
>> I have a modest
>> substitute proposal: let's just drop the --disable-float4-byval and
>> --disable-float8-byval configure options as of v11.  Those don't have any
>> impact on on-disk storage.

> USE_FLOAT4_BYVAL seems completely pointless to me, but don't we need
> USE_FLOAT8_BYVAL on machines where Datum is only 4 bytes wide?

Yes.  The point is to have one configuration for 32-bit machines and one
for 64-bit, not three possible configurations on 32-bit and four on
64-bit.

I don't actually envision changing the C code much at all; we might want
to resurrect the old code at some point.  I just want to reduce the number
of supported configurations.

> Being able to test such things without digging up a 32-bit machine is
> useful.

Agreed.  It'd still be possible to manually force this, it just wouldn't
be a documented/supported configuration.

regards, tom lane



Re: [HACKERS] Constifying numeric.c's local vars

2018-02-21 Thread Mark Dilger

> This patch got committed as c1898c3e1e235ae35b4759d233253eff221b976a
> on Sun Sep 10 16:20:41 2017 -0700, but I've only just gotten around to
> reviewing it.
> 
> I believe this is wrong and should be reverted, at least in part.
> 
> The NumericVar struct has the field 'digits' as non-const:
> 
> typedef struct NumericVar
> {
>int ndigits;/* # of digits in digits[] - can be 0! */
>int weight; /* weight of first digit */
>int sign;   /* NUMERIC_POS, NUMERIC_NEG, or NUMERIC_NAN */
>int dscale; /* display scale */
>NumericDigit *buf;  /* start of palloc'd space for digits[] */
>NumericDigit *digits;   /* base-NBASE digits */
> } NumericVar;
> 
> The static const data which is getting put in read only memory sets that data
> by casting away const as follows:
> 
> static const NumericDigit const_zero_data[1] = {0};
> static const NumericVar const_zero =
> {0, 0, NUMERIC_POS, 0, NULL, (NumericDigit *) const_zero_data};
> 
> This means that the const variable 'const_zero' contains a pointer that is
> non-const, pointing at something that is static const, stored in read only
> memory.  Yikes.

I still believe this is unsafe.

> The function set_var_from_var(const NumericVar *value, NumericVar *dest)
> uses memcpy to copy the contents of value into dest.  In cases where the value
> is a static const variable (eg, const_zero), the memcpy is copying a pointer 
> to
> static const read only data into the dest and implicitly casting away const.
> Since that static const data is stored in read only memory, this has undefined
> semantics, and I believe could lead to a server crash, at least on some
> architectures with some compilers.

This is probably safe, though, since NumericDigit seems to just be a typedef
to int16.  I should have checked that definition before complaining about that
part.




Re: [HACKERS] Constifying numeric.c's local vars

2018-02-21 Thread Andres Freund


On February 21, 2018 8:49:51 AM PST, Mark Dilger  
wrote:
>
>The idea that set_var_from_var might be called on const_zero (or
>const_one,
>etc.) is not hypothetical.  It is being done in numeric.c.
>
>If this is safe, somebody needs to be a lot clearer about why that is
>so.  There
>are no comments explaining it in the file, and the conversation in this
>thread
>never got into any details about it either.

Just getting started for the day, no coffee yet. But, uh, what you're appear to 
be saying is that we have code modifying the digits of the zero value? That's 
seems unlikely.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: missing toast table for pg_policy

2018-02-21 Thread Robert Haas
On Sun, Feb 18, 2018 at 10:43 AM, Joe Conway  wrote:
> Is there really a compelling reason to not just create toast tables for
> all system catalogs as in the attached? Then we could just check for 0
> rows in misc_sanity.sql.

+1.  I don't have a huge problem with excluding a few key catalogs for
which we think it might be unsafe, but in general it seems like a good
idea to settle on a policy of including them everywhere else.
Omitting one or even half a dozen TOAST tables on system catalogs
doesn't save anything material for users, but does succeed in annoying
some user who is trying to do something a little off the beaten path.
It also doesn't save anything for developers; indeed, the cognitive
load comes mostly from having to argue about which things should get
TOAST tables.  If we just add them everywhere, we can stop arguing
about this; no other policy will have that effect.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Constifying numeric.c's local vars

2018-02-21 Thread Mark Dilger

> On Sep 11, 2017, at 5:10 AM, Tom Lane  wrote:
> 
> Andres Freund  writes:
>> One large user of unnecessary non-constant static variables is
>> numeric.c.  More out of curiosity - numeric is slow enough in itself to
>> make inlining not a huge win - I converted it to use consts.
> 
> LGTM.
> 
>> It's a bit ugly that some consts have to be casted away in the constant
>> definitions, but aside from just inlining the values, I don't quite see
>> a better solution?
> 
> No, I don't either.  I'm not sure that writing the constant inline would
> produce the desired results - the compiler might well decide that it had
> to be in read-write storage.
> 
>   regards, tom lane

This patch got committed as c1898c3e1e235ae35b4759d233253eff221b976a
on Sun Sep 10 16:20:41 2017 -0700, but I've only just gotten around to
reviewing it.

I believe this is wrong and should be reverted, at least in part.

The NumericVar struct has the field 'digits' as non-const:

typedef struct NumericVar
{
int ndigits;/* # of digits in digits[] - can be 0! */
int weight; /* weight of first digit */
int sign;   /* NUMERIC_POS, NUMERIC_NEG, or NUMERIC_NAN */
int dscale; /* display scale */
NumericDigit *buf;  /* start of palloc'd space for digits[] */
NumericDigit *digits;   /* base-NBASE digits */
} NumericVar;

The static const data which is getting put in read only memory sets that data
by casting away const as follows:

static const NumericDigit const_zero_data[1] = {0};
static const NumericVar const_zero =
{0, 0, NUMERIC_POS, 0, NULL, (NumericDigit *) const_zero_data};

This means that the const variable 'const_zero' contains a pointer that is
non-const, pointing at something that is static const, stored in read only
memory.  Yikes.

The function set_var_from_var(const NumericVar *value, NumericVar *dest)
uses memcpy to copy the contents of value into dest.  In cases where the value
is a static const variable (eg, const_zero), the memcpy is copying a pointer to
static const read only data into the dest and implicitly casting away const.
Since that static const data is stored in read only memory, this has undefined
semantics, and I believe could lead to a server crash, at least on some
architectures with some compilers.

The idea that set_var_from_var might be called on const_zero (or const_one,
etc.) is not hypothetical.  It is being done in numeric.c.

If this is safe, somebody needs to be a lot clearer about why that is so.  There
are no comments explaining it in the file, and the conversation in this thread
never got into any details about it either.

Mark Dilger






Re: Drop --disable-floatN-byval configure options?

2018-02-21 Thread Robert Haas
On Wed, Feb 21, 2018 at 10:50 AM, Tom Lane  wrote:
> Per the discussion at
> https://www.postgresql.org/message-id/flat/54dfd2022c205eda9aa35b88923f027a%40postgrespro.ru
> it's become evident that use of --disable-float8-byval breaks at least one
> current regression test case, because it results in a large change in hash
> table size for float8 hash keys.  While we could hack around that somehow,
> I think it would be a poor use of development effort.  I have a modest
> substitute proposal: let's just drop the --disable-float4-byval and
> --disable-float8-byval configure options as of v11.  Those don't have any
> impact on on-disk storage.

USE_FLOAT4_BYVAL seems completely pointless to me, but don't we need
USE_FLOAT8_BYVAL on machines where Datum is only 4 bytes wide?  If
we've got to support that case on such machines, I'm not sure I'm in
favor of removing the option to have that behavior on other machines.
Being able to test such things without digging up a 32-bit machine is
useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: file cloning in pg_upgrade and CREATE DATABASE

2018-02-21 Thread Robert Haas
On Tue, Feb 20, 2018 at 10:00 PM, Peter Eisentraut
 wrote:
> Some example measurements:
>
> 6 GB database, pg_upgrade unpatched 30 seconds, patched 3 seconds (XFS
> and APFS)
>
> similar for a CREATE DATABASE from a large template
>
> Even if you don't have a file system with cloning support, the special
> library calls make copying faster.  For example, on APFS, in this
> example, an unpatched CREATE DATABASE takes 30 seconds, with the library
> call (but without cloning) it takes 10 seconds.

Nice results.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: master check fails on Windows Server 2008

2018-02-21 Thread Marina Polyakova

On 21-02-2018 18:51, Tom Lane wrote:

Marina Polyakova  writes:

On 20-02-2018 21:23, Tom Lane wrote:

I continue to wonder if it's not better to just remove
the option and thereby simplify our lives.  What's the actual value 
of

having it anymore?


I agree with you, but I have too little experience to vote for 
removing

this option.


I've started a separate thread to propose removal of the option, at
https://postgr.es/m/10862.1519228...@sss.pgh.pa.us


Thank you!

--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: master check fails on Windows Server 2008

2018-02-21 Thread Tom Lane
Marina Polyakova  writes:
> On 20-02-2018 21:23, Tom Lane wrote:
>> I continue to wonder if it's not better to just remove
>> the option and thereby simplify our lives.  What's the actual value of
>> having it anymore?

> I agree with you, but I have too little experience to vote for removing 
> this option.

I've started a separate thread to propose removal of the option, at
https://postgr.es/m/10862.1519228...@sss.pgh.pa.us

regards, tom lane



Drop --disable-floatN-byval configure options?

2018-02-21 Thread Tom Lane
Per the discussion at
https://www.postgresql.org/message-id/flat/54dfd2022c205eda9aa35b88923f027a%40postgrespro.ru
it's become evident that use of --disable-float8-byval breaks at least one
current regression test case, because it results in a large change in hash
table size for float8 hash keys.  While we could hack around that somehow,
I think it would be a poor use of development effort.  I have a modest
substitute proposal: let's just drop the --disable-float4-byval and
--disable-float8-byval configure options as of v11.  Those don't have any
impact on on-disk storage.  As best I can recall, the whole argument
for having them was to allow building backends that retained backwards
compatibility with old version-0 C functions that might expect pass-
by-reference behavior for these types.  Now that we've dropped version-0
C function support, there's no point in that; any version-1 function that
is written per conventions won't notice the difference.  So the extra
complication isn't buying useful flexibility, just more cases that we'd
have to test if we want to claim that we're supporting these options in
any meaningful way.

We would need to double-check that pg_upgrade allows converting from
byref to byval cases, but offhand I see no check for that in its code,
so I think it will just work.

(You might be wondering why these flags are reflected in pg_control,
if there's no impact on on-disk storage.  The reason is that the backend's
compiled-in assumptions have to match what pg_type.typbyval says, so
pg_control memorializes what initdb put into the catalogs.  But we should
be able to cope with changing that during a pg_upgrade cycle, just like
other catalog content changes.)

regards, tom lane



Off-cycle back-branch releases next week

2018-02-21 Thread Tom Lane
We will make a set of minor releases next week, ie wrap tarballs
Monday 2/26 for announcement Thursday 3/1.

The reason for this unusual scheduling is that an external security
researcher found a problem and did not wish to wait for our next
quarterly releases before publishing his results.  So we compromised
on this timing.

regards, tom lane



Re: CURRENT OF causes an error when IndexOnlyScan is used

2018-02-21 Thread Anastasia Lubennikova

20.02.2018 12:52, Aleksander Alekseev:

Hi Anastasia,


I'd like to propose the patch that fixes the issue.
We already have a way to return heaptuple from IndexOnlyScan,
but it was not applied to b-tree for some reason.

Attached patch solves the reported bug.
Moreover, it will come in handy for "index with included attributes" feature
[1],
where we can store long (and even TOASTed) attributes in indextuple.

[1] https://commitfest.postgresql.org/17/1350/

I believe the patch should include a test that tries to reproduce an
issue it tries to fix.

Also maybe this code that repeats 3 times can be moved to a separate
procedure?


Good point. Updated version with test is attached.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

commit 391107028d4597e9721a3b274904e49ee69de0cf
Author: Anastasia 
Date:   Wed Feb 21 17:25:11 2018 +0300

return_heaptuple_in_btree_indexonlyscan_v2.patch

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 8158508..db8a55c 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -374,6 +374,8 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 	so->currTuples = so->markTuples = NULL;
 
 	scan->xs_itupdesc = RelationGetDescr(rel);
+	scan->xs_hitupdesc = NULL;
+	scan->xs_hitup = NULL;
 
 	scan->opaque = so;
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 51dca64..278f43d 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,6 +25,9 @@
 #include "utils/tqual.h"
 
 
+static HeapTuple _bt_fetch_tuple(IndexScanDesc scandesc,
+ ItemPointerData heapTid);
+static void	_bt_fill_hitupdesc(IndexScanDesc scan);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
@@ -38,7 +41,52 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
+/*
+ * Fetch all keys in tuple.
+ * Returns a new HeapTuple containing the originally-indexed data.
+ */
+static HeapTuple
+_bt_fetch_tuple(IndexScanDesc scandesc, ItemPointerData heapTid)
+{
+	Relation index = scandesc->indexRelation;
+	Datum		fetchatt[INDEX_MAX_KEYS];
+	bool		isnull[INDEX_MAX_KEYS];
+	int			i;
+	HeapTuple htuple;
+
+	for (i = 0; i < index->rd_att->natts; i++)
+	{
+		fetchatt[i] = index_getattr(scandesc->xs_itup, i + 1,
+	scandesc->xs_itupdesc, [i]);
+	}
+
+	htuple = heap_form_tuple(scandesc->xs_hitupdesc, fetchatt, isnull);
+	htuple->t_tableOid = scandesc->heapRelation->rd_id;
+	htuple->t_self = heapTid;
 
+
+	return htuple;
+}
+
+static void
+_bt_fill_hitupdesc(IndexScanDesc scan)
+{
+	int natts = RelationGetNumberOfAttributes(scan->indexRelation);
+	int attno;
+	/*
+	* The storage type of the index can be different from the original
+	* datatype being indexed, so we cannot just grab the index's tuple
+	* descriptor. Instead, construct a descriptor with the original data
+	* types.
+	*/
+	scan->xs_hitupdesc = CreateTemplateTupleDesc(natts, false);
+	for (attno = 1; attno <= natts; attno++)
+	{
+		TupleDescInitEntry(scan->xs_hitupdesc, attno, NULL,
+		scan->indexRelation->rd_opcintype[attno - 1],
+		-1, 0);
+	}
+}
 /*
  *	_bt_drop_lock_and_maybe_pin()
  *
@@ -1105,9 +1153,19 @@ readcomplete:
 	/* OK, itemIndex says what to return */
 	currItem = >currPos.items[so->currPos.itemIndex];
 	scan->xs_ctup.t_self = currItem->heapTid;
+
 	if (scan->xs_want_itup)
+	{
 		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
 
+		if (!scan->xs_hitupdesc)
+			_bt_fill_hitupdesc(scan);
+
+		if (scan->xs_hitup)
+			pfree(scan->xs_hitup);
+		scan->xs_hitup = _bt_fetch_tuple(scan, currItem->heapTid);
+	}
+
 	return true;
 }
 
@@ -1155,9 +1213,19 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 	/* OK, itemIndex says what to return */
 	currItem = >currPos.items[so->currPos.itemIndex];
 	scan->xs_ctup.t_self = currItem->heapTid;
+
 	if (scan->xs_want_itup)
+	{
 		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
 
+		if (!scan->xs_hitupdesc)
+			_bt_fill_hitupdesc(scan);
+
+		if (scan->xs_hitup)
+			pfree(scan->xs_hitup);
+		scan->xs_hitup = _bt_fetch_tuple(scan, currItem->heapTid);
+	}
+
 	return true;
 }
 
@@ -1932,9 +2000,19 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/* OK, itemIndex says what to return */
 	currItem = >currPos.items[so->currPos.itemIndex];
 	scan->xs_ctup.t_self = currItem->heapTid;
+
 	if (scan->xs_want_itup)
+	{
 		scan->xs_itup = (IndexTuple) (so->currTuples + currItem->tupleOffset);
 
+		if (!scan->xs_hitupdesc)
+			_bt_fill_hitupdesc(scan);
+
+		if (scan->xs_hitup)
+			pfree(scan->xs_hitup);
+		scan->xs_hitup = 

Re: [HACKERS] Runtime Partition Pruning

2018-02-21 Thread David Rowley
On 21 February 2018 at 22:45, Rajkumar Raghuwanshi
 wrote:
> select count(*) from prt1 x where (x.a,x.b) in (select t1.a,t2.b from prt1
> t1,prt2 t2 where t1.a=t2.b)
> and (x.c) in (select t3.c from plt1 t3,plt2 t4 where t3.c=t4.c);

Thanks for the test case.

It seems like the x.c Param for some reason has a ParamId of 0. I need
to go learn a bit more about Params to understand why this is.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Add support for tuple routing to foreign partitions

2018-02-21 Thread Etsuro Fujita

(2018/02/02 19:33), Etsuro Fujita wrote:

(2018/01/25 23:33), Stephen Frost wrote:

I'm afraid a good bit of this patch is now failing to apply. I don't
have much else to say except to echo the performance concern that Amit
Langote raised about expanding the inheritence tree twice.


To address that concern, I'm thinking to redesign the patch so that it
wouldn't expand the tree at planning time anymore. I don't have any
clear solution for that yet, but what I have in mind now is to add new
FDW APIs to the executor, instead, so that the FDW could 1) create stuff
such as a query for remote INSERT as PlanForeignModify and 2)
initialize/end the remote INSERT operation as BeginForeignModify and
EndForeignModify, somewhere in the executor.


New FDW APIs I would like to propose for that are:

void
BeginForeignRouting(ModifyTableState *mtstate,
ResultRelInfo *resultRelInfo,
int partition_index);

Prepare for a tuple-routing operation on a foreign table.  This is 
called from ExecSetupPartitionTupleRouting and ExecInitPartitionInfo.


TupleTableSlot *
ExecForeignRouting(EState *estate,
   ResultRelInfo *resultRelInfo,
   TupleTableSlot *slot);

Route one tuple to the foreign table.  This is called from ExecInsert.

void
EndForeignRouting(EState *estate,
  ResultRelInfo *resultRelInfo);

End the operation and release resources.  This is called from 
ExecCleanupTupleRouting.


Attached are WIP patches for that:

Patch postgres-fdw-refactoring-WIP.patch: refactoring patch for 
postgres_fdw.c to reduce duplicate code.


Patch foreign-routing-fdwapi-WIP.patch: main patch to add new FDW APIs, 
which is created on top of patch postgres-fdw-refactoring-WIP.patch and 
the lazy-initialization-of-partition-info patch [1].


By this change we don't need to expand the inheritance tree at planning 
time, so no need to worry about the performance concern.  Maybe I'm 
missing something, though.  Early feedback would be greatly appreciated.


Best regards,
Etsuro Fujita

[1] https://www.postgresql.org/message-id/5a8bfb31.6030...@lab.ntt.co.jp
*** a/contrib/file_fdw/output/file_fdw.source
--- b/contrib/file_fdw/output/file_fdw.source
***
*** 315,321  SELECT tableoid::regclass, * FROM p2;
  (0 rows)
  
  COPY pt FROM '@abs_srcdir@/data/list2.bad' with (format 'csv', delimiter ','); -- ERROR
! ERROR:  cannot route inserted tuples to a foreign table
  CONTEXT:  COPY pt, line 2: "1,qux"
  COPY pt FROM '@abs_srcdir@/data/list2.csv' with (format 'csv', delimiter ',');
  SELECT tableoid::regclass, * FROM pt;
--- 315,321 
  (0 rows)
  
  COPY pt FROM '@abs_srcdir@/data/list2.bad' with (format 'csv', delimiter ','); -- ERROR
! ERROR:  cannot route copied tuples to a foreign table
  CONTEXT:  COPY pt, line 2: "1,qux"
  COPY pt FROM '@abs_srcdir@/data/list2.csv' with (format 'csv', delimiter ',');
  SELECT tableoid::regclass, * FROM pt;
***
*** 342,351  SELECT tableoid::regclass, * FROM p2;
  (2 rows)
  
  INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR
! ERROR:  cannot route inserted tuples to a foreign table
  INSERT INTO pt VALUES (2, 'xyzzy');
  UPDATE pt set a = 1 where a = 2; -- ERROR
! ERROR:  cannot route inserted tuples to a foreign table
  SELECT tableoid::regclass, * FROM pt;
   tableoid | a |   b   
  --+---+---
--- 342,351 
  (2 rows)
  
  INSERT INTO pt VALUES (1, 'xyzzy'); -- ERROR
! ERROR:  cannot route inserted tuples to foreign table "p1"
  INSERT INTO pt VALUES (2, 'xyzzy');
  UPDATE pt set a = 1 where a = 2; -- ERROR
! ERROR:  cannot route inserted tuples to foreign table "p1"
  SELECT tableoid::regclass, * FROM pt;
   tableoid | a |   b   
  --+---+---
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
***
*** 7364,7369  NOTICE:  drop cascades to foreign table bar2
--- 7364,7448 
  drop table loct1;
  drop table loct2;
  -- ===
+ -- test tuple routing for foreign-table partitions
+ -- ===
+ create table pt (a int, b int, c text) partition by list (a);
+ create table loct1 (a int check (a in (1)), b int, c text, constraint locp1_pkey primary key (b));
+ create table loct2 (b int, c text, a int check (a in (2)), constraint locp2_pkey primary key (b));
+ create foreign table ptp1 partition of pt for values in (1) server loopback options (table_name 'loct1');
+ create foreign table ptp2 partition of pt for values in (2) server loopback options (table_name 'loct2');
+ insert into pt values (1, 1, 'foo');
+ insert into pt values (1, 2, 'bar') returning *;
+  a | b |  c  
+ ---+---+-
+  1 | 2 | bar
+ (1 row)
+ 
+ insert into pt values (2, 1, 'baz') returning *;
+  a | b |  c  
+ ---+---+-
+  2 | 1 | baz
+ (1 row)
+ 
+ select tableoid::regclass, * FROM pt;
+  tableoid | a | 

Re: NEXT VALUE FOR sequence

2018-02-21 Thread Laurenz Albe
Ashutosh Bapat wrote:
> On Tue, Feb 20, 2018 at 8:39 PM, Laurenz Albe  
> wrote:
> > Tom Lane wrote:
> > > Laurenz Albe  writes:
> > > > The SQL standard has the expression "NEXT VALUE FOR asequence" to do
> > > > what we traditionally do with "nextval('asequence')".
> > > > This is an attempt to implement this on top of the recently introduced
> > > > NextValueExpr node.
> > > 
> > > This has been proposed repeatedly, and rejected repeatedly, because in
> > > fact the standard's semantics for NEXT VALUE FOR are *not* like nextval().
> > > See SQL:2011 4.22.2 "Operations involving sequence generators":
> > > 
> > > If there are multiple instances of s specifying
> > > the same sequence generator within a single SQL-statement, all those
> > > instances return the same value for a given row processed by that
> > > SQL-statement.
> > > 
> > > This is not terribly exact --- what is a "processed row" in a join query,
> > > for instance?  But it's certainly not supposed to act like independent
> > > executions of nextval() or NextValueExpr would.  Pending somebody doing
> > > the legwork to produce something that at least arguably conforms to the
> > > spec's semantics, we've left the syntax unimplemented.
> > 
> > Would it be reasonable to say that any two NextValueExpr in the same
> > target list are "in one row"?
> 
> I think, "processed row" thing gets pretty complicated. Consider
> simple case. What happens when NextValueExpr appears in one of the
> conditions and that row gets eliminated, do we consider that as a
> processed row and increment the NextValueExpr or do not increment it?

I think that is an unrelated question.

In my opinion both would be ok.  SQL does not decree in which order
conditions are evaluated, and it is ok to evaluate an expression even if
it is never used, right?

All that has to be guaranteed is that using "NEXT VALUE FOR asequence"
several times "in one row" (whatever exactly that is) will return the
same value each time.

The exact rules seem to be:

a) If NVE is directly contained in a  QS, then the
   General Rules of Subclause 9.21, “Generation of the next value of a
   sequence generator”, are applied once per row in the result of QS
   with SEQ as SEQUENCE. The result of each evaluation of NVE for a given
   row is the RESULT returned by the General Rules of Subclause 9.21,
   “Generation of the next value of a sequence generator”.

I understand:
For each NEXT VALUE FOR that appears in a select list (not a subselect),
return the same value per row.

b) If NVE is directly contained in a 
   TVC, then the General Rules of Subclause 9.21, “Generation of the
   next value of a sequence generator”, are applied once per
contained in TVC.
   The result of each evaluation of NVE for a given 
   is the RESULT returned by the General Rules of Subclause 9.21,
   “Generation of the next value of a sequence generator”.

I understand:
In a VALUES clause, each invocation of NEXT VALUE FOR should return the same 
value.

c) If NVE is directly contained in an , then the General Rules
   of Subclause 9.21, “Generation of the next value of a sequence generator”,
   are applied once per row to be updated by the 
   or . The result of each evaluation of NVE for a
   given row is the RESULT returned by the General Rules of Subclause 9.21,
   “Generation of the next value of a sequence generator”.

I understand:
In an UPDATE statement, all invocations of NEXT VALUE FOR in the SET clause
should return the same value per row.

Yours,
Laurenz Albe



Re: NEXT VALUE FOR sequence

2018-02-21 Thread Ashutosh Bapat
On Tue, Feb 20, 2018 at 8:39 PM, Laurenz Albe  wrote:
> Tom Lane wrote:
>> Laurenz Albe  writes:
>> > The SQL standard has the expression "NEXT VALUE FOR asequence" to do
>> > what we traditionally do with "nextval('asequence')".
>> > This is an attempt to implement this on top of the recently introduced
>> > NextValueExpr node.
>>
>> This has been proposed repeatedly, and rejected repeatedly, because in
>> fact the standard's semantics for NEXT VALUE FOR are *not* like nextval().
>> See SQL:2011 4.22.2 "Operations involving sequence generators":
>>
>> If there are multiple instances of s specifying
>> the same sequence generator within a single SQL-statement, all those
>> instances return the same value for a given row processed by that
>> SQL-statement.
>>
>> This is not terribly exact --- what is a "processed row" in a join query,
>> for instance?  But it's certainly not supposed to act like independent
>> executions of nextval() or NextValueExpr would.  Pending somebody doing
>> the legwork to produce something that at least arguably conforms to the
>> spec's semantics, we've left the syntax unimplemented.
>
> Would it be reasonable to say that any two NextValueExpr in the same
> target list are "in one row"?

I think, "processed row" thing gets pretty complicated. Consider
simple case. What happens when NextValueExpr appears in one of the
conditions and that row gets eliminated, do we consider that as a
processed row and increment the NextValueExpr or do not increment it?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: master check fails on Windows Server 2008

2018-02-21 Thread Marina Polyakova

On 20-02-2018 21:23, Tom Lane wrote:

Marina Polyakova  writes:

On 20-02-2018 3:37, Tom Lane wrote:
4. Try to tweak the stats_ext.sql test conditions in some more 
refined

way to get the test to pass everywhere.  This'd be a lot of work with
no guarantee of success, so I'm not too excited about it.



Thank you for your explanations! I'll try to do something in this
direction..


OK.  The least painful fix might be to establish a different work_mem
setting just for that one query.

However, if you're intent on putting work into continued support of
--disable-float8-byval, I would *strongly* suggest setting up a 
buildfarm
member that runs that way, because otherwise we're pretty much 
guaranteed

to break it again.


Oh, thank you again!


I continue to wonder if it's not better to just remove
the option and thereby simplify our lives.  What's the actual value of
having it anymore?


I agree with you, but I have too little experience to vote for removing 
this option.


--
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] path toward faster partition pruning

2018-02-21 Thread David Rowley
On 21 February 2018 at 14:53, Amit Langote
 wrote:
> On 2018/02/21 10:19, David Rowley wrote:
>> v30-0004-Faster-partition-pruning.patch contains:
>>
>> +create table coll_pruning_multi (a text) partition by range
>> (substr(a, 1) collate "en_GB", substr(a, 1) collate "en_US");
>>
>> This'll likely work okay on Linux. Other collate tests seem to use
>> COLLATE "POSIX or "C" so that work cross-platform.
>
> Thanks.  I completely forgot about that.  I've rewritten those tests to
> use "POSIX" and "C" in the attached.

Thanks for fixing. I made a pass over v31 and only see a few small things:

1. In get_partitions_for_keys() why is the
get_partitions_excluded_by_ne_datums call not part of
get_partitions_for_keys_list?

2. Still a stray "minoff += 1;" in get_partitions_for_keys_range

3. You're also preferring to minoff--/++, but maxoff -= 1/maxoff += 1;
would be nice to see the style unified here.

4. "other other"

 * that is, each of its fields other other than clauseinfo must be valid before

5. "a IS NULL" -> "an IS NULL":

 * Based on a IS NULL or IS NOT NULL clause that was matched to a partition

6. Can you add a warning in the header comment for
extract_partition_clauses() to explain "Note: the 'clauses' List may
be modified inside this function. Callers may like to make a copy of
important lists before passing them to this function.", or something
like that...

7. "null" -> "nulls"

* Only allow strict operators.  This will guarantee null are

8. "dicard" -> "discard"

* contains a <= 2, then because 3 <= 2 is false, we dicard a < 3 as

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Runtime Partition Pruning

2018-02-21 Thread Rajkumar Raghuwanshi
On Wed, Feb 21, 2018 at 2:36 PM, David Rowley 
wrote:

> I've attached v11 of the patch.
>

Hi,

I have applied attached patch on head
"6f1d723b6359507ef55a81617167507bc25e3e2b" over Amit's v30 patches. while
testing further I got a server crash with below test case. Please take a
look.

CREATE TABLE prt1 (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE prt1_p1 PARTITION OF prt1 FOR VALUES FROM (0) TO (250);
CREATE TABLE prt1_p3 PARTITION OF prt1 FOR VALUES FROM (500) TO (600);
CREATE TABLE prt1_p2 PARTITION OF prt1 FOR VALUES FROM (250) TO (500);
INSERT INTO prt1 SELECT i, i, to_char(i, 'FM') FROM generate_series(0,
599, 2) i;
CREATE INDEX iprt1_p1_a on prt1_p1(a);
CREATE INDEX iprt1_p2_a on prt1_p2(a);
CREATE INDEX iprt1_p3_a on prt1_p3(a);
ANALYZE prt1;

CREATE TABLE prt2 (a int, b int, c varchar) PARTITION BY RANGE(b);
CREATE TABLE prt2_p1 PARTITION OF prt2 FOR VALUES FROM (0) TO (250);
CREATE TABLE prt2_p2 PARTITION OF prt2 FOR VALUES FROM (250) TO (500);
CREATE TABLE prt2_p3 PARTITION OF prt2 FOR VALUES FROM (500) TO (600);
INSERT INTO prt2 SELECT i, i, to_char(i, 'FM') FROM generate_series(0,
599, 3) i;
CREATE INDEX iprt2_p1_b on prt2_p1(b);
CREATE INDEX iprt2_p2_b on prt2_p2(b);
CREATE INDEX iprt2_p3_b on prt2_p3(b);
ANALYZE prt2;

CREATE TABLE plt1 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt1_p1 PARTITION OF plt1 FOR VALUES IN ('', '0003',
'0004', '0010');
CREATE TABLE plt1_p2 PARTITION OF plt1 FOR VALUES IN ('0001', '0005',
'0002', '0009');
CREATE TABLE plt1_p3 PARTITION OF plt1 FOR VALUES IN ('0006', '0007',
'0008', '0011');
INSERT INTO plt1 SELECT i, i, to_char(i/50, 'FM') FROM
generate_series(0, 599, 2) i;
CREATE INDEX iplt1_p1_c on plt1_p1(c);
CREATE INDEX iplt1_p2_c on plt1_p2(c);
CREATE INDEX iplt1_p3_c on plt1_p3(c);
ANALYZE plt1;

CREATE TABLE plt2 (a int, b int, c text) PARTITION BY LIST(c);
CREATE TABLE plt2_p1 PARTITION OF plt2 FOR VALUES IN ('', '0003',
'0004', '0010');
CREATE TABLE plt2_p2 PARTITION OF plt2 FOR VALUES IN ('0001', '0005',
'0002', '0009');
CREATE TABLE plt2_p3 PARTITION OF plt2 FOR VALUES IN ('0006', '0007',
'0008', '0011');
INSERT INTO plt2 SELECT i, i, to_char(i/50, 'FM') FROM
generate_series(0, 599, 3) i;
CREATE INDEX iplt2_p1_c on plt2_p1(c);
CREATE INDEX iplt2_p2_c on plt2_p2(c);
CREATE INDEX iplt2_p3_c on plt2_p3(c);
ANALYZE plt2;

select count(*) from prt1 x where (x.a,x.b) in (select t1.a,t2.b from prt1
t1,prt2 t2 where t1.a=t2.b)
and (x.c) in (select t3.c from plt1 t3,plt2 t4 where t3.c=t4.c);

/*
postgres=# select count(*) from prt1 x where (x.a,x.b) in (select t1.a,t2.b
from prt1 t1,prt2 t2 where t1.a=t2.b)
postgres-# and (x.c) in (select t3.c from plt1 t3,plt2 t4 where t3.c=t4.c);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
*/

stack-trace give below :

/*
(gdb) bt
#0  0x006ce6dc in ExecEvalParamExec (state=0x26e9ee0, op=0x26e9f78,
econtext=0x26ea390) at execExprInterp.c:
#1  0x006cc66a in ExecInterpExpr (state=0x26e9ee0,
econtext=0x26ea390, isnull=0x7ffe0f75d77f "") at execExprInterp.c:1024
#2  0x006cdd8c in ExecInterpExprStillValid (state=0x26e9ee0,
econtext=0x26ea390, isNull=0x7ffe0f75d77f "") at execExprInterp.c:1819
#3  0x007db078 in ExecEvalExprSwitchContext (state=0x26e9ee0,
econtext=0x26ea390, isNull=0x7ffe0f75d77f "") at
../../../../src/include/executor/executor.h:305
#4  0x007e2072 in evaluate_expr (expr=0x26a3cb0, result_type=25,
result_typmod=-1, result_collation=0) at clauses.c:4890
#5  0x007e588a in partkey_datum_from_expr (context=0x26d3180,
parttypid=25, expr=0x26a3cb0, value=0x7ffe0f75da00) at partprune.c:1504
#6  0x007e5243 in extract_bounding_datums (context=0x26d3180,
minimalclauses=0x7ffe0f75d900, keys=0x7ffe0f75da00) at partprune.c:1307
#7  0x007e377d in get_partitions_from_clauses (context=0x26d3180)
at partprune.c:273
#8  0x006ea2ec in set_valid_runtime_subplans_recurse
(node=0x269bf90, pinfo=0x7f6cf6765cf0, ctxcache=0x26d3158,
validsubplans=0x7ffe0f75de10) at nodeAppend.c:771
#9  0x006e9ebf in set_valid_runtime_subplans (node=0x269bf90) at
nodeAppend.c:640
#10 0x006e99b5 in choose_next_subplan_locally (node=0x269bf90) at
nodeAppend.c:426
#11 0x006e9598 in ExecAppend (pstate=0x269bf90) at nodeAppend.c:224
#12 0x006deb3a in ExecProcNodeFirst (node=0x269bf90) at
execProcnode.c:446
#13 0x006fb9ee in ExecProcNode (node=0x269bf90) at
../../../src/include/executor/executor.h:239
#14 0x006fbcc4 in ExecHashJoinImpl (pstate=0x2697808, parallel=0
'\000') at nodeHashjoin.c:262
#15 0x006fc3fd in ExecHashJoin (pstate=0x2697808) at
nodeHashjoin.c:565
#16 0x006deb3a in ExecProcNodeFirst (node=0x2697808) at
execProcnode.c:446
#17 0x0070c376 in ExecProcNode 

Re: ALTER TABLE ADD COLUMN fast default

2018-02-21 Thread Andres Freund
Hi,

On 2018-02-20 17:03:07 +1030, Andrew Dunstan wrote:
> + 
> +  attmissingval
> +  bytea
> +  
> +  
> +   This column has a binary representation of the value used when the
> +   column is entirely missing from the row, as happens when the column is
> +   added after the row is created. The value is only used when
> +   atthasmissing is true.
> +  
> + 

Hm. Why is this a bytea, and why is that a good idea? In pg_statistics,
which has to deal with something similar-ish, we deal with the ambiguity
of the storage by storing anyarrays, which then can display the contents
properly thanks to the embedded oid.  Now it'd be a bit sad to store a
one element array in the table. But that seems better than storing a
bytea that's typeless and can't be viewed reasonably. The best solution
would be to create a variadic any type that can actually be stored, but
I see why you'd not want to go there necessarily.

But also, a bit higher level, is it a good idea to store this
information directly into pg_attribute? That table is already one of the
hotter system catalog tables, and very frequently the largest. If we
suddenly end up storing a lot of default values in there, possibly ones
that aren't ever actually used because all the default values are long
since actually stored in the table, we'll spend more time doing cache
lookups.

I'm somewhat tempted to say that the default data should be in a
separate catalog table.  Or possibly use pg_attrdef.

>  
> +/*
> + * Return the missing value of an attribute, or NULL if there isn't one.
> + */
> +static Datum
> +getmissingattr(TupleDesc tupleDesc,
> +int attnum, bool *isnull)
> +{
> + int missingnum;
> + Form_pg_attribute att;
> + AttrMissing *attrmiss;
> +
> + Assert(attnum <= tupleDesc->natts);
> + Assert(attnum > 0);
> +
> + att = TupleDescAttr(tupleDesc, attnum - 1);
> +
> + if (att->atthasmissing)
> + {
> + Assert(tupleDesc->constr);
> + Assert(tupleDesc->constr->num_missing > 0);
> +
> + attrmiss = tupleDesc->constr->missing;
> +
> + /*
> +  * Look through the tupledesc's attribute missing values
> +  * for the one that corresponds to this attribute.
> +  */
> + for (missingnum = tupleDesc->constr->num_missing - 1;
> +  missingnum >= 0; missingnum--)
> + {
> + if (attrmiss[missingnum].amnum == attnum)
> + {
> + if (attrmiss[missingnum].ammissingNull)
> + {
> + *isnull = true;
> + return (Datum) 0;
> + }
> + else
> + {
> + *isnull = false;
> + return attrmiss[missingnum].ammissing;
> + }
> + }
> + }

I think requiring this is a bad idea. If, and only if, there's default
attributes with missing values we should build an array that can be
directly indexed by attribute number, accessing the missing value.  What
you're doing here is essentially O(n^2) afaict, with n being the number
of columns missing values.


> +/*
> + * Fill in missing values for a TupleTableSlot
> + */
> +static void
> +slot_getmissingattrs(TupleTableSlot *slot, int startAttNum, int lastAttNum)
> +{
> + AttrMissing *attrmiss;
> + int missingnum;
> + int missattnum;
> + int i;
> +
> + /* initialize all the missing attributes to null */
> + for (missattnum = lastAttNum - 1; missattnum >= startAttNum; 
> missattnum--)
> + {
> + slot->tts_values[missattnum] = PointerGetDatum(NULL);
> + slot->tts_isnull[missattnum] = true;
> + }

Any reason not to memset this in one (well two) go with memset? Also,
how come you're using PointerGetDatum()? Normally we just use (Datum) 0,
imo it's confusing to use PointerGetDatum(), as it implies things that
are quite possibly not the case.


> +/*
> + * Fill in one attribute, either a data value or a bit in the null bitmask
> + */

Imo this isn't sufficient documentation to explain what this function
does.  Even if you just add "per-attribute helper for heap_fill_tuple
and other routines building tuples" or such, I think it'd be clearer.

> +static inline void
> +fill_val(Form_pg_attribute att,
> +  bits8 **bit,
> +  int *bitmask,
> +  char **dataP,
> +  uint16 *infomask,
> +  Datum datum,
> +  bool isnull)
> +{
> + Sizedata_length;
> + char   *data = *dataP;
> +
> + /*
> +  * If we're building a null bitmap, set the appropriate bit for the
> +  * current 

Re: [HACKERS] asynchronous execution

2018-02-21 Thread Kyotaro HORIGUCHI
At Thu, 11 Jan 2018 17:08:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20180111.170839.23674040.horiguchi.kyot...@lab.ntt.co.jp>
> At Mon, 11 Dec 2017 20:07:53 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>  wrote in 
> <20171211.200753.191768178.horiguchi.kyot...@lab.ntt.co.jp>
> > > The attached PoC patch theoretically has no impact on the normal
> > > code paths and just brings gain in async cases.
> > 
> > The parallel append just committed hit this and the attached are
> > the rebased version to the current HEAD. The result of a concise
> > performance test follows.
> > 
> >patched(ms)  unpatched(ms)   gain(%)
> > A: simple table scan :  3562.32  3444.81-3.4
> > B: local partitioning:  1451.25  1604.38 9.5
> > C: single remote table   :  8818.92  9297.76 5.1
> > D: sharding (single con) :  5966.14  6646.7310.2
> > E: sharding (multi con)  :  1802.25  6515.4972.3
> > 
> > > A and B are degradation checks, which are expected to show no
> > > degradation.  C is the gain only by postgres_fdw's command
> > > presending on a remote table.  D is the gain of sharding on a
> > > connection. The number of partitions/shards is 4.  E is the gain
> > > using dedicate connection per shard.
> > 
> > Test A is accelerated by parallel sequential scan. Introducing
> > parallel append accelerates test B. Comparing A and B, I doubt
> > that degradation is stably measurable at least my environment but
> > I believe that there's no degradation theoreticaly. The test C to
> > E still shows apparent gain.
> > regards,
> 
> The patch conflicts with 3cac0ec. This is the rebased version.

It hadn't been workable itself for a long time.

- Rebased to current master.
  (Removed some wrongly-inserted lines)
- Fixed wrong-positioned  assertion in postgres_fdw.c
  (Caused assertion failure on normal usage)
- Properly reset persistent (static) variable.
  (Caused SEGV under certain condition)
- Fixed explain output of async-mixed append plan.
  (Choose proper subnode as the referent node)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

>From 6ab58d3fb02f716deaa207824747646dd8c2a448 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c|  2 +-
 src/backend/storage/ipc/latch.c   | 18 ++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c | 68 +++
 src/include/storage/latch.h   |  4 +-
 src/include/utils/resowner_private.h  |  8 
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index a4f6d4d..890972b 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -220,7 +220,7 @@ pq_init(void)
 (errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 	  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index e6706f7..5457899 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
+	if 

RE: Speed up the removal of WAL files

2018-02-21 Thread Tsunakawa, Takayuki
From: Michael Paquier [mailto:mich...@paquier.xyz]
It seems to me that you would reintroduce partially the problems that
> 1d4a0ab1 has fixed.  In short, if a crash happens in the code paths calling
> RemoveXlogFile with durable = false before fsync'ing pg_wal, then a rename
> has no guarantee to be durable, so you could finish again with a file that
> as an old name, but new contents.  A crucial thing which matters for a rename

Hmm, you're right.  Even during recovery, RemoveOldXlogFiles() can't skip 
fsyncing pg_wal/ because new WAL records streamed from the master are written 
to recycled WAL files.

After all, it seems to me that we have to stand with the current patch which 
only handles RemoveNonParentXlogFiles().


Regards
Takayuki Tsunakawa