date:20180207

Re: [HACKERS] path toward faster partition pruning

2018-02-07 Thread Amit Langote

On 2018/02/07 1:36, Robert Haas wrote:
> On Tue, Feb 6, 2018 at 4:55 AM, Amit Langote
>  wrote:
>>> I understand why COLLATION_MATCH think that a collation OID match is
>>> OK, but why is InvalidOid also OK?  Can you add a comment?  Maybe some
>>> test cases, too?
>>
>> partcollid == InvalidOid means the partition key is of uncollatable type,
>> so further checking the collation is unnecessary.
> 
> Yeah, but in that case wouldn't BOTH OIDs be InvalidOid, and thus the
> equality test would mach anyway?

It seems that that's not necessarily true.  I remember to have copied that
logic from the following macro in indxpath.c:

#define IndexCollMatchesExprColl(idxcollation, exprcollation) \
((idxcollation) == InvalidOid || (idxcollation) == (exprcollation))

which was added by the following commit:

commit cb37c291060dd13b1a8ff61fceee09efcfbc34e1
Author: Tom Lane 
Date:   Thu Sep 29 00:43:42 2011 -0400

Fix index matching for operators with mixed collatable/noncollatable inputs.

If an indexable operator for a non-collatable indexed datatype has a
collatable right-hand input type, any OpExpr for it will be marked with a
nonzero inputcollid (since having one collatable input is sufficient to
make that happen).  However, an index on a non-collatable column certainly
doesn't have any collation.  This caused us to fail to match such
operators to their indexes, because indxpath.c required an exact match of
index collation and clause collation.  It seems correct to allow a match
when the index is collation-less regardless of the clause's inputcollid:
an operator with both noncollatable and collatable inputs could perhaps
depend on the collation of the collatable input, but it could hardly
expect the index for the noncollatable input to have that same collation.

[ ... ]

+ *If the index has a collation, the clause must have the same collation.
+ *For collation-less indexes, we assume it doesn't matter; this is
+ *necessary for cases like "hstore ? text", wherein hstore's operators
+ *don't care about collation but the clause will get marked with a
+ *collation anyway because of the text argument.  (This logic is
+ *embodied in the macro IndexCollMatchesExprColl.)
+ *

Discussion leading to the above commit occurred here:

https://www.postgresql.org/message-id/flat/201109282050.p8SKoA4O084649%40wwwmaster.postgresql.org

It seems that we can think similarly for partitioning and the let the
partition pruning proceed with a clause even if the partition key is
non-collatable whereas the clause's other argument is collatable.  Even
though it seems we don't yet allow the kind of partitioning that would
lead to such a situation.

Thanks,
Amit

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-07 Thread Claudio Freire

On Thu, Feb 8, 2018 at 2:44 AM, Kyotaro HORIGUCHI
 wrote:
> Hello, I looked this a bit closer.
>
> In upthread[1] Robert mentioned the exponentially increasing size
> of additional segments.
>
>>> Hmm, I had imagined making all of the segments the same size rather
>>> than having the size grow exponentially.  The whole point of this is
>>> to save memory, and even in the worst case you don't end up with that
>>> many segments as long as you pick a reasonable base size (e.g. 64MB).
>>
>> Wastage is bound by a fraction of the total required RAM, that is,
>> it's proportional to the amount of required RAM, not the amount
>> allocated. So it should still be fine, and the exponential strategy
>> should improve lookup performance considerably.
>
> It seems that you are getting him wrong. (Anyway I'm not sure
> what you meant by the above. not-yet-allocated memory won't be a
> waste.) The conclusive number of dead tuples in a heap scan is
> undeteminable until the scan ends. If we had a new dead tuple
> required a, say 512MB new segment and the scan ends just after,
> the wastage will be almost the whole of the segment.

And the segment size is bound by a fraction of total needed memory.

When I said "allocated", I meant m_w_m. Wastage is not proportional to m_w_m.

> On the other hand, I don't think the exponential strategy make
> things considerably better. bsearch iterations in
> lazy_tid_reaped() are distributed between segment search and tid
> search. Intuitively more or less the increased segment size just
> moves some iterations of the former to the latter.
>
> I made a calculation[2]. With maintemance_work_mem of 4096MB, the
> number of segments is 6 and expected number of bsearch iteration
> is about 20.8 for the exponential strategy. With 64MB fixed size
> segments, we will have 64 segments (that is not so many) and the
> expected iteration is 20.4. (I suppose the increase comes from
> the imbalanced size among segments.) Addition to that, as Robert
> mentioned, the possible maximum memory wastage of the exponential
> strategy is about 2GB and 64MB in fixed size strategy.

That calculation has a slight bug in that it should be log2, and that
segment size is limited to 1GB at the top end.

But in any case, the biggest issue is that it's ignoring the effect of
cache locality. The way in which the exponential strategy helps, is by
keeping the segment list small and comfortably fitting in fast cache
memory, while also keeping wastage at a minimum for small lists. 64MB
segments with 4G mwm would be about 2kb of segment list. It fits in
L1, if there's nothing else contending for it, but it's already
starting to get big, and it would be expected settings larger than 4G
mwm could be used.

I guess I could tune the starting/ending sizes a bit.

Say, with an upper limit of 256M (instead of 1G), and after fixing the
other issues, we get:

exponential sized strategy
...
#18 : segsize=64MB total=4096MB, (tuples = 715827882, min
tsize=18.8GB), iterseg(18)=4.169925, iteritem(11184810) = 23.415037,
expected iter=29.491213

fixed sized strategy
...
#64 : segsize=64MB total=4096MB, (tuples = 715827882, min
tsize=18.2GB), interseg(64)=6.00, iteritem(11184810) = 23.415037,
expected iter=29.415037

Almost identical, and we get all the benefits of cache locality with
the exponential strategy. The fixed strategy might fit in the L1, but
it's less likely the bigger the mwm is.

The scaling factor could also be tuned I guess, but I'm wary of using
anything other than a doubling strategy, since it might cause memory
fragmentation.

Re: JIT compiling with LLVM v10.0

2018-02-07 Thread Andres Freund

On 2018-02-08 11:50:17 +1300, Thomas Munro wrote:
> You are asking LLVM to dlopen(""), which doesn't work on my not-Linux,
> explaining the errors I reported in the older thread.  The portable
> way to dlopen the main binary is dlopen(NULL), so I think you need to
> pass NULL in to LLVMLoadLibraryPermanently(), even though that isn't
> really clear from any LLVM documentation I've looked at.

Ugh. Thanks for figuring that out, will incorporate!

Greetings,

Andres Freund

Fix a typo in walsender.c

2018-02-07 Thread atorikoshi


Hi,

Attached a minor patch for variable name in comment:  
s/progress_update/update_progress


  ---include/server/replication/logical.h
  ...
  35 typedef struct LogicalDecodingContext
  36 {
  ...
  68 LogicalOutputPluginWriterUpdateProgress update_progress;

Regards,

--
Atsushi Torikoshi
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
diff --git a/src/backend/replication/walsender.c 
b/src/backend/replication/walsender.c
index 130ecd5..d46374d 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1243,7 +1243,7 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr 
lsn, TransactionId xid,
 }
 
 /*
- * LogicalDecodingContext 'progress_update' callback.
+ * LogicalDecodingContext 'update_progress' callback.
  *
  * Write the current position to the log tracker (see XLogSendPhysical).
  */

Disabling src/test/[ssl|ldap] when not building with SSL/LDAP support

2018-02-07 Thread Michael Paquier

Hi all,

In order to run tests consistently on the whole tree, I use a simple
alias which tests also things like src/test/ssl and src/test/ldap on the
way.

Lately, I am getting annoyed by $subject when working on OpenSSL stuff
as sometimes I need to test things with and without SSL support to make
sure that a patch is rightly shaped.  However if configure is built
without SSL support then src/test/ssl just fails.  The same applies to
src/test/ldap.

Could it be possible to disable them using something like the attached
if a build is done without SSL and/or LDAP?

Thanks,
--
Michael
diff --git a/configure.in b/configure.in
index 4d26034579..aee3ab0867 100644
--- a/configure.in
+++ b/configure.in
@@ -682,6 +682,7 @@ PGAC_ARG_BOOL(with, ldap, no,
   [build with LDAP support],
   [AC_DEFINE([USE_LDAP], 1, [Define to 1 to build with LDAP support. (--with-ldap)])])
 AC_MSG_RESULT([$with_ldap])
+AC_SUBST(with_ldap)
 
 
 #
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index d980f81046..dcb8dc5d90 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -186,6 +186,7 @@ with_tcl	= @with_tcl@
 with_openssl	= @with_openssl@
 with_selinux	= @with_selinux@
 with_systemd	= @with_systemd@
+with_ldap	= @with_ldap@
 with_libxml	= @with_libxml@
 with_libxslt	= @with_libxslt@
 with_system_tzdata = @with_system_tzdata@
diff --git a/src/test/ldap/Makefile b/src/test/ldap/Makefile
index fef5742b82..983b37e71a 100644
--- a/src/test/ldap/Makefile
+++ b/src/test/ldap/Makefile
@@ -14,10 +14,18 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 check:
+ifeq ($(with_ldap),yes)
 	$(prove_check)
+else
+	@echo "No tests as LDAP is not supported."
+endif
 
 installcheck:
+ifeq ($(with_ldap),yes)
 	$(prove_installcheck)
+else
+	@echo "No tests as LDAP is not supported."
+endif
 
 clean distclean maintainer-clean:
 	rm -rf tmp_check
diff --git a/src/test/ssl/Makefile b/src/test/ssl/Makefile
index 4e9095529a..87872d07ae 100644
--- a/src/test/ssl/Makefile
+++ b/src/test/ssl/Makefile
@@ -131,7 +131,15 @@ clean distclean maintainer-clean:
 	rm -rf tmp_check
 
 check:
+ifeq ($(with_openssl),yes)
 	$(prove_check)
+else
+	@echo "No tests as OpenSSL is not supported."
+endif
 
 installcheck:
+ifeq ($(with_openssl),yes)
 	$(prove_installcheck)
+else
+	@echo "No tests as OpenSSL is not supported."
+endif


signature.asc
Description: PGP signature

Re: postgres_fdw: perform UPDATE/DELETE .. RETURNING on a join directly

2018-02-07 Thread Etsuro Fujita


(2018/02/08 10:40), Robert Haas wrote:

On Wed, Feb 7, 2018 at 6:01 PM, Tom Lane  wrote:

The buildfarm's opinion of it is lower than yours.  Just eyeballing
the failures, I'd say there was some naivete about the reproducibility
of tuple CTIDs across different platforms.  Is there a good reason
these test cases need to print CTID?


Uggh, I missed the fact that they were doing that.  It's probably
actually useful test coverage, but it's not surprising that it isn't
stable.


That was my purpose, but I agree with the instability.  Thanks again, 
Robert!


Best regards,
Etsuro Fujita

Re: postgres_fdw: perform UPDATE/DELETE .. RETURNING on a join directly

2018-02-07 Thread Etsuro Fujita


(2018/02/08 5:39), Robert Haas wrote:

On Thu, Jan 11, 2018 at 2:58 AM, Etsuro Fujita
  wrote:

(2017/12/27 20:55), Etsuro Fujita wrote:

Attached is an updated version of the patch.


I revised code/comments a little bit.  PFA new version.


I spent a while reading through this today.  I see a few decisions
here or there that are debatable, in the sense that somebody else
might have chosen to do it differently, but I don't see anything that
actually looks wrong.  So, committed.


Thanks for committing, Robert!  Thanks for reviewing, Rushabh and Ashutosh!

Best regards,
Etsuro Fujita

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-07 Thread Kyotaro HORIGUCHI

Hello, I looked this a bit closer.

In upthread[1] Robert mentioned the exponentially increasing size
of additional segments.

>> Hmm, I had imagined making all of the segments the same size rather
>> than having the size grow exponentially.  The whole point of this is
>> to save memory, and even in the worst case you don't end up with that
>> many segments as long as you pick a reasonable base size (e.g. 64MB).
>
> Wastage is bound by a fraction of the total required RAM, that is,
> it's proportional to the amount of required RAM, not the amount
> allocated. So it should still be fine, and the exponential strategy
> should improve lookup performance considerably.

It seems that you are getting him wrong. (Anyway I'm not sure
what you meant by the above. not-yet-allocated memory won't be a
waste.) The conclusive number of dead tuples in a heap scan is
undeteminable until the scan ends. If we had a new dead tuple
required a, say 512MB new segment and the scan ends just after,
the wastage will be almost the whole of the segment.

On the other hand, I don't think the exponential strategy make
things considerably better. bsearch iterations in
lazy_tid_reaped() are distributed between segment search and tid
search. Intuitively more or less the increased segment size just
moves some iterations of the former to the latter.

I made a calculation[2]. With maintemance_work_mem of 4096MB, the
number of segments is 6 and expected number of bsearch iteration
is about 20.8 for the exponential strategy. With 64MB fixed size
segments, we will have 64 segments (that is not so many) and the
expected iteration is 20.4. (I suppose the increase comes from
the imbalanced size among segments.) Addition to that, as Robert
mentioned, the possible maximum memory wastage of the exponential
strategy is about 2GB and 64MB in fixed size strategy.

Seeing these numbers, I don't tend to take the exponential
strategy.


[1] 
https://www.postgresql.org/message-id/cagtbqpbzx5s4qrnb6yp-2nk+a9bxbavktzkwsgvmeov3mtg...@mail.gmail.com

[2] See attached perl script. I hope it is correct.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
#! /usr/bin/perl

$maxmem=1024 * 4;
#=
print "exponential sized strategy\n";
$ss = 64;
$ts = 0;
$sumiteritem = 0;
for ($i = 1 ; $ts < $maxmem ; $i++) {
$ss = $ss * 2;
if ($ts + $ss > $maxmem) {
$ss = $maxmem - $ts;
}
$ts += $ss;
$ntups = $ts*1024*1024 / 6;
$ntupinseg = $ss*1024*1024 / 6;
$npages = $ntups / 291;
$tsize = $npages * 8192.0 / 1024 / 1024 / 1024;
$sumiteritem += log($ntupinseg) * $ntupinseg; # weight by percentage in 
all tuples
printf("#%d : segsize=%dMB total=%dMB, (tuples = %ld, min 
tsize=%.1fGB), iterseg(%d)=%f, iteritem(%d) = %f, expected iter=%f\n",
   $i, $ss, $ts, $ntups, $tsize,
   $i, log($i), $ntupinseg, log($ntupinseg), log($i) + 
$sumiteritem/$ntups);
}

print "\n\nfixed sized strategy\n";
$ss = 64;
$ts = 0;
for ($i = 1 ; $ts < $maxmem ; $i++) {
$ts += $ss;
$ntups = $ts*1024*1024 / 6;
$ntupinseg = $ss*1024*1024 / 6;
$npages = $ntups / 300;
$tsize = $npages * 8192.0 / 1024 / 1024 / 1024;
printf("#%d : segsize=%dMB total=%dMB, (tuples = %ld, min 
tsize=%.1fGB), interseg(%d)=%f, iteritem(%d) = %f, expected iter=%f\n",
   $i, $ss, $ts, $ntups, $tsize,
$i, log($i), $ntupinseg, log($ntupinseg), log($i) + 
log($ntupinseg));
}

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2018-02-07 Thread Amit Langote

On 2018/02/08 11:55, Amit Langote wrote:
> Hi Ashutosh.
> 
> On 2018/02/07 13:51, Ashutosh Bapat wrote:
>> Here's a new patchset with following changes
>>
>> 1. Rebased on the latest head taking care of partition bound
>> comparison function changes
> 
> I was about to make these changes myself while revising the fast pruning
> patch.  Instead, I decided to take a look at your patch and try to use it
> in my tree.

I also noticed that a later patch adds partsupfunc to PartitionScheme,
which the pruning patch needs too.  So, perhaps would be nice to take out
that portion of the patch.  That is, the changes to PartitionScheme struct
definition and those to find_partition_scheme().

Regarding the latter, wouldn't be nice to have a comment before the code
that does the copying about why we don't compare the partsupfunc field to
decide if we have a match or not.  I understand it's because the
partsupfunc array contains pointers, not OIDs.  But maybe, that's too
obvious to warrant a comment.

Thanks,
Amit

Re: it's a feature, but it feels like a bug

2018-02-07 Thread David Fetter

On Wed, Feb 07, 2018 at 10:26:50PM -0500, Tom Lane wrote:
> Rafal Pietrak  writes:
> > ztk=# create table test (a int, b int, c int, d bool, e int, primary key
> > (a,b,c,d));
> > CREATE TABLE
> > ztk=# create unique index leftone on test (a,b) where d is true;
> > CREATE INDEX
> > ztk=# create unique index rightone on test (b,c) where d is false;
> > CREATE INDEX
> > ztk=# alter table ONLY test ADD CONSTRAINT e2b_fk FOREIGN KEY (a,e)
> > REFERENCES test(a,b) ON UPDATE CASCADE;
> > ERROR:  there is no unique constraint matching given keys for referenced
> > table "test"
> 
> > And it is sort of "couterintuitive" - as you can see, there is a UNIQUE
> > index for test(a,b) target; admitedly partial, but  why should that
> > matter?
> 
> Because the index fails to guarantee uniqueness of (a,b) in rows where d
> isn't true.  There could be many duplicates in such rows, possibly even of
> (a,b) pairs that also appear --- though only once --- in rows where d is
> true.
> 
> If there were a way to say that the FK is only allowed to reference rows
> where d is true, then this index could support an FK like that.  But
> there's no way to express such a thing in SQL.

There will be as soon as we implement ASSERTIONs.

> Personally I'd think about separating your rows-where-d-is-true into
> their own table, which could have a normal PK index.  You could
> still create a union view over that table and the one with the other
> rows to satisfy whatever queries want to think the two kinds of rows
> are the same thing.  But I'd offer that if one set of rows has (a,b)
> as a PK and the other does not, they are not really the same kind of
> thing.

Another way might be to partition the table on the boolean and make a
foreign key to the "true" partition, e.g.:

CREATE TABLE foo(b BOOLEAN, i INTEGER NOT NULL, t TEXT NOT NULL) PARTITION BY 
LIST (b);
CREATE TABLE foo_true PARTITION OF foo (PRIMARY KEY(i, t)) FOR VALUES IN 
('true');
CREATE TABLE bar(foo_i INTEGER NOT NULL, foo_t TEXT NOT NULL, FOREIGN 
KEY(foo_i, foo_t) REFERENCES foo_true);

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-07 Thread Masahiko Sawada

On Tue, Feb 6, 2018 at 9:51 PM, Claudio Freire  wrote:
> On Tue, Feb 6, 2018 at 4:56 AM, Masahiko Sawada  wrote:
>> On Tue, Feb 6, 2018 at 2:55 AM, Claudio Freire  
>> wrote:
>>> On Mon, Feb 5, 2018 at 1:53 AM, Masahiko Sawada  
>>> wrote:
 On Fri, Feb 2, 2018 at 11:13 PM, Claudio Freire  
 wrote:
> After autovacuum gets cancelled, the next time it wakes up it will
> retry vacuuming the cancelled relation. That's because a cancelled
> autovacuum doesn't update the last-vacuumed stats.
>
> So the timing between an autovacuum work item and the next retry for
> that relation is more or less an autovacuum nap time, except perhaps
> in the case where many vacuums get cancelled, and they have to be
> queued.

 I think that's not true if there are multiple databases.
>>>
>>> I'd have to re-check.
>>>
>>> The loop is basically, IIRC:
>>>
>>> while(1) { vacuum db ; work items ; nap }
>>>
>>> Now, if that's vacuum one db, not all, and if the decision on the
>>> second run doesn't pick the same db because that big table failed to
>>> be vacuumed, then you're right.
>>>
>>> In that case we could add the FSM vacuum as a work item *in addition*
>>> to what this patch does. If the work item queue is full and the FSM
>>> vacuum doesn't happen, it'd be no worse than with the patch as-is.
>>>
>>> Is that what you suggest?
>>
>> That's one of the my suggestion. I might had to consider this issue
>> for each case. To be clear let me summarize for each case.
>>
>> For table with indices, vacuum on fsm of heap is done after
>> lazy_vacuum_heap(). Therefore I think that the case where a table got
>> vacuumed but fsm couldn't get vacuumed doesn't happen unless the
>> autovacuum gets cancelled before or while vacuuming fsm.
>
> Well, that's the whole point. Autovacuums get cancelled all the time
> in highly contended tables. I have more than a handful tables in
> production that never finish autovacuum, so I have to trigger manual
> vacuums periodically.
>
>> (1) using autovacuum work-item, (2) vacuuming fsm of table in
>> PG_CATCH, (3) remembering the tables got cancelled and vacuuming them
>> after finished a loop of table_oids.
> ...
>> So I'm in favor of (3).
> ...
>> However, it's quite possible that I'm not seeing the whole picture here.
>
> Well, none of that handles the case where the autovacuum of a table
> doesn't get cancelled, but takes a very long time.
>
> No free space becomes visible during long-running vacuums. That means
> bloat keeps accumulating even though vacuum is freeing space, because
> the FSM doesn't expose that free space.
>
> The extra work incurred in those FSM vacuums isn't useless, it's
> preventing bloat in long-running vacuums.
>
> I can look into doing 3, that *might* get rid of the need to do that
> initial FSM vacuum, but all other intermediate vacuums are still
> needed.

Understood. So how about that this patch focuses only make FSM vacuum
more frequently and leaves the initial FSM vacuum and the handling
cancellation cases? The rest can be a separate patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: SSL test names

2018-02-07 Thread Michael Paquier

On Wed, Feb 07, 2018 at 11:54:52AM -0500, Peter Eisentraut wrote:
> Here is a patch that gives the tests in the SSL test suite proper names
> instead of just writing out the connection strings.  So instead of
> 
> # running client tests
> # test that the server doesn't accept non-SSL connections
> ok 1 - sslmode=disable (should fail)
> # connect without server root cert
> ok 2 - sslrootcert=invalid sslmode=require
> ok 3 - sslrootcert=invalid sslmode=verify-ca (should fail)
> ok 4 - sslrootcert=invalid sslmode=verify-full (should fail)
> 
> you get something like
> 
> # running client tests
> ok 1 - server doesn't accept non-SSL connections
> ok 2 - connect without server root cert sslmode=require
> ok 3 - connect without server root cert sslmode=verify-ca
> ok 4 - connect without server root cert sslmode=verify-full
> ok 5 - connect with wrong server root cert sslmode=require
> ok 6 - connect with wrong server root cert sslmode=verify-ca
> ok 7 - connect with wrong server root cert sslmode=verify-full
> 
> I have found the old way very confusing while working with several
> SSL-related patches recently.

No objections against that.

You need to update the comment on top of test_connect_ok in
ServerSetup.pm.  Wouldn't it be better to use the expected result
as an argument and merge test_connect_ok and test_connect_fails?
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] MERGE SQL Statement for PG11

2018-02-07 Thread Pavan Deolasee

On Wed, Feb 7, 2018 at 10:52 PM, Peter Geoghegan  wrote:

>
>
> Apparently there is a pending patch to do better there - the commit
> message of 2f178441 refers to this.
>
>
Thanks. Will look at it.

> > The revised version also supports subqueries in SET targetlist as well as
> > WHEN AND conditions. As expected, this needed a minor tweak in
> > query/expression mutators. So once I worked on that for partitioned
> tables,
> > subqueries started working with very minimal adjustments elsewhere. Other
> > things such as whole-var references are still broken. A few new tests are
> > also added.
>
> Great!
>
> Wholerow references are expected to be a bit trickier. See commit
> ad227837 for some hints on how you could fix this.
>

Thanks again.

>
> > Next I am going to look at RLS. While I've no prior experience with RLS,
> I
> > am expecting it to be less cumbersome in comparison to partitioning. I am
> > out of action till Monday and may not be able to respond in time. But any
> > reviews and comments are appreciated as always.
>
> I don't think that RLS support will be particularly challenging. It
> might take a while.
>

Ok. I would start from writing a test case to check what works and what
doesn't with the current patch and work from there. My understanding of RLS
is limited right now, but from what I've seen in the code (while hacking
other stuff), my guess is it will require us evaluating a set of quals and
then deciding on the action.

>
> If your rapid progress here is any indication, most open items are not
> likely to be particularly challenging. Once again, I suggest that a
> good area for us to focus on is the semantics of READ COMMITTED
> conflict handling.

I understand getting EPQ semantics right is very important. Can you please
(once again) summarise your thoughts on what you think is the *most*
appropriate behaviour? I can then think how much efforts might be involved
in that. If the efforts are disproportionately high, we can discuss if
settling for some not-so-nice semantics, like we apparently did for
partition key updates.

I am sorry, I know you and Simon have probably done that a few times
already and I should rather study those proposals first. So it's okay if
you don't want to repeat those; I will work on them next week once I am
back from holidays.

Maybe you'd prefer to just blast through the
> simpler open items, which is fine, but do bear in mind that the EPQ
> and EPQ-adjacent stuff is probably going to be the thing that makes or
> breaks this patch for v11.
>

TBH I did not consider partitioning any less complex and it was indeed very
complex, requiring at least 3 reworks by me. And from what I understood, it
would have been a blocker too. So is subquery handling and RLS. That's why
I focused on addressing those items while you and Simon were still debating
EPQ semantics.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-07 Thread Claudio Freire

On Thu, Feb 8, 2018 at 12:13 AM, Claudio Freire  wrote:
> On Wed, Feb 7, 2018 at 11:29 PM, Alvaro Herrera  
> wrote:
>> Claudio Freire wrote:
>>> On Wed, Feb 7, 2018 at 8:52 PM, Alvaro Herrera  
>>> wrote:
>>> >> Waiting as you say would be akin to what the patch does by putting
>>> >> vacuum on its own parallel group.
>>> >
>>> > I don't think it's the same.  We don't need to wait until all the
>>> > concurrent tests are done -- we only need to wait until the transactions
>>> > that were current when the delete finished are done, which is very
>>> > different since each test runs tons of small transactions rather than
>>> > one single big transaction.
>>>
>>> Um... maybe "lock pg_class" ?
>>
>> I was thinking in first doing
>>   SELECT array_agg(DISTINCT virtualtransaction) vxids
>> FROM pg_locks \gset
>>
>> and then in a DO block loop until
>>
>>SELECT DISTINCT virtualtransaction
>>  FROM pg_locks
>> INTERSECT
>>SELECT (unnest(:'vxids'::text[]));
>>
>> returns empty; something along those lines.
>
> Isn't it the same though?
>
> I can't think how a transaction wouldn't be holding at least an access
> share on pg_class.

Never mind, I just saw the error of my ways.

I don't like looping, though, seems overly cumbersome. What's worse?
maintaining that fragile weird loop that might break (by causing
unexpected output), or a slight slowdown of the test suite? I don't
know how long it might take on slow machines, but in my machine, which
isn't a great machine, while the vacuum test isn't fast indeed, it's
just a tiny fraction of what a simple "make check" takes. So it's not
a huge slowdown in any case.

I'll give it some thought, maybe there's a simpler way.

Re: it's a feature, but it feels like a bug

2018-02-07 Thread Tom Lane

Rafal Pietrak  writes:
> ztk=# create table test (a int, b int, c int, d bool, e int, primary key
> (a,b,c,d));
> CREATE TABLE
> ztk=# create unique index leftone on test (a,b) where d is true;
> CREATE INDEX
> ztk=# create unique index rightone on test (b,c) where d is false;
> CREATE INDEX
> ztk=# alter table ONLY test ADD CONSTRAINT e2b_fk FOREIGN KEY (a,e)
> REFERENCES test(a,b) ON UPDATE CASCADE;
> ERROR:  there is no unique constraint matching given keys for referenced
> table "test"

> And it is sort of "couterintuitive" - as you can see, there is a UNIQUE
> index for test(a,b) target; admitedly partial, but  why should that
> matter?

Because the index fails to guarantee uniqueness of (a,b) in rows where d
isn't true.  There could be many duplicates in such rows, possibly even of
(a,b) pairs that also appear --- though only once --- in rows where d is
true.

If there were a way to say that the FK is only allowed to reference rows
where d is true, then this index could support an FK like that.  But
there's no way to express such a thing in SQL.

Personally I'd think about separating your rows-where-d-is-true into
their own table, which could have a normal PK index.  You could still
create a union view over that table and the one with the other rows
to satisfy whatever queries want to think the two kinds of rows
are the same thing.  But I'd offer that if one set of rows has (a,b)
as a PK and the other does not, they are not really the same kind
of thing.

regards, tom lane

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-07 Thread Claudio Freire

On Wed, Feb 7, 2018 at 11:29 PM, Alvaro Herrera  wrote:
> Claudio Freire wrote:
>> On Wed, Feb 7, 2018 at 8:52 PM, Alvaro Herrera  
>> wrote:
>> >> Waiting as you say would be akin to what the patch does by putting
>> >> vacuum on its own parallel group.
>> >
>> > I don't think it's the same.  We don't need to wait until all the
>> > concurrent tests are done -- we only need to wait until the transactions
>> > that were current when the delete finished are done, which is very
>> > different since each test runs tons of small transactions rather than
>> > one single big transaction.
>>
>> Um... maybe "lock pg_class" ?
>
> I was thinking in first doing
>   SELECT array_agg(DISTINCT virtualtransaction) vxids
> FROM pg_locks \gset
>
> and then in a DO block loop until
>
>SELECT DISTINCT virtualtransaction
>  FROM pg_locks
> INTERSECT
>SELECT (unnest(:'vxids'::text[]));
>
> returns empty; something along those lines.

Isn't it the same though?

I can't think how a transaction wouldn't be holding at least an access
share on pg_class.

Re: PostgreSQL crashes with SIGSEGV

2018-02-07 Thread Peter Geoghegan

On Wed, Feb 7, 2018 at 4:41 PM, Tom Lane  wrote:
> Peter Geoghegan  writes:
>> It would be nice to get an opinion on this mode_final() + tuplesort
>> memory lifetime business from you, Tom.
>
> I'm fairly sure that that bit in mode_final() was just a hack to make
> it work.  If there's a better, more principled way, let's go for it.

This is the more principled way. It is much easier to make every
single tuplesort caller on every release branch follow this rule (or
those on 9.5+, at least).

My big concern with making tuplesort_getdatum() deliberately copy into
caller's context is that that could introduce memory leaks in caller
code that evolved in a world where tuplesort_end() frees
pass-by-reference datum memory. Seems like the only risky case is
certain ordered set aggregate code, though. And, it's probably just a
matter of fixing the comments. I'll read through the bug report thread
that led up to October's commit be0ebb65 [1] tomorrow, to make sure of
this.

I just noticed that process_ordered_aggregate_single() says that the
behavior I propose for tuplesort_getdatum() is what it *already*
expects:

/*
 * Note: if input type is pass-by-ref, the datums returned by the sort are
 * freshly palloc'd in the per-query context, so we must be careful to
 * pfree them when they are no longer needed.
 */

This supports the idea that ordered set aggregate code is the odd one
out when it comes to beliefs about tuplesort memory contexts. Even
just among tuplesort_getdatum() callers, though even more so among
tuplesort callers in general. One simple rule for all tuplesort fetch
routines is the high-level goal here.

>> Note that you removed the quoted comment within be0ebb65, back in October.
>
> There were multiple instances of basically that same comment before.
> AFAICS I just consolidated them into one place, in the header comment for
> ordered_set_shutdown.

I see. So, to restate my earlier remarks in terms of the newer
organization: The last line in that header comment will need to be
removed, since it will become false under my scheme. The line is:
"Note that many of the finalfns could *not* free the tuplesort object,
at least not without extra data copying, because what they return is a
pointer to a datum inside the tuplesort object".

[1] 
https://www.postgresql.org/message-id/flat/CAB4ELO5RZhOamuT9Xsf72ozbenDLLXZKSk07FiSVsuJNZB861A%40mail.gmail.com#cab4elo5rzhoamut9xsf72ozbendllxzksk07fisvsujnzb8...@mail.gmail.com
-- 
Peter Geoghegan

Re: Add more information_schema columns

2018-02-07 Thread Michael Paquier

On Wed, Feb 07, 2018 at 10:50:12AM -0500, Peter Eisentraut wrote:
> Committed with the separate entries.

Thanks.  The result looks fine to me.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2018-02-07 Thread Amit Langote

Hi Ashutosh.

On 2018/02/07 13:51, Ashutosh Bapat wrote:
> Here's a new patchset with following changes
> 
> 1. Rebased on the latest head taking care of partition bound
> comparison function changes

I was about to make these changes myself while revising the fast pruning
patch.  Instead, I decided to take a look at your patch and try to use it
in my tree.

I looked at the patch 0001 and noticed that git diff --check says:

src/backend/catalog/partition.c:2900: trailing whitespace.
+partition_rbound_datum_cmp(FmgrInfo *partsupfunc, Oid *partcollation,

Also, might be a good idea to write briefly about the new arguments in the
header comment.  Something like that they are PartitionKey elements.

Thanks,
Amit

it's a feature, but it feels like a bug

2018-02-07 Thread Rafal Pietrak

Hi,

I've bumped onto the following problem:
-screenshot
ztk=# create table test (a int, b int, c int, d bool, e int, primary key
(a,b,c,d));
CREATE TABLE
ztk=# create unique index leftone on test (a,b) where d is true;
CREATE INDEX
ztk=# create unique index rightone on test (b,c) where d is false;
CREATE INDEX
ztk=# alter table ONLY test ADD CONSTRAINT e2b_fk FOREIGN KEY (a,e)
REFERENCES test(a,b) ON UPDATE CASCADE;
ERROR:  there is no unique constraint matching given keys for referenced
table "test"
ztk=#
---

BTW: the "rightone" index above only "reflects my actual schema" and
does not give much to the axample.

Now I know it's a "feature".

But it really hurts my dataset/schema "coexistence".

And it is sort of "couterintuitive" - as you can see, there is a UNIQUE
index for test(a,b) target; admitedly partial, but  why should that
matter? a unique index is there to indicate exactly one row as target
for the FK. This is what LEFTONE index does. Surely, some part of the
TEST table will never be accesable for TEST(A,E) as FK targets; but that
(in the case of my dataset) is exactly what is expected.

Thus, It feels like a bug.

So my questions are:
1. is there a well known reason for not allowing it? (i.e. ignoring
partial unique indexes when looking up FK support index)
2. I hope that there isn't; So: would it break things if I try to patch
sources to allow for it?
3. I admit I'm not so much of a hacker, but at this point I'm so
desperate to have it done, that I'll try anyway. So if the above is
doable at all, where should I start reading sources?

Thank you,

-R

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-07 Thread Alvaro Herrera

Claudio Freire wrote:
> On Wed, Feb 7, 2018 at 8:52 PM, Alvaro Herrera  
> wrote:
> >> Waiting as you say would be akin to what the patch does by putting
> >> vacuum on its own parallel group.
> >
> > I don't think it's the same.  We don't need to wait until all the
> > concurrent tests are done -- we only need to wait until the transactions
> > that were current when the delete finished are done, which is very
> > different since each test runs tons of small transactions rather than
> > one single big transaction.
> 
> Um... maybe "lock pg_class" ?

I was thinking in first doing 
  SELECT array_agg(DISTINCT virtualtransaction) vxids
FROM pg_locks \gset

and then in a DO block loop until

   SELECT DISTINCT virtualtransaction
 FROM pg_locks
INTERSECT
   SELECT (unnest(:'vxids'::text[]));

returns empty; something along those lines.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] A design for amcheck heapam verification

2018-02-07 Thread Peter Geoghegan

On Mon, Feb 5, 2018 at 12:55 PM, Peter Geoghegan  wrote:
> Anyway, parallel CREATE INDEX added a new "scan" argument to
> IndexBuildHeapScan(), which caused this patch to bitrot. At a minimum,
> an additional NULL argument should be passed by amcheck. However, I
> have a better idea.
>
> ISTM that verify_nbtree.c should manage the heap scan itself, it the
> style of parallel CREATE INDEX. It can acquire its own MVCC snapshot
> for bt_index_check() (which pretends to be a CREATE INDEX
> CONCURRENTLY). There can be an MVCC snapshot acquired per index
> verified, a snapshot that is under the direct control of amcheck. The
> snapshot would be acquired at the start of verification on an index
> (when "heapallindex = true"), before the verification of the index
> structure even begins, and released at the very end of verification.

Attached patch fixes the parallel index build bitrot in this way. This
is version 6 of the patch.

This approach resulted in a nice reduction in complexity:
bt_index_check() and bt_index_parent_check() heapallindexed
verification operations both work in exactly the same way now, except
that bt_index_check() imitates a CREATE INDEX CONCURRENTLY (to match
the heavyweight relation locks acquired). This doesn't really need to
be explained as a special case anymore; bt_index_parent_check() is
like an ordinary CREATE INDEX, without any additional "TransactionXmin
heap tuple xmin recheck" complication.

A further benefit is that this makes running bt_index_check() checks
against many indexes more thorough, and easier to reason about. Users
won't have to worry about TransactionXmin becoming very stale when
many indexes are verified within a single command.

I made the following additional, unrelated changes based on various feedback:

* Faster modulo operations.

Andrey Borodin suggested that I make k_hashes() do fewer modulo
operations. While I don't want to change the algorithm to make this
happen, the overhead has been reduced. Modulo operations are now
performed through bitwise AND operations, which is possible only
because the bitset size is always a power of two. Apparently this is a
fairly common optimization for Bloom filters that use (enhanced)
double-hashing; we might as well do it this way.

I've really just transcribed the enhanced double hashing pseudo-code
from the Georgia Tech/Dillinger & Manolios paper into C code, so no
real change there; bloomfilter.c's k_hashes() is still closely based
on "5.2 Enhanced Double Hashing" from that same paper. Experience
suggests that we ought to be very conservative about developing novel
hashing techniques. Paranoid, even.

* New reference to the modulo bias effect.

Michael Paquier wondered why the Bloom filter was always a
power-of-two, which this addresses. (Of course, the "modulo bitwise
AND" optimization I just mentioned is another reason to limit
ourselves to power-of-two bitset sizes, albeit a new one.)

* Removed sdbmhash().

Michael also wanted to know more about sdbmhash(), due to some general
concern about its quality. I realized that it is best to avoid adding
a new general-purpose hash function, whose sole purpose is to be
different to hash_any(), when I could instead use
hash_uint32_extended() to get two 32-bit values all at once. Robert
suggested this approach at one point, actually, but for some reason I
didn't follow up until now.

-- 
Peter Geoghegan
From 2ff9dcace49ea590762701717235d87e13b03c6b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Thu, 24 Aug 2017 20:58:21 -0700
Subject: [PATCH 1/2] Add Bloom filter data structure implementation.

A Bloom filter is a space-efficient, probabilistic data structure that
can be used to test set membership.  Callers will sometimes incur false
positives, but never false negatives.  The rate of false positives is a
function of the total number of elements and the amount of memory
available for the Bloom filter.

Two classic applications of Bloom filters are cache filtering, and data
synchronization testing.  Any user of Bloom filters must accept the
possibility of false positives as a cost worth paying for the benefit in
space efficiency.

This commit adds a test harness extension module, test_bloomfilter.  It
can be used to get a sense of how the Bloom filter implementation
performs under varying conditions.
---
 src/backend/lib/Makefile   |   4 +-
 src/backend/lib/README |   2 +
 src/backend/lib/bloomfilter.c  | 303 +
 src/include/lib/bloomfilter.h  |  27 ++
 src/test/modules/Makefile  |   1 +
 src/test/modules/test_bloomfilter/.gitignore   |   4 +
 src/test/modules/test_bloomfilter/Makefile |  21 ++
 src/test/modules/test_bloomfilter/README   |  71 +
 .../test_bloomfilter/expected/test_bloomfilter.out |  25 ++
 .../test_bloomfilter/sql/test_bloomfilter.sql  |  22 ++

Re: postgres_fdw: perform UPDATE/DELETE .. RETURNING on a join directly

2018-02-07 Thread Robert Haas

On Wed, Feb 7, 2018 at 6:01 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> I spent a while reading through this today.  I see a few decisions
>> here or there that are debatable, in the sense that somebody else
>> might have chosen to do it differently, but I don't see anything that
>> actually looks wrong.  So, committed.
>
> The buildfarm's opinion of it is lower than yours.  Just eyeballing
> the failures, I'd say there was some naivete about the reproducibility
> of tuple CTIDs across different platforms.  Is there a good reason
> these test cases need to print CTID?

Uggh, I missed the fact that they were doing that.  It's probably
actually useful test coverage, but it's not surprising that it isn't
stable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: update tuple routing and triggers

2018-02-07 Thread Amit Langote

Thanks for the review.

On 2018/02/08 0:04, Robert Haas wrote:
> On Tue, Feb 6, 2018 at 12:52 AM, Amit Langote
>  wrote:
>> About renaming es_leaf_result_relations to
>> es_tuple_routing_result_relations, I will defer that to committer.  But on
>> second though, maybe we don't need to make this patch larger than it has
>> to be.
> 
> +1 for renaming it.  I like keeping the patch small, but not at the
> price of being misleading.

OK, done in the attached.

> +/*
> + * Since we're newly creating this ResultRelInfo, add it to
> + * someplace where others could find it.
> + */
> 
> How about: "Since we've just initialized this ResultRelInfo, it's not
> in any list attached to the estate as yet.  Add it, so that it can be
> found later."

Wrote that way.

Thanks,
Amit
From 146f4d64fc9c085d7bc26435613eddd96d972d0d Mon Sep 17 00:00:00 2001
From: amit 
Date: Tue, 6 Feb 2018 10:40:45 +0900
Subject: [PATCH v4] Fix trigger behavior with update tuple routing

Trigger stats shown by EXPLAIN ANALYZE may contain duplicate
entries, because leaf partitions' ResultRelInfo is redundantly
present in multiple EState lists.

While at it, rename es_leaf_result_relations to something that
more accurately describes what it is.

Reported by: Etsuro Fujita
---
 src/backend/commands/explain.c   |  3 ++-
 src/backend/executor/execMain.c  |  7 +--
 src/backend/executor/execPartition.c | 12 +---
 src/backend/executor/execUtils.c |  2 +-
 src/include/nodes/execnodes.h|  7 +--
 5 files changed, 22 insertions(+), 9 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 41cd47e8bc..3339a8a0fc 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -652,7 +652,8 @@ ExplainPrintTriggers(ExplainState *es, QueryDesc *queryDesc)
boolshow_relname;
int numrels = 
queryDesc->estate->es_num_result_relations;
int numrootrels = 
queryDesc->estate->es_num_root_result_relations;
-   List   *leafrels = queryDesc->estate->es_leaf_result_relations;
+   List   *leafrels =
+   
queryDesc->estate->es_tuple_routing_result_relations;
List   *targrels = queryDesc->estate->es_trig_target_relations;
int nr;
ListCell   *l;
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 410921cc40..5d3e923cca 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1413,8 +1413,11 @@ ExecGetTriggerResultRel(EState *estate, Oid relid)
rInfo++;
nr--;
}
-   /* Third, search through the leaf result relations, if any */
-   foreach(l, estate->es_leaf_result_relations)
+   /*
+* Third, search through the result relations that were created during
+* tuple routing, if any.
+*/
+   foreach(l, estate->es_tuple_routing_result_relations)
{
rInfo = (ResultRelInfo *) lfirst(l);
if (RelationGetRelid(rInfo->ri_RelationDesc) == relid)
diff --git a/src/backend/executor/execPartition.c 
b/src/backend/executor/execPartition.c
index ba6b52c32c..4048c3ebc6 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -178,6 +178,15 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
  resultRTindex,
  rel,
  
estate->es_instrument);
+
+   /*
+* Since we've just initialized this ResultRelInfo, 
it's not in
+* any list attached to the estate as yet.  Add it, so 
that it can
+* be found later.
+*/
+   estate->es_tuple_routing_result_relations =
+   
lappend(estate->es_tuple_routing_result_relations,
+   leaf_part_rri);
}
 
part_tupdesc = RelationGetDescr(partrel);
@@ -210,9 +219,6 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
mtstate != NULL &&
mtstate->mt_onconflict 
!= ONCONFLICT_NONE);
 
-   estate->es_leaf_result_relations =
-   lappend(estate->es_leaf_result_relations, 
leaf_part_rri);
-
proute->partitions[i] = leaf_part_rri;
i++;
}
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index e29f7aaf7b..50b6edce63 100644
---

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-07 Thread Claudio Freire

On Wed, Feb 7, 2018 at 8:52 PM, Alvaro Herrera  wrote:
>> Waiting as you say would be akin to what the patch does by putting
>> vacuum on its own parallel group.
>
> I don't think it's the same.  We don't need to wait until all the
> concurrent tests are done -- we only need to wait until the transactions
> that were current when the delete finished are done, which is very
> different since each test runs tons of small transactions rather than
> one single big transaction.

Um... maybe "lock pg_class" ?

That should conflict with basically any other running transaction and
have pretty much that effect.

Attached is a version of patch 1 with that approach.
From 3f1eb8829f9b396d1e5aef33114df44b04631455 Mon Sep 17 00:00:00 2001
From: Claudio Freire 
Date: Mon, 12 Sep 2016 23:36:42 -0300
Subject: [PATCH 1/2] Vacuum: allow using more than 1GB work mem

Turn the dead_tuples array into a structure composed of several
exponentially bigger arrays, to enable usage of more than 1GB
of work mem during vacuum and thus reduce the number of full
index scans necessary to remove all dead tids when the memory is
available.

Improve test ability to spot vacuum errors
---
 src/backend/commands/vacuumlazy.c| 402 ---
 src/test/regress/expected/vacuum.out |  31 ++-
 src/test/regress/sql/vacuum.sql  |  23 +-
 3 files changed, 380 insertions(+), 76 deletions(-)

diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index cf7f5e1..1e82d26 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -11,11 +11,24 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead tuples.  We
- * initially allocate an array of TIDs of that size, with an upper limit that
+ * initially allocate an array of TIDs of 128MB, or an upper limit that
  * depends on table size (this limit ensures we don't allocate a huge area
- * uselessly for vacuuming small tables).  If the array threatens to overflow,
+ * uselessly for vacuuming small tables). Additional arrays of increasingly
+ * large sizes are allocated as they become necessary.
+ *
+ * The TID array is thus represented as a list of multiple segments of
+ * varying size, beginning with the initial size of up to 128MB, and growing
+ * exponentially until the whole budget of (autovacuum_)maintenance_work_mem
+ * is used up.
+ *
+ * Lookup in that structure happens with a binary search of segments, and then
+ * with a binary search within each segment. Since segment's size grows
+ * exponentially, this retains O(log N) lookup complexity.
+ *
+ * If the multiarray's total size threatens to grow beyond maintenance_work_mem,
  * we suspend the heap scan phase and perform a pass of index cleanup and page
- * compaction, then resume the heap scan with an empty TID array.
+ * compaction, then resume the heap scan with an array of logically empty but
+ * already preallocated TID segments to be refilled with more dead tuple TIDs.
  *
  * If we're processing a table with no indexes, we can just vacuum each page
  * as we go; there's no need to save up multiple tuples to minimize the number
@@ -92,6 +105,14 @@
 #define LAZY_ALLOC_TUPLES		MaxHeapTuplesPerPage
 
 /*
+ * Minimum (starting) size of the dead_tuples array segments. Will allocate
+ * space for 128MB worth of tid pointers in the first segment, further segments
+ * will grow in size exponentially. Don't make it too small or the segment list
+ * will grow bigger than the sweetspot for search efficiency on big vacuums.
+ */
+#define LAZY_INIT_TUPLES		Max(MaxHeapTuplesPerPage, (128<<20) / sizeof(ItemPointerData))
+
+/*
  * Before we consider skipping a page that's marked as clean in
  * visibility map, we must've seen at least this many clean pages.
  */
@@ -103,6 +124,27 @@
  */
 #define PREFETCH_SIZE			((BlockNumber) 32)
 
+typedef struct DeadTuplesSegment
+{
+	ItemPointerData last_dead_tuple;	/* Copy of the last dead tuple (unset
+		 * until the segment is fully
+		 * populated). Keep it first to simplify
+		 * binary searches */
+	int			num_dead_tuples;	/* # of entries in the segment */
+	int			max_dead_tuples;	/* # of entries allocated in the segment */
+	ItemPointer dt_tids;		/* Array of dead tuples */
+}	DeadTuplesSegment;
+
+typedef struct DeadTuplesMultiArray
+{
+	int			num_entries;	/* current # of entries */
+	int			max_entries;	/* total # of slots that can be allocated in
+ * array */
+	int			num_segs;		/* number of dead tuple segments allocated */
+	int			last_seg;		/* last dead tuple segment with data (or 0) */
+	DeadTuplesSegment *dt_segments;		/* array of num_segs segments */
+}	DeadTuplesMultiArray;
+
 typedef struct LVRelStats
 {
 	/* hasindex = true means two-pass strategy; false means one-pass */
@@ -123,14 +165,20 @@ typedef struct LVRelStats
 	BlockNumber nonempty_pages; /* actually,

Re: PDF Builds on borka (Debian/stretch) broken - 9.6

2018-02-07 Thread Peter Eisentraut

On 2/7/18 17:21, Stephen Frost wrote:
> Looks like Nov 15 (which I believe is when the stretch upgrade was done)
> it was upgraded:
> 
> 2017-11-15 17:38:55 upgrade openjade:amd64 1.4devel1-21.1 1.4devel1-21.3+b1
> 
> That doesn't look like a terribly large version bump to me tho..

You probably had the openjade1.3 package installed, and maybe it got
deleted or the alternative uninstalled during the upgrade.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [COMMITTERS] pgsql: Rearm statement_timeout after each executed query.

2018-02-07 Thread Thomas Munro

On Tue, Feb 6, 2018 at 5:21 PM, Peter Eisentraut
 wrote:
> On 9/18/17 22:41, Andres Freund wrote:
>> Rearm statement_timeout after each executed query.
>
> This appears to have broken statement_timeout behavior in master such
> that only every second query is affected by it.

Yeah, I also just ran into this while testing a nearby complaint about
statement timeouts vs parallel query.  In the error path
stmt_timeout_active remains true, so the next statement does nothing
in enable_statement_timeout().  I think we just need to clear that
flag in the error path, right where we call disable_all_timeouts().
See attached.

-- 
Thomas Munro
http://www.enterprisedb.com

fix-statement-timeout.patch
Description: Binary data

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-07 Thread Alvaro Herrera

Claudio Freire wrote:
> On Wed, Feb 7, 2018 at 7:57 PM, Alvaro Herrera  
> wrote:
> > Claudio Freire wrote:

> > Hmm, this solution is not very friendly to the goal of reducing test
> > runtime, particularly since the new test creates a nontrivial-sized
> > table.  Maybe we can find a better alternative.  Can we use some wait
> > logic instead?  Maybe something like grab a snapshot of running VXIDs
> > and loop waiting until they're all gone before doing the vacuum?
> 
> I'm not sure there's any alternative. I did some tests and any active
> snapshot on any other table, not necessarily on the one being
> vacuumed, distorted the test. And it makes sense, since that snapshot
> makes those deleted tuples unvacuumable.

Sure.

> Waiting as you say would be akin to what the patch does by putting
> vacuum on its own parallel group.

I don't think it's the same.  We don't need to wait until all the
concurrent tests are done -- we only need to wait until the transactions
that were current when the delete finished are done, which is very
different since each test runs tons of small transactions rather than
one single big transaction.

> > Also, I don't understand why pg_relation_size() is not a better solution
> > to determining the table size compared to explain.
> 
> I was told pg_relation_size can return stale information. I didn't
> check, I took that at face value.

Hmm, it uses stat() on the table files.  I think those files would be
truncated at the time the transaction commits, so they shouldn't be
stale.  (I don't think the system waits for a checkpoint to flush a
truncation.)  Maybe relying on that is not reliable or future-proof
enough.  Anyway this is a minor point -- the one above worries me most.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-07 Thread Claudio Freire

On Wed, Feb 7, 2018 at 7:57 PM, Alvaro Herrera  wrote:
> Claudio Freire wrote:
>
>> - vacuum test on its own parallel group
>
> Hmm, this solution is not very friendly to the goal of reducing test
> runtime, particularly since the new test creates a nontrivial-sized
> table.  Maybe we can find a better alternative.  Can we use some wait
> logic instead?  Maybe something like grab a snapshot of running VXIDs
> and loop waiting until they're all gone before doing the vacuum?

I'm not sure there's any alternative. I did some tests and any active
snapshot on any other table, not necessarily on the one being
vacuumed, distorted the test. And it makes sense, since that snapshot
makes those deleted tuples unvacuumable.

Waiting as you say would be akin to what the patch does by putting
vacuum on its own parallel group.

I'm guessing all tests write something to the database, so all tests
will create a snapshot. Maybe if there were read-only tests, those
might be safe to include in vacuum's parallel group, but otherwise I
don't see any alternative.

> Also, I don't understand why pg_relation_size() is not a better solution
> to determining the table size compared to explain.

I was told pg_relation_size can return stale information. I didn't
check, I took that at face value.

Re: postgres_fdw: perform UPDATE/DELETE .. RETURNING on a join directly

2018-02-07 Thread Tom Lane

Robert Haas  writes:
> I spent a while reading through this today.  I see a few decisions
> here or there that are debatable, in the sense that somebody else
> might have chosen to do it differently, but I don't see anything that
> actually looks wrong.  So, committed.

The buildfarm's opinion of it is lower than yours.  Just eyeballing
the failures, I'd say there was some naivete about the reproducibility
of tuple CTIDs across different platforms.  Is there a good reason
these test cases need to print CTID?

regards, tom lane

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-07 Thread Alvaro Herrera

Claudio Freire wrote:

> - vacuum test on its own parallel group

Hmm, this solution is not very friendly to the goal of reducing test
runtime, particularly since the new test creates a nontrivial-sized
table.  Maybe we can find a better alternative.  Can we use some wait
logic instead?  Maybe something like grab a snapshot of running VXIDs
and loop waiting until they're all gone before doing the vacuum?

Also, I don't understand why pg_relation_size() is not a better solution
to determining the table size compared to explain.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PDF Builds on borka (Debian/stretch) broken - 9.6

2018-02-07 Thread Tom Lane

Stephen Frost  writes:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
>> We've seen this before, eg
>> https://www.postgresql.org/message-id/51CB39BD.8020205%40pgexperts.com
>> and the conclusion seemed to be that it was from having the wrong version
>> of openjade installed.  Was borka upgraded recently?

> Looks like Nov 15 (which I believe is when the stretch upgrade was done)
> it was upgraded:
> 2017-11-15 17:38:55 upgrade openjade:amd64 1.4devel1-21.1 1.4devel1-21.3+b1
> That doesn't look like a terribly large version bump to me tho..

Digging around further in our archives, it seems that this bug is specific
to openjade 1.4devel, and 1.3.x works fine.  But if borka was not running
1.3 before then it's hard to know what to make of it.  Since this is the
first report of this failure since 2013, I'd sort of thought the problem
was gone, but evidently not :-(

Another line of thought is that it might be an interaction between
particular versions of openjade and particular versions of the dsssl
style sheets.  Were the latter upgraded?

regards, tom lane

Re: JIT compiling with LLVM v10.0

2018-02-07 Thread Thomas Munro

On Thu, Feb 8, 2018 at 3:54 AM, Andres Freund  wrote:
> I've pushed v10.0. The big (and pretty painful to make) change is that
> now all the LLVM specific code lives in src/backend/jit/llvm, which is
> built as a shared library which is loaded on demand.
>
> The layout is now as follows:
>
> src/backend/jit/jit.c:
> Part of JITing always linked into the server. Supports loading the
> LLVM using JIT library.
>
> src/backend/jit/llvm/
> Infrastructure:
>  llvmjit.c:
> General code generation and optimization infrastructure
>  llvmjit_error.cpp, llvmjit_wrap.cpp:
> Error / backward compat wrappers
>  llvmjit_inline.cpp:
> Cross module inlining support
> Code-Gen:
>   llvmjit_expr.c
> Expression compilation
>   llvmjit_deform.c
> Deform compilation

You are asking LLVM to dlopen(""), which doesn't work on my not-Linux,
explaining the errors I reported in the older thread.  The portable
way to dlopen the main binary is dlopen(NULL), so I think you need to
pass NULL in to LLVMLoadLibraryPermanently(), even though that isn't
really clear from any LLVM documentation I've looked at.

diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index a1bc6449f7..874bddf81e 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -443,7 +443,7 @@ llvm_session_initialize(void)
cpu = NULL;

/* force symbols in main binary to be loaded */
-   LLVMLoadLibraryPermanently("");
+   LLVMLoadLibraryPermanently(NULL);

llvm_opt0_orc = LLVMOrcCreateInstance(llvm_opt0_targetmachine);
llvm_opt3_orc = LLVMOrcCreateInstance(llvm_opt3_targetmachine);

-- 
Thomas Munro
http://www.enterprisedb.com

Re: PDF Builds on borka (Debian/stretch) broken - 9.6

2018-02-07 Thread Stephen Frost

Tom,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Stephen Frost  writes:
> > While trying to do the PDF builds on borka for 9.6 for tomorrow's
> > release, I'm getting:
> 
> > openjade  -D . -D . -c 
> > /usr/share/sgml/docbook/stylesheet/dsssl/modular/catalog -d 
> > ./stylesheet.dsl -t tex -V tex-backend -i output-print -i include-index -V 
> > texpdf-output -V '%paper-type%'=A4 -o postgres-A4.tex-pdf postgres.sgml
> > openjade:./stylesheet.dsl:670:9:E: flow object not accepted by port; only 
> > display flow objects accepted
> > Makefile:180: recipe for target 'postgres-A4.tex-pdf' failed
> > make: *** [postgres-A4.tex-pdf] Segmentation fault
> > make: *** Deleting file 'postgres-A4.tex-pdf'
> 
> Bleah.
> 
> We've seen this before, eg
> https://www.postgresql.org/message-id/51CB39BD.8020205%40pgexperts.com
> 
> and the conclusion seemed to be that it was from having the wrong version
> of openjade installed.  Was borka upgraded recently?

Looks like Nov 15 (which I believe is when the stretch upgrade was done)
it was upgraded:

2017-11-15 17:38:55 upgrade openjade:amd64 1.4devel1-21.1 1.4devel1-21.3+b1

That doesn't look like a terribly large version bump to me tho..

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: PDF Builds on borka (Debian/stretch) broken - 9.6

2018-02-07 Thread Tom Lane

Stephen Frost  writes:
> While trying to do the PDF builds on borka for 9.6 for tomorrow's
> release, I'm getting:

> openjade  -D . -D . -c 
> /usr/share/sgml/docbook/stylesheet/dsssl/modular/catalog -d ./stylesheet.dsl 
> -t tex -V tex-backend -i output-print -i include-index -V texpdf-output -V 
> '%paper-type%'=A4 -o postgres-A4.tex-pdf postgres.sgml
> openjade:./stylesheet.dsl:670:9:E: flow object not accepted by port; only 
> display flow objects accepted
> Makefile:180: recipe for target 'postgres-A4.tex-pdf' failed
> make: *** [postgres-A4.tex-pdf] Segmentation fault
> make: *** Deleting file 'postgres-A4.tex-pdf'

Bleah.

We've seen this before, eg
https://www.postgresql.org/message-id/51CB39BD.8020205%40pgexperts.com

and the conclusion seemed to be that it was from having the wrong version
of openjade installed.  Was borka upgraded recently?

regards, tom lane

Re: [HACKERS] More stats about skipped vacuums

2018-02-07 Thread Tom Lane

Robert Haas  writes:
> It seems to me that there was a thread where Tom proposed removing
> support for dynamic_shared_memory_type = none.

I think you're recalling <32138.1502675...@sss.pgh.pa.us>, wherein
I pointed out that

>>> Whether that's worth the trouble is debatable.  The current code
>>> in initdb believes that every platform has some type of DSM support
>>> (see choose_dsm_implementation).  Nobody's complained about that,
>>> and it certainly works on every buildfarm animal.  So for all we know,
>>> dynamic_shared_memory_type = none is broken already.

(That was in fact in the same thread Kyotaro-san just linked to about
reimplementing the stats collector.)

It's still true that we've no reason to believe there are any supported
platforms that haven't got some sort of DSM.  Performance might be a
different question, of course ... but it's hard to believe that
transferring stats through DSM wouldn't be better than writing them
out to files.

> I suggest we remove support for dynamic_shared_memory_type = none first,
> and see if we get any complaints.  If we don't, then future patches can
> rely on it being present.

If we remove it in v11, it'd still be maybe a year from now before we'd
have much confidence from that alone that nobody cares.  I think the lack
of complaints about it in 9.6 and 10 is a more useful data point.

regards, tom lane

Re: [HACKERS] Bug in to_timestamp().

2018-02-07 Thread Dmitry Dolgov

> On 6 February 2018 at 10:17, Arthur Zakirov  wrote:
> It is strange. I still can apply both v9 [1] and v10 [2] via 'git
> apply'. And Patch Tester [3] says that it is applied. But maybe
> it is because of my git (git version 2.16.1).
>
> You can try also 'patch -p1':

Yes, looks like the problem is on my side, sorry.

Re: JIT compiling with LLVM v10.0

2018-02-07 Thread Andres Freund

Hi,

On 2018-02-07 20:35:12 +0100, Pierre Ducroquet wrote:
> I also find it more readable and it looks cleaner, insane guys could be able 
> to write their own JIT engines for PostgreSQL by patching a single
> file :)

Right - we could easily make the libname configurable if requested.

> Since it's now in its own .so file, does it still make as much sense using 
> mostly the LLVM C API ?

Yes, I definitely want to continue that. For one the C API is a *lot*
more stable, for another postgres is C.

> I included a small addition to the gitignore file, I'm surprised you were not 
> bothered by the various .bc files generated.

I use a VPATH build (i.e. source code is in a different directory than
the build products), so I do not really see that. But yes, it makes
sense to add ignores

Thanks for looking,

Andres Freund

Re: Add PGDLLIMPORT to enable_hashagg

2018-02-07 Thread legrand legrand

Thank you Metin !

Regards
PAscal



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Re: postgres_fdw: perform UPDATE/DELETE .. RETURNING on a join directly

2018-02-07 Thread Robert Haas

On Thu, Jan 11, 2018 at 2:58 AM, Etsuro Fujita
 wrote:
> (2017/12/27 20:55), Etsuro Fujita wrote:
>> Attached is an updated version of the patch.
>
> I revised code/comments a little bit.  PFA new version.

I spent a while reading through this today.  I see a few decisions
here or there that are debatable, in the sense that somebody else
might have chosen to do it differently, but I don't see anything that
actually looks wrong.  So, committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: JIT compiling with LLVM v10.0

2018-02-07 Thread Pierre Ducroquet

On Wednesday, February 7, 2018 3:54:05 PM CET Andres Freund wrote:
> Hi,
> 
> I've pushed v10.0. The big (and pretty painful to make) change is that
> now all the LLVM specific code lives in src/backend/jit/llvm, which is
> built as a shared library which is loaded on demand.
> 
> The layout is now as follows:
> 
> src/backend/jit/jit.c:
> Part of JITing always linked into the server. Supports loading the
> LLVM using JIT library.
> 
> src/backend/jit/llvm/
> Infrastructure:
>  llvmjit.c:
> General code generation and optimization infrastructure
>  llvmjit_error.cpp, llvmjit_wrap.cpp:
> Error / backward compat wrappers
>  llvmjit_inline.cpp:
> Cross module inlining support
> Code-Gen:
>   llvmjit_expr.c
> Expression compilation
>   llvmjit_deform.c
> Deform compilation
> 
> I generally like how this shaped out. There's a good amount of followup
> cleanup needed, but I'd appreciate some early feedback.

Hi

I also find it more readable and it looks cleaner, insane guys could be able 
to write their own JIT engines for PostgreSQL by patching a single file :) 
Since it's now in its own .so file, does it still make as much sense using 
mostly the LLVM C API ?
I'll really look in the jit code itself later, right now I've just rebased my 
previous patches and did a quick check that everything worked for LLVM4 and 
3.9.
I included a small addition to the gitignore file, I'm surprised you were not 
bothered by the various .bc files generated.

Anyway, great work, and I look forward exploring the code :)

 Pierre
>From f461c3d6d0deec52b66f3276a87cfdb3ab65b259 Mon Sep 17 00:00:00 2001
From: Pierre 
Date: Fri, 2 Feb 2018 09:11:55 +0100
Subject: [PATCH 1/8] Add support for LLVM4 in llvmjit.c

---
 src/backend/jit/llvm/llvmjit.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index a1bc6449f7..e81764af98 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -215,12 +215,19 @@ llvm_get_function(LLVMJitContext *context, const char *funcname)
 
 		addr = 0;
 		if (LLVMOrcGetSymbolAddressIn(handle->stack, , handle->orc_handle, mangled))
-			elog(ERROR, "failed to lookup symbol");
+			elog(ERROR, "failed to lookup symbol %s", mangled);
 		if (addr)
 			return (void *) addr;
 	}
 #endif
 
+#if LLVM_VERSION_MAJOR < 5
+	if ((addr = LLVMOrcGetSymbolAddress(llvm_opt0_orc, mangled)))
+		return (void *) addr;
+	if ((addr = LLVMOrcGetSymbolAddress(llvm_opt3_orc, mangled)))
+		return (void *) addr;
+	elog(ERROR, "failed to lookup symbol %s for %s", mangled, funcname);
+#else
 	if (LLVMOrcGetSymbolAddress(llvm_opt0_orc, , mangled))
 		elog(ERROR, "failed to lookup symbol");
 	if (addr)
@@ -229,7 +236,7 @@ llvm_get_function(LLVMJitContext *context, const char *funcname)
 		elog(ERROR, "failed to lookup symbol");
 	if (addr)
 		return (void *) addr;
-
+#endif
 	elog(ERROR, "failed to JIT: %s", funcname);
 
 	return NULL;
@@ -365,11 +372,18 @@ llvm_compile_module(LLVMJitContext *context)
 	 * faster instruction selection mechanism is used.
 	 */
 	{
-		LLVMSharedModuleRef smod;
 		instr_time tb, ta;
 
 		/* emit the code */
 		INSTR_TIME_SET_CURRENT(ta);
+#if LLVM_VERSION_MAJOR < 5
+		orc_handle = LLVMOrcAddEagerlyCompiledIR(compile_orc, context->module,
+			 llvm_resolve_symbol, NULL);
+		// It seems there is no error return from that function in LLVM < 5.
+#else
+		LLVMSharedModuleRef smod;
+
+		LLVMSharedModuleRef smod;
 		smod = LLVMOrcMakeSharedModule(context->module);
 		if (LLVMOrcAddEagerlyCompiledIR(compile_orc, _handle, smod,
 		llvm_resolve_symbol, NULL))
@@ -377,6 +391,7 @@ llvm_compile_module(LLVMJitContext *context)
 			elog(ERROR, "failed to jit module");
 		}
 		LLVMOrcDisposeSharedModuleRef(smod);
+#endif
 		INSTR_TIME_SET_CURRENT(tb);
 		INSTR_TIME_SUBTRACT(tb, ta);
 		ereport(DEBUG1, (errmsg("time to emit: %.3fs",
-- 
2.16.1

>From 7c34e2264edad3d2d99ba2ba6fbf41ec3a73e0ed Mon Sep 17 00:00:00 2001
From: Pierre 
Date: Fri, 2 Feb 2018 09:13:40 +0100
Subject: [PATCH 2/8] Add LLVM4 support in llvmjit_error.cpp

---
 src/backend/jit/llvm/llvmjit_error.cpp | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/src/backend/jit/llvm/llvmjit_error.cpp b/src/backend/jit/llvm/llvmjit_error.cpp
index 625ba2d25d..1c78bd956d 100644
--- a/src/backend/jit/llvm/llvmjit_error.cpp
+++ b/src/backend/jit/llvm/llvmjit_error.cpp
@@ -56,7 +56,9 @@ llvm_enter_fatal_on_oom(void)
 	if (fatal_new_handler_depth == 0)
 	{
 		old_new_handler = std::set_new_handler(fatal_system_new_handler);
+#if LLVM_VERSION_MAJOR > 4
 		llvm::install_bad_alloc_error_handler(fatal_llvm_new_handler);
+#endif
 		llvm::install_fatal_error_handler(fatal_llvm_error_handler);
 	}
 	fatal_new_handler_depth++;
@@ -72,7 +74,9 @@ llvm_leave_fatal_on_oom(void)
 	if (fatal_new_handler_depth == 0)
 	{

Re: [HACKERS] [POC] Faster processing at Gather node

2018-02-07 Thread Andres Freund

Hi,

On 2018-01-25 12:09:23 -0500, Robert Haas wrote:
> > Perhaps a short benchmark for 32bit systems using shm_mq wouldn't hurt?
> > I suspect there won't be much of a performance impact, but it's probably
> > worth checking.
>
> I don't think I understand your concern here.  If this is used on a
> system where we're emulating atomics and barriers in painful ways, it
> might hurt performance, but I think we have a policy of not caring.

Well, it's more than just systems like that - for 64bit atomics we
sometimes do fall back to spinlock based atomics on 32bit systems, even
if they support 32 bit atomics.


> Also, I don't know where I'd find a 32-bit system to test.

You can compile with -m32 on reasonable systems ;)


> >>   Assert(used <= ringsize);
> >>   available = Min(ringsize - used, nbytes - sent);
> >>
> >>   /* Bail out if the queue has been detached. */
> >> - if (detached)
> >> + if (mq->mq_detached)
> >
> > Hm, do all paths here guarantee that mq->mq_detached won't be stored on
> > the stack / register in the first iteration, and then not reread any
> > further? I think it's fine because every branch of the if below ends up
> > in a syscall / barrier, but it might be worth noting on that here.
>
> Aargh.  I hate compilers.  I added a comment.  Should I think about
> changing mq_detached to pg_atomic_uint32 instead?

I think a pg_compiler_barrier() would suffice to alleviate my concern,
right? If you wanted to go for an atomic, using pg_atomic_flag would
probably be more appropriate than pg_atomic_uint32.


> >> - /* Write as much data as we can via a single 
> >> memcpy(). */
> >> + /*
> >> +  * Write as much data as we can via a single 
> >> memcpy(). Make sure
> >> +  * these writes happen after the read of 
> >> mq_bytes_read, above.
> >> +  * This barrier pairs with the one in 
> >> shm_mq_inc_bytes_read.
> >> +  */
> >
> > s/above/above. Otherwise a newer mq_bytes_read could become visible
> > before the corresponding reads have fully finished./?
>
> I don't find that very clear.  A newer mq_bytes_read could become
> visible whenever, and a barrier doesn't prevent that from happening.

Well, my point was that the barrier prevents the the write to
mq_bytes_read becoming visible before the corresponding reads have
finished. Which then would mean the memcpy() could overwrite them. And a
barrier *does* prevent that from happening.

I don't think this is the same as:

> What it does is ensure (together with the one in
> shm_mq_inc_bytes_read) that we don't try to read bytes that aren't
> fully *written* yet.

which seems much more about the barrier in shm_mq_inc_bytes_written()?


> Generally, my mental model is that barriers make things happen in
> program order rather than some other order that the CPU thinks would
> be fun.  Imagine a single-core server doing all of this stuff the "old
> school" way.  If the reader puts data into the queue before
> advertising its presence and the writer finishes using the data from
> the queue before advertising its consumption, then everything works.
> If you do anything else, it's flat busted, even on that single-core
> system, because a context switch could happen at any time, and then
> you might read data that isn't written yet.  The barrier just ensures
> that we get that order of execution even on fancy modern hardware, but
> I'm not sure how much of that we really need to explain here.

IDK, I find it nontrivial to understand individual uses of
barriers. There's often multiple non isometric ways to use barriers, and
the logic why a specific one is correct isn't always obvious.


> >> + pg_memory_barrier();
> >>   memcpy(>mq_ring[mq->mq_ring_offset + offset],
> >>  (char *) data + sent, sendnow);
> >>   sent += sendnow;
> >
> > Btw, this mq_ring_offset stuff seems a bit silly, why don't we use
> > proper padding/union in the struct to make it unnecessary to do that bit
> > of offset calculation every time? I think it currently prevents
> > efficient address calculation instructions from being used.
>
> Well, the root cause -- aside from me being a fallible human being
> with only limited programing skills -- is that I wanted the parallel
> query code to be able to request whatever queue size it preferred
> without having to worry about how many bytes of that space was going
> to get consumed by overhead.

What I meant is that instead of doing
struct shm_mq
{
...
boolmq_detached;
uint8   mq_ring_offset;
charmq_ring[FLEXIBLE_ARRAY_MEMBER];
};

it'd be possible to do something like

{
...
boolmq_detached;
union {
charmq_ring[FLEXIBLE_ARRAY_MEMBER];
double

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2018-02-07 Thread Robert Haas

On Tue, Jan 9, 2018 at 9:49 AM, Masahiko Sawada  wrote:
>> If I understand correctly, XactLastRecEnd can be set by, for example,
>> a HOT cleanup record, so that doesn't seem like a good thing to use.
>
> Yes, that's right.
>
>> Whether we need to use 2PC across remote nodes seems like it shouldn't
>> depend on whether a local SELECT statement happened to do a HOT
>> cleanup or not.
>
> So I think we need to check if the top transaction is invalid or not as well.

Even if you check both, it doesn't sound like it really does what you
want.  Won't you still end up partially dependent on whether a HOT
cleanup happened, if not in quite the same way as before?  How about
defining a new bit in MyXactFlags for XACT_FLAGS_WROTENONTEMPREL?
Just have heap_insert, heap_update, and heap_delete do something like:

if (RelationNeedsWAL(relation))
MyXactFlags |= XACT_FLAGS_WROTENONTEMPREL;

Overall, what's the status of this patch?  Are we hung up on this
issue only, or are there other things?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] More stats about skipped vacuums

2018-02-07 Thread Robert Haas

On Tue, Feb 6, 2018 at 8:34 PM, Kyotaro HORIGUCHI
 wrote:
> Based on the reason, it fails to run when
> dynamic_shared_memory_type = none and it is accompanied by
> several cleanup complexities. The decision there is we should go
> for just using static shared memory rather than adding complexity
> for nothing. If it really needs to be expandable in the future,
> it's the time to use DSA. (But would still maintain a fallback
> stuff.)

It seems to me that there was a thread where Tom proposed removing
support for dynamic_shared_memory_type = none.  The main reason that I
included that option initially was because it seemed silly to risk
causing problems for users whose dynamic shared memory facilities
didn't work for the sake of a feature that, at the time (9.4), had no
in-core users.

But things have shifted a bit since then.  We have had few complaints
about dynamic shared memory causing portability problems (except for
performance: apparently some implementations perform better than
others on some systems, and we need support for huge pages, but
neither of those things are a reason to disable it) and we now have
in-core use that is enabled by default.  I suggest we remove support
for dynamic_shared_memory_type = none first, and see if we get any
complaints.  If we don't, then future patches can rely on it being
present.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-07 Thread Robert Haas

On Tue, Feb 6, 2018 at 7:51 AM, Claudio Freire  wrote:
> No free space becomes visible during long-running vacuums. That means
> bloat keeps accumulating even though vacuum is freeing space, because
> the FSM doesn't expose that free space.
>
> The extra work incurred in those FSM vacuums isn't useless, it's
> preventing bloat in long-running vacuums.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

SSL test names

2018-02-07 Thread Peter Eisentraut

Here is a patch that gives the tests in the SSL test suite proper names
instead of just writing out the connection strings.  So instead of

# running client tests
# test that the server doesn't accept non-SSL connections
ok 1 - sslmode=disable (should fail)
# connect without server root cert
ok 2 - sslrootcert=invalid sslmode=require
ok 3 - sslrootcert=invalid sslmode=verify-ca (should fail)
ok 4 - sslrootcert=invalid sslmode=verify-full (should fail)

you get something like

# running client tests
ok 1 - server doesn't accept non-SSL connections
ok 2 - connect without server root cert sslmode=require
ok 3 - connect without server root cert sslmode=verify-ca
ok 4 - connect without server root cert sslmode=verify-full
ok 5 - connect with wrong server root cert sslmode=require
ok 6 - connect with wrong server root cert sslmode=verify-ca
ok 7 - connect with wrong server root cert sslmode=verify-full

I have found the old way very confusing while working with several
SSL-related patches recently.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 30014a0669f9a357a7013e2ee235075196acd497 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 26 Jan 2018 16:14:34 -0500
Subject: [PATCH] Refine SSL tests test name reporting

Instead of using the psql/libpq connection string as the displayed test
name and relying on "notes" and source code comments to explain the
tests, give the tests self-explanatory names, like we do elsewhere.
---
 src/test/ssl/ServerSetup.pm|  11 +---
 src/test/ssl/t/001_ssltests.pl | 142 +
 2 files changed, 88 insertions(+), 65 deletions(-)

diff --git a/src/test/ssl/ServerSetup.pm b/src/test/ssl/ServerSetup.pm
index b4d5746e20..c0f21290af 100644
--- a/src/test/ssl/ServerSetup.pm
+++ b/src/test/ssl/ServerSetup.pm
@@ -39,7 +39,6 @@ our @EXPORT = qw(
 sub run_test_psql
 {
my $connstr   = $_[0];
-   my $logstring = $_[1];
 
my $cmd = [
'psql', '-X', '-A', '-t', '-c', "SELECT \$\$connected with 
$connstr\$\$",
@@ -59,9 +58,7 @@ sub test_connect_ok
my $connstr = $_[1];
my $test_name = $_[2];
 
-   my $result =
- run_test_psql("$common_connstr $connstr", "(should succeed)");
-   ok($result, $test_name || $connstr);
+   ok(run_test_psql("$common_connstr $connstr"), $test_name);
 }
 
 sub test_connect_fails
@@ -70,8 +67,7 @@ sub test_connect_fails
my $connstr = $_[1];
my $test_name = $_[2];
 
-   my $result = run_test_psql("$common_connstr $connstr", "(should fail)");
-   ok(!$result, $test_name || "$connstr (should fail)");
+   ok(!run_test_psql("$common_connstr $connstr"), $test_name);
 }
 
 # Copy a set of files, taking into account wildcards
@@ -151,9 +147,6 @@ sub switch_server_cert
my $cafile   = $_[2] || "root+client_ca";
my $pgdata   = $node->data_dir;
 
-   note
- "reloading server with certfile \"$certfile\" and cafile \"$cafile\"";
-
open my $sslconf, '>', "$pgdata/sslconfig.conf";
print $sslconf "ssl=on\n";
print $sslconf "ssl_ca_file='$cafile.crt'\n";
diff --git a/src/test/ssl/t/001_ssltests.pl b/src/test/ssl/t/001_ssltests.pl
index 28837a1391..e53bd12ae9 100644
--- a/src/test/ssl/t/001_ssltests.pl
+++ b/src/test/ssl/t/001_ssltests.pl
@@ -47,113 +47,134 @@
 "user=ssltestuser dbname=trustdb sslcert=invalid hostaddr=$SERVERHOSTADDR 
host=common-name.pg-ssltest.test";
 
 # The server should not accept non-SSL connections.
-note "test that the server doesn't accept non-SSL connections";
-test_connect_fails($common_connstr, "sslmode=disable");
+test_connect_fails($common_connstr, "sslmode=disable",
+  "server doesn't accept non-SSL connections");
 
 # Try without a root cert. In sslmode=require, this should work. In verify-ca
 # or verify-full mode it should fail.
-note "connect without server root cert";
-test_connect_ok($common_connstr, "sslrootcert=invalid sslmode=require");
-test_connect_fails($common_connstr, "sslrootcert=invalid sslmode=verify-ca");
-test_connect_fails($common_connstr, "sslrootcert=invalid sslmode=verify-full");
+test_connect_ok($common_connstr, "sslrootcert=invalid sslmode=require",
+   "connect without server root cert 
sslmode=require");
+test_connect_fails($common_connstr, "sslrootcert=invalid sslmode=verify-ca",
+   "connect without server root cert 
sslmode=verify-ca");
+test_connect_fails($common_connstr, "sslrootcert=invalid sslmode=verify-full",
+  "connect without server root cert 
sslmode=verify-full");
 
 # Try with wrong root cert, should fail. (We're using the client CA as the
 # root, but the server's key is signed by the server CA.)
-note "connect with wrong server root cert";
 test_connect_fails($common_connstr,
-

Re: [PROPOSAL] Shared Ispell dictionaries

2018-02-07 Thread Arthur Zakirov

On Thu, Jan 25, 2018 at 07:51:58PM +0300, Arthur Zakirov wrote:
> Attached new version of the patch.

Here is rebased version of the patch due to changes into dict_ispell.c.
The patch itself wasn't changed.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/src/backend/tsearch/spell.c b/src/backend/tsearch/spell.c
index b9fdd77e19..e071994523 100644
--- a/src/backend/tsearch/spell.c
+++ b/src/backend/tsearch/spell.c
@@ -78,6 +78,8 @@
 #define tmpalloc(sz)  MemoryContextAlloc(Conf->buildCxt, (sz))
 #define tmpalloc0(sz)  MemoryContextAllocZero(Conf->buildCxt, (sz))
 
+#define tmpstrdup(str) MemoryContextStrdup(Conf->buildCxt, (str))
+
 /*
  * Prepare for constructing an ISpell dictionary.
  *
@@ -498,7 +500,7 @@ NIAddSpell(IspellDict *Conf, const char *word, const char 
*flag)
Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + 
strlen(word) + 1);
strcpy(Conf->Spell[Conf->nspell]->word, word);
Conf->Spell[Conf->nspell]->p.flag = (*flag != '\0')
-   ? cpstrdup(Conf, flag) : VoidString;
+   ? tmpstrdup(flag) : VoidString;
Conf->nspell++;
 }
 
@@ -1040,7 +1042,7 @@ setCompoundAffixFlagValue(IspellDict *Conf, 
CompoundAffixFlag *entry,
entry->flag.i = i;
}
else
-   entry->flag.s = cpstrdup(Conf, s);
+   entry->flag.s = tmpstrdup(s);
 
entry->flagMode = Conf->flagMode;
entry->value = val;
@@ -1536,6 +1538,9 @@ nextline:
return;
 
 isnewformat:
+   pfree(recoded);
+   pfree(pstr);
+
if (oldformat)
ereport(ERROR,
(errcode(ERRCODE_CONFIG_FILE_ERROR),
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c45979dee4..725473b7c2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1364,6 +1364,35 @@ include_dir 'conf.d'
   
  
 
+ 
+  max_shared_dictionaries_size 
(integer)
+  
+   max_shared_dictionaries_size configuration 
parameter
+  
+  
+  
+   
+Sets the maximum size of all text search dictionaries loaded into 
shared
+memory.  The default is 100 megabytes (100MB).  This
+parameter can only be set at server start.
+   
+
+   
+Currently controls only loading of Ispell
+dictionaries (see ).
+After compiling the dictionary it will be copied into shared memory.
+Another backends on first use of the dictionary will use it from shared
+memory, so it doesn't need to compile the dictionary second time.
+   
+
+   
+If total size of simultaneously loaded dictionaries reaches the maximum
+allowed size then a new dictionary will be loaded into local memory of
+a backend.
+   
+  
+ 
+
  
   huge_pages (enum)
   
diff --git a/src/backend/commands/tsearchcmds.c 
b/src/backend/commands/tsearchcmds.c
index 3a843512d1..b6aeae449b 100644
--- a/src/backend/commands/tsearchcmds.c
+++ b/src/backend/commands/tsearchcmds.c
@@ -39,6 +39,7 @@
 #include "nodes/makefuncs.h"
 #include "parser/parse_func.h"
 #include "tsearch/ts_cache.h"
+#include "tsearch/ts_shared.h"
 #include "tsearch/ts_utils.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
@@ -396,7 +397,8 @@ verify_dictoptions(Oid tmplId, List *dictoptions)
 * Call the init method and see if it complains.  We don't 
worry about
 * it leaking memory, since our command will soon be over 
anyway.
 */
-   (void) OidFunctionCall1(initmethod, 
PointerGetDatum(dictoptions));
+   (void) OidFunctionCall2(initmethod, 
PointerGetDatum(dictoptions),
+   
ObjectIdGetDatum(InvalidOid));
}
 
ReleaseSysCache(tup);
@@ -513,6 +515,8 @@ RemoveTSDictionaryById(Oid dictId)
 
CatalogTupleDelete(relation, >t_self);
 
+   ts_dict_shmem_release(dictId);
+
ReleaseSysCache(tup);
 
heap_close(relation, RowExclusiveLock);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a581c0..c7dce8cac5 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -44,6 +44,7 @@
 #include "storage/procsignal.h"
 #include "storage/sinvaladt.h"
 #include "storage/spin.h"
+#include "tsearch/ts_shared.h"
 #include "utils/backend_random.h"
 #include "utils/snapmgr.h"
 
@@ -150,6 +151,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
size = add_size(size, SyncScanShmemSize());
size = add_size(size, AsyncShmemSize());
size = add_size(size, BackendRandomShmemSize());
+   size = add_size(size, TsearchShmemSize());
 #ifdef EXEC_BACKEND
size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -271,6 +273,11 @@

Re: Temporary tables prevent autovacuum, leading to XID wraparound

2018-02-07 Thread Robert Haas

On Tue, Feb 6, 2018 at 6:03 PM, Michael Paquier
 wrote:
> I am not sure that we would like to give up that easily the property
> that we have now to clean up past temporary files only at postmaster
> startup and only when not in recovery.  If you implement that, there is
> a risk that the backend you are starting is eating the connection slot
> and by consequence its temporary schema and its set of temporary tables
> on which one may want to look into after a crash.

You can't actually do that from SQL, because as soon as you try to
access your pg_temp schema, it'll be cleared out.  All this changes
does is move that forward from time-of-first-access to session start.
That's a significant change that needs discussion, but I don't think
it has the effect you are supposing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] PATCH: enabling parallel execution for cursors explicitly (experimental)

2018-02-07 Thread Robert Haas

On Mon, Jan 22, 2018 at 7:05 AM, Amit Kapila  wrote:
> On error, workers should be terminated.  What kind of problem do you
> have in mind?

Hmm.  Yeah, I guess that makes sense.  If the only thing you can do is
fetch from the cursor -- and you have to make sure to lock down
protocol messages as well as SQL commands -- and if any error kills
the workers -- then I guess this might be workable.  However, I think
there are still problems.  Say the worker hits an error while the
leader is idle; how do we report the error?  It's a protocol version
for the leader to send an unsolicited ErrorResponse, which is why we
have to use FATAL rather than ERROR in various places for recovery
conflicts, query cancellation, etc.

Also, if you're OK with not being able to do anything except fetch
from the cursor until you reach the end, then couldn't you just issue
the query without the cursor and use PQsetSingleRowMode?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MCV lists for highly skewed distributions

2018-02-07 Thread Dean Rasheed

On 7 February 2018 at 15:25, Robert Haas  wrote:
> Do you plan to press forward with this, then, or what's
> the next step?
>

Yes, I think the results are pretty good so far, especially for the
more non-uniform distributions. AFAICS it solves the 2 original
complaints, and I've not yet seen a case where it makes things worse.

I plan to do more testing (and if anyone else wants to, that would be
most welcome), and then barring any unexpected issues/objections, I'll
commit it. Probably not this week though.

Regards,
Dean

Re: [HACKERS] [PATCH] Improve geometric types

2018-02-07 Thread Emre Hasegeli

>  - line_eq looks too complex in the normal (not containing NANs)
>cases.  We should avoid such complexity if possible.
>
>One problem here is that comparison conceals NANness of
>operands. Conversely arithmetics propagate it. We can converge
>NANness into a number. The attached line_eq() doesn that. This
>doesn't have almost no additional complexity when NAN is
>involved. I believe it qbehaves in the same way
>and shares a doubious behavior like this.
>
>=# select '{nan, 1, nan}'::line = '{nan, 2, nan}'::line;
> ?column?
>--
> t
>
>But probably no point in fixing(?) it.

I think we should fix it.

>The attached file contains line_eq, point_eq_point and
>circle_same. I expect that line_eq is fast but other two are
>doubious.

I haven't got an attachment.

>. . . Mmm.. The function seems broken. I posted the fix for
>the existing version is posted, and line_perp() in the attched
>file will work fine.

I am incorporating the fix you have posted to the other thread to this patch.

Re: New gist vacuum.

2018-02-07 Thread Andrey Borodin

Hi, David!

> 7 февр. 2018 г., в 18:39, David Steele  написал(а):
> 
> Hi Andrey,
> 
> On 1/21/18 5:34 AM, Andrey Borodin wrote:
>> Hello, Alexander!
>>> 16 янв. 2018 г., в 21:42, Andrey Borodin  написал(а):
>>> Please find README patch attached.
>> 
>> Here's v2 version. Same code, but x2 comments. Also fixed important typo in 
>> readme BFS->DFS. Feel free to ping me any time with questions.
> 
> This patch has not gotten review and does not seem like a good fit for
> the last PG11 CF so I am marking it Returned with Feedback.

Why do you think this patch does not seem good fit for CF?

I've been talking with Alexander just yesterday at PgConf.Russia, and he was 
going to provide review.

Best regards, Andrey Borodin.

Re: Re: New gist vacuum.

2018-02-07 Thread David Steele

Hi Andrey,

On 1/21/18 5:34 AM, Andrey Borodin wrote:
> Hello, Alexander!
>> 16 янв. 2018 г., в 21:42, Andrey Borodin  написал(а):
>> Please find README patch attached.
> 
> Here's v2 version. Same code, but x2 comments. Also fixed important typo in 
> readme BFS->DFS. Feel free to ping me any time with questions.

This patch has not gotten review and does not seem like a good fit for
the last PG11 CF so I am marking it Returned with Feedback.

Regards,
-- 
-David
da...@pgmasters.net

Re: Re: [HACKERS] Proposal: generic WAL compression

2018-02-07 Thread David Steele

Hi Oleg,

On 1/22/18 4:37 PM, Stephen Frost wrote:
> Oleg,
> 
> I'm not really sure why this is still in Needs Review as a review was
> posted and I don't see any follow-up.  I've changed this to be Waiting
> for Author.
> 
> * Antonin Houska (a...@cybertec.at) wrote:
>> Oleg Ivanov  wrote:
>>
>>> In order to overcome that issue, I would like to propose the patch, which
>>> provides possibility to use another approach of the WAL record
>>> construction.
>>
>> After having spent several hours reviewing this patch I dare to send the
>> following comments:
> 
> Thanks for the review Antonin!
> 
>> * writeToPtr() and readFromPtr() are applied to the existing code. I think
>>   this is a reason for a separate diff, to be applied before the actual 
>> patch.
> 
> I'm not really a fan of using these, for my 2c.  I'm not sure how others
> feel, but having these macros which change one of the arguments to the
> macro (by advancing the pointer) doesn't result in a net improvement to
> readability for me.
> 
> The other review comments also seem pretty reasonable to me.

This proposal is still in need of review and hasn't really gotten it in
the last few CFs.

Since the feature does not seem like a good fit for the last CF of PG11
I am marking it Returned with Feedback.

Regards,
-- 
-David
da...@pgmasters.net

Re: MCV lists for highly skewed distributions

2018-02-07 Thread Robert Haas

On Wed, Feb 7, 2018 at 10:20 AM, Dean Rasheed  wrote:
> One thing this new algorithm does do is improve the user's ability to
> get more MCVs by increasing the stats target. I'm not yet convinced
> there should be a separate knob for the RSE cutoff. For that to be
> useful, there would need to be some data distributions for which 10%
> (say) was clearly better, and some for which 20% was better. So far,
> there doesn't appear to be a massive difference between the two, and
> it's nothing that can't compensated for using the existing stats
> target knob.

Fair enough.  Do you plan to press forward with this, then, or what's
the next step?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: update tuple routing and triggers

2018-02-07 Thread Robert Haas

On Tue, Feb 6, 2018 at 12:52 AM, Amit Langote
 wrote:
> About renaming es_leaf_result_relations to
> es_tuple_routing_result_relations, I will defer that to committer.  But on
> second though, maybe we don't need to make this patch larger than it has
> to be.

+1 for renaming it.  I like keeping the patch small, but not at the
price of being misleading.

+/*
+ * Since we're newly creating this ResultRelInfo, add it to
+ * someplace where others could find it.
+ */

How about: "Since we've just initialized this ResultRelInfo, it's not
in any list attached to the estate as yet.  Add it, so that it can be
found later."

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Re: [HACKERS] WIP: Separate log file for extension

2018-02-07 Thread David Steele

Hi Antonin,

On 1/10/18 5:38 PM, Tom Lane wrote:
> Antonin Houska  writes:
>> After having read the thread on your patch I think that the reason you were
>> asked to evaluate performance was that your patch can possibly make syslogger
>> a bottleneck. In contrast, my patch does not prevent user from disabling the
>> syslogger if it (the syslogger) seems to cause performance issues.
> 
> Just to clarify that: we know that in workloads that emit lots of log
> output, the log collector *already is* a bottleneck; there are reports
> that some people can't use it because it's too slow.  See e.g. towards
> the end of this thread:
> 
> https://www.postgresql.org/message-id/flat/CABUevExztL0GORyWM9S4tR_Ft3FmJbRaxQdxj%2BBQZjpvmRurdw%40mail.gmail.com
> 
> and particularly the referenced thread from 2011.  (I seem to recall other
> reports but didn't have much luck finding them.)
> 
> I'm quite concerned by the proposed patch, and not even so much any
> performance issues; what bothers me is that it adds complexity into a
> portion of the system where we can ill afford it.  Bugs in the logging
> mechanism compromise one's ability to have any faith in tracking down
> other bugs.  The difficulty of reconfiguring the logger on the fly
> is another reason to not want more configuration options for it.
> On the whole, therefore, I'd just as soon not go there --- especially
> seeing that there's been little field demand for such features.

I think this feature would be useful, especially for an extension like
pgaudit.  It's a request I hear fairly frequently.

However, there doesn't seem to be consensus that this is a viable
approach, so marked as Returned with Feedback for this CF.

This may be too invasive a feature to be a good fit for the last PG11 CF
in March but I hope you keep working on the idea.

Regards,
-- 
-David
da...@pgmasters.net

Re: JIT compiling with LLVM v10.0

2018-02-07 Thread Andres Freund

Hi,

I've pushed v10.0. The big (and pretty painful to make) change is that
now all the LLVM specific code lives in src/backend/jit/llvm, which is
built as a shared library which is loaded on demand.

The layout is now as follows:

src/backend/jit/jit.c:
Part of JITing always linked into the server. Supports loading the
LLVM using JIT library.

src/backend/jit/llvm/
Infrastructure:
 llvmjit.c:
General code generation and optimization infrastructure
 llvmjit_error.cpp, llvmjit_wrap.cpp:
Error / backward compat wrappers
 llvmjit_inline.cpp:
Cross module inlining support
Code-Gen:
  llvmjit_expr.c
Expression compilation
  llvmjit_deform.c
Deform compilation

I generally like how this shaped out. There's a good amount of followup
cleanup needed, but I'd appreciate some early feedback.


I've also rebased onto a recent master version.

postgres[21915][1]=# SELECT pg_llvmjit_available();
┌──┐
│ pg_llvmjit_available │
├──┤
│ t│
└──┘
(1 row)

make -C src/backend/jit/llvm/ uninstall
postgres[21915][1]=# \c
You are now connected to database "postgres" as user "andres".
postgres[21922][1]=# SELECT pg_llvmjit_available();
┌──┐
│ pg_llvmjit_available │
├──┤
│ f│
└──┘
(1 row)

Yeha ;)

Greetings,

Andres Freund

Re: Obsolete fmgr() declaration in fmgr.h

2018-02-07 Thread Dagfinn Ilmari Mannsåker

Robert Haas  writes:

> On Tue, Feb 6, 2018 at 9:37 AM, Dagfinn Ilmari Mannsåker
>  wrote:
>> Commit 5ded4bd21403e143dd3eb66b92d52732fdac1945 removed support for
>> version-0 function calling convention, and with it the fmgr() function.
>> However, the declaration was left behind in fmgr.h.  The attached patch
>> finishes the cleanup.
>
> Committed.

Thanks. Commitfest entry updated
(https://commitfest.postgresql.org/17/1512/).

- ilmari
-- 
- Twitter seems more influential [than blogs] in the 'gets reported in
  the mainstream press' sense at least.   - Matt McLeod
- That'd be because the content of a tweet is easier to condense down
  to a mainstream media article.  - Calle Dybedahl

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-07 Thread Claudio Freire

On Wed, Feb 7, 2018 at 12:50 AM, Kyotaro HORIGUCHI
 wrote:
> Hello,
>
> At Tue, 6 Feb 2018 10:41:22 -0300, Claudio Freire  
> wrote in

Re: Why does load_external_function() return PGFunction?

2018-02-07 Thread Tom Lane

Andres Freund  writes:
> On 2018-02-06 15:43:29 -0500, Tom Lane wrote:
>> void* isn't necessarily compatible with function pointers --- there are
>> platforms where they're physically different widths, though possibly
>> you'd never get PG to run on such hardware anyway.

> Fair point. Although we're relying on dlsym like infrastructure, which
> returns just a void *.

Yeah.  Presumably, a platform where they were really different would have
to provide some unstandardized variant of dlsym for fetching function
pointers.  We could cope with that fairly easily as things stand, since
we have platform-specific wrappers for dlsym anyway.  But if we made the
API for the wrappers dependent on data and code pointers being the same,
we'd be in trouble.

regards, tom lane

Re: [HACKERS] path toward faster partition pruning

2018-02-07 Thread Robert Haas

On Wed, Feb 7, 2018 at 8:37 AM, Ashutosh Bapat
 wrote:
> While looking at the changes in partition.c I happened to look at the
> changes in try_partition_wise_join(). They mark partitions deemed
> dummy by pruning as dummy relations. If we accept those changes, we
> could very well change the way we handle dummy partitioned tables,
> which would mean that we also revert the recent commit
> f069c91a5793ff6b7884120de748b2005ee7756f. But I guess, those changes
> haven't been reviewed yet and so not final.

Well, if you have an opinion on those proposed changes, I'd like to hear it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Obsolete fmgr() declaration in fmgr.h

2018-02-07 Thread Robert Haas

On Tue, Feb 6, 2018 at 9:37 AM, Dagfinn Ilmari Mannsåker
 wrote:
> Commit 5ded4bd21403e143dd3eb66b92d52732fdac1945 removed support for
> version-0 function calling convention, and with it the fmgr() function.
> However, the declaration was left behind in fmgr.h.  The attached patch
> finishes the cleanup.

Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] path toward faster partition pruning

2018-02-07 Thread Ashutosh Bapat

On Wed, Feb 7, 2018 at 6:49 PM, Robert Haas  wrote:
> On Wed, Feb 7, 2018 at 3:42 AM, Ashutosh Bapat
>  wrote:
>> partprune.c looks to much tied to one feature. I am sure that the
>> functions used for partition pruning can be used by other
>> optimizations as well.
>
> Uh, I don't know about that, this code looks like it does partition
> pruning specifically, and nothing else.  How else do you think it
> could be used?

I didn't have any specific thing in mind, and now more I think, it
looks less likely that it will be used for something else.

While looking at the changes in partition.c I happened to look at the
changes in try_partition_wise_join(). They mark partitions deemed
dummy by pruning as dummy relations. If we accept those changes, we
could very well change the way we handle dummy partitioned tables,
which would mean that we also revert the recent commit
f069c91a5793ff6b7884120de748b2005ee7756f. But I guess, those changes
haven't been reviewed yet and so not final.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: MCV lists for highly skewed distributions

2018-02-07 Thread Robert Haas

On Wed, Feb 7, 2018 at 3:51 AM, Dean Rasheed  wrote:
> Thanks for testing. I agree, this new algorithm seems to stand up
> pretty well in the testing I've done too. One thing about it that can
> be tuned is the cutoff threshold for the relative standard error -- I
> chose 10% completely arbitrarily, but it could just as easily be 20%,
> 30% or even higher (in the patch s/0.01/0.04/ or s/0.01/0.09).

Maybe it'd be worth having a separate GUC for this, and a reloption to
override the GUC.  It seems to me that it would be very reasonable to
want to separately control (a) how much sampling you want to do and
(b) how aggressively you want to be about including things in the MCV
list.  Of course, even if we do that, it doesn't excuse us from
needing to set a good default value.  And it might not be necessary to
do the extra work for this.

Looking at your data, it definitely seems like 10% would be too strict
-- if I'm reading this correctly, with a stats target in the 10-50
range, your normally-distributed table gets no MCVs at all, rather
than a number equal to the statistics target.  That doesn't seem good.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] path toward faster partition pruning

2018-02-07 Thread Robert Haas

On Wed, Feb 7, 2018 at 3:42 AM, Ashutosh Bapat
 wrote:
> partprune.c looks to much tied to one feature. I am sure that the
> functions used for partition pruning can be used by other
> optimizations as well.

Uh, I don't know about that, this code looks like it does partition
pruning specifically, and nothing else.  How else do you think it
could be used?

> partition.c seems to have two kinds of functions 1. that build and
> manage relcache, creates quals from bounds etc. which are metadata
> management kind 2. partition bound comparison functions, and other
> optimizer related functions. May be we should divide the file that
> way. The first category code remains in catalog/ as it is today. The
> second catagory functions move to optimizer/.

It would be sensible to separate functions that build and manage data
in the relcache from other functions.  I think we should consider
moving the existing functions of that type from partition.c to
src/backend/utils/cache/partcache.c.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-02-07 Thread amul sul

On Tue, Feb 6, 2018 at 7:05 PM, amul sul  wrote:
> On Sun, Feb 4, 2018 at 10:47 AM, Amit Kapila  wrote:
>> On Fri, Feb 2, 2018 at 2:11 PM, amul sul  wrote:
>>> On Fri, Jan 26, 2018 at 11:58 AM, Amit Kapila  
>>> wrote:
[]
>>
>> I wonder what will be the behavior of this patch with
>> wal_consistency_checking [1].  I think it will generate a failure as
>> there is nothing in WAL to replay it.  Can you once try it?  If we see
>> a failure with wal consistency checker, then we need to think whether
>> (a) we want to deal with it by logging this information, or (b) do we
>> want to mask it or (c) something else?
>>
>>
>> [1] -  
>> https://www.postgresql.org/docs/devel/static/runtime-config-developer.html
>>
>
> Yes, you are correct standby stopped with a following error:
>
>  FATAL:  inconsistent page found, rel 1663/13260/16390, forknum 0, blkno 0
>  CONTEXT:  WAL redo at 0/3002510 for Heap/DELETE: off 6 KEYS_UPDATED
>  LOG:  startup process (PID 22791) exited with exit code 1
>  LOG:  terminating any other active server processes
>  LOG:  database system is shut down
>
> I have tested warm standby replication setup using attached script. Without
> wal_consistency_checking setting, it works fine & data from master to standby 
> is
> replicated as expected, if this guaranty is enough then I think could skip 
> this
> error from wal consistent check for such deleted tuple (I guess option
> b that you have suggested), thoughts?

I tried to mask ctid.ip_blkid if it is set to InvalidBlockId with
following change in heap_mask:

- PATCH -
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 682f4f07a8..e7c011f9a5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -9323,6 +9323,10 @@ heap_mask(char *pagedata, BlockNumber blkno)
 */
if (HeapTupleHeaderIsSpeculative(page_htup))
ItemPointerSet(_htup->t_ctid, blkno, off);
+
+   /* TODO : comments ?  */
+   if 
(!BlockNumberIsValid(BlockIdGetBlockNumber((&((page_htup->t_ctid).ip_blkid)
+   BlockIdSet(&((page_htup->t_ctid).ip_blkid), blkno);
}

/*
- END -

Test script[1] works as expected with this change but I don't have much
confident on it due to lack of knowledge of wal_consistency_checking
routine. Any suggestion/comments will be much appreciated, thanks!

[1] 
https://postgr.es/m/CAAJ_b94_29wiUA83W8LQjtfjv9XNV=+pt8+iowrpjnnfhe3...@mail.gmail.com

Regards,
Amul

Re: In logical replication concurrent update of partition key creates a duplicate record on standby.

2018-02-07 Thread Amit Kapila

On Wed, Feb 7, 2018 at 3:42 PM, amul sul  wrote:
> On Wed, Feb 7, 2018 at 3:03 PM, Amit Khandekar  wrote:
>> On 7 February 2018 at 13:53, amul sul  wrote:
>>> Hi,
>>>
>>> If an update of partition key involves tuple movement from one partition to
>>> another partition then there will be a separate delete on one partition and
>>> insert on the other partition made.
>>>
>>> In the logical replication if an update performed on the master and standby 
>>> at
>>> the same moment, then replication worker tries to replicate delete + insert
>>> operation on standby. While replying master changes on standby for the 
>>> delete
>>> operation worker will log "concurrent update, retrying" message (because the
>>> update on standby has already deleted) and move forward to reply the next
>>> insert operation. Standby update also did the same delete+insert is as part 
>>> of
>>> the update of partition key in a result there will be two records inserted 
>>> on
>>> standby.
>>
>> A quick thinking on how to resolve this makes me wonder if we can
>> manage to pass some information through logical decoding that the
>> delete is part of a partition key update. This is analogous to how we
>> set some information locally in the tuple by setting
>> tp.t_data->t_ctid.ip_blkid to InvalidBlockNumber.
>>
>
> +1,
>

I also mentioned the same thing in the other thread [1], but I think
that alone won't solve the dual record problem as you are seeing.  I
think we need to do something for next insert as you are suggesting.

> also if  worker failed to reply delete operation on standby then
> we need to decide what will be the next step, should we skip follow
> insert operation or error out or something else.
>

That would be tricky, do you see any simple way of doing either of those.


[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2BHopDbA3h0oYXE1kuhsU0rLT-hONeeS0SoG36YpeSnGw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: In logical replication concurrent update of partition key creates a duplicate record on standby.

2018-02-07 Thread Craig Ringer

On 7 February 2018 at 17:33, Amit Khandekar  wrote:


>
> A quick thinking on how to resolve this makes me wonder if we can
> manage to pass some information through logical decoding that the
> delete is part of a partition key update. This is analogous to how we
> set some information locally in the tuple by setting
> tp.t_data->t_ctid.ip_blkid to InvalidBlockNumber.
>
>
We already do something similar for UPDATEs that change the REPLICA
IDENTITY; we include the oldkey in extra WAL.

The main question is whether the required knowledge is available at a
suitable level.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-02-07 Thread Amit Kapila

On Sat, Feb 3, 2018 at 4:04 AM, Robert Haas  wrote:
> On Fri, Jan 26, 2018 at 1:28 AM, Amit Kapila  wrote:
>> So, this means that in case of logical replication, it won't generate
>> the error this patch is trying to introduce.  I think if we want to
>> handle this we need some changes in WAL and logical decoding as well.
>>
>> Robert, others, what do you think?  I am not very comfortable leaving
>> this unaddressed, if we don't want to do anything about it, at least
>> we should document it.
>
> As I said on the other thread, I'm not sure how reasonable it really
> is to try to do anything about this.  For both the issue you raised
> there, I think we'd need to introduce a new WAL record type that
> represents a delete from one table and an insert to another that
> should be considered as a single operation.
>

I think to solve the issue in this thread, a flag should be sufficient
that can be used in logical replication InvalidBlockNumber in CTID for
Deletes.

> I'm not keen on that idea,
> but you can make an argument that it's the Right Thing To Do.  I would
> be more inclined, at least for v11, to just document that the
> delete+insert will be replayed separately on replicas.
>

Even if we do what you are suggesting, we need something in WAL
(probably a flag to indicate this special type of Delete), otherwise,
wal consistency checker will fail.   Another idea would be to mask the
ctid change so that wal consistency checker doesn't cry.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Add PGDLLIMPORT to enable_hashagg

2018-02-07 Thread Metin Doslu

Hey all,

I'm attaching the updated patch, it includes

i. The list of Pascal (max_worker_processes was already with
PGDLLIMPORT, so I also added to max_parallel_workers)
ii. Some others in cost.h to make the file more readable.

Best,
Metin

On Tue, Feb 6, 2018 at 10:40 PM, Peter Geoghegan  wrote:
> On Tue, Feb 6, 2018 at 12:39 PM, Robert Haas  wrote:
>> Yeah, let's get them all into one list and I'll commit the whole thing 
>> together.
>
> +1
>
> --
> Peter Geoghegan
>

add_pgdllimport-v2.patch
Description: Binary data

Re: ERROR: could not map dynamic shared memory segment

2018-02-07 Thread Thomas Munro

On Wed, Feb 7, 2018 at 11:42 PM, tushar  wrote:
> I am getting  ERROR:  could not map dynamic shared memory segment in the log
> file
>
> - Please refer this scenario-
>
> in V11/V10 latest sources
>
>  set parallel_setup_cost=0;
>  set parallel_tuple_cost=0;
>  set max_parallel_workers_per_gather=4;
>  create table r(n int);
> insert into r values (generate_series(1,100));
> insert into r values (generate_series(100,200));
> analyze r;
>
> postgres=# select * from r where n < (select n from r where n<=1 limit
> 6644);
> ERROR:  more than one row returned by a subquery used as an expression
>
> in the log file -
>
> 2018-02-07 10:28:27.615 GMT [20569] ERROR:  more than one row returned by a
> subquery used as an expression
> 2018-02-07 10:28:27.615 GMT [20569] STATEMENT:  select * from r where n <
> (select n from r where n<=1 limit 6644);
> 2018-02-07 10:28:27.616 GMT [20586] ERROR:  could not map dynamic shared
> memory segment
> 2018-02-07 10:28:27.616 GMT [20587] ERROR:  could not map dynamic shared
> memory segment
> 2018-02-07 10:28:27.617 GMT [20559] LOG:  background worker "parallel
> worker" (PID 20586) exited with exit code 1
> 2018-02-07 10:28:27.617 GMT [20559] LOG:  background worker "parallel
> worker" (PID 20587) exited with exit code 1
>
> Is this something already reported ?

I think this is expected.  The leader process errors out and detaches
from the reference-counted DSM segment before the worker processes
manage to start up, and when they try to attach they can't because the
DSM segment is gone.  This happens with anything that errors out very
quickly in the leader, like SELECT COUNT(r.n/0) FROM r using your
example table.

-- 
Thomas Munro
http://www.enterprisedb.com

ERROR: could not map dynamic shared memory segment

2018-02-07 Thread tushar


Hi ,

I am getting  ERROR:  could not map dynamic shared memory segment in the 
log file


- Please refer this scenario-

in V11/V10 latest sources

 set parallel_setup_cost=0;
 set parallel_tuple_cost=0;
 set max_parallel_workers_per_gather=4;
 create table r(n int);
insert into r values (generate_series(1,100));
insert into r values (generate_series(100,200));
analyze r;

postgres=# select * from r where n < (select n from r where n<=1 
limit 6644);

ERROR:  more than one row returned by a subquery used as an expression

in the log file -

2018-02-07 10:28:27.615 GMT [20569] ERROR:  more than one row returned 
by a subquery used as an expression
2018-02-07 10:28:27.615 GMT [20569] STATEMENT:  select * from r where n 
< (select n from r where n<=1 limit 6644);
2018-02-07 10:28:27.616 GMT [20586] ERROR:  could not map dynamic shared 
memory segment
2018-02-07 10:28:27.616 GMT [20587] ERROR:  could not map dynamic shared 
memory segment
2018-02-07 10:28:27.617 GMT [20559] LOG:  background worker "parallel 
worker" (PID 20586) exited with exit code 1
2018-02-07 10:28:27.617 GMT [20559] LOG:  background worker "parallel 
worker" (PID 20587) exited with exit code 1


Is this something already reported ?

--
regards,tushar
EnterpriseDB  https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

Re: Incorrect grammar

2018-02-07 Thread Etsuro Fujita


(2018/02/07 5:51), Robert Haas wrote:

On Tue, Feb 6, 2018 at 5:22 AM, Etsuro Fujita
  wrote:

While reviewing the
lazy-initialization-of-partition-info-for-tuple-routing patch, I ran
into a grammar mistake in a comment in ExecSetupChildParentMapForLeaf.
Attached is a patch for fixing that.


Committed.


Thank you.

Best regards,
Etsuro Fujita

Re: non-bulk inserts and tuple routing

2018-02-07 Thread Etsuro Fujita


(2018/02/05 19:43), Etsuro Fujita wrote:

(2018/02/05 14:34), Amit Langote wrote:

The code in tupconv_map_for_subplan() currently assumes that it can rely
on all leaf partitions having been initialized.


On reflection I noticed this analysis is not 100% correct; I think what 
that function actually assumes is that all sublplans' partitions have 
already been initialized, not all leaf partitions.



Since we're breaking that
assumption with this proposal, that needed to be changed. So the patch
contained some refactoring to make it not rely on that assumption.


I don't think we really need this refactoring because since that as in 
another patch you posted, we could initialize all subplans' partitions 
in ExecSetupPartitionTupleRouting, I think tupconv_map_for_subplan could 
be called without any changes to that function because of what I said above.



Here is the updated version that contains two patches as described above.


Thanks for updating the patches!  I'll post my next comments in a few days.


Here are comments for the other patch (patch 
v24-0002-During-tuple-routing-initialize-per-partition-ob.patch):


o On changes to ExecSetupPartitionTupleRouting:

* The comment below wouldn't be correct; no ExecInitResultRelInfo in the 
patch.


-   proute->partitions = (ResultRelInfo **) palloc(proute->num_partitions *
-  sizeof(ResultRelInfo *));
+   /*
+* Actual ResultRelInfo's and TupleConversionMap's are allocated in
+* ExecInitResultRelInfo().
+*/
+   proute->partitions = (ResultRelInfo **) palloc0(proute->num_partitions *
+   sizeof(ResultRelInfo 
*));


* The patch removes this from the initialization step for a subplan's 
partition, but I think it would be better to keep this here because I 
think it's a good thing to put the initialization stuff together into 
one place.


-   /*
-* This is required in order to we convert the partition's
-* tuple to be compatible with the root partitioned table's
-* tuple descriptor.  When generating the per-subplan result
-* rels, this was not set.
-*/
-   leaf_part_rri->ri_PartitionRoot = rel;

* I think it would be better to keep this comment here.

-   /* Remember the subplan offset for this ResultRelInfo */

* Why is this removed from that initialization?

-   proute->partitions[i] = leaf_part_rri;

o On changes to ExecInitPartitionInfo:

* I don't understand the step starting from this, but I'm wondering if 
that step can be removed by keeping the above setup of 
proute->partitions for the subplan's partition in 
ExecSetupPartitionTupleRouting.


+   /*
+* If we are doing tuple routing for update, try to reuse the
+* per-subplan resultrel for this partition that ExecInitModifyTable()
+* might already have created.
+*/
+   if (mtstate && mtstate->operation == CMD_UPDATE)

That's all I have for now.

Best regards,
Etsuro Fujita

Re: In logical replication concurrent update of partition key creates a duplicate record on standby.

2018-02-07 Thread amul sul

On Wed, Feb 7, 2018 at 3:03 PM, Amit Khandekar  wrote:
> On 7 February 2018 at 13:53, amul sul  wrote:
>> Hi,
>>
>> If an update of partition key involves tuple movement from one partition to
>> another partition then there will be a separate delete on one partition and
>> insert on the other partition made.
>>
>> In the logical replication if an update performed on the master and standby 
>> at
>> the same moment, then replication worker tries to replicate delete + insert
>> operation on standby. While replying master changes on standby for the delete
>> operation worker will log "concurrent update, retrying" message (because the
>> update on standby has already deleted) and move forward to reply the next
>> insert operation. Standby update also did the same delete+insert is as part 
>> of
>> the update of partition key in a result there will be two records inserted on
>> standby.
>
> A quick thinking on how to resolve this makes me wonder if we can
> manage to pass some information through logical decoding that the
> delete is part of a partition key update. This is analogous to how we
> set some information locally in the tuple by setting
> tp.t_data->t_ctid.ip_blkid to InvalidBlockNumber.
>

+1, also if  worker failed to reply delete operation on standby then
we need to decide what will be the next step, should we skip follow
insert operation or error out or something else.

> I guess, at the node 2 where this issue reproduces, this issue can
> also be reproduced if there is a non-partition-key UPDATE going on,
> and the tuple gets deleted as part of the replaying of partition-key
> update ? This UPDATE will skip the update, thinking that the tuple is
> deleted. This is similar to what's happening now in case of local
> concurrent updates, for which the fix is being worked upon.

Yes, you are correct, at node 2 the reported issue is also reproducible without
the update of partition key.

== NODE 2 ==

postgres=# update foo set  b='node2_update' where a=1;
UPDATE 1
postgres=# select * from foo;
 a |  b
---+--
 1 | node2_update
(1 row)

< -- continued replication worker -->

postgres=# 2018-02-07 15:26:53.323 IST [86449] LOG:  concurrent update, retrying

postgres=# select tableoid::regclass, * from foo;
 tableoid | a |  b
--+---+--
 foo1 | 1 | node2_update
 foo2 | 2 | node1_update
(2 rows)

Regards,
Amul Sul

Re: git instructions

2018-02-07 Thread Magnus Hagander

On Tue, Feb 6, 2018 at 9:59 PM, David G. Johnston <
david.g.johns...@gmail.com> wrote:

> On Tue, Feb 6, 2018 at 1:46 PM, Stefan Kaltenbrunner <
> ste...@kaltenbrunner.cc> wrote:
>
>>
>> >
>> > Yes, this used to be the case, and is the reason behind the original
>> > recommendation. It's what they call the "dumb HTTP protocol" in the
>> > docs. This is not the case when you use git-http-backend, which is the
>> > change we made a few months back.
>>
>> Agreed - wrt the actual patch - not sure it is accurate to classify the
>> current way als the "older git protocol" as I cannot find that wording
>> used in the git docs - maybe "classic"?
>
>
> Neither "older" nor "classic" appeal to me.  If you want to convey an
> opinion of quality I'd say something like "the more limited git protocol"
> otherwise its just "the git protocol" and we can explain the pros and cons
> between the http and git protocols.  Noting the improvement of the http
> protocol from its former "dumb" version, early on so people have the new
> paradigm in their head when they get to the quality comparison, will be
> worthwhile for some period of time.
>
>
Just "the git protocol" is probably best here, so changed to that. I also
changed the http->https urls per Stefans suggestion.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: In logical replication concurrent update of partition key creates a duplicate record on standby.

2018-02-07 Thread Amit Khandekar

On 7 February 2018 at 13:53, amul sul  wrote:
> Hi,
>
> If an update of partition key involves tuple movement from one partition to
> another partition then there will be a separate delete on one partition and
> insert on the other partition made.
>
> In the logical replication if an update performed on the master and standby at
> the same moment, then replication worker tries to replicate delete + insert
> operation on standby. While replying master changes on standby for the delete
> operation worker will log "concurrent update, retrying" message (because the
> update on standby has already deleted) and move forward to reply the next
> insert operation. Standby update also did the same delete+insert is as part of
> the update of partition key in a result there will be two records inserted on
> standby.

A quick thinking on how to resolve this makes me wonder if we can
manage to pass some information through logical decoding that the
delete is part of a partition key update. This is analogous to how we
set some information locally in the tuple by setting
tp.t_data->t_ctid.ip_blkid to InvalidBlockNumber.

I guess, at the node 2 where this issue reproduces, this issue can
also be reproduced if there is a non-partition-key UPDATE going on,
and the tuple gets deleted as part of the replaying of partition-key
update ? This UPDATE will skip the update, thinking that the tuple is
deleted. This is similar to what's happening now in case of local
concurrent updates, for which the fix is being worked upon.

Re: MCV lists for highly skewed distributions

2018-02-07 Thread Dean Rasheed

On 4 February 2018 at 12:18, John Naylor  wrote:
> I did the same basic eyeball testing I did on earlier patches, and
> this is the best implementation so far. I've attached some typical
> pg_stats contents for HEAD and this patch. More rigorous testing,
> including of planner estimates on real data, is still needed of
> course, but I think this is definitely in the right direction.
>

Thanks for testing. I agree, this new algorithm seems to stand up
pretty well in the testing I've done too. One thing about it that can
be tuned is the cutoff threshold for the relative standard error -- I
chose 10% completely arbitrarily, but it could just as easily be 20%,
30% or even higher (in the patch s/0.01/0.04/ or s/0.01/0.09).

Looking at the too-many-MCVs example from [1], the table has 100
million rows, and each distinct value occurs 10 times. With the
default stats target of 100, the MCV-list is fully populated with
values, most having been seen two (or occasionally three) times. Thus,
if you query for one of those MCVs, the estimate is 6667 or 10,000
rather than 10, which is a pretty bad over-estimate. Increasing the
stats target improves things, because the sample frequencies go down
correspondingly. The problem is also lessened a bit if you randomise
the order of the values in the second column so that it isn't
correlated to the storage order:

insert into qqq select a, random()*10^7 from generate_series(1, (10^8)::int) a;

However in a sample of 30,000 rows it remains reasonably likely that
the same value will be seen twice (I typically observed 40 to 50 MCVs
in this case, c.f. the birthday paradox), and in a sample of 300,000
rows it goes back to giving 1000 MCVs, each seen 2 or 3 times. So this
isn't really the fault of the non-uniform ANALYSE sample, but rather
of the current algorithm's belief that seeing a value just twice is
sufficient for it to qualify as an MCV.

The new MCV algorithm solves this by demanding that a value be seen
many more times before being included in cases where the table size is
much larger than the sample size. The exact threshold depends on the
table size, the sample size and the relative error cutoff value. In
this example, with a table size of 100 million rows, the sample size
makes little difference, because its always going to be much smaller
than the table size, so the number of instances required in the sample
depends mostly on the RSE cutoff chosen:

 rse_cutoff | sample=3 | sample=30 | sample=300
+--+---+
 10%|  100 |   100 | 97
 20%|   25 |25 | 25
 30%|   12 |12 | 11
 40%|7 | 7 |  7
 50%|4 | 4 |  4

So any of those RSE cutoff's solves the too-many-MCVs problem in this
particular case, giving no MCVs, although 50% is pushing it a bit, and
in any case, the theoretical basis for this algorithm breaks down well
before then.

The advantage of a larger RSE cutoff is that it gives more MCVs for a
non-uniform data distribution, where the current algorithm tends to
give too few MCVs. In the attached example, the table has 1 million
rows of integer data in a normal distribution with a standard
deviation of 50. This typically gives somewhere in the region of 430
to 440 distinct values. Running against HEAD and with this patch, for
varying sample sizes and RSE cutoffs gives the following MCV-list
sizes:

stats_target  mcvs (HEAD)  mcvs (10% RSE)  mcvs (20% RSE)  mcvs (30% RSE)
1010   0   10  10
2020   0   20  20
3030   0   30  30
4040   17  40  40
5050   50  50  50
100   100  100 100 100
300   132  204 261 290
1000  205  266 315 338
3000  252  364 400 411
1 296  413 414 412

One thing to note is that the HEAD algorithm never includes more than
around 300 values, because of its requirement for a value to be more
common than average. The new algorithm has no such requirement, so it
will include nearly every value in the MCV list (except the most
uncommon ones) if the stats target is set high enough. Also, the
detailed output from the test shows that the resulting estimates based
on those MCVs are pretty decent.

(Note: this example shows that the too-few-MCVs problem can occur in
any non-uniform distribution, not just highly skewed ones.)

I've also tried this out against some real-world data, and in the
testing I've done so far, the new algorithm is not actually that
sensitive to

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-02-07 Thread amul sul

On Tue, Feb 6, 2018 at 7:05 PM, amul sul  wrote:
> On Sun, Feb 4, 2018 at 10:47 AM, Amit Kapila  wrote:
>> On Fri, Feb 2, 2018 at 2:11 PM, amul sul  wrote:
>>> On Fri, Jan 26, 2018 at 11:58 AM, Amit Kapila  
>>> wrote:
>>> []
 I think you can manually (via debugger) hit this by using
 PUBLICATION/SUBSCRIPTION syntax for logical replication.  I think what
 you need to do is in node-1, create a partitioned table and subscribe
 it on node-2.  Now, perform an Update on node-1, then stop the logical
 replication worker before it calls heap_lock_tuple.  Now, in node-2,
 update the same row such that it moves the row.  Now, continue the
 logical replication worker.  I think it should hit your new code, if
 not then we need to think of some other way.

>>>
>>> I am able to hit the change log using above steps. Thanks a lot for the
>>> step by step guide, I really needed that.
>>>
>>> One strange behavior I found in the logical replication which is 
>>> reproducible
>>> without attached patch as well -- when I have updated on node2 by keeping
>>> breakpoint before the heap_lock_tuple call in replication worker, I can see
>>> a duplicate row was inserted on the node2, see this:
>>>
>> ..
>>>
>>> I am thinking to report this in a separate thread, but not sure if
>>> this is already known behaviour or not.
>>>
>>
>> I think it is worth to discuss this behavior in a separate thread.
>> However, if possible, try to reproduce it without partitioning and
>> then report it.
>>
> Logical replication behavior for the normal table is as expected, this happens
> only with partition table, will start a new thread for this on hacker.
>
Posted on hackers :
https://postgr.es/m/CAAJ_b94bYxLsX0erZXVH-anQPbWqcYUPWX4xVRa1YJY=ph6...@mail.gmail.com

Regards,
Amul Sul

Re: MCV lists for highly skewed distributions

2018-02-07 Thread Dean Rasheed

On 1 February 2018 at 17:49, Robert Haas  wrote:
> One point which I want to emphasize is that the length of the MCV list
> bounds the estimated frequency of non-MCVs in two ways: no non-MCV is
> ever thought to be more frequent than the least-common MCVs, and
> however many non-MCVs we think we have (probably fewer than we
> actually have) have to fit into whatever percentage of the table is
> consumed by MCVs.  This would be less important if we had reliable
> n_distinct estimates, but we don't.  So, even throwing things into the
> MCV list that are no more common than the average item can improve
> planning in some cases.
>

That's a good point, and a nice explanation. I think that lends more
weight to the argument that we should be including as many MCVs as
possible, provided there's enough evidence to justify their inclusion.

Regards,
Dean

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2018-02-07 Thread Ashutosh Bapat

On Fri, Dec 22, 2017 at 3:00 PM, Rajkumar Raghuwanshi
 wrote:
> On Wed, Dec 20, 2017 at 5:21 PM, Ashutosh Bapat
>  wrote:
>>
>> Thanks. Here are some comments
>>
> Thanks Ashutosh for review and suggestions.
>
>>
>> +-- test default partition behavior for range
>> +ALTER TABLE prt1 DETACH PARTITION prt1_p3;
>> +ALTER TABLE prt1 ATTACH PARTITION prt1_p3 DEFAULT;
>> +ALTER TABLE prt2 DETACH PARTITION prt2_p3;
>> +ALTER TABLE prt2 ATTACH PARTITION prt2_p3 DEFAULT;
>>
>> I think we need an ANALYZE here in case the statistics gets updated while
>> DETACH and ATTACH is going on. Other testcases also need to be updated
>> with
>> ANALYZE, including the negative one.
>
> Done.
>
>>
>>
>> +-- partition-wise join can not be applied if the only one of joining
>> table have
>>
>> Correction: ... if only one of the joining tables has ...
>
> Done.
>
>>
>> Please add the patch to the next commitfest so that it's not
>> forgotten.
>
> Done.
> Added to CF: https://commitfest.postgresql.org/16/1426/
>
>>
>> I think we can get rid of the multi-level partition-wise
>> testcase as well. Also, since we are re-attaching existing partition
>> tables as default partitions, we don't need to check the output as
>> well; just plan should be enough.
>
> Ok. Done.
>
> updated test patch attached.
>

The patch looks good to me. I don't think we can reduce it further.
But we need some tests to test PWJ with default partitions. Marking
this as ready for committer.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

In logical replication concurrent update of partition key creates a duplicate record on standby.

2018-02-07 Thread amul sul

Hi,

If an update of partition key involves tuple movement from one partition to
another partition then there will be a separate delete on one partition and
insert on the other partition made.

In the logical replication if an update performed on the master and standby at
the same moment, then replication worker tries to replicate delete + insert
operation on standby. While replying master changes on standby for the delete
operation worker will log "concurrent update, retrying" message (because the
update on standby has already deleted) and move forward to reply the next
insert operation. Standby update also did the same delete+insert is as part of
the update of partition key in a result there will be two records inserted on
standby.

Here is the quick demonstration:

== NODE 1 ==

postgres=# insert into foo values(1, 'initial insert');
INSERT 0 1
postgres=# select tableoid::regclass, * from foo;
 tableoid | a |   b
--+---+
 foo1 | 1 | initial insert
(1 row)


== NODE 2 ==

postgres=# select tableoid::regclass, * from foo;
 tableoid | a |   b
--+---+
 foo1 | 1 | initial insert
(1 row)


-- Now attach GDB to the replication worker & break on heap_lock_tuple() call


== NODE 1 ==

postgres=#  update foo set a=2, b='node1_update' where a=1;
UPDATE 1

< replication worker hits break point on heap_lock_tuple() --->


== NODE 2 ==

postgres=# update foo set a=2, b='node2_update' where a=1;
UPDATE 1

< continue replication worker --->

postgres=# 2018-02-07 13:32:46.307 IST [81613] LOG:  concurrent update, retrying


== NODE 1 ==

postgres=# select tableoid::regclass, * from foo;
 tableoid | a |  b
--+---+--
 foo2 | 2 | node1_update
(1 row)


== NODE 2 ==

postgres=# select tableoid::regclass, * from foo;
 tableoid | a |  b
--+---+--
 foo2 | 2 | node2_update
 foo2 | 2 | node1_update
(2 rows)

=== Script to create partitioned table, publication  & subscription ==
-- node1
CREATE TABLE foo (a int2, b text) PARTITION BY LIST (a);
CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);

CREATE PUBLICATION update_row_mov_pub FOR ALL TABLES;

ALTER TABLE foo REPLICA IDENTITY FULL;
ALTER TABLE foo1 REPLICA IDENTITY FULL;
ALTER TABLE foo2 REPLICA IDENTITY FULL;

-- node2
CREATE TABLE foo (a int2, b text) PARTITION BY LIST (a);
CREATE TABLE foo1 PARTITION OF foo FOR VALUES IN (1);
CREATE TABLE foo2 PARTITION OF foo FOR VALUES IN (2);

CREATE SUBSCRIPTION update_row_mov_sub CONNECTION 'host=localhost
dbname=postgres' PUBLICATION update_row_mov_pub;

== END ==

Here is a link of previous discussion :
https://postgr.es/m/caaj_b97w_ggv-k4erxwtpz5sagfj4auymx0khfysvanmwrz...@mail.gmail.com

Regards,
Amul  Sul

Re: Add RANGE with values and exclusions clauses to the Window Functions

2018-02-07 Thread Pantelis Theodosiou

On Sun, Feb 4, 2018 at 6:10 PM, Tom Lane  wrote:

> Oliver Ford  writes:
> > [ 0001-window-frame-v13.patch ]
>
> I've been hacking on this all week (with breaks for release notes) and
> have gotten it into a state that I think is close to committable.
>
> There was quite a lot I didn't like about the patch initially, notably
> that the interaction with operator classes/families was done all wrong.
> The idea is to add one support function per opclass, not jam them all
> into one opclass that breaks every rule for B-tree opclasses.  For one
> reason, with this approach there's no chance of dealing with non-default
> sort orders imposed by non-default opclasses.  (As a concrete example,
> suppose that we have two btree opclasses for complex numbers, one that
> sorts by real part and one that sorts by imaginary part.  You can write
> a well-defined in_range function for each of these, but one of them has
> to increment the real part and the other the imaginary part.)  I whacked
> that around and also wrote the missing documentation for the API spec
> for in_range functions.  The path of least resistance was to dump it
> into the nbtree/README text file, which I'm not that satisfied with;
> probably it should go in the main SGML docs, but I did not find a good
> place to put it.
>
> I also really didn't like the implementation you'd chosen in
> nodeWindowAgg.c to scan the entire partition and build an array of peer
> group lengths.  That risks running OOM with a large partition.  Even if
> the array doesn't lead to OOM, the tuplestore will spill to disk with
> nasty performance consequences.  We should try to ensure that the
> tuplestore needn't get larger than the frame, so that well-written queries
> with narrow frames can execute without spilling to disk.  So I rewrote
> that using an idea that had been speculated about in the original
> comments, but nobody had gotten to yet: add some more read pointers to
> track the frame boundaries, and advance them as needed.  I'm not really
> sure if this ends up as more or few row comparisons than the other way,
> but in any case it uses a fixed amount of memory, which is good.
>
> Also, the last patch's reimplementation of WinGetFuncArgInFrame isn't
> right: AFAICS, it results in any "relpos" that would point to a row
> in the exclusion range returning the row just after/before that range,
> which is already wrong if the exclusion range is more than one row,
> plus it doesn't renumber the rows beyond the exclusion.  The behavior
> we want is that the frame rows surviving after exclusion should appear
> consecutively numbered.  (This could be exposed with some tests using
> nth_value.)  I think the attached rewrite gets this right.  Also, punting
> entirely on the set-mark problem for SEEK_TAIL cases doesn't satisfy me,
> for the same reason as above that we don't want the tuplestore to bloat.
> What I did below is to set the mark at the frame start, which at least
> gives an opportunity for efficient queries.
>
> I hacked around on various other things too, for instance the behavior
> for null values in RANGE mode didn't seem to be per spec.
>
> I'm reasonably happy with all the code now, though surely it could use
> another look by someone else.  I've not yet reviewed the docs (other than
> the implementor-oriented details I added), nor have I really looked at the
> test cases.  I do have a couple suggestions on the test cases: for one,
> rather than duplicating the same window definition N times in each query,
> use one WINDOW clause and reference it with "OVER windowname".  Also,
> adding a bunch of columns of different types to a single table seems like
> a messy and not easily extensible way of testing different data types.
> I'd suggest leaving the existing table alone and adding a new test table
> per additional data type you want to test, so that there's an easy
> template for testing future additions of more in_range support.
>
> BTW, something I've not done here but am strongly tempted to do is
> run around and change all the uses of "RANGE value PRECEDING/FOLLOWING"
> terminology to, say, "RANGE offset PRECEDING/FOLLOWING".  "value" is
> just way too generic a term for this situation, making documentation
> confusing, plus you end up contorting sentences to avoid constructions
> like "value of the value".  I'm not wedded to "offset" if somebody's got a
> better word, but let's try to pick something more specific than "value".
> (In the ROWS and GROUPS cases, maybe write "count"?  Not entirely sure
> what to do for text that's trying to address all three cases, though.)
>
>
What about "extent_size" or just "size"? I see the SQL spec refers to
"preceding or following size" in an error message: ("data exception —
invalid preceding or following size in window function" )

Best regards
Pantelis Theodosiou

84 matches

Mail list logo