Re: [HACKERS] GSoC 2017: weekly progress reports (week 4) and patch for hash index

2018-03-09 Thread Amit Kapila
On Sat, Mar 10, 2018 at 1:10 PM, Amit Kapila  wrote:
> On Tue, Mar 6, 2018 at 11:26 AM, Amit Kapila  wrote:
>> On Mon, Mar 5, 2018 at 8:58 AM, Thomas Munro
>>  wrote:
>>> On Sun, Mar 4, 2018 at 12:53 AM, Amit Kapila  
>>> wrote:
 Yes, but I think it would be better if we call this once we are sure
 that at least one tuple from the old bucket has been transferred
 (consider if all tuples in the old bucket are dead).  Apart from this,
 I think this patch has missed handling the cases where we scan the
 buckets when the split is in progress.  In such cases, we scan both
 old and new bucket, so I think we need to ensure that we take
 PredicateLock on both the buckets during such scans.
>>>
>>> Hmm.  Yeah.
>>>
>>> So, in _hash_first(), do you think we might just need this?
>>>
>>>   if (H_BUCKET_BEING_POPULATED(opaque))
>>>   {
>>>   ...
>>>   old_blkno = _hash_get_oldblock_from_newbucket(rel, bucket);
>>>   ...
>>>   old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
>>> + PredicateLockPage(rel, BufferGetBlockNumber(old_buf),
>>> scan->xs_snapshot);
>>>   TestForOldSnapshot(scan->xs_snapshot, rel, 
>>> BufferGetPage(old_buf));
>>>
>>> That is, if you begin scanning a 'new' bucket, we remember the old
>>> bucket and go and scan that too, so we'd better predicate-lock both up
>>> front (or I suppose we could do it later when we visit that page, but
>>> here it can be done in a single place).
>>>
>>
>> Yeah, that can work, but I am slightly worried that we might actually
>> never scan the old bucket (say for queries with Limit clause) in which
>> case it might give false positives for insertions in old buckets.
>>
>
> I have changed the patch to address this point by acquiring predicate
> lock in _hash_readnext where it will acquire the lock only when it
> tries to scan the old bucket. I have also addressed another problem
> related to transfer of predicate locks during split such that it will
> transfer locks only when there is any tuple transferred from old to
> the new bucket.
>

Added some additional text in README in the attached patch to explain
the new change in mechanism.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Predicate-Locking-in-hash-index_v7.patch
Description: Binary data


Re: [HACKERS] GSoC 2017: weekly progress reports (week 4) and patch for hash index

2018-03-09 Thread Amit Kapila
On Tue, Mar 6, 2018 at 11:26 AM, Amit Kapila  wrote:
> On Mon, Mar 5, 2018 at 8:58 AM, Thomas Munro
>  wrote:
>> On Sun, Mar 4, 2018 at 12:53 AM, Amit Kapila  wrote:
>>> Yes, but I think it would be better if we call this once we are sure
>>> that at least one tuple from the old bucket has been transferred
>>> (consider if all tuples in the old bucket are dead).  Apart from this,
>>> I think this patch has missed handling the cases where we scan the
>>> buckets when the split is in progress.  In such cases, we scan both
>>> old and new bucket, so I think we need to ensure that we take
>>> PredicateLock on both the buckets during such scans.
>>
>> Hmm.  Yeah.
>>
>> So, in _hash_first(), do you think we might just need this?
>>
>>   if (H_BUCKET_BEING_POPULATED(opaque))
>>   {
>>   ...
>>   old_blkno = _hash_get_oldblock_from_newbucket(rel, bucket);
>>   ...
>>   old_buf = _hash_getbuf(rel, old_blkno, HASH_READ, LH_BUCKET_PAGE);
>> + PredicateLockPage(rel, BufferGetBlockNumber(old_buf),
>> scan->xs_snapshot);
>>   TestForOldSnapshot(scan->xs_snapshot, rel, BufferGetPage(old_buf));
>>
>> That is, if you begin scanning a 'new' bucket, we remember the old
>> bucket and go and scan that too, so we'd better predicate-lock both up
>> front (or I suppose we could do it later when we visit that page, but
>> here it can be done in a single place).
>>
>
> Yeah, that can work, but I am slightly worried that we might actually
> never scan the old bucket (say for queries with Limit clause) in which
> case it might give false positives for insertions in old buckets.
>

I have changed the patch to address this point by acquiring predicate
lock in _hash_readnext where it will acquire the lock only when it
tries to scan the old bucket. I have also addressed another problem
related to transfer of predicate locks during split such that it will
transfer locks only when there is any tuple transferred from old to
the new bucket.

>
>> I'm wondering how to test all this.  I'm thinking about a program that
>> repeatedly creates a hash index and then slowly adds more things to it
>> so that buckets split (maybe using distinct keys carefully crafted to
>> hit the same bucket?), while concurrently hammering it with a ton of
>> scans and then ... somehow checking correctness...
>>
>
> Yeah, that will generate the required errors, but not sure how to
> verify correctness.  One idea could be that when the split is in
> progress, we somehow stop it in-between (say by cancel request) and
> then run targeted selects and inserts on the bucket being scanned and
> bucket being populated.
>

I have verified the bucket split scenario manually as below:

Setup

create table hash_tbl(id int4, p integer);
insert into hash_tbl (id, p) select g, 10 from generate_series(1, 10) g;
Analyze hash_tbl;
create index hash_idx on hash_tbl using hash(p);

Session-1

begin isolation level serializable;
set enable_seqscan=off;
set enable_bitmapscan=off;
set enable_indexonlyscan=on;
select sum(p) from hash_tbl where p=10;
 sum
-
 100
(1 row)

insert into hash_tbl (id, p) select g, 10 from generate_series(10, 1000) g;
 -- Via debugger, stop in _hash_splitbucket at line 1283 {..
LockBuffer(bucket_obuf, BUFFER_LOCK_EXCLUSIVE); ...}

By this time split of bucket 1 is done but the split flag is not
cleared.  So, predicate lock from bucket-1 have been transferred to
bucket-3 (new bucket).

Session-2

begin isolation level serializable;
set enable_seqscan=off;
set enable_bitmapscan=off;
set enable_indexonlyscan=on;
select sum(p) from hash_tbl where p=10;
 sum
-
 100
(1 row)

Session-1
--
Commit;

Session-2
--
postgres=# insert into hash_tbl (id, p) select g, 10 from
generate_series(51, 60) g;
ERROR:  could not serialize access due to read/write dependencies
among transactions
DETAIL:  Reason code: Canceled on identification as a pivot, during write.
HINT:  The transaction might succeed if retried.

It got conflict while inserting in the new bucket (bucket number -3).

I think this patch now addresses all the open issues. Let me know what
do you think about it?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Predicate-Locking-in-hash-index_v6.patch
Description: Binary data


Re: csv format for psql

2018-03-09 Thread Fabien COELHO



recordsep in the unaligned mode doesn't play the role of a line ending
because the last line is not finished by recordsep. According to the source
code, this is intended, see print_unaligned_text() in print.c:


Something else comes to mind: CSV allows linefeeds inside fields, and
we don't want to replace these with record separators.


Sure.


So the notion that recordsep can be used to choose line endings
is even less okay than if there was just the last line issue.


I'm not following. ISTM that the escaping macanism would work in pretty 
all reasonable cases, although it might be possible to shot oneself in the 
foot by setting manually strange values for recordsep, eg '"'. I do not 
see that as a significant issue. If the user asks for something stupid, 
they get something stupid, fine.


--
Fabien.



Re: csv format for psql

2018-03-09 Thread Fabien COELHO


Hello Daniel,

About "-C", I'm fine if it is used and if it is not used, really. psql 
begins to be quite full of one letter options, currently 34 of them, with 
upper & lower cases and numbers included.




The solution to set fieldsep automatically to ',' with
\pset format csv is problematic.


I agree. I'm really advocating that --csv would set fieldsep, but manual 
pset on format would still do what is expected, and only that, i.e. --csv 
is NOT a simple shortcut for  -P format=csv".



Same problem on the command line. Options are evaluated left-to-right:

$ psql --csv -F';'
would work as expected, but
$ psql -F';' --csv
would not.


ISTM that having an option overriding another one after it is standard 
practice.


I would be fine with that if --csv is documented as "setting format, 
fieldsep and recordsep to default suited for outputting CSV".


Now this is just a personal opinion.


The "\n" eol style is hardcoded. Should it use "recordsep"? For instance,
https://tools.ietf.org/html/rfc4180 seems to specify CRLF end of lines.
The definition is evolving, eg https://www.w3.org/TR/tabular-data-model/
accepts both "\r" and "\r\n". I do not like using windows eol, but I think
that it should be possible to do it, which is not the case with this
version.


Interesting point. The output stream is opened in text mode so printing
'\n' should generate LF on Unix, CR LF on Windows, and I think CR on MacOS.
I think that's for the best.


I did not know that C's putc/printf/... would change output on sight on 
different systems. I'm not sure I like it. It would still mean that one 
cannot change the format to suits \r\n or \n at will, which is kind of 
disappointing.



recordsep in the unaligned mode doesn't play the role of a line ending
because the last line is not finished by recordsep.


It could just be with CSV format? As you point out, there is already an 
exception with the separator is '\0'. Note that the last line of a CSV 
file may or may not end with \n or \r\n, according to specs.



I'd suggest that tests should include more types, not just strings. I
would suggest int, float, timestamp, bytea, an array (which uses , as a
separator), json (which uses both " and ,)...


I'll do but the printout code is type-agnostic so it's not supposed
to make a difference compared to mere literals.


Sure, but it seems better to actually see that it works properly for non 
trivial cases.



Cases with NULLs are missing though, I'll go add some too.


Indeed.

--
Fabien.



Re: constraint exclusion and nulls in IN (..) clause

2018-03-09 Thread Tom Lane
I wrote:
> I think it'd make more sense to see about incorporating that idea in
> predicate_implied_by_simple_clause/predicate_refuted_by_simple_clause.

After further thought, it seems like the place to deal with this is
really operator_predicate_proof(), as in the attached draft patch
against HEAD.  This passes the smell test for me, in the sense that
it's an arguably correct and general extension of the proof rules,
but it could use more testing.

TBH, the change in the existing regression test case in inherit.sql
makes me itch.  We've got

create table list_parted (
a   varchar
) partition by list (a);
...
create table part_null_xy partition of list_parted for values in (null, 'xy');
...
explain (costs off) select * from list_parted where a = 'ab' or a in (null, 
'cd');

Now, the fact that "null" is included in this query's IN clause is a
complete no-op, because the IN is using a strict equality operator.
So it's nice that the planner can see that and realize that there's
no point in scanning part_null_xy ... but this means that the syntax
that's been chosen for list partitioning is seriously misleading.
"in (null, 'xy')" in the CREATE TABLE command has nothing to do with
the semantics of that identical clause in any other context, and indeed
it seems chosen in a way to confuse even (or especially?) seasoned
experts.

I suppose it's too late to do anything about that now, but it sure
seems like NULL should've been handled some other way.

regards, tom lane

diff --git a/src/backend/optimizer/util/predtest.c b/src/backend/optimizer/util/predtest.c
index fb963d0..446207d 100644
*** a/src/backend/optimizer/util/predtest.c
--- b/src/backend/optimizer/util/predtest.c
*** static Node *extract_not_arg(Node *claus
*** 100,106 
  static Node *extract_strong_not_arg(Node *clause);
  static bool clause_is_strict_for(Node *clause, Node *subexpr);
  static bool operator_predicate_proof(Expr *predicate, Node *clause,
! 		 bool refute_it);
  static bool operator_same_subexprs_proof(Oid pred_op, Oid clause_op,
  			 bool refute_it);
  static bool operator_same_subexprs_lookup(Oid pred_op, Oid clause_op,
--- 100,106 
  static Node *extract_strong_not_arg(Node *clause);
  static bool clause_is_strict_for(Node *clause, Node *subexpr);
  static bool operator_predicate_proof(Expr *predicate, Node *clause,
! 		 bool refute_it, bool weak);
  static bool operator_same_subexprs_proof(Oid pred_op, Oid clause_op,
  			 bool refute_it);
  static bool operator_same_subexprs_lookup(Oid pred_op, Oid clause_op,
*** predicate_implied_by_simple_clause(Expr 
*** 1137,1143 
  	}
  
  	/* Else try operator-related knowledge */
! 	return operator_predicate_proof(predicate, clause, false);
  }
  
  /*--
--- 1137,1143 
  	}
  
  	/* Else try operator-related knowledge */
! 	return operator_predicate_proof(predicate, clause, false, weak);
  }
  
  /*--
*** predicate_refuted_by_simple_clause(Expr 
*** 1232,1238 
  	}
  
  	/* Else try operator-related knowledge */
! 	return operator_predicate_proof(predicate, clause, true);
  }
  
  
--- 1232,1238 
  	}
  
  	/* Else try operator-related knowledge */
! 	return operator_predicate_proof(predicate, clause, true, weak);
  }
  
  
*** static const StrategyNumber BT_refute_ta
*** 1498,1506 
   * values, since then the operators aren't being given identical inputs.  But
   * we only support that for btree operators, for which we can assume that all
   * non-null inputs result in non-null outputs, so that it doesn't matter which
!  * two non-null constants we consider.  Currently the code below just reports
!  * "proof failed" if either constant is NULL, but in some cases we could be
!  * smarter (and that likely would require checking strong vs. weak proofs).
   *
   * We can make proofs involving several expression forms (here "foo" and "bar"
   * represent subexpressions that are identical according to equal()):
--- 1498,1505 
   * values, since then the operators aren't being given identical inputs.  But
   * we only support that for btree operators, for which we can assume that all
   * non-null inputs result in non-null outputs, so that it doesn't matter which
!  * two non-null constants we consider.  If either constant is NULL, we have
!  * to think harder, but sometimes the proof still works, as explained below.
   *
   * We can make proofs involving several expression forms (here "foo" and "bar"
   * represent subexpressions that are identical according to equal()):
*** static const StrategyNumber BT_refute_ta
*** 1528,1534 
   * and we dare not make deductions with those.
   */
  static bool
! operator_predicate_proof(Expr *predicate, Node *clause, bool refute_it)
  {
  	OpExpr	   *pred_opexpr,
  			   *clause_opexpr;
--- 1527,1534 
   * and we dare not make deductions with those.
   */
  static bool
! operator_predicate_proof

Re: csv format for psql

2018-03-09 Thread David G. Johnston
On Fri, Mar 9, 2018 at 3:18 PM, Daniel Verite 
wrote:

> I think that the point of recordsep in unaligned mode is you can set it
>
to something that never appears in the data, especially when embedded
> newlines might be in the data. In CSV this is solved differently so
> we don't need it.


​I'd rather argue it from the standpoint that \copy doesn't use recordsep
nor fieldsep and thus neither should --csv; which is arguably a convenience
invocation of \copy that pipes to psql's stdout (and overcomes \copy's
single-line limitation - which I think still exists... - and inability to
use variables - does it?...).  COPY doesn't allow for changing the record
separator and the newline output is system-dependent.  I can accept the
same limitation with this feature.

I suppose the question is how many "COPY" options do we want to expose on
the command line, and how does it look?

I'll put a -1 on having a short option (-C or otherwise); "that is the way
its always been done" doesn't work for me here - by way of example "-a and
-A" is ill-advised; --echo-all does not seem important enough to warrant a
short option (especially not a lower-case one) and so the more useful
unaligned mode is forced into the secondary capital A position.

David J.


Re: [HACKERS] Lazy hash table for XidInMVCCSnapshot (helps Zipfian a bit)

2018-03-09 Thread Yura Sokolov
08.03.2018 03:42, Tomas Vondra пишет:
> On 03/06/2018 06:23 AM, Yura Sokolov wrote:
>> 05.03.2018 18:00, Tom Lane пишет:
>>> Tomas Vondra  writes:
 Snapshots are static (we don't really add new XIDs into existing ones,
 right?), so why don't we simply sort the XIDs and then use bsearch to
 lookup values? That should fix the linear search, without need for any
 local hash table.
>>>
>>> +1 for looking into that, since it would avoid adding any complication
>>> to snapshot copying.  In principle, with enough XIDs in the snap, an
>>> O(1) hash probe would be quicker than an O(log N) bsearch ... but I'm
>>> dubious that we are often in the range where that would matter.
>>> We do need to worry about the cost of snapshot copying, too.
>>
>> Sorting - is the first thing I've tried. Unfortunately, sorting
>> itself eats too much cpu. Filling hash table is much faster.
>>
> 
> I've been interested in the sort-based approach, so I've spent a bit of
> time hacking on it (patch attached). It's much less invasive compared to
> the hash-table, but Yura is right the hashtable gives better results.
> 
> I've tried to measure the benefits using the same script I shared on
> Tuesday, but I kept getting really strange numbers. In fact, I've been
> unable to even reproduce the results I shared back then. And after a bit
> of head-scratching I think the script is useless - it can't possibly
> generate snapshots with many XIDs because all the update threads sleep
> for exactly the same time. And first one to sleep is first one to wake
> up and commit, so most of the other XIDs are above xmax (and thus not
> included in the snapshot). I have no idea why I got the numbers :-/
> 
> But with this insight, I quickly modified the script to also consume
> XIDs by another thread (which simply calls txid_current). With that I'm
> getting snapshots with as many XIDs as there are UPDATE clients (this
> time I actually checked that using gdb).
> 
> And for a 60-second run the tps results look like this (see the attached
> chart, showing the same data):
> 
> writers master  hash   sort
>-
> 16   1068   1057   1070
> 32   1005995   1033
> 64   1064   1042   1117
> 128  1058   1021   1051
> 256   977   1004928
> 512   773935808
> 768   576815670
> 1000  413752573
> 
> The sort certainly does improve performance compared to master, but it's
> only about half of the hashtable improvement.
> 
> I don't know how much this simple workload resembles the YCSB benchmark,
> but I seem to recall it's touching individual rows. In which case it's
> likely worse due to the pg_qsort being more expensive than building the
> hash table.
> 
> So I agree with Yura the sort is not a viable alternative to the hash
> table, in this case.
> 
>> Can I rely on snapshots being static? May be then there is no need
>> in separate raw representation and hash table. I may try to fill hash
>> table directly in GetSnapshotData instead of lazy filling. Though it
>> will increase time under lock, so it is debatable and should be
>> carefully measured.
>>
> 
> Yes, I think you can rely on snapshots not being modified later. That's
> pretty much the definition of a snapshot.
> 
> You may do that in GetSnapshotData, but you certainly can't do that in
> the for loop there. Not only we don't want to add more work there, but
> you don't know the number of XIDs in that loop. And we don't want to
> build hash tables for small number of XIDs.
> 
> One reason against building the hash table in GetSnapshotData is that
> we'd build it even when the snapshot is never queried. Or when it is
> queried, but we only need to check xmin/xmax.

Thank you for analyze, Tomas.

Stephen is right about bug in snapmgr.c
Attached version fixes bug, and also simplifies XidInXip a bit.

With regards,
Sokolov Yura.
From 8484ae3f2c2f8af20ae8ce2f6d7960b6519e65c0 Mon Sep 17 00:00:00 2001
From: Sokolov Yura aka funny_falcon 
Date: Fri, 9 Mar 2018 22:49:01 +0300
Subject: [PATCH] Make hash table for xip in XidInMVCCSnapshot

When lot of concurrent transactions attempts to update single
row, then linear scan through running list in XidInMVCCSnapshot
became noticebale bottleneck.

With this change, hash table is built on first search of xid in
snapshot->xip and snapshot->subxip arrays.

If size of array is smaller than 60, then linear scan is still
used, cause there is no much benefits from building hash then.
(at least on Intel Xeon).
---
 src/backend/storage/ipc/procarray.c | 67 ++--
 src/backend/utils/time/snapmgr.c| 25 +++
 src/backend/utils/time/tqual.c  | 88 -
 src/include/utils/snapshot.h| 11 +
 4 files changed, 148 insertions(+), 43 deletions(-)

diff --

Bogus use of canonicalize_qual

2018-03-09 Thread Tom Lane
Whilst fooling about with predtest.c, I noticed a rather embarrassing
error.  Consider the following, rather silly, CHECK constraint:

regression=# create table pp (f1 int);
CREATE TABLE
regression=# create table cc (check (f1 = 1 or f1 = null)) inherits(pp);
CREATE TABLE

Because "f1 = null" reduces to constant NULL, this check constraint is
actually a no-op, because it can never evaluate to FALSE:

regression=# insert into cc values(1);
INSERT 0 1
regression=# insert into cc values(2);
INSERT 0 1
regression=# select * from pp;
 f1 

  1
  2
(2 rows)

But:

regression=# select * from pp where f1 = 2;
 f1 

(0 rows)

Huh?  The reason is that the planner is deciding that it can exclude
cc from the plan:

regression=# explain select * from pp where f1 = 2;
   QUERY PLAN   

 Append  (cost=0.00..0.01 rows=1 width=4)
   ->  Seq Scan on pp  (cost=0.00..0.00 rows=1 width=4)
 Filter: (f1 = 2)
(3 rows)

and that ultimately traces to the part of canonicalize_qual() that throws
away constant-NULL subexpressions of AND/OR clauses.  It's clearly
documented in canonicalize_qual() that it should only be applied to actual
WHERE clauses, where that's a valid optimization.  But there is lots of
code that didn't get that memo and is calling it on CHECK constraints,
allowing the NULL to be thrown away when it should not be.  The particular
culprit here, I think, is get_relation_constraints(), but there's a lot of
similar code elsewhere.

So, what to do?  We have a few choices, none ideal:

1. Just remove that optimization from canonicalize_qual().  This would
result in some inefficiency in badly-written queries, but it might be
acceptable.

2. Run around and remove all the bogus canonicalize_qual() calls.  The
main demerit here is that this'd mean CHECK constraints also don't get the
other effect of canonicalize_qual(), which is:

 * The following code attempts to apply the inverse OR distributive law:
 *  ((A AND B) OR (A AND C))  =>  (A AND (B OR C))
 * That is, locate OR clauses in which every subclause contains an
 * identical term, and pull out the duplicated terms.

This'd possibly make it harder to match WHERE clauses, which do get that
processing, to CHECK clauses which wouldn't.  I think that possibly
predtest.c is smart enough to make proofs even in the face of that, but
I'm not sure.  Another concern is whether external code might still
contain incorrect canonicalize_qual() calls, or whether we might not
accidentally put some back in future.

3. Change canonicalize_qual() to take an additional parameter indicating
whether it's working on a true qual or not.  This might be the best fix
for HEAD, but it doesn't seem very back-patchable.

4. Split canonicalize_qual() into two functions, one for the inverse OR
business and one for NULL removal.  This would result in an additional
tree traversal (and reconstruction) for every WHERE clause, slowing
planning somewhat.

5. Remove NULL-simplification from canonicalize_qual(), but put it
back somewhere later in the planner where we're traversing qual trees
anyway.  I think that it might work to charge the RestrictInfo-building
code with this, though I'm not sure about it.  It seems kind of high
risk for a back-patchable change in any case.

Thoughts?

regards, tom lane



Re: PATCH: Unlogged tables re-initialization tests

2018-03-09 Thread Peter Eisentraut
This seems like a useful test.

On 3/5/18 12:35, David Steele wrote:
> +mkdir($tablespaceDir)
> + or die "unable to mkdir \"$tablespaceDir\"";

Use BAIL_OUT instead of die in tests.

> + $ts1UnloggedPath = $node->safe_psql('postgres',
> + q{select pg_relation_filepath('ts1_unlogged')});

strange indentation

> +# Write forks to test that they are removed during recovery
> +$node->command_ok(['touch', "$pgdata/${baseUnloggedPath}_vm"],
> + 'touch vm fork in base');
> +$node->command_ok(['touch', "$pgdata/${baseUnloggedPath}_fsm"],
> + 'touch fsm fork in base');

These are not tests, just some prep work.  So they should not use
command_ok.

It would probably also be better to avoid the Unix-specific touch
command and instead use Perl code to open and write to the files.

> +# Check unlogged table in base
> +ok(-f "$pgdata/${baseUnloggedPath}_init", 'init fork in base');
> +ok(-f "$pgdata/$baseUnloggedPath", 'main fork in base');
> +ok(!-f "$pgdata/${baseUnloggedPath}_vm", 'vm fork not in base');
> +ok(!-f "$pgdata/${baseUnloggedPath}_fsm", 'fsm fork not in base');

These test names could be a bit more verbose and distinct, for example,
'main fork was recreated after restart'.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: csv format for psql

2018-03-09 Thread Daniel Verite
David G. Johnston wrote:

> I'm not following - if recordsep is not something that would
> interpreted as a newline then the file we output would have not
> structural newlines.
> It might have data newlines but those would be quoted.

They would be, but I don't quite understand the point
in your first sentence.
All I'm saying is that setting recordsep to '\r\n' does
not achieve the goal of obtaining CRLF line endings.

Here's an example with the unaligned mode, linux host:

(psql -At -P recordsep=$'\r\n' << EOF
  select E'A\nB' union E'C\nD';
EOF
) | hexdump -C

The result is:
  41 0a 42 0d 0a 43 0a 44  0a |A.B..C.D.|

The expectation of CRLF line endings is that every LF would be
preceded by CR, but here we get that only for 1 LF out of 4. That's
useless.

It's not a bug. We asked for a CRLF to separate our two
records and that's exactly what we got, no more no less.

In csv output, the difference would be that there would a double quote
character before A and after B, and before C and after D but otherwise
it would be the same mix of LF and CRLF.

I think that the point of recordsep in unaligned mode is you can set it
to something that never appears in the data, especially when embedded
newlines might be in the data. In CSV this is solved differently so
we don't need it.

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite



Re: PQHost() undefined behavior if connecting string contains both host and hostaddr types

2018-03-09 Thread Peter Eisentraut
On 1/13/18 22:19, Haribabu Kommi wrote:
> While working on [1], we find out the inconsistency in PQHost() behavior
> if the connecting string that is passed to connect to the server contains
> multiple hosts with both host and hostaddr types. For example,
> 
> host=host1,host2 hostaddr=127.0.0.1,127.0.0.1 port=5434,5432
> 
> As the hostaddr is given preference when both host and hostaddr is 
> specified, so the connection type for both addresses of the above
> conninfo is CHT_HOST_ADDRESS. So the PQhost() returns the
> conn->pghost value i.e "host1,host2" instead of the actual host that
> is connected.
> 
> Instead of checking the connection type while returning the host
> details, it should check whether the host is NULL or not? with this
> change it returns the expected value for all the connection types.

I agree that something is wrong here.

It seems, however, that PGhost() has always been broken for hostaddr
use.  In 9.6 (before the multiple-hosts stuff was introduced), when
connecting to "hostaddr=127.0.0.1", PGhost() returns "/tmp".  Urgh.

I think we should decide what PGhost() is supposed to mean when hostaddr
is in use, and then make a fix for that consistently across all versions.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] Verify Checksums during Basebackups

2018-03-09 Thread Michael Banck
Hi,

Am Mittwoch, den 28.02.2018, 19:08 +0100 schrieb Michael Banck:
> some installations have data which is only rarerly read, and if they are
> so large that dumps are not routinely taken, data corruption would only
> be detected with some large delay even with checksums enabled.
> 
> The attached small patch verifies checksums (in case they are enabled)
> during a basebackup. The rationale is that we are reading every block in
> this case anyway, so this is a good opportunity to check them as well.
> Other and complementary ways of checking the checksums are possible of
> course, like the offline checking tool that Magnus just submitted.

I've attached a second version of this patch. Changes are:

1. I've included some code from Magnus' patch, notably the way the
segment numbers are determined and the skipfile() function, along with
the array of files to skip.

2. I am now checking the LSN in the pageheader and compare it against
the LSN of the basebackup start, so that no checksums are verified for
pages changed after basebackup start. I am not sure whether this
addresses all concerns by Stephen and David, as I am not re-reading the
page on a checksum mismatch as they are doing in pgbackrest.

3. pg_basebackup now exits with 1 if a checksum mismatch occured, but it
keeps the data around.

4. I added an Assert() that the TAR_SEND_SIZE is a multiple of BLCKSZ.
AFAICT we support block sizes of 1, 2, 4, 8, 16 and 32 KB, while
TAR_SEND_SIZE is set to 32 KB, so this should be fine unless somebody
mucks around with BLCKSZ manually, in which case the Assert should fire.
I compiled --with-blocksize=32 and checked that this still works as
intended.

5. I also check that the buffer we read is a multiple of BLCKSZ. If that
is not the case I emit a WARNING that the checksum cannot be checked and
pg_basebackup will exit with 1 as well.

This is how it looks like right now from pg_basebackup's POV:

postgres@fock:~$ initdb -k --pgdata=data1 1> /dev/null

WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
postgres@fock:~$ pg_ctl --pgdata=data1 --log=pg1.log start
waiting for server to start done
server started
postgres@fock:~$ psql -h /tmp -c "SELECT pg_relation_filepath('pg_class')"
 pg_relation_filepath 
--
 base/12367/1259
(1 row)

postgres@fock:~$ echo -n "Bang!" | dd conv=notrunc oflag=seek_bytes seek=4000 
bs=9 count=1 of=data1/base/12367/1259
0+1 records in
0+1 records out
5 bytes copied, 3.7487e-05 s, 133 kB/s
postgres@fock:~$ pg_basebackup --pgdata=data2 -h /tmp 
WARNING:  checksum mismatch in file "./base/12367/1259", segment 0, block 0: 
expected CC05, found CA4D
pg_basebackup: checksum error occured
postgres@fock:~$ echo $?
1
postgres@fock:~$ ls data2
backup_label  pg_dynshmempg_multixact  pg_snapshots  pg_tblspcpg_xact
base  pg_hba.confpg_notify pg_stat   pg_twophase  
postgresql.auto.conf
globalpg_ident.conf  pg_replslot   pg_stat_tmp   PG_VERSION   
postgresql.conf
pg_commit_ts  pg_logical pg_serial pg_subtrans   pg_wal
postgres@fock:~$

Possibly open questions:

1. I have not so far changed the replication protocol to make verifying
checksums optional. I can go about that next if the consensus is that we
need such an option (and cannot just check it everytime)?

2. The isolation tester test uses dd (similar to the above), is that
allowed, or do I have to come up with some internal Perl thing that also
works on Windows?

3. I am using basename() to get the filename, I haven't seen that used a
lot in the codebase (nor did I find an obvious internal implementation),
is that fine?


Cheers,

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuerdiff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 185f32a5f9..8e84c7c38e 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 
+#include "access/xlog.h"
 #include "access/xlog_internal.h"	/* for pg_start/stop_backup */
 #include "catalog/catalog.h"
 #include "catalog/pg_type.h"
@@ -30,6 +31,8 @@
 #include "replication/basebackup.h"
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
+#include "storage/bufpage.h"
+#include "storage/checksum.h"
 #include "storage/dsm_impl.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -75,6 +78,9 @@ static bool backup_started_in_recovery = false;
 /* Relative path of temporary statistics directory */
 static char *statrelpath = NULL;
 
+/* The starting XLOG position of the base backup */
+static XLogRecPtr startptr;

Re: JIT compiling with LLVM v11

2018-03-09 Thread Andres Freund
On 2018-03-09 15:42:24 -0500, Peter Eisentraut wrote:
> For jit_optimize_above_cost, in my testing, any query where JIT payed
> off was even faster with optimizing.  So right now I don't see a need to
> make this a separate setting.  Maybe just make it an on/off setting for
> experimenting.

I'd prefer to be more defensive here. The time needed for JITing without
optimization is roughly linear, whereas optimization is definitely not
linear with input size.


> For inlining, I haven't been able to get a clear picture.  It's a bit
> faster perhaps, but the optimizing dominates it.  I don't have a clear
> mental model for what kind of returns to expect from this.

Yea, you need longrunning queries to benefit significantly. There's a
*lot* more potential once some structural issues with the expression
format (both with and without JIT) are fixed.


> What I'd quite like is if EXPLAIN or EXPLAIN ANALYZE showed something
> about what kind of JIT processing was done, if any, to help with this
> kind of testing.

Yea, I like that. I think we can only show that when timing is on,
because otherwise the tests will not be stable depending on --with-jit
being specified or not.

So I'm thinking of displaying it similar to the "Planning time" piece,
i.e. depending on es->summary being enabled. It'd be good to display the
inline/optimize/emit times too. I think we can just store it in the
JitContext. But the inline/optimize/emission times will only be
meaningful when the query is actually executed, I don't see a way around
that...

Greetings,

Andres Freund



Re: FOR EACH ROW triggers on partitioned tables

2018-03-09 Thread Alvaro Herrera
... and this little addendum makes pg_dump work correctly.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 867bbe8f1e..ca0a66753e 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1363,7 +1363,7 @@ index_constraint_create(Relation heapRelation,
 
(void) CreateTrigger(trigger, NULL, 
RelationGetRelid(heapRelation),
 InvalidOid, conOid, 
indexRelationId, InvalidOid,
-InvalidOid, true);
+InvalidOid, true, 
false);
}
 
/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 4303c5a131..f5fc0938a6 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -8467,7 +8467,7 @@ CreateFKCheckTrigger(Oid myRelOid, Oid refRelOid, 
Constraint *fkconstraint,
fk_trigger->args = NIL;
 
(void) CreateTrigger(fk_trigger, NULL, myRelOid, refRelOid, 
constraintOid,
-indexOid, InvalidOid, 
InvalidOid, true);
+indexOid, InvalidOid, 
InvalidOid, true, false);
 
/* Make changes-so-far visible */
CommandCounterIncrement();
@@ -8541,7 +8541,7 @@ createForeignKeyTriggers(Relation rel, Oid refRelOid, 
Constraint *fkconstraint,
fk_trigger->args = NIL;
 
(void) CreateTrigger(fk_trigger, NULL, refRelOid, myRelOid, 
constraintOid,
-indexOid, InvalidOid, 
InvalidOid, true);
+indexOid, InvalidOid, 
InvalidOid, true, false);
 
/* Make changes-so-far visible */
CommandCounterIncrement();
@@ -8596,7 +8596,7 @@ createForeignKeyTriggers(Relation rel, Oid refRelOid, 
Constraint *fkconstraint,
fk_trigger->args = NIL;
 
(void) CreateTrigger(fk_trigger, NULL, refRelOid, myRelOid, 
constraintOid,
-indexOid, InvalidOid, 
InvalidOid, true);
+indexOid, InvalidOid, 
InvalidOid, true, false);
 
/* Make changes-so-far visible */
CommandCounterIncrement();
@@ -14324,7 +14324,7 @@ CloneRowTriggersToPartition(Oid parentId, Oid 
partitionId)
 
CreateTrigger(trigStmt, NULL, partitionId,
  InvalidOid, InvalidOid, InvalidOid,
- trigForm->tgfoid, 
HeapTupleGetOid(tuple), false);
+ trigForm->tgfoid, 
HeapTupleGetOid(tuple), false, true);
pfree(trigStmt);
}
 
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index c4f63c8b90..6a857df566 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -151,7 +151,8 @@ static bool before_stmt_triggers_fired(Oid relid, CmdType 
cmdType);
 ObjectAddress
 CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
  Oid relOid, Oid refRelOid, Oid constraintOid, Oid 
indexOid,
- Oid funcoid, Oid parentTriggerOid, bool isInternal)
+ Oid funcoid, Oid parentTriggerOid, bool isInternal,
+ bool in_partition)
 {
int16   tgtype;
int ncolumns;
@@ -780,6 +781,11 @@ CreateTrigger(CreateTrigStmt *stmt, const char 
*queryString,
 
/*
 * Build the new pg_trigger tuple.
+*
+* When we're creating a trigger in a partition, we mark it as internal,
+* even though we don't do the isInternal magic in this function.  This
+* makes the triggers in partitions identical to the ones in the
+* partitioned tables, except that they are marked internal.
 */
memset(nulls, false, sizeof(nulls));
 
@@ -789,7 +795,7 @@ CreateTrigger(CreateTrigStmt *stmt, const char *queryString,
values[Anum_pg_trigger_tgfoid - 1] = ObjectIdGetDatum(funcoid);
values[Anum_pg_trigger_tgtype - 1] = Int16GetDatum(tgtype);
values[Anum_pg_trigger_tgenabled - 1] = 
CharGetDatum(TRIGGER_FIRES_ON_ORIGIN);
-   values[Anum_pg_trigger_tgisinternal - 1] = BoolGetDatum(isInternal);
+   values[Anum_pg_trigger_tgisinternal - 1] = BoolGetDatum(isInternal || 
in_partition);
values[Anum_pg_trigger_tgconstrrelid - 1] = 
ObjectIdGetDatum(constrrelid);
values[Anum_pg_trigger_tgconstrindid - 1] = ObjectIdGetDatum(indexOid);
values[Anum_pg_trigger_tgconstraint - 1] = 
ObjectIdGetDatum(constraintOid);
@@ -1089,7 +1095,7 @@ CreateTrigger(CreateTrigStmt *stmt, const char 
*queryString,

Re: using worker_spi as pattern

2018-03-09 Thread Jeremy Finzel
On Fri, Mar 9, 2018 at 12:34 AM, Michael Paquier 
wrote:

> On Thu, Mar 08, 2018 at 11:04:20PM -0600, Jeremy Finzel wrote:
> > Since you mention, can anyone elaborate further on the memory leak danger
> > here?
> >
> > Line 193 in src/test/modules/worker_spi/worker_spi.c read:
> > # Note some memory might be leaked here.
> >
> > Is this any reason *not *to use this pattern in production?
>
> quote_identifier may palloc the result, so the first pstrdup on the top
> to save "schema" and "table" refer to a pointer which may perhaps get
> lost.  Those are just a couple of bytes, so the code complication is not
> worth the cleanup IMO.
> --
> Michael
>

Makes sense, thank you.


Re: JIT compiling with LLVM v11

2018-03-09 Thread Andres Freund
On 2018-03-09 15:28:19 -0500, Peter Eisentraut wrote:
> On 3/6/18 15:16, Andres Freund wrote:
> > 2) Don't load the JIT provider until fully needed. Right now
> >jit_compile_expr() will load the jit provider even if not really
> >needed. We should probably move the first two return blocks in
> >llvm_compile_expr() into jit_compile_expr(), to avoid that.
> 
> I see that you have implemented that, but it doesn't seem to have helped
> with my make installcheck times.

What's the exact comparison you're looking at?

I think that's largely that unnecessary trivial queries get JITed and
optimized, because the stats are entirely completely off.

Greetings,

Andres Freund



Re: JIT compiling with LLVM v11

2018-03-09 Thread Peter Eisentraut
On 3/6/18 10:29, Peter Eisentraut wrote:
> I think taking the total cost as the triggering threshold is probably
> good enough for a start.  The cost modeling can be refined over time.

I looked into this a bit more.

The default of jit_above_cost = 50 seems pretty good.  I constructed
a query that cost about 45 where the run time with and without JIT
were about even.  This is obviously very limited testing, but it's a
good start.

For jit_optimize_above_cost, in my testing, any query where JIT payed
off was even faster with optimizing.  So right now I don't see a need to
make this a separate setting.  Maybe just make it an on/off setting for
experimenting.

For inlining, I haven't been able to get a clear picture.  It's a bit
faster perhaps, but the optimizing dominates it.  I don't have a clear
mental model for what kind of returns to expect from this.

What I'd quite like is if EXPLAIN or EXPLAIN ANALYZE showed something
about what kind of JIT processing was done, if any, to help with this
kind of testing.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: FOR EACH ROW triggers on partitioned tables

2018-03-09 Thread Alvaro Herrera
Alvaro Herrera wrote:

> One thing I'd like to add before claiming this committable (backend-
> side) is enabling constraint triggers.  AFAICT that requires a bit of
> additional logic, but it shouldn't be too terrible.  This would allow
> for deferrable unique constraints, for example.

v7 supports constraint triggers.  I added an example using a UNIQUE
DEFERRABLE constraint, and another one using plain CREATE CONSTRAINT TRIGGER.
It's neat to see that the WHEN clause is executed at the time of the
operation, and the trigger action is deferred (or not) till COMMIT time.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From a34e786924b54d94dbf28c182aed27d0e92dba06 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 8 Mar 2018 14:01:39 -0300
Subject: [PATCH v7 1/3] add missing CommandCounterIncrement in
 StorePartitionBound

---
 src/backend/catalog/heap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index cf36ce4add..2b5377bdf2 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -3299,6 +3299,9 @@ StorePartitionBound(Relation rel, Relation parent, 
PartitionBoundSpec *bound)
heap_freetuple(newtuple);
heap_close(classRel, RowExclusiveLock);
 
+   /* Make update visible */
+   CommandCounterIncrement();
+
/*
 * The partition constraint for the default partition depends on the
 * partition bounds of every other partition, so we must invalidate the
-- 
2.11.0

>From 1165c0438c627ea214de9ee4cffa83d89b0aa485 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Thu, 8 Mar 2018 14:04:13 -0300
Subject: [PATCH v7 2/3] Add missing CommandCounterIncrement() in partitioned
 index code

---
 src/backend/catalog/pg_constraint.c | 4 
 src/backend/commands/indexcmds.c| 6 ++
 src/backend/commands/tablecmds.c| 2 --
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/src/backend/catalog/pg_constraint.c 
b/src/backend/catalog/pg_constraint.c
index 731c5e4317..38fdf72877 100644
--- a/src/backend/catalog/pg_constraint.c
+++ b/src/backend/catalog/pg_constraint.c
@@ -18,6 +18,7 @@
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
+#include "access/xact.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
 #include "catalog/objectaccess.h"
@@ -781,6 +782,9 @@ ConstraintSetParentConstraint(Oid childConstrId, Oid 
parentConstrId)
recordDependencyOn(&depender, &referenced, DEPENDENCY_INTERNAL_AUTO);
 
heap_close(constrRel, RowExclusiveLock);
+
+   /* make update visible */
+   CommandCounterIncrement();
 }
 
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 504806b25b..9ca632865b 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1003,6 +1003,9 @@ DefineIndex(Oid relationId,
ReleaseSysCache(tup);
heap_close(pg_index, RowExclusiveLock);
heap_freetuple(newtup);
+
+   /* make update visible */
+   CommandCounterIncrement();
}
}
else
@@ -2512,5 +2515,8 @@ IndexSetParentIndex(Relation partitionIdx, Oid parentOid)
 
recordDependencyOn(&partIdx, &partitionTbl, 
DEPENDENCY_AUTO);
}
+
+   /* make our updates visible */
+   CommandCounterIncrement();
}
 }
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 74e020bffc..7ecfbc17a0 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -14571,8 +14571,6 @@ ATExecAttachPartitionIdx(List **wqueue, Relation 
parentIdx, RangeVar *name)
 
pfree(attmap);
 
-   CommandCounterIncrement();
-
validatePartitionedIndex(parentIdx, parentTbl);
}
 
-- 
2.11.0

>From 975902aa033877165d0aaec556edc09bb02808c7 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 21 Nov 2017 15:53:11 -0300
Subject: [PATCH v7 3/3] Allow FOR EACH ROW triggers on partitioned tables

---
 src/backend/catalog/heap.c |   1 +
 src/backend/catalog/index.c|   4 +-
 src/backend/catalog/pg_constraint.c|   3 +
 src/backend/commands/tablecmds.c   |  92 +-
 src/backend/commands/trigger.c | 183 +--
 src/backend/commands/typecmds.c|   1 +
 src/backend/tcop/utility.c |   3 +-
 src/include/catalog/indexing.h |   2 +
 src/include/catalog/pg_constraint.h|  39 ++--
 src/include/catalog/pg_constraint_fn.h |   1 +
 src/include/commands/trigger.h |   4 +-
 src/test/regress/expected/triggers.out | 277 

Re: JIT compiling with LLVM v11

2018-03-09 Thread Peter Eisentraut
On 3/6/18 15:16, Andres Freund wrote:
> 2) Don't load the JIT provider until fully needed. Right now
>jit_compile_expr() will load the jit provider even if not really
>needed. We should probably move the first two return blocks in
>llvm_compile_expr() into jit_compile_expr(), to avoid that.

I see that you have implemented that, but it doesn't seem to have helped
with my make installcheck times.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Implementing SQL ASSERTION

2018-03-09 Thread Thomas Munro
On Sat, Mar 10, 2018 at 6:37 AM, Robert Haas  wrote:
> On Mon, Jan 15, 2018 at 11:35 AM, David Fetter  wrote:
>> - We follow the SQL standard and make SERIALIZABLE the default
>>   transaction isolation level, and
>
> The consequences of such a decision would include:
>
> - pgbench -S would run up to 10x slower, at least if these old
> benchmark results are still valid:
>
> https://www.postgresql.org/message-id/ca+tgmozog1wfbyrqzjukilsxw5sdujjguey0c2bqsg-tcis...@mail.gmail.com
>
> - pgbench without -S would fail outright, because it doesn't have
> provision to retry failed transactions.
>
> https://commitfest.postgresql.org/16/1419/
>
> - Many user applications would probably also experience similar difficulties.
>
> - Parallel query would no longer work by default, unless this patch
> gets committed:
>
> https://commitfest.postgresql.org/17/1004/
>
> I think a good deal of work to improve the performance of serializable
> would need to be done before we could even think about making it the
> default -- and even then, the fact that it really requires the
> application to be retry-capable seems like a pretty major obstacle.

Also:

- It's not available on hot standbys.  Experimental patches have been
developed based on the read only safe snapshot concept, but some
tricky problems remain unsolved.

- Performance is terrible (conflicts are maximised) if you use any
index type except btree, unless some of these get committed:

https://commitfest.postgresql.org/17/1172/
https://commitfest.postgresql.org/17/1183/
https://commitfest.postgresql.org/17/1466/

-- 
Thomas Munro
http://www.enterprisedb.com



Re: FOR EACH ROW triggers on partitioned tables

2018-03-09 Thread Alvaro Herrera
Peter Eisentraut wrote:
> On 3/7/18 20:57, Alvaro Herrera wrote:
> > So, unless someone has a brilliant idea on how to construct a column
> > mapping from partitioned table to partition, I'm going back to the
> > design I was proposing earlier, ie., creating individual pg_trigger rows
> > for each partition that are essentially adjusted copies of the ones for
> > the partitioned table.
> 
> Yes, that seems easiest.
> 
> The idea of having only one pg_trigger entry was derived from the
> assumption that we wouldn't need the other ones for anything.  But if
> that doesn't apply, then it's better to just go with the straightforward
> way instead of bending the single-pg_trigger way to our will.

Agreed.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: csv format for psql

2018-03-09 Thread David G. Johnston
On Fri, Mar 9, 2018 at 12:42 PM, Daniel Verite 
wrote:

> I wrote:
>
> > recordsep in the unaligned mode doesn't play the role of a line ending
> > because the last line is not finished by recordsep. According to the
> source
> >code, this is intended, see print_unaligned_text() in print.c:
>
> Something else comes to mind: CSV allows linefeeds inside fields, and
> we don't want to replace these with record separators.
> So the notion that recordsep can be used to choose line endings
> is even less okay than if there was just the last line issue.
>
>
I'm not following - if recordsep is not something that would interpreted as
a newline then the file we output would have not structural newlines.  It
might have data newlines but those would be quoted.

David J.


Re: csv format for psql

2018-03-09 Thread Daniel Verite
I wrote:

> recordsep in the unaligned mode doesn't play the role of a line ending
> because the last line is not finished by recordsep. According to the source
>code, this is intended, see print_unaligned_text() in print.c:

Something else comes to mind: CSV allows linefeeds inside fields, and
we don't want to replace these with record separators.
So the notion that recordsep can be used to choose line endings
is even less okay than if there was just the last line issue.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite



Re: csv format for psql

2018-03-09 Thread Daniel Verite
Fabien COELHO wrote:

> I also think that a short option brings little value, and "--csv" is good 
> enough for the purpose, so I would agree to remove the "-C" binding.

It's not that accepting -C brings much value by itself, it's that
loosing the consistency across all options comes with a negative cost.
The point is that up to now all options have a short form and a long
form, so --csv would be a deliberate exception. I'm rather unconvinced
it's justified, but I seem to be alone in that case, so I'll comply.


> About "fieldsep_csv", I do not like much the principle of having different 
> output variables to represent the same concept depending on the format. I 
> would rather have reused fieldsep as in your previous submission and set 
> it to "," when under --csv

The trouble with fieldsep is that it defaults to '|', which
both you and Pavel say you dislike. Fair enough, it's better
to have ',' by default, but the cleanest solution to that
is fieldsep_csv with its own default.
The solution to set fieldsep automatically to ',' with
\pset format csv is problematic.

For instance
 \pset format csv
 \pset fieldsep ';'

changes fieldsep to ';' as expected, but in the other order

 \pset fieldsep ';'
 \pset format csv

you get ',' while reasonably you'd expect ';'

Same problem on the command line. Options are evaluated left-to-right:

 $ psql --csv -F';'
would work as expected, but
 $ psql -F';' --csv
would not.

I don't feel good about solving these issues with ad-hoc rules,
or inventing the notion that a pset variable has been defined
but not user-redefined. This stuff has "inconsistent" written all over it
and I don't see a maintainer going along with that.

> The "\n" eol style is hardcoded. Should it use "recordsep"? For instance, 
> https://tools.ietf.org/html/rfc4180 seems to specify CRLF end of lines. 
> The definition is evolving, eg https://www.w3.org/TR/tabular-data-model/ 
> accepts both "\r" and "\r\n". I do not like using windows eol, but I think 
> that it should be possible to do it, which is not the case with this 
> version.

Interesting point. The output stream is opened in text mode so printing
'\n' should generate LF on Unix, CR LF on Windows, and I think CR on MacOS.
I think that's for the best.

recordsep in the unaligned mode doesn't play the role of a line ending
because the last line is not finished by recordsep. According to the source
code, this is intended, see print_unaligned_text() in print.c:
  /*
   * The last record is terminated by a newline, independent of the set
   * record separator.  But when the record separator is a zero byte, we
   * use that (compatible with find -print0 and xargs).
   */


> The "\pset format" error message in "do_pset" shows values in seemingly 
> random order. The situation is pre-existing but not really satisfactory. 
> I'd suggest to put all values in alphabetical order.

ok

> In csv_print_field & csv_print_text, you are not consistent when putting 
> braces on blocks with only one instruction. I'd suggest not to put braces 
> in that case.

ok

> I'd suggest that tests should include more types, not just strings. I 
> would suggest int, float, timestamp, bytea, an array (which uses , as a 
> separator), json (which uses both " and ,)...

I'll do but the printout code is type-agnostic so it's not supposed
to make a difference compared to mere literals.
Cases with NULLs are missing though, I'll go add some too.


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite



Re: [HACKERS] [FEATURE PATCH] pg_stat_statements with plans (v02)

2018-03-09 Thread Arthur Zakirov
Greetings,

On Thu, Mar 08, 2018 at 02:58:34PM +0100, Julian Markwort wrote:
> > I'd love to hear more feedback, here are two ideas to polish this
> > patch:
> > a) Right now, good_plan and bad_plan collection can be activated and
> > deactivated with separate GUCs. I think it would be sensible to
> > collect
> > either both or none. (This would result in fewer convoluted
> > conditionals.)
> > b) Would you like to be able to tune the percentiles yourself, to
> > adjust for the point at which a new plan is stored?

I'd like to review the patch and leave a feedback. I tested it with
different indexes on same table and with same queries and it works fine.

First of all, GUC variables and functions. How about union
'pg_stat_statements.good_plan_enable' and
'pg_stat_statements.bad_plan_enable' into
'pg_stat_statements.track_plan' with GUC_LIST_INPUT flag? It may accept
comma separated values 'good' and 'bad'. It lets to add another tracking
type in future without adding new variable.

In same manner pg_stat_statements_good_plan_reset() and
pg_stat_statements_bad_plan_reset() functions can be combined in
function:

pg_stat_statements_plan_reset(in queryid bigint, in good boolean, in bad
boolean)

Further comments on the code.

+-- test to see if any plans have been recorded.
+SELECT
+  CASE WHEN good_plan_time > 0 THEN 1 ELSE 0 END,
+  CASE WHEN bad_plan_time > 0 THEN 1 ELSE 0 END,
+  CASE WHEN good_plan_timestamp >= timestamp '1970-01-01 00:00:00' THEN 1 ELSE 
0 END,
+  CASE WHEN good_plan_timestamp >= timestamp '1970-01-01 00:00:00' THEN 1 ELSE 
0 END

I think here is a typo. Last case should be bad_plan_timestamp.

+   good_plan_str = palloc(1 * sizeof(char));
+   *good_plan_str = '\0';
...
+   bad_plan_str = palloc(1 * sizeof(char));
+   *bad_plan_str = '\0';

Here we can use empty string literals instead of pallocs. For example:

const char *good_plan_str;
const char *bad_plan_str;
...
good_plan_str = "";
...
bad_plan_str = "";

+   interquartile_dist = 2.0*(0.6745 * 
sqrt(e->counters.sum_var_time / e->counters.calls));

It is worth to check e->counters.calls for zero here. Because the entry
can be sticky.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



Re: PATCH: Configurable file mode mask

2018-03-09 Thread David Steele
Hi Michael,

On 3/7/18 8:51 PM, Michael Paquier wrote:
> On Wed, Mar 07, 2018 at 10:56:32AM -0500, David Steele wrote:
>> On 3/6/18 10:04 PM, Michael Paquier wrote:
>>> Seems like you forgot to re-add the chmod calls in initdb.c.
>>
>> Hmmm, I thought we were talking about moving the position of umask().
>>
>> I don't think the chmod() calls are needed because umask() is being set.
>>  The tests show that the config files have the proper permissions after
>> inidb, so this just looks like redundant code to me.
> 
> Let's discuss that on a separate thread then, there could be something
> we are missing, but I agree that those should not be needed.  Looking at
> the code history, those calls have been around since the beginning of
> PG-times.

Done.

>> Another way to do this would be to make the function generic and
>> stipulate that the postmaster must be shut down before running the
>> function.  We could check postmaster.pid permissions as a separate
>> test.
> 
> Yeah, that looks like a sensitive approach as this could be run
> post-initdb or after taking a backup.  This way we would avoid other
> similar behaviors in the future...  Still postmaster.pid is an
> exception.

Done.

>>> sub clean_rewind_test
>>> {
>>> +   ok (check_pg_data_perm($node_master->data_dir(), 0));
>>> +
>>> $node_master->teardown_node  if defined $node_master;
>>> I have also a pending patch for pg_rewind which adds read-only files in
>>> the data folders of a new test, so this would cause this part to blow
>>> up.  Testing that for all the test sets does not bring additional value
>>> as well, and doing it at cleanup phase is also confusing.  So could you
>>> move that check into only one test's script?  You could remove the umask
>>> call in 003_extrafiles.pl as well this way, and reduce the patch diffs.
>>
>> I think I would rather have a way to skip the permission test rather
>> than disable it for most of the tests.  pg_rewind does more writes into
>> PGDATA that anything other than a backend.  Good coverage seems like a
>> plus.
> 
> All the tests basically run the same scenario, with minimal variations,
> so let's do the test once in the test touching the highest amount of
> files and call it good.

OK, test 001 is used to check default mode and 002 is used to check
group mode.  The rest are left as-is.  Does that work for you?

> I have begun to read through patch 3.
> 
> -if (stat_buf.st_mode & (S_IRWXG | S_IRWXO))
> +if (stat_buf.st_mode & PG_MODE_MASK_ALLOW_GROUP)
>  ereport(FATAL,
>  (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> - errmsg("data directory \"%s\" has group or world access",
> + errmsg("data directory \"%s\" has invalid permissions",
>  DataDir),
> - errdetail("Permissions should be u=rwx (0700).")));
> + errdetail("Permissions should be u=rwx (0700) or u=rwx,g=rx 
> (0750).")));
> Hm.  This relaxes the checks and concerns me a lot.  What if users want
> to keep strict permission all the time and rely on the existing check to
> be sure that this gets never changed post-initdb?  For such users, we
> may want to track if cluster has been initialized with group access, in
> which case tracking that in the control file would be more adapted. 
> Then the startup checks should use this configuration.  Another idea
> would be a startup option.  So, I cannot believe that all users would
> like to see such checks relaxed.  Some of my users would surely complain
> about such sanity checks relaxed unconditionally, so making this
> configurable would be fine, and the current approach is not acceptable
> in my opinion.

How about a GUC that enforces one mode or the other on startup?  Default
would be 700.  The GUC can be set automatically by initdb based on the
-g option.  We had this GUC originally, but since the front-end tools
can't read it we abandoned it.  Seems like it would be good as an
enforcing mechanism, though.

Thanks,
-- 
-David
da...@pgmasters.net
From 19d827b9d9c0285556e977d3962619deb84c3c0e Mon Sep 17 00:00:00 2001
From: David Steele 
Date: Mon, 5 Mar 2018 14:33:10 -0500
Subject: [PATCH 1/3] pg_resetwal tests.

Adds a very basic test suite for pg_resetwal.
---
 src/bin/pg_resetwal/.gitignore |  1 +
 src/bin/pg_resetwal/Makefile   |  6 +
 src/bin/pg_resetwal/t/001_basic.pl | 53 ++
 3 files changed, 60 insertions(+)
 create mode 100644 src/bin/pg_resetwal/t/001_basic.pl

diff --git a/src/bin/pg_resetwal/.gitignore b/src/bin/pg_resetwal/.gitignore
index 236abb4323..a950255fd7 100644
--- a/src/bin/pg_resetwal/.gitignore
+++ b/src/bin/pg_resetwal/.gitignore
@@ -1 +1,2 @@
 /pg_resetwal
+/tmp_check
diff --git a/src/bin/pg_resetwal/Makefile b/src/bin/pg_resetwal/Makefile
index 5ab7ad33e0..13c9ca6279 100644
--- a/src/bin/pg_resetwal/Makefile
+++ b/src/bin/pg_resetwal/Makefile
@@ -33,3 +33,9 @@ uninstall:
 
 clean distclean maintainer-clean:
rm -f p

Re: [HACKERS] Runtime Partition Pruning

2018-03-09 Thread Jesper Pedersen

Hi David,

On 03/01/2018 08:29 PM, David Rowley wrote:

0004 fails "make check-world" due to

pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 670; 1259 49954 TABLE
boolp_f jpedersen
pg_restore: [archiver (db)] could not execute query: ERROR:  syntax error at
or near "false"
LINE 24: ..." ATTACH PARTITION "public"."boolp_f" FOR VALUES IN (false);


The tests seem to have stumbled on a pg_dump bug which causes it to
produce syntax that's not valid (currently)

I should be able to stop my patch failing the test by dropping that
table, which I should have been doing anyway.



I've added that thread to the Open Items list.


Thanks for the review and in advance for the future review.

I'll delay releasing a new patch as there's some discussion over on
the faster partition pruning thread which affects this too [1]

[1] 
https://www.postgresql.org/message-id/CA+Tgmoa4D1c4roj7L8cx8gkkeBWAZD=mtcxkxtwbnslrhd3...@mail.gmail.com



Sure, 0003-0005 depends on that thread. 0002 is refactoring so that one 
is ready.


In 0004 should we add a HASH based test case,

-- test.sql --
-- verify pruning in hash partitions
create table hashp (a int primary key, b int) partition by hash (a);
create table hashp_0 partition of hashp for values with (modulus 2, 
remainder 0);
create table hashp_1 partition of hashp for values with (modulus 2, 
remainder 1);

insert into hashp values (0,0), (1,1), (2,2), (3,3);
prepare q1 (int) as select * from hashp where a = $1;
execute q1 (1);
execute q1 (1);
execute q1 (1);
execute q1 (1);
execute q1 (1);
explain (analyze, costs off, summary off, timing off)  execute q1 (1);
explain (analyze, costs off, summary off, timing off)  execute q1 (3);
deallocate q1;
drop table hashp;
-- test.sql --

Also, should 0004 consider the "Parallel Append" case, aka

-- parallel.sql --
CREATE TABLE t1 (
a integer NOT NULL,
b integer NOT NULL
) PARTITION BY HASH (b);

CREATE TABLE t1_p00 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 0);
CREATE TABLE t1_p01 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 1);
CREATE TABLE t1_p02 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 2);
CREATE TABLE t1_p03 PARTITION OF t1 FOR VALUES WITH (MODULUS 4, 
REMAINDER 3);

INSERT INTO t1 (SELECT i, i FROM generate_series(1, 100) AS i);
PREPARE q1 (int) AS SELECT * FROM t1 WHERE a = $1;
EXECUTE q1 (5432);
EXECUTE q1 (5432);
EXECUTE q1 (5432);
EXECUTE q1 (5432);
EXECUTE q1 (5432);
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)  EXECUTE q1 (5432);
DEALLOCATE q1;
DROP TABLE t1;
-- parallel.sql --

Best regards,
 Jesper



Re: [HACKERS] GUC for cleanup indexes threshold.

2018-03-09 Thread Alexander Korotkov
On Fri, Mar 9, 2018 at 3:12 PM, Masahiko Sawada 
wrote:

> On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov
>  wrote:
> > 2) These parameters are reset during btbulkdelete() and set during
> > btvacuumcleanup().
>
> Can't we set these parameters even during btbulkdelete()? By keeping
> them up to date, we will able to avoid an unnecessary cleanup vacuums
> even after index bulk-delete.
>

We certainly can update cleanup-related parameters during btbulkdelete().
However, in this case we would update B-tree meta-page during each
VACUUM cycle.  That may cause some overhead for non append-only
workloads.  I don't think this overhead would be sensible, because in
non append-only scenarios VACUUM typically writes much more of information.
But I would like this oriented to append-only workload patch to be
as harmless as possible for other workloads.

I've not reviewed the code deeply yet but the regression test of this
> patch seems to wrongly pass the regression tests and bt_metap()
> function of pageinspect needs to be updated.


Right.  Thank you for fixing this issue.


> Attached an updated patch
> fixed these issue. Will review the patch again.


Thank you!

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Faster inserts with mostly-monotonically increasing values

2018-03-09 Thread Claudio Freire
On Fri, Mar 9, 2018 at 2:54 PM, Pavan Deolasee  wrote:
>
>
> On Tue, Mar 6, 2018 at 10:10 AM, Pavan Deolasee 
> wrote:
>>
>>
>>
>> On Tue, Mar 6, 2018 at 7:29 AM, Peter Geoghegan  wrote:
>>>
>>> On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire 
>>> wrote:
>>>
>>> > I believe PKs are a prime candidate for this optimization, and
>>> > expecting it to apply only when no concurrency is involved is severely
>>> > dumbing down the optimization.
>>>
>>> Pavan justified the patch using a benchmark that only involved a
>>> single client -- hardly typical for a patch that changes the B-Tree
>>> code. If the benefits with many clients can be shown to matter, that
>>> will make this much more interesting to me.
>>
>>
>> Ok. I will repeat those tests with more number of clients and report back.
>>
>
> So I repeated the tests with 1,2,4 and 8 clients, each running the following
> statement and a total of 1024 transactions. So roughly 100M rows are
> inserted.
>
> INSERT INTO testtab(b) SELECT generate_series(1,10);
>
> The table definition is:
> postgres=# \d+ testtab
>Table "public.testtab"
>  Column |  Type  | Collation | Nullable |  Default
> | Storage | Stats target | Description
> ++---+--++-+--+-
>  a  | bigint |   | not null | nextval('testtab_a_seq'::regclass)
> | plain   |  |
>  b  | bigint |   |  |
> | plain   |  |
> Indexes:
> "testtab_a_key" UNIQUE CONSTRAINT, btree (a)
>
>
> After taking average of 3-runs:
>
> +-++---+
> | clients | Patched - time in sec | Master - time in sec |
> +-++---+
> | 1   | 311.8643602| 411.832757|
> +-++---+
> | 2   | 252.5433   | 300.7875613   |
> +-++---+
> | 4   | 337.0414279| 350.9636766   |
> +-++---+
> | 8   | 444.2035582| 477.1903417   |
> +-++---+
>
> So yes, the benefits of the patch go down with higher number of clients, but
> it does not entirely vanish.

What if you implement my suggestion?

That should improve the multi-client case considerably.



Re: Failed to request an autovacuum work-item in silence

2018-03-09 Thread Masahiko Sawada
Thank you for reviewing!

On Thu, Mar 8, 2018 at 6:07 PM, Ildar Musin  wrote:
> Just couple remarks. I would rename 'requested' variable in
> AutoVacuumRequestWork() func to something like 'success' or 'result'.
> Because request is something caller does. And I would also rephrase log
> message as follows:
>
>request for autovacuum work item "%s" for "%s" failed

Agreed.

On Thu, Mar 8, 2018 at 10:46 PM, Alvaro Herrera
 wrote:
> Hi
>
> I was thinking that the BRIN code requesting the workitem would print
> the error message based on the return value.  There is no point to
> returning a boolean indicator if the caller isn't going to do anything
> with it ...  This means you don't need to convert the type to string in
> autovacuum.c (which would defeat attempts at generalizing this code).
>

Agreed.

Attached an updated patch. Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 68b3371..6897b7f 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -33,6 +33,7 @@
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/lsyscache.h"
 
 
 /*
@@ -187,9 +188,19 @@ brininsert(Relation idxRel, Datum *values, bool *nulls,
 brinGetTupleForHeapBlock(revmap, lastPageRange, &buf, &off,
 		 NULL, BUFFER_LOCK_SHARE, NULL);
 			if (!lastPageTuple)
-AutoVacuumRequestWork(AVW_BRINSummarizeRange,
-	  RelationGetRelid(idxRel),
-	  lastPageRange);
+			{
+bool requested;
+
+requested = AutoVacuumRequestWork(AVW_BRINSummarizeRange,
+  RelationGetRelid(idxRel),
+  lastPageRange);
+
+if (!requested)
+	ereport(LOG, (errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+  errmsg("request for autovacuum work item BrinSummarizeRange for \"%s\" failed",
+		 RelationGetRelationName(idxRel;
+			}
+
 			else
 LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 		}
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 21f5e2c..fa4b42c 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -343,6 +343,7 @@ static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
 		const char *nspname, const char *relname);
+static const char *autovac_get_workitem_name(AutoVacuumWorkItemType type);
 static void av_sighup_handler(SIGNAL_ARGS);
 static void avl_sigusr2_handler(SIGNAL_ARGS);
 static void avl_sigterm_handler(SIGNAL_ARGS);
@@ -3207,11 +3208,12 @@ AutoVacuumingActive(void)
 /*
  * Request one work item to the next autovacuum run processing our database.
  */
-void
+bool
 AutoVacuumRequestWork(AutoVacuumWorkItemType type, Oid relationId,
 	  BlockNumber blkno)
 {
 	int			i;
+	bool		result = false;
 
 	LWLockAcquire(AutovacuumLock, LW_EXCLUSIVE);
 
@@ -3231,12 +3233,15 @@ AutoVacuumRequestWork(AutoVacuumWorkItemType type, Oid relationId,
 		workitem->avw_database = MyDatabaseId;
 		workitem->avw_relation = relationId;
 		workitem->avw_blockNumber = blkno;
+		result = true;
 
 		/* done */
 		break;
 	}
 
 	LWLockRelease(AutovacuumLock);
+
+	return result;
 }
 
 /*
diff --git a/src/include/postmaster/autovacuum.h b/src/include/postmaster/autovacuum.h
index 18cff54..96752ca 100644
--- a/src/include/postmaster/autovacuum.h
+++ b/src/include/postmaster/autovacuum.h
@@ -71,7 +71,7 @@ extern void AutovacuumWorkerIAm(void);
 extern void AutovacuumLauncherIAm(void);
 #endif
 
-extern void AutoVacuumRequestWork(AutoVacuumWorkItemType type,
+extern bool AutoVacuumRequestWork(AutoVacuumWorkItemType type,
 	  Oid relationId, BlockNumber blkno);
 
 /* shared memory stuff */


Re: Faster inserts with mostly-monotonically increasing values

2018-03-09 Thread Pavan Deolasee
On Tue, Mar 6, 2018 at 10:10 AM, Pavan Deolasee 
wrote:

>
>
> On Tue, Mar 6, 2018 at 7:29 AM, Peter Geoghegan  wrote:
>
>> On Mon, Mar 5, 2018 at 5:48 PM, Claudio Freire 
>> wrote:
>>
>> > I believe PKs are a prime candidate for this optimization, and
>> > expecting it to apply only when no concurrency is involved is severely
>> > dumbing down the optimization.
>>
>> Pavan justified the patch using a benchmark that only involved a
>> single client -- hardly typical for a patch that changes the B-Tree
>> code. If the benefits with many clients can be shown to matter, that
>> will make this much more interesting to me.
>
>
> Ok. I will repeat those tests with more number of clients and report back.
>
>
So I repeated the tests with 1,2,4 and 8 clients, each running the
following statement and a total of 1024 transactions. So roughly 100M rows
are inserted.

INSERT INTO testtab(b) SELECT generate_series(1,10);

The table definition is:
postgres=# \d+ testtab
   Table "public.testtab"
 Column |  Type  | Collation | Nullable |  Default
 | Storage | Stats target | Description
++---+--++-+--+-
 a  | bigint |   | not null |
nextval('testtab_a_seq'::regclass) | plain   |  |
 b  | bigint |   |  |
  | plain   |  |
Indexes:
"testtab_a_key" UNIQUE CONSTRAINT, btree (a)


After taking average of 3-runs:

+-++---+
| clients | Patched - time in sec | Master - time in sec |
+-++---+
| 1   | 311.8643602| 411.832757|
+-++---+
| 2   | 252.5433   | 300.7875613   |
+-++---+
| 4   | 337.0414279| 350.9636766   |
+-++---+
| 8   | 444.2035582| 477.1903417   |
+-++---+

So yes, the benefits of the patch go down with higher number of clients,
but it does not entirely vanish.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Implementing SQL ASSERTION

2018-03-09 Thread Robert Haas
On Mon, Jan 15, 2018 at 11:35 AM, David Fetter  wrote:
> - We follow the SQL standard and make SERIALIZABLE the default
>   transaction isolation level, and

The consequences of such a decision would include:

- pgbench -S would run up to 10x slower, at least if these old
benchmark results are still valid:

https://www.postgresql.org/message-id/ca+tgmozog1wfbyrqzjukilsxw5sdujjguey0c2bqsg-tcis...@mail.gmail.com

- pgbench without -S would fail outright, because it doesn't have
provision to retry failed transactions.

https://commitfest.postgresql.org/16/1419/

- Many user applications would probably also experience similar difficulties.

- Parallel query would no longer work by default, unless this patch
gets committed:

https://commitfest.postgresql.org/17/1004/

I think a good deal of work to improve the performance of serializable
would need to be done before we could even think about making it the
default -- and even then, the fact that it really requires the
application to be retry-capable seems like a pretty major obstacle.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Precision loss casting float to numeric

2018-03-09 Thread Emre Hasegeli
>> I wonder if an alternative to making a cast that can't be immutable,
>> because it looks at a GUC, could be to offer a choice of cast
>> functions: if you need the other behavior, create a schema, do a
>> CREATE CAST in it, and put it on your search path ahead of pg_catalog.
>
> Nope, because casts aren't schema-local, since they don't have names.
> There can be only one cast between given source and target types.

In this case, I cannot see any other option than adding those as
separate cast functions.  Should we mark this entry as "returned with
feedback"?

We can also consider turning the current float to numeric casts to
explicit as they are causing data loss.  I am not sure how much it
would impact backwards-compatibility.  The counter argument is the
numeric to float casts being IMPLICIT.  They are causing data loss for
sure, but I believe there are different reasons to keep them as
IMPLICIT.



Re: csv format for psql

2018-03-09 Thread Pavel Stehule
2018-03-09 8:40 GMT+01:00 Fabien COELHO :

>
> About "fieldsep_csv", I do not like much the principle of having different
 output variables to represent the same concept depending on the format. I
 would rather have reused fieldsep as in your previous submission and set it
 to "," when under --csv.

>>>
>>> yes
>>>
>>>
>> how will be possible to set different separator ';'? I don't see it with
>> described design
>>
>
> Indeed, it should be possible. I think that the following should be made
> to work:
>
>   psql --csv -P fieldsep=; -c 'TABLE foo' > foo.csv
>
> So that it can be changed the semi-colon (or tab or whatever) style if
> required.
>

ok, then all is ok

Regards

Pavel

>
> --
> Fabien.
>


Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2018-03-09 Thread Peter Eisentraut
I think this patch is not going to be ready for PG11.

- It depends on some work in the thread "logical decoding of two-phase
transactions", which is still in progress.

- Various details in the logical_work_mem patch (0001) are unresolved.

- This being partially a performance feature, we haven't seen any
performance tests (e.g., which settings result in which latencies under
which workloads).

That said, the feature seems useful and desirable, and the
implementation makes sense.  There are documentation and tests.  But
there is a significant amount of design and coding work still necessary.

Attached is a fixup patch that I needed to make it compile.

The last two patches in your series (0008, 0009) are labeled as bug
fixes.  Would you like to argue that they should be applied
independently of the rest of the feature?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 7ac3c2b16f9976c75a0feea3131a36bdf50da2f8 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 9 Mar 2018 10:50:33 -0500
Subject: [PATCH] fixup! Track statistics for streaming/spilling

---
 src/include/catalog/pg_proc.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 9d6c88f0c1..f1cea24379 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2901,7 +2901,7 @@ DATA(insert OID = 2022 (  pg_stat_get_activity
PGNSP PGUID 12 1 100 0 0 f f f
 DESCR("statistics: information about currently active backends");
 DATA(insert OID = 3318 (  pg_stat_get_progress_info  PGNSP 
PGUID 12 1 100 0 0 f f f t t s r 1 0 2249 "25" 
"{25,23,26,26,20,20,20,20,20,20,20,20,20,20}" "{i,o,o,o,o,o,o,o,o,o,o,o,o,o}" 
"{cmdtype,pid,datid,relid,param1,param2,param3,param4,param5,param6,param7,param8,param9,param10}"
 _null_ _null_ pg_stat_get_progress_info _null_ _null_ _null_ ));
 DESCR("statistics: information about progress of backends running maintenance 
command");
-DATA(insert OID = 3099 (  pg_stat_get_wal_senders  PGNSP PGUID 12 1 10 0 0 
f f f f t s r 0 0 2249 "" 
"{23,25,3220,3220,3220,3220,1186,1186,1186,23,25,20,20,20,20,20,20}" 
"{o,o,o,o,o,o,o,o,o,o,o}" 
"{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}"
 _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
+DATA(insert OID = 3099 (  pg_stat_get_wal_senders  PGNSP PGUID 12 1 10 0 0 
f f f f t s r 0 0 2249 "" 
"{23,25,3220,3220,3220,3220,1186,1186,1186,23,25,20,20,20,20,20,20}" 
"{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}" 
"{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,spill_txns,spill_count,spill_bytes,stream_txns,stream_count,stream_bytes}"
 _null_ _null_ pg_stat_get_wal_senders _null_ _null_ _null_ ));
 DESCR("statistics: information about currently active replication");
 DATA(insert OID = 3317 (  pg_stat_get_wal_receiver PGNSP PGUID 12 1 0 0 0 
f f f f f s r 0 0 2249 "" "{23,25,3220,23,3220,23,1184,1184,3220,1184,25,25}" 
"{o,o,o,o,o,o,o,o,o,o,o,o}" 
"{pid,status,receive_start_lsn,receive_start_tli,received_lsn,received_tli,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time,slot_name,conninfo}"
 _null_ _null_ pg_stat_get_wal_receiver _null_ _null_ _null_ ));
 DESCR("statistics: information about WAL receiver");
-- 
2.16.2



Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-03-09 Thread Peter Eisentraut
On 2/1/18 19:21, Simon Riggs wrote:
> If we really can't persuade you of that, it doesn't sink the patch. We
> can have the WAL pointer itself - it wouldn't save space but it would
> at least alleviate the spinlock.

Do you want to send in an alternative patch that preserves the WAL
pointer and only changes the locking?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] MERGE SQL Statement for PG11

2018-03-09 Thread Alvaro Herrera
Please don't forget to remove the xlog.c and miscadmin.h hunks from the
patch.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: using index or check in ALTER TABLE SET NOT NULL

2018-03-09 Thread Sergei Kornilov
Hello all!
Summarizing, how i must update tests?
We want test real skip of verify phase or only result correctness?

I can verify what i do not break alter table by tests in v4 of my patch. But 
here is no check code path. So my feature may be broken in future without test 
fail. alter table will be correct regardless this feature.
Or i must test code path too? I understood Tom and will remove INFO ereport 
from patch anyway. But currently i have no idea how check code path. As 
mentioned earlier using debug ereport is uncomfortable too. Here is no suitable 
event trigger to track what exactly happens with table. Adding new one for test 
purposes seems overkill.

regards, Sergei



Re: Testbed for predtest.c ... and some arguable bugs therein

2018-03-09 Thread Tom Lane
Robert Haas  writes:
> The use of the terms "refutes", "non-truth", and "non-falsity" drive
> me a little batty; they've all got an internal negation built into
> them, and that's confusing.  It seems very tempting to propose getting
> rid of predicate_refuted_by altogether and to simply have one
> function:

> if we know that everything in clause_list is { true | false } [ or
> null ], does that imply that everything in predicate_list must be {
> true | false } [ or null ]?

I don't think that's terribly practical, because the proof rules are
different.  To prove an AND condition true, you must prove all its
elements true; to prove it false, you need only prove one element
false.  The reverse for OR.  No doubt we could physically merge the two
functions, but I do not think it'd end up being any sort of improvement.

> saying that "A weakly refutes B" is a heck of a lot less clear, at
> least in my book, than saying "If A is true or null, B must also be
> true or null."

Well, I beg to differ.  Once you've internalized the definitions, the
former is a lot shorter.  Or in other words: programming is all about
inventing a suitable notation for expressing your thoughts.

> I realize this is all a little to one side of what you're talking
> about here, but the fact that even you got confused about whether the
> existing logic was correct, and that you found that there were
> multiple possible definitions of what "correct" means, seems like
> pretty good evidence that this is not as clear as it could be.

To my mind, the pre-v10 logic was fine, and the confusion was created
by muddled thinking while introducing the weak proof rules.  What we're
trying to do now is pin down exactly what the weak rules need to be
so we can document them properly.  It's conceivable that a different
code-level representation of the rules would be clearer, but I'm not
convinced of that, and anyway there's no hope of finding such a
representation when we don't have the rules straight.

As an example of the sort of documentation I'm thinking of, here's
what I've written so far about why operator_predicate_proof isn't
already broken:

 * We mostly need not distinguish strong vs. weak implication/refutation here.
 * This depends on an assumption (which we cannot check) that a pair of
 * related operators will not return one NULL and one non-NULL result for the
 * same inputs.  Then, for the proof types where we start with an assumption
 * of truth of the clause, the predicate operator could not return NULL
 * either, so it doesn't matter whether we are trying to make a strong or weak
 * proof.  For weak implication, it could be that the clause operator returned
 * NULL, but then the predicate operator would as well, so that the weak
 * implication still holds.  This argument doesn't apply in the case where we
 * are considering two different constant values, since then the operators
 * aren't being given identical inputs.  But we only support that for btree
 * operators, for which we can assume that all non-null inputs result in
 * non-null outputs, so that it doesn't matter which two non-null constants
 * we consider.  Currently the code below just reports "proof failed" if
 * either constant is NULL, but in some cases we could be smarter.

(I expect the last sentence, as well as the code, will get some revisions
arising from the "constraint exclusion and null values" thread; but this
is correct today.)

I don't really think this would be any clearer if I'd avoided the
implication/refutation jargon.

regards, tom lane



Re: Fixes for missing schema qualifications

2018-03-09 Thread David Steele
On 3/9/18 2:55 AM, Michael Paquier wrote:
> 
> In light of CVE-2018-1058, user's applications need to be careful about
> the use of schema-unqualified queries.  A lookup at the upstream code is
> showing four areas which are missing such handling:
> - psql has one problem in get_create_object_cmd which misses twice to
> qualify array_remove().
> - isolationtester is missing one for a call to pg_backend_pid()
> - information_schema.sql has one problem as well: the function
> _pg_interval_type does not qualify upper().  Please note that there is
> no need to care about view's bodies because those use OID references, so
> only the function body need to be taken care of.
> - worker_spi scans pg_namespace and uses count() without schema
> qualification.
> 
> Attached is a patch which fixes all four of them, and which should be
> back-patched.  For information_schema.sql, users can always replace the
> body of the function by redefining them (using SET search_path in CREATE
> FUNCTION would work as well however this is more costly than a simple
> qualification).

These look sane to me.  Did you check the back branches for anything
that might not exist in HEAD?

Regards,
-- 
-David
da...@pgmasters.net



Re: disable SSL compression?

2018-03-09 Thread Magnus Hagander
On Fri, Mar 9, 2018 at 3:06 AM, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 3/8/18 14:23, Claudio Freire wrote:
> > On Thu, Mar 8, 2018 at 3:40 PM, Peter Eisentraut
> >  wrote:
> >> It appears that SSL compression is nowadays deprecated as insecure.
> >> Yet, it is still enabled by libpq by default, and there is no way to
> >> disable it in the server.  Should we make some changes here?  Does
> >> anyone know more about this?
> >
> > Even if libpq enables it, it has to be enabled both in the client and
> > the server for it to work.
> >
> > OpenSSL disables the whole feature by default, and enabling it is
> > rather cumbersome. The result is that, at least with OpenSSL, the
> > server and client won't accept compression without extensive fiddling
> > by the user.
>
> But however that may be, libpq appears to enable it by default.  This is
> what I get from psql:
>
> SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384,
> bits: 256, compression: on)
>
>
What platform does that actually work out of the box on? I have customers
who actively want to use it (for compression, not security -- replication
across limited and metered links), and the amount of workarounds they have
to put in place OS level to get it working is increasingly complicated.

That said, I think it makes sense to not have libpq enable it by default.
The simple change is to just have libpq default to it being off while still
allowing it to be turned on. I don't really see any downside of that at all
(given the state of the libraries), and it's supposedly a trivial change.



-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: FOR EACH ROW triggers on partitioned tables

2018-03-09 Thread Peter Eisentraut
On 3/7/18 20:57, Alvaro Herrera wrote:
> So, unless someone has a brilliant idea on how to construct a column
> mapping from partitioned table to partition, I'm going back to the
> design I was proposing earlier, ie., creating individual pg_trigger rows
> for each partition that are essentially adjusted copies of the ones for
> the partitioned table.

Yes, that seems easiest.

The idea of having only one pg_trigger entry was derived from the
assumption that we wouldn't need the other ones for anything.  But if
that doesn't apply, then it's better to just go with the straightforward
way instead of bending the single-pg_trigger way to our will.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [PATCH] pg_hba.conf : new auth option : clientcert=verify-full

2018-03-09 Thread Julian Markwort
Hello Magnus,

> I think this makes a lot of sense, and can definitely be a useful
> option.

I was hesistant to write a long and elaborate patch as I wasn't certain
if there was any interest for such an addition, but I'm thankful for
your input.

> However, the patch is completely lacking documentation, which
> obviously make it a no-starter. 

I'll write the missing documentation shortly.

> Also if I read it right, if the CN is not correct, it will give the
> error message "certificate authentication failed for user ...". I
> realize this comes from the re-use of the code, but I don't think
> this makes it very useful. We  need to separate these two things.

The error message "certificate authentication failed for user XYZ:
client certificate contains no user name" is the result of calling
CheckCertAuth when the user presented a certificate without a CN in it.

The error message that is presented to the user upon trying to connect
with a certificate containing a CN other than the username is:

-
psql: FATAL: password authentication failed for user "nottestuser"
-

The server's log contains the lines:

-
2018-03-09 13:06:43.111 CET [3310] LOG:  provided user name
(nottestuser) and authenticated user name (testuser) do not match
2018-03-09 13:06:43.111 CET [3310] FATAL:  password authentication
failed for user "nottestuser"
2018-03-09 13:06:43.111 CET [3310] DETAIL:  Connection matched
pg_hba.conf line 97: "hostssl all nottestuser 127.0.0.1/32 password
clientcert=verify-full"
-

I'd argue that the message in the log file is consistent and useful,
however the message given by psql (or any libpq application for that
matter) leaves uncertainty regarding the correctness of a provided
password, for example.
I could attach the log message of CheckCertAuth to the logdetail,
however then I'd have issues if there is already something written to
the logdetail.
I could also use an additional ereport() call whenever clientcert was
set to verify-full and the user name didn't match the CN.

Kind regards
Julian

smime.p7s
Description: S/MIME cryptographic signature


Re: [HACKERS] GUC for cleanup indexes threshold.

2018-03-09 Thread Masahiko Sawada
On Fri, Mar 9, 2018 at 8:43 AM, Alexander Korotkov
 wrote:
> Hi!
>

Sorry for my late reply.

> I'd like to propose a revised patch based on various ideas upthread.

Thank you for proposing the patch!

>
> This patch works as following.
>
> 1) B-tree meta page is extended with 2 additional parameters:
>  * btm_oldest_btpo_xact – oldest btpo_xact among of deleted pages,
>  * btm_last_cleanup_num_heap_tuples – number of heap tuples during last
> cleanup scan.
>
> 2) These parameters are reset during btbulkdelete() and set during
> btvacuumcleanup().

Can't we set these parameters even during btbulkdelete()? By keeping
them up to date, we will able to avoid an unnecessary cleanup vacuums
even after index bulk-delete.

>
> 3) Index scans during second and subsequent btvacuumcleanup() happen only if
>btm_oldest_btpo_xact is older than RecentGlobalXmin
>   OR num_heap_tuples >= btm_last_cleanup_num_heap_tuples(1 +
> vacuum_cleanup_index_scale_factor).
>
> In other words btvacuumcleanup() scans the index only if there are
> recyclable pages,
> or index statistics is stalled (inserted more than
> vacuum_cleanup_index_scale_factor
> since last index statistics collection).
>
> 4) vacuum_cleanup_index_scale_factor can be set either by GUC or reloption.
> Default value is 0.1.  So, by default cleanup scan is triggered after
> increasing of
> table size by 10%.
>
> 5) Since new fields are added to the metapage, BTREE_VERSION is bumped.
> In order to support pg_upgrade, read of previous metapage version is
> supported.
> On metapage rewrite, it's upgraded to the new version.
>
> So, since we don't skip scan of recyclable pages, there is no risk of xid
> wraparound.
> Risk of stalled statistics is also small, because
> vacuum_cleanup_index_scale_factor
> default value is quite low.  User can increase
> vacuum_cleanup_index_scale_factor
> on his own risk and have less load of B-tree cleanup scan bought by more gap
> in
> index statistics.

Agreed.

I've not reviewed the code deeply yet but the regression test of this
patch seems to wrongly pass the regression tests and bt_metap()
function of pageinspect needs to be updated. Attached an updated patch
fixed these issue. Will review the patch again.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


0001-lazy-btree-cleanup-4.patch
Description: Binary data


Re: Testbed for predtest.c ... and some arguable bugs therein

2018-03-09 Thread Robert Haas
On Thu, Mar 8, 2018 at 6:24 PM, Tom Lane  wrote:
> * R1: A refutes B if truth of A implies falsity of B.
> * R2: A refutes B if truth of A implies non-truth of B.
> * R3: A refutes B if non-falsity of A implies non-truth of B.
> * R4: A refutes B if non-falsity of A implies falsity of B.

The use of the terms "refutes", "non-truth", and "non-falsity" drive
me a little batty; they've all got an internal negation built into
them, and that's confusing.  It seems very tempting to propose getting
rid of predicate_refuted_by altogether and to simply have one
function:

if we know that everything in clause_list is { true | false } [ or
null ], does that imply that everything in predicate_list must be {
true | false } [ or null ]?

Internally, we could represent what we know about various clauses by
some combination of CLAUSE_MIGHT_BE_TRUE = 0x01, CLAUSE_MIGHT_BE_FALSE
= 0x02, CLAUSE_MIGHT_BE_NULL = 0x04.  I have a feeling this might work
out to be a lot easier to understand than what we've got now, because
saying that "A weakly refutes B" is a heck of a lot less clear, at
least in my book, than saying "If A is true or null, B must also be
true or null."

I realize this is all a little to one side of what you're talking
about here, but the fact that even you got confused about whether the
existing logic was correct, and that you found that there were
multiple possible definitions of what "correct" means, seems like
pretty good evidence that this is not as clear as it could be.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Another oddity in handling of WCO constraints in postgres_fdw

2018-03-09 Thread Etsuro Fujita

Hi Ashutosh,

(2018/03/08 14:24), Ashutosh Bapat wrote:

Etsuro said [2] that WCO constraints can not be implemented on foreign
server and normal check constraints can be, and for that he provides
an example in [3]. But I think that example is going the wrong
direction.


More precisely, what I'm saying there is: for WCO constraints created by 
an auto-updatable view over a foreign table, we cannot always implement 
constraints on the remote side that match with those WCO constraints.



For local constraints to be enforced, we use remote
constraints. For local WCO we need to use remote WCO. That means we
create many foreign tables pointing to same local table on the foreign
server through many views, but it's not impossible.


Maybe I don't understand this correctly, but I guess that it would be 
the user's responsibility to not create foreign tables in such a way.


Best regards,
Etsuro Fujita



Re: inserts into partitioned table may cause crash

2018-03-09 Thread Etsuro Fujita

(2018/03/06 21:26), Etsuro Fujita wrote:

One thing I notice while working on this is this in ExecInsert/CopyFrom,
which I moved to ExecPrepareTupleRouting as-is for the former:

/*
* If we're capturing transition tuples, we might need to convert from the
* partition rowtype to parent rowtype.
*/
if (mtstate->mt_transition_capture != NULL)
{
if (resultRelInfo->ri_TrigDesc &&
(resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
resultRelInfo->ri_TrigDesc->trig_insert_instead_row))
{
/*
* If there are any BEFORE or INSTEAD triggers on the partition,
* we'll have to be ready to convert their result back to
* tuplestore format.
*/
mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
mtstate->mt_transition_capture->tcs_map =
TupConvMapForLeaf(proute, rootRelInfo, leaf_part_index);
}

Do we need to consider INSTEAD triggers here? The partition is either a
plain table or a foreign table, so I don't think it can have those
triggers. Am I missing something?


There seems to be no objections, so I removed the INSTEAD-trigger 
condition from this.  Here are updated patches for PG10 and HEAD.


Other changes:
* Add regression tests based on your test cases shown upthread
* Adjust code/comments a little bit

Best regards,
Etsuro Fujita
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
***
*** 2633,2645  CopyFrom(CopyState cstate)
  			if (cstate->transition_capture != NULL)
  			{
  if (resultRelInfo->ri_TrigDesc &&
! 	(resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
! 	 resultRelInfo->ri_TrigDesc->trig_insert_instead_row))
  {
  	/*
! 	 * If there are any BEFORE or INSTEAD triggers on the
! 	 * partition, we'll have to be ready to convert their
! 	 * result back to tuplestore format.
  	 */
  	cstate->transition_capture->tcs_original_insert_tuple = NULL;
  	cstate->transition_capture->tcs_map =
--- 2633,2644 
  			if (cstate->transition_capture != NULL)
  			{
  if (resultRelInfo->ri_TrigDesc &&
! 	resultRelInfo->ri_TrigDesc->trig_insert_before_row)
  {
  	/*
! 	 * If there are any BEFORE triggers on the partition,
! 	 * we'll have to be ready to convert their result back to
! 	 * tuplestore format.
  	 */
  	cstate->transition_capture->tcs_original_insert_tuple = NULL;
  	cstate->transition_capture->tcs_map =
***
*** 2768,2779  CopyFrom(CopyState cstate)
  			 * tuples inserted by an INSERT command.
  			 */
  			processed++;
  
! 			if (saved_resultRelInfo)
! 			{
! resultRelInfo = saved_resultRelInfo;
! estate->es_result_relation_info = resultRelInfo;
! 			}
  		}
  	}
  
--- 2767,2779 
  			 * tuples inserted by an INSERT command.
  			 */
  			processed++;
+ 		}
  
! 		/* Restore the saved ResultRelInfo */
! 		if (saved_resultRelInfo)
! 		{
! 			resultRelInfo = saved_resultRelInfo;
! 			estate->es_result_relation_info = resultRelInfo;
  		}
  	}
  
*** a/src/backend/executor/nodeModifyTable.c
--- b/src/backend/executor/nodeModifyTable.c
***
*** 67,72  static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
--- 67,77 
  static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
  static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
  		int whichplan);
+ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
+ 		PartitionTupleRouting *proute,
+ 		EState *estate,
+ 		ResultRelInfo *targetRelInfo,
+ 		TupleTableSlot *slot);
  
  /*
   * Verify that the tuples to be produced by INSERT or UPDATE match the
***
*** 265,271  ExecInsert(ModifyTableState *mtstate,
  {
  	HeapTuple	tuple;
  	ResultRelInfo *resultRelInfo;
- 	ResultRelInfo *saved_resultRelInfo = NULL;
  	Relation	resultRelationDesc;
  	Oid			newId;
  	List	   *recheckIndexes = NIL;
--- 270,275 
***
*** 282,381  ExecInsert(ModifyTableState *mtstate,
  	 * get information on the (current) result relation
  	 */
  	resultRelInfo = estate->es_result_relation_info;
- 
- 	/* Determine the partition to heap_insert the tuple into */
- 	if (mtstate->mt_partition_tuple_routing)
- 	{
- 		int			leaf_part_index;
- 		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
- 
- 		/*
- 		 * Away we go ... If we end up not finding a partition after all,
- 		 * ExecFindPartition() does not return and errors out instead.
- 		 * Otherwise, the returned value is to be used as an index into arrays
- 		 * proute->partitions[] and proute->partition_tupconv_maps[] that will
- 		 * get us the ResultRelInfo and TupleConversionMap for the partition,
- 		 * respectively.
- 		 */
- 		leaf_part_index = ExecFindPartition(resultRelInfo,
- 			proute->partition_dispatch_info,
- 			slot,
- 			estate);
- 		Assert(leaf_part_index >= 0 &&
- 			   leaf_part_index < proute->num_partitions);
- 
- 		/*
- 		 * Save the old ResultRelInfo and switch to the 

Re: [HACKERS] Partition-wise aggregation/grouping

2018-03-09 Thread Ashutosh Bapat
On Thu, Mar 8, 2018 at 7:31 PM, Robert Haas  wrote:
>
> This kind of goes along with the suggestion I made yesterday to
> introduce a new function, which at the time I proposed calling
> initialize_grouping_rel(), to set up new grouped or partially grouped
> relations.  By doing that it would be easier to ensure the
> initialization is always done in a consistent way but only for the
> relations we actually need.  But maybe we should call it
> fetch_grouping_rel() instead.  The idea would be that instead of
> calling fetch_upper_rel() we would call fetch_grouping_rel() when it
> is a question of the grouped or partially grouped relation.  It would
> either return the existing relation or initialize a new one for us.  I
> think that would make it fairly easy to initialize only the ones we're
> going to need.

Hmm. I am working on refactoring the code to do something like this.

> On a related note, I'm not sure that this code is correct:
>
> +   if (!isPartialAgg)
> +   {
> +   grouped_rel->part_scheme = input_rel->part_scheme;
> +   grouped_rel->nparts = nparts;
> +   grouped_rel->boundinfo = input_rel->boundinfo;
> +   grouped_rel->part_rels = part_rels;
> +   }
>
> It's not obvious to me why this should be done only when
> !isPartialAgg.  The comments claim that the partially grouped child
> rels can't be considered partitions of the top-level partitially
> grouped rel, but it seems to me that we could consider them that way.
> Maybe I'm missing something.

When we are performing partial aggregates, GROUP clause does not have
partition keys. This means that the targetlist of the grouped relation
and partially grouped relation do not have bare partition keys. So,
for a relation sitting on top of this (partially) grouped relation the
partition key doesn't exist. So, we can't consider grouped or
partially grouped relation as partitioned.


-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



Re: [PATCH] Improve geometric types

2018-03-09 Thread Emre Hasegeli
> Thank you for the revised version. This seems fine for me so
> marked this as "Ready for Committer".

Thank you for bearing with me for longer than year.

> By the way I think that line_perp is back patched, could you
> propose it for the older versions? (I already marked my entry as
> Rejected)

Do you mean back-patching?

> Brief summary follows (almost same to the header of patch files):
>
> - 0001 looks fine.
>   > * Eliminate SQL level function calls
>   > * Re-use more functions to implement others
>   > * Unify internal function names and signatures
>   > * Remove private functions from geo_decls.h
>   > * Replace not-possible checks with Assert()
>   > * Add comments to describe some functions
>   > * Remove some unreachable code
>   > * Define delimiter symbols of line datatype like the other ones
>   > * Remove debugging print()s from the refactored functions
>
>   And one bug fix.

What is that one?  Should I incorporate it into the commit message?



Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-09 Thread amul sul
On Thu, Mar 8, 2018 at 12:31 PM, Amit Kapila  wrote:
> On Thu, Mar 8, 2018 at 11:04 AM, Pavan Deolasee
>  wrote:
>>
>> On Tue, Feb 13, 2018 at 12:41 PM, amul sul  wrote:
>>>
>>> Thanks for the confirmation, updated patch attached.
>>>
>>
>> I am actually very surprised that 0001-Invalidate-ip_blkid-v5.patch does not
>> do anything to deal with the fact that t_ctid may no longer point to itself
>> to mark end of the chain. I just can't see how that would work.
>>
>
> I think it is not that patch doesn't care about the end of the chain.
>  For example, ctid pointing to itself is used to ensure that for
> deleted rows, nothing more needs to be done like below check in the
> ExecUpdate/ExecDelete code path.
> if (!ItemPointerEquals(tupleid, &hufd.ctid))
> {
> ..
> }
> ..
>
> It will deal with such cases by checking invalidblockid before these
> checks.  So, we should be fine in such cases.
>
>
>> But if it
>> does, it needs good amount of comments explaining why and most likely
>> updating comments at other places where chain following is done. For
>> example, how's this code in heap_get_latest_tid() is still valid? Aren't we
>> setting "ctid" to some invalid value here?
>>
>> 2302 /*
>> 2303  * If there's a valid t_ctid link, follow it, else we're done.
>> 2304  */
>> 2305 if ((tp.t_data->t_infomask & HEAP_XMAX_INVALID) ||
>> 2306 HeapTupleHeaderIsOnlyLocked(tp.t_data) ||
>> 2307 ItemPointerEquals(&tp.t_self, &tp.t_data->t_ctid))
>> 2308 {
>> 2309 UnlockReleaseBuffer(buffer);
>> 2310 break;
>> 2311 }
>> 2312
>> 2313 ctid = tp.t_data->t_ctid;
>>
>
> I have not tested, but it seems this could be problematic, but I feel
> we could deal with such cases by checking invalid block id in the
> above if check.  Another one such case is in EvalPlanQualFetch
>
>> This is just one example. I am almost certain there are many such cases that
>> will require careful attention.
>>
>
> Right, I think we should be able to detect and fix such cases.
>

I found a couple of places (in heap_lock_updated_tuple, rewrite_heap_tuple,
EvalPlanQualFetch & heap_lock_updated_tuple_rec) where ItemPointerEquals is
use to check tuple has been updated/deleted.  With the proposed patch
ItemPointerEquals() will no longer work as before, we required addition check
for updated/deleted tuple, proposed the same in latest patch[1]. Do let me know
your thoughts/suggestions on this, thanks.

Regards,
Amul

1] 
https://postgr.es/m/caaj_b96mw5xn5osqgxpgn5dwfrs1j7oebphrmxkdsny+70y...@mail.gmail.com



Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-03-09 Thread amul sul
Hi Andres,

Thanks for your time and the review comments/suggestions.

On Tue, Mar 6, 2018 at 4:53 AM, Andres Freund  wrote:
> Hi,
>
> On 2018-02-13 12:41:26 +0530, amul sul wrote:
>> From 08c8c7ece7d9411e70a780dbeed89d81419db6b6 Mon Sep 17 00:00:00 2001
>> From: Amul Sul 
>> Date: Tue, 13 Feb 2018 12:37:52 +0530
>> Subject: [PATCH 1/2] Invalidate ip_blkid v5
[]
>
> Very nice and instructive to keep this in a submission's commit message.
>

Thank you.

>
>
>> diff --git a/src/backend/access/heap/heapam.c 
>> b/src/backend/access/heap/heapam.c
>> index 8a846e7dba..f4560ee9cb 100644
>> --- a/src/backend/access/heap/heapam.c
>> +++ b/src/backend/access/heap/heapam.c
>> @@ -3037,6 +3037,8 @@ xmax_infomask_changed(uint16 new_infomask, uint16 
>> old_infomask)
>>   *   crosscheck - if not InvalidSnapshot, also check tuple against this
>>   *   wait - true if should wait for any conflicting update to commit/abort
>>   *   hufd - output parameter, filled in failure cases (see below)
>> + *   row_moved - true iff the tuple is being moved to another partition
>> + *   table due to an update of partition key. 
>> Otherwise, false.
>>   *
>
> I don't think 'row_moved' is a good variable name for this. Moving a row
> in our heap format can mean a lot of things. Maybe 'to_other_part' or
> 'changing_part'?
>

Okay, renamed to  'changing_part' in the attached version.

>
>> + /*
>> +  * Sets a block identifier to the InvalidBlockNumber to indicate such 
>> an
>> +  * update being moved tuple to another partition.
>> +  */
>> + if (row_moved)
>> + BlockIdSet(&((tp.t_data->t_ctid).ip_blkid), 
>> InvalidBlockNumber);
>
> The parens here are set in a bit werid way. I assume that's from copying
> it out of ItemPointerSet()?  Why aren't you just using 
> ItemPointerSetBlockNumber()?
>
>
> I think it'd be better if we followed the example of specultive inserts
> and created an equivalent of HeapTupleHeaderSetSpeculativeToken. That'd
> be a heck of a lot easier to grep for...
>

Added HeapTupleHeaderValidBlockNumber, HeapTupleHeaderSetBlockNumber and
ItemPointerValidBlockNumber macro, but not exactly same as the
HeapTupleHeaderSetSpeculativeToken. Do let me know your thoughts/suggestions.

>
>> @@ -9314,6 +9323,18 @@ heap_mask(char *pagedata, BlockNumber blkno)
>>*/
>>   if (HeapTupleHeaderIsSpeculative(page_htup))
>>   ItemPointerSet(&page_htup->t_ctid, blkno, off);
>> +
>> + /*
>> +  * For a deleted tuple, a block identifier is set to 
>> the
>
> I think this 'the' is superflous.
>

Fixed in the attached version.

>
>> +  * InvalidBlockNumber to indicate that the tuple has 
>> been moved to
>> +  * another partition due to an update of partition key.
>
> But I think it should be 'the partition key'.
>

Fixed in the attached version.

>
>> +  * Like speculative tuple, to ignore any inconsistency 
>> set block
>> +  * identifier to current block number.
>
> This doesn't quite parse.
>

Tried to explain a little bit more, any help or suggestion to improve it
further will be appreciated.

>
>> +  */
>> + if (!BlockNumberIsValid(
>> + 
>> BlockIdGetBlockNumber((&((page_htup->t_ctid).ip_blkid)
>> + BlockIdSet(&((page_htup->t_ctid).ip_blkid), 
>> blkno);
>>   }
>
> That formatting looks wrong. I think it should be replaced by a macro
> like mentioned above.
>

Used HeapTupleHeaderValidBlockNumber & HeapTupleHeaderSetBlockNumber
macro in the attached version.

>
>>   /*
>> diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
>> index 160d941c00..a770531e14 100644
>> --- a/src/backend/commands/trigger.c
>> +++ b/src/backend/commands/trigger.c
>> @@ -3071,6 +3071,11 @@ ltrmark:;
>>   ereport(ERROR,
>>   
>> (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
>>errmsg("could not 
>> serialize access due to concurrent update")));
>> + if 
>> (!BlockNumberIsValid(BlockIdGetBlockNumber(&((hufd.ctid).ip_blkid
>> + ereport(ERROR,
>> + 
>> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>> +  errmsg("tuple to be 
>> locked was already moved to another partition due to concurrent update")));
>
> Yes, given that we repeat this in multiple places, I *definitely* want
> to see this wrapped in a macro with a descriptive name.
>

Used ItemPointerValidBlockNumber macro all such places.

>
>> diff --git a/src/backend/executor/nodeLockRows.c 
>> 

Re: Protect syscache from bloating with negative cache entries

2018-03-09 Thread Kyotaro HORIGUCHI
At Wed, 07 Mar 2018 23:12:29 -0500, Tom Lane  wrote in 
<352.1520482...@sss.pgh.pa.us>
> Kyotaro HORIGUCHI  writes:
> > At Thu, 8 Mar 2018 00:28:04 +, "Tsunakawa, Takayuki" 
> >  wrote in 
> > <0A3221C70F24FB45833433255569204D1F8FF0D9@G01JPEXMBYT05>
> >> Yes.  We are now facing the problem of too much memory use by PostgreSQL, 
> >> where about some applications randomly access about 200,000 tables.  It is 
> >> estimated based on a small experiment that each backend will use several 
> >> to ten GBs of local memory for CacheMemoryContext.  The total memory use 
> >> will become over 1 TB when the expected maximum connections are used.
> >> 
> >> I haven't looked at this patch, but does it evict all kinds of entries in 
> >> CacheMemoryContext, ie. relcache, plancache, etc?
> 
> > This works only for syscaches, which could bloat with entries for
> > nonexistent objects.
> 
> > Plan cache is a utterly deferent thing. It is abandoned at the
> > end of a transaction or such like.
> 
> When I was at Salesforce, we had *substantial* problems with plancache
> bloat.  The driving factor there was plans associated with plpgsql
> functions, which Salesforce had a huge number of.  In an environment
> like that, there would be substantial value in being able to prune
> both the plancache and plpgsql's function cache.  (Note that neither
> of those things are "abandoned at the end of a transaction".)

Mmm. Right. Thanks for pointing it. Anyway plan cache seems to be
a different thing.

> > Relcache is not based on catcache and out of the scope of this
> > patch since it doesn't get bloat with nonexistent entries. It
> > uses dynahash and we could introduce a similar feature to it if
> > we are willing to cap relcache size.
> 
> I think if the case of concern is an application with 200,000 tables,
> it's just nonsense to claim that relcache size isn't an issue.
> 
> In short, it's not really apparent to me that negative syscache entries
> are the major problem of this kind.  I'm afraid that you're drawing very
> large conclusions from a specific workload.  Maybe we could fix that
> workload some other way.

The current patch doesn't consider whether an entry is negative
or positive(?). Just clean up all entries based on time.

If relation has to have the same characterictics to syscaches, it
might be better be on the catcache mechanism, instaed of adding
the same pruning mechanism to dynahash..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




RE: [bug fix] pg_rewind creates corrupt WAL files, and the standby cannot catch up the primary

2018-03-09 Thread Tsunakawa, Takayuki
From: Michael Paquier [mailto:mich...@paquier.xyz]
> I see.  So the only reason why this flag exists is that if a file is large
> enough so as it is split into multiple chunks, then the first unlink will
> work but not the successive ones.  One chunk can be at most
> 1 million bytes, which is why it is essential for WAL segments.  Instead
> of ignoring *all* errors, let's ignore only ENOENT and rename ignore_error
> to missing_ok as well.
> 
> You need to update the comment in receiveFileChunks where an entry gets
> deleted with basically what I mentioned above, say:
> "If a file has been deleted on the source, remove it on the target as well.
> Note that multiple unlink() calls may happen on the same file if multiple
> data chunks are associated with it, hence ignore unconditionally anything
> missing.  If this file is not a relation data file, then it has been already
> truncated when creating the file chunk list at hte previous execution of
> the filemap."
> 
> Adding also a comment on top of remove_target_file to explain what
> missing_ok does would be nice to keep track of what the code should do.

Thanks for reviewing.  All done.

Regards
Takayuki Tsunakawa



pg_rewind_corrupt_wal_v2.patch
Description: pg_rewind_corrupt_wal_v2.patch


Problem while setting the fpw with SIGHUP

2018-03-09 Thread Dilip Kumar
While setting the full_page_write with SIGHUP I hit an assert in checkpoint
process. And, that is because inside a CRITICAL section we are calling
RecoveryInProgress which intern allocates memory.  So I have moved
RecoveryInProgress call out of the CRITICAL section and the problem got
solved.

command executed: killall -SIGHUP postgres
Crash call stack:
#0  0x7fa19560d5d7 in raise () from /lib64/libc.so.6
#1  0x7fa19560ecc8 in abort () from /lib64/libc.so.6
#2  0x009fc991 in ExceptionalCondition (conditionName=0xc5ab1c
"!(CritSectionCount == 0)", errorType=0xc5a739 "FailedAssertion",
fileName=0xc5a8a5 "mcxt.c", lineNumber=635) at assert.c:54
#3  0x00a34e56 in MemoryContextCreate (node=0x192edc0,
tag=T_AllocSetContext, size=216, nameoffset=216, methods=0xc58620
,
parent=0x18fe1b0, name=0xac1137 "WAL record construction", flags=0) at
mcxt.c:635
#4  0x00a2aaa1 in AllocSetContextCreateExtended (parent=0x18fe1b0,
name=0xac1137 "WAL record construction", flags=0, minContextSize=0,
initBlockSize=8192, maxBlockSize=8388608) at aset.c:463
#5  0x0055983c in InitXLogInsert () at xloginsert.c:1033
#6  0x0054e4e5 in InitXLOGAccess () at xlog.c:8183
#7  0x0054df71 in RecoveryInProgress () at xlog.c:7952
#8  0x005507f6 in UpdateFullPageWrites () at xlog.c:9566
#9  0x007ea821 in UpdateSharedMemoryConfig () at checkpointer.c:1366
#10 0x007e95a1 in CheckpointerMain () at checkpointer.c:383

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


fix_fpwupdate.patch
Description: Binary data


Re: [PATCH] Improve geometric types

2018-03-09 Thread Kyotaro HORIGUCHI
Hello, sorry to be late.

At Fri, 2 Mar 2018 15:26:25 +0100, Emre Hasegeli  wrote in 

> > Unfortunately according to http://commitfest.cputube.org/ this patch 
> > doesn't apply anymore.
> New versions are attached.

Thank you for the revised version. This seems fine for me so
marked this as "Ready for Committer".

- This applies cleanly on the master head and regression passes.

- The new behavior looks sane (except for the EPSILON, which is
  out of the scope of this patch).

- Test is complete as long as I can see. At least far more
  completely filled than the previous state. Some test items
  might seem a bit big, but it seems to be needed to raise
  coverage on required combinaions of dimension values.

By the way I think that line_perp is back patched, could you
propose it for the older versions? (I already marked my entry as
Rejected)


Brief summary follows (almost same to the header of patch files):

- 0001 looks fine.
  > * Eliminate SQL level function calls
  > * Re-use more functions to implement others
  > * Unify internal function names and signatures
  > * Remove private functions from geo_decls.h
  > * Replace not-possible checks with Assert()
  > * Add comments to describe some functions
  > * Remove some unreachable code
  > * Define delimiter symbols of line datatype like the other ones
  > * Remove debugging print()s from the refactored functions

  And one bug fix.

- 0002 looks fine.

  Refactors adt/float.c and utils/float.h
making float checking *macros* into inline functions.
making float comparison operators more efficiently.

  others are the consequence of the above change.
  and fixes NaN problem of GiST.

- 0003 looks fine.

  just changes the usage of double to float8 as more proper type.
  uses operator functions instead of bare arithmetics to handle
  arithmetic errors more properly.

- 0004 looks fine. (Sorry for overlooking that this treats bugs)

  all overlooked failure cases seems to be fixed.
  
- 0005 looks fine.

  this unifies +-0.0 to +0.0 for the convenient of later processing.

- 0006 It seems cover the all types of operations.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center