date:20171127

Re: [HACKERS] Small improvement to compactify_tuples

2017-11-27 Thread Юрий Соколов

2017-11-08 23:44 GMT+03:00 Юрий Соколов :
>
> 2017-11-08 20:02 GMT+03:00 Tom Lane :
> >
> > Claudio Freire  writes:
> > > What's perhaps not clear is whether there are better ideas. Like
> > > rebuilding the page as Tom proposes, which doesn't seem like a bad
> > > idea. Bucket sort already is O(bytes), just as memcopy, only it has a
> > > lower constant factor (it's bytes/256 in the original patch), which
> > > might make copying the whole page an extra time lose against bucket
> > > sort in a few cases.
> >
> > > Deciding that last point does need more benchmarking. That doesn't
> > > mean the other improvements can't be pursued in the meanwhile, right?
> >
> > Well, I doubt we're going to end up committing more than one of these
> > ideas.  The question is which way is best.  If people are willing to
> > put in the work to test all of them, let's do it.
> >
> > BTW, it strikes me that in considering the rebuild-the-page approach,
> > we should not have blinders on and just measure the speed of
> > PageRepairFragmentation.  Rather, we should take a look at what happens
> > subsequently given a physically-ordered set of tuples.  I can recall
> > Andres or someone moaning awhile ago about lack of locality of access in
> > index page searches --- maybe applying that approach while vacuuming
> > indexes will help?
> >
> > regards, tom lane
>
> I'd like to add qsort_template.h as Claudio suggested, ie in a way close
to
> simplehash.h. With such template header, there will be no need in
> qsort_tuple_gen.pl .

Attached patched replaces gen_qsort_tuple.pl with qsort_template.h - generic
qsort template header.
Some tests do not specify exact order (ie their output depends on order of
equal elements). Such tests output wes fixed.

I didn't apply this qsort to compactify_tuples yet. Will do soon.

With regards,
Sokolov Yura aka funny_falcon.

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2017-11-27 Thread Antonin Houska

Masahiko Sawada  wrote:

> On Fri, Nov 24, 2017 at 10:28 PM, Antonin Houska  wrote:
> > Masahiko Sawada  wrote:
> >
> >> On Mon, Oct 30, 2017 at 5:48 PM, Ashutosh Bapat
> >>  wrote:
> >> > On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada  
> >> > wrote:
> >> >>
> >> >> Because I don't want to break the current user semantics. that is,
> >> >> currently it's guaranteed that the subsequent reads can see the
> >> >> committed result of previous writes even if the previous transactions
> >> >> were distributed transactions. And it's ensured by writer side. If we
> >> >> can make the reader side ensure it, the backend process don't need to
> >> >> wait for the resolver process.
> >> >>
> >> >> The waiting backend process are released by resolver process after the
> >> >> resolver process tried to resolve foreign transactions. Even if
> >> >> resolver process failed to either connect to foreign server or to
> >> >> resolve foreign transaction the backend process will be released and
> >> >> the foreign transactions are leaved as dangling transaction in that
> >> >> case, which are processed later. Also if resolver process takes a long
> >> >> time to resolve foreign transactions for whatever reason the user can
> >> >> cancel it by Ctl-c anytime.
> >> >>
> >> >
> >> > So, there's no guarantee that the next command issued from the
> >> > connection *will* see the committed data, since the foreign
> >> > transaction might not have committed because of a network glitch
> >> > (say). If we go this route of making backends wait for resolver to
> >> > resolve the foreign transaction, we will have add complexity to make
> >> > sure that the waiting backends are woken up in problematic events like
> >> > crash of the resolver process OR if the resolver process hangs in a
> >> > connection to a foreign server etc. I am not sure that the complexity
> >> > is worth the half-guarantee.
> >> >
> >>
> >> Hmm, maybe I was wrong. I now think that the waiting backends can be
> >> woken up only in following cases;
> >> - The resolver process succeeded to resolve all foreign transactions.
> >> - The user did the cancel (e.g. ctl-c).
> >> - The resolver process failed to resolve foreign transaction for a
> >> reason of there is no such prepared transaction on foreign server.
> >>
> >> In other cases the resolver process should not release the waiters.
> >
> > I'm not sure I see consensus here. What Ashutosh says seems to be: "Special
> > effort is needed to ensure that backend does not keep waiting if the 
> > resolver
> > can't finish it's work in forseable future. But this effort is not worth
> > because by waking the backend up you might prevent the next transaction from
> > seeing the changes the previous one tried to make."
> >
> > On the other hand, your last comments indicate that you try to be even more
> > stringent in letting the backend wait. However even this stringent approach
> > does not guarantee that the next transaction will see the data changes made 
> > by
> > the previous one.
> >
> 
> What I'd like to guarantee is that the subsequent read can see the
> committed result of previous writes if the transaction involving
> multiple foreign servers is committed without cancellation by user.

I missed the point that user should not expect atomicity of the commit to be
guaranteed if he has cancelled his request.

The other things are clear to me, including the fact that atomic commit and
atomic visibility will be implemented separately. Thanks.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

Re: ERROR: too many dynamic shared memory segments

2017-11-27 Thread Dilip Kumar

On Tue, Nov 28, 2017 at 4:18 AM, Thomas Munro  wrote:

> On Tue, Nov 28, 2017 at 10:05 AM, Jakub Glapa 
> wrote:
> > As for the crash. I dug up the initial log and it looks like a
> segmentation
> > fault...
> >
> > 2017-11-23 07:26:53 CET:192.168.10.83(35238):user@db:[30003]: ERROR:
> too
> > many dynamic shared memory segments
>
> I think there are two failure modes: one of your sessions showed the
> "too many ..." error (that's good, ran out of slots and said so and
> our error machinery worked as it should), and another crashed with a
> segfault, because it tried to use a NULL "area" pointer (bad).  I
> think this is a degenerate case where we completely failed to launch
> parallel query, but we ran the parallel query plan anyway and this
> code thinks that the DSA is available.  Oops.
>

 I think BitmapHeapScan check whether dsa is valid or not if DSA is not
valid then it should assume it's non-parallel plan.

Attached patch should fix the issue.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


bug_fix_in_pbhs_when_dsa_not_initialized.patch
Description: Binary data

Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-11-27 Thread Amit Kapila

On Mon, Nov 27, 2017 at 11:26 PM, hubert depesz lubaczewski
 wrote:
> On Mon, Nov 27, 2017 at 05:24:49PM +0530, Amit Kapila wrote:
>> I think it is "actual_time * 1" for anything below Gather.
>
> Well, I think I found another problem.
>
> Please take a look at:
> https://explain.depesz.com/s/Bs8c
>
> Node 15 is Gather with loops = 2974 (because it was ran by nested loop).
>
> There were 4 workers, so 5 threads working, but the loops on parallel
> seq scan is 17.698 million ?!
>

The number of loops displayed on parallel seq scan seems to be wrong, see below.

> The problem is that for explain.depesz.com I'm calculating how much time
> pg actually spent doing given thing.
>
> So, if an operation has actual time of 1ms, but 100 loops, it took
> 100ms.
>
> In case of parallel operations it looks like I can't realistically find
> the information anywhere?
>
> In the explain above, we see that Nested loop took, in total, around
> 2 million miliseconds (step 9).
> Out of which, ~ 7000 ms was hash join and it's subnodes.
> This leaves most of the time for Gather step, which was called 2974
> times with actual time reported as 618ms.
> 2974 * 618 is 1837932 miliseconds, which seems reasonable.
>
> But then - how much of this time was spent in Parallel Seq Scan in step
> #16 ?
>
> 2974 * 609ms?
>

Yeah.

>Why is loops there 17 million?
>

That is wrong and I think you have hit a bug.  It should be 2974 * 5 =
14870 as you have seen in other cases.  The problem is that during
rescan, we generally reinitialize the required state, but we forgot to
reinitialize the instrumentation related memory which is used in the
accumulation of stats, so changing that would fix some part of this
problem which is that at Parallel node, you won't see wrong values.
However, we also need to ensure that the per-worker details also get
accumulated across rescans.  Attached patch should fix the problem you
are seeing.  I think this needs some more analysis and testing to see
if everything works in the desired way.

Is it possible for you to test the attached patch and see if you are
still seeing any unexpected values?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

fix_accum_instr_parallel_workers_v1.patch
Description: Binary data

Re: [HACKERS] Partition-wise aggregation/grouping

2017-11-27 Thread Rajkumar Raghuwanshi

On Thu, Nov 23, 2017 at 6:38 PM, Jeevan Chalke
 wrote:
> Let me know if I missed any comment to be fixed.

Hi,

I have applied v8 patches on commit id 8735978e7aebfbc499843630131c18d1f7346c79,
and getting below observation, please take a look.

Observation:
"when joining a foreign partition table with local partition table
getting wrong output
with partition_wise_join enabled, same is working fine on PG-head
without aggregates patch."

Test-case:
CREATE EXTENSION postgres_fdw;
CREATE SERVER pwj_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS
(dbname 'postgres',port '5432',use_remote_estimate 'true');
CREATE USER MAPPING FOR PUBLIC SERVER pwj_server;

CREATE TABLE fplt1 (a int,  c text) PARTITION BY LIST(c);
CREATE TABLE fplt1_p1 (a int,  c text);
CREATE TABLE fplt1_p2 (a int,  c text);
CREATE FOREIGN TABLE ftplt1_p1 PARTITION OF fplt1 FOR VALUES IN
('', '0001', '0002', '0003') SERVER pwj_server OPTIONS (TABLE_NAME
'fplt1_p1');
CREATE FOREIGN TABLE ftplt1_p2 PARTITION OF fplt1 FOR VALUES IN
('0004', '0005', '0006', '0007') SERVER pwj_server OPTIONS (TABLE_NAME
'fplt1_p2');
INSERT INTO fplt1_p1 SELECT i, to_char(i%8, 'FM') FROM
generate_series(0, 199, 2) i;
INSERT INTO fplt1_p2 SELECT i, to_char(i%8, 'FM') FROM
generate_series(200, 398, 2) i;

CREATE TABLE lplt2 (a int,  c text) PARTITION BY LIST(c);
CREATE TABLE lplt2_p1 PARTITION OF lplt2 FOR VALUES IN ('',
'0001', '0002', '0003');
CREATE TABLE lplt2_p2 PARTITION OF lplt2 FOR VALUES IN ('0004',
'0005', '0006', '0007');
INSERT INTO lplt2 SELECT i, to_char(i%8, 'FM') FROM
generate_series(0, 398, 3) i;

SELECT t1.c, t2.c,count(*) FROM fplt1 t1 JOIN lplt2 t2 ON (t1.c = t2.c
and t1.a = t2.a)  WHERE t1.a % 25 = 0 GROUP BY 1,2 ORDER BY t1.c,
t2.c;
  c   |  c   | count
--+--+---
  |  | 1
 0004 | 0004 | 1
 0006 | 0006 | 1
(3 rows)

SET enable_partition_wise_join = on;
SELECT t1.c, t2.c,count(*) FROM fplt1 t1 JOIN lplt2 t2 ON (t1.c = t2.c
and t1.a = t2.a)  WHERE t1.a % 25 = 0 GROUP BY 1,2 ORDER BY t1.c,
t2.c;
  c   |  c   | count
--+--+---
  |  | 1
 0004 | 0004 | 1
(2 rows)


Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Re: [HACKERS] new function for tsquery creartion

2017-11-27 Thread Victor Drobny


On 2017-11-19 04:30, Tomas Vondra wrote:
Hello,

Hi,

On 09/13/2017 10:57 AM, Victor Drobny wrote:

On 2017-09-09 06:03, Thomas Munro wrote:

Please send a rebased version of the patch for people to review and
test as that one has bit-rotted.

Hello,
Thank you for interest. In the attachment you can find rebased
version(based on 69835bc8988812c960f4ed5aeee86b62ac73602a commit)



I did a quick review of the patch today. The patch unfortunately no
longer applies, so I had to use an older commit from September. Please
rebase to current master.


Thank you for your time. In the attachment you can find rebased version.
(based on e842791b0 commit)


I've only looked on the diff at this point, will do more testing once
the rebase happens.

Some comments:

1) This seems to mix multiple improvements into one patch. There's the
support for alternative query syntax, and then there are the new
operators (AROUND and ). I propose to split the patch into two or
more parts, each addressing one of those bits.


I agree. I have split it in 3 parts: support for around operator,
queryto_tsquery function and documentation.


I guess there will be two or three parts - first adding the syntax,
second adding  and third adding the AROUND(n). Seems reasonable?

2) I don't think we should mention Google in the docs explicitly. Not
that I'm somehow anti-google, but this syntax was certainly not 
invented

by Google - I vividly remember using something like that on Altavista
(yeah, I'm old). And it's used by pretty much every other web search
engine out there ...


Yes, those syntax is not introduced by google, but, as for me, it was 
the
easiest way to give a brief description of it. Of cause it can be 
changed,

I just don't know how. Any suggestions are welcomed! ;)


3) In the SGML docs, please use  instead of just
quoting the values. So it should be | instead of '|'
etc. Just like in the parts describing plainto_tsquery, for example.


Fixed. I hope that i didn't miss anything.


4) Also, I recommend adding a brief explanation what the examples do.
Right now there's just a bunch of queryto_tsquery, and the reader is
expected to understand the output. I suggest adding a sentence or two,
explaining what's happening (just like for plainto_tsquery examples).

5) I'm not sure about negative values in the  operator. I don't
find it particularly confusing - once you understand that (a  b)
means "there are 'k' words between 'a' and 'b' (n <= k <= m)", then
negative values seem like a fairly straightforward extension.

But I guess the main question is "Is there really a demand for the new
 operator, or have we just invented if because we can?"


The operator  is not introduced yet. It's just a concept. It were 
our

thoughts about implementation AROUND operator through  in future.

6) There seem to be some new constants defined, with names not 
following

the naming convention. I mean this

#define WAITOPERAND 1
#define WAITOPERATOR2
#define WAITFIRSTOPERAND3
#define WAITSINGLEOPERAND   4
#define INSIDEQUOTES5   <-- the new one

and

#define TSPO_L_ONLY0x01
#define TSPO_R_ONLY0x02
#define TSPO_BOTH  0x04
#define TS_NOT_EXAC0x08 <-- the new one

Perhaps that's OK, but it seems a bit inconsistent.


I agree. I have fixed it.



regards


--
Victor Drobny
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companydiff --git a/src/backend/tsearch/ts_selfuncs.c b/src/backend/tsearch/ts_selfuncs.c
index 046f543..375cb5c 100644
--- a/src/backend/tsearch/ts_selfuncs.c
+++ b/src/backend/tsearch/ts_selfuncs.c
@@ -396,6 +396,7 @@ tsquery_opr_selec(QueryItem *item, char *operand,
 break;
 
 			case OP_PHRASE:
+			case OP_AROUND:
 			case OP_AND:
 s1 = tsquery_opr_selec(item + 1, operand,
 	   lookup, length, minfreq);
diff --git a/src/backend/utils/adt/tsginidx.c b/src/backend/utils/adt/tsginidx.c
index aba456e..1c852bd 100644
--- a/src/backend/utils/adt/tsginidx.c
+++ b/src/backend/utils/adt/tsginidx.c
@@ -239,10 +239,11 @@ TS_execute_ternary(GinChkVal *gcv, QueryItem *curitem, bool in_phrase)
 			return !result;
 
 		case OP_PHRASE:
+		case OP_AROUND:
 
 			/*
 			 * GIN doesn't contain any information about positions, so treat
-			 * OP_PHRASE as OP_AND with recheck requirement
+			 * OP_PHRASE and OP_AROUND as OP_AND with recheck requirement
 			 */
 			*(gcv->need_recheck) = true;
 			/* Pass down in_phrase == true in case there's a NOT below */
diff --git a/src/backend/utils/adt/tsquery.c b/src/backend/utils/adt/tsquery.c
index 5cdfe4d..07d56b0 100644
--- a/src/backend/utils/adt/tsquery.c
+++ b/src/backend/utils/adt/tsquery.c
@@ -29,7 +29,8 @@ const int	tsearch_op_priority[OP_COUNT] =
 	4,			/* OP_NOT */
 	2,			/* OP_AND */
 	1,			/* OP_OR */
-	3			/* OP_PHRASE */
+	3,			/* OP_PHRASE */
+	3			/* OP_AROUND */
 };
 
 struct TSQueryParserStateData
@@ -58

Re: [HACKERS] [PATCH] Move all am-related reloption code into src/backend/access/[am-name] and get rid of relopt_kind for custom AM

2017-11-27 Thread Nikolay Shaplov

В письме от 28 ноября 2017 11:23:38 пользователь Michael Paquier написал:
> On Tue, Nov 28, 2017 at 11:22 AM, Michael Paquier
> 
>  wrote:
> > On Tue, Nov 14, 2017 at 4:44 PM, Michael Paquier
> > 
> >  wrote:
> >> On Mon, Oct 30, 2017 at 9:49 AM, Nikolay Shaplov  
wrote:
> >>> В письме от 3 сентября 2017 11:45:43 пользователь Alvaro Herrera
> >>> написал:
> >>> 
> >>> Yet another git rebase. This patch can be applied to master branch
> >>> again.
> >>> 
> >>> For this patch no review needed now, as I will offer part of it (one
> >>> that
> >>> refuses to set toast reloptions when there is no TOAST table) as a
> >>> separate
> >>> patch.
> >> 
> >> This patch needs a rebase, there are conflicts with HEAD.
> > 
> > Bumping the patch to next CF as no updates have happened for the last two
> > weeks.
> Er, returned with feedback.. This will be my last patch for the day.
May be you are right. This patch should be split into smaller patches. I will 
do it bit by bit whet I have time. But I see no reason for the big one to 
remain on commitfest, as it is not for direct commit.

PS. I will do the rebase soon, this patch overlaps with two recent commits. So 
this require very careful rebasing


-- 
Do code for fun. Can do it for money (Perl & C/C++ ~10h/week)

Re: [HACKERS] PATCH: pgbench - option to build using ppoll() for larger connection counts

2017-11-27 Thread Fabien COELHO



Hello Michaël,


Could you rebase the v11 patch?


This patch has been waiting for a rebase for more than three weeks as
of today, I am marking it as returned with feedback. It would be a
good idea to reply to Robert's input in
https://www.postgresql.org/message-id/ca+tgmoyybnv-ddhvpmxlb2nx2sqenjirtwhmdeazucgotb2...@mail.gmail.com.


ISTM that this was done: If -c is high enough, pgbench fails without the 
patch.


--
Fabien.

Re: [HACKERS] Useless code in ExecInitModifyTable

2017-11-27 Thread Etsuro Fujita


(2017/11/28 11:18), Michael Paquier wrote:

On Tue, Sep 5, 2017 at 12:41 PM, Ryan Murphy  wrote:

The new status of this patch is: Waiting on Author


This status is misleading, so I switched it back to "needs review"
(please be careful about that!).


I think I forgot to change that status.  Sorry about that.


I can still apply the patch
correctly. Sorry I am not taking the time to have a meaningful opinion
about this patch. The patch passes all regression tests. As I am on
this entry, I am bumping the patch to next CF..


Ok, thanks!

Best regards,
Etsuro Fujita

Re: Add RANGE with values and exclusions clauses to the Window Functions

2017-11-27 Thread David Fetter

On Mon, Nov 27, 2017 at 04:55:17PM +, Oliver Ford wrote:
> On Mon, Nov 27, 2017 at 4:40 PM, Erik Rijkers  wrote:
> > On 2017-11-27 17:34, Erik Rijkers wrote:
> >>
> >> On 2017-11-27 16:01, Oliver Ford wrote:
> >>>
> >>> Attached is it in bare diff form.
> >>
> >>
> >> [0001-window-frame-v3.patch]
> >>
> >> Thanks, that did indeed fix it:
> >>
> >> make && make check  now  ok.
> >>
> >> There were errors in the doc build (unmatched tags); I fixed them in
> >> the attached doc-patch (which should go on top of yours).
> >
> >
> > 0001-window-frame-v3-fixtags.diff
> >
> > now attached, I hope...
> >
> 
> Cheers here's v4 with the correct docs.

I've taken the liberty of adding float8, somewhat mechanically.  Do
the docs need some change, assuming that addition is useful?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From 9144e33c2bb3c6acccfe27bfde8155850faacd87 Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Mon, 27 Nov 2017 23:27:04 -0500
Subject: [PATCH] window frame v5

---
 doc/src/sgml/ref/select.sgml|  39 +-
 doc/src/sgml/syntax.sgml|  36 +-
 src/backend/executor/nodeWindowAgg.c| 296 +++--
 src/backend/nodes/copyfuncs.c   |   2 +
 src/backend/nodes/equalfuncs.c  |   1 +
 src/backend/nodes/outfuncs.c|   2 +
 src/backend/nodes/readfuncs.c   |   2 +
 src/backend/optimizer/plan/createplan.c |  10 +-
 src/backend/parser/gram.y   |  61 ++-
 src/backend/parser/parse_clause.c   |  84 +++-
 src/backend/utils/adt/ruleutils.c   |   6 +
 src/include/nodes/execnodes.h   |   1 +
 src/include/nodes/parsenodes.h  |   9 +
 src/include/nodes/plannodes.h   |   1 +
 src/include/parser/kwlist.h |   2 +
 src/test/regress/expected/window.out| 713 +++-
 src/test/regress/sql/window.sql | 192 -
 17 files changed, 1340 insertions(+), 117 deletions(-)

diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 8a3e86b6db..40c146ca40 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -859,8 +859,8 @@ WINDOW window_name AS ( frame_clause can be one of
 
 
-{ RANGE | ROWS } frame_start
-{ RANGE | ROWS } BETWEEN frame_start AND frame_end
+{ RANGE | ROWS } frame_start [ frame_exclusion_clause ]
+{ RANGE | ROWS } BETWEEN frame_start AND frame_end [ frame_exclusion_clause ]
 
 
 where frame_start and frame_end can be
@@ -874,6 +874,15 @@ CURRENT ROW
 UNBOUNDED FOLLOWING
 
 
+and the optional frame_exclusion_clause can be
+one of
+
+
+EXCLUDE CURRENT ROW
+EXCLUDE TIES
+EXCLUDE NO OTHERS
+
+
 If frame_end is omitted it defaults to CURRENT
 ROW.  Restrictions are that
 frame_start cannot be UNBOUNDED FOLLOWING,
@@ -900,13 +909,25 @@ UNBOUNDED FOLLOWING
 RANGE mode it means that the frame starts or ends with
 the current row's first or last peer in the ORDER BY ordering.
 The value PRECEDING and
-value FOLLOWING cases are currently only
-allowed in ROWS mode.  They indicate that the frame starts
-or ends with the row that many rows before or after the current row.
-value must be an integer expression not
-containing any variables, aggregate functions, or window functions.
-The value must not be null or negative; but it can be zero, which
-selects the current row itself.
+value FOLLOWING cases differ depending on
+whether the frame clause is in ROWS or RANGE mode. In
+ROWS mode, they indicate that the frame starts or ends with the row that
+many rows before or after the current row. In RANGE mode, they indicate that
+the frame starts or ends when the ORDER BY column's value for each row is within the bounds
+specified by value for both the start and the end of the frame.
+In ROWS mode, value must be an integer expression not
+containing any variables, aggregate functions, or window functions.In RANGE mode,
+there must be exactly one ORDER BY column and if the column is an integer column,
+then value must be an integer. If it is a date/time column, then
+value must be an interval. In both modes, the value must not be null or
+negative; but it can be zero, which just selects the current row itself.
+   
+
+   
+For the frame_exclusion_clause, EXCLUDE CURRENT ROW
+excludes the current row from the frame. EXCLUDE TIES excludes any peers of the current row from the
+frame. EXCLUDE NO OTHERS does nothing, but is provided in order to optionally document the intention
+not to exclude any other rows.

 

diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml
index a938a21334..f711c99987 100644
--- a/doc/src/sgml/syntax.sgml
+++ b/doc/src/sgml/syntax.sgml
@@ -1805,8 +1805,8 @@ FROM generate_series(1,10) AS s(i);
 and the optional frame_clause
 can be one

Re: simplehash: tb->sizemask = 0

2017-11-27 Thread Tom Lane

Tomas Vondra  writes:
> I'm a bit puzzled by this code in SH_COMPUTE_PARAMETERS:

> if (tb->size == SH_MAX_SIZE)
> tb->sizemask = 0;
> else
> tb->sizemask = tb->size - 1;

> Doesn't that mean that with SH_MAX_SIZE we end up with sizemask being 0
> (i.e. no bits set)?

Yeah, which is very obviously broken: for one thing, the Asserts
in SH_NEXT/SH_PREV would surely go off.

(Why are those assertions, anyway, and not test-and-elog?
I do not think an assertion failure is a suitable way to
report "hash table full".)

> I don't think we're building hash tables with 2^32 buckets, though.

What this proves is that nobody has ever tested the behavior at
SH_MAX_SIZE.  I would suggest building a test version with that
set small enough to be conveniently reachable, and then exercising
the behavior as the limit is approached and reached.

regards, tom lane

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2017-11-27 Thread Ashutosh Bapat

On Tue, Nov 28, 2017 at 3:04 AM, Masahiko Sawada  wrote:
> On Fri, Nov 24, 2017 at 10:28 PM, Antonin Houska  wrote:
>> Masahiko Sawada  wrote:
>>
>>> On Mon, Oct 30, 2017 at 5:48 PM, Ashutosh Bapat
>>>  wrote:
>>> > On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada  
>>> > wrote:
>>> >>
>>> >> Because I don't want to break the current user semantics. that is,
>>> >> currently it's guaranteed that the subsequent reads can see the
>>> >> committed result of previous writes even if the previous transactions
>>> >> were distributed transactions. And it's ensured by writer side. If we
>>> >> can make the reader side ensure it, the backend process don't need to
>>> >> wait for the resolver process.
>>> >>
>>> >> The waiting backend process are released by resolver process after the
>>> >> resolver process tried to resolve foreign transactions. Even if
>>> >> resolver process failed to either connect to foreign server or to
>>> >> resolve foreign transaction the backend process will be released and
>>> >> the foreign transactions are leaved as dangling transaction in that
>>> >> case, which are processed later. Also if resolver process takes a long
>>> >> time to resolve foreign transactions for whatever reason the user can
>>> >> cancel it by Ctl-c anytime.
>>> >>
>>> >
>>> > So, there's no guarantee that the next command issued from the
>>> > connection *will* see the committed data, since the foreign
>>> > transaction might not have committed because of a network glitch
>>> > (say). If we go this route of making backends wait for resolver to
>>> > resolve the foreign transaction, we will have add complexity to make
>>> > sure that the waiting backends are woken up in problematic events like
>>> > crash of the resolver process OR if the resolver process hangs in a
>>> > connection to a foreign server etc. I am not sure that the complexity
>>> > is worth the half-guarantee.
>>> >
>>>
>>> Hmm, maybe I was wrong. I now think that the waiting backends can be
>>> woken up only in following cases;
>>> - The resolver process succeeded to resolve all foreign transactions.
>>> - The user did the cancel (e.g. ctl-c).
>>> - The resolver process failed to resolve foreign transaction for a
>>> reason of there is no such prepared transaction on foreign server.
>>>
>>> In other cases the resolver process should not release the waiters.
>>
>> I'm not sure I see consensus here. What Ashutosh says seems to be: "Special
>> effort is needed to ensure that backend does not keep waiting if the resolver
>> can't finish it's work in forseable future. But this effort is not worth
>> because by waking the backend up you might prevent the next transaction from
>> seeing the changes the previous one tried to make."
>>
>> On the other hand, your last comments indicate that you try to be even more
>> stringent in letting the backend wait. However even this stringent approach
>> does not guarantee that the next transaction will see the data changes made 
>> by
>> the previous one.
>>
>
> What I'd like to guarantee is that the subsequent read can see the
> committed result of previous writes if the transaction involving
> multiple foreign servers is committed without cancellation by user. In
> other words, the backend should not be waken up and the resolver
> should continue to resolve at certain intervals even if the resolver
> fails to connect to the foreign server or fails to resolve it. This is
> similar to what synchronous replication guaranteed today. Keeping this
> semantics is very important for users. Note that the reading a
> consistent result by concurrent reads is a separated problem.

The question I have is how would we deal with a foreign server that is
not available for longer duration due to crash, longer network outage
etc. Example is the foreign server crashed/got disconnected after
PREPARE but before COMMIT/ROLLBACK was issued. The backend will remain
blocked for much longer duration without user having an idea of what's
going on. May be we should add some timeout.

>
> The read result including foreign servers can be inconsistent if the
> such transaction is cancelled or the coordinator server crashes during
> two-phase commit processing. That is, if there is in-doubt transaction
> the read result can be inconsistent, even for subsequent reads. But I
> think this behaviour can be accepted by users. For the resolution of
> in-doubt transactions, the resolver process will try to resolve such
> transactions after the coordinator server recovered. On the other
> hand, for the reading a consistent result on such situation by
> subsequent reads, for example, we can disallow backends to inquiry SQL
> to the foreign server if a foreign transaction of the foreign server
> is remained.

+1 for the last sentence. If we do that, we don't need the backend to
be blocked by resolver since a subsequent read accessing that foreign
server would get an error and not inconsistent data.

>
> For the concurrent reads, the reading an inconsistent

Re: Commit fest 2017-11

2017-11-27 Thread Michael Paquier

On Wed, Nov 1, 2017 at 11:30 PM, Michael Paquier
 wrote:
> On Wed, Nov 1, 2017 at 9:04 AM, Michael Paquier
>  wrote:
>> Anybody willing to take the hat of the commit fest manager? If nobody,
>> I am fine to take the hat as default choice this time.
>
> And now it is open. Let's the fest begin.

The current commit fest is coming to an end, and as many may have
noticed, I have begun classifying patches depending on their status.
This will likely take a couple of days. As a last push, I would like
to point out that there are 22 patches marked as ready for committer:
https://commitfest.postgresql.org/15/?status=3.
-- 
Michael

Re: [HACKERS] [PATCH] Move all am-related reloption code into src/backend/access/[am-name] and get rid of relopt_kind for custom AM

2017-11-27 Thread Michael Paquier

On Tue, Nov 28, 2017 at 11:22 AM, Michael Paquier
 wrote:
> On Tue, Nov 14, 2017 at 4:44 PM, Michael Paquier
>  wrote:
>> On Mon, Oct 30, 2017 at 9:49 AM, Nikolay Shaplov  wrote:
>>> В письме от 3 сентября 2017 11:45:43 пользователь Alvaro Herrera написал:
>>>
>>> Yet another git rebase. This patch can be applied to master branch again.
>>>
>>> For this patch no review needed now, as I will offer part of it (one that
>>> refuses to set toast reloptions when there is no TOAST table) as a separate
>>> patch.
>>
>> This patch needs a rebase, there are conflicts with HEAD.
>
> Bumping the patch to next CF as no updates have happened for the last two 
> weeks.

Er, returned with feedback.. This will be my last patch for the day.
-- 
Michael

Re: [HACKERS] [PATCH] Move all am-related reloption code into src/backend/access/[am-name] and get rid of relopt_kind for custom AM

2017-11-27 Thread Michael Paquier

On Tue, Nov 14, 2017 at 4:44 PM, Michael Paquier
 wrote:
> On Mon, Oct 30, 2017 at 9:49 AM, Nikolay Shaplov  wrote:
>> В письме от 3 сентября 2017 11:45:43 пользователь Alvaro Herrera написал:
>>
>> Yet another git rebase. This patch can be applied to master branch again.
>>
>> For this patch no review needed now, as I will offer part of it (one that
>> refuses to set toast reloptions when there is no TOAST table) as a separate
>> patch.
>
> This patch needs a rebase, there are conflicts with HEAD.

Bumping the patch to next CF as no updates have happened for the last two weeks.

> There is still no API equivalent for add_int_reloption(),
> add_real_reloption(), which is a problem as extension should be
> allowed to still use that. Except if I am missing something you could
> just have a compatibility API as for example optionCatalogAddItem()
> and add_reloption() share really a lot.

This worries me.
-- 
Michael

Re: [HACKERS] GnuTLS support

2017-11-27 Thread Michael Paquier

On Mon, Nov 27, 2017 at 2:37 PM, Michael Paquier
 wrote:
> On Mon, Nov 27, 2017 at 10:28 AM, Andreas Karlsson  wrote:
>> Hm, after reading more of our MSVC code it seems like building with MSVC
>> does not really use switch, people rather have to create a config.pl. Is
>> that correct? The MSVC scripts also seems to only support uuid-ossp which it
>> just calls $config->{uuid}.
>>
>> If so we could either have:
>>
>> $config->{ssl} = "openssl";
>> $config->{sslpath} = "/path/to/openssl";
>
> Using this one may actually finish by being cleaner as there is no
> need to cross-check for both options defined.

Andreas, as I can see that you are still actively working on this
patch, I am bumping it to next CF.
-- 
Michael

Re: [HACKERS] Useless code in ExecInitModifyTable

2017-11-27 Thread Michael Paquier

On Tue, Sep 5, 2017 at 12:41 PM, Ryan Murphy  wrote:
> The new status of this patch is: Waiting on Author

This status is misleading, so I switched it back to "needs review"
(please be careful about that!). I can still apply the patch
correctly. Sorry I am not taking the time to have a meaningful opinion
about this patch. The patch passes all regression tests. As I am on
this entry, I am bumping the patch to next CF..
-- 
Michael

Re: [HACKERS] Patch: add --if-exists to pg_recvlogical

2017-11-27 Thread Michael Paquier

On Fri, Sep 29, 2017 at 10:06 PM, Peter Eisentraut
 wrote:
> On 9/22/17 15:31, Peter Eisentraut wrote:
>> I suggest you create a patch that focuses on the --create part.
>>
>> You can create a separate follow-on patch for the --start part if you
>> wish, but I think that that patch would be rejected.
>
> I have moved this entry to the next commit fest, awaiting your updated
> patch.

Still waiting for an update, and marked as returned with feedback. It
seems to me that your patch adds a couple of option interactions, so
this should be double-checked with exiting options and a set of TAP
tests should be added as well.
-- 
Michael

Re: [HACKERS] PATCH: pgbench - break out timing data for initialization phases

2017-11-27 Thread Michael Paquier

On Tue, Sep 26, 2017 at 3:07 AM, Fabien COELHO  wrote:
>
> Hello Doug,
>
>> total time: 316.03 s (insert 161.60 s, commit 0.64 s, vacuum 60.77 s,
>> index 93.01 s)
>
>
> Definitely interesting.
>
> There is a "ready for committers" patch in the CF which extensively rework
> the initialization: it becomes customizable, and this approach may not work
> as is after that...
>
> Maybe you could investigate how it may be implemented on top of that?
>
> Either show times when the phases are performed computed, or maybe use some
> auxiliary data structure to keep the information (hmmm...).

The patch needs as well a rebase, I am marking it as returned with
feedback per lack of activity. When sending a new version, please make
sure to:
- reply to the feedback which has been provided previously.
- register a new version of the patch in the commit fest app.
Thanks,
-- 
Michael

Re: [HACKERS] proposal: Support Unicode host variable in ECPG

2017-11-27 Thread Michael Paquier

On Tue, Nov 21, 2017 at 4:02 PM, Michael Paquier
 wrote:
> The patch does not apply anymore and needs a rebase. There are also a
> bunch of whitespace errors, no documentation in the patch, and it
> seems to me that this patch introduces many concepts so it could be
> broken down into many individual steps. Finally, I strongly recommend
> that you review other's patches. You have two patches, including this
> one, registered into this commit fest but your name is showing nowhere
> as a reviewer. The patch you are proposing here is large, the usual
> recommendation being to review one patch of equal difficulty for each
> patch sent to keep the inflow and outflow of patches balanced.

With a high-level look, this patch requires a bit more work.. I am
unsure as well if that's a feature which would prove to be useful, but
without documentation it is hard to make an opinion of it. I am
marking the patch as returned with feedback.
-- 
Michael

Re: [HACKERS] Support for Secure Transport SSL library on macOS as OpenSSL alternative

2017-11-27 Thread Michael Paquier

On Mon, Nov 20, 2017 at 11:35 AM, Michael Paquier
 wrote:
> The last set of patches available here does not apply:
> https://www.postgresql.org/message-id/b5e2b87d-3e8a-4597-9a7f-8489b3b67...@yesql.se
> The SSL test refactoring is one cause. I think as well that this is
> crashing when attempting to use SCRAM authentication with the SSL
> brand of macos and SCRAM's channel binding. I am going to send a patch
> which allows handling of no support for channel bindings for a given
> SSL implementation, something needed as well by the gnutls patch.
> Please make sure that you define at least be_tls_get_peer_finished()
> and pgtls_get_finished() with a NULL result and a length of 0 as
> return results as, as far as I can see, macos does not give direct
> access to the TLS finish message bytes. At least that's not
> documented.

This last comment is from last week, so I am marking the patch as
returned with feedback. This also needs more thoughts for channel
binding support with SCRAM.
-- 
Michael

Re: [PATCH] Atomic pgrename on Windows

2017-11-27 Thread Craig Ringer

On 27 November 2017 at 14:28, Alexander Korotkov 
wrote:

> Hi!
>
> It's assumed in PostgreSQL codebase that pgrename atomically replaces
> target file with source file even if target file is open and being read by
> another process.  And this assumption is true on Linux, but it's false on
> Windows.  MoveFileEx() triggers an error when target file is open (and
> accordingly locked).  Some our customers has been faced such errors while
> operating heavily loaded PostgreSQL instance on Windows.
>
> LOG could not rename temporary statistics file "pg_stat_tmp/global.tmp" to
> "pg_stat_tmp/global.stat": Permission denied
>

That would explain a number of intermittent reports Iv'e seen floating
around.

> Attached patch atomic-pgrename-windows-1.patch fixes this problem.  It
> appears to be possible to atomically replace file on Windows –
> ReplaceFile() does that.  ReplaceFiles() requires target file to exist,
> this is why we still need to call MoveFileEx() when it doesn't exist.
>

Look at the error codes for ReplaceFile:

https://msdn.microsoft.com/en-us/library/aa365512(VS.85).aspx

The docs don't say it's atomic and the error codes suggest it may not be.
But there's a Microsoft Research paper claiming it's atomic -
http://research.microsoft.com/pubs/64525/tr-2006-45.pdf .

It appears that MoveFileTransacted would be what we really want, when we're
on NTFS (it won't work on FAT32 or network shares). See
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365241(v=vs.85).aspx
. But the docs have a preface warning that it's not recommended and may not
be available in future Windows versions, so that's not an option.

This Go language bug (https://github.com/golang/go/issues/8914) and this
cPython discussion (http://bugs.python.org/issue8828)) have discussion.

I found this comment particularly illuminating
https://bugs.python.org/msg146307 as it quotes what Java does. It uses
MoveFileEx.

See also:

*
https://stackoverflow.com/questions/167414/is-an-atomic-file-rename-with-overwrite-possible-on-windows
*
https://blogs.msdn.microsoft.com/adioltean/2005/12/28/how-to-do-atomic-writes-in-a-file/
* https://msdn.microsoft.com/en-us/library/aa365512(VS.85).aspx
*
https://msdn.microsoft.com/en-us/library/windows/desktop/hh802690(v=vs.85).aspx

(I sincerely hope that that blog post about atomic writes, which is 12
years old, is obsoleted by some new functionality. Because seriously, WTF.)

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] [PATCH] A hook for session start

2017-11-27 Thread Michael Paquier

On Tue, Nov 21, 2017 at 4:09 AM, Fabrízio de Royes Mello
 wrote:
> Due to some "Blackfriday" commitments I'll be able to work again with this
> patch on next week.

Okay, this has proved to require broader changes than thought first. I
am marking the patch as returned with feedback.
-- 
Michael

Re: [HACKERS] PATCH: Batch/pipelining support for libpq

2017-11-27 Thread Michael Paquier

On Thu, Oct 5, 2017 at 9:58 AM, Vaishnavi Prabakaran
 wrote:
> Thanks for the suggestion and, OK I will create a new patch in upcoming
> commitfest with attached patch addressing above review comments.

The patch still applies and there has been no updates for the last
month, as well as no reviews. I am bumping it to next CF.
-- 
Michael

Re: [HACKERS] [PATCH v1] Add and report the new "in_hot_standby" GUC pseudo-variable.

2017-11-27 Thread Michael Paquier

On Mon, Oct 2, 2017 at 4:16 PM, Daniel Gustafsson  wrote:
> Based on the unaddressed questions raised in this thread, I’m marking this
> patch Returned with Feedback.  Please re-submit a new version of the patch to 
> a
> future commitfest.

For some reason this patch has been moved to CF 2017-11, but there has
been no real updates since the last version in
https://www.postgresql.org/message-id/flat/1572601.d6Y6hSXNBC%40hammer.magicstack.net#1572601.d6y6hsx...@hammer.magicstack.net
was sent. The patch fails to apply as well now. I am marking it as
returned with feedback. I am not seeing any replies to the requests
done as well.
-- 
Michael

Re: [HACKERS] Bug in ExecModifyTable function and trigger issues for foreign tables

2017-11-27 Thread Etsuro Fujita


(2017/11/28 7:58), Tom Lane wrote:

Pushed with that and some cosmetic fiddling with comments and docs.


Thank you.

Best regards,
Etsuro Fujita

Re: [HACKERS] VACUUM and ANALYZE disagreeing on what reltuples means

2017-11-27 Thread Michael Paquier

On Sun, Nov 19, 2017 at 6:40 AM, Tomas Vondra
 wrote:
> Thanks for looking into this. I agree your patch is more consistent and
> generally cleaner.

This is classified as a bug fix, and is marked as waiting on author. I
am moving it to next CF as work continues.
-- 
Michael

Re: [HACKERS] subscription worker signalling wal writer too much

2017-11-27 Thread Michael Paquier

On Tue, Nov 21, 2017 at 3:44 PM, Michael Paquier
 wrote:
> I have just gone through this thread, and reducing the amount of times
> the WAL writer is woken up is something worth doing. The last version
> if v3 that Robert has sent in
> https://www.postgresql.org/message-id/CA+TgmoZhw_nyrxYUqSuDh-XEJya-YzfwShki5NuDS=ufgvo...@mail.gmail.com.
> As far as I can see, the smarter formula presented here comes from
> XLogBackgroundFlush(). Jeff, could you update the patch to reflect
> that the new condition you are adding comes from there? I am switching
> the patch as "waiting on author" for now.

It has been one week since this update, in consequence I am marking
this patch as returned with feedback for now. Please make sure to
register a new CF entry if you can get a new version out.
-- 
Michael

Re: [HACKERS] Other formats in pset like markdown, rst, mediawiki

2017-11-27 Thread Michael Paquier

On Mon, Sep 18, 2017 at 10:33 PM, Daniel Gustafsson  wrote:
>> On 18 Sep 2017, at 15:31, Jan Michálek  wrote:
>>
>> I have`t any new version recently, because i was waiting for new version. 
>> But I will have some time on this in next months. How many time I have to 
>> next freeze?
>
> The next commitfest runs from 2017-11-01 to 2017-11-30, I will move your patch
> there.

It's been 1 month and a half since this last status, and no updates
have showed up. I am marking this patch as returned with feedback. If
you are able to reproduce a new version, please make sure to register
it in the commit fest for review.
-- 
Michael

Re: [HACKERS] PATCH: pgbench - option to build using ppoll() for larger connection counts

2017-11-27 Thread Michael Paquier

On Fri, Nov 3, 2017 at 2:29 PM, Fabien COELHO  wrote:
> Could you rebase the v11 patch?

This patch has been waiting for a rebase for more than three weeks as
of today, I am marking it as returned with feedback. It would be a
good idea to reply to Robert's input in
https://www.postgresql.org/message-id/ca+tgmoyybnv-ddhvpmxlb2nx2sqenjirtwhmdeazucgotb2...@mail.gmail.com.
-- 
Michael

Re: [HACKERS] WAL logging problem in 9.4.3?

2017-11-27 Thread Michael Paquier

On Thu, Sep 14, 2017 at 3:34 PM, Kyotaro HORIGUCHI
 wrote:
> At Wed, 13 Sep 2017 17:42:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>  wrote in 
> <20170913.174239.25978735.horiguchi.kyot...@lab.ntt.co.jp>
>> filterdiff seems to did something wrong..
>
> # to did...
>
> The patch is broken by filterdiff so I send a new patch made
> directly by git format-patch. I confirmed that a build completes
> with applying this.

To my surprise this patch still applies but fails recovery tests. I am
bumping it to next CF, for what will be its 8th registration as it is
for a bug fix, switching the status to "waiting on author".
-- 
Michael

Fix a typo in xact.c

2017-11-27 Thread Masahiko Sawada

Hi,

Attached patch for $subject.

s/recoreds/record/

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


fix_typo_in_xact_c.patch
Description: Binary data

Re: [HACKERS] Refactoring identifier checks to consistently use strcmp

2017-11-27 Thread Michael Paquier

On Tue, Nov 28, 2017 at 12:11 AM, Daniel Gustafsson  wrote:
>> The patch needs a rebase, and there are a couple of places that need
>> an extra lookup I think:
>> $ git grep defname -- *.c | grep strcasecmp | wc -l
>>  39
>
> Rebased and handled a few more places which I had either missed in the last
> round, or that had been added in the meantime.  “PARALLEL” in aggregatecmds.c
> is intentionally using pg_strcasecmp() due to the old-style syntax which is
> still supported.

This meritates a comment. Code readers may get confused.

> AFAICS this covers all relevant codepaths from the 39 above.

I was just looking at the tsearch code which uses pg_strcmpcase, and
those are defined with makeDefElem() so you should switch to strcmp in
this case as well, no? If I patch the code myself I would get an error
when double-quoting, making those command more consistent with the
rest of what you are patching here:
create extension unaccent;
alter text search dictionary unaccent (Rules = 'unaccent'); -- ok
alter text search dictionary unaccent (RuLes = 'unaccent'); -- ok
alter text search dictionary unaccent ("Rules" = 'unaccent'); -- error
-- 
Michael

Re: [PATCH] Atomic pgrename on Windows

2017-11-27 Thread Andres Freund

On 2017-11-28 09:47:45 +0900, Michael Paquier wrote:
> On Mon, Nov 27, 2017 at 3:28 PM, Alexander Korotkov
>  wrote:
> > Attached patch atomic-pgrename-windows-1.patch fixes this problem.  It
> > appears to be possible to atomically replace file on Windows – ReplaceFile()
> > does that.  ReplaceFiles() requires target file to exist, this is why we
> > still need to call MoveFileEx() when it doesn't exist.
> 
> Do you think that it could be safer to unlink the target file first
> with pgunlink()? This way you make sure that the target file is
> removed and not locked. This change makes me worrying about the
> introduction of more race conditions.

That seems like a *seriously* bad idea. What if we crash inbetween the
unlink and the rename?

I'm confused about the need for this. Shouldn't normally
opening all files FILE_SHARE_DELETE take care of this? See
https://msdn.microsoft.com/en-us/library/windows/desktop/aa363858(v=vs.85).aspx
"Note  Delete access allows both delete and rename operations."

Is there an external process active that doesn't set that flag? Are we
missing setting it somewhere?

Greetings,

Andres Freund

Re: [HACKERS] Timeline ID in backup_label file

2017-11-27 Thread Michael Paquier

On Tue, Nov 28, 2017 at 9:40 AM, David Steele  wrote:
> Perhaps I'm the one who is misunderstanding. If you propose a patch I'll be
> happy to review it, though I doubt there is a lot to be gained even if it
> would be a better implementation.

OK. I'll keep that in mind. Thanks for the input.
-- 
Michael

Re: [PATCH] Atomic pgrename on Windows

2017-11-27 Thread Michael Paquier

On Mon, Nov 27, 2017 at 3:28 PM, Alexander Korotkov
 wrote:
> Attached patch atomic-pgrename-windows-1.patch fixes this problem.  It
> appears to be possible to atomically replace file on Windows – ReplaceFile()
> does that.  ReplaceFiles() requires target file to exist, this is why we
> still need to call MoveFileEx() when it doesn't exist.

Do you think that it could be safer to unlink the target file first
with pgunlink()? This way you make sure that the target file is
removed and not locked. This change makes me worrying about the
introduction of more race conditions.

> This patch is based on work of Victor Spirin who was asked by Postgres Pro
> to research this problem.

Victor has no community account so his name cannot be registered as a
co-author of the patch. I have added your name though.
-- 
Michael

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2017-11-27 Thread Robert Haas

On Mon, Nov 27, 2017 at 4:35 PM Masahiko Sawada 
wrote:

> What I'd like to guarantee is that the subsequent read can see the
> committed result of previous writes if the transaction involving
> multiple foreign servers is committed without cancellation by user. In
> other words, the backend should not be waken up and the resolver
> should continue to resolve at certain intervals even if the resolver
> fails to connect to the foreign server or fails to resolve it. This is
> similar to what synchronous replication guaranteed today. Keeping this
> semantics is very important for users. Note that the reading a
> consistent result by concurrent reads is a separated problem.
>
> The read result including foreign servers can be inconsistent if the
> such transaction is cancelled or the coordinator server crashes during
> two-phase commit processing. That is, if there is in-doubt transaction
> the read result can be inconsistent, even for subsequent reads. But I
> think this behaviour can be accepted by users. For the resolution of
> in-doubt transactions, the resolver process will try to resolve such
> transactions after the coordinator server recovered. On the other
> hand, for the reading a consistent result on such situation by
> subsequent reads, for example, we can disallow backends to inquiry SQL
> to the foreign server if a foreign transaction of the foreign server
> is remained.
>
> For the concurrent reads, the reading an inconsistent result can be
> happen even without in-doubt transaction because we can read data on a
> foreign server between PREPARE and COMMIT PREPARED while other foreign
> servers have committed. I think we should deal with this problem by
> other feature on reader side, for example, atomic visibility. If we
> have atomic visibility feature, we also can solve the above problem.


+1 to all of that.

...Robert

>
> --
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Isn't partition drop code seriously at risk of deadlock?

2017-11-27 Thread Amit Langote

On 2017/11/28 9:04, Tom Lane wrote:
> The complaint in bug #14927 that heap_drop_with_catalog is not bothering
> to check for SearchSysCache lookup failure (in code evidently newly added
> for the partition feature) seems to me to be only scratching the surface
> of what's wrong with that code.  In particular, I do not understand how
> it can possibly be deadlock-free to be trying to grab AccessExclusiveLock
> on a partition's parent table when we already have such a lock on the
> partition.  Which we do, or at least had better, long before we get to
> heap_drop_with_catalog.

We do that as of 258cef12540fa1 [1].  The lock on the parent is taken in
RangeVarCallbackForDropRelation() before the partition itself is locked.

Thanks,
Amit

[1]
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=258cef12540fa1

Re: [HACKERS] Timeline ID in backup_label file

2017-11-27 Thread David Steele


On 11/27/17 7:11 PM, Michael Paquier wrote:

On Mon, Nov 27, 2017 at 11:06 PM, David Steele  wrote:

On 11/15/17 10:09 PM, Michael Paquier wrote:


read_backup_label() is a static function in the backend code. With #2
I do not imply to change the order of the elements written in the
backup_label file, just to make the way they are parsed smarter.


My point is that the order *could* be changed and it wouldn't be noticed
if the read function were improved as you propose.

I'm not against the idea, just think we should continue to enforce the
order unless we decide an interface break is OK.


I still don't quite understand here. The order the items are read does
not cause a backward-incompatible change. True that there is no reason
to change that either now.


Perhaps I'm the one who is misunderstanding. If you propose a patch I'll 
be happy to review it, though I doubt there is a lot to be gained even 
if it would be a better implementation.


Regards,
--
-David
da...@pgmasters.net

Re: [HACKERS] Challenges preventing us moving to 64 bit transaction id (XID)?

2017-11-27 Thread Masahiko Sawada

On Tue, Nov 28, 2017 at 4:56 AM, Robert Haas  wrote:
> On Fri, Nov 24, 2017 at 5:33 AM, Alexander Korotkov
>  wrote:
>> pg_prune_xid makes sense only for heap pages.  Once we introduce special
>> area for heap pages, we can move pg_prune_xid there and save some bytes in
>> index pages.  However, this is an optimization not directly related to
>> 64-bit xids.  Idea is that if we anyway change page format, why don't apply
>> this optimization as well?  But if we have any doubts in this, it can be
>> removed with no problem.
>
> My first reaction is that changing the page format seems like a
> non-starter, because it would break pg_upgrade.

IIUC xid-lsn ranges patch[1] doesn't require the page format
conversion. Is there any reason we can not go on that way? FWIW, I've
rebased the patch to current HEAD.

[1] https://www.postgresql.org/message-id/5242F8BF.6010807%40vmware.com

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Expression based aggregate transition / combine function invocation

2017-11-27 Thread Andres Freund

Hi,

this is part of my work to make expression evaluation JITable. In a lot
of analytics queries the major bottleneck is transition function
invocation (makes sense, hardly anyone wants to see billions of
rows). Therefore for JITing to be really valuable transition function
stuff needs to be JITable.

Excerpt from the preliminary commit message:

  Previously aggregate transition and combination functions were invoked
  by special case code in nodeAgg.c, evaluting input and filters
  separately using the expression evaluation machinery. That turns out
  to not be great for performance for several reasons:
  - repeated expression evaluations have some cost
  - the transition functions invocations are poorly predicted
  - filter and input computation had to be done separately
  - the special case code made it hard to implement JITing of the whole
transition function invocation

  Address this by building one large expression that computes input,
  evaluates filters, and invokes transition functions.

  This leads to moderate speedups in queries bottlenecked by aggregate
  computations, and enables large speedups for similar cases once JITing
  is done.


While this gets rid of a substantial amount of duplication between the
infrastructure for transition and combine functions, it still increases
codesize a bit.

Todo / open Questions:
- Location of transition function building functions. Currently they're
  in execExpr.c. That allows not to expose a bunch of functions local to
  it, but requires exposing some aggregate structs to the world. We
  could go the other way round as well.

- Right now we waste a bunch of time by having to access transition
  states indexed by both grouping set number and the transition state
  offset therein. It'd be nicer if we could cheaply reduce the number of
  indirections, but I can't quite see how without adding additional
  complications.


Here's some example tpch Q01 timings:
master: 11628 ms (best of three)
patches: 10330 ms (best of three)

other tpch queries are similar, aggregate improvement is a factor of
x 1.04.

Greetings,

Andres Freund
>From 3ff1371aae4615c15f6bcdc232170a3a295e8021 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 24 Nov 2017 14:14:35 -0800
Subject: [PATCH 1/3] Simplify representation of aggregate transition values a
 bit.

Previously aggregate transition values for hash and direct (sort + no
group by) aggregates were represented differently. That made upcoming
changes hard, so represent both as grouping set indexed array of
per-group data.

Author: Andres Freund
---
 src/backend/executor/nodeAgg.c | 94 --
 src/include/nodes/execnodes.h  |  6 ++-
 2 files changed, 48 insertions(+), 52 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index da6ef1a94c4..a939ae99fa8 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -532,13 +532,13 @@ static void select_current_set(AggState *aggstate, int setno, bool is_hash);
 static void initialize_phase(AggState *aggstate, int newphase);
 static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
 static void initialize_aggregates(AggState *aggstate,
-	  AggStatePerGroup pergroup,
-	  int numReset);
+	  AggStatePerGroup *pergroups,
+	  bool isHash, int numReset);
 static void advance_transition_function(AggState *aggstate,
 			AggStatePerTrans pertrans,
 			AggStatePerGroup pergroupstate);
-static void advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup,
-   AggStatePerGroup *pergroups);
+static void advance_aggregates(AggState *aggstate, AggStatePerGroup *sort_pergroups,
+   AggStatePerGroup *hash_pergroups);
 static void advance_combine_function(AggState *aggstate,
 		 AggStatePerTrans pertrans,
 		 AggStatePerGroup pergroupstate);
@@ -793,15 +793,14 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
  * If there are multiple grouping sets, we initialize only the first numReset
  * of them (the grouping sets are ordered so that the most specific one, which
  * is reset most often, is first). As a convenience, if numReset is 0, we
- * reinitialize all sets. numReset is -1 to initialize a hashtable entry, in
- * which case the caller must have used select_current_set appropriately.
+ * reinitialize all sets.
  *
  * When called, CurrentMemoryContext should be the per-query context.
  */
 static void
 initialize_aggregates(AggState *aggstate,
-	  AggStatePerGroup pergroup,
-	  int numReset)
+	  AggStatePerGroup *pergroups,
+	  bool isHash, int numReset)
 {
 	int			transno;
 	int			numGroupingSets = Max(aggstate->phase->numsets, 1);
@@ -812,31 +811,19 @@ initialize_aggregates(AggState *aggstate,
 	if (numReset == 0)
 		numReset = numGroupingSets;
 
-	for (transno = 0; transno < numTrans; transno++)
+	for (setno = 0; setno < numReset; setno++)
 	{
-		AggStatePerTrans pertrans = &transstates[transno];

Re: [HACKERS] Timeline ID in backup_label file

2017-11-27 Thread Michael Paquier

On Mon, Nov 27, 2017 at 11:06 PM, David Steele  wrote:
> On 11/15/17 10:09 PM, Michael Paquier wrote:
>> On Thu, Nov 16, 2017 at 9:20 AM, David Steele  wrote:
>>> For this patch at least, I think we should do #1.  Getting rid of the order
>>> dependency is attractive but there may be other programs that are depending
>>> on the order.  I know you are not proposing to change the order now, but it
>>> *could* be changed with #2.
>>
>> read_backup_label() is a static function in the backend code. With #2
>> I do not imply to change the order of the elements written in the
>> backup_label file, just to make the way they are parsed smarter.
>
> My point is that the order *could* be changed and it wouldn't be noticed
> if the read function were improved as you propose.
>
> I'm not against the idea, just think we should continue to enforce the
> order unless we decide an interface break is OK.

I still don't quite understand here. The order the items are read does
not cause a backward-incompatible change. True that there is no reason
to change that either now.

>>> Also, I think DEBUG1 would be a more appropriate log level if any logging is
>>> done.
>>
>> OK, the label and time strings can include spaces. The label string is
>> assumed to not be bigger than 1028 bytes, and the timestamp is assumed
>> that it can get up to 128. So it is possible to use stuff like
>> fscanf("%127[^\n]\n") to avoid any buffer overflow problems on the
>> backend. If a backup_label is generated with strings longer than that,
>> then read_backup_label would fail to parse the next entries but that's
>> not really something to worry about IMO as those are only minor sanity
>> checks. Having a safer coding in the backend is more important, and
>> the new checks should not trigger any hard failures.
>
> I have tested and get an error as expected when I munge the backup_label
> file:
>
> FATAL:  invalid data in file "backup_label"
> DETAIL:  Timeline ID parsed is 2, but expected 1
>
> Everything else looks good so I will mark it ready for committer.

Thanks. This maps what I saw.
-- 
Michael

Isn't partition drop code seriously at risk of deadlock?

2017-11-27 Thread Tom Lane

The complaint in bug #14927 that heap_drop_with_catalog is not bothering
to check for SearchSysCache lookup failure (in code evidently newly added
for the partition feature) seems to me to be only scratching the surface
of what's wrong with that code.  In particular, I do not understand how
it can possibly be deadlock-free to be trying to grab AccessExclusiveLock
on a partition's parent table when we already have such a lock on the
partition.  Which we do, or at least had better, long before we get to
heap_drop_with_catalog.

regards, tom lane

Re: [HACKERS] More stats about skipped vacuums

2017-11-27 Thread Tom Lane

Magnus Hagander  writes:
> What I've been thinking about for that one before is if we could just
> invent a protocol (shmq based maybe) whereby autovacuum can ask the stats
> collector for a single table or index stat. If autovacuum never needs to
> see a consistent view between multiple tables, I would think that's going
> to be a win in a lot of cases.

Perhaps.  Autovac might run through quite a few tables before it finds
one in need of processing, though, so I'm not quite certain this would
yield such great benefits in isolation.

> However, when it comes to the stats system, I'd say that on any busy system
> (which would be the ones to care about), the stats structures are still
> going to be *written* a lot more than they are read.

Uh, what?  The stats collector doesn't write those files at all except
on-demand.

regards, tom lane

Re: [HACKERS] Bug in ExecModifyTable function and trigger issues for foreign tables

2017-11-27 Thread Tom Lane

Dean Rasheed  writes:
> On 27 November 2017 at 16:35, Tom Lane  wrote:
>> On looking closer, the reason it's like that in Fujita-san's patch
>> is to minimize the API churn seen by FDW AddForeignUpdateTargets
>> functions, specifically whether they see a tlist that's before or
>> after what expand_targetlist() does.  I'm doubtful that the
>> potential savings is worth taking risks there.  In particular,
>> it seems like a good thing that expand_targetlist() verifies the
>> correct tlist ordering *after* the FDW function has acted.
>> So now my inclination is to leave this alone.

> Ah yes, that seems like a worthwhile check to keep. Never mind then.

Pushed with that and some cosmetic fiddling with comments and docs.
Thanks for the discussion!

regards, tom lane

Re: [HACKERS] Challenges preventing us moving to 64 bit transaction id (XID)?

2017-11-27 Thread Alexander Korotkov

On Mon, Nov 27, 2017 at 10:56 PM, Robert Haas  wrote:

> On Fri, Nov 24, 2017 at 5:33 AM, Alexander Korotkov
>  wrote:
> > pg_prune_xid makes sense only for heap pages.  Once we introduce special
> > area for heap pages, we can move pg_prune_xid there and save some bytes
> in
> > index pages.  However, this is an optimization not directly related to
> > 64-bit xids.  Idea is that if we anyway change page format, why don't
> apply
> > this optimization as well?  But if we have any doubts in this, it can be
> > removed with no problem.
>
> My first reaction is that changing the page format seems like a
> non-starter, because it would break pg_upgrade.  If we get the heap
> storage API working, then we could have a heap AM that works as it
> does today and a newheap AM with such changes, but I have a bit of a
> hard time imagining a patch that causes a hard compatibility break
> ever being accepted.

Thank you for raising this question.  There was a discussion about 64-bit
xids during PGCon 2017.  Couple ways to provide pg_upgrade were discussed.

1) We've page layout version in the page (current is number 4).  So, we can
define new page layout version 5.  Pages with new layout version would
contain 64-bit base values for xid and multixact.  The question is how to
deal with page of layout version 4.  If this page have enough of free space
to fit extra 16 bytes, then it could be upgraded on the fly.  If it doesn't
contains enough of space for than then things becomes more complicated: we
can't upgrade it to new format, but we still need to fit new xmax value
there in the case tuple being updated or deleted.  pg_upgrade requires
server restart.  Thus, once we set hint bits, pre-pg_upgrade xmin is not
really meaningful – corresponding xid is visible for every post-pg_upgrade
snapshot.  So, idea is to use both xmin and xmax tuple fields on such
unupgradable page to store 64-bit xmax.  This idea was proposed by me, but
was criticized by some session attendees (sorry, but I don't recall who
were them) for its complexity and suspected overhead.

2) Alternative idea was to use unused bits in page header.  Naturally, if
we would look for unused bits in pd_flags (3 bits of 16 is
used), pd_pagesize_version (we can left 1 bit of 16 to distinguish between
old and new format) and pd_special (we can leave 1 bit to distinguish
sequence pages), we can scrape together 43 bits.  That would be far enough
for single base value, because we definitely don't need all lower 32-bits
of base value (21 bits is more than enough).  But I'm not sure about two
base values: if we would live 2 bits for lower part of base value, than it
leaves us 19 bits for high part of base value.  This solution would give us
2^51 maximum values for xids and multixacts.  I'm not sure if it's enough
to assume these counters infinite.  AFAIK, there are products on the market
whose have 48-bit transaction identifiers and don't care about wraparound
or something...

New heap AM for 64-bit xids is an interesting idea too.  I would even say
that pluggable storage API being discussed now is excessive for this
particular purpose (but still can fit!), because in most of aspects heap
with 64-bit xids is absolutely same as current heap (in contrast to heap
with undo log, for example).  Best fit API for heap with 64-bit xid support
would be pluggable heap page format.  But I don't think it deserves
separate API though.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Transactions involving multiple postgres foreign servers

2017-11-27 Thread Masahiko Sawada

On Fri, Nov 24, 2017 at 10:28 PM, Antonin Houska  wrote:
> Masahiko Sawada  wrote:
>
>> On Mon, Oct 30, 2017 at 5:48 PM, Ashutosh Bapat
>>  wrote:
>> > On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada  
>> > wrote:
>> >>
>> >> Because I don't want to break the current user semantics. that is,
>> >> currently it's guaranteed that the subsequent reads can see the
>> >> committed result of previous writes even if the previous transactions
>> >> were distributed transactions. And it's ensured by writer side. If we
>> >> can make the reader side ensure it, the backend process don't need to
>> >> wait for the resolver process.
>> >>
>> >> The waiting backend process are released by resolver process after the
>> >> resolver process tried to resolve foreign transactions. Even if
>> >> resolver process failed to either connect to foreign server or to
>> >> resolve foreign transaction the backend process will be released and
>> >> the foreign transactions are leaved as dangling transaction in that
>> >> case, which are processed later. Also if resolver process takes a long
>> >> time to resolve foreign transactions for whatever reason the user can
>> >> cancel it by Ctl-c anytime.
>> >>
>> >
>> > So, there's no guarantee that the next command issued from the
>> > connection *will* see the committed data, since the foreign
>> > transaction might not have committed because of a network glitch
>> > (say). If we go this route of making backends wait for resolver to
>> > resolve the foreign transaction, we will have add complexity to make
>> > sure that the waiting backends are woken up in problematic events like
>> > crash of the resolver process OR if the resolver process hangs in a
>> > connection to a foreign server etc. I am not sure that the complexity
>> > is worth the half-guarantee.
>> >
>>
>> Hmm, maybe I was wrong. I now think that the waiting backends can be
>> woken up only in following cases;
>> - The resolver process succeeded to resolve all foreign transactions.
>> - The user did the cancel (e.g. ctl-c).
>> - The resolver process failed to resolve foreign transaction for a
>> reason of there is no such prepared transaction on foreign server.
>>
>> In other cases the resolver process should not release the waiters.
>
> I'm not sure I see consensus here. What Ashutosh says seems to be: "Special
> effort is needed to ensure that backend does not keep waiting if the resolver
> can't finish it's work in forseable future. But this effort is not worth
> because by waking the backend up you might prevent the next transaction from
> seeing the changes the previous one tried to make."
>
> On the other hand, your last comments indicate that you try to be even more
> stringent in letting the backend wait. However even this stringent approach
> does not guarantee that the next transaction will see the data changes made by
> the previous one.
>

What I'd like to guarantee is that the subsequent read can see the
committed result of previous writes if the transaction involving
multiple foreign servers is committed without cancellation by user. In
other words, the backend should not be waken up and the resolver
should continue to resolve at certain intervals even if the resolver
fails to connect to the foreign server or fails to resolve it. This is
similar to what synchronous replication guaranteed today. Keeping this
semantics is very important for users. Note that the reading a
consistent result by concurrent reads is a separated problem.

The read result including foreign servers can be inconsistent if the
such transaction is cancelled or the coordinator server crashes during
two-phase commit processing. That is, if there is in-doubt transaction
the read result can be inconsistent, even for subsequent reads. But I
think this behaviour can be accepted by users. For the resolution of
in-doubt transactions, the resolver process will try to resolve such
transactions after the coordinator server recovered. On the other
hand, for the reading a consistent result on such situation by
subsequent reads, for example, we can disallow backends to inquiry SQL
to the foreign server if a foreign transaction of the foreign server
is remained.

For the concurrent reads, the reading an inconsistent result can be
happen even without in-doubt transaction because we can read data on a
foreign server between PREPARE and COMMIT PREPARED while other foreign
servers have committed. I think we should deal with this problem by
other feature on reader side, for example, atomic visibility. If we
have atomic visibility feature, we also can solve the above problem.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [PATCH] Fix crash in int8_avg_combine().

2017-11-27 Thread Hadi Moshayedi

On Sat, Nov 25, 2017 at 10:47 PM, Andres Freund 
wrote:

> diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
> > index 869c59dc85..2dc59e89cd 100644
> > --- a/src/include/utils/memutils.h
> > +++ b/src/include/utils/memutils.h
> > @@ -189,7 +189,7 @@ extern MemoryContext SlabContextCreate(MemoryContext
> parent,
> >   * Few callers should be interested in this, but tuplesort/tuplestore
> need
> >   * to know it.
> >   */
> > -#define ALLOCSET_SEPARATE_THRESHOLD  8192
> > +#define ALLOCSET_SEPARATE_THRESHOLD  16384
>
> Huh, what's that about in this context?
>

There is following static assert:

StaticAssertStmt(ALLOC_CHUNK_LIMIT == ALLOCSET_SEPARATE_THRESHOLD, ...)

and ALLOCK_CHUNK_LIMIT is defined as:

#define ALLOC_CHUNK_LIMIT (1 << (ALLOCSET_NUM_FREELISTS-1+ALLOC_MINBITS))

and there is this comment:

"CAUTION: ALLOC_MINBITS must be large enough so that 1<

Re: [PATCH] Fix crash in int8_avg_combine().

2017-11-27 Thread Hadi Moshayedi

On Sat, Nov 25, 2017 at 10:55 PM, Tom Lane  wrote:

> I believe we already dealt with this:
>
> Author: Tom Lane 
> Branch: REL_10_STABLE [619a8c47d] 2017-11-14 17:49:49 -0500
> Branch: REL9_6_STABLE [4a15f87d2] 2017-11-14 17:49:49 -0500
> Branch: REL9_5_STABLE [d4e38489f] 2017-11-14 17:49:49 -0500
>
> Prevent int128 from requiring more than MAXALIGN alignment.
>
> Our initial work with int128 neglected alignment considerations, an
> oversight that came back to bite us in bug #14897 from Vincent
> Lachenal.
> It is unsurprising that int128 might have a 16-byte alignment
> requirement;
> what's slightly more surprising is that even notoriously lax Intel
> chips
> sometimes enforce that.
>
> Raising MAXALIGN seems out of the question: the costs in wasted disk
> and
> memory space would be significant, and there would also be an on-disk
> compatibility break.  Nor does it seem very practical to try to allow
> some
> data structures to have more-than-MAXALIGN alignment requirement, as
> we'd
> have to push knowledge of that throughout various code that copies data
> structures around.
>
> The only way out of the box is to make type int128 conform to the
> system's
> alignment assumptions.  Fortunately, gcc supports that via its
> __attribute__(aligned()) pragma; and since we don't currently support
> int128 on non-gcc-workalike compilers, we shouldn't be losing any
> platform
> support this way.
>
> Although we could have just done pg_attribute_aligned(MAXIMUM_ALIGNOF)
> and
> called it a day, I did a little bit of extra work to make the code more
> portable than that: it will also support int128 on compilers without
> __attribute__(aligned()), if the native alignment of their 128-bit-int
> type is no more than that of int64.
>
> Add a regression test case that exercises the one known instance of the
> problem, in parallel aggregation over a bigint column.
>
> Back-patch of commit 751804998.  The code known to be affected only
> exists
> in 9.6 and later, but we do have some stuff using int128 in 9.5, so
> patch
> back to 9.5.
>
> Discussion: https://postgr.es/m/20171110185747.31519.28038@
> wrigleys.postgresql.org
>
> Does that solution not work for you?
>

It works for me. My REL_10_STABLE was out of date. A git pull fixed it.

Re: [HACKERS] More stats about skipped vacuums

2017-11-27 Thread Magnus Hagander

On Mon, Nov 27, 2017 at 7:53 PM, Robert Haas  wrote:

> On Sun, Nov 26, 2017 at 3:19 PM, Tom Lane  wrote:
> > Robert Haas  writes:
> >> On Sat, Nov 25, 2017 at 12:09 PM, Tom Lane  wrote:
> >>> Mumble.  It's a property I'm pretty hesitant to give up, especially
> >>> since the stats views have worked like that since day one.  It's
> >>> inevitable that weakening that guarantee would break peoples' queries,
> >>> probably subtly.
> >
> >> You mean, queries against the stats views, or queries in general?  If
> >> the latter, by what mechanism would the breakage happen?
> >
> > Queries against the stats views, of course.
>
> Hmm.  Those are probably rare.  If we only took a snapshot of the
> statistics for the backends that explicitly access those views, that
> probably wouldn't be too crazy.
>
> Sorry if this is a stupid question, but how often and for what purpose
> to regular backends need the stats collector data for purposes other
> than querying the stats views?  I thought that the data was only used
> to decide whether to VACUUM/ANALYZE, and therefore would be accessed
> mostly by autovacuum, and for that you'd actually want the most
> up-to-date view of the stats for a particular table that is available,
> not any older snapshot.
>
>
Autovacuum resets the stats to make sure. Autovacuum in particular can
probably be made a lot more efficient, because it only ever looks at one
relation at a time, I think.

What I've been thinking about for that one before is if we could just
invent a protocol (shmq based maybe) whereby autovacuum can ask the stats
collector for a single table or index stat. If autovacuum never needs to
see a consistent view between multiple tables, I would think that's going
to be a win in a lot of cases.

I don't think regular backends use them at all. But anybody looking at the
stats do, and it is pretty important there.

However, when it comes to the stats system, I'd say that on any busy system
(which would be the ones to care about), the stats structures are still
going to be *written* a lot more than they are read. We certainly don't
read them at the rate of once per transaction. A lot of the reads are also
limited to one database of course.

I wonder if we want to implement some sort of copy-on-read-snapshot in the
stats collector itself. So instead of unconditionally publishing
everything, have the backends ask for it. When a backend asks for it it
gets a "snapshot counter" or something from the stats collector, and on the
next write after that we do a copy-write if the snapshot it still
available. (no, i have not thought in detail)

Or -- if we keep a per-database hashtable in dynamic shared memory (which
we can now). Can we copy it into local memory in the backend fast enough
that we can hold a lock and just queue up the stats updates during the
copy? If we can copy the complete structure, that would fix one of the
bigger bottlenecks with it today which is that we dump and rebuild the
hashtables as we go through the tempfiles.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: default range partition and constraint exclusion

2017-11-27 Thread Robert Haas

On Mon, Nov 27, 2017 at 4:04 AM, Kyotaro HORIGUCHI
 wrote:
> This is the story in my understanding.

Thanks, that's helpful.  Sorry I didn't have time yet to study this in
detail myself.  If we're routing tuples to a partition for which the
partition constraint is evaluating to null, that's OK, but if we're
routing tuples to a partition for which the partition constraint is
evaluating to false, that's a bug, and the right solution is to
correct either the tuple routing or the partition constraint.  In this
case, it looks like the tuple routing is working as expected, so that
would say we ought to fix the constraint.

I am out of time for today but will try to look at this some more tomorrow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

simplehash: tb->sizemask = 0

2017-11-27 Thread Tomas Vondra

Hi,

I'm a bit puzzled by this code in SH_COMPUTE_PARAMETERS:

if (tb->size == SH_MAX_SIZE)
tb->sizemask = 0;
else
tb->sizemask = tb->size - 1;

Doesn't that mean that with SH_MAX_SIZE we end up with sizemask being 0
(i.e. no bits set)? At least that's what I get from

printf("%#x\n", (unsigned int)0);

That would mean SH_INITIAL_BUCKET/SH_NEXT/SH_PREV can only ever return
bucket 0, no?

I don't think we're building hash tables with 2^32 buckets, though.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Challenges preventing us moving to 64 bit transaction id (XID)?

2017-11-27 Thread Peter Geoghegan

On Mon, Nov 27, 2017 at 11:56 AM, Robert Haas  wrote:
> On Fri, Nov 24, 2017 at 5:33 AM, Alexander Korotkov
>  wrote:
>> pg_prune_xid makes sense only for heap pages.  Once we introduce special
>> area for heap pages, we can move pg_prune_xid there and save some bytes in
>> index pages.  However, this is an optimization not directly related to
>> 64-bit xids.  Idea is that if we anyway change page format, why don't apply
>> this optimization as well?  But if we have any doubts in this, it can be
>> removed with no problem.
>
> My first reaction is that changing the page format seems like a
> non-starter, because it would break pg_upgrade.  If we get the heap
> storage API working, then we could have a heap AM that works as it
> does today and a newheap AM with such changes, but I have a bit of a
> hard time imagining a patch that causes a hard compatibility break
> ever being accepted.

I actually think that we could use that field in indexes for storing
an epoch. This could be used to avoid having to worry about
anti-wraparound VACUUM for deleted index pages that contain a cached
XID -- the one we compare to RecentGlobalXmin as part of the recycling
interlock. (I suggested this to Sawada-san at one point, in the
context of avoiding vacuuming indexes on large append-mostly tables.)

In any case, we'd hardly go to all that effort to save just 4 bytes in
the page header.

-- 
Peter Geoghegan

Re: [HACKERS] Challenges preventing us moving to 64 bit transaction id (XID)?

2017-11-27 Thread Robert Haas

On Fri, Nov 24, 2017 at 5:33 AM, Alexander Korotkov
 wrote:
> pg_prune_xid makes sense only for heap pages.  Once we introduce special
> area for heap pages, we can move pg_prune_xid there and save some bytes in
> index pages.  However, this is an optimization not directly related to
> 64-bit xids.  Idea is that if we anyway change page format, why don't apply
> this optimization as well?  But if we have any doubts in this, it can be
> removed with no problem.

My first reaction is that changing the page format seems like a
non-starter, because it would break pg_upgrade.  If we get the heap
storage API working, then we could have a heap AM that works as it
does today and a newheap AM with such changes, but I have a bit of a
hard time imagining a patch that causes a hard compatibility break
ever being accepted.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Re: proposal - psql: possibility to specify sort for describe commands, when size is printed

2017-11-27 Thread Pavel Stehule

2017-11-27 8:58 GMT+01:00 Alexander Korotkov :

> On Tue, Nov 14, 2017 at 3:26 PM, Magnus Hagander 
> wrote:
>
>> On Mon, Nov 13, 2017 at 3:17 PM, Dean Rasheed 
>> wrote:
>>
>>> On 28 October 2017 at 13:46, Pavel Stehule 
>>> wrote:
>>> > I though about Alexander proposal, and I am thinking so it can be
>>> probably
>>> > best if we respect psql design. I implemented two command suffixes
>>> > (supported only when it has sense) "s" sorted by size and "d" as
>>> descent
>>> >
>>> > so list of tables can be sorted with commands:
>>> >
>>> > \dt+sd (in this case, the order is not strict), so command
>>> > \dtsd+ is working too (same \disd+ or \di+sd)
>>> >
>>> > These two chars are acceptable. Same principle is used for \l command
>>> >
>>> > \lsd+ or \l+sd
>>> >
>>> > What do you think about it?
>>> >
>>>
>>> I really hate that syntax. This is going to turn into an
>>> incomprehensible mess, and isn't easily extended to support other
>>> options.
>>>
>>
>> +1. While useful in itself, I think it's definitely a dangerous pattern
>> to go down, as it'll only get worse.
>>
>>
>> I agree with people who have said they would prefer this to be
>>> available as a per-command option rather than as a variable that you
>>> have to set, but it needs a clearer syntax. I actually like Stephen's
>>> idea of using a user-defined SQL snippet, because that's a familiar
>>> syntax to people, and it avoids adding an ever-increasing number of
>>> options to these commands. Instead, the syntax could simply be:
>>>
>>
>> +1 here as well. And anybody who is actually going to need this level of
>> control definitely will know SQL...
>>
>> And if one wants to save some "standard patterns", it should be doable to
>> save the pattern itself in a variable and then use it with something like
>> "\dt :mysort" and have it expand the normal way there.
>>
>
> +1
> I agree, that would look better, especially with "standard patterns" which
> could help with too long to type each time SQL snippets.
>

I though about this design more time. I see following disadvantages

1. we are not able to check all possible variants of extended query. If
there will be some custom error, then we will raise pretty ugly error
messages,

2. I don't thing so using "Size" as table size in human readable format and
"size" as table size in raw format is intuitive, but any change of "Size"
can introduce some (less probability compatibility issues),

3. What queries will contains size calculations? It is not cheap - requires
AccessShareLock

This proposal is first time, when we cannot to detect full semantic from
\xxx command. When user extend query correctly, then it is better than
before, when not it is worse than before.

Regards

Pavel



> --
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>

Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-11-27 Thread Jesper Pedersen


Hi Yura,

On 11/27/2017 07:41 AM, Юрий Соколов wrote:

I looked at assembly, and remembered, that last commit simplifies
`init_local_spin_delay` to just two-three writes of zeroes (looks
like compiler combines 2*4byte write into 1*8 write). Compared to
code around (especially in LWLockAcquire itself), this overhead
is negligible.

Though, I found that there is benefit in calling LWLockAttemptLockOnce
before entering loop with calls to LWLockAttemptLockOrQueue in the
LWLockAcquire (in there is not much contention). And this way, `inline`
decorator for LWLockAttemptLockOrQueue could be omitted. Given, clang
doesn't want to inline this function, it could be the best way.


In attach version with LWLockAcquireOnce called before entering loop
in LWLockAcquire.



Oh... there were stupid error in previos file.
Attached fixed version.



I can reconfirm my performance findings with this patch; system same as 
up-thread.


Thanks !

Best regards,
 Jesper

Re: [HACKERS] More stats about skipped vacuums

2017-11-27 Thread Robert Haas

On Sun, Nov 26, 2017 at 3:19 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Sat, Nov 25, 2017 at 12:09 PM, Tom Lane  wrote:
>>> Mumble.  It's a property I'm pretty hesitant to give up, especially
>>> since the stats views have worked like that since day one.  It's
>>> inevitable that weakening that guarantee would break peoples' queries,
>>> probably subtly.
>
>> You mean, queries against the stats views, or queries in general?  If
>> the latter, by what mechanism would the breakage happen?
>
> Queries against the stats views, of course.

Hmm.  Those are probably rare.  If we only took a snapshot of the
statistics for the backends that explicitly access those views, that
probably wouldn't be too crazy.

Sorry if this is a stupid question, but how often and for what purpose
to regular backends need the stats collector data for purposes other
than querying the stats views?  I thought that the data was only used
to decide whether to VACUUM/ANALYZE, and therefore would be accessed
mostly by autovacuum, and for that you'd actually want the most
up-to-date view of the stats for a particular table that is available,
not any older snapshot.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] More stats about skipped vacuums

2017-11-27 Thread Robert Haas

On Mon, Nov 27, 2017 at 1:49 AM, Kyotaro HORIGUCHI
 wrote:
> Hmmm. Okay, we must make stats collector more effeicient if we
> want to have additional counters with smaller significance in the
> table stats. Currently sizeof(PgStat_StatTabEntry) is 168
> bytes. The whole of the patchset increases it to 232 bytes. Thus
> the size of a stat file for a database with 1 tables
> increases from about 1.7MB to 2.4MB.  DSM and shared dynahash is
> not dynamically expandable so placing stats on shared hash
> doesn't seem effective. Stats as a regular table could work but
> it seems too-much.

dshash, which is already committed, is both DSM-based and dynamically
expandable.

> Is it acceptable that adding a new section containing this new
> counters, which is just loaded as a byte sequence and parsing
> (and filling the corresponding hash) is postponed until a counter
> in the section is really requested?  The new counters need to be
> shown in a separate stats view (maybe named pg_stat_vacuum).

Still makes the stats file bigger.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] path toward faster partition pruning

2017-11-27 Thread Jesper Pedersen


Hi Amit,

On 11/24/2017 12:00 AM, Amit Langote wrote:

On 2017/11/23 3:56, Jesper Pedersen wrote:
EXPLAIN (ANALYZE) SELECT t1.a, t1.b, t2.c, t2.d FROM t1 INNER JOIN t2 ON
t2.c = t1.b WHERE t2.d = 1;

I just wanted to highlight that the "JOIN ON" partition isn't pruned - the
"WHERE" one is.


Did you mean to write ON t2.d = t1.b?  If so, equivalence class mechanism
will give rise to a t1.b = 1 and hence help prune t1's partition as well:



No, I meant 't2.c = t1.b'. If you take the same example, but don't 
partition you will get the following plan:


test=# EXPLAIN (COSTS OFF) SELECT t1.a, t1.b, t2.c, t2.d FROM t1 INNER 
JOIN t2 ON t2.c = t1.b WHERE t2.d = 1;

  QUERY PLAN
--
 Nested Loop
   ->  Index Scan using idx_t2_d on t2
 Index Cond: (d = 1)
   ->  Index Only Scan using idx_t1_b_a on t1
 Index Cond: (b = t2.c)
(5 rows)

Maybe "5.10.2. Declarative Partitioning" could be expanded to include 
some general "guidelines" of where partition based plans should be 
checked against their non-partition counterparts (at least the first 
bullet in 5.10.1 says ".. in certain situations .."). Probably a 
separate patch from this.


[snip]


Should pruning of partitions for UPDATEs (where the partition key isn't
updated) and DELETEs be added to the TODO list?


Note that partition pruning *does* work for UPDATE and DELETE, but only if
you use list/range partitioning.  The reason it doesn't work in this case
(t1 is hash partitioned) is that the pruning is still based on constraint
exclusion in the UPDATE/DELETE case and constraint exclusion cannot handle
hash partitioning.



Thanks for your description.



I can see how that seems a bit odd.  If you use hash partitioning,
UPDATE/DELETE do not benefit from partition-pruning, even though SELECT
does.  That's because SELECT uses the new partition-pruning method (this
patch set) which supports hash partitioning, whereas UPDATE and DELETE use
constraint exclusion which doesn't.  It would be a good idea to make even
UPDATE and DELETE use the new method thus bringing everyone on the same
page, but that requires us to make some pretty non-trivial changes to how
UPDATE/DELETE planning works for inheritance/partitioned tables, which we
should undertake separately, imho.



Agreed.

Best regards,
 Jesper

Re: Do we accept doc changes to back branches?

2017-11-27 Thread Stephen Frost

Greetings,

* Joshua D. Drake (j...@commandprompt.com) wrote:
> I don't recall, do we allow non-bug fix (or what constitutes a bug)
> for back branches with docs? I have been reviewing the logical
> replication docs and they could use some love but I need to know
> which branch I should start work on.

The short answer, I believe, is to always start with master because we
certainly want doc fixes to be included in the next release (assuming
that documentation exists in master...  if it doesn't then start with
the latest release it's in).  Following that, patches for prior branches
are certainly welcome where they improve the documentation in a
meaningful way.  A good example is Dean's recent improvements on the RLS
documentation by adding a table which lists out the privileges and
policies matrix that RLS follows.

See commit 87c2a17fee784c7e1004ba3d3c5d8147da676783.

Thanks!

Stephen

signature.asc
Description: Digital signature

Do we accept doc changes to back branches?

2017-11-27 Thread Joshua D. Drake


-hackers,


I don't recall, do we allow non-bug fix (or what constitutes a bug) for 
back branches with docs? I have been reviewing the logical replication 
docs and they could use some love but I need to know which branch I 
should start work on.



Thanks,


JD

--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc

PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.org
* Unless otherwise stated, opinions are my own.   *

Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-11-27 Thread hubert depesz lubaczewski

On Mon, Nov 27, 2017 at 05:24:49PM +0530, Amit Kapila wrote:
> I think it is "actual_time * 1" for anything below Gather.

Well, I think I found another problem.

Please take a look at:
https://explain.depesz.com/s/Bs8c

Node 15 is Gather with loops = 2974 (because it was ran by nested loop).

There were 4 workers, so 5 threads working, but the loops on parallel
seq scan is 17.698 million ?!

The problem is that for explain.depesz.com I'm calculating how much time
pg actually spent doing given thing.

So, if an operation has actual time of 1ms, but 100 loops, it took
100ms.

In case of parallel operations it looks like I can't realistically find
the information anywhere?

In the explain above, we see that Nested loop took, in total, around
2 million miliseconds (step 9).
Out of which, ~ 7000 ms was hash join and it's subnodes.
This leaves most of the time for Gather step, which was called 2974
times with actual time reported as 618ms.
2974 * 618 is 1837932 miliseconds, which seems reasonable.

But then - how much of this time was spent in Parallel Seq Scan in step
#16 ?

2974 * 609ms? Why is loops there 17 million?

Best regards,

depesz

Re: [HACKERS] Bug in ExecModifyTable function and trigger issues for foreign tables

2017-11-27 Thread Dean Rasheed

On 27 November 2017 at 16:35, Tom Lane  wrote:
> I wrote:
>> Dean Rasheed  writes:
>>> A separate point -- it might be marginally more efficient to have the
>>> work of rewriteTargetListUD() done after expand_targetlist() to avoid
>>> the possible renumbering of the resjunk entries.
>
>> Hm.  It wouldn't save a lot, but yeah, doing it in this order seems
>> a bit silly when you put it like that.
>
> On looking closer, the reason it's like that in Fujita-san's patch
> is to minimize the API churn seen by FDW AddForeignUpdateTargets
> functions, specifically whether they see a tlist that's before or
> after what expand_targetlist() does.  I'm doubtful that the
> potential savings is worth taking risks there.  In particular,
> it seems like a good thing that expand_targetlist() verifies the
> correct tlist ordering *after* the FDW function has acted.
> So now my inclination is to leave this alone.
>

Ah yes, that seems like a worthwhile check to keep. Never mind then.

Regards,
Dean

Re: [HACKERS] Custom compression methods

2017-11-27 Thread Tomas Vondra

Hi,

On 11/27/2017 04:52 PM, Ildus Kurbangaliev wrote:
> ...
>
> Hi. This looks like a serious bug, but I couldn't reproduce it yet.
> Did you upgrade some old database or this bug happened after 
> insertion of all data to new database? I tried using your 'archie' 
> tool to download mailing lists and insert them to database, but
> couldn't catch any errors.
> 

I can trigger it pretty reliably with these steps:

git checkout f65d21b258085bdc8ef2cc282ab1ff12da9c595c
patch -p1 < ~/custom_compression_methods_v6.patch
./configure --enable-debug --enable-cassert \
 CFLAGS="-fno-omit-frame-pointer -O0 -DRANDOMIZE_ALLOCATED_MEMORY" \
 --prefix=/home/postgres/pg-compress
make -s clean && make -s -j4 install
cd contrib/
make -s clean && make -s -j4 install

export PATH=/home/postgres/pg-compress/bin:$PATH
pg_ctl -D /mnt/raid/pg-compress init
pg_ctl -D /mnt/raid/pg-compress -l compress.log start
createdb archie
cd ~/archie/sql/
psql archie < create.sql

~/archie/bin/load.py --workers 4 --db archie */* > load.log 2>&1


I guess the trick might be -DRANDOMIZE_ALLOCATED_MEMORY (I first tried
without it, and it seemed working fine). If that's the case, I bet there
is a palloc that should have been palloc0, or something like that.

If you still can't reproduce that, I may give you access to this machine
so that you can debug it there.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Add RANGE with values and exclusions clauses to the Window Functions

2017-11-27 Thread Oliver Ford

On Mon, Nov 27, 2017 at 4:40 PM, Erik Rijkers  wrote:
> On 2017-11-27 17:34, Erik Rijkers wrote:
>>
>> On 2017-11-27 16:01, Oliver Ford wrote:
>>>
>>> Attached is it in bare diff form.
>>
>>
>> [0001-window-frame-v3.patch]
>>
>> Thanks, that did indeed fix it:
>>
>> make && make check  now  ok.
>>
>> There were errors in the doc build (unmatched tags); I fixed them in
>> the attached doc-patch (which should go on top of yours).
>
>
> 0001-window-frame-v3-fixtags.diff
>
> now attached, I hope...
>

Cheers here's v4 with the correct docs.


0001-window-frame-v4.patch
Description: Binary data

Re: [HACKERS] Parallel Append implementation

2017-11-27 Thread amul sul

Thanks a lot Rajkumar for this test. I am able to reproduce this crash by
enabling  partition wise join.

The reason for this crash is the same as
 the
previous[1] i.e node->as_whichplan
value.  This time append->first_partial_plan value looks suspicious. With
the
following change to the v21 patch, I am able to reproduce this crash as
assert
failure when enable_partition_wise_join = ON otherwise working fine.

diff --git a/src/backend/executor/nodeAppend.c
b/src/backend/executor/nodeAppend.c
index e3b17cf0e2..4b337ac633 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -458,6 +458,7 @@ choose_next_subplan_for_worker(AppendState *node)

/* Backward scan is not supported by parallel-aware plans */
Assert(ScanDirectionIsForward(node->ps.state->es_direction));
+   Assert(append->first_partial_plan < node->as_nplans);

LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);


Will look into this more, tomorrow.
 

1. http://postgr.es/m/CAAJ_b97kLNW8Z9nvc_JUUG5wVQUXvG=
f37WsX8ALF0A=kah...@mail.gmail.com

Regards,
Amul


On Fri, Nov 24, 2017 at 5:00 PM, Rajkumar Raghuwanshi
 wrote:
> On Thu, Nov 23, 2017 at 2:22 PM, amul sul  wrote:
>> Look like it is the same crash what v20 claim to be fixed, indeed I
>> missed to add fix[1] in v20 patch, sorry about that. Attached updated
>> patch includes aforementioned fix.
>
> Hi,
>
> I have applied latest v21 patch, it got crashed when enabled
> partition-wise-join,
> same query is working fine with and without partition-wise-join
> enabled on PG-head.
> please take a look.
>
> SET enable_partition_wise_join TO true;
>
> CREATE TABLE pt1 (a int, b int, c text, d int) PARTITION BY LIST(c);
> CREATE TABLE pt1_p1 PARTITION OF pt1 FOR VALUES IN ('', '0001',
> '0002', '0003');
> CREATE TABLE pt1_p2 PARTITION OF pt1 FOR VALUES IN ('0004', '0005',
> '0006', '0007');
> CREATE TABLE pt1_p3 PARTITION OF pt1 FOR VALUES IN ('0008', '0009',
> '0010', '0011');
> INSERT INTO pt1 SELECT i % 20, i % 30, to_char(i % 12, 'FM'), i %
> 30 FROM generate_series(0, 9) i;
> ANALYZE pt1;
>
> CREATE TABLE pt2 (a int, b int, c text, d int) PARTITION BY LIST(c);
> CREATE TABLE pt2_p1 PARTITION OF pt2 FOR VALUES IN ('', '0001',
> '0002', '0003');
> CREATE TABLE pt2_p2 PARTITION OF pt2 FOR VALUES IN ('0004', '0005',
> '0006', '0007');
> CREATE TABLE pt2_p3 PARTITION OF pt2 FOR VALUES IN ('0008', '0009',
> '0010', '0011');
> INSERT INTO pt2 SELECT i % 20, i % 30, to_char(i % 12, 'FM'), i %
> 30 FROM generate_series(0, 9) i;
> ANALYZE pt2;
>
> EXPLAIN ANALYZE
> SELECT t1.c, sum(t2.a), COUNT(*) FROM pt1 t1 FULL JOIN pt2 t2 ON t1.c
> = t2.c GROUP BY t1.c ORDER BY 1, 2, 3;
> WARNING:  terminating connection because of crash of another server
process
> DETAIL:  The postmaster has commanded this server process to roll back
> the current transaction and exit, because another server process
> exited abnormally and possibly corrupted shared memory.
> HINT:  In a moment you should be able to reconnect to the database and
> repeat your command.
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Failed.
> !>
>
> stack-trace is given below.
>
> Core was generated by `postgres: parallel worker for PID 73935
>  '.
> Program terminated with signal 11, Segmentation fault.
> #0  0x006dc4b3 in ExecProcNode (node=0x7f7f7f7f7f7f7f7e) at
> ../../../src/include/executor/executor.h:238
> 238if (node->chgParam != NULL) /* something changed? */
> Missing separate debuginfos, use: debuginfo-install
> keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-65.el6.x86_64
> libcom_err-1.41.12-23.el6.x86_64 libselinux-2.0.94-7.el6.x86_64
> openssl-1.0.1e-57.el6.x86_64 zlib-1.2.3-29.el6.x86_64
> (gdb) bt
> #0  0x006dc4b3 in ExecProcNode (node=0x7f7f7f7f7f7f7f7e) at
> ../../../src/include/executor/executor.h:238
> #1  0x006dc72e in ExecAppend (pstate=0x26cd6e0) at
nodeAppend.c:207
> #2  0x006d1e7c in ExecProcNodeInstr (node=0x26cd6e0) at
> execProcnode.c:446
> #3  0x006dcee5 in ExecProcNode (node=0x26cd6e0) at
> ../../../src/include/executor/executor.h:241
> #4  0x006dd38c in fetch_input_tuple (aggstate=0x26cd7f8) at
> nodeAgg.c:699
> #5  0x006e02eb in agg_fill_hash_table (aggstate=0x26cd7f8) at
> nodeAgg.c:2536
> #6  0x006dfb2b in ExecAgg (pstate=0x26cd7f8) at nodeAgg.c:2148
> #7  0x006d1e7c in ExecProcNodeInstr (node=0x26cd7f8) at
> execProcnode.c:446
> #8  0x006d1e4d in ExecProcNodeFirst (node=0x26cd7f8) at
> execProcnode.c:430
> #9  0x006c9439 in ExecProcNode (node=0x26cd7f8) at
> ../../../src/include/executor/executor.h:241
> #10 0x006cbd73 in ExecutePlan (estate=0x26ccda0,
> planstate=0x26cd7f8, use_parallel_mode=0 '\000', operation=CMD_SELECT,
> sendTuples=1 '\001', numberTuples=0,
> direction=Forwar

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2017-11-27 Thread Tomas Vondra

Hi,

Attached is an updated version of the patch series, fixing the issues
reported by Mark Dilger:

1) Fix fabs() issue in histogram.c.

2) Do not rely on extra_data being StdAnalyzeData, and instead lookup
the LT operator explicitly. This also adds a simple regression tests to
make sure ANALYZE on arrays works fine, but perhaps we should invent
some simple queries too.

3) I've removed / clarified some of the comments mentioned by Mark.

4) I haven't changed how the statistics kinds are defined in relation.h,
but I agree there should be a comment explaining how STATS_EXT_INFO_*
relate to StatisticExtInfo.kinds.

5) The most significant change happened histograms. There used to be two
structures for histograms:

  - MVHistogram - expanded (no deduplication etc.), result of histogram
build and never used for estimation

  - MVSerializedHistogram - deduplicated to save space, produced from
MVHistogram before storing in pg_statistic_ext and never used for
estimation

So there wasn't really any reason to expose the "non-serialized" version
outside histogram.c. It was just confusing and unnecessary, so I've
moved MVHistogram to histogram.c (and renamed it to MVHistogramBuild),
and renamed MVSerializedHistogram. And same for the MVBucket stuff.

So now we only deal with MVHistogram everywhere, except in histogram.c.

6) I've also made MVHistogram to include a varlena header directly (and
be packed as a bytea), which allows us to store it without having to
call any serialization functions).

I guess if we should do (5) and (6) for the MCV lists too, it seems more
convenient than the current approach. And perhaps even for the
statistics added to 9.6 (it does not change the storage format).


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


0001-multivariate-MCV-lists.patch.gz
Description: application/gzip


0002-multivariate-histograms.patch.gz
Description: application/gzip

Re: Add RANGE with values and exclusions clauses to the Window Functions

2017-11-27 Thread Erik Rijkers


On 2017-11-27 17:34, Erik Rijkers wrote:

On 2017-11-27 16:01, Oliver Ford wrote:

Attached is it in bare diff form.


[0001-window-frame-v3.patch]

Thanks, that did indeed fix it:

make && make check  now  ok.

There were errors in the doc build (unmatched tags); I fixed them in
the attached doc-patch (which should go on top of yours).


0001-window-frame-v3-fixtags.diff

now attached, I hope...

--- doc/src/sgml/syntax.sgml.orig	2017-11-27 17:19:30.253810944 +0100
+++ doc/src/sgml/syntax.sgml	2017-11-27 17:20:55.515626931 +0100
@@ -1805,8 +1805,8 @@
 and the optional frame_clause
 can be one of
 
-{ RANGE | ROWS } frame_start [ frame_exclusion_clause ]
-{ RANGE | ROWS } BETWEEN frame_start AND frame_end [ frame_exclusion_clause ]
+{ RANGE | ROWS } frame_start [ frame_exclusion_clause ]
+{ RANGE | ROWS } BETWEEN frame_start AND frame_end [ frame_exclusion_clause ]
 
 where frame_start and frame_end can be
 one of
@@ -1817,7 +1817,7 @@
 value FOLLOWING
 UNBOUNDED FOLLOWING
 
-where the optional frame_exclusion_clause can be one of
+where the optional frame_exclusion_clause can be one of
 
 EXCLUDE CURRENT ROW
 EXCLUDE TIES
@@ -1888,24 +1888,24 @@
 

 The value PRECEDING and
-value FOLLOWING cases, when used
-in ROWS mode, indicate that the frame starts or ends the specified
-number of rows before or after the current row. In ROWS mode,
-value must be an integer expression not containing any variables,
+value FOLLOWING cases, when used
+in ROWS mode, indicate that the frame starts or ends the specified
+number of rows before or after the current row. In ROWS mode,
+value must be an integer expression not containing any variables,
 aggregate functions, or window functions.
-When used in RANGE mode, they indicate that the frame starts or ends when the value of
-each row's ORDER BY column is within the start value and end value bounds. In RANGE mode,
-value can be either an integer expression or a date/time interval.
-In RANGE mode, there must be exactly one ORDER BY column and if the column is an integer column,
-then value must be an integer.
-If it is a date/time column, then value must be an interval. In both modes,
+When used in RANGE mode, they indicate that the frame starts or ends when the value of
+each row's ORDER BY column is within the start value and end value bounds. In RANGE mode,
+value can be either an integer expression or a date/time interval.
+In RANGE mode, there must be exactly one ORDER BY column and if the column is an integer column,
+then value must be an integer.
+If it is a date/time column, then value must be an interval. In both modes,
 the value must not be null or negative; but it can be zero, which just selects the current row.

 

-For the frame_exclusion_clause, EXCLUDE CURRENT ROW
-excludes the current row from the frame. EXCLUDE TIES excludes any peers of the current row from the
-frame. EXCLUDE NO OTHERS does nothing, but is provided in order to optionally document the intention
+For the frame_exclusion_clause, EXCLUDE CURRENT ROW
+excludes the current row from the frame. EXCLUDE TIES excludes any peers of the current row from the
+frame. EXCLUDE NO OTHERS does nothing, but is provided in order to optionally document the intention
 not to exclude any other rows.

Re: [HACKERS] Bug in ExecModifyTable function and trigger issues for foreign tables

2017-11-27 Thread Tom Lane

I wrote:
> Dean Rasheed  writes:
>> A separate point -- it might be marginally more efficient to have the
>> work of rewriteTargetListUD() done after expand_targetlist() to avoid
>> the possible renumbering of the resjunk entries.

> Hm.  It wouldn't save a lot, but yeah, doing it in this order seems
> a bit silly when you put it like that.

On looking closer, the reason it's like that in Fujita-san's patch
is to minimize the API churn seen by FDW AddForeignUpdateTargets
functions, specifically whether they see a tlist that's before or
after what expand_targetlist() does.  I'm doubtful that the
potential savings is worth taking risks there.  In particular,
it seems like a good thing that expand_targetlist() verifies the
correct tlist ordering *after* the FDW function has acted.
So now my inclination is to leave this alone.

regards, tom lane

Re: Add RANGE with values and exclusions clauses to the Window Functions

2017-11-27 Thread Erik Rijkers


On 2017-11-27 16:01, Oliver Ford wrote:

Attached is it in bare diff form.


[0001-window-frame-v3.patch]

Thanks, that did indeed fix it:

make && make check  now  ok.

There were errors in the doc build (unmatched tags); I fixed them in the 
attached doc-patch (which should go on top of yours).


(In very limited testing I did not find any problems yet)


thanks,

Erik Rijkers

Re: [HACKERS] Custom compression methods

2017-11-27 Thread Ildus Kurbangaliev

On Sat, 25 Nov 2017 06:40:00 +0100
Tomas Vondra  wrote:

> Hi,
> 
> I ran into another issue - after inserting some data into a table
> with a tsvector column (without any compression defined), I can no
> longer read the data.
> 
> This is what I get in the console:
> 
> db=# select max(md5(body_tsvector::text)) from messages;
> ERROR:  cache lookup failed for compression options 6432
> 
> and the stack trace looks like this:
> 
> Breakpoint 1, get_cached_compression_options (cmoptoid=6432) at
> tuptoaster.c:2563
> 2563  elog(ERROR, "cache lookup failed for
> compression options %u", cmoptoid);
> (gdb) bt
> #0  get_cached_compression_options (cmoptoid=6432) at
> tuptoaster.c:2563 #1  0x004bf3da in toast_decompress_datum
> (attr=0x2b44148) at tuptoaster.c:2390
> #2  0x004c0c1e in heap_tuple_untoast_attr (attr=0x2b44148) at
> tuptoaster.c:225
> #3  0x0083f976 in pg_detoast_datum (datum=) at
> fmgr.c:1829
> #4  0x008072de in tsvectorout (fcinfo=0x2b41e00) at
> tsvector.c:315 #5  0x005fae00 in ExecInterpExpr
> (state=0x2b414b8, econtext=0x2b25ab0, isnull=) at
> execExprInterp.c:1131 #6  0x0060bdf4 in
> ExecEvalExprSwitchContext (isNull=0x7fe9bd37 "",
> econtext=0x2b25ab0, state=0x2b414b8)
> at ../../../src/include/executor/executor.h:299
> 
> It seems the VARATT_IS_CUSTOM_COMPRESSED incorrectly identifies the
> value as custom-compressed for some reason.
> 
> Not sure why, but the tsvector column is populated by a trigger that
> simply does
> 
> NEW.body_tsvector
> := to_tsvector('english', strip_replies(NEW.body_plain));
> 
> If needed, the complete tool is here:
> 
> https://bitbucket.org/tvondra/archie
> 

Hi. This looks like a serious bug, but I couldn't reproduce
it yet. Did you upgrade some old database or this bug happened after
insertion of all data to new database? I tried using your 'archie'
tool to download mailing lists and insert them to database, but couldn't
catch any errors.

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [HACKERS] Bug in ExecModifyTable function and trigger issues for foreign tables

2017-11-27 Thread Tom Lane

Dean Rasheed  writes:
> I wonder if, years from now, it might look a bit odd that
> rewriteTargetListUD() is doing part of work of preptlist.c, is only
> called from there, and yet is located in the rewriter.

Yeah, I probably wouldn't have done it like this in a green field,
but maintaining traceability to the existing code is valuable IMO.

> A separate point -- it might be marginally more efficient to have the
> work of rewriteTargetListUD() done after expand_targetlist() to avoid
> the possible renumbering of the resjunk entries.

Hm.  It wouldn't save a lot, but yeah, doing it in this order seems
a bit silly when you put it like that.

regards, tom lane

Re: [HACKERS] Refactoring identifier checks to consistently use strcmp

2017-11-27 Thread Daniel Gustafsson

> On 17 Nov 2017, at 03:31, Michael Paquier  wrote:
> 
> On Tue, Sep 5, 2017 at 5:34 PM, Daniel Gustafsson  wrote:
>>> On 17 Aug 2017, at 11:08, Daniel Gustafsson  wrote:
 On 16 Aug 2017, at 17:51, Tom Lane  wrote:
 My thought is that if we are looking at words that have been through the
 parser, then it should *always* be plain strcmp; we should expect that
 the parser already did the appropriate case-folding.
>>> 
>>> +1
>>> 
 pg_strcasecmp would be appropriate, perhaps, if we're dealing with stuff
 that somehow came in without going through the parser.
>>> 
>>> In that case it would be up to the consumer of the data to handle required
>>> case-folding for the expected input, so pg_strcasecmp or strcmp depending on
>>> situation.
>> 
>> This patch has been marked “Waiting on Author”, but I’m not sure what the
>> concensus of this thread came to with regards to quoted keywords and 
>> backwards
>> compatibility.  There seems to be a 2-1 vote for allowing a break, and 
>> forcing
>> all keywords out of the parser to be casefolded. Any other opinions?
> 
> This patch impacts the DDL grammar of aggregates, operators,
> collations, text search, views, etc. Still I agree with the purpose of
> this thread that it would be nice to get a more consistent behavior
> even if it breaks some queries, so +1 for the argument with the
> post-parser comparison which should use strcmp.

Thanks for reviewing!

> The patch needs a rebase, and there are a couple of places that need
> an extra lookup I think:
> $ git grep defname -- *.c | grep strcasecmp | wc -l
>  39

Rebased and handled a few more places which I had either missed in the last
round, or that had been added in the meantime.  “PARALLEL” in aggregatecmds.c
is intentionally using pg_strcasecmp() due to the old-style syntax which is
still supported. AFAICS this covers all relevant codepaths from the 39 above.

cheers ./daniel



defname_strcmp-v2.patch
Description: Binary data

Re: Add RANGE with values and exclusions clauses to the Window Functions

2017-11-27 Thread Oliver Ford

On Mon, Nov 27, 2017 at 12:06 PM, Oliver Ford  wrote:
> On Fri, Nov 24, 2017 at 3:08 PM, Erikjan Rijkers  wrote:
>>   SELECT pg_get_viewdef('v_window');
>> ! pg_get_viewdef
>> ! --
>> !   SELECT i.i,+
>> !  sum(i.i) OVER (ORDER BY i.i) AS sum_rows+
>>   FROM generate_series(1, 10) i(i);
>>   (1 row)
>>
>> --- 1034,1043 
>>   (10 rows)
>>
>>   SELECT pg_get_viewdef('v_window');
>> ! pg_get_viewdef
>> !
>> ---
>> !   SELECT i.i,
>> +
>> !  sum(i.i) OVER (ORDER BY i.i ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
>> AS sum_rows+
>>   FROM generate_series(1, 10) i(i);
>>   (1 row)
>>
>>
>> This small hickup didn't prevent building an instance but obviously I
>> haven't done any real tests yet.
>>
>>
>> thanks,
>>
>>
>> Erik Rijkers
>
> After another clone and rebuild it works alright with the correct
> spacing on mine, so the attached v2 should all pass. I noticed that I
> hadn't added the exclusions clauses to the view defs code, so that's
> also in this patch with extra tests to cover it.

Sorry previous patch was in full-commit form and not just a diff.
Attached is it in bare diff form.


0001-window-frame-v3.patch
Description: Binary data

Re: [HACKERS] Timeline ID in backup_label file

2017-11-27 Thread David Steele

Hi Michael,

On 11/15/17 10:09 PM, Michael Paquier wrote:
> On Thu, Nov 16, 2017 at 9:20 AM, David Steele  wrote:
>> For this patch at least, I think we should do #1.  Getting rid of the order
>> dependency is attractive but there may be other programs that are depending
>> on the order.  I know you are not proposing to change the order now, but it
>> *could* be changed with #2.
> 
> read_backup_label() is a static function in the backend code. With #2
> I do not imply to change the order of the elements written in the
> backup_label file, just to make the way they are parsed smarter.

My point is that the order *could* be changed and it wouldn't be noticed
if the read function were improved as you propose.

I'm not against the idea, just think we should continue to enforce the
order unless we decide an interface break is OK.

>> Also, I think DEBUG1 would be a more appropriate log level if any logging is
>> done.
> 
> OK, the label and time strings can include spaces. The label string is
> assumed to not be bigger than 1028 bytes, and the timestamp is assumed
> that it can get up to 128. So it is possible to use stuff like
> fscanf("%127[^\n]\n") to avoid any buffer overflow problems on the
> backend. If a backup_label is generated with strings longer than that,
> then read_backup_label would fail to parse the next entries but that's
> not really something to worry about IMO as those are only minor sanity
> checks. Having a safer coding in the backend is more important, and
> the new checks should not trigger any hard failures.

I have tested and get an error as expected when I munge the backup_label
file:

FATAL:  invalid data in file "backup_label"
DETAIL:  Timeline ID parsed is 2, but expected 1

Everything else looks good so I will mark it ready for committer.

Regards,
-- 
-David
da...@pgmasters.net

Re: PG10.1 autovac killed building extended stats

2017-11-27 Thread Justin Pryzby

On Fri, Nov 17, 2017 at 03:43:52PM -0600, Justin Pryzby wrote:
> On Fri, Nov 17, 2017 at 01:36:00PM -0300, Alvaro Herrera wrote:
> > Justin Pryzby wrote:
> > 
> > > Core was generated by `postgres: autovacuum worker process   gtt  
> > >'.
> > > Program terminated with signal 11, Segmentation fault.
> > > #0  statext_ndistinct_build (totalrows=300, numrows=300, rows=0x21df3e8, 
> > > attrs=, stats=0x0) at mvdistinct.c:103
> > > 103 item->attrs = 
> > > bms_add_member(item->attrs,
> > > 
> > > (gdb) p stats
> > > $5 = (VacAttrStats **) 0x0
> > > => This is an error, no ??
> > 
> > Not necessarily, but then I think this previous code is busted:
> 
> > If I recall things correctly, the "continue" should be executed
> > regardless of IsAutoVacuumWorkerProcess() (only the log should be
> > conditional on that).  I'm not sure how I arrived at the current coding,
> > which I added in bf2a691e02d7 -- probably just fuzzy thinking.
> 
> Is it useful to run with that change ?

FYI: the affected server ran without crashing since Nov 18 with this change.

--- postgresql-10.1.orig/src/backend/statistics/extended_stats.c
2017-11-06 20:46:52.0 -0400
+++ postgresql-10.1/src/backend/statistics/extended_stats.c 2017-11-18 
18:23:44.057988419 -0400
@@ -98,20 +98,21 @@
if (!stats && !IsAutoVacuumWorkerProcess())
{
ereport(WARNING,

(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 errmsg("statistics object \"%s.%s\" 
could not be computed for relation \"%s.%s\"",
stat->schema, 
stat->name,

get_namespace_name(onerel->rd_rel->relnamespace),

RelationGetRelationName(onerel)),
 errtable(onerel)));
-   continue;
}
+   if (!stats)
+   continue;
 
/* check allowed number of dimensions */
Assert(bms_num_members(stat->columns) >= 2 &&
   bms_num_members(stat->columns) <= 
STATS_MAX_DIMENSIONS);
 
/* compute statistic of each requested type */
foreach(lc2, stat->types)
{
chart = (char) lfirst_int(lc2);

Justin
--- postgresql-10.1.orig/src/backend/statistics/extended_stats.c
2017-11-06 20:46:52.0 -0400
+++ postgresql-10.1/src/backend/statistics/extended_stats.c 2017-11-18 
18:23:44.057988419 -0400
@@ -98,20 +98,21 @@
if (!stats && !IsAutoVacuumWorkerProcess())
{
ereport(WARNING,

(errcode(ERRCODE_INVALID_OBJECT_DEFINITION),
 errmsg("statistics object \"%s.%s\" 
could not be computed for relation \"%s.%s\"",
stat->schema, 
stat->name,

get_namespace_name(onerel->rd_rel->relnamespace),

RelationGetRelationName(onerel)),
 errtable(onerel)));
-   continue;
}
+   if (!stats)
+   continue;
 
/* check allowed number of dimensions */
Assert(bms_num_members(stat->columns) >= 2 &&
   bms_num_members(stat->columns) <= 
STATS_MAX_DIMENSIONS);
 
/* compute statistic of each requested type */
foreach(lc2, stat->types)
{
chart = (char) lfirst_int(lc2);

Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA

2017-11-27 Thread Юрий Соколов

2017-11-26 19:51 GMT+03:00 Юрий Соколов :
>
> 2017-11-06 18:07 GMT+03:00 Sokolov Yura :
> >
> > On 2017-10-20 11:54, Sokolov Yura wrote:
> >>
> >> Hello,
> >>
> >> On 2017-10-19 19:46, Andres Freund wrote:
> >>>
> >>> On 2017-10-19 14:36:56 +0300, Sokolov Yura wrote:
> 
>  > > +   init_local_spin_delay(&delayStatus);
>  >
>  > The way you moved this around has the disadvantage that we now do
this -
>  > a number of writes - even in the very common case where the lwlock
can
>  > be acquired directly.
> 
>  Excuse me, I don't understand fine.
>  Do you complain against init_local_spin_delay placed here?
> >>>
> >>>
> >>> Yes.
> >>
> >>
> >> I could place it before perform_spin_delay under `if (!spin_inited)`
if you
> >> think it is absolutely must.
> >
> >
> > I looked at assembly, and remembered, that last commit simplifies
> > `init_local_spin_delay` to just two-three writes of zeroes (looks
> > like compiler combines 2*4byte write into 1*8 write). Compared to
> > code around (especially in LWLockAcquire itself), this overhead
> > is negligible.
> >
> > Though, I found that there is benefit in calling LWLockAttemptLockOnce
> > before entering loop with calls to LWLockAttemptLockOrQueue in the
> > LWLockAcquire (in there is not much contention). And this way, `inline`
> > decorator for LWLockAttemptLockOrQueue could be omitted. Given, clang
> > doesn't want to inline this function, it could be the best way.
>
> In attach version with LWLockAcquireOnce called before entering loop
> in LWLockAcquire.
>

Oh... there were stupid error in previos file.
Attached fixed version.


lwlock_v6.patch.gz
Description: GNU Zip compressed data

Re: Add RANGE with values and exclusions clauses to the Window Functions

2017-11-27 Thread Oliver Ford

On Fri, Nov 24, 2017 at 3:08 PM, Erikjan Rijkers  wrote:
>   SELECT pg_get_viewdef('v_window');
> ! pg_get_viewdef
> ! --
> !   SELECT i.i,+
> !  sum(i.i) OVER (ORDER BY i.i) AS sum_rows+
>   FROM generate_series(1, 10) i(i);
>   (1 row)
>
> --- 1034,1043 
>   (10 rows)
>
>   SELECT pg_get_viewdef('v_window');
> ! pg_get_viewdef
> !
> ---
> !   SELECT i.i,
> +
> !  sum(i.i) OVER (ORDER BY i.i ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
> AS sum_rows+
>   FROM generate_series(1, 10) i(i);
>   (1 row)
>
>
> This small hickup didn't prevent building an instance but obviously I
> haven't done any real tests yet.
>
>
> thanks,
>
>
> Erik Rijkers

After another clone and rebuild it works alright with the correct
spacing on mine, so the attached v2 should all pass. I noticed that I
hadn't added the exclusions clauses to the view defs code, so that's
also in this patch with extra tests to cover it.


0001-window-frame-v2.patch
Description: Binary data

Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-11-27 Thread Amit Kapila

On Mon, Nov 27, 2017 at 2:45 PM, hubert depesz lubaczewski
 wrote:
> On Sat, Nov 25, 2017 at 07:08:03AM +0530, Amit Kapila wrote:
>> > For example, check step 13 in https://explain.depesz.com/s/gNBd
>> >
>> > It shows time of 3ms, but loops of 1873, so the actual time is ~ 5600ms.
>> >
>> > But with parallel execution it seems to be no longer the case.
>> >
>> > For example:
>> > https://explain.depesz.com/s/LTMp
>> > or
>> > https://explain.depesz.com/s/QHRi
>> > Should I, for explain.depesz.com, when dealing
>> > with partial* and parallel* nodes, use "loops=1" for calculation of
>> > exclusive/inclusive time? always? some other nodes?
>> >
>>
>> I am not sure what exactly inclusive or exclusive means, but for
>> parallel nodes, total stats are accumulated so you are seeing loops as
>> 'worker nodes + 1'.  Now, as presumably workers run parallelly, so I
>> think the actual time will be what will be shown in the node not
>> actual time * nloops.
>
> Please check the plans:
> https://explain.depesz.com/s/gNBd (step 13)
> and https://explain.depesz.com/s/LTMp (step 3)
>
> Inclusive time is basically "loops * actual time", so for Index Scan,
> which had 1873 loops and actual time of 3.002..3.016, we got 1873
> * 3.016 = 5648.968ms.
>
> In case of parallel workers it looks like the inclusive time is
> basically the upper value from actual time.
>
> The question now is: how can I tell which nodes should use "actual_time
> * 1" and which "actual_time * loops" time?
>
> Anything "below" "Gather"?
>

I think it is "actual_time * 1" for anything below Gather.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Bug in ExecModifyTable function and trigger issues for foreign tables

2017-11-27 Thread Dean Rasheed

On 26 November 2017 at 22:56, Tom Lane  wrote:
> Etsuro Fujita  writes:
>> [ fix-rewrite-tlist-v4.patch ]
>
> I started reviewing this patch.  I did not much like the fact that it
> effectively moved rewriteTargetListUD to a different file and renamed it.
> That seems like unnecessary code churn, plus it breaks the analogy with
> rewriteTargetListIU, plus it will make back-patching harder (since that
> code isn't exactly the same in back branches).  I see little reason why
> we can't leave it where it is and just make it non-static.  It's not like
> there's no other parts of the rewriter that the planner calls.
>

I wonder if, years from now, it might look a bit odd that
rewriteTargetListUD() is doing part of work of preptlist.c, is only
called from there, and yet is located in the rewriter.

Aside from having a similar name to rewriteTargetListIU(), what
rewriteTargetListUD() does seems more like what
preprocess_targetlist() does for rowmarks. The fact that
rewriteTargetListIU() intentionally only applies to the parent,
whereas preprocess_targetlist() and now rewriteTargetListUD() apply to
each child, further destroys any similarity between
rewriteTargetListUD() and rewriteTargetListIU().

The point about back-patching is a reasonable one though, so I won't
mind either way.

A separate point -- it might be marginally more efficient to have the
work of rewriteTargetListUD() done after expand_targetlist() to avoid
the possible renumbering of the resjunk entries.

Regards,
Dean

Re: Transform for pl/perl

2017-11-27 Thread Anthony Bykov

On Wed, 15 Nov 2017 08:58:54 +
Aleksander Alekseev  wrote:

> The following review has been posted through the commitfest
> application: make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   tested, passed
> Documentation:tested, passed
> 
> Hello Anthony,
> 
> Great patch! Everything is OK and I almost agree with Pavel.
> 
> The only thing that I would like to suggest is to add a little more
> tests for various corner cases. For instance:
> 
> 1. Converting in both directions (Perl <-> JSONB) +/- infinity, NaN,
> MAX_INT, MIN_INT.
> 
> 2. Converting in both directions strings that contain non-ASCII
> (Russian / Japanese / etc) characters and special characters like \n,
> \t, \.
> 
> 3. Make sure that converting Perl objects that are not representable
> in JSONB (blessed hashes, file descriptors, regular expressions, ...)
> doesn't crash everything and shows a reasonable error message.
> 
> The new status of this patch is: Waiting on Author

Hello Aleksander,
Thank you for your review.

I've added more tests and I had to change behavior of transform when
working with not-representable-in-JSONB format objects - now it will
through an exception. You can find an example in tests.

Please, find the 4-th version of patch in attachments.

--
Anthony Bykov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companydiff --git a/contrib/Makefile b/contrib/Makefile
index 8046ca4..53d44fe 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -75,9 +75,9 @@ ALWAYS_SUBDIRS += sepgsql
 endif
 
 ifeq ($(with_perl),yes)
-SUBDIRS += hstore_plperl
+SUBDIRS += hstore_plperl jsonb_plperl
 else
-ALWAYS_SUBDIRS += hstore_plperl
+ALWAYS_SUBDIRS += hstore_plperl jsonb_plperl
 endif
 
 ifeq ($(with_python),yes)
diff --git a/contrib/jsonb_plperl/Makefile b/contrib/jsonb_plperl/Makefile
new file mode 100644
index 000..cd86553
--- /dev/null
+++ b/contrib/jsonb_plperl/Makefile
@@ -0,0 +1,40 @@
+# contrib/jsonb_plperl/Makefile
+
+MODULE_big = jsonb_plperl
+OBJS = jsonb_plperl.o $(WIN32RES)
+PGFILEDESC = "jsonb_plperl - jsonb transform for plperl"
+
+PG_CPPFLAGS = -I$(top_srcdir)/src/pl/plperl
+
+EXTENSION = jsonb_plperlu jsonb_plperl
+DATA = jsonb_plperlu--1.0.sql jsonb_plperl--1.0.sql
+
+REGRESS = jsonb_plperl jsonb_plperl_relocatability jsonb_plperlu jsonb_plperlu_relocatability
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/jsonb_plperl
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+# We must link libperl explicitly
+ifeq ($(PORTNAME), win32)
+# these settings are the same as for plperl
+override CPPFLAGS += -DPLPERL_HAVE_UID_GID -Wno-comment
+# ... see silliness in plperl Makefile ...
+SHLIB_LINK += $(sort $(wildcard ../../src/pl/plperl/libperl*.a))
+else
+rpathdir = $(perl_archlibexp)/CORE
+SHLIB_LINK += $(perl_embed_ldflags)
+endif
+
+# As with plperl we need to make sure that the CORE directory is included
+# last, probably because it sometimes contains some header files with names
+# that clash with some of ours, or with some that we include, notably on
+# Windows.
+override CPPFLAGS := $(CPPFLAGS) $(perl_embed_ccflags) -I$(perl_archlibexp)/CORE
diff --git a/contrib/jsonb_plperl/expected/jsonb_plperl.out b/contrib/jsonb_plperl/expected/jsonb_plperl.out
new file mode 100644
index 000..b5c0980
--- /dev/null
+++ b/contrib/jsonb_plperl/expected/jsonb_plperl.out
@@ -0,0 +1,239 @@
+CREATE EXTENSION jsonb_plperl CASCADE;
+NOTICE:  installing required extension "plperl"
+-- test hash -> jsonb
+CREATE FUNCTION testHVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+$val = {a => 1, b => 'boo', c => undef};
+return $val;
+$$;
+SELECT testHVToJsonb();
+  testhvtojsonb  
+-
+ {"a": 1, "b": "boo", "c": null}
+(1 row)
+
+-- test array -> jsonb
+CREATE FUNCTION testAVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+$val = [{a => 1, b => 'boo', c => undef}, {d => 2}];
+return $val;
+$$;
+SELECT testAVToJsonb();
+testavtojsonb
+-
+ [{"a": 1, "b": "boo", "c": null}, {"d": 2}]
+(1 row)
+
+-- test scalar -> jsonb
+CREATE FUNCTION testSVToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+$val = 1;
+return $val;
+$$;
+SELECT testSVToJsonb();
+ testsvtojsonb 
+---
+ 1
+(1 row)
+
+-- test blessed scalar -> jsonb
+CREATE FUNCTION testBlessedToJsonb() RETURNS jsonb
+LANGUAGE plperl
+TRANSFORM FOR TYPE jsonb
+AS $$
+my $class = shift;
+my $tmp = { a=>"a", 1=>"1" };
+bless $tmp, $class;
+return $tmp;
+$$;
+SELECT testBlessedToJsonb();
+  testblessedtojsonb  
+--
+ {"1": "1", "a": "a"}
+(1 row)
+
+-- test blessed scalar -> jsonb
+CREATE FUNCTION testRegexpToJsonb() RETURNS jsonb
+L

Re: [HACKERS] Parallel Hash take II

2017-11-27 Thread Thomas Munro

On Thu, Nov 23, 2017 at 12:36 AM, Thomas Munro
 wrote:
> Here's a new patch set with responses to the last batch of review comments.

Rebased on top of the recent SGML->XML change. Also:

1.  The final patch in the v26 patchset extended EXPLAIN ANALYZE
output to show per-worker information.  I'm withdrawing that patch for
now.  If you want to see how many tuples each backend hashed you can
already do that with (ANALYZE, VERBOSE).  It's a pre-existing bug that
you don't get batch/bucket/size info when Hash Join doesn't run in the
leader, and it's a pre-existing bug that EXPLAIN doesn't show
information for the leader separately.  I decided that it's not this
patchset's job to fix that stuff, and it's not entirely clear what the
best approach is anyway.  Let's discuss the way that information is
captured and displayed separately from the Parallel Hash feature.

2.  I found a way to crash v26 by starting a worker very late.  Fixed.

Unfortunately I saw a one-off case of an assertion failure in
ExecParallelHashRepartitionRest()/sts_begin_parallel_scan() on Travis
CI that I can't explain.  I haven't been able to reproduce it there or
on any other machine since.  I am still looking into it.

-- 
Thomas Munro
http://www.enterprisedb.com


parallel-hash-v27.patchset.tgz
Description: GNU Zip compressed data

Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-11-27 Thread hubert depesz lubaczewski

On Sat, Nov 25, 2017 at 07:08:03AM +0530, Amit Kapila wrote:
> > For example, check step 13 in https://explain.depesz.com/s/gNBd
> >
> > It shows time of 3ms, but loops of 1873, so the actual time is ~ 5600ms.
> >
> > But with parallel execution it seems to be no longer the case.
> >
> > For example:
> > https://explain.depesz.com/s/LTMp
> > or
> > https://explain.depesz.com/s/QHRi
> > Should I, for explain.depesz.com, when dealing
> > with partial* and parallel* nodes, use "loops=1" for calculation of
> > exclusive/inclusive time? always? some other nodes?
> >
> 
> I am not sure what exactly inclusive or exclusive means, but for
> parallel nodes, total stats are accumulated so you are seeing loops as
> 'worker nodes + 1'.  Now, as presumably workers run parallelly, so I
> think the actual time will be what will be shown in the node not
> actual time * nloops.

Please check the plans:
https://explain.depesz.com/s/gNBd (step 13)
and https://explain.depesz.com/s/LTMp (step 3)

Inclusive time is basically "loops * actual time", so for Index Scan,
which had 1873 loops and actual time of 3.002..3.016, we got 1873
* 3.016 = 5648.968ms.

In case of parallel workers it looks like the inclusive time is
basically the upper value from actual time.

The question now is: how can I tell which nodes should use "actual_time
* 1" and which "actual_time * loops" time?

Anything "below" "Gather"?
Anything starting with "Partial?"
Anything starting with "Parallel"?
Anything with "Worker" in node "description" in explain?

depesz

RE: How is the PostgreSQL debuginfo file generated

2017-11-27 Thread Rui Hai Jiang

Thank you for your help.

I have tried many ways you suggested.  I was not familiar with this area, so I 
did spend lots of time trying.

At last, I found the easiest way to do this is to build a new SRPM package, 
then build other RPMs from the new SRPM.

Thanks a lot,

Ruihai

From: Tom Lane 
Sent: Friday, November 24, 2017 12:50:37 AM
To: Rui Hai Jiang
Cc: pgsql-hack...@postgresql.org
Subject: Re: How is the PostgreSQL debuginfo file generated

Rui Hai Jiang  writes:
> I’m wondering how to build the debuginfo package from  the PostgreSQL source.
> For example to generate something like this one :   
> postgresql91-debuginfo.x86_64.

On platforms where such things exist, that's handled by the packaging
system, not by PG proper.  You should proceed by making a new SRPM
and building RPMs from that.

regards, tom lane

Re: default range partition and constraint exclusion

2017-11-27 Thread Kyotaro HORIGUCHI

At Fri, 24 Nov 2017 10:49:07 -0500, Robert Haas  wrote 
in 
> On Wed, Nov 22, 2017 at 4:21 AM, Amit Langote
>  wrote:
> >>> If all predicate_refuted_by() receives is the expression tree (AND/OR)
> >>> with individual nodes being strict clauses involving partition keys (and
> >>> nothing about the nullness of the keys), the downstream code is just
> >>> playing by the rules as explained in the header comment of
> >>> predicate_refuted_by_recurse() in concluding that query's restriction
> >>> clause a = 2 refutes it.
> >>
> >> Oh, wait a minute.  Actually, I think predicate_refuted_by() is doing
> >> the right thing here.  Isn't the problem that mc2p2 shouldn't be
> >> accepting a (2, null) tuple at all?
> >
> > It doesn't.  But, for a query, it does contain (2, ) tuples,
> > where  would always be non-null.  So, it should be scanned in the
> > plan for the query that has only a = 2 as restriction and no restriction
> > on b.  That seems to work.
> 
> OK, so I am still confused about whether the constraint is wrong or
> the constraint exclusion logic is wrong.  One of them, at least, has
> to be wrong, and we have to fix whichever one is wrong.  Fixing broken
> constraint exclusion logic by hacking up the constraint, or conversely
> fixing a broken constraint by hacking up the constraint exclusion
> logic, wouldn't be right.
> 
> I think my last email was confused: I thought that the (2, null) tuple
> was ending up in mc2p2, but it's really ending up in mc2p_default,
> whose constraint currently looks like this:
> 
> NOT (
>   ((a < 1) OR ((a = 1) AND (b < 1)))
> OR
>   ((a > 1) OR ((a = 1) AND (b >= 1)))
> )
> 
> Now where exactly is constraint exclusion going wrong here?  a = 2
> refutes a < 1 and a = 1, which means that (a < 1) OR ((a = 1) AND (b <
> 1)) must be false and that (a = 1) AND (b >= 1) must also be false.
> But (a > 1) could be either true or null, which means (a > 1) OR ((a =

a > 1 is true when a = 2, so the second term is true?

> 1) AND (b >= 1)) can be true or null, which means the whole thing can
> be false or null, which means that it is not refuted by a = 2.  It

Then the whole thing is false.

> should be possible to dig down in there step by step and figure out
> where the wheels are coming off -- have you tried to do that?

| select  NOT (
|  ((a < 1) OR ((a = 1) AND (b < 1)))
|OR
|  ((a > 1) OR ((a = 1) AND (b >= 1)))
| )
| from (values (2::int, null::int)) as t(a, b);
|  ?column? 
| --
|  f

The problem here I think is that get_qual_for_range() for default
partition returns an inconsistent qual with what partition
get_partition_for_tuple chooses for keys containing nulls. It
chooses default partition if any of the key values is null,
without referring the constraint expression.

The current behavior is apparently odd.

| select pg_get_partition_constraintdef('mc2p2'::regclass);
| pg_get_partition_constraintdef 
| 
|  ((a IS NOT NULL) AND (b IS NOT NULL) AND ((a > 1) OR ((a = 1) AND (b >= 1


| select pg_get_partition_constraintdef('mc2p_default'::regclass);
|   pg_get_partition_constraintdef   
|
| ---
|  (NOT (((a < 1) OR ((a = 1) AND (b < 1))) OR ((a > 1) OR ((a = 1) AND (b >= 
1)


| insert into mc2p2 values (2);
| ERROR:  new row for relation "mc2p2" violates partition constraint
| DETAIL:  Failing row contains (2, null).

This is the correct behavior.

| insert into mc2p_default values (2);
| ERROR:  new row for relation "mc2p_default" violates partition constraint
| DETAIL:  Failing row contains (2, null).

This is the correct behavior in terms of constraint, but
incorrect in terms of partition routing.

But interestingly, the following *works*, in a way contradicting
to the constraint.

| insert into mc2p values (2);
| INSERT 0 1
|
| select * from mc2p_default;
|  a | b 
| ---+---
|  2 |  
| (1 row)

After applying the patch upthread, get_qual_for_range() returns
the consistent qual and "insert into mc2p_default values (2)" is
accepted correctly and everything become consistent.

This is the story in my understanding.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] Challenges preventing us moving to 64 bit transaction id (XID)?

2017-11-27 Thread Amit Kapila

On Fri, Nov 24, 2017 at 4:03 PM, Alexander Korotkov
 wrote:
>> > 0002-heap-page-special-1.patch
>> > Putting xid and multixact bases into PageHeaderData would take extra 16
>> > bytes on index pages too.  That would be waste of space for indexes.
>> > This
>> > is why I decided to put bases into special area of heap pages.
>> > This patch adds special area for heap pages contaning prune xid and
>> > magic
>> > number.  Magic number is different for regular heap page and sequence
>> > page.
>> >
>>
>> uint16 pd_pagesize_version;
>> - TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
>>   ItemIdData pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */
>>   } PageHeaderData;
>>
>> Why have you moved pd_prune_xid from page header?
>
>
> pg_prune_xid makes sense only for heap pages.  Once we introduce special
> area for heap pages, we can move pg_prune_xid there and save some bytes in
> index pages.  However, this is an optimization not directly related to
> 64-bit xids.  Idea is that if we anyway change page format, why don't apply
> this optimization as well?
>

Sure, but I think this patch could have been proposed on top of your
main patch not other way.  Another similar thing I have noticed is
below:

*** typedef struct CheckPoint
*** 39,45 
  TimeLineID PrevTimeLineID; /* previous TLI, if this record begins a new
   * timeline (equals ThisTimeLineID otherwise) */
  bool fullPageWrites; /* current full_page_writes */
- uint32 nextXidEpoch; /* higher-order bits of nextXid */
  TransactionId nextXid; /* next free XID */
  Oid nextOid; /* next free OID */
  MultiXactId nextMulti; /* next free MultiXactId */


I think if we have 64-bit Transaction Ids, then we might not need
epoch at all, but I think this can also be a separate change from the
main patch.  The point is that already main patch is big enough that
we should not try to squeeze other related changes in it.

>  But if we have any doubts in this, it can be
> removed with no problem.
>
>>
>> > 0003-64bit-xid-1.patch
..
>> >
>>
>> It seems there is no README or some detailed explanation of how all
>> this works like how the value of pd_xid_base is maintained.  I don't
>> think there are enough comments in the patch to explain the things.  I
>> think it will be easier to understand and review the patch if you
>> provide some more details either in email or in the patch.
>
>
> OK.  I'm going to write README and include it into the patch.
>

Thanks.  Few assorted comments on the main patch:

1.
+ /*
+  * Ensure that given xid fits base of given page.
+  */
+ bool
+ heap_page_prepare_for_xid(Relation relation, Buffer buffer,
+   TransactionId xid, bool multi)

I couldn't find any use of the return value of this function,
basically, if the specified xid can update the patch, then it returns
false otherwise it updates the tuples on a page and base xid on a page
such that new xid can update the tuple.  So either, you should make
this function return void or split it into two parts such that first
function check if the new xid can update the tuple, if so proceed with
updating the tuple, otherwise, update the base xid in the page and
xmin/xmax in tuples.

2.
heap_page_prepare_for_xid()
{
..
+ /* Find minimum and maximum xids in the page */
+ found = heap_page_xid_min_max(page, multi, &min, &max);
+
+ /* No items on the page? */
+ if (!found)
+ {
+ int64 delta;
+
+ if (!multi)
+ delta = (xid - FirstNormalTransactionId) - pageSpecial->pd_xid_base;
+ else
+ delta = (xid - FirstNormalTransactionId) - pageSpecial->pd_multi_base;
+
+ heap_page_shift_base(RelationNeedsWAL(relation) ? buffer : InvalidBuffer,
+ page, multi, delta);
+ MarkBufferDirty(buffer);
+ return false;
+ }
..
}

When there are no items on the page what is need to try to traverse
all items again (via heap_page_shift_base), you can ideally shift the
base directly as (xid - FirstNormalTransactionId) in this case.

3.
heap_page_prepare_for_xid()
{
..
+ /* Can we just shift base on the page */
+ if (xid < base + FirstNormalTransactionId)
+ {
+ int64 freeDelta = MaxShortTransactionId - max,
+ requiredDelta = (base + FirstNormalTransactionId) - xid;
+
+ if (requiredDelta <= freeDelta)
+ {
+ heap_page_shift_base(RelationNeedsWAL(relation) ? buffer : InvalidBuffer,
+ page, multi, - (freeDelta + requiredDelta) / 2);
+ MarkBufferDirty(buffer);
+ return true;
+ }
+ }
+ else
+ {
+ int64 freeDelta = min - FirstNormalTransactionId,
+ requiredDelta = xid - (base + MaxShortTransactionId);
+
+ if (requiredDelta <= freeDelta)
+ {
+ heap_page_shift_base(RelationNeedsWAL(relation) ? buffer : InvalidBuffer,
+ page, multi, (freeDelta + requiredDelta) / 2);
+ MarkBufferDirty(buffer);
+ return true;
+ }
+ }
..
}

I think the above code doesn't follow guidelines for writing a WAL.
Currently, (a) it modifies page (b) write WAL (c) mark buffer dirty.
You need to perform step (c) before step (b).  Also, all these steps
need to be performed in the critical section.

Also, some comments ex

Re: explain analyze output with parallel workers - question about meaning of information for explain.depesz.com

2017-11-27 Thread Ashutosh Bapat

On Sat, Nov 25, 2017 at 7:08 AM, Amit Kapila  wrote:
>>
>> It looks that the actual time is really actual time, and loops is
>> "worker nodes + 1".
>>
>> Is that really the case?
>>
>
> I think so.
>

To add to what Amit has explained. The + 1 is for the leader (the
backend running the whole query) which also runs the parallel plan if
it gets time between fetching rows from the workers.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Typo in config_default.pl comment

2017-11-27 Thread Magnus Hagander

On Mon, Nov 27, 2017 at 2:21 AM, Andreas Karlsson  wrote:

> Hi,
>
> There is a --with-tls in the comments in config_default.pl which should
> be --with-tcl.
>

Oops. Applied, thanks.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: [HACKERS] UPDATE of partition key

2017-11-27 Thread Amit Khandekar

On 24 November 2017 at 10:52, Amit Langote
 wrote:
> On 2017/11/23 21:57, Amit Khandekar wrote:
>> If we collect the partition keys in expand_partitioned_rtentry(), we
>> need to pass the root relation also, so that we can convert the
>> partition key attributes to root rel descriptor. And the other thing
>> is, may be, we can check beforehand (in expand_inherited_rtentry)
>> whether the rootrte's updatedCols is empty, which I think implies that
>> it's not an UPDATE operation. If yes, we can just skip collecting the
>> partition keys.
>
> Yeah, it seems like a good idea after all to check in
> expand_inherited_rtentry() whether the root RTE's updatedCols is non-empty
> and if so check if any of the updatedCols are partition keys.  If we find
> some, then it will suffice to just set a simple flag in the
> PartitionedChildRelInfo that will be created for the root table.  That
> should be done *after* we have visited all the tables in the partition
> tree including some that might be partitioned and hence will provide their
> partition keys.  The following block in expand_inherited_rtentry() looks
> like a good spot:
>
> if (rte->inh && partitioned_child_rels != NIL)
> {
> PartitionedChildRelInfo *pcinfo;
>
> pcinfo = makeNode(PartitionedChildRelInfo);

Yes, I am thinking about something like that. Thanks.

I am also working on your suggestion of moving the
convert-to-root-descriptor logic from ExecInsert() to ExecUpdate().

So, in the upcoming patch version, I am intending to include the above
two, and if possible, Robert's idea of re-using is_partition_attr()
for pull_child_partition_columns().


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Re: [HACKERS] Re: proposal - psql: possibility to specify sort for describe commands, when size is printed

2017-11-27 Thread Alexander Korotkov

On Tue, Nov 14, 2017 at 3:26 PM, Magnus Hagander 
wrote:

> On Mon, Nov 13, 2017 at 3:17 PM, Dean Rasheed 
> wrote:
>
>> On 28 October 2017 at 13:46, Pavel Stehule 
>> wrote:
>> > I though about Alexander proposal, and I am thinking so it can be
>> probably
>> > best if we respect psql design. I implemented two command suffixes
>> > (supported only when it has sense) "s" sorted by size and "d" as descent
>> >
>> > so list of tables can be sorted with commands:
>> >
>> > \dt+sd (in this case, the order is not strict), so command
>> > \dtsd+ is working too (same \disd+ or \di+sd)
>> >
>> > These two chars are acceptable. Same principle is used for \l command
>> >
>> > \lsd+ or \l+sd
>> >
>> > What do you think about it?
>> >
>>
>> I really hate that syntax. This is going to turn into an
>> incomprehensible mess, and isn't easily extended to support other
>> options.
>>
>
> +1. While useful in itself, I think it's definitely a dangerous pattern to
> go down, as it'll only get worse.
>
>
> I agree with people who have said they would prefer this to be
>> available as a per-command option rather than as a variable that you
>> have to set, but it needs a clearer syntax. I actually like Stephen's
>> idea of using a user-defined SQL snippet, because that's a familiar
>> syntax to people, and it avoids adding an ever-increasing number of
>> options to these commands. Instead, the syntax could simply be:
>>
>
> +1 here as well. And anybody who is actually going to need this level of
> control definitely will know SQL...
>
> And if one wants to save some "standard patterns", it should be doable to
> save the pattern itself in a variable and then use it with something like
> "\dt :mysort" and have it expand the normal way there.
>

+1
I agree, that would look better, especially with "standard patterns" which
could help with too long to type each time SQL snippets.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

90 matches

Mail list logo