date:20180410

Re: [HACKERS] path toward faster partition pruning

2018-04-10 Thread Amit Langote

Thanks for the comment.

On 2018/04/10 21:11, Ashutosh Bapat wrote:
> On Tue, Apr 10, 2018 at 5:32 PM, David Rowley
>  wrote:
>> Apart from that confusion, looking at the patch:
>>
>> +CREATE OR REPLACE FUNCTION pp_hashint4_noop(int4, int8) RETURNS int8 AS
>> +$$SELECT coalesce($1)::int8$$ LANGUAGE sql IMMUTABLE STRICT;
>> +CREATE OPERATOR CLASS pp_test_int4_ops FOR TYPE int4 USING HASH AS
>> +OPERATOR 1 = , FUNCTION 2 pp_hashint4_noop(int4, int8);
>> +CREATE OR REPLACE FUNCTION pp_hashtext_length(text, int8) RETURNS int8 AS
>> +$$SELECT length(coalesce($1))::int8$$ LANGUAGE sql IMMUTABLE STRICT;
>>
>>
>> Why coalesce here? Maybe I've not thought of something, but coalesce
>> only seems useful to me if there's > 1 argument. Plus the function is
>> strict, so not sure it's really doing even if you added a default.
> 
> I think Amit Langote wanted to write coalesce($1, $2), $2 being the
> seed for hash function. See how hash operator class functions are
> defined in sql/insert.sql.

Actually, I referenced functions and operator classes defined in
hash_part.sql, not insert.sql.  Although as you point out, I didn't think
very hard about the significance of $2 passed to coalesce in those
functions.  I will fix that and add it back, along with some other changes
that makes them almost identical with definitions in hash_part.sql.

> May be we should just use the same
> functions or even the same tables.

Because hash_part.sql and partition_prune.sql tests run in parallel, I've
decided to rename the functions, operator classes, and the tables in
partition_prune.sql.  It seems like a good idea in any case.  Also, since
the existing pruning tests were written with that table, I decided not to
change that.

Will post an updated patch after addressing David's comment.

Regards,
Amit

Re: Boolean partitions syntax

2018-04-10 Thread Amit Langote

On 2018/04/11 13:39, David Rowley wrote:
> On 11 April 2018 at 05:22, Tom Lane  wrote:
>> David Rowley  writes:
>>> On 11 April 2018 at 03:34, Tom Lane  wrote:
 Well, that just begs the question: why do these expressions need to
 be immutable?  What we really want, I think, is to evaluate them
 and reduce them to constants.  After that, it hardly matters whether
 the original expression was volatile.  I see absolutely no moral
 difference between "for values in (random())" and cases like
 this, which works today:
>>
>>> I'd personally be pretty surprised if this worked.
>>
>> Well, my point is that we're *already* behaving that way in some cases,
>> simply because of the existence of macro-like inputs for some of these
>> datatypes.  I'm not sure that users are going to perceive a big difference
>> between 'now'::timestamptz and now(), for example.  If we take one but
>> not the other, I don't think anybody will see that as a feature.
> 
> To me, it seems a bit inconsistent to treat 'now'::timestamp and now()
> the same way.
> 
> I found this in the 7.4 release notes [1]:
> 
> "String literals specifying time-varying date/time values, such as
> 'now' or 'today' will no longer work as expected in column default
> expressions; they now cause the time of the table creation to be the
> default, not the time of the insertion. Functions such as now(),
> current_timestamp, or current_dateshould be used instead.
> 
> In previous releases, there was special code so that strings such as
> 'now' were interpreted at INSERT time and not at table creation time,
> but this work around didn't cover all cases. Release 7.4 now requires
> that defaults be defined properly using functions such as now() or
> current_timestamp. These will work in all situations."
> 
> So isn't 'now' being different from now() in DDL something users
> should be quite used to by now?
> 
> I've got to admit, I'm somewhat less concerned about evaluating
> volatile functions in this case because you don't seem that concerned,
> but if you happen to be wrong, then it's going to be a much harder
> thing to fix.  Really, is anyone going to complain if we don't
> evaluate these and reject them with an error instead? It seems like a
> safer option to me, also less work, and we're probably less likely to
> regret it.

As someone said upthread, we should just document that we *always*
evaluate the expression specified for a partition bound during create
table and not some other time.  That seems easier than figuring out what
to say in the error message; saying "cannot use immutable expression for
partition bound" is likely to confuse a user even more by introducing the
ambiguity about when partition bounds are evaluated.  Most users would
expect it to be create table time anyway.

> We also need to decide what of this we can backpatch to PG10 to fix
> [2].  Ideally what goes into PG10 and PG11 would be the same, so
> perhaps that's another reason to keep it more simple.

Backpatch all of it?  Newly introduced syntax and evaluation semantics
does not break inputs that PG 10 allows.  But I may be missing something.

Thanks,
Amit

Re: vacuum_cost_limit doc description patch

2018-04-10 Thread David Rowley

On 11 April 2018 at 09:13, Martín Marqués  wrote:
> This is a patch to add some further description, plus the upper and
> lower limits it has.

Hi,

+ for vacuum_cost_delay. The parameter can take a value
between 1 and 1.

vacuum_cost_delay should be in  tags.

+1 to mentioning that we sleep for vacuum_cost_delay, but I just don't
see many other GUCs with mention of their supported range.

effective_io_concurrency mentions the range it supports, but this
happens to depend on USE_POSIX_FADVISE, which if undefined the maximum
setting is 0, which means the docs are wrong in some cases on that.

vacuum_cost_limit seems fairly fixed at 0-1 with no compile-time
conditions, so perhaps it's okay, providing we remember and update the
docs if that ever changes.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

RE: power() function in Windows: "value out of range: underflow"

2018-04-10 Thread Huong Dangminh

> > I updated the patch as David Rowley mentioned.
> 
> Looks fine to me. Please add to the next commitfest.

Thanks. Added.

> --
>  David Rowley   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
> 

---
Thanks and best regards,
Dang Minh Huong
NEC Solution Innovators, Ltd.
http://www.nec-solutioninnovators.co.jp/en/

Re: Boolean partitions syntax

2018-04-10 Thread Amit Langote

Horiguchi-san,

Thanks for working on this.

On 2018/04/11 13:20, Kyotaro HORIGUCHI wrote:
> At Wed, 11 Apr 2018 11:27:17 +0900, Amit Langote wrote:
>> On 2018/04/11 10:44, Tom Lane wrote:
>>> Kyotaro HORIGUCHI  writes:
 At least partition bound *must* be a constant. Any expression
 that can be reduced to a constant at parse time ought to be
 accepted but must not be accepted if not.
>>>
>>> My point is that *any* expression can be reduced to a constant,
>>> we just have to do so.
> 
> Agreed in that sense. What was in my mind was something like
> column reference, random() falls into reducible category of
> course.
> 
> # I regard the change in gram.y is regarded as acceptable as a
> # direction, so I'll continue to working on this.

I haven't yet reviewed the grammar changes in detail yet...

>> Currently transformPartitionBoundValue() applies eval_const_expressions()
>> by way of calling expression_planner().  However passing to it, say, an
>> expression representing random() is unable to reduce it to a Const because
>> simplify_function/evaluate_function won't compute a mutable function like
>> random().  So, that currently results in an error like this (note that
>> this is after applying Horiguchi-san's latest patch that enables
>> specifying random() as a partition bound in the first place):
>>
>> create table foo_part partition of foo for values in ((random())::int);
>> ERROR:  specified value cannot be cast to type integer for column "a"
>> LINE 1: ...table foo_random partition of foo for values in ((random()):...
>>  ^
>> DETAIL:  The cast requires a non-immutable conversion.
>> HINT:  Try putting the literal value in single quotes.
>>
>> The error is output after the following if check in
>> transformPartitionBoundValue fails:
>>
>> /* Fail if we don't have a constant (i.e., non-immutable coercion) */
>> if (!IsA(value, Const))
>>
>> I think what Tom is proposing here, instead of bailing out upon
>> eval_const_expressions() failing to simplify the input expression to a
>> Const, is to *invoke the executor* to evaluate the expression, like the
>> optimizer does in evaluate_expr, and cook up a Const with whatever comes
>> out to store it into the catalog (that is, in relpartbound).
> 
> Yes. In the attached I used evaluate_expr by making it non-static
> function. a_expr used instead of partbound_datum is changed to
> u_expr, which is the same with range bounds.
> 
>> =# create table c1 partition of p for values in (random() * 100);
>> CREATE TABLE
>> =# \d c1
> ...
>> Partition of: p FOR VALUES IN (97)

I looked at the non-gram.y portions of the patch for now as I was also
playing with this.  Some comments on your patch:

* You missed adding a break here for the EXPR_KIND_PARTITION_EXPRESSION case

 case EXPR_KIND_PARTITION_EXPRESSION:
 err = _("window functions are not allowed in partition key
expressions");
+case EXPR_KIND_PARTITION_BOUNDS:
+err = _("window functions are not allowed in partition bounds");
 break;

So, the following is the wrong error message that you probably failed to
notice:

--- a/src/test/regress/expected/create_table.out
+++ b/src/test/regress/expected/create_table.out
@@ -308,7 +308,7 @@ CREATE TABLE partitioned (
 a int,
 b int
 ) PARTITION BY RANGE ((avg(a) OVER (PARTITION BY b)));
-ERROR:  window functions are not allowed in partition key expressions
+ERROR:  window functions are not allowed in partition bounds

* I think the new ParseExprKind you added should omit the last "S", that
is, name it EXPR_KIND_PARTITION_BOUND, because these are expressions to
represent individual bound values.  And so adjust the comment to say "bound".

 +EXPR_KIND_PARTITION_BOUNDS, /* partition bounds value */

* When looking at the changes to transformPartitionBoundValue, I noticed
that there is no new argument Oid colTypColl

 static Const *
-transformPartitionBoundValue(ParseState *pstate, A_Const *con,
+transformPartitionBoundValue(ParseState *pstate, Node *val,
  const char *colName, Oid colType, int32
colTypmod)

that's because you seem to pass the expression's type, typmod, and typcoll
to the newly added call to evaluate_expr.  I wonder if we should really
pass the partition key specified values here.  We already receive first
two from the caller.

* In the changes to create_table.out

@@ -450,13 +450,9 @@ CREATE TABLE part_1 PARTITION OF list_parted FOR
VALUES IN ('1');
 CREATE TABLE part_2 PARTITION OF list_parted FOR VALUES IN (2);
 CREATE TABLE part_null PARTITION OF list_parted FOR VALUES IN (null);
 CREATE TABLE fail_part PARTITION OF list_parted FOR VALUES IN (int '1');
-ERROR:  syntax error at or near "int"
-LINE 1: ... fail_part PARTITION OF list_parted FOR VALUES IN (int '1');
-  ^
+ERROR:  partition

Re: power() function in Windows: "value out of range: underflow"

2018-04-10 Thread David Rowley

On 11 April 2018 at 16:42, Huong Dangminh  wrote:
> I updated the patch as David Rowley mentioned.

Looks fine to me. Please add to the next commitfest.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

RE: power() function in Windows: "value out of range: underflow"

2018-04-10 Thread Huong Dangminh

Thanks for confirming.
 
> 2018-04-11 0:13 GMT-03:00 David Rowley :
> > I can recreate this when building with MSVC 2012. I confirm that I see
> > the same as you. Microsoft are setting errno to EDOM in the above 3
> > cases, where in Linux the result is still NaN, just the errno is not
> > set.
> >
> FWIW, I tested in MSVC 2017 (15.6.4) and it worked like expected.
> Looking at [1], it seems there could be nasty bugs when using math functions
> in MSVC < 2013 (or 2015). I don't have some older MSVC here to try
> /fp:OPTIONHERE [2] to see if it makes any difference.

Thanks.
Anyway currently PostgreSQL installers seem not be compiled with MSVC 2017, 
so it should be fix for current users?
I updated the patch as David Rowley mentioned.


---
Thanks and best regards,
Dang Minh Huong
NEC Solution Innovators, Ltd.
http://www.nec-solutioninnovators.co.jp/en/



dpow_NaN_V2.patch
Description: dpow_NaN_V2.patch

Re: Boolean partitions syntax

2018-04-10 Thread David Rowley

On 11 April 2018 at 05:22, Tom Lane  wrote:
> David Rowley  writes:
>> On 11 April 2018 at 03:34, Tom Lane  wrote:
>>> Well, that just begs the question: why do these expressions need to
>>> be immutable?  What we really want, I think, is to evaluate them
>>> and reduce them to constants.  After that, it hardly matters whether
>>> the original expression was volatile.  I see absolutely no moral
>>> difference between "for values in (random())" and cases like
>>> this, which works today:
>
>> I'd personally be pretty surprised if this worked.
>
> Well, my point is that we're *already* behaving that way in some cases,
> simply because of the existence of macro-like inputs for some of these
> datatypes.  I'm not sure that users are going to perceive a big difference
> between 'now'::timestamptz and now(), for example.  If we take one but
> not the other, I don't think anybody will see that as a feature.

To me, it seems a bit inconsistent to treat 'now'::timestamp and now()
the same way.

I found this in the 7.4 release notes [1]:

"String literals specifying time-varying date/time values, such as
'now' or 'today' will no longer work as expected in column default
expressions; they now cause the time of the table creation to be the
default, not the time of the insertion. Functions such as now(),
current_timestamp, or current_dateshould be used instead.

In previous releases, there was special code so that strings such as
'now' were interpreted at INSERT time and not at table creation time,
but this work around didn't cover all cases. Release 7.4 now requires
that defaults be defined properly using functions such as now() or
current_timestamp. These will work in all situations."

So isn't 'now' being different from now() in DDL something users
should be quite used to by now?

I've got to admit, I'm somewhat less concerned about evaluating
volatile functions in this case because you don't seem that concerned,
but if you happen to be wrong, then it's going to be a much harder
thing to fix.  Really, is anyone going to complain if we don't
evaluate these and reject them with an error instead? It seems like a
safer option to me, also less work, and we're probably less likely to
regret it.

We also need to decide what of this we can backpatch to PG10 to fix
[2].  Ideally what goes into PG10 and PG11 would be the same, so
perhaps that's another reason to keep it more simple.

[1] https://www.postgresql.org/docs/current/static/release-7-4.html
[2] 
https://www.postgresql.org/message-id/CAKJS1f-BL%2Br5FXSejDu%3D%2BMAvzRARaawRnQ_ZFtbv_o6tha9NJw%40mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Boolean partitions syntax

2018-04-10 Thread Kyotaro HORIGUCHI

At Wed, 11 Apr 2018 11:27:17 +0900, Amit Langote 
 wrote in 
<1810b14f-3cd7-aff5-8358-c225c0231...@lab.ntt.co.jp>
> On 2018/04/11 10:44, Tom Lane wrote:
> > Kyotaro HORIGUCHI  writes:
> >> At least partition bound *must* be a constant. Any expression
> >> that can be reduced to a constant at parse time ought to be
> >> accepted but must not be accepted if not.
> > 
> > My point is that *any* expression can be reduced to a constant,
> > we just have to do so.

Agreed in that sense. What was in my mind was something like
column reference, random() falls into reducible category of
course.

# I regard the change in gram.y is regarded as acceptable as a
# direction, so I'll continue to working on this.

> Currently transformPartitionBoundValue() applies eval_const_expressions()
> by way of calling expression_planner().  However passing to it, say, an
> expression representing random() is unable to reduce it to a Const because
> simplify_function/evaluate_function won't compute a mutable function like
> random().  So, that currently results in an error like this (note that
> this is after applying Horiguchi-san's latest patch that enables
> specifying random() as a partition bound in the first place):
> 
> create table foo_part partition of foo for values in ((random())::int);
> ERROR:  specified value cannot be cast to type integer for column "a"
> LINE 1: ...table foo_random partition of foo for values in ((random()):...
>  ^
> DETAIL:  The cast requires a non-immutable conversion.
> HINT:  Try putting the literal value in single quotes.
> 
> The error is output after the following if check in
> transformPartitionBoundValue fails:
> 
> /* Fail if we don't have a constant (i.e., non-immutable coercion) */
> if (!IsA(value, Const))
> 
> I think what Tom is proposing here, instead of bailing out upon
> eval_const_expressions() failing to simplify the input expression to a
> Const, is to *invoke the executor* to evaluate the expression, like the
> optimizer does in evaluate_expr, and cook up a Const with whatever comes
> out to store it into the catalog (that is, in relpartbound).

Yes. In the attached I used evaluate_expr by making it non-static
function. a_expr used instead of partbound_datum is changed to
u_expr, which is the same with range bounds.

> =# create table c1 partition of p for values in (random() * 100);
> CREATE TABLE
> =# \d c1
...
> Partition of: p FOR VALUES IN (97)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index ed6b680ed8..cfb0984100 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -152,8 +152,6 @@ static Node *substitute_actual_parameters(Node *expr, int nargs, List *args,
 static Node *substitute_actual_parameters_mutator(Node *node,
 	 substitute_actual_parameters_context *context);
 static void sql_inline_error_callback(void *arg);
-static Expr *evaluate_expr(Expr *expr, Oid result_type, int32 result_typmod,
-			  Oid result_collation);
 static Query *substitute_actual_srf_parameters(Query *expr,
  int nargs, List *args);
 static Node *substitute_actual_srf_parameters_mutator(Node *node,
@@ -4842,7 +4840,7 @@ sql_inline_error_callback(void *arg)
  * We use the executor's routine ExecEvalExpr() to avoid duplication of
  * code and ensure we get the same result as the executor would get.
  */
-static Expr *
+Expr *
 evaluate_expr(Expr *expr, Oid result_type, int32 result_typmod,
 			  Oid result_collation)
 {
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index dd0c26c11b..02979a3533 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -180,6 +180,8 @@ static Node *makeXmlExpr(XmlExprOp op, char *name, List *named_args,
 static List *mergeTableFuncParameters(List *func_args, List *columns);
 static TypeName *TableFuncTypeName(List *columns);
 static RangeVar *makeRangeVarFromAnyName(List *names, int position, core_yyscan_t yyscanner);
+static Node *makePartRangeDatum(PartitionRangeDatumKind kind, Node *value,
+int location);
 static void SplitColQualList(List *qualList,
 			 List **constraintList, CollateClause **collClause,
 			 core_yyscan_t yyscanner);
@@ -472,7 +474,7 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 	columnDef columnOptions
 %type 	def_elem reloption_elem old_aggr_elem operator_def_elem
 %type 	def_arg columnElem where_clause where_or_current_clause
-a_expr b_expr c_expr AexprConst indirection_el opt_slice_bound
+a_expr u_expr b_expr b0_expr c_expr c0_expr AexprConst indirection_el opt_slice_bound
 columnref in_expr having_clause func_table xmltable array_expr
 ExclusionWhereClause operator_def_arg
 %type 	rowsfrom_item rowsfrom_list opt_col_def_list
@@

Re: power() function in Windows: "value out of range: underflow"

2018-04-10 Thread Euler Taveira

2018-04-11 0:13 GMT-03:00 David Rowley :
> I can recreate this when building with MSVC 2012. I confirm that I see
> the same as you. Microsoft are setting errno to EDOM in the above 3
> cases, where in Linux the result is still NaN, just the errno is not
> set.
>
FWIW, I tested in MSVC 2017 (15.6.4) and it worked like expected.
Looking at [1], it seems there could be nasty bugs when using math
functions in MSVC < 2013 (or 2015). I don't have some older MSVC here
to try /fp:OPTIONHERE [2] to see if it makes any difference.


[1] https://docs.microsoft.com/en-us/cpp/porting/floating-point-migration-issues
[2] https://msdn.microsoft.com/pt-br/library/e7s85ffb.aspx


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: power() function in Windows: "value out of range: underflow"

2018-04-10 Thread David Rowley

On 10 April 2018 at 20:30, Huong Dangminh  wrote:
> Hi,
>
> There are some cases that power() function does not work
> correctly with 'NaN' arguments in Windows environment.
> Something like,
>
> postgres=# select power('NaN',11);
> ERROR:  value out of range: underflow
> postgres=# select power('NaN','NaN');
> ERROR:  value out of range: underflow
> postgres=# select power(11,'NaN');
> ERROR:  value out of range: underflow
>
> In Linux environment, instead of ERROR it returns 'NaN'.
>
> The reason here is,
> When pow() in float.c:dpow() is called with 'NaN' arguments,
> pow() returns 'NaN' but in Windows environment errno is set to
> EDOM(invalid floating-point exception).
> So, PostgreSQL update "result" and cause an ERROR in CHECKFLOATVAL macro.

I can recreate this when building with MSVC 2012. I confirm that I see
the same as you. Microsoft are setting errno to EDOM in the above 3
cases, where in Linux the result is still NaN, just the errno is not
set.

Looking at [1], it says:

* If x or y is a NaN, a NaN shall be returned (unless specified
elsewhere in this description).

* For any value of y (including NaN), if x is +1, 1.0 shall be returned.

* For any value of x (including NaN), if y is ±0, 1.0 shall be returned.

Which Microsoft seem to not have broken, they're just also setting the
errno to EDOM in the first case, which seems a little strange, but the
standard there does not seem to mention that it shouldn't do that.

You patch fixes the problem for me, and importantly the two following
cases still work:

postgres=# select power(1,'NaN');
 power
---
 1
(1 row)

postgres=# select power('NaN', 0);
 power
---
 1
(1 row)

There's no mention in the SQL standard about NaN handling in the above
two cases, but I wonder if it's worth also adding a couple of tests
for the above two cases to ensure all platforms are doing the same
thing... ?

I'd also rather see this line:

if (errno == EDOM && isnan(result) && !(isnan(arg1) || isnan(arg2)))

written as:

if (errno == EDOM && isnan(result) && !isnan(arg1) && !isnan(arg2))

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/pow.html

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Boolean partitions syntax

2018-04-10 Thread Amit Langote

On 2018/04/10 23:37, David Rowley wrote:
> On 10 April 2018 at 23:13, Kyotaro HORIGUCHI
>  wrote:
>> Note: This is not intended to be committed this time but just for
>> information.
>>
>> At Tue, 10 Apr 2018 10:34:27 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
>> wrote:
>>> Just adding negation would work as a_expr is doing.
>>>
 | '-' a_expr%prec UMINUS
 { $$ = doNegate($2, @1); }
>>
>> a_expr fits partbound_datum_list as is but it cannot sit
>> side-by-side with MAX/MINVALUE at all. However c_expr can if
>> columnref is not excluded. The attached patch does that and
>> partition bound accepts the following syntax. (I didn't see the
>> transform side at all)
>>
>> create table p2c1 partition of p2 for values from (log(1000),0+1,0/1) to 
>> (10, 10, ('1'||'0')::int);
> 
> I imagined this would have had a check for volatile functions and some
> user-friendly error message to say partition bounds must be immutable,
> but instead, it does:
> 
> postgres=# create table d_p1 partition of d for values in (Random());
> ERROR:  specified value cannot be cast to type double precision for column "d"
> LINE 1: create table d_p1 partition of d for values in (Random());
> ^
> DETAIL:  The cast requires a non-immutable conversion.
> HINT:  Try putting the literal value in single quotes.
> 
> For inspiration, maybe you could follow the lead of CREATE INDEX:
> 
> postgres=# create index on d ((random()));
> ERROR:  functions in index expression must be marked IMMUTABLE

Hmm, I think we do already check that the *partition key* expression is
IMMUTABLE.

create table foo (a int) partition by list (random());
ERROR:  functions in partition key expression must be marked IMMUTABLE

This sounds to not apply to what we'd want to do with a partition bound
expression.

Thanks,
Amit

Re: Boolean partitions syntax

2018-04-10 Thread Amit Langote

On 2018/04/11 10:44, Tom Lane wrote:
> Kyotaro HORIGUCHI  writes:
>> At least partition bound *must* be a constant. Any expression
>> that can be reduced to a constant at parse time ought to be
>> accepted but must not be accepted if not.
> 
> My point is that *any* expression can be reduced to a constant,
> we just have to do so.

Currently transformPartitionBoundValue() applies eval_const_expressions()
by way of calling expression_planner().  However passing to it, say, an
expression representing random() is unable to reduce it to a Const because
simplify_function/evaluate_function won't compute a mutable function like
random().  So, that currently results in an error like this (note that
this is after applying Horiguchi-san's latest patch that enables
specifying random() as a partition bound in the first place):

create table foo_part partition of foo for values in ((random())::int);
ERROR:  specified value cannot be cast to type integer for column "a"
LINE 1: ...table foo_random partition of foo for values in ((random()):...
 ^
DETAIL:  The cast requires a non-immutable conversion.
HINT:  Try putting the literal value in single quotes.

The error is output after the following if check in
transformPartitionBoundValue fails:

/* Fail if we don't have a constant (i.e., non-immutable coercion) */
if (!IsA(value, Const))

I think what Tom is proposing here, instead of bailing out upon
eval_const_expressions() failing to simplify the input expression to a
Const, is to *invoke the executor* to evaluate the expression, like the
optimizer does in evaluate_expr, and cook up a Const with whatever comes
out to store it into the catalog (that is, in relpartbound).

Thanks,
Amit

Re: lazy detoasting

2018-04-10 Thread Jan Wieck

Maybe I'm missing something here, but let me put $.02 in anyway.

TOAST reuses entries. If a toasted value is unchanged on UPDATE (i.e. the
toast pointer didn't get replaced by a new value), the new tuple points to
the same toast slices as the old. If it is changed, the current transaction
DELETEs the old toast slices and INSERTs new ones (if needed).

If your session has a transaction snapshot that protects the old toast
slices from being vacuumed away, you are fine. Even under READ COMMITTED
(that is why it uses that other visibility, so they don't go away when the
concurrent UPDATE commits and rather behave like REPEATABLE READ).

At least that is what the code was intended to do originally.

Regards, Jan

On Tue, Apr 10, 2018 at 6:24 PM, Chapman Flack 
wrote:

> On 04/10/2018 05:04 PM, Chapman Flack wrote:
> > ... I wonder if
> > there's at least a cheap way to check a particular snapshot
> > for suitability wrt a given toast pointer. Check a couple usual
> > suspects, find one most of the time, fall back to eager detoasting
> > otherwise?
> >
> > Guess I need to go back for a deeper dive on just what constitutes
> > a toast pointer. I was skimming last time
>
> I see all I have in a toast pointer is a relation id and a value
> oid, so probably no way to check that a given snapshot 'works' to
> find it, at least no more cheaply than by opening the toast relation
> and trying to find it.
>
> Welp, what tuptoaster does to construct its SnapshotToast is to
> grab GetOldestSnapshot():
>
> *  since we don't know which one to use, just use the oldest one.
> *  This is safe: at worst, we will get a "snapshot too old" error
> *  that might have been avoided otherwise.
>
> ... which means, I take it, that a more recent snapshot
> might work, but this conservative choice would only fail
> if the oldest snapshot has really been around for many days,
> under typical old_snapshot_threshold settings?
>
> ... and if I'm getting a value from a portal that happens to have
> a non-null holdSnapshot, then that would be the one to use, in
> preference to just conservatively grabbing the oldest?
>
> -Chap
>
>
> ... am I right that the init_toast_snapshot construction is really
> making only one change to the snapshot data, namely changing the
> 'satisfies' condition to HeapTupleSatisfiesToast ?
>
> The lsn and whenTaken seem to be fetched from the original data
> and stored right back without change.
>
>

-- 
Jan Wieck
Senior Postgres Architect
http://pgblog.wi3ck.info

Re: Boolean partitions syntax

2018-04-10 Thread Tom Lane

Kyotaro HORIGUCHI  writes:
> At least partition bound *must* be a constant. Any expression
> that can be reduced to a constant at parse time ought to be
> accepted but must not be accepted if not.

My point is that *any* expression can be reduced to a constant,
we just have to do so.

> Maybe we could explicitly control that by having pseudo functions
> like eval().
> ... where sold_date = eval_on_parse('today');
> ... where sold_date = eval_on_exec('today');

This strikes me as far outside the immediate requirements.  In fact,
if we had such functions, they'd be irrelevant to this requirement.

regards, tom lane

Re: Boolean partitions syntax

2018-04-10 Thread Kyotaro HORIGUCHI

At Wed, 11 Apr 2018 02:33:58 +1200, David Rowley  
wrote in 
> On 3 February 2018 at 12:04, Tom Lane  wrote:
> > Perhaps more useful to discuss: would that truly be the semantics we want,
> > or should we just evaluate the expression and have done?  It's certainly
> > arguable that "IN (random())" ought to draw an error, not compute some
> > random value and use that.  But if you are insistent on partition bounds
> > being immutable in any strong sense, you've already got problems, because
> > e.g. a timestamptz literal's interpretation isn't necessarily fixed.
> > It's only after we've reduced the original input to Datum form that we
> > can make any real promises about the value not moving.  So I'm not seeing
> > where is the bright line between "IN ('today')" and "IN (random())".
> 
> I see there's been some progress on this thread that's probably gone a
> bit beyond here without the discussion about the desired semantics.
> 
> To kick that off, I'm wondering, in regards to the comment about
> 'today' vs random(); how does this differ from something like:
> 
> CREATE VIEW ... AS SELECT ... FROM ... WHERE datecol = 'today'; ?
> 
> In this case 'today' is going to be evaluated during the parse
> analysis that's done during CREATE VIEW. Why would partitioning need
> to be treated differently?

At least partition bound *must* be a constant. Any expression
that can be reduced to a constant at parse time ought to be
accepted but must not be accepted if not. random() is immutable
but can be reduced to a constant at parse time so it can take a
part of partbound expression freely. I don't think there's a
serious problem this side but docuementaion.

On the other hand view can take either but it is not explicitly
specifiable for its creator. The following two work in different
way for reasons of PostgreSQL internal and we cannot see the
difference until dumping definition.

create view vconstdate as select * from sales where sold_date = 'today';
create view vvardate   as select * from sales where sold_date = now()::date;

Maybe we could explicitly control that by having pseudo functions
like eval().

... where sold_date = eval_on_parse('today');
... where sold_date = eval_on_exec('today');


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: User defined data types in Logical Replication

2018-04-10 Thread Masahiko Sawada

On Mon, Mar 19, 2018 at 7:57 PM, Masahiko Sawada  wrote:
> On Mon, Mar 19, 2018 at 12:50 PM, Masahiko Sawada  
> wrote:
>> On Fri, Mar 16, 2018 at 10:24 AM, Alvaro Herrera
>>  wrote:
>>> Masahiko Sawada wrote:
 On Thu, Mar 15, 2018 at 9:41 AM, Alvaro Herrera  
 wrote:
>>>
 > I think this is a worthwhile test, but IMO it should be improved a bit
 > before we include it.  Also, we can come up with a better name for the
 > test surely, not just refer to this particular bug. Maybe "typemap".

 It might be useful if we have the facility of TAP test to check the
 log message and regexp-match the message to the expected string.
>>>
>>> Something similar to PostgresNode::issues_sql_like() perhaps?
>>>
>>
>> Yeah, I didn't know that but I think it's a good idea. Unlike
>> issues_sql_like() we don't issue anything to the subscriber. So maybe
>> we need truncate logfile before insertion and verify logfile of
>> particular period. The test code would be like follows.
>>
>> $node_subscriber->safe_psql('postgres', 'CREATE SUBSCRIPTION...");
>> truncate $node_subscriber->logfile, 0;
>> $node_publisher->safe_psql('postgres', 'INSERT .. ')
>> my $log = TestLib::slurp_file($node_subscriber->logfile);
>>
>> # Verify logs
>> like($log, qr/processing remote data for replication target relation
>> "public.test" column "b", remote type dummyint, local type dummyint/,
>> 'callback function of datatype conversion1');
>> like($log, qr/processing remote data for replication target relation
>> "public.test" column "a", remote type dummytext, local type
>> dummytext/, 'callback function of datatype conversion2');
>>
>> Thoughts?
>>
>
> Attached an updated test patch added the above test(0002 patch). Since
> for this test case it's enough to use existing test functions I didn't
> create new test functions. Also I found that the local data type name
> in log for data type conversion isn't qualified whereas the remote
> data type is always qualified. Attached 0001 patch fixes that.
>

The original issue has been fixed and this entry on CommitFest has
been marked as "Committed" but there are still works for improving
testing. Perhaps I should register a new entry of remaining patches to
next CommitFest.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2018-04-10 Thread Masahiko Sawada

On Wed, Apr 11, 2018 at 1:40 AM, Robert Haas  wrote:
> On Tue, Apr 10, 2018 at 5:40 AM, Masahiko Sawada  
> wrote:
>> The probability of performance degradation can be reduced by
>> increasing N_RELEXTLOCK_ENTS. But as Robert mentioned, while keeping
>> fast and simple implementation like acquiring lock by a few atomic
>> operation it's hard to improve or at least keep the current
>> performance on all cases. I was thinking that this patch is necessary
>> by parallel DML operations and vacuum but if the community cannot
>> accept this approach it might be better to mark it as "Rejected" and
>> then I should reconsider the design of parallel vacuum.
>
> I'm sorry that I didn't get time to work further on this during the
> CommitFest.

Never mind. There was a lot of items especially at the last CommitFest.

> In terms of moving forward, I'd still like to hear what
> Andres has to say about the comments I made on March 1st.

Yeah, agreed.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Re: [HACKERS] kqueue

2018-04-10 Thread Thomas Munro

On Wed, Dec 6, 2017 at 12:53 AM, Thomas Munro
 wrote:
> On Thu, Jun 22, 2017 at 7:19 PM, Thomas Munro
>  wrote:
>> I don't plan to resubmit this patch myself, but I was doing some
>> spring cleaning and rebasing today and I figured it might be worth
>> quietly leaving a working patch here just in case anyone from the
>> various BSD communities is interested in taking the idea further.

I heard through the grapevine of some people currently investigating
performance problems on busy FreeBSD systems, possibly related to the
postmaster pipe.  I suspect this patch might be a part of the solution
(other patches probably needed to get maximum value out of this patch:
reuse WaitEventSet objects in some key places, and get rid of high
frequency PostmasterIsAlive() read() calls).  The autoconf-fu in the
last version bit-rotted so it seemed like a good time to post a
rebased patch.

-- 
Thomas Munro
http://www.enterprisedb.com

kqueue-v9.patch
Description: Binary data

Re: Partitioned tables and covering indexes

2018-04-10 Thread Amit Langote

Hi.

On 2018/04/11 0:36, Teodor Sigaev wrote:
>>     Does the attached fix look correct?  Haven't checked the fix with
>> ATTACH
>>     PARTITION though.
>>
>>
>> Attached patch seems to fix the problem.  However, I would rather get
>> rid of modifying stmt->indexParams.  That seems to be more logical
>> for me.  Also, it would be good to check some covering indexes on
>> partitioned tables.  See the attached patch.
>
> Seems right way, do not modify incoming object and do not copy rather
> large and deep nested structure as suggested by Amit.

Yeah, Alexander's suggested way of using a separate variable for
indexParams is better.

> But it will  be better to have a ATTACH PARTITION test too.

I have added tests.  Actually, instead of modifying existing tests, I
think it might be better to have a separate section at the end of
indexing.sql to test covering indexes feature for partitioned tables.

Attached find updated patch.

Thanks,
Amit
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 860a60d109..ad819177d7 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -342,6 +342,7 @@ DefineIndex(Oid relationId,
Oid tablespaceId;
Oid createdConstraintId = InvalidOid;
List   *indexColNames;
+   List   *allIndexParams;
Relationrel;
RelationindexRelation;
HeapTuple   tuple;
@@ -378,16 +379,16 @@ DefineIndex(Oid relationId,
numberOfKeyAttributes = list_length(stmt->indexParams);
 
/*
-* We append any INCLUDE columns onto the indexParams list so that we 
have
-* one list with all columns.  Later we can determine which of these are
-* key columns, and which are just part of the INCLUDE list by checking
-* the list position.  A list item in a position less than
-* ii_NumIndexKeyAttrs is part of the key columns, and anything equal to
-* and over is part of the INCLUDE columns.
+* Calculate the new list of index columns including both key columns 
and
+* INCLUDE columns.  Later we can determine which of these are key 
columns,
+* and which are just part of the INCLUDE list by checking the list
+* position.  A list item in a position less than ii_NumIndexKeyAttrs is
+* part of the key columns, and anything equal to and over is part of 
the
+* INCLUDE columns.
 */
-   stmt->indexParams = list_concat(stmt->indexParams,
-   
stmt->indexIncludingParams);
-   numberOfAttributes = list_length(stmt->indexParams);
+   allIndexParams = list_concat(list_copy(stmt->indexParams),
+
list_copy(stmt->indexIncludingParams));
+   numberOfAttributes = list_length(allIndexParams);
 
if (numberOfAttributes <= 0)
ereport(ERROR,
@@ -544,7 +545,7 @@ DefineIndex(Oid relationId,
/*
 * Choose the index column names.
 */
-   indexColNames = ChooseIndexColumnNames(stmt->indexParams);
+   indexColNames = ChooseIndexColumnNames(allIndexParams);
 
/*
 * Select name for index if caller didn't specify
@@ -658,7 +659,7 @@ DefineIndex(Oid relationId,
coloptions = (int16 *) palloc(numberOfAttributes * sizeof(int16));
ComputeIndexAttrs(indexInfo,
  typeObjectId, collationObjectId, 
classObjectId,
- coloptions, stmt->indexParams,
+ coloptions, allIndexParams,
  stmt->excludeOpNames, relationId,
  accessMethodName, accessMethodId,
  amcanorder, stmt->isconstraint);
@@ -886,8 +887,8 @@ DefineIndex(Oid relationId,
memcpy(part_oids, partdesc->oids, sizeof(Oid) * nparts);
 
parentDesc = CreateTupleDescCopy(RelationGetDescr(rel));
-   opfamOids = palloc(sizeof(Oid) * numberOfAttributes);
-   for (i = 0; i < numberOfAttributes; i++)
+   opfamOids = palloc(sizeof(Oid) * numberOfKeyAttributes);
+   for (i = 0; i < numberOfKeyAttributes; i++)
opfamOids[i] = 
get_opclass_family(classObjectId[i]);
 
heap_close(rel, NoLock);
diff --git a/src/test/regress/expected/indexing.out 
b/src/test/regress/expected/indexing.out
index 33f68aab71..2c2bf44aa8 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -1313,3 +1313,27 @@ alter index idxpart2_a_idx attach partition 
idxpart22_a_idx;
 create index on idxpart (a);
 create table idxpart_another (a int, b int, primary key (a, b))

Re: Gotchas about pg_verify_checksums

2018-04-10 Thread Peter Geoghegan

On Tue, Apr 10, 2018 at 4:44 PM, Michael Paquier  wrote:
> Peter, the code does the right thing as it requires the instance's
> control file state to be either DB_SHUTDOWNED_IN_RECOVERY or
> DB_SHUTDOWNED.  The documentation, on the contrary, implies that
> the instance just needs to be offline, which can be anything as long as
> the postmaster is stopped.  That's how I understand the current
> wording.

I see. The problem is clearly the documentation, then.

-- 
Peter Geoghegan

Re: [HACKERS] PATCH: Keep one postmaster monitoring pipe per process

2018-04-10 Thread Thomas Munro

On Wed, Apr 11, 2018 at 12:26 PM, Andres Freund  wrote:
> On 2018-04-11 12:17:14 +1200, Thomas Munro wrote:
>> I arrived at this idea via the realisation that the closest thing to
>> prctl(PR_SET_PDEATHSIG) on BSD-family systems today is
>> please-tell-my-kqueue-if-this-process-dies.  It so happens that my
>> kqueue patch already uses that instead of the pipe to detect
>> postmaster death.  The only problem is that you have to ask it, by
>> calling it kevent().  In a busy loop like those two, where there is no
>> other kind of waiting going on, you could do that by calling kevent()
>> with timeout = 0 to check the queue.
>
> Which is not cheap.

Certainly, but I am reacting to reports via you of performance
problems coming from read(heavily_contended_fd) by saying: although we
can't avoid a syscall like Linux can in this case, we *can* avoid the
reportedly contended pipe mutex completely by reading only our own
process's kqueue.  Combined with Heikki's proposal not to check every
single time through busy loops to reduce the syscall frequency, that
might be the optimal solution until some hypothetical
prctl(PR_SET_PDEATHSIG)-feature-match patch arrives.  Just a thought.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: WIP: Covering + unique indexes.

2018-04-10 Thread Peter Geoghegan

On Tue, Apr 10, 2018 at 1:37 PM, Peter Geoghegan  wrote:
> _bt_mark_page_halfdead() looked like it had a problem, but it now
> looks like I was wrong.

I did find another problem, though. Looks like the idea to explicitly
represent the number of attributes directly has paid off already:

pg@~[3711]=# create table covering_bug (f1 int, f2 int, f3 text);
create unique index cov_idx on covering_bug (f1) include(f2);
insert into covering_bug select i, i * random() * 1000, i * random() *
10 from generate_series(0,10) i;
DEBUG:  building index "pg_toast_16451_index" on table "pg_toast_16451" serially
CREATE TABLE
DEBUG:  building index "cov_idx" on table "covering_bug" serially
CREATE INDEX
ERROR:  tuple has wrong number of attributes in index "cov_idx"

Note that amcheck can detect the issue with the index after the fact, too:

pg@~[3711]=# select bt_index_check('cov_idx');
ERROR:  wrong number of index tuple attributes for index "cov_idx"
DETAIL:  Index tid=(3,2) natts=2 points to index tid=(2,92) page lsn=0/170DC88.

I don't think that the issue is complicated. Looks like we missed a
place that we have to truncate within _bt_split(), located directly
after this comment block:

/*
 * If the page we're splitting is not the rightmost page at its level in
 * the tree, then the first entry on the page is the high key for the
 * page.  We need to copy that to the right half.  Otherwise (meaning the
 * rightmost page case), all the items on the right half will be user
 * data.
 */

I believe that the reason that we didn't find this bug prior to commit
is that we only have a single index tuple with the wrong number of
attributes after an initial root page split through insertions, but
the next root page split masks the problems. Not 100% sure that that's
why we missed it just yet, though.

This bug shouldn't be hard to fix. I'll take care of it as part of
that post-commit review patch I'm working on.

-- 
Peter Geoghegan

Re: [HACKERS] PATCH: Keep one postmaster monitoring pipe per process

2018-04-10 Thread Andres Freund

Hi,

On 2018-04-11 12:17:14 +1200, Thomas Munro wrote:
> I arrived at this idea via the realisation that the closest thing to
> prctl(PR_SET_PDEATHSIG) on BSD-family systems today is
> please-tell-my-kqueue-if-this-process-dies.  It so happens that my
> kqueue patch already uses that instead of the pipe to detect
> postmaster death.  The only problem is that you have to ask it, by
> calling it kevent().  In a busy loop like those two, where there is no
> other kind of waiting going on, you could do that by calling kevent()
> with timeout = 0 to check the queue.

Which is not cheap.


> You could probably figure out a way to hide the
> prctl(PR_SET_PDEATHSIG)-based approach inside the WaitEventSet code,
> with a fast path that doesn't make any system calls if the only event
> registered is postmaster death (you can just check the global variable
> set by your signal handler).  But I guess you wouldn't like the extra
> function call so I guess you'd prefer to check the global variable
> directly in the busy loop, in builds that have
> prctl(PR_SET_PDEATHSIG).

Yea, I'd really want this to be a inline function of the style

static inline bool
PostmasterIsAlive(void)
{
if (!postmaster_possibly_dead)
return true;
return PostmasterIsAliveInternal();
}

I do not like the WES alternative because a special cased "just
postmaster death" WES seems to have absolutely no advantage over a
faster PostmasterIsAlive().  And using WES seems to make a cheap check
like the above significantly more complicated.

Greetings,

Andres Freund

Re: Reopen logfile on SIGHUP

2018-04-10 Thread Grigory Smolkin



On 04/11/2018 12:00 AM, Tom Lane wrote:

Alexander Kuzmenkov  writes:

Syslogger does already rotate logs properly on SIGHUP under some
conditions, so we can just change this to unconditional rotation.
Probably some people wouldn't want their logs to be rotated on SIGHUP,
so we could also add a GUC to control this. Please see the attached patch.

I don't believe this meets the "not break other use-cases" requirement.

Point 1: I do not like a solution that presumes that some background
daemon is going to SIGHUP the postmaster whenever it feels like it.
That will break scenarios in which the DBA is in the midst of a set
of related configuration changes (either ALTER SYSTEM commands or
manual postgresql.conf edits) and doesn't want those changes applied
till she's done.  So we need a mechanism that's narrowly targeted
to reopening the logfile, without SIGHUP'ing the entire database.


If logging collector can reopen file on SIGUSR1, then maybe there should 
be logging_collector.pid file in PGDATA, so external rotation tools can 
get it without much trouble?




Point 2: Depending on how you've got the log filenames configured,
setting rotation_requested may result in a change in log filename, which
will be the wrong thing in some use-cases, particularly that of an
external logrotate daemon that only wishes you'd close and reopen your
file descriptor.  This is a pre-existing issue with the SIGUSR1 code path,
which I think hasn't come up only because hardly anybody is using that.
If we're going to make it mainstream, we need to think harder about how
that ought to work.
External tools usually rely on logfile name staying the same. PGDG 
distribution do it that way for sure.


Anastasia's original patch avoided the point-2 pitfall, but didn't
do anything about point 1.

BTW, another thing that needs to be considered is the interaction with
rotation_disabled.  Right now we automatically drop that on SIGHUP, but
I'm unclear on whether it should be different for logrotate requests.

regards, tom lane




--
Grigory Smolkin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] PATCH: Keep one postmaster monitoring pipe per process

2018-04-10 Thread Thomas Munro

On Wed, Apr 11, 2018 at 12:03 PM, Andres Freund  wrote:
> On 2018-04-11 11:57:20 +1200, Thomas Munro wrote:
>> Then if pgarch_ArchiverCopyLoop() and HandleStartupProcInterrupts()
>> (ie loops without waiting) adopt a prctl(PR_SET_PDEATHSIG)-based
>> approach where available as suggested by Andres[2] or fall back to
>> polling a reusable WaitEventSet (timeout = 0), then there'd be no more
>> calls to PostmasterIsAlive() outside latch.c.
>
> I'm still unconvinced by this. There's good reasons why code might be
> busy-looping without checking the latch, and we shouldn't force code to
> add more latch checks if unnecessary. Resetting the latch frequently can
> actually increase the amount of context switches considerably.

What latch?  There wouldn't be a latch in a WaitEventSet that is used
only to detect postmaster death.

I arrived at this idea via the realisation that the closest thing to
prctl(PR_SET_PDEATHSIG) on BSD-family systems today is
please-tell-my-kqueue-if-this-process-dies.  It so happens that my
kqueue patch already uses that instead of the pipe to detect
postmaster death.  The only problem is that you have to ask it, by
calling it kevent().  In a busy loop like those two, where there is no
other kind of waiting going on, you could do that by calling kevent()
with timeout = 0 to check the queue.

You could probably figure out a way to hide the
prctl(PR_SET_PDEATHSIG)-based approach inside the WaitEventSet code,
with a fast path that doesn't make any system calls if the only event
registered is postmaster death (you can just check the global variable
set by your signal handler).  But I guess you wouldn't like the extra
function call so I guess you'd prefer to check the global variable
directly in the busy loop, in builds that have
prctl(PR_SET_PDEATHSIG).

-- 
Thomas Munro
http://www.enterprisedb.com

Re: pgsql: Validate page level checksums in base backups

2018-04-10 Thread Michael Paquier

On Tue, Apr 10, 2018 at 11:35:21PM +0200, Tomas Vondra wrote:
> BTW pg_verify_checksums needs the same fix.

Yes you are right here.  Just for the record: what needs to be done is
to check for PageIsNew() in scan_file() before fetching
pg_checksum_page() and after reading an individual block, which applies
even to what HEAD has for pg_verify_checksums.c.

I am adding an open item for this one.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] PATCH: Keep one postmaster monitoring pipe per process

2018-04-10 Thread Andres Freund

Hi,

On 2018-04-11 11:57:20 +1200, Thomas Munro wrote:
> Rebased, but I don't actually like this patch any more.  Over in
> another thread[1] I proposed that we should just make exit(1) the
> default behaviour built into latch.c for those cases that don't want
> to do something special (eg SyncRepWaitForLSN()), so we don't finish
> up duplicating the ugly exit(1) code everywhere (or worse, forgetting
> to).  Tom Lane seemed to think that was an idea worth pursuing.
> 
> I think what we need for PG12 is a patch that does that, and also
> introduces a reused WaitEventSet object in several of these places.
> Then eg SyncRepWaitForLSN() won't be interacting with contended kernel
> objects every time (assuming an implementation like epoll or
> eventually kqueue, completion ports etc is available).
> 
> Then if pgarch_ArchiverCopyLoop() and HandleStartupProcInterrupts()
> (ie loops without waiting) adopt a prctl(PR_SET_PDEATHSIG)-based
> approach where available as suggested by Andres[2] or fall back to
> polling a reusable WaitEventSet (timeout = 0), then there'd be no more
> calls to PostmasterIsAlive() outside latch.c.

I'm still unconvinced by this. There's good reasons why code might be
busy-looping without checking the latch, and we shouldn't force code to
add more latch checks if unnecessary. Resetting the latch frequently can
actually increase the amount of context switches considerably.

- Andres

Re: 2018-03 CF Cleanup

2018-04-10 Thread Michael Paquier

On Tue, Apr 10, 2018 at 03:18:53PM -0400, Tom Lane wrote:
> David Steele  writes:
>> OK, I did it that way and closed the CF.
> 
> Yay!  And many thanks for your hard work on this.

Thanks for the cleanup, David!  That's a hard task.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] PATCH: Keep one postmaster monitoring pipe per process

2018-04-10 Thread Thomas Munro

On Tue, Sep 20, 2016 at 11:26 AM, Andres Freund  wrote:
> On 2016-09-20 11:07:03 +1200, Thomas Munro wrote:
>> Yeah, I wondered why that was different than the pattern established
>> elsewhere when I was hacking on replication code.  There are actually
>> several places where we call PostmasterIsAlive() unconditionally in a
>> loop that waits for WL_POSTMASTER_DEATH but ignores the return code:
>
> Note that just changing this implies a behavioural change in at least
> some of those: If the loop is busy with work, the WaitLatch might never
> be reached.  I think that might be ok in most of those, but it does
> require examination.

Rebased, but I don't actually like this patch any more.  Over in
another thread[1] I proposed that we should just make exit(1) the
default behaviour built into latch.c for those cases that don't want
to do something special (eg SyncRepWaitForLSN()), so we don't finish
up duplicating the ugly exit(1) code everywhere (or worse, forgetting
to).  Tom Lane seemed to think that was an idea worth pursuing.

I think what we need for PG12 is a patch that does that, and also
introduces a reused WaitEventSet object in several of these places.
Then eg SyncRepWaitForLSN() won't be interacting with contended kernel
objects every time (assuming an implementation like epoll or
eventually kqueue, completion ports etc is available).

Then if pgarch_ArchiverCopyLoop() and HandleStartupProcInterrupts()
(ie loops without waiting) adopt a prctl(PR_SET_PDEATHSIG)-based
approach where available as suggested by Andres[2] or fall back to
polling a reusable WaitEventSet (timeout = 0), then there'd be no more
calls to PostmasterIsAlive() outside latch.c.

[1] 
https://www.postgresql.org/message-id/CAEepm%3D2LqHzizbe7muD7-2yHUbTOoF7Q%2BqkSD5Q41kuhttRTwA%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/flat/7261eb39-0369-f2f4-1bb5-62f3b6083b5e%40iki.fi

-- 
Thomas Munro
http://www.enterprisedb.com

0001-Replace-PostmasterIsAlive-calls-with-WL_POSTMASTER_D.patch
Description: Binary data

Re: Gotchas about pg_verify_checksums

2018-04-10 Thread Michael Paquier

On Tue, Apr 10, 2018 at 10:27:19PM +0200, Daniel Gustafsson wrote:
>> On 10 Apr 2018, at 06:21, Michael Paquier  wrote:
> Does it really imply that?  Either way, the tool could potentially be useful
> for debugging a broken cluster so I’m not sure that stating it requires a
> cleanly shut down server is useful.

Torn pages could lead to false positives.  So I think that the tool's
assumptions are right.

> Thinking more on this, I don’t think the -f option should be in the tool until
> we have the ability to turn on/off checksums.  Since checksums are always on,
> or always off, -f is at best confusing IMO.  The attached patch removes -f,
> when we can turn checksums on/off we can rethink how -f should behave.

That's my impression as well, thanks for confirming.  Your patch looks
fine to me.

I am wondering as well if we should not actually rename the tool?  Why
not naming it pg_checksums instead of pg_verify_checksums, and add an
--action switch to it which can be used to work on checksums.  The
obvious option to support in v11 is a "verify" mode, but I would imagine
that a "disable" and "enable" modes would be useful as well, and all the
APIs are here already to be able to do an in-place update of the
checksums, and then switch the control file properly.  We have no idea
at this stage if a patch to enable checksums while the cluster is online
will be able to make it, still a way to switch checksums while the
cluster is offline is both reliable and easy to implement.  Not saying
do to that for v11 of course, I would like to keep the door open for
v12.
--
Michael


signature.asc
Description: PGP signature

RE: power() function in Windows: "value out of range: underflow"

2018-04-10 Thread Huong Dangminh

Hi,
Thanks for response

# I will add this thread to current CF soon.

> 2018-04-10 5:30 GMT-03:00 Huong Dangminh :
> > There are some cases that power() function does not work correctly
> > with 'NaN' arguments in Windows environment.
> > Something like,
> >
> What is your exact OS version? What is your postgres version? I tested with
> Windows 10 (10.0.16299) x64 and couldn't reproduce the error.

I think it is not depended on OS version but the visual c++ runtime libraries 
version 
(It seem also be included in installers).

My environment:


- PostgreSQL: 9.6.8
  # I am using the EDB PostgreSQL Installer
  # It also can reproduce in PostgreSQL 10.3 or 9.5.12

  >postgres -V
  postgres (PostgreSQL) 9.6.8

  >psql -c "select version()"
 version
  -
   PostgreSQL 9.6.8, compiled by Visual C++ build 1800, 64-bit
  (1 row)

  >psql -c "select 'NaN'^11"
  ERROR:  value out of range: underflow

- OS: Microsoft Windows 10 Pro 10.0.14393 build 14393

=

> > postgres=# select power('NaN',11);
> > ERROR:  value out of range: underflow
> > postgres=# select power('NaN','NaN');
> > ERROR:  value out of range: underflow
> > postgres=# select power(11,'NaN');
> > ERROR:  value out of range: underflow
> >
> > In Linux environment, instead of ERROR it returns 'NaN'.
> >
> Could you show us a simple test case? Print arguments, result and errno.

It just is my confirmation when debug PostgreSQL in Windows environment.
# I built PostgreSQL with VC++ 2012 in debug mode and confirmed.
I don't know how to print those values.


---
Thanks and best regards,
Dang Minh Huong
NEC Solution Innovators, Ltd.
http://www.nec-solutioninnovators.co.jp/en/

Re: Gotchas about pg_verify_checksums

2018-04-10 Thread Michael Paquier

On Tue, Apr 10, 2018 at 02:40:58PM -0700, Peter Geoghegan wrote:
> I agree with Michael -- shutting down the server using immediate mode
> could lead to torn pages, that crash recovery will need to repair at a
> later stage. I think that some strong caveats around this are required
> in the pg_verify_checksums docs, at a minimum.

Peter, the code does the right thing as it requires the instance's
control file state to be either DB_SHUTDOWNED_IN_RECOVERY or
DB_SHUTDOWNED.  The documentation, on the contrary, implies that
the instance just needs to be offline, which can be anything as long as
the postmaster is stopped.  That's how I understand the current
wording.
--
Michael


signature.asc
Description: PGP signature

Re: lazy detoasting

2018-04-10 Thread Chapman Flack

On 04/10/2018 05:04 PM, Chapman Flack wrote:
> ... I wonder if
> there's at least a cheap way to check a particular snapshot
> for suitability wrt a given toast pointer. Check a couple usual
> suspects, find one most of the time, fall back to eager detoasting
> otherwise?
> 
> Guess I need to go back for a deeper dive on just what constitutes
> a toast pointer. I was skimming last time

I see all I have in a toast pointer is a relation id and a value
oid, so probably no way to check that a given snapshot 'works' to
find it, at least no more cheaply than by opening the toast relation
and trying to find it.

Welp, what tuptoaster does to construct its SnapshotToast is to
grab GetOldestSnapshot():

*  since we don't know which one to use, just use the oldest one.
*  This is safe: at worst, we will get a "snapshot too old" error
*  that might have been avoided otherwise.

... which means, I take it, that a more recent snapshot
might work, but this conservative choice would only fail
if the oldest snapshot has really been around for many days,
under typical old_snapshot_threshold settings?

... and if I'm getting a value from a portal that happens to have
a non-null holdSnapshot, then that would be the one to use, in
preference to just conservatively grabbing the oldest?

-Chap

... am I right that the init_toast_snapshot construction is really
making only one change to the snapshot data, namely changing the
'satisfies' condition to HeapTupleSatisfiesToast ?

The lsn and whenTaken seem to be fetched from the original data
and stored right back without change.

Re: pgsql: Validate page level checksums in base backups

2018-04-10 Thread David Steele


On 4/10/18 5:24 PM, Tomas Vondra wrote:


I think there's a bug in sendFile(). We do check checksums on all pages
that pass this LSN check:

 /*
  * Only check pages which have not been modified since the
  * start of the base backup. Otherwise, they might have been
  * written only halfway and the checksum would not be valid.
  * However, replaying WAL would reinstate the correct page in
  * this case.
  */
 if (PageGetLSN(page) < startptr)
 {
 ...
 }

Now, imagine the page is new, i.e. all-zeroes. That means the LSN is 0/0
too, and we'll try to verify the checksum - but we actually do not set
checksums on empty pages.

So I think it should be something like this:

 if ((!PageIsNew(page)) && (PageGetLSN(page) < startptr))
 {
 ...
 }

It might be worth verifying that the page is actually all-zeroes (and
not just with corrupted pd_upper value. Not sure it's worth it.

I've found this by fairly trivial stress testing - running pgbench and
pg_basebackup in a loop. It was failing pretty reliably (~75% of runs).
With the proposed change I see no further failures.


Good catch, Tomas!

I should have seen this since I had the same issue when I implemented 
this feature in pgBackRest.


Anyway, I agree that your fix looks correct.

Thanks,
--
-David
da...@pgmasters.net

Re: Gotchas about pg_verify_checksums

2018-04-10 Thread Peter Geoghegan

On Tue, Apr 10, 2018 at 1:27 PM, Daniel Gustafsson  wrote:
>> On 10 Apr 2018, at 06:21, Michael Paquier  wrote:
>> 1) The documentation states that the cluster needs to be offline.
>> Doesn't this imply that the cluster can also be forcibly killed?  It
>> seems to me that the documentation ought to say that the cluster needs
>> to be shut down cleanly instead.  Mentioning that only in the notes of
>> the documentation would be enough in my opinion.
>
> Does it really imply that?  Either way, the tool could potentially be useful
> for debugging a broken cluster so I’m not sure that stating it requires a
> cleanly shut down server is useful.  That being said, you’re absolutely right
> that the current wording isn’t great, I think “The cluster must be shut down
> before running..” would be better.

I agree with Michael -- shutting down the server using immediate mode
could lead to torn pages, that crash recovery will need to repair at a
later stage. I think that some strong caveats around this are required
in the pg_verify_checksums docs, at a minimum.

-- 
Peter Geoghegan

Re: pgsql: Validate page level checksums in base backups

2018-04-10 Thread Tomas Vondra



On 04/10/2018 11:24 PM, Tomas Vondra wrote:
> Hi,
> 
> I think there's a bug in sendFile(). We do check checksums on all pages
> that pass this LSN check:
> 
> /*
>  * Only check pages which have not been modified since the
>  * start of the base backup. Otherwise, they might have been
>  * written only halfway and the checksum would not be valid.
>  * However, replaying WAL would reinstate the correct page in
>  * this case.
>  */
> if (PageGetLSN(page) < startptr)
> {
> ...
> }
> 
> Now, imagine the page is new, i.e. all-zeroes. That means the LSN is 0/0
> too, and we'll try to verify the checksum - but we actually do not set
> checksums on empty pages.
> 
> So I think it should be something like this:
> 
> if ((!PageIsNew(page)) && (PageGetLSN(page) < startptr))
> {
> ...
> }
> 
> It might be worth verifying that the page is actually all-zeroes (and
> not just with corrupted pd_upper value. Not sure it's worth it.
> 
> I've found this by fairly trivial stress testing - running pgbench and
> pg_basebackup in a loop. It was failing pretty reliably (~75% of runs).
> With the proposed change I see no further failures.
> 

BTW pg_verify_checksums needs the same fix.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Runtime Partition Pruning

2018-04-10 Thread Alvaro Herrera

Robert Haas wrote:
> On Mon, Apr 9, 2018 at 2:28 PM, Alvaro Herrera  
> wrote:

> >> I don't get this.  The executor surely had to (and did) open all of
> >> the relations somewhere even before this patch.

> > I was worried that this coding could be seen as breaking modularity, or
> > trying to do excessive work.  However, after looking closer at it, it
> > doesn't really look like it's the case.  So, nevermind.
> 
> Well what I'm saying is that it shouldn't be necessary.  If the
> relations are being opened already and the pointers to the relcache
> entries are being saved someplace, you shouldn't need to re-open them
> elsewhere to get pointers to the relcache entries.

I looked a bit more into this.  It turns out that we have indeed opened
the relation before -- first in parserOpenTable (for addRangeTableEntry),
then in expandRTE, then in QueryRewrite, then in subquery_planner, then
in get_relation_info.

So, frankly, since each module thinks it's okay to open it every once in
a while, I'm not sure we should be terribly stressed about doing it once
more for partition pruning.  Particularly since communicating the
pointer seems to be quite troublesome.

To figure out, I used the attached patch (not intended for application)
to add a backtrace to each log message, plus a couple of accusatory
elog() calls in relation_open and ExecSetupPartitionPruneState.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 5d682c8cb31f79c79c2ba4cafeb876f10b072cfc Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 10 Apr 2018 18:30:29 -0300
Subject: [PATCH] errbacktrace

---
 src/backend/utils/error/elog.c  | 59 +
 src/include/postgres_ext.h  |  1 +
 src/include/utils/elog.h|  3 ++
 src/interfaces/libpq/fe-protocol3.c |  3 ++
 4 files changed, 66 insertions(+)

diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 16531f7a0f..e531d0009e 100644
--- a/src/backend/utils/error/elog.c
+++ b/src/backend/utils/error/elog.c
@@ -62,6 +62,7 @@
 #ifdef HAVE_SYSLOG
 #include 
 #endif
+#include 
 
 #include "access/transam.h"
 #include "access/xact.h"
@@ -493,6 +494,8 @@ errfinish(int dummy,...)
pfree(edata->hint);
if (edata->context)
pfree(edata->context);
+   if (edata->backtrace)
+   pfree(edata->backtrace);
if (edata->schema_name)
pfree(edata->schema_name);
if (edata->table_name)
@@ -811,6 +814,41 @@ errmsg(const char *fmt,...)
return 0;   /* return value does 
not matter */
 }
 
+#define BACKTRACE_FRAMES 100
+int
+errbacktrace(void)
+{
+   ErrorData   *edata = [errordata_stack_depth];
+   MemoryContext oldcontext;
+   void   *buf[BACKTRACE_FRAMES];
+   int nframes;
+   char  **strfrms;
+   StringInfoData errtrace;
+   int i;
+
+   recursion_depth++;
+   CHECK_STACK_DEPTH();
+   oldcontext = MemoryContextSwitchTo(edata->assoc_context);
+
+   nframes = backtrace(buf, BACKTRACE_FRAMES);
+   strfrms = backtrace_symbols(buf, nframes);
+   if (strfrms == NULL)
+   return 0;
+
+   initStringInfo();
+
+   /* the first frame is always errbacktrace itself, so skip it */
+   for (i = 1; i < nframes; i++)
+   appendStringInfo(, "%s\n", strfrms[i]);
+   free(strfrms);
+
+   edata->backtrace = errtrace.data;
+
+   MemoryContextSwitchTo(oldcontext);
+   recursion_depth--;
+
+   return 0;
+}
 
 /*
  * errmsg_internal --- add a primary error message text to the current error
@@ -1522,6 +1560,8 @@ CopyErrorData(void)
newedata->hint = pstrdup(newedata->hint);
if (newedata->context)
newedata->context = pstrdup(newedata->context);
+   if (newedata->backtrace)
+   newedata->backtrace = pstrdup(newedata->backtrace);
if (newedata->schema_name)
newedata->schema_name = pstrdup(newedata->schema_name);
if (newedata->table_name)
@@ -1560,6 +1600,8 @@ FreeErrorData(ErrorData *edata)
pfree(edata->hint);
if (edata->context)
pfree(edata->context);
+   if (edata->backtrace)
+   pfree(edata->backtrace);
if (edata->schema_name)
pfree(edata->schema_name);
if (edata->table_name)
@@ -1635,6 +1677,8 @@ ThrowErrorData(ErrorData *edata)
newedata->hint = pstrdup(edata->hint);
if (edata->context)
newedata->context = pstrdup(edata->context);
+   if (edata->backtrace)
+   newedata->backtrace = pstrdup(newedata->backtrace);
/* assume message_id is not available */
if (edata->schema_name)

Re: pgsql: Validate page level checksums in base backups

2018-04-10 Thread Tomas Vondra

Hi,

I think there's a bug in sendFile(). We do check checksums on all pages
that pass this LSN check:

/*
 * Only check pages which have not been modified since the
 * start of the base backup. Otherwise, they might have been
 * written only halfway and the checksum would not be valid.
 * However, replaying WAL would reinstate the correct page in
 * this case.
 */
if (PageGetLSN(page) < startptr)
{
...
}

Now, imagine the page is new, i.e. all-zeroes. That means the LSN is 0/0
too, and we'll try to verify the checksum - but we actually do not set
checksums on empty pages.

So I think it should be something like this:

if ((!PageIsNew(page)) && (PageGetLSN(page) < startptr))
{
...
}

It might be worth verifying that the page is actually all-zeroes (and
not just with corrupted pd_upper value. Not sure it's worth it.

I've found this by fairly trivial stress testing - running pgbench and
pg_basebackup in a loop. It was failing pretty reliably (~75% of runs).
With the proposed change I see no further failures.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: lazy detoasting

2018-04-10 Thread Chapman Flack

On 04/10/2018 05:04 PM, Chapman Flack wrote:

> If I'm a function, and ... I found [the datum] myself, say by an
> SPI query within the function, usually that's at a level of abstraction
> somewhere above what-snapshot-was-used-in-the-scan.

It looks like for that case (since the commit 08e261cbc that Tom
mentioned earlier), there can sometimes be a portal->holdSnapshot
that will be the right one.

-Chap

vacuum_cost_limit doc description patch

2018-04-10 Thread Martín Marqués

Hi,

Today looking for information on hard limits for
autovacuum_vacuum_cost_limit I found myself with a very short
description in the docs.

This is a patch to add some further description, plus the upper and
lower limits it has.

-- 
Martín Marqués
select 'martin.marques' || '@' || 'gmail.com'
DBA, Programador, Administrador
From 3a1aa5de682ee6f6cbd1086845ce845409156186 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mart=C3=ADn=20Marqu=C3=A9s?=
 
Date: Tue, 10 Apr 2018 18:03:52 -0300
Subject: [PATCH] Describe better vacuum_cost_limit, mentioning the lower and
 upper values it can take.

---
 doc/src/sgml/config.sgml | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d5f2d23c4..fe7a6d91df 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1860,8 +1860,9 @@ include_dir 'conf.d'


 
- The accumulated cost that will cause the vacuuming process to sleep.
- The default value is 200.
+ This is the accumulated cost that will cause the vacuuming process to sleep
+ for vacuum_cost_delay. The parameter can take a value between 1 and 1.
+ The default is 200.
 

   
-- 
2.13.6

Re: lazy detoasting

2018-04-10 Thread Chapman Flack

On 04/10/2018 04:03 PM, Robert Haas wrote:

> I suspect you want, or maybe need, to use the same snapshot as the
> scan that retrieved the tuple containing the toasted datum.

I'm sure it's worth more than that, but I don't know if it's
implementable.

If I'm a function, and the datum came to me as a parameter, I may
have no way to determine what snapshot the enclosing query used to
obtain the thing passed to me. Or, if I found it myself, say by an
SPI query within the function, usually that's at a level of abstraction
somewhere above what-snapshot-was-used-in-the-scan.

But in both cases, it's expected that I could successfully detoast
either datum if I did so right there on the spot, as that's the usual
convention, right? So at that moment, something in the set of
registered || active snapshots is protecting the tuples I need.

If it's impractical to determine which snapshot is needed (or just
enough work to obviate any benefit of lazy detoasting), I wonder if
there's at least a cheap way to check a particular snapshot
for suitability wrt a given toast pointer. Check a couple usual
suspects, find one most of the time, fall back to eager detoasting
otherwise?

Guess I need to go back for a deeper dive on just what constitutes
a toast pointer. I was skimming last time

-Chap

Re: Reopen logfile on SIGHUP

2018-04-10 Thread Tom Lane

Alexander Kuzmenkov  writes:
> Syslogger does already rotate logs properly on SIGHUP under some 
> conditions, so we can just change this to unconditional rotation. 
> Probably some people wouldn't want their logs to be rotated on SIGHUP, 
> so we could also add a GUC to control this. Please see the attached patch.

I don't believe this meets the "not break other use-cases" requirement.

Point 1: I do not like a solution that presumes that some background
daemon is going to SIGHUP the postmaster whenever it feels like it.
That will break scenarios in which the DBA is in the midst of a set
of related configuration changes (either ALTER SYSTEM commands or
manual postgresql.conf edits) and doesn't want those changes applied
till she's done.  So we need a mechanism that's narrowly targeted
to reopening the logfile, without SIGHUP'ing the entire database.

Point 2: Depending on how you've got the log filenames configured,
setting rotation_requested may result in a change in log filename, which
will be the wrong thing in some use-cases, particularly that of an
external logrotate daemon that only wishes you'd close and reopen your
file descriptor.  This is a pre-existing issue with the SIGUSR1 code path,
which I think hasn't come up only because hardly anybody is using that.
If we're going to make it mainstream, we need to think harder about how
that ought to work.

Anastasia's original patch avoided the point-2 pitfall, but didn't
do anything about point 1.

BTW, another thing that needs to be considered is the interaction with
rotation_disabled.  Right now we automatically drop that on SIGHUP, but
I'm unclear on whether it should be different for logrotate requests.

regards, tom lane

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-10 Thread Joshua D. Drake


On 04/10/2018 12:51 PM, Joshua D. Drake wrote:

-hackers,

The thread is picking up over on the ext4 list. They don't update their 
archives as often as we do, so I can't link to the discussion. What 
would be the preferred method of sharing the info?


Thanks to Anthony for this link:

http://lists.openwall.net/linux-ext4/2018/04/10/33

It isn't quite real time but it keeps things close enough.

jD




Thanks,

JD





--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
* Unless otherwise stated, opinions are my own.   *

Re: Partitioned tables and covering indexes

2018-04-10 Thread Jaime Casanova

On 10 April 2018 at 10:36, Teodor Sigaev  wrote:
>> Does the attached fix look correct?  Haven't checked the fix with
>> ATTACH
>> PARTITION though.
>>
>>
>> Attached patch seems to fix the problem.  However, I would rather get
>> rid of modifying stmt->indexParams.  That seems to be more logical
>> for me.  Also, it would be good to check some covering indexes on
>> partitioned tables.  See the attached patch.
>
> Seems right way, do not modify incoming object and do not copy rather large
> and deep nested structure as suggested by Amit.
>
> But it will  be better to have a ATTACH PARTITION test too.
>

the patch worked for me, i also tried some combinations using ATTACH
PARTITION and found no problems

-- 
Jaime Casanova  www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WIP: Covering + unique indexes.

2018-04-10 Thread Peter Geoghegan

On Tue, Apr 10, 2018 at 9:03 AM, Teodor Sigaev  wrote:
>> * Not sure that all calls to BTreeInnerTupleGetDownLink() are limited
>> to inner tuples, which might be worth doing something about (perhaps
>> just renaming the macro).
>
> What is suspicious place for you opinion?

_bt_mark_page_halfdead() looked like it had a problem, but it now
looks like I was wrong. I also verified every other
BTreeInnerTupleGetDownLink() caller. It now looks like everything is
good here.

-- 
Peter Geoghegan

Re: Reopen logfile on SIGHUP

2018-04-10 Thread Alexander Kuzmenkov


El 10/04/18 a las 22:40, Robert Haas escribió:



Having said that, I'm not averse to providing a solution if it's robust,
not too invasive and doesn't break other use-cases.  So far we've not
seen a patch that meets those conditions.

Fair enough.



Syslogger does already rotate logs properly on SIGHUP under some 
conditions, so we can just change this to unconditional rotation. 
Probably some people wouldn't want their logs to be rotated on SIGHUP, 
so we could also add a GUC to control this. Please see the attached patch.


--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5d5f2d2..b49711f 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -4912,6 +4912,23 @@ local0.*/var/log/postgresql
   
  
 
+ 
+  log_rotate_on_sighup (boolean)
+  
+   log_rotate_on_sighup configuration parameter
+  
+  
+  
+   
+When logging_collector is enabled,
+this parameter will cause the log file to be rotated when the 
+postgres server receives SIGHUP.
+This parameter can only be set in the postgresql.conf
+file or on the server command line.
+   
+  
+ 
+
  
   log_truncate_on_rotation (boolean)
   
diff --git a/src/backend/postmaster/syslogger.c b/src/backend/postmaster/syslogger.c
index 58b759f..70091cd 100644
--- a/src/backend/postmaster/syslogger.c
+++ b/src/backend/postmaster/syslogger.c
@@ -64,6 +64,7 @@
 bool		Logging_collector = false;
 int			Log_RotationAge = HOURS_PER_DAY * MINS_PER_HOUR;
 int			Log_RotationSize = 10 * 1024;
+bool		Log_rotate_on_sighup;
 char	   *Log_directory = NULL;
 char	   *Log_filename = NULL;
 bool		Log_truncate_on_rotation = false;
@@ -353,6 +354,12 @@ SysLoggerMain(int argc, char *argv[])
 			}
 
 			/*
+			 * Force rotation on SIGHUP if configured to do so.
+			 */
+			if (Log_rotate_on_sighup)
+rotation_requested = true;
+
+			/*
 			 * Force rewriting last log filename when reloading configuration.
 			 * Even if rotation_requested is false, log_destination may have
 			 * been changed and we don't want to wait the next file rotation.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fa92ce2..924dc0b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1494,6 +1494,15 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"log_rotate_on_sighup", PGC_SIGHUP, LOGGING_WHERE,
+			gettext_noop("Automatic log file rotation will occur on SIGHUP."),
+			NULL
+		},
+		_rotate_on_sighup,
+		false,
+		NULL, NULL, NULL
+	},
+	{
 		{"log_truncate_on_rotation", PGC_SIGHUP, LOGGING_WHERE,
 			gettext_noop("Truncate existing log files of same name during log rotation."),
 			NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 66d0938..0a8b88c 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -389,6 +389,8 @@
 #log_rotation_size = 10MB		# Automatic rotation of logfiles will
 	# happen after that much log output.
 	# 0 disables.
+#log_rotate_on_sighup = off		# If on, automatic rotation of logfiles
+	# will happen on SIGHUP.
 
 # These are relevant when logging to syslog:
 #syslog_facility = 'LOCAL0'
diff --git a/src/include/postmaster/syslogger.h b/src/include/postmaster/syslogger.h
index b35fadc..3b45dd5 100644
--- a/src/include/postmaster/syslogger.h
+++ b/src/include/postmaster/syslogger.h
@@ -65,6 +65,7 @@ typedef union
 extern bool Logging_collector;
 extern int	Log_RotationAge;
 extern int	Log_RotationSize;
+extern bool Log_rotate_on_sighup;
 extern PGDLLIMPORT char *Log_directory;
 extern PGDLLIMPORT char *Log_filename;
 extern bool Log_truncate_on_rotation;

Re: Gotchas about pg_verify_checksums

2018-04-10 Thread Daniel Gustafsson

> On 10 Apr 2018, at 06:21, Michael Paquier  wrote:

> 1) The documentation states that the cluster needs to be offline.
> Doesn't this imply that the cluster can also be forcibly killed?  It
> seems to me that the documentation ought to say that the cluster needs
> to be shut down cleanly instead.  Mentioning that only in the notes of
> the documentation would be enough in my opinion.

Does it really imply that?  Either way, the tool could potentially be useful
for debugging a broken cluster so I’m not sure that stating it requires a
cleanly shut down server is useful.  That being said, you’re absolutely right
that the current wording isn’t great, I think “The cluster must be shut down
before running..” would be better.

> 2) On a cluster where checksums are disabled, aka the control file says
> so, then using --force does not result in incorrect blocks to be
> reported.  This is caused by the presence of the check on
> PG_DATA_CHECKSUM_VERSION which does not match for a cluster where
> checksums have been disabled.  Wouldn't one want to know this
> information as well to know what are the portions of the data folder not
> treated yet by checksum updates?
> 3) Is the force option actually useful?  I assume that --force is useful
> if one wants to check how checksums are computed if a switch off -> on
> is used to see the progress of the operation or to see how much progress
> has been done after doing an online switch, which is also what a228cc13
> outlines.  Still this requires the cluster to be offline…

Thinking more on this, I don’t think the -f option should be in the tool until
we have the ability to turn on/off checksums.  Since checksums are always on,
or always off, -f is at best confusing IMO.  The attached patch removes -f,
when we can turn checksums on/off we can rethink how -f should behave.

cheers ./daniel

remove_force.patch
Description: Binary data

Re: submake-errcodes

2018-04-10 Thread Tom Lane

Alvaro Herrera  writes:
> Tom Lane wrote:
>> $(OBJS): | submake-generated-headers
>> but I took it out thinking it was no longer needed.
>> 
>> The short-term solution seems to be to put that back, but that's sort
>> of annoying because it means this isn't a bulletproof solution.

> Isn't it possible to put it in some common.mk file rather than each
> individual Makefile?

I thought about putting the above into Makefile.global, but it'd only
work in cases where $(OBJS) gets set before including Makefile.global,
which turns out to be nearly noplace.  So we'd still end up touching
an awful lot of makefiles to make it work, for relatively little
practical benefit.

regards, tom lane

Re: submake-errcodes

2018-04-10 Thread Alvaro Herrera

Tom Lane wrote:

> $(OBJS): | submake-generated-headers
> 
> but I took it out thinking it was no longer needed.
> 
> The short-term solution seems to be to put that back, but that's sort
> of annoying because it means this isn't a bulletproof solution.  It
> will only work for builds started in one of the directories that we
> take the trouble to put this defense into, and I can't see doing that
> everywhere.  Still, such things didn't work reliably before either
> except in these few directories, so maybe it won't matter.

Isn't it possible to put it in some common.mk file rather than each
individual Makefile?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: lazy detoasting

2018-04-10 Thread Robert Haas

On Tue, Apr 10, 2018 at 11:26 AM, Chapman Flack  wrote:
> Out of the six GetFooSnapshot()s, would I want to squirrel away
> Active? Oldest? Transaction?

I suspect you want, or maybe need, to use the same snapshot as the
scan that retrieved the tuple containing the toasted datum.

(This advice may be worth what you paid for it.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-10 Thread Joshua D. Drake


-hackers,

The thread is picking up over on the ext4 list. They don't update their 
archives as often as we do, so I can't link to the discussion. What 
would be the preferred method of sharing the info?


Thanks,

JD


--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
* Unless otherwise stated, opinions are my own.   *

Re: Reopen logfile on SIGHUP

2018-04-10 Thread Robert Haas

On Tue, Apr 10, 2018 at 3:17 PM, Tom Lane  wrote:
> We, as in the core project, are not shipping it.

+1 for what JD said on that subject.

> I'm also unclear
> on why you want to exclude "fix the RPM packaging" as a reasonable
> solution.

Mostly because the complaint was about the *Debian* packaging.  Other
than that, it's possible that that's the way forward.

> It seems likely that some change in that packaging would
> be necessary anyway, as it wouldn't know today about any signaling
> method we might choose to adopt.
>
> Having said that, I'm not averse to providing a solution if it's robust,
> not too invasive and doesn't break other use-cases.  So far we've not
> seen a patch that meets those conditions.

Fair enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reopen logfile on SIGHUP

2018-04-10 Thread Joshua D. Drake


On 04/10/2018 12:17 PM, Tom Lane wrote:

Robert Haas  writes:

On Tue, Feb 27, 2018 at 6:12 PM, Tom Lane  wrote:

IOW, I think a fair response to this is "if you're using logrotate with
Postgres, you're doing it wrong".



Well, the original post says that this is how the PGDG RPMs are doing
it on Debian/Ubuntu.  I wonder if that's due to some Debian/Ubuntu
policy or just a preference on the part of whoever did the packaging
work.  Anyway it's a little hard to argue that the configuration is
insane when we're shipping it.


We, as in the core project, are not shipping it.


Well, yes we are at least from an external perception problem. The name 
says it all, PGDG RPMs. They are either the official PostgreSQL.Org RPMs 
or they aren't. If they aren't they shouldn't be called PGDG RPMs nor 
should they be available from yum.postgresql.org and apt.postgresql.org 
respectively.


Note: I am not advocating the removal of those packages. I am advocating 
that the core project of PostgreSQL.Org in fact does ship those packages 
and that is how people see it outside of our email silo.


JD



--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***  A fault and talent of mine is to tell it exactly how it is.  ***
PostgreSQL centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://postgresconf.org
* Unless otherwise stated, opinions are my own.   *

Re: 2018-03 CF Cleanup

2018-04-10 Thread Tom Lane

David Steele  writes:
> OK, I did it that way and closed the CF.

Yay!  And many thanks for your hard work on this.

regards, tom lane

Re: Reopen logfile on SIGHUP

2018-04-10 Thread Tom Lane

Robert Haas  writes:
> On Tue, Feb 27, 2018 at 6:12 PM, Tom Lane  wrote:
>> IOW, I think a fair response to this is "if you're using logrotate with
>> Postgres, you're doing it wrong".

> Well, the original post says that this is how the PGDG RPMs are doing
> it on Debian/Ubuntu.  I wonder if that's due to some Debian/Ubuntu
> policy or just a preference on the part of whoever did the packaging
> work.  Anyway it's a little hard to argue that the configuration is
> insane when we're shipping it.

We, as in the core project, are not shipping it.  I'm also unclear
on why you want to exclude "fix the RPM packaging" as a reasonable
solution.  It seems likely that some change in that packaging would
be necessary anyway, as it wouldn't know today about any signaling
method we might choose to adopt.

Having said that, I'm not averse to providing a solution if it's robust,
not too invasive and doesn't break other use-cases.  So far we've not
seen a patch that meets those conditions.

regards, tom lane

Re: submake-errcodes

2018-04-10 Thread Tom Lane

I wrote:
> Hm ... you're cd'ing into src/pl/plpython and issuing "make all"?
> That works for me.
> ... or, wait ... with -j it doesn't.  That's strange, will look.

So after a bit of digging, it seems that the locution

all: submake-generated-headers

doesn't result in ensuring that submake-generated-headers is complete
before we go to build the other targets required by "all"; it only
says that submake-generated-headers must be complete before we execute
the (empty) list of commands attached to the "all" target.

I'd tested high-j runs pretty carefully at top level, but it turns
out that that works because in both the toplevel GNUmakefile and
src/Makefile, all the interesting work happens in recursive sub-makes,
and we force the ordering of those properly with the dependencies on
the recursive make rules:

$(1)-$(2)-recurse: $(if $(filter all install, $(3)), submake-generated-headers) 
$(if $(filter check, $(3)), temp-install)
$$(MAKE) -C $(2) $(3)

If you go to, eg, src/pl/plpython and issue "make -j", there's nothing
to prevent the builds of object files from happening before the header
build finishes.  There *was* something there before:

$(OBJS): | submake-generated-headers

but I took it out thinking it was no longer needed.

The short-term solution seems to be to put that back, but that's sort
of annoying because it means this isn't a bulletproof solution.  It
will only work for builds started in one of the directories that we
take the trouble to put this defense into, and I can't see doing that
everywhere.  Still, such things didn't work reliably before either
except in these few directories, so maybe it won't matter.

regards, tom lane

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-10 Thread Joshua D. Drake


-hackers,

I reached out to the Linux ext4 devs, here is ty...@mit.edu response:

"""
Hi Joshua,

This isn't actually an ext4 issue, but a long-standing VFS/MM issue.

There are going to be multiple opinions about what the right thing to
do.  I'll try to give as unbiased a description as possible, but
certainly some of this is going to be filtered by my own biases no
matter how careful I can be.

First of all, what storage devices will do when they hit an exception
condition is quite non-deterministic.  For example, the vast majority
of SSD's are not power fail certified.  What this means is that if
they suffer a power drop while they are doing a GC, it is quite
possible for data written six months ago to be lost as a result.  The
LBA could potentialy be far, far away from any LBA's that were
recently written, and there could have been multiple CACHE FLUSH
operations in the since the LBA in question was last written six
months ago.  No matter; for a consumer-grade SSD, it's possible for
that LBA to be trashed after an unexpected power drop.

Which is why after a while, one can get quite paranoid and assume that
the only way you can guarantee data robustness is to store multiple
copies and/or use erasure encoding, with some of the copies or shards
written to geographically diverse data centers.

Secondly, I think it's fair to say that the vast majority of the
companies who require data robustness, and are either willing to pay
$$$ to an enterprise distro company like Red Hat, or command a large
enough paying customer base that they can afford to dictate terms to
an enterprise distro, or hire a consultant such as Christoph, or have
their own staffed Linux kernel teams, have tended to use O_DIRECT.  So
for better or for worse, there has not been as much investment in
buffered I/O and data robustness in the face of exception handling of
storage devices.

Next, the reason why fsync() has the behaviour that it does is one
ofhe the most common cases of I/O storage errors in buffered use
cases, certainly as seen by the community distros, is the user who
pulls out USB stick while it is in use.  In that case, if there are
dirtied pages in the page cache, the question is what can you do?
Sooner or later the writes will time out, and if you leave the pages
dirty, then it effectively becomes a permanent memory leak.  You can't
unmount the file system --- that requires writing out all of the pages
such that the dirty bit is turned off.  And if you don't clear the
dirty bit on an I/O error, then they can never be cleaned.  You can't
even re-insert the USB stick; the re-inserted USB stick will get a new
block device.  Worse, when the USB stick was pulled, it will have
suffered a power drop, and see above about what could happen after a
power drop for non-power fail certified flash devices --- it goes
double for the cheap sh*t USB sticks found in the checkout aisle of
Micro Center.

So this is the explanation for why Linux handles I/O errors by
clearing the dirty bit after reporting the error up to user space.
And why there is not eagerness to solve the problem simply by "don't
clear the dirty bit".  For every one Postgres installation that might
have a better recover after an I/O error, there's probably a thousand
clueless Fedora and Ubuntu users who will have a much worse user
experience after a USB stick pull happens.

I can think of things that could be done --- for example, it could be
switchable on a per-block device basis (or maybe a per-mount basis)
whether or not the dirty bit gets cleared after the error is reported
to userspace.  And perhaps there could be a new unmount flag that
causes all dirty pages to be wiped out, which could be used to recover
after a permanent loss of the block device.  But the question is who
is going to invest the time to make these changes?  If there is a
company who is willing to pay to comission this work, it's almost
certainly soluble.  Or if a company which has a kernel on staff is
willing to direct an engineer to work on it, it certainly could be
solved.  But again, of the companies who have client code where we
care about robustness and proper handling of failed disk drives, and
which have a kernel team on staff, pretty much all of the ones I can
think of (e.g., Oracle, Google, etc.) use O_DIRECT and they don't try
to make buffered writes and error reporting via fsync(2) work well.

In general these companies want low-level control over buffer cache
eviction algorithms, which drives them towards the design decision of
effectively implementing the page cache in userspace, and using
O_DIRECT reads/writes.

If you are aware of a company who is willing to pay to have a new
kernel feature implemented to meet your needs, we might be able to
refer you to a company or a consultant who might be able to do that
work.  Let me know off-line if that's the case...

- Ted
"""

--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc
***

Re: 2018-03 CF Cleanup

2018-04-10 Thread David Steele

On 4/10/18 11:24 AM, Alvaro Herrera wrote:
> David Steele wrote:
> 
>> For the moment I have left all bugs in the 2018-03 CF.  I can can add
>> them to the "Older Bugs" section of the PG 11 Open Items but I'm not
>> convinced that is the best way to track them.  If they are added to
>> "Older Bugs", does that mean they get closed?  Or should we just add a
>> link to the CF entry in "Older Bugs"?
> 
> OpenItem's "Older Bugs" section is generally a way to ignore items
> forever (ie., they don't represent a true Open Item for this release.)
> I think it's better to move these patches to the next commitfest
> instead.

OK, I did it that way and closed the CF.

Thanks,
-- 
-David
da...@pgmasters.net

Re: Reopen logfile on SIGHUP

2018-04-10 Thread Robert Haas

On Tue, Feb 27, 2018 at 6:12 PM, Tom Lane  wrote:
> IOW, I think a fair response to this is "if you're using logrotate with
> Postgres, you're doing it wrong".

Well, the original post says that this is how the PGDG RPMs are doing
it on Debian/Ubuntu.  I wonder if that's due to some Debian/Ubuntu
policy or just a preference on the part of whoever did the packaging
work.  Anyway it's a little hard to argue that the configuration is
insane when we're shipping it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [sqlsmith] Segfault in expand_tuple

2018-04-10 Thread Andrew Dunstan



On 04/07/2018 03:28 PM, Andreas Seltenreich wrote:
> Hi,
>
> the following query triggers a segfault for me when run against the
> regression database.  Testing was done with master at 039eb6e92f.
> Backtrace below.
>
[large query]

> Core was generated by `postgres: smith regression [local] SELECT  '.
> Program terminated with signal SIGSEGV, Segmentation fault.
> (gdb) bt
> #0  0x556c14759cb8 in expand_tuple 
> (targetHeapTuple=targetHeapTuple@entry=0x0, 
> targetMinimalTuple=targetMinimalTuple@entry=0x7ffe8088a118, 
> sourceTuple=, tupleDesc=)
> at heaptuple.c:984
> #1  0x556c1475bb46 in minimal_expand_tuple (sourceTuple=, 
> tupleDesc=) at heaptuple.c:1015
> #2  0x556c14917177 in ExecCopySlotMinimalTuple (slot=) at 
> execTuples.c:631
> #3  0x556c14ba8ada in copytup_heap (state=0x556c16c4f5e8, 
> stup=0x7ffe8088a180, tup=) at tuplesort.c:3585
> #4  0x556c14baf8e6 in tuplesort_puttupleslot 
> (state=state@entry=0x556c16c4f5e8, slot=) at tuplesort.c:1444
> #5  0x556c14937791 in ExecSort (pstate=0x556c16c3ac50) at nodeSort.c:112
> #6  0x556c1493c6f4 in ExecProcNode (node=0x556c16c3ac50) at 
> ../../../src/include/executor/executor.h:239
> #7  begin_partition (winstate=winstate@entry=0x556c16c3a6b8) at 
> nodeWindowAgg.c:1110
> #8  0x556c149403aa in ExecWindowAgg (pstate=0x556c16c3a6b8) at 
> nodeWindowAgg.c:2094



Yeah, thanks for the report. An obvious fix is this where we treat any
case where attrmiss is entirely missing as meaning NULL for every
attribute we need to supply:

diff --git a/src/backend/access/common/heaptuple.c
b/src/backend/access/common/heaptuple.c
index c646456..b9802b9 100644
--- a/src/backend/access/common/heaptuple.c
+++ b/src/backend/access/common/heaptuple.c
@@ -981,7 +981,7 @@ expand_tuple(HeapTuple *targetHeapTuple,
 
    Form_pg_attribute attr = TupleDescAttr(tupleDesc, attnum);
 
-   if (attrmiss[attnum].ammissingPresent)
+   if (attrmiss && attrmiss[attnum].ammissingPresent)
    {
    fill_val(attr,
 nullBits ?  : NULL,


although it might be a very small optimization to move the additional
test outside the loop.


I have cut down the query that generates the failure to this:

select
    ref_0.a as c2,
    pg_catalog.stddev(
  cast((select pg_catalog.sum(float4col) from public.brintest)
as float4))
   over (partition by ref_0.a,ref_0.b,ref_0.c order by ref_0.b)
    as c5
from
  public.mlparted3 as ref_0;

Not sure if we want to produce something more self-contained for the
regression set. Clearly we should add something, as Tom has reminded me,
to make sure the code path is exercised.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Boolean partitions syntax

2018-04-10 Thread Jonathan S. Katz


> On Apr 10, 2018, at 1:22 PM, Tom Lane  wrote:
> 
> David Rowley  writes:
>> On 11 April 2018 at 03:34, Tom Lane  wrote:
>>> Well, that just begs the question: why do these expressions need to
>>> be immutable?  What we really want, I think, is to evaluate them
>>> and reduce them to constants.  After that, it hardly matters whether
>>> the original expression was volatile.  I see absolutely no moral
>>> difference between "for values in (random())" and cases like
>>> this, which works today:
> 
>> I'd personally be pretty surprised if this worked.
> 
> Well, my point is that we're *already* behaving that way in some cases,
> simply because of the existence of macro-like inputs for some of these
> datatypes.  I'm not sure that users are going to perceive a big difference
> between 'now'::timestamptz and now(), for example.  If we take one but
> not the other, I don't think anybody will see that as a feature.

+1.  Also, one hopes that a user tests their code prior to rolling it out
into a production environment, so a case like the “random()” example
has already been vetted as either not something for their partition, or
they want a one-time randomly generated number to determine how
things are partitioned.

>> What other DDL will execute a volatile function?
> 
> This might be a valid concern, not sure.  It's certainly true that
> most other DDL doesn't result in acquiring a transaction snapshot;
> but it's not *universally* true.  Certainly there's DDL that can
> execute nigh-arbitrary user code, such as CREATE INDEX.
> 
>> What if the volatile function has side
>> effects?
> 
> Can't say that that bothers me.  If the user has thought about what
> they're doing, the results won't surprise them; if they haven't,
> they're likely to be surprised in any case.

+1.  I’m all for protecting users from themselves, but there’s only so
much you can do.  This is where we can make up any knowledge
gap with documentation.

> We might be well advised to do a CCI after evaluating the expressions,
> but that could still be before anything interesting happens.
> 
>> What if the user didn't want the function evaluated and
>> somehow thought they wanted the evaluation to take place on INSERT?
> 
> You could object to awfully large chunks of SQL on the grounds that
> it might confuse somebody.

That's truer than you may think.  

At the end of the day, a user wants to be able to create a partition
with the syntax that they expect from working with other parts of the
database.  If we have clear documentation, e.g. “If you use a volatile
function for a partition, it will only be executed once” etc. should be enough
to educate, or at least say we provided notice about the behavior.

Jonathan

Re: Function to track shmem reinit time

2018-04-10 Thread Robert Haas

On Tue, Apr 10, 2018 at 9:07 AM, David Steele  wrote:
> On 3/29/18 9:40 AM, Tomas Vondra wrote:
>> On 03/28/2018 08:55 PM, David Steele wrote:
>>> I'm setting this entry to Waiting on Author, but based on the discussion
>>> I think it should be Returned with Feedback.
>>
>> Fine with me.
>
> This entry has been marked Returned with Feedback.

I guess I'm in the minority here, but I don't see why it isn't useful
to have something like this. It's a lot easier for a monitoring
process to poll for this kind of thing than it is to monitor the logs.
Not that log monitoring isn't a good idea, but this is pretty
lightweight.

OTOH, it also seems like this could live as an out-of-core extension,
so I guess if nobody else likes the idea Anastasia could always do it
that way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: submake-errcodes

2018-04-10 Thread Tom Lane

Devrim =?ISO-8859-1?Q?G=FCnd=FCz?=  writes:
> On Tue, 2018-04-10 at 10:01 -0400, Tom Lane wrote:
>> You could replace it with submake-generated-headers, since that's more
>> general, but in principle you shouldn't need anything because that
>> target is invoked automatically as of yesterday.  What's the larger
>> context here --- why do you need any of this?

> However, as Christoph wrote, builds against git master fail:

Hm ... you're cd'ing into src/pl/plpython and issuing "make all"?
That works for me.

... or, wait ... with -j it doesn't.  That's strange, will look.

regards, tom lane

Re: pgsql: Merge catalog/pg_foo_fn.h headers back into pg_foo.h headers.

2018-04-10 Thread Tom Lane

Julien Rouhaud  writes:
> On Tue, Apr 10, 2018 at 6:58 PM, Tom Lane  wrote:
>> none of the three if-guards in the temp-install rule itself should
>> prevent this, so what is preventing it?  I don't see it.

> I just checked, and for the record the second rule (ifneq
> ($(abs_top_builddir),) is actually preventing it.

Ah-hah.  I'd misread that maze of twisty little ifdefs that
sets up abs_top_builddir et al.  Thanks for clarifying!

regards, tom lane

Re: [sqlsmith] Failed assertion in create_gather_path

2018-04-10 Thread Robert Haas

On Tue, Apr 10, 2018 at 2:59 AM, Jeevan Chalke
 wrote:
> I actually wanted to have rel->consider_parallel in the condition (yes, for
> additional safety) as we are adding a partial path into rel. But then
> observed that it is same as that of final_rel->consider_parallel and thus
> used it along with other condition.
>
> I have observed at many places that we do check consider_parallel flag
> before adding a partial path to it. Thus for consistency added here too, but
> yes, it just adds an additional safety here.

Thanks to Andreas for reporting this issue and to Jeevan for fixing
it.  My commit 0927d2f46ddd4cf7d6bf2cc84b3be923e0aedc52 seems clearly
to blame.

The change to set_subquery_pathlist() looks correct, but can we add a
simple test case?  Right now if I keep the new Assert() in
add_partial_path() and leave out the rest of the changes, the
regression tests still pass.  That's not so good.  Also, I think I
would be inclined to wrap the if-statement around the rest of the
function instead of returning early.

The new Assert() in add_partial_path() is an excellent idea.  I had
the same thought before, but I didn't do anything about it.  That was
a bad idea; your plan is better. In fact, I suggest an additional
Assert() that any relation to which we're adding a partial path is
marked consider_parallel, like this:

+/* Path to be added must be parallel safe. */
+Assert(new_path->parallel_safe);
+
+/* Relation should be OK for parallelism, too. */
+Assert(parent_rel->consider_parallel);

Regarding recurse_set_operations, since the rel->consider_parallel is
always the same as final_rel->consider_parallel there's really no
value in testing it.  If it were possible for rel->consider_parallel
to be false even when final_rel->consider_parallel were true then the
test would be necessary for correctness.  That might or might not
happen in the future, so I guess it wouldn't hurt to include this for
future-proofing, but it's not actually a bug fix, so it also wouldn't
hurt to leave it out.  I could go either way, I guess.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: submake-errcodes

2018-04-10 Thread Devrim Gündüz


Hi,

On Tue, 2018-04-10 at 10:01 -0400, Tom Lane wrote:
> You could replace it with submake-generated-headers, since that's more
> general, but in principle you shouldn't need anything because that
> target is invoked automatically as of yesterday.  What's the larger
> context here --- why do you need any of this?

Good question -- IIRC we used it to build PL/Python. Just confirmed that
removing from v10 spec file does not break anything.

However, as Christoph wrote, builds against git master fail:

==
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement 
-Wendif-labels -Wmissing-format-attribute -Wformat-security 
-fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -O2 -g -pipe -Wall 
-Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions 
-fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches 
-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic 
-fasynchronous-unwind-tables -fPIC -I. -I. -I/usr/include/python3.6m 
-I../../../src/include  -D_GNU_SOURCE -I/usr/include/libxml2  -I/usr/include  
-c -o plpy_resultobject.o plpy_resultobject.c
In file included from ../../../src/include/postgres.h:47:0,
 from plpy_cursorobject.c:7:
../../../src/include/utils/elog.h:71:10: fatal error: utils/errcodes.h: No such 
file or directory
 #include "utils/errcodes.h"
  ^~
In file included from ../../../src/include/postgres.h:47:0,
 from plpy_procedure.c:7:
../../../src/include/utils/elog.h:71:10: fatal error: utils/errcodes.h: No such 
file or directory
 #include "utils/errcodes.h"
  ^~
compilation terminated.
compilation terminated.
make[1]: *** [: plpy_procedure.o] Error 1
make[1]: *** Waiting for unfinished jobs
make[1]: *** [: plpy_cursorobject.o] Error 1
==

Regards,

-- 
Devrim Gündüz
EnterpriseDB: https://www.enterprisedb.com
PostgreSQL Consultant, Red Hat Certified Engineer
Twitter: @DevrimGunduz , @DevrimGunduzTR

signature.asc
Description: This is a digitally signed message part

Re: pgsql: Merge catalog/pg_foo_fn.h headers back into pg_foo.h headers.

2018-04-10 Thread Julien Rouhaud

On Tue, Apr 10, 2018 at 6:58 PM, Tom Lane  wrote:
>
> Well, the question that's bothering me is how come "make check" in
> an external build doesn't try to execute the temp-install rule before
> printing that error message.  Experimentation shows that it doesn't,
> but it sure looks to me like it should: there's nothing conditional
> about the "check: temp-install" dependency in Makefile.global.  And
> none of the three if-guards in the temp-install rule itself should
> prevent this, so what is preventing it?  I don't see it.

I just checked, and for the record the second rule (ifneq
($(abs_top_builddir),) is actually preventing it.  On top of
Makefile.global:

# Set top_srcdir, srcdir, and VPATH.
ifdef PGXS
top_srcdir = $(top_builddir)
[...]
else # not PGXS
[...]
abs_top_builddir = @abs_top_builddir@
[...]
endif # not PGXS

>
> For this reason, I thought that adding NO_TEMP_INSTALL would be a good
> safety factor in case somebody changes/breaks whatever unobvious
> thing is keeping this from failing, and so I went ahead and did it.

It makes sense.  Thanks for correcting and pushing!

Re: Boolean partitions syntax

2018-04-10 Thread Tom Lane

David Rowley  writes:
> On 11 April 2018 at 03:34, Tom Lane  wrote:
>> Well, that just begs the question: why do these expressions need to
>> be immutable?  What we really want, I think, is to evaluate them
>> and reduce them to constants.  After that, it hardly matters whether
>> the original expression was volatile.  I see absolutely no moral
>> difference between "for values in (random())" and cases like
>> this, which works today:

> I'd personally be pretty surprised if this worked.

Well, my point is that we're *already* behaving that way in some cases,
simply because of the existence of macro-like inputs for some of these
datatypes.  I'm not sure that users are going to perceive a big difference
between 'now'::timestamptz and now(), for example.  If we take one but
not the other, I don't think anybody will see that as a feature.

> What other DDL will execute a volatile function?

This might be a valid concern, not sure.  It's certainly true that
most other DDL doesn't result in acquiring a transaction snapshot;
but it's not *universally* true.  Certainly there's DDL that can
execute nigh-arbitrary user code, such as CREATE INDEX.

> What if the volatile function has side
> effects?

Can't say that that bothers me.  If the user has thought about what
they're doing, the results won't surprise them; if they haven't,
they're likely to be surprised in any case.

We might be well advised to do a CCI after evaluating the expressions,
but that could still be before anything interesting happens.

> What if the user didn't want the function evaluated and
> somehow thought they wanted the evaluation to take place on INSERT?

You could object to awfully large chunks of SQL on the grounds that
it might confuse somebody.

regards, tom lane

Re: pgsql: Merge catalog/pg_foo_fn.h headers back into pg_foo.h headers.

2018-04-10 Thread Tom Lane

Julien Rouhaud  writes:
> On Tue, Apr 10, 2018 at 3:46 PM, Tom Lane  wrote:
>> Hm.  I wonder if we don't also want NO_TEMP_INSTALL, like the doc/src/sgml
>> makefile.  I don't know whether "make check" could be useful in a PGXS
>> build, but certainly that recipe for making a temp install isn't gonna
>> work.

> If I understand correctly, PGXS.mk already forbids a "make check" if
> PGXS is defined.  And it seems that postgres' contribs rely on
> including PGXS.mk, setting NO_PGXS and doing a "make check", so
> NO_TEMP_INSTALL shouldn't be needed.

Well, the question that's bothering me is how come "make check" in
an external build doesn't try to execute the temp-install rule before
printing that error message.  Experimentation shows that it doesn't,
but it sure looks to me like it should: there's nothing conditional
about the "check: temp-install" dependency in Makefile.global.  And
none of the three if-guards in the temp-install rule itself should
prevent this, so what is preventing it?  I don't see it.

For this reason, I thought that adding NO_TEMP_INSTALL would be a good
safety factor in case somebody changes/breaks whatever unobvious
thing is keeping this from failing, and so I went ahead and did it.

regards, tom lane

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-10 Thread Greg Stark

On 10 April 2018 at 02:59, Craig Ringer  wrote:

> Nitpick: In most cases the kernel reserves disk space immediately,
> before returning from write(). NFS seems to be the main exception
> here.

I'm kind of puzzled by this. Surely NFS servers store the data in the
filesystem using write(2) or the in-kernel equivalent? So if the
server is backed by a filesystem where write(2) preallocates space
surely the NFS server must behave as if it'spreallocating as well? I
would expect NFS to provide basically the same set of possible
failures as the underlying filesystem (as long as you don't enable
nosync of course).

-- 
greg

Re: pg_dump should use current_database() instead of PQdb()

2018-04-10 Thread Tom Lane

Peter Eisentraut  writes:
> A report from a pgbouncer user revealed that running pg_dump -C/--create
> does not work through a connection proxy if the virtual database name on
> the proxy does not match the real database name on the database server.
> That's because pg_dump looks up the database to be dumped using the
> information from PQdb().  It should be using current_database() instead.
>  (The code was quite likely written before current_database() was
> available (PG 7.3)).

> See attached patch.

Looks reasonable in a quick once-over, but I've not tested it.

regards, tom lane

pg_dump should use current_database() instead of PQdb()

2018-04-10 Thread Peter Eisentraut

A report from a pgbouncer user revealed that running pg_dump -C/--create
does not work through a connection proxy if the virtual database name on
the proxy does not match the real database name on the database server.
That's because pg_dump looks up the database to be dumped using the
information from PQdb().  It should be using current_database() instead.
 (The code was quite likely written before current_database() was
available (PG 7.3)).

See attached patch.

There are a few other uses of PQdb() in pg_dump, but I think those are
OK because they relate to connection information.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From eaccc63b2f9943633f957f80d67acf54ad70098d Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 10 Apr 2018 12:32:05 -0400
Subject: [PATCH] pg_dump: Use current_database() instead of PQdb()

For querying pg_database about information about the database being
dumped, look up by using current_database() instead of the value
obtained from PQdb().  When using a connection proxy, the value from
PQdb() might not be the real name of the database.
---
 src/bin/pg_dump/pg_dump.c | 52 +--
 1 file changed, 22 insertions(+), 30 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index d066f4f00b..52698cb3b6 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -251,7 +251,7 @@ static char *convertRegProcReference(Archive *fout,
const char *proc);
 static char *getFormattedOperatorName(Archive *fout, const char *oproid);
 static char *convertTSFunction(Archive *fout, Oid funcOid);
-static Oid findLastBuiltinOid_V71(Archive *fout, const char *);
+static Oid findLastBuiltinOid_V71(Archive *fout);
 static char *getFormattedTypeName(Archive *fout, Oid oid, OidOptions opts);
 static void getBlobs(Archive *fout);
 static void dumpBlob(Archive *fout, BlobInfo *binfo);
@@ -735,8 +735,7 @@ main(int argc, char **argv)
 * With 8.1 and above, we can just use FirstNormalObjectId - 1.
 */
if (fout->remoteVersion < 80100)
-   g_last_builtin_oid = findLastBuiltinOid_V71(fout,
-   
PQdb(GetConnection(fout)));
+   g_last_builtin_oid = findLastBuiltinOid_V71(fout);
else
g_last_builtin_oid = FirstNormalObjectId - 1;
 
@@ -2538,6 +2537,7 @@ dumpDatabase(Archive *fout)
PGresult   *res;
int i_tableoid,
i_oid,
+   i_datname,
i_dba,
i_encoding,
i_collate,
@@ -2565,16 +2565,13 @@ dumpDatabase(Archive *fout)
minmxid;
char   *qdatname;
 
-   datname = PQdb(conn);
-   qdatname = pg_strdup(fmtId(datname));
-
if (g_verbose)
write_msg(NULL, "saving database definition\n");
 
/* Fetch the database-level properties for this database */
if (fout->remoteVersion >= 90600)
{
-   appendPQExpBuffer(dbQry, "SELECT tableoid, oid, "
+   appendPQExpBuffer(dbQry, "SELECT tableoid, oid, datname, "
  "(%s datdba) AS dba, "
  
"pg_encoding_to_char(encoding) AS encoding, "
  "datcollate, datctype, 
datfrozenxid, datminmxid, "
@@ -2591,13 +2588,12 @@ dumpDatabase(Archive *fout)
  "shobj_description(oid, 
'pg_database') AS description "
 
  "FROM pg_database "
- "WHERE datname = ",
+ "WHERE datname = 
current_database()",
  username_subquery);
-   appendStringLiteralAH(dbQry, datname, fout);
}
else if (fout->remoteVersion >= 90300)
{
-   appendPQExpBuffer(dbQry, "SELECT tableoid, oid, "
+   appendPQExpBuffer(dbQry, "SELECT tableoid, oid, datname, "
  "(%s datdba) AS dba, "
  
"pg_encoding_to_char(encoding) AS encoding, "
  "datcollate, datctype, 
datfrozenxid, datminmxid, "
@@ -2606,13 +2602,12 @@ dumpDatabase(Archive *fout)
  "shobj_description(oid, 
'pg_database') AS description "
 
  "FROM pg_database "
-

Re: [HACKERS] Moving relation extension locks out of heavyweight lock manager

2018-04-10 Thread Robert Haas

On Tue, Apr 10, 2018 at 5:40 AM, Masahiko Sawada  wrote:
> The probability of performance degradation can be reduced by
> increasing N_RELEXTLOCK_ENTS. But as Robert mentioned, while keeping
> fast and simple implementation like acquiring lock by a few atomic
> operation it's hard to improve or at least keep the current
> performance on all cases. I was thinking that this patch is necessary
> by parallel DML operations and vacuum but if the community cannot
> accept this approach it might be better to mark it as "Rejected" and
> then I should reconsider the design of parallel vacuum.

I'm sorry that I didn't get time to work further on this during the
CommitFest.  In terms of moving forward, I'd still like to hear what
Andres has to say about the comments I made on March 1st.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Optimization of range queries

2018-04-10 Thread Konstantin Knizhnik




On 09.04.2018 20:05, Teodor Sigaev wrote:

Hi!

12 years ago I proposed patch to which could "union" OR clauses into 
one range clause if it's possible. In that time pgsql could not use IS 
NULL as index clause, so patch doesn't support that


https://www.postgresql.org/message-id/flat/45742C51.9020602%40sigaev.ru

option number 4), all other are already committed.


It seems to be slightly different optimization.
Attached please find small patch which extends simplify_and_arguments in 
clauses.c to eliminated redundant checks.
It doesn't perform complete constrains propagation and not using 
predicate_implied_by/predicate_refuted_by because them seems to be too 
expensive and essentially increase
query optimization time. Instead of it it just strict match comparison 
of predicates with some extra logic for handling negators.


With this patch constructed query plans are optimal:

postgres=# create table foo(x integer primary key, y integer);
CREATE TABLE
postgres=# insert into foo (x) values (generate_series(1,10));
INSERT 0 10
postgres=# insert into foo (x,y) values (generate_series(11,20), 1);
INSERT 0 10
postgres=# vacuum analyze foo;
VACUUM
postgres=# explain select * from foo where not (x < 9 and y is not 
null) and (x <= 11 and y is not null);

 QUERY PLAN

 Index Scan using foo_pkey on foo  (cost=0.42..8.48 rows=2 width=8)
   Index Cond: ((x <= 11) AND (x >= 9))
   Filter: (y IS NOT NULL)
(3 rows)

postgres=# explain select * from foo where  x <= 11 and y is not 
null and not (x < 9 and y is not null);

 QUERY PLAN

 Index Scan using foo_pkey on foo  (cost=0.42..8.48 rows=2 width=8)
   Index Cond: ((x <= 11) AND (x >= 9))
   Filter: (y IS NOT NULL)
(3 rows)






Konstantin Knizhnik wrote:

Hi hackers,

Postgres optimizer is not able to build efficient execution plan for 
the following query:


explain select * from  people_raw where not ("ID"<2068113880 AND 
"INN" is not null) and "ID"<=2068629726 AND "INN" is not null;

  QUERY PLAN
 

  Bitmap Heap Scan on people_raw (cost=74937803.72..210449640.49 
rows=121521030 width=336)

    Recheck Cond: ("ID" <= 2068629726)
    Filter: (("INN" IS NOT NULL) AND (("ID" >= 2068113880) OR ("INN" 
IS NULL)))
    ->  Bitmap Index Scan on "People_pkey" (cost=0.00..74907423.47 
rows=2077021718 width=0)

  Index Cond: ("ID" <= 2068629726)
(5 rows)


Here the table is very large, but query effects only relatively small 
number of rows located in the range: [2068113880,2068629726]

But unfortunately optimizer doesn't take it into the account.
Moreover, using "is not null" and "not null" is both operands of AND 
is not smart:
  (("INN" IS NOT NULL) AND (("ID" >= 2068113880) OR ("INN" IS 
NULL)))


If I remove "is not null" condition, then plan is perfect:

explain select * from  people_raw where not ("ID"<2068113880) and 
"ID"<=2068629726;

  QUERY PLAN
 

  Index Scan using "People_pkey" on people_raw (cost=0.58..196745.57 
rows=586160 width=336)

    Index Cond: (("ID" >= 2068113880) AND ("ID" <= 2068629726))
(2 rows)

Before starting  investigation of the problem, I will like to know 
opinion and may be some advise of people familiar with optimizer:

how difficult will be to handle this case and where to look.

Thanks in advance,





--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c
index ed6b680..488d0e6 100644
--- a/src/backend/optimizer/util/clauses.c
+++ b/src/backend/optimizer/util/clauses.c
@@ -322,6 +322,133 @@ and_clause(Node *clause)
 }
 
 /*
+ * Returns t iff its argument is an 'true' constant
+ */
+static bool
+is_true(Node *clause)
+{
+	return IsA(clause, Const) &&
+		!((Const *) clause)->constisnull &&
+		DatumGetBool(((Const *) clause)->constvalue);
+}
+
+/*
+ * Returns t iff its argument is an 'not' clause: (NOT { expr }).
+ */
+static bool
+is_not(Node *clause)
+{
+	return IsA(clause, BoolExpr) &&
+		((BoolExpr *) clause)->boolop == NOT_EXPR;
+}
+
+static bool is_negator(Node *c1, Node *c2);
+
+
+/*
+ * Returns t iff two expressions are the same or one of them is 'not' clause is it's argument is negator of other expression
+ */
+static bool
+is_equal(Node *c1, Node *c2)
+{
+	if (equal(c1, c2))
+		return true;
+
+	if ((is_not(c1) && is_negator(linitial(((BoolExpr*)c1)->args), c2)) ||
+		(is_not(c2) && is_negator(linitial(((BoolExpr*)c2)->args), c1)))
+		return true;
+
+	return false;
+}
+
+/*
+

Re: pgsql: Support partition pruning at execution time

2018-04-10 Thread Robert Haas

On Tue, Apr 10, 2018 at 12:29 PM, Alvaro Herrera
 wrote:
> In contrast, in an indexonly scan you have a single counter and it
> doesn't really matter the distribution of fetches done by workers, so it
> seems okay to aggregate them all in a single counter.  And it being so
> simple, it seems reasonable to me to put it in Instrumentation rather
> than have a dedicated struct.

I don't have a strong opinion on that.  Since we know how many tuples
were processed by each worker, knowing how many heap fetches we have
on a per-worker basis seems like a good thing to have, too.  On the
other hand, maybe EXPLAIN (ANALYZE, VERBOSE) would give us that
anyway, since it knows about displaying per-worker instrumentation
(look for /* Show worker detail */).  If it doesn't, then that's
probably bad, because I'm pretty sure that when I installed that code
the stuff that got displayed for worker instrumentation pretty much
matched the stuff that got displayed for overall instrumentation.

In any case part of the point is that Instrumentation is supposed to
be a generic structure that contains things that are for the most part
common to all node types.  So what MERGE did to that structure looks
like in inadvisable kludge to me.  I'd get rid of that and do it a
different way rather than propagate the idea that nfilteredX is
scratch space that can mean something different in every separate
node.

>> I was pretty much oblivious to this problem during the initial
>> parallel query development and mistakenly assumed that bringing over
>> struct Instrumentation was good enough.  It emerged late in the game
>> that this wasn't really the case, but holding up the whole feature
>> because some nodes might have details not reported on a per-worker
>> basis didn't really seem to make sense.  Whether that was the right
>> call is obviously arguable.
>
> I certainly don't blame you for that.

Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [PATCH] pg_hba.conf : new auth option : clientcert=verify-full

2018-04-10 Thread Magnus Hagander

On Tue, Apr 10, 2018 at 2:10 PM, Julian Markwort <
julian.markw...@uni-muenster.de> wrote:

> On Fri, 2018-04-06 at 20:31 +0200, Magnus Hagander wrote:
>
> I've been through this one again.
>
> Thanks for taking the time!
>
> There is one big omission from it -- it fails to work with the view
> pg_hba_file_rules. When fixing that, things started to look kind of ugly
> with the "two booleans to indicate one thing", so I went ahead and changed
> it to instead be an enum of 3 values.
>
> Oh, I missed that view. To be honest, it wasn't even on my mind that there
> could be a view depending on pg_hba
>
> Also, now when using verify-full or verify-ca, I figured its a lot easier
> to misspell the value, and we don't want that to mean "off". Instead, I
> made it give an error in this case instead of silently treating it as 0.
>
> Good catch!
>
> The option "2" for clientcert was never actually documented, and I think
> we should get rid of that. We need to keep "1" for backwards compatibility
> (which arguably was a bad choice originally, but that was many years ago),
> but let's not add another one.
> I also made a couple of very small changes to the docs.
>
> Attached is an updated patch with these changes. I'd appreciate it if you
> can run it through your tests to confirm that it didn't break any of those
> usecases.
>
> I've tested a couple of things with this and it seems to work as expected.
> Unforunately, there are no tests for libpq, afaik. But testing such
> features would become complicated quite quickly, with the need to generate
> certificates and such...
>

As Peter mentionde, there are in src/test/ssl. I forgot about those, but
yes, it would be useful to have that.



I've noticed that the error message for mismatching CN is now printed twice
> when using password prompts, as all authentication details are checked upon
> initiation of a connection and again later, after sending the password.
>

That depends on your authenticaiton method. Specifically for passwords,
what happens is there are actually two separate connections -- the first
one with no password supplied, and the second one with a password in it.

We could directly reject the connection after the first one and not ask for
a password. I'm not sure if that's better though -- that would leak the
information on *why* we rejected the connection.


2018-04-10 13:54:19.882 CEST [8735] LOG:  provided user name (testuser) and
> authenticated user name (nottestuser) do not match
> 2018-04-10 13:54:19.882 CEST [8735] LOG:  certificate validation failed
> for user "testuser": common name in client certificate does not match user
> name or mapping, but clientcert=verify-full is enabled.
> 2018-04-10 13:54:21.515 CEST [8736] LOG:  provided user name (testuser)
> and authenticated user name (nottestuser) do not match
> 2018-04-10 13:54:21.515 CEST [8736] LOG:  certificate validation failed
> for user "testuser": common name in client certificate does not match user
> name or mapping, but clientcert=verify-full is enabled.
> 2018-04-10 13:54:21.515 CEST [8736] FATAL:  password authentication failed
> for user "testuser"
> 2018-04-10 13:54:21.515 CEST [8736] DETAIL:  Connection matched
> pg_hba.conf line 94: "hostssl all testuser 127.0.0.1/32 password
> clientcert=verify-full"
>
>
> Also, as you can see, there are two LOG messages indicating the mismatch
> -- "provided user name (testuser) and authenticated user name (nottestuser)
> do not match" comes from hba.c:check_usermap() and "certificate validation
> failed for user "testuser": common name in client certificate does not
> match user name or mapping, but clientcert=verify-full is enabled." is the
> message added in auth.c:CheckCertAuth() by the patch.
> Maybe we should shorten the latter LOG message?
>


It might definitely be worth shorting it yeah. For one, we can just use
"cn" :)


-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: Boolean partitions syntax

2018-04-10 Thread David Rowley

On 11 April 2018 at 03:34, Tom Lane  wrote:
> David Rowley  writes:
>> I imagined this would have had a check for volatile functions and some
>> user-friendly error message to say partition bounds must be immutable,
>> but instead, it does:
>
>> postgres=# create table d_p1 partition of d for values in (Random());
>> ERROR:  specified value cannot be cast to type double precision for column 
>> "d"
>> LINE 1: create table d_p1 partition of d for values in (Random());
>> ^
>> DETAIL:  The cast requires a non-immutable conversion.
>> HINT:  Try putting the literal value in single quotes.
>
>> For inspiration, maybe you could follow the lead of CREATE INDEX:
>
>> postgres=# create index on d ((random()));
>> ERROR:  functions in index expression must be marked IMMUTABLE
>
> Well, that just begs the question: why do these expressions need to
> be immutable?  What we really want, I think, is to evaluate them
> and reduce them to constants.  After that, it hardly matters whether
> the original expression was volatile.  I see absolutely no moral
> difference between "for values in (random())" and cases like
> this, which works today:

I'd personally be pretty surprised if this worked. What other DDL will
execute a volatile function? What if the volatile function has side
effects? What if the user didn't want the function evaluated and
somehow thought they wanted the evaluation to take place on INSERT?

I imagine if someone does this then they're probably doing something
wrong, and we should tell them, rather than perhaps silently doing
something they don't want. Perhaps someone might think they can
randomly distribute records into a set of partitions with something
like: for values in(((random() * 1000)::int between 0 and 100)), they
might be surprised when all their records end up in the same (random)
partition.

If we did this, then it seems like we're more likely to live to regret
doing it, rather than regret not doing it.  If someone comes along and
gives us some valid use case in the future, then maybe it can be
considered then. I just can't imagine what that use case would be...

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pgsql: Support partition pruning at execution time

2018-04-10 Thread Alvaro Herrera

Robert Haas wrote:
> On Tue, Apr 10, 2018 at 11:14 AM, Alvaro Herrera
>  wrote:
> > Questions:
> > 1. Do we want to back-patch this to 10?  I suppose (without checking)
> > that EXPLAIN ANALYZE is already reporting bogus numbers for parallel
> > index-only scans, so I think we should do that.
> 
> I haven't looked at this closely, but have you considered adding
> bespoke code for IndexOnlyScan that works like
> ExecSortRetrieveInstrumentation and ExecHashRetrieveInstrumentation
> already do rather than jamming this into struct Instrumentation?

Thanks for pointing these out -- I hadn't come across these.

My initial impression is that those two are about transferring
instrumentation state that's quite a bit more complicated than what my
patch proposes for indexonly scan, which is just a single integer
counter.  For example, in the case of Sort, for each worker we display
separately the method and type and amount of memory.  In the hash case,
we aggregate all of them together for some reason (though I'm not clear
about this one:
/*
 * In a parallel-aware hash join, for now we report the
 * maximum peak memory reported by any worker.
 */
hinstrument.space_peak =
Max(hinstrument.space_peak, worker_hi->space_peak);
 -- why shouldn't we sum these values?)

In contrast, in an indexonly scan you have a single counter and it
doesn't really matter the distribution of fetches done by workers, so it
seems okay to aggregate them all in a single counter.  And it being so
simple, it seems reasonable to me to put it in Instrumentation rather
than have a dedicated struct.

> I'm inclined to view any node-specific instrumentation that's not
> being pulled back to the leader as a rough edge to be filed down when
> it bothers somebody more than an outright bug, but perhaps that is an
> unduly lenient view.  Still, if we take the view that it's an outright
> bug, I suspect we find that there may be at least a few more of those.

OK.  As Tom indicated, it's not possible to backpatch this anyway.
Given that nobody has complained to date, it seems okay to be lenient
about this.  Seeking a backpatchable solution seems more trouble than is
worth.

> I was pretty much oblivious to this problem during the initial
> parallel query development and mistakenly assumed that bringing over
> struct Instrumentation was good enough.  It emerged late in the game
> that this wasn't really the case, but holding up the whole feature
> because some nodes might have details not reported on a per-worker
> basis didn't really seem to make sense.  Whether that was the right
> call is obviously arguable.

I certainly don't blame you for that.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Online enabling of checksums

2018-04-10 Thread Robert Haas

On Fri, Apr 6, 2018 at 8:59 PM, Andres Freund  wrote:
> This is PROPARALLEL_RESTRICTED. That doesn't strike me right, shouldn't
> they be PROPARALLEL_UNSAFE? It might be fine, but I'd not want to rely
> on it.

Just a fine-grained note on this particular point:

It's totally fine for parallel-restricted operations to write WAL,
write to the filesystem, or launch nukes at ${ENEMY_NATION}.  Well, I
mean, the last one might be a bad idea for geopolitical reasons, but
it's not a problem for parallel query.  It is a problem to insert or
update heap tuples because it might extend the relation; mutual
exclusion doesn't work properly there yet (there was a patch to fix
that, but you had some concerns and it didn't go in).  It is a problem
to update or delete heap tuples which might create new combo CIDs; not
all workers will have the same view (there's no patch for this yet
AFAIK, but the fix probably doesn't look that different from
cc5f81366c36b3dd8f02bd9be1cf75b2cc8482bd and could probably use most
of the same infrastructure).

TL;DR: Writing pages (e.g. to set a checksum) doesn't make something
non-parallel-safe.  Writing heap tuples makes it parallel-unsafe.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pgsql: Merge catalog/pg_foo_fn.h headers back into pg_foo.h headers.

2018-04-10 Thread Julien Rouhaud

On Tue, Apr 10, 2018 at 3:46 PM, Tom Lane  wrote:
>
> Julien Rouhaud  writes:
>> I think the best fix if to define NO_GENERATED_HEADERS in pgxs.mk,
>> patch attached.
>
> Hm.  I wonder if we don't also want NO_TEMP_INSTALL, like the doc/src/sgml
> makefile.  I don't know whether "make check" could be useful in a PGXS
> build, but certainly that recipe for making a temp install isn't gonna
> work.

If I understand correctly, PGXS.mk already forbids a "make check" if
PGXS is defined.  And it seems that postgres' contribs rely on
including PGXS.mk, setting NO_PGXS and doing a "make check", so
NO_TEMP_INSTALL shouldn't be needed.

Re: WIP: Covering + unique indexes.

2018-04-10 Thread Teodor Sigaev


* There is no pfree() within _bt_buildadd() for truncated tuples, even
though that's a context where it's clearly not okay.

Agree



* It might be a good idea to also pfree() the truncated tuple for most
other _bt_buildadd() callers. Even though it's arguably okay in other
cases, it seems worth being consistent about it (consistent with old
nbtree code).

Seems, I don't see other calls to pfree after.


* There should probably be some documentation around why it's okay
that we call index_truncate_tuple() with an exclusive buffer lock held
(during a page split). For example, there should probably be a comment
on the VARATT_IS_EXTERNAL() situation.

I havn't objection to improve docs/comments.



* Not sure that all calls to BTreeInnerTupleGetDownLink() are limited
to inner tuples, which might be worth doing something about (perhaps
just renaming the macro).

What is suspicious place for you opinion?



I do not have the time to write a patch right away, but I should be
able to post one in a few days. I want to avoid sending several small
patches.

no problem, we can wait


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: pgsql: Support partition pruning at execution time

2018-04-10 Thread Robert Haas

On Tue, Apr 10, 2018 at 11:14 AM, Alvaro Herrera
 wrote:
> Questions:
> 1. Do we want to back-patch this to 10?  I suppose (without checking)
> that EXPLAIN ANALYZE is already reporting bogus numbers for parallel
> index-only scans, so I think we should do that.

I haven't looked at this closely, but have you considered adding
bespoke code for IndexOnlyScan that works like
ExecSortRetrieveInstrumentation and ExecHashRetrieveInstrumentation
already do rather than jamming this into struct Instrumentation?

I'm inclined to view any node-specific instrumentation that's not
being pulled back to the leader as a rough edge to be filed down when
it bothers somebody more than an outright bug, but perhaps that is an
unduly lenient view.  Still, if we take the view that it's an outright
bug, I suspect we find that there may be at least a few more of those.
I was pretty much oblivious to this problem during the initial
parallel query development and mistakenly assumed that bringing over
struct Instrumentation was good enough.  It emerged late in the game
that this wasn't really the case, but holding up the whole feature
because some nodes might have details not reported on a per-worker
basis didn't really seem to make sense.  Whether that was the right
call is obviously arguable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pgsql: Support partition pruning at execution time

2018-04-10 Thread David Rowley

On 11 April 2018 at 03:42, Tom Lane  wrote:
> Alvaro Herrera  writes:
>> 2. Do we want to revert Andrew's test stabilization patch?  If I
>> understand correctly, the problem is the inverse of what was diagnosed:
>> "any running transaction at the time of the test could prevent pages
>> from being set as all-visible".  That's correct, but the test doesn't
>> depend on pages being all-visible -- quite the contrary, it depends on
>> the pages NOT being all-visible (which is why the HeapFetches counts are
>> all non-zero).  Since the pages contain very few tuples, autovacuum
>> should never process the tables anyway.
>
> I did not especially like the original test output, because even without
> the bug at hand, it seemed to me the number of heap fetches might vary
> depending on BLCKSZ.  Given that the point of the test is just to check
> partition pruning, seems like IOS vs regular indexscan isn't a critical
> difference.  I'd keep Andrew's change but fix the comment.

hmm, I don't feel strongly about reverting or not, but if
[auto-]vacuum has not visited the table, then I don't see why BLCKSZ
would have an effect here.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pgsql: Support partition pruning at execution time

2018-04-10 Thread David Rowley

On 11 April 2018 at 03:14, Alvaro Herrera  wrote:
> 2. Do we want to revert Andrew's test stabilization patch?  If I
> understand correctly, the problem is the inverse of what was diagnosed:
> "any running transaction at the time of the test could prevent pages
> from being set as all-visible".  That's correct, but the test doesn't
> depend on pages being all-visible -- quite the contrary, it depends on
> the pages NOT being all-visible (which is why the HeapFetches counts are
> all non-zero).  Since the pages contain very few tuples, autovacuum
> should never process the tables anyway.

I think it's probably a good idea to revert it once the
instrumentation is working correctly. It appears this found a bug in
that code, so is probably useful to keep just in case something else
breaks it in the future.

I don't think there is too much risk of instability from other
sources. There's no reason an auto-vacuum would trigger and cause a
change in heap fetches. We only delete one row from lprt_a, that's not
going to trigger an auto-vacuum.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: pgsql: Support partition pruning at execution time

2018-04-10 Thread Tom Lane

Alvaro Herrera  writes:
> Questions:
> 1. Do we want to back-patch this to 10?  I suppose (without checking)
> that EXPLAIN ANALYZE is already reporting bogus numbers for parallel
> index-only scans, so I think we should do that.

You can't back-patch a change in struct Instrumentation; that'd be
a ABI break.  Maybe there are no third parties directly touching that
struct, but I wouldn't bet on it.

> 2. Do we want to revert Andrew's test stabilization patch?  If I
> understand correctly, the problem is the inverse of what was diagnosed:
> "any running transaction at the time of the test could prevent pages
> from being set as all-visible".  That's correct, but the test doesn't
> depend on pages being all-visible -- quite the contrary, it depends on
> the pages NOT being all-visible (which is why the HeapFetches counts are
> all non-zero).  Since the pages contain very few tuples, autovacuum
> should never process the tables anyway.

I did not especially like the original test output, because even without
the bug at hand, it seemed to me the number of heap fetches might vary
depending on BLCKSZ.  Given that the point of the test is just to check
partition pruning, seems like IOS vs regular indexscan isn't a critical
difference.  I'd keep Andrew's change but fix the comment.

regards, tom lane

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-10 Thread Anthony Iliopoulos

Hi Robert,

On Tue, Apr 10, 2018 at 11:15:46AM -0400, Robert Haas wrote:
> On Mon, Apr 9, 2018 at 3:13 PM, Andres Freund  wrote:
> > Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
> > absurd, as is some of the proposed ways this is all supposed to
> > work. But I think the case we're discussing is much closer to a near
> > irresolvable corner case than anything else.
> 
> Well, I admit that I wasn't entirely serious about that email, but I
> wasn't entirely not-serious either.  If you can't find reliably find
> out whether the contents of the file on disk are the same as the
> contents that the kernel is giving you when you call read(), then you
> are going to have a heck of a time building a reliable system.  If the
> kernel developers are determined to insist on these semantics (and,
> admittedly, I don't know whether that's the case - I've only read
> Anthony's remarks), then I don't really see what we can do except give
> up on buffered I/O (or on Linux).

I think it would be interesting to get in touch with some of the
respective linux kernel maintainers and open up this topic for
more detailed discussions. LSF/MM'18 is upcoming and it would
have been the perfect opportunity but it's past the CFP deadline.
It may still worth contacting the organizers to bring forward
the issue, and see if there is a chance to have someone from
Pg invited for further discussions.

Best regards,
Anthony

Re: Partitioned tables and covering indexes

2018-04-10 Thread Teodor Sigaev


Does the attached fix look correct?  Haven't checked the fix with ATTACH
PARTITION though.


Attached patch seems to fix the problem.  However, I would rather get
rid of modifying stmt->indexParams.  That seems to be more logical
for me.  Also, it would be good to check some covering indexes on
partitioned tables.  See the attached patch.
Seems right way, do not modify incoming object and do not copy rather large and 
deep nested structure as suggested by Amit.


But it will  be better to have a ATTACH PARTITION test too.

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/

Re: Boolean partitions syntax

2018-04-10 Thread Tom Lane

David Rowley  writes:
> I imagined this would have had a check for volatile functions and some
> user-friendly error message to say partition bounds must be immutable,
> but instead, it does:

> postgres=# create table d_p1 partition of d for values in (Random());
> ERROR:  specified value cannot be cast to type double precision for column "d"
> LINE 1: create table d_p1 partition of d for values in (Random());
> ^
> DETAIL:  The cast requires a non-immutable conversion.
> HINT:  Try putting the literal value in single quotes.

> For inspiration, maybe you could follow the lead of CREATE INDEX:

> postgres=# create index on d ((random()));
> ERROR:  functions in index expression must be marked IMMUTABLE

Well, that just begs the question: why do these expressions need to
be immutable?  What we really want, I think, is to evaluate them
and reduce them to constants.  After that, it hardly matters whether
the original expression was volatile.  I see absolutely no moral
difference between "for values in (random())" and cases like
this, which works today:

regression=# create table pp(d1 date) partition by range(d1);
CREATE TABLE
regression=# create table cc partition of pp for values from ('today') to 
('tomorrow');
CREATE TABLE
regression=# \d+ cc
   Table "public.cc"
 Column | Type | Collation | Nullable | Default | Storage | Stats target | Descr
iption 
+--+---+--+-+-+--+--
---
 d1 | date |   |  | | plain   |  | 
Partition of: pp FOR VALUES FROM ('2018-04-10') TO ('2018-04-11')
Partition constraint: ((d1 IS NOT NULL) AND (d1 >= '2018-04-10'::date) AND (d1 <
 '2018-04-11'::date))

If we're willing to reduce 'today'::date to a fixed constant,
why not random()?

regards, tom lane

Including SQL files in extension scripts

2018-04-10 Thread Jeremy Finzel

In writing extension update scripts, I find it to be really difficult to
grok diffs for example in changed view or function definitions when a new
extension script has to include the whole definition in a new file.  I want
to rather use separate files for these objects, then use something like
psql's \i to include the definitions so that version control (and
management of these files) becomes easier, but this does not appear to be
supported.  Does anyone have any recommendations here?

Currently, I often simply opt to create a script that builds out the
extension SQL files from separated sql files so that I can more easily
maintain them.  But I was hoping there is a better way.  Any ideas would be
much appreciated.

Thanks,
Jeremy

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-10 Thread Robert Haas

On Tue, Apr 10, 2018 at 1:37 AM, Craig Ringer  wrote:
> ... but *only if they hit an I/O error* or they're on a FS that
> doesn't reserve space and hit ENOSPC.
>
> It still does 99% of the job. It still flushes all buffers to
> persistent storage and maintains write ordering. It may not detect and
> report failures to the user how we'd expect it to, yes, and that's not
> great. But it's hardly throw up our hands and give up territory
> either. Also, at least for initdb, we can make initdb fsync() its own
> files before close(). Annoying but hardly the end of the world.

I think we'd need every child postgres process started by initdb to do
that individually, which I suspect would slow down initdb quite a lot.
Now admittedly for anybody other than a PostgreSQL developer that's
only a minor issue, and our regression tests mostly run with fsync=off
anyway.  But I have a strong suspicion that our assumptions about how
fsync() reports errors are baked into an awful lot of parts of the
system, and by the time we get unbaking them I think it's going to be
really surprising if we haven't done real harm to overall system
performance.

BTW, I took a look at the MariaDB source code to see whether they've
got this problem too and it sure looks like they do.
os_file_fsync_posix() retries the fsync in a loop with an 0.2 second
sleep after each retry.  It warns after 100 failures and fails an
assertion after 1000 failures.  It is hard to understand why they
would have written the code this way unless they expect errors
reported by fsync() to continue being reported until the underlying
condition is corrected.  But, it looks like they wouldn't have the
problem that we do with trying to reopen files to fsync() them later
-- I spot checked a few places where this code is invoked and in all
of those it looks like the file is already expected to be open.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: crash with sql language partition support function

2018-04-10 Thread Tom Lane

Alvaro Herrera  writes:
> Ashutosh Bapat wrote:
>> On Tue, Apr 10, 2018 at 1:44 PM, Amit Langote
>>  wrote:
>>> Attached fixes it.  It teaches RelationBuildPartitionKey() to use
>>> fmgr_info_cxt and pass rd_partkeycxt to it.

>> The patch is using partkeycxt and not rd_partkeycxt. Probably a typo
>> in the mail.

No, it's correct as written, because rd_partkeycxt hasn't been set yet.

> Good point.  Yeah, it looks like it should cause problems.  I think it
> would be better to have RelationGetPartitionKey() return a copy in the
> caller's context of the data of the relcache data, rather than the
> relcache data itself, no?  Seems like this would not fail if there never
> is a CCI between the RelationGetPartitionKey call and the last access of
> the returned key struct ... but if this is the explanation, then it's a
> pretty brittle setup and we should stop relying on that.

Yeah, all of the callers of RelationGetPartitionKey seem to assume that
the pointer they get is guaranteed valid and will stay so forever.
This is pretty dangerous, independently of the bug Amit mentions.

However, I'm not sure that copy-on-read is a good solution here, because
of exactly the point at stake that the FmgrInfos may have infrastructure.
We don't have a way to copy that, and even if we did, copying on every
reference would be really expensive.

We might try to make this work like the relcache infrastructure for
indexes, which also contains FmgrInfos.  However, in the index case
we may safely assume that that stuff never changes for the life of the
index.  I'm afraid that's not true for the partitioning data is it?

How much does it actually buy us to store FmgrInfos here rather than,
say, function OIDs?  If we backed off to that, then the data structure
might be simple enough that copy-on-read would work.

Otherwise we might need some kind of refcount mechanism.  We built one
for domain constraints in the typcache, and it's not horrid, but it is a
fair amount of code.

> Finally: I don't quite understand why we need two memory contexts there,
> one for the partkey and another for the partdesc.  Surely one context
> for both is sufficient.

It'd only matter if there were a reason to delete/rebuild one but not the
other within a particular relcache entry, which probably there isn't.
So using one context for both would likely be a bit simpler and more
efficient.

BTW, another thing in the same general area that I was planning to get
on the warpath about sooner or later is that the code managing
rd_partcheck is really cruddy.  It spews a lot of random data structure
into the CacheMemoryContext with no way to release it at relcache inval,
resulting in a session-lifespan memory leak.  (pfree'ing just the List
header, as RelationDestroyRelation does, is laughably inadequate.)
Perhaps that could be fixed by likewise storing it in a sub-context
used for partition information.

regards, tom lane

Re: [PATCH][PROPOSAL] Refuse setting toast.* reloptions when TOAST table does not exist

2018-04-10 Thread David Steele

On 4/10/18 9:17 AM, Alvaro Herrera wrote:
> Nikolay Shaplov wrote:
> 
>> But I need some confirmation, in order not to write patch in vain again :-)
> 
> Don't worry, rest assured that you will still write *many* patches in
> vain, not just this one.

Despite the rather dubious pep talk, Álvaro is right.  You will get more
response with a new patch and a new thread.

But everyone is pretty fatigued right now, so I would recommend waiting
a while.  If you would like to pursue this, plan to have a new patch
ready in August for the next CF.  It sounds like you have an idea about
what needs to be done.

Regards,
-- 
-David
da...@pgmasters.net

Re: lazy detoasting

2018-04-10 Thread Chapman Flack

On 04/10/2018 10:06 AM, Tom Lane wrote:
> Chapman Flack  writes:

>> Am I right in thinking that, for my original purpose of
>> detoasting something later in a transaction, all that matters
>> is that I registered a snapshot from the time at which I copied
>> the toasted datum, and the resource owner I registered it to
>> has not been released yet, ...
> 
> I believe so.

Ok.

I see I have about half a dozen choices for the snapshot that
I choose to squirrel away at the same time as the copied datum.
(That's not counting SnapshotToast, which I didn't know was a thing
until Pavan's thread this morning, but it's not a thing I can get,
just something constructed on the fly in the tuptoaster.)

Out of the six GetFooSnapshot()s, would I want to squirrel away
Active? Oldest? Transaction? This should be happening in a PL
function not doing anything arcane; the datum in question might
be passed to it as a parameter or retrieved from a query done
within the function.

GetOldestTransaction() is what the tuptoaster will eventually use
to construct SnapshotToast when looking for the data, but it's
probably overkill for me to save the oldest one in sight at the
time I save the datum. Or is it? Should I be confident that, at
the time I'm handed this datum, its toasted content must be
retrievable through the (Active? Transaction?) snapshot at that
moment, and it's enough to register that, while to register the
Oldest would be to pin something unnecessarily far back?

> Wouldn't be a bad idea to test this, of course ;-)

Mm-hmm. (Thunderbird has just corrected my spelling of mm-hmm.)
I'm still not sure I know enough to construct a meaningful test...

I assume that, while I'm doing this for a backend PL at the
moment, some of the same considerations will apply if a future
wire protocol is to support Craig's "v4 protocol TODO item -
Lazy fetch/stream of TOASTed valued" suggestion in
https://www.postgresql.org/message-id/53ff0ef8@2ndquadrant.com

-Chap

Re: 2018-03 CF Cleanup

2018-04-10 Thread Alvaro Herrera

David Steele wrote:
> Hackers,
> 
> I have gone through all the remaining non-bug entries in CF 2018-03 and
> pushed them or closed them as appropriate.  The majority of the patches
> were in Needs Review or Ready for Committer status and I did a brief
> review of each to be sure the state was reasonable.

Thanks!

> For the moment I have left all bugs in the 2018-03 CF.  I can can add
> them to the "Older Bugs" section of the PG 11 Open Items but I'm not
> convinced that is the best way to track them.  If they are added to
> "Older Bugs", does that mean they get closed?  Or should we just add a
> link to the CF entry in "Older Bugs"?

OpenItem's "Older Bugs" section is generally a way to ignore items
forever (ie., they don't represent a true Open Item for this release.)
I think it's better to move these patches to the next commitfest
instead.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

2018-03 CF Cleanup

2018-04-10 Thread David Steele

Hackers,

I have gone through all the remaining non-bug entries in CF 2018-03 and
pushed them or closed them as appropriate.  The majority of the patches
were in Needs Review or Ready for Committer status and I did a brief
review of each to be sure the state was reasonable.

I moved a few Waiting on Author patches where it looked like the patch
was being actively developed even if there was no new patch.

For the moment I have left all bugs in the 2018-03 CF.  I can can add
them to the "Older Bugs" section of the PG 11 Open Items but I'm not
convinced that is the best way to track them.  If they are added to
"Older Bugs", does that mean they get closed?  Or should we just add a
link to the CF entry in "Older Bugs"?

Thanks,
-- 
-David
da...@pgmasters.net

Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

2018-04-10 Thread Robert Haas

On Mon, Apr 9, 2018 at 3:13 PM, Andres Freund  wrote:
> Let's lower the pitchforks a bit here.  Obviously a grand rewrite is
> absurd, as is some of the proposed ways this is all supposed to
> work. But I think the case we're discussing is much closer to a near
> irresolvable corner case than anything else.

Well, I admit that I wasn't entirely serious about that email, but I
wasn't entirely not-serious either.  If you can't find reliably find
out whether the contents of the file on disk are the same as the
contents that the kernel is giving you when you call read(), then you
are going to have a heck of a time building a reliable system.  If the
kernel developers are determined to insist on these semantics (and,
admittedly, I don't know whether that's the case - I've only read
Anthony's remarks), then I don't really see what we can do except give
up on buffered I/O (or on Linux).

> We're talking about the storage layer returning an irresolvable
> error. You're hosed even if we report it properly.  Yes, it'd be nice if
> we could report it reliably.  But that doesn't change the fact that what
> we're doing is ensuring that data is safely fsynced unless storage
> fails, in which case it's not safely fsynced anyway.

I think that reliable error reporting is more than "nice" -- I think
it's essential.  The only argument for the current Linux behavior that
has been so far advanced on this thread, at least as far as I can see,
is that if it kept retrying the buffers forever, it would be pointless
and might run the machine out of memory, so we might as well discard
them.  But previous comments have already illustrated that the kernel
is not really up against a wall there -- it could put individual
inodes into a permanent failure state when it discards their dirty
data, as you suggested, or it could do what others have suggested, and
what I think is better, which is to put the whole filesystem into a
permanent failure state that can be cleared by remounting the FS.
That could be done on an as-needed basis -- if the number of dirty
buffers you're holding onto for some filesystem becomes too large, put
the filesystem into infinite-fail mode and discard them all.  That
behavior would be pretty easy for administrators to understand and
would resolve the entire problem here provided that no PostgreSQL
processes survived the eventual remount.

I also don't really know what we mean by an "unresolvable" error.  If
the drive is beyond all hope, then it doesn't really make sense to
talk about whether the database stored on it is corrupt.  In general
we can't be sure that we'll even get an error - e.g. the system could
be idle and the drive could be on fire.  Maybe this is the case you
meant by "it'd be nice if we could report it reliably".  But at least
in my experience, that's typically not what's going on.  You get some
I/O errors and so you remount the filesystem, or reboot, or rebuild
the array, or ... something.  And then the errors go away and, at that
point, you want to run recovery and continue using your database.  In
this scenario, it matters *quite a bit* what the error reporting was
like during the period when failures were occurring.  In particular,
if the database was allowed to think that it had successfully
checkpointed when it didn't, you're going to start recovery from the
wrong place.

I'm going to shut up now because I'm telling you things that you
obviously already know, but this doesn't sound like a "near
irresolvable corner case".  When the storage goes bonkers, either
PostgreSQL and the kernel can interact in such a way that a checkpoint
can succeed without all of the relevant data getting persisted, or
they don't.  It sounds like right now they do, and I'm not really
clear that we have a reasonable idea how to fix that.  It does not
sound like a PANIC is sufficient.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pgsql: Support partition pruning at execution time

2018-04-10 Thread Alvaro Herrera

Changed CC to pgsql-hackers.

Tom Lane wrote:
> Alvaro Herrera  writes:

> > Frankly, I don't like this.  I would rather have an instrument->ntuples2
> > rather than these "divide this by nloops, sometimes" schizoid counters.
> > This is already being misused by ON CONFLICT (see "other_path" in
> > show_modifytable_info).  But it seems like a correct fix would require
> > more code.
> 
> The question then becomes whether to put back nfiltered3, or to do
> something more to your liking.  Think I'd vote for the latter.

Doing it properly is not a lot of code actually.  Patch attached.  ON
CONFLICT is not changed by this patch, but that's a straightforward
change.

Questions:
1. Do we want to back-patch this to 10?  I suppose (without checking)
that EXPLAIN ANALYZE is already reporting bogus numbers for parallel
index-only scans, so I think we should do that.

2. Do we want to revert Andrew's test stabilization patch?  If I
understand correctly, the problem is the inverse of what was diagnosed:
"any running transaction at the time of the test could prevent pages
from being set as all-visible".  That's correct, but the test doesn't
depend on pages being all-visible -- quite the contrary, it depends on
the pages NOT being all-visible (which is why the HeapFetches counts are
all non-zero).  Since the pages contain very few tuples, autovacuum
should never process the tables anyway.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 989b6aad67..fe0419311b 100644
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
***
*** 1459,1470  ExplainNode(PlanState *planstate, List *ancestors,
show_instrumentation_count("Rows Removed by 
Filter", 1,

   planstate, es);
if (es->analyze)
!   {
!   longheapFetches =
!   ((IndexOnlyScanState *) 
planstate)->ioss_HeapFetches;
! 
!   ExplainPropertyInteger("Heap Fetches", NULL, 
heapFetches, es);
!   }
break;
case T_BitmapIndexScan:
show_scan_qual(((BitmapIndexScan *) 
plan)->indexqualorig,
--- 1459,1466 
show_instrumentation_count("Rows Removed by 
Filter", 1,

   planstate, es);
if (es->analyze)
!   ExplainPropertyInteger("Heap Fetches", NULL,
!  
planstate->instrument->ntuples2, es);
break;
case T_BitmapIndexScan:
show_scan_qual(((BitmapIndexScan *) 
plan)->indexqualorig,
diff --git a/src/backend/executor/insindex 86252cee1f..fe5d55904d 100644
*** a/src/backend/executor/instrument.c
--- b/src/backend/executor/instrument.c
***
*** 156,161  InstrAggNode(Instrumentation *dst, Instrumentation *add)
--- 156,162 
dst->startup += add->startup;
dst->total += add->total;
dst->ntuples += add->ntuples;
+   dst->ntuples2 += add->ntuples2;
dst->nloops += add->nloops;
dst->nfiltered1 += add->nfiltered1;
dst->nfiltered2 += add->nfiltered2;
diff --git a/src/backend/executor/nodeInindex ddc0ae9061..3a02a99621 100644
*** a/src/backend/executor/nodeIndexonlyscan.c
--- b/src/backend/executor/nodeIndexonlyscan.c
***
*** 162,168  IndexOnlyNext(IndexOnlyScanState *node)
/*
 * Rats, we have to visit the heap to check visibility.
 */
!   node->ioss_HeapFetches++;
tuple = index_fetch_heap(scandesc);
if (tuple == NULL)
continue;   /* no visible tuple, 
try next index entry */
--- 162,168 
/*
 * Rats, we have to visit the heap to check visibility.
 */
!   InstrCountTuples2(node, 1);
tuple = index_fetch_heap(scandesc);
if (tuple == NULL)
continue;   /* no visible tuple, 
try next index entry */
***
*** 509,515  ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int 
eflags)
indexstate->ss.ps.plan = (Plan *) node;
indexstate->ss.ps.state = estate;
indexstate->ss.ps.ExecProcNode = ExecIndexOnlyScan;
-   indexstate->ioss_HeapFetches = 0;

Re: [HACKERS] [PATCH] Incremental sort

2018-04-10 Thread Alexander Korotkov

On Tue, Apr 10, 2018 at 4:15 PM, David Steele  wrote:

> On 4/9/18 11:56 AM, Alexander Korotkov wrote:
> >
> > Thank you very much for your efforts on reviewing this patch.
> > I'm looking forward work with you on this feature for PG12.
>
> Since there's a new patch I have changed the status to Needs Review and
> moved the entry to the next CF.
>

Right, thank you.

Still, it seems to me that discussion and new patches will be required
> to address Tom's concerns.
>

Sounds correct for me.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Partitioned tables and covering indexes

2018-04-10 Thread Alexander Korotkov

On Tue, Apr 10, 2018 at 12:47 PM, Amit Langote <
langote_amit...@lab.ntt.co.jp> wrote:

> On 2018/04/10 16:07, Jaime Casanova wrote:
> > Hi,
> >
> > Trying covering indexes on partitioned tables i get this error
> > """
> > postgres=# create index on t1_part (i) include (t);
> > ERROR:  cache lookup failed for opclass 0
> > """
> >
> > To reproduce:
> >
> > create table t1_part (i int, t text) partition by hash (i);
> > create table t1_part_0 partition of t1_part for values with (modulus
> > 2, remainder 0);
> > create table t1_part_1 partition of t1_part for values with (modulus
> > 2, remainder 1);
> > insert into t1_part values (1, repeat('abcdefquerty', 20));
> >
> > create index on t1_part (i) include (t);
>
> It seems that the bug is caused due to the original IndexStmt that
> DefineIndex receives being overwritten when processing the INCLUDE
> columns.  Also, the original copy of it should be used when recursing for
> defining the index in partitions.
>
> Does the attached fix look correct?  Haven't checked the fix with ATTACH
> PARTITION though.
>

Attached patch seems to fix the problem.  However, I would rather get
rid of modifying stmt->indexParams.  That seems to be more logical
for me.  Also, it would be good to check some covering indexes on
partitioned tables.  See the attached patch.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


DefineIndex-fix-covering-index-partitioned-2.patch
Description: Binary data

1 2 >

1 - 100 of 161 matches

Mail list logo