date:20151019

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-10-19 Thread Kouhei Kaigai

> -Original Message-
> From: Etsuro Fujita [mailto:fujita.ets...@lab.ntt.co.jp]
> Sent: Tuesday, October 20, 2015 1:11 PM
> To: Robert Haas
> Cc: Tom Lane; Kaigai Kouhei(海外 浩平); Kyotaro HORIGUCHI;
> pgsql-hackers@postgresql.org; Shigeru Hanada
> Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual
> 
> On 2015/10/20 5:34, Robert Haas wrote:
> > On Mon, Oct 19, 2015 at 3:45 AM, Etsuro Fujita
> >  wrote:
> >> As Tom mentioned, just recomputing the original join tuple is not good
> >> enough.  We would need to rejoin the test tuples for the baserels even if
> >> ROW_MARK_COPY is in use.  Consider:
> >>
> >> A=# BEGIN;
> >> A=# UPDATE t SET a = a + 1 WHERE b = 1;
> >> B=# SELECT * from t, ft1, ft2
> >>   WHERE t.a = ft1.a AND t.b = ft2.b AND ft1.c = ft2.c FOR UPDATE;
> >> A=# COMMIT;
> >>
> >> where the plan for the SELECT FOR UPDATE is
> >>
> >> LockRows
> >> -> Nested Loop
> >> -> Seq Scan on t
> >> -> Foreign Scan on 
> >>  Remote SQL: SELECT * FROM ft1 JOIN ft2 WHERE ft1.c = ft2.c AND 
> >> ft1.a
> >> = $1 AND ft2.b = $2
> >>
> >> If an EPQ recheck is invoked by the A's UPDATE, just recomputing the
> >> original join tuple from the whole-row image that you proposed would output
> >> an incorrect result in the EQP recheck since the value a in the updated
> >> version of a to-be-joined tuple in t would no longer match the value ft1.a
> >> extracted from the whole-row image if the A's UPDATE has committed
> >> successfully.  So I think we would need to rejoin the tuples populated from
> >> the whole-row images for the baserels ft1 and ft2, by executing the
> >> secondary plan with the new parameter values for a and b.
> 
> > No.  You just need to populate fdw_recheck_quals correctly, same as
> > for the scan case.
> 
> Yeah, I think we can probably do that for the case where a pushed-down
> join clause is an inner-join one, but I'm not sure that we can do that
> for the case where that clause is an outer-join one.  Maybe I'm missing
> something, though.
>
Please check my message yesterday. The non-nullable side of outer-join is
always visible regardless of the join-clause pushed down, as long as it
satisfies the scan-quals pushed-down.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SuperUser check in pg_stat_statements

2015-10-19 Thread rajan

Jim,

I already tried to create a view upon the pg_stat_statements, but no luck.



--
View this message in context: 
http://postgresql.nabble.com/SuperUser-check-in-pg-stat-statements-tp5870589p5870683.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Dangling Client Backend Process

2015-10-19 Thread Rajeev rastogi

On  19 October 2015 21:37, Robert Haas [mailto:robertmh...@gmail.com] Wrote:

>On Sat, Oct 17, 2015 at 4:52 PM, Alvaro Herrera
> wrote:
>> Andres Freund wrote:
>>> On 2015-10-14 17:33:01 +0900, Kyotaro HORIGUCHI wrote:
>>> > If I recall correctly, he concerned about killing the backends
>>> > running transactions which could be saved. I have a sympathy with
>>> > the opinion.
>>>
>>> I still don't. Leaving backends alive after postmaster has died
>>> prevents the auto-restart mechanism to from working from there on.
>>> Which means that we'll potentially continue happily after another
>>> backend has PANICed and potentially corrupted shared memory. Which
>>> isn't all that unlikely if postmaster isn't around anymore.
>>
>> I agree.  When postmaster terminates without waiting for all backends
>> to go away, things are going horribly wrong -- either a DBA has done
>> something stupid, or the system is misbehaving.  As Andres says, if
>> another backend dies at that point, things are even worse -- the dying
>> backend could have been holding a critical lwlock, for instance, or it
>> could have corrupted shared buffers on its way out.  It is not
>> sensible to leave the rest of the backends in the system still trying
>> to run just because there is no one there to kill them.
>
>Yep.  +1 for changing this.

Seems many people are in favor of this change.
I have made changes to handle backend exit on postmaster death (after they 
finished their work and waiting for new command). 
Changes are as per approach explained in my earlier mail i.e.
1. WaitLatchOrSocket called from secure_read and secure_write function will 
wait on an additional event as WL_POSTMASTER_DEATH.
2. There is a possibility that the command is read without waiting on latch. 
This case is handled by checking postmaster status after command read (i.e. 
after ReadCommand).

Attached is the patch.

Thanks and Regards,
Kumar Rajeev Rastogi





dangling_backend_process.patch
Description: dangling_backend_process.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-10-19 Thread Etsuro Fujita


On 2015/10/20 5:34, Robert Haas wrote:

On Mon, Oct 19, 2015 at 3:45 AM, Etsuro Fujita
 wrote:

As Tom mentioned, just recomputing the original join tuple is not good
enough.  We would need to rejoin the test tuples for the baserels even if
ROW_MARK_COPY is in use.  Consider:

A=# BEGIN;
A=# UPDATE t SET a = a + 1 WHERE b = 1;
B=# SELECT * from t, ft1, ft2
  WHERE t.a = ft1.a AND t.b = ft2.b AND ft1.c = ft2.c FOR UPDATE;
A=# COMMIT;

where the plan for the SELECT FOR UPDATE is

LockRows
-> Nested Loop
-> Seq Scan on t
-> Foreign Scan on 
 Remote SQL: SELECT * FROM ft1 JOIN ft2 WHERE ft1.c = ft2.c AND ft1.a
= $1 AND ft2.b = $2

If an EPQ recheck is invoked by the A's UPDATE, just recomputing the
original join tuple from the whole-row image that you proposed would output
an incorrect result in the EQP recheck since the value a in the updated
version of a to-be-joined tuple in t would no longer match the value ft1.a
extracted from the whole-row image if the A's UPDATE has committed
successfully.  So I think we would need to rejoin the tuples populated from
the whole-row images for the baserels ft1 and ft2, by executing the
secondary plan with the new parameter values for a and b.



No.  You just need to populate fdw_recheck_quals correctly, same as
for the scan case.


Yeah, I think we can probably do that for the case where a pushed-down 
join clause is an inner-join one, but I'm not sure that we can do that 
for the case where that clause is an outer-join one.  Maybe I'm missing 
something, though.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-10-19 Thread Kouhei Kaigai

> On Mon, Oct 19, 2015 at 12:17 AM, Kouhei Kaigai  wrote:
> > 1. Fetch every EPQ slot of base relations involved in this join.
> >In case of ForeignScan, all the required tuples of base relations
> >should be filled because it is preliminary fetched by whole-row var
> >if earlier row-locking, or by RefetchForeignRow if later row-locking.
> >In case of CustomScan, it can call ExecProcNode() to generate the
> >first tuple even if it does not exists.
> >Anyway, I assume all the component tuples of this join can be fetched
> >using existing EPQ slot because they are owned by base relations.
> >
> > 2. The recheck callback fills up ss_ScanTupleSlot according to the
> >fdw_scan_tlist or custom_scan_tlist. The callback knows the best way
> >to reconstruct the joined tuple from the base relations' tuple fetched
> >on the step-1.
> >For example, if joined tuple is consists of (t1.a, t1.b, t2.x, t3.s),
> >the callback picks up 't1.a' and 't1.b' from the tuple fetched from
> >the EPQ slot of t1, then put these values onto the 1st and 2nd slot.
> >Also, it picks up 't2.x' from the tuple fetched from the EPQ slot of
> >t2, then put this value onto the 3rd slot. Same as above for 't3'.
> >At this point, ss_ScanTupleSlot gets filled up by the expected fields
> >as if join clauses are satisfied.
> >
> > 3. The recheck callback also checks qualifiers of base relations that
> >are pushed down. Because expression nodes kept in fds_exprs or
> >custom_exprs are initialized to reference ss_ScanTupleSlot at setrefs.c,
> >it is more reasonable to run ExecQual after the step-2.
> >If one of the qualifiers of base relation was evaluated as false,
> >the recheck callback returns an empty slot.
> >
> > 4. The recheck callback also checks join-clauses to join underlying
> >base relations. Due to same reason at step-3, it is more reasonable
> >to execute ExecQual after the step-2.
> >If one of the join-clauses was evaluated as false, the recheck returns
> >an empty slot.
> >Elsewhere, it returns ss_ScanTupleSlot, then ExecScan will process
> >any further jobs.
> 
> Hmm, I guess this would work.  But it still feels unnatural to me.  It
> feels like we haven't really pushed down the join.  It's pushed down
> except when there's an EPQ check, and then it's not.  So we need a
> whole alternate plan tree.  With my proposal, we don't need that.
>
Even if we fetch whole-row of both side, join pushdown is exactly working
because we can receive less number of rows than local join + 2 of foreign-
scan. (If planner works well, we can expect join-path that increases number
of rows shall be dropped.)

One downside of my proposition is growth of width for individual rows.
It is a trade-off situation. The above approach takes no changes for
existing EPQ infrastructure, thus, its implementation design is clear.
On the other hands, your approach will reduce traffic over the network,
however, it is still unclear how we integrate scanrelid==0 with EPQ
infrastructure.

On the other hands, in case of custom-scan that takes underlying local
scan-nodes, thus, any kind of ROW_MARK_* except for ROW_MARK_COPY will
happen. I think width of the joined tuples are relatively minor issue
than FDW cases. However, we cannot expect the fetched rows are protected
by early row-locking mechanism, so probability of re-fetching rows and
reconstruction of joined-tuple has relatively higher priority.

> There is also some possible loss of efficiency with this approach.
> Suppose that we have two tables ft1 and ft2 which are being joined,
> and we push down the join.  They are being joined on an integer
> column, and the join needs to select several other columns as well.
> However, ft1 and ft2 are very wide tables that also contain some text
> columns.   The query is like this:
> 
> SELECT localtab.a, ft1.p, ft2.p FROM localtab LEFT JOIN (ft1 JOIN ft2
> ON ft1.x = ft2.x AND ft1.huge ~ 'stuff' AND f2.huge2 ~ 'nonsense') ON
> localtab.q = ft1.q;
> 
> If we refetch each row individually, we will need a wholerow image of
> ft1 and ft2 that includes all columns, or at least f1.huge and
> f2.huge2.  If we just fetch a wholerow image of the join output, we
> can exclude those.  The only thing we need to recheck is that it's
> still the case that localtab.q = ft1.q (because the value of
> localtab.q might have changed).
>
Isn't it possible to distinguish whole-var reference required by
locking mechanism, from the ones required by users?
(Does resjunk=true give us a hint?)

In case when whole-var reference is required by system internal, it
seems to me harmless to put dummy NULLs on unreferenced columns.
Is it a feasible idea?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Multi-column distinctness.

2015-10-19 Thread Tomas Vondra


Hello Kyotaro-san,

On 09/11/2015 06:58 PM, Tomas Vondra wrote:
>

Maybe the best solution is to abandon the ALTER TABLE approach
entirely, and instead invent a new set of commands

   CREATE STATISTICS
   DROP STATISTICS

(ALTER STATISTICS seems a bit excessive at this point).

Another thing is that perhaps we should add names for statistics,
just like we do for constraints, for example. Otherwise the DROP
STATISTICS handling is rather awkward - for example if the user
creates stats twice by mistake, he's unable to drop just one of them.


Do you think this modified syntax makes sense? I'll have time to hack on 
this over the next few days.


regards

--
Tomas Vondra   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why no CONSTANT for row variables in plpgsql?

2015-10-19 Thread Tom Lane

Jim Nasby  writes:
> What did seem odd is that while processing the DECLARE section there 
> were plpgsql datums for tt.a and tt.b. I would have expected the 
> assignment to produce a row datum of type tt.

Yeah, that's the thing that's weird about plpgsql's ROW datums.

What the row datum mechanism is actually good for IMO is representing
multiple targets for FOR and INTO constructs, ie
 SELECT ... INTO a,b,c;
If you look at the representation of INTO, it only allows one target
datum, and the reason that's OK is it uses a row datum for cases like
this.  The row member datums are just the scalar variables a,b,c,
which can also be accessed directly.

IMO, we ought to get rid of the use of that representation for
composite-type variables and use the RECORD code paths for them,
whether they are declared as type record or as named composite
types.  That would probably make it easier to handle this case,
and it'd definitely make it easier to deal with some other weak
spots like ALTER TYPE changes to composite types.  However, last
time I proposed that, it was shot down on the grounds that it might
hurt performance in some cases.  (Which is likely true, although
that argument ignores the fact that other cases might get better.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ROWS FROM(): A Foolish (In)Consistency?

2015-10-19 Thread Jim Nasby


On 10/19/15 1:07 PM, David Fetter wrote:

What I'd like to do is lift the restriction on ROWS FROM(), which
currently requires that the stuff inside the parentheses set-returning
functions, so constructs something like the following would actually work:

SELECT *
FROM
ROWS FROM (
 (VALUES (...), ..., (...)),
 (SELECT ... ),
 (INSERT ... RETURNING ... ),
 my_srf()
)
AS t(...)

would actually work.


There's been a few places where I would have found that handy.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] SQL function to report log message

2015-10-19 Thread Jim Nasby


On 10/19/15 1:09 AM, Pavel Stehule wrote:

 What I was trying to say is that if the argument to a USING
option
 is NULL then RAISE should skip over it, as if it hadn't
been applied
 at all. Similar to how the code currently tests for \0.


I understand, but I don't prefer this behave. The NULL is
strange value
and should be signalized.


So instead of raising the message we wanted, we throw a completely
different exception? How does that make sense?


It is partially wrong because we handle all fields same. It has sense
for "message" fields, and has not sense for other fields. In this case
the text "NULL" will be better.


I fail to see how doing

HINT: NULL

is much better than just not raising a HINT at all...


More to the point, if RAISE operated this way then it would be
trivial to create a fully functional plpgsql wrapper around it.


I have a different opinion - better to have propossed function in core.
What I know, the NULL is not use in Postgres as "ignore value", and I am
thinking, it is good idea.


Normally I'd agree, but in this case I think it's inline with what the C 
code is already doing by testing for \0.


I suppose if we get the function it's not that bad since at least we get 
the functionality, so I'll stop arguing it.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Why no CONSTANT for row variables in plpgsql?

2015-10-19 Thread Jim Nasby


On 10/19/15 5:16 PM, Jim Nasby wrote:

Yeah, was hoping someone knew offhand why this was a problem. Guess I'll
rip the checks out and see what explodes.


... and what blows up is exec_eval_datum():


case PLPGSQL_DTYPE_ROW:
{
PLpgSQL_row *row = (PLpgSQL_row *) datum;
HeapTuple   tup;

if (!row->rowtupdesc)/* should not 
happen */
elog(ERROR, "row variable has no 
tupdesc");
/* Make sure we have a valid type/typmod 
setting */
BlessTupleDesc(row->rowtupdesc);
oldcontext = 
MemoryContextSwitchTo(estate->eval_econtext->ecxt_per_tuple_memory);
tup = make_tuple_from_row(estate, row, 
row->rowtupdesc);
if (tup == NULL)/* should not happen */
elog(ERROR, "row not compatible with its own 
tupdesc");


running this:

create type tt as (a int, b int);
do $$
declare
  c CONSTANT tt := '(1,2)'::tt;
begin
  raise notice 'c = %', c;
end
$$;
ERROR:  row not compatible with its own tupdesc
CONTEXT:  PL/pgSQL function inline_code_block line 5 at RAISE
STATEMENT:  do $$
declare
  c CONSTANT tt := '(1,2)'::tt;
begin
  raise notice 'c = %', c;
end
$$;
ERROR:  row not compatible with its own tupdesc
CONTEXT:  PL/pgSQL function inline_code_block line 5 at RAISE

row.tupledesc looked normal to me.

What did seem odd is that while processing the DECLARE section there 
were plpgsql datums for tt.a and tt.b. I would have expected the 
assignment to produce a row datum of type tt. When exec_stmt_block is 
finally called, the initialization for loop initializes tt.a and tt.b, 
but does nothing with c.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
diff --git a/src/pl/plpgsql/src/pl_gram.y b/src/pl/plpgsql/src/pl_gram.y
index 841a8d6..01b3c10 100644
--- a/src/pl/plpgsql/src/pl_gram.y
+++ b/src/pl/plpgsql/src/pl_gram.y
@@ -500,34 +500,16 @@ decl_statement: decl_varname decl_const decl_datatype 
decl_collate decl_notnull

 $3, true);
if ($2)
{
-   if (var->dtype == 
PLPGSQL_DTYPE_VAR)
((PLpgSQL_var 
*) var)->isconst = $2;
-   else
-   ereport(ERROR,
-   
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-   
 errmsg("row or record variable cannot be CONSTANT"),
-   
 parser_errposition(@2)));
}
if ($5)
{
-   if (var->dtype == 
PLPGSQL_DTYPE_VAR)
((PLpgSQL_var 
*) var)->notnull = $5;
-   else
-   ereport(ERROR,
-   
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-   
 errmsg("row or record variable cannot be NOT NULL"),
-   
 parser_errposition(@4)));
 
}
if ($6 != NULL)
{
-   if (var->dtype == 
PLPGSQL_DTYPE_VAR)
((PLpgSQL_var 
*) var)->default_val = $6;
-   else
-   ereport(ERROR,
-   
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-   
 errmsg("default value for row or record variable is not supported"),
-

Re: [HACKERS] SuperUser check in pg_stat_statements

2015-10-19 Thread Lukas Fittl

On Mon, Oct 19, 2015 at 3:12 PM, Jim Nasby  wrote:

> On 10/19/15 3:48 PM, rajan wrote:
>
>> Thanks Stephen and Shulgin for your response.
>>
>> Will go through the patch and will try to solve my problem using that.
>>
>> My scenario is that i need to have an user who cannot be a super user but
>> a
>> monitor user, who will be able to see all the queries executed by all
>> users.
>>
>
> You can set that up today by defining a view on top of pg_stat_statements
> (or maybe it needs a SECDEF SRF... been a while since I've done it).


You can solve this using a security definer method created by a superuser,
see

https://gist.github.com/lfittl/9ee78ac200e4e7ebe33d

for a full example.

-- 
Lukas Fittl

Skype: lfittl
Phone: +1 415 321 0630

Re: [HACKERS] Why no CONSTANT for row variables in plpgsql?

2015-10-19 Thread Jim Nasby


On 10/18/15 10:16 PM, Tom Lane wrote:

Jim Nasby  writes:

Is there a particular reason why row and record variables can't be
CONSTANT in plpgsql?


Well, you can't usefully do anything with such a variable unless
it can be initialized, which isn't currently supported either:

regression=# do $$ declare x int8_tbl := row(1,2); begin end $$;
ERROR:  default value for row or record variable is not supported
LINE 1: do $$ declare x int8_tbl := row(1,2); begin end $$;


Yeah, I assumed the two were related. We also don't allow NOT NULL. This 
is all checked in the production in pl_gram.y, but there's nothing 
indicating why this is the case. :/



I have a vague recollection of having looked at this a few years
ago and realizing it wasn't quite as trivial as one could wish.
Don't remember why, though.  In any case, I'm sure it's fixable
if someone wants to put in enough effort.


Yeah, was hoping someone knew offhand why this was a problem. Guess I'll 
rip the checks out and see what explodes. :)

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SuperUser check in pg_stat_statements

2015-10-19 Thread Jim Nasby


On 10/19/15 3:48 PM, rajan wrote:

Thanks Stephen and Shulgin for your response.

Will go through the patch and will try to solve my problem using that.

My scenario is that i need to have an user who cannot be a super user but a
monitor user, who will be able to see all the queries executed by all users.


You can set that up today by defining a view on top of 
pg_stat_statements (or maybe it needs a SECDEF SRF... been a while since 
I've done it).

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-10-19 Thread Robert Haas

On Mon, Oct 19, 2015 at 12:17 AM, Kouhei Kaigai  wrote:
> 1. Fetch every EPQ slot of base relations involved in this join.
>In case of ForeignScan, all the required tuples of base relations
>should be filled because it is preliminary fetched by whole-row var
>if earlier row-locking, or by RefetchForeignRow if later row-locking.
>In case of CustomScan, it can call ExecProcNode() to generate the
>first tuple even if it does not exists.
>Anyway, I assume all the component tuples of this join can be fetched
>using existing EPQ slot because they are owned by base relations.
>
> 2. The recheck callback fills up ss_ScanTupleSlot according to the
>fdw_scan_tlist or custom_scan_tlist. The callback knows the best way
>to reconstruct the joined tuple from the base relations' tuple fetched
>on the step-1.
>For example, if joined tuple is consists of (t1.a, t1.b, t2.x, t3.s),
>the callback picks up 't1.a' and 't1.b' from the tuple fetched from
>the EPQ slot of t1, then put these values onto the 1st and 2nd slot.
>Also, it picks up 't2.x' from the tuple fetched from the EPQ slot of
>t2, then put this value onto the 3rd slot. Same as above for 't3'.
>At this point, ss_ScanTupleSlot gets filled up by the expected fields
>as if join clauses are satisfied.
>
> 3. The recheck callback also checks qualifiers of base relations that
>are pushed down. Because expression nodes kept in fds_exprs or
>custom_exprs are initialized to reference ss_ScanTupleSlot at setrefs.c,
>it is more reasonable to run ExecQual after the step-2.
>If one of the qualifiers of base relation was evaluated as false,
>the recheck callback returns an empty slot.
>
> 4. The recheck callback also checks join-clauses to join underlying
>base relations. Due to same reason at step-3, it is more reasonable
>to execute ExecQual after the step-2.
>If one of the join-clauses was evaluated as false, the recheck returns
>an empty slot.
>Elsewhere, it returns ss_ScanTupleSlot, then ExecScan will process
>any further jobs.

Hmm, I guess this would work.  But it still feels unnatural to me.  It
feels like we haven't really pushed down the join.  It's pushed down
except when there's an EPQ check, and then it's not.  So we need a
whole alternate plan tree.  With my proposal, we don't need that.

There is also some possible loss of efficiency with this approach.
Suppose that we have two tables ft1 and ft2 which are being joined,
and we push down the join.  They are being joined on an integer
column, and the join needs to select several other columns as well.
However, ft1 and ft2 are very wide tables that also contain some text
columns.   The query is like this:

SELECT localtab.a, ft1.p, ft2.p FROM localtab LEFT JOIN (ft1 JOIN ft2
ON ft1.x = ft2.x AND ft1.huge ~ 'stuff' AND f2.huge2 ~ 'nonsense') ON
localtab.q = ft1.q;

If we refetch each row individually, we will need a wholerow image of
ft1 and ft2 that includes all columns, or at least f1.huge and
f2.huge2.  If we just fetch a wholerow image of the join output, we
can exclude those.  The only thing we need to recheck is that it's
still the case that localtab.q = ft1.q (because the value of
localtab.q might have changed).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SuperUser check in pg_stat_statements

2015-10-19 Thread rajan

Thanks Stephen and Shulgin for your response.

Will go through the patch and will try to solve my problem using that.

My scenario is that i need to have an user who cannot be a super user but a
monitor user, who will be able to see all the queries executed by all users.



--
View this message in context: 
http://postgresql.nabble.com/SuperUser-check-in-pg-stat-statements-tp5870589p5870639.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-10-19 Thread Robert Haas

On Mon, Oct 19, 2015 at 3:45 AM, Etsuro Fujita
 wrote:
> As Tom mentioned, just recomputing the original join tuple is not good
> enough.  We would need to rejoin the test tuples for the baserels even if
> ROW_MARK_COPY is in use.  Consider:
>
> A=# BEGIN;
> A=# UPDATE t SET a = a + 1 WHERE b = 1;
> B=# SELECT * from t, ft1, ft2
>  WHERE t.a = ft1.a AND t.b = ft2.b AND ft1.c = ft2.c FOR UPDATE;
> A=# COMMIT;
>
> where the plan for the SELECT FOR UPDATE is
>
> LockRows
> -> Nested Loop
>-> Seq Scan on t
>-> Foreign Scan on 
> Remote SQL: SELECT * FROM ft1 JOIN ft2 WHERE ft1.c = ft2.c AND ft1.a
> = $1 AND ft2.b = $2
>
> If an EPQ recheck is invoked by the A's UPDATE, just recomputing the
> original join tuple from the whole-row image that you proposed would output
> an incorrect result in the EQP recheck since the value a in the updated
> version of a to-be-joined tuple in t would no longer match the value ft1.a
> extracted from the whole-row image if the A's UPDATE has committed
> successfully.  So I think we would need to rejoin the tuples populated from
> the whole-row images for the baserels ft1 and ft2, by executing the
> secondary plan with the new parameter values for a and b.

No.  You just need to populate fdw_recheck_quals correctly, same as
for the scan case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checkpoint throttling issues

2015-10-19 Thread Alvaro Herrera

Fabien COELHO wrote:

> >4) It's a bit dubious to only pgstat_send_bgwriter() when on schedule.
> 
> No opinion!

My guess here, without looking, is that this was based on the idea of
"oops, we're late here for the checkpoint, let's do as less work as
possible to avoid getting even later", and thus skip some "unnecessary"
tasks.  I would vote for fixing this (and the SIGHUP issue too).  I'm
not sure how bad a behavioral change this is; IMO these fixes should be
backpatched all the way back.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Checkpoint throttling issues

2015-10-19 Thread Fabien COELHO




preexisting issues:

1) The progress passed to CheckpointWriteDelay() will often be wrong -
  it's calculated as num_written / num_to_write, but num_written is only
  incremented if the buffer hasn't since independently been written
  out. That's bad because it mean's we'll think we're further and
  further behind if there's independent writeout activity.

  Simple enough to fix, we gotta split num_written into num_written
  (for stats purposes) and num_processed (for progress).

  This is pretty much a bug, but I'm a slightly worried about
  backpatching a fix because it can have a rather noticeable
  behavioural impact.


I noticed this discrepancy, but decided agains doing anything about it, 
because it meant more arguing and explaning. In the pgbench run I used the 
case does not seem to arise often because most I/O is performed through 
the checkpointer, so it does not matter much.


Another discrepancy is that there are two progress computations, one 
against time and one against wal buffer, and when one is used and why it 
is better than the other is not too clear to me.



2) CheckpointWriteDelay()'s handling of config file changes et al is
  pretty weird: The config is reloaded during a checkpoing iff it's not
  an immediate checkpoint and we're on schedule. I see very little
  justification for having the got_SIGHUP block inside that if block.


No opinion.


3) The static pg_usleep(10L); in CheckpointWriteDelay() is a bad
  idea.

  On a system with low write activity (< 10 writes sec) this will lead
  to checkpoints finishing too early as there'll always be around ~10
  writes a second. On slow IO, say a sdcard, that's bad.

  On system with a very high write throughput (say 1k+ buffers/sec) the
  unconditional 100ms sleep while on schedule will mean we're far
  behind after the sleep. This leads to rather bursty IO, and makes it
  more likely to run into dirty buffer limits and such.  Just reducing
  the sleep from 100ms to 10ms leads to significant latency
  improvements. On pgbench rate limited to 5k tps:

  before:
  number of transactions skipped: 70352 (1.408 %)
  number of transactions above the 100.0 ms latency limit: 2817 (0.056 %)
  latency average: 5.266 ms
  latency stddev: 11.723 ms
  rate limit schedule lag: avg 3.138 (max 0.000) ms

  after:
  number of transactions skipped: 41696 (0.834 %)
  number of transactions above the 100.0 ms latency limit: 1736 (0.035 %)
  latency average: 4.929 ms
  latency stddev: 8.929 ms
  rate limit schedule lag: avg 2.835 (max 0.000) ms


Do these figures include sorting & flushing ?


  I think the sleep time should be computed adaptively based on the
  number of buffers remaining and the remaining time. There's probably
  better formulations, but that seems like an easy enough improvement
  and considerably better than now.


One this run. If there are few I/Os then the 1 0ms will result in more 
random/discontinuous I/Os, because there would be only a few writes in 
each bucket, so I'm not sure whether this would be an unconditionnal 
improvement. I would be interested to have figures *with* sorting and 
flushing on and several kind of run.



4) It's a bit dubious to only pgstat_send_bgwriter() when on schedule.


No opinion!

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] checkpointer continuous flushing

2015-10-19 Thread Fabien COELHO



Hello Andres,


Here is a v13, which is just a rebase after 1aba62ec.


I'm working on this patch, to get it into a state I think it'd be
commitable.


I'll review it carefully. Also, if you can include some performance 
feature it would help, even if I'll do some more runs.


In my performance testing it showed that calling PerformFileFlush() only 
at segment boundaries and in CheckpointWriteDelay() can lead to rather 
spikey IO - not that surprisingly. The sync in CheckpointWriteDelay() is 
problematic because it only is triggered while on schedule, and not when 
behind.


When behind, the PerformFileFlush should be called on segment boundaries.
The idea was not to go to sleep without flushing, and to do it as little 
as possible.



My testing seems to show that just adding a limit of 32 buffers to
FileAsynchronousFlush() leads to markedly better results.


Hmmm. 32 buffers means 256 KB, which is quite small. Not sure what a good 
"limit" would be. It could depend whether pages are close or not.



I wonder if mmap() && msync(MS_ASYNC) isn't a better replacement for
sync_file_range(SYNC_FILE_RANGE_WRITE) than posix_fadvise(DONTNEED). It
might even be possible to later approximate that on windows using
FlushViewOfFile().


I'm not sure that mmap/msync can be used for this purpose, because there 
is no real control it seems about where the file is mmapped.


As far as I can see the while (nb_spaces != 0)/NextBufferToWrite() logic 
doesn't work correctly if tablespaces aren't actually sorted. I'm 
actually inclined to fix this by simply removing the flag to 
enable/disable sorting.


I do no think that there is a significant downside to having sort always 
on, but showing it requires to be able to test, so to have a guc. The 
point of the guc is to demonstrate that the feature is harmless:-)



Having defined(HAVE_SYNC_FILE_RANGE) || defined(HAVE_POSIX_FADVISE) in
so many places looks ugly, I want to push that to the underlying
functions. If we add a different flushing approach we shouldn't have to
touch several places that don't actually really care.


I agree that it is pretty ugly, but I do not think that you can remove 
them all. You need at least one for checking the guc and one for enabling 
the feature. Maybe their number could be reduced if the functions are 
switched to do-nothing stubs which are called nevertheless, but I was not 
keen on letting unused code when there is no sync_file_range nor 
posix_fadvise.



I've replaced the NextBufferToWrite() logic with a binaryheap.h heap -
seems to work well, with a bit less code actually.


Hmmm. I'll check. I'm still unconvinced that using a tree for a 2-3 
element set in most case is an improvement.



I'll post this after some more cleanup & testing.


I'll have a look when it is ready.


I've also noticed that sleeping logic in CheckpointWriteDelay() isn't
particularly good. In high throughput workloads the 100ms sleep is too
long, leading to bursty IO behaviour. If 1k+ buffers a written out a
second 100ms is a rather long sleep. For another that we only sleep
100ms when the write rate is low makes the checkpoint finish rather
quickly - on a slow disk (say microsd) that can cause unneccesary
slowdowns for concurrent activity.  ISTM we should calculate the sleep
time in a better way.


I also noted this point, but I'm not sure how to have a better approach, 
so I let it as it is. I tried 50 ms & 200 ms on some runs, without 
significant effect on performance for the test I ran then. The point of 
having not too small a value is that it provide some significant work to 
the IO subsystem without overflowing it. On average it does not matter. 
I'm unsure how it would interact with flushing. So I decided not to do 
anything about it. Maybe it should be a guc, but I would not know how to 
choose it.


The SIGHUP behaviour is also weird.  Anyway, this probably belongs on a 
new thread.


Probably. I did not try to look at that.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] SQL function to report log message

2015-10-19 Thread Pavel Stehule

Hi

2015-10-13 22:01 GMT+02:00 Robert Haas :

> This patch is marked as Ready for Committer in the CommitFest
> application.  Here is my attempt to summarize the votes upthread:
>
> Tom Lane: plpgsql RAISE is sufficient; we don't need this.
> Pavel Stehule: could be replaced by custom function, but not against it.
> Robert Haas: plpgsql RAISE is sufficient; we don't need this.
> Jim Nasby: A generic wrapper around RAISE is not trivial, so +1.
> Andres Freund: I've written this function many times, so let's have it in
> core.
> David Fetter: I've written this function like 20 times, we should have it.
>
> I'm only -0 on this patch, so I won't yell and scream if some other
> committer is prepared to step up and get this committed, but I'm not
> excited about doing it myself on the strength of a weak consensus in
> which I'm not a participant.  Any takers?  I recommend that we allow
> this patch 30 days to attract an interested committer, and, if nobody
> volunteers in that time, that we mark it Rejected for lack of
> interest.
>

I am changing opinion little bit - the RAISE statement in plpgsql is really
static for some purposes. The proposed function is much more dynamic due
using VARIADIC params. It isn't possible with RAISE statement without some
possible difficult modifications (difficult to find good consensus). The
customized constructor for SPIError can do this work too, but it is not
effective install and to use plpythonu for one functionality. More -
proposed function is pretty simple without side effects - so maintenance of
this function is not high.

Regards

Pavel


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: [HACKERS] pg_restore cancel TODO

2015-10-19 Thread Joshua D. Drake


On 10/19/2015 09:47 AM, Jeff Janes wrote:

On Mon, Oct 19, 2015 at 9:37 AM, Bruce Momjian mailto:br...@momjian.us>> wrote:




Sorry, I don't know how I managed to screw this up so much.  pg_restore,
not pg_upgrade.  I've never looked into pg_restore much until recently,
so my fingers just autocomplete to the one I'm more used to.  Let me go
fix the TODO page.  (I'm  pretty sure pg_upgrade terminates itself very
early on upon missing extensions)


Yes it does.

jD



Jeff.



--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
New rule for social situations: "If you think to yourself not even
JD would say this..." Stop and shut your mouth. It's going to be bad.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ROWS FROM(): A Foolish (In)Consistency?

2015-10-19 Thread David Fetter

On Mon, Oct 19, 2015 at 10:24:37AM -0700, David Fetter wrote:
> Folks,
> 
> As I was learning how best to add native weighted statistics, coming
> soon, I noticed that our ROWS FROM() constructor takes only
> set-returning functions, gluing the outputs together side by side
> without a join condition of any kind.  This is a handy capability,
> which I don't find elsewhere in our code, modulo horrible things
> involving WITH ORDINALITY and FULL JOIN.
> 
> Did I miss something?
> 
> If not, would it make sense to allow every set-returning construct
> inside ROWS FROM(), rather than just some of them?

Based on off-list feedback, this proposal is not as clear as I would
have liked.

Here's what we get with SRFs:

SELECT * FROM ROWS FROM (generate_series(1,5,2),generate_series(4,1,-1)) AS 
t(i,j);
 i | j 
---+---
 1 | 4
 3 | 3
 5 | 2
   | 1
(4 rows)

Here's a similar construct, but not really, from LATERAL:

SELECT * FROM generate_series(1,5,2) AS s(i) FULL JOIN LATERAL 
generate_series(4,1,-1) AS t(j) ON s.i=t.j;
 i | j 
---+---
 1 | 1
   | 2
 3 | 3
   | 4
 5 |  
(5 rows)

What I'd like to do is lift the restriction on ROWS FROM(), which
currently requires that the stuff inside the parentheses set-returning
functions, so constructs something like the following would actually work:

SELECT *
FROM
ROWS FROM (
(VALUES (...), ..., (...)),
(SELECT ... ),
(INSERT ... RETURNING ... ), 
my_srf()
)
AS t(...)

would actually work.

What say?

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH v3] GSSAPI encryption support

2015-10-19 Thread Robbie Harwood

Stephen Frost  writes:

> As for this patch, the reason I've not been as involved (beyond being
> ridiculously busy) is that Michael's environment, which at least appears
> perfectly reasonable (and works with PG unpatched) isn't working.  If we
> can get that working (and I've not looked at what's happening, so I have
> no idea how easy or hard that would be), then I'd be a lot more excited
> to spend time doing review of the patch.

I would also really like to see this fixed.  Unfortunately, I haven't
been able to replicate; I followed Michael's (very nicely written)
writeup and didn't see the issues that Michael did.  The only thing I
know about the problem is that it logs:

> psql: lost synchronization with server: got message type "S", length 22

which unfortunately could be a great many things.  I've said this a
couple times now, but I really do need more information - a traffic
dump, a list of commands that were run, etc.; unfortunately, the surface
here is pretty large, and while I totally am willing to believe there
are bugs in the code I've written, I do not yet see them.

signature.asc
Description: PGP signature

[HACKERS] ROWS FROM(): A Foolish (In)Consistency?

2015-10-19 Thread David Fetter

Folks,

As I was learning how best to add native weighted statistics, coming
soon, I noticed that our ROWS FROM() constructor takes only
set-returning functions, gluing the outputs together side by side
without a join condition of any kind.  This is a handy capability,
which I don't find elsewhere in our code, modulo horrible things
involving WITH ORDINALITY and FULL JOIN.

Did I miss something?

If not, would it make sense to allow every set-returning construct
inside ROWS FROM(), rather than just some of them?

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_restore cancel TODO

2015-10-19 Thread Jeff Janes

On Mon, Oct 19, 2015 at 9:37 AM, Bruce Momjian  wrote:

> On Mon, Oct 19, 2015 at 09:26:21AM -0700, Jeff Janes wrote:
> > It seems like gdb eats signals that you send a process while it is being
> > debugged, so it is hard to figure out what is going on.  From strace, it
> looks
> > like the children do receive a signal but either ignore it, or set a
> flag and
> > then ignore that.
> >
> > It doesn't continue to load the entire dump file, it exits once they
> complete
> > the current assignment and ask the parent for more work.
> >
> > Could just be a matter of adding the local equivalent of
> CHECK_FOR_INTERRUPTS
> > in the part of the code that spools COPY data to the backends?  I'm not
> sure
> > what would happen if it were in the index/constraint building phase,
> I've never
> > let it get that far when it reported errors early on.
> >
> > (This is linux, sorry for not making that clear)
>
> Well, we are not running COPY in pg_upgrade, just the DDL commands.
> Index creation is on empty tables, so it should be very quick.  What
> should basically happen is that the pg_restore child processes should
> exit as forked children, and then the backends for these pg_restore
> proceses should then exit.  My guess is that this problem is not
> pg_upgrade-specific as there is no signal control in pg_upgrade --- you
> are just getting the defaults.
>
> (Updated TODO to mention Linux.)
>


Sorry, I don't know how I managed to screw this up so much.  pg_restore,
not pg_upgrade.  I've never looked into pg_restore much until recently, so
my fingers just autocomplete to the one I'm more used to.  Let me go fix
the TODO page.  (I'm  pretty sure pg_upgrade terminates itself very early
on upon missing extensions)

Jeff.

Re: [HACKERS] pg_restore cancel TODO

2015-10-19 Thread Bruce Momjian

On Mon, Oct 19, 2015 at 09:26:21AM -0700, Jeff Janes wrote:
> It seems like gdb eats signals that you send a process while it is being
> debugged, so it is hard to figure out what is going on.  From strace, it looks
> like the children do receive a signal but either ignore it, or set a flag and
> then ignore that.
> 
> It doesn't continue to load the entire dump file, it exits once they complete
> the current assignment and ask the parent for more work.
> 
> Could just be a matter of adding the local equivalent of CHECK_FOR_INTERRUPTS
> in the part of the code that spools COPY data to the backends?  I'm not sure
> what would happen if it were in the index/constraint building phase, I've 
> never
> let it get that far when it reported errors early on.
> 
> (This is linux, sorry for not making that clear)

Well, we are not running COPY in pg_upgrade, just the DDL commands. 
Index creation is on empty tables, so it should be very quick.  What
should basically happen is that the pg_restore child processes should
exit as forked children, and then the backends for these pg_restore
proceses should then exit.  My guess is that this problem is not
pg_upgrade-specific as there is no signal control in pg_upgrade --- you
are just getting the defaults.

(Updated TODO to mention Linux.)

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_restore cancel TODO

2015-10-19 Thread Jeff Janes

On Mon, Oct 19, 2015 at 8:28 AM, Tom Lane  wrote:

> Alvaro Herrera  writes:
> > Jeff Janes wrote:
> >> I've added the TODO item:
> >>
> >> When pg_upgrade -j ... is interrupted (for example, ctrl-C from the
> >> keyboard) make it cancel the children processes.
> >>
> >> The context where this arises is that I want to populate data into a new
> >> installation compiled with a patch under review, but immediately get
> error
> >> messages indicating I forgot to install a required extension.  I hit
> ctrl-C
> >> so I can fix the problem, but it keeps running anyway.
>
> > This looks more like a bug to me than a To-do item.
>
> Why doesn't the control-C kill all the child processes automatically?
> I'd have expected it to ...
>

It seems like gdb eats signals that you send a process while it is being
debugged, so it is hard to figure out what is going on.  From strace, it
looks like the children do receive a signal but either ignore it, or set a
flag and then ignore that.

It doesn't continue to load the entire dump file, it exits once they
complete the current assignment and ask the parent for more work.

Could just be a matter of adding the local equivalent of
CHECK_FOR_INTERRUPTS in the part of the code that spools COPY data to the
backends?  I'm not sure what would happen if it were in the
index/constraint building phase, I've never let it get that far when it
reported errors early on.

(This is linux, sorry for not making that clear)

Cheers,

Jeff

Re: [HACKERS] SuperUser check in pg_stat_statements

2015-10-19 Thread Stephen Frost

Rajan,

* rajan (vgmon...@gmail.com) wrote:
> When monitoring using pg_stat_satements I see that postgres by default
> conceals queries executed by other users from the user who is selecting the
> pg_stat_statements view.
> 
> I have edited the pg_stat_statements.c by disabling the superuser check
> function so that all queries will be visible to all users.
> 
> Can this be posted as a patch to postgresql?

We don't want that to be generally viewable but rather something where
an administrator can control who can see it.

The current proposal for that is to have a set of default roles, one of
which will have this ability.

The thread on that topic starts here:

http://www.postgresql.org/message-id/flat/20150508042928.gp30...@tamriel.snowman.net#20150508042928.gp30...@tamriel.snowman.net

With the latest patch here:

http://www.postgresql.org/message-id/2015093020.gm3...@tamriel.snowman.net

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] a raft of parallelism-related bug fixes

2015-10-19 Thread Robert Haas

On Sat, Oct 17, 2015 at 9:16 PM, Andrew Dunstan  wrote:
> If all that is required is a #define, like CLOBBER_CACHE_ALWAYS, then no
> special buildfarm support is required - you would just add that to the
> animal's config file, more or less like this:
>
>  config_env =>
>  {
>  CPPFLAGS => '-DGRATUITOUSLY_PARALLEL',
>  },
>
> I try to make things easy :-)

Wow, that's great.  So, I'll try to rework the test code I posted
previously into something less hacky, and eventually add a #define
like this so we can run it on the buildfarm.  There's a few other
things that need to get done before that really makes sense - like
getting the rest of the bug fix patches committed - otherwise any
buildfarm critters we add will just be permanently red.

Thanks to Noah and Stephen for your replies also - it is good to hear
that if I spend the time to make this committable, somebody will use
it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Dangling Client Backend Process

2015-10-19 Thread Robert Haas

On Sat, Oct 17, 2015 at 4:52 PM, Alvaro Herrera
 wrote:
> Andres Freund wrote:
>> On 2015-10-14 17:33:01 +0900, Kyotaro HORIGUCHI wrote:
>> > If I recall correctly, he concerned about killing the backends
>> > running transactions which could be saved. I have a sympathy with
>> > the opinion.
>>
>> I still don't. Leaving backends alive after postmaster has died prevents
>> the auto-restart mechanism to from working from there on.  Which means
>> that we'll potentially continue happily after another backend has
>> PANICed and potentially corrupted shared memory. Which isn't all that
>> unlikely if postmaster isn't around anymore.
>
> I agree.  When postmaster terminates without waiting for all backends to
> go away, things are going horribly wrong -- either a DBA has done
> something stupid, or the system is misbehaving.  As Andres says, if
> another backend dies at that point, things are even worse -- the dying
> backend could have been holding a critical lwlock, for instance, or it
> could have corrupted shared buffers on its way out.  It is not sensible
> to leave the rest of the backends in the system still trying to run just
> because there is no one there to kill them.

Yep.  +1 for changing this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_restore cancel TODO

2015-10-19 Thread Bruce Momjian

On Mon, Oct 19, 2015 at 08:28:34AM -0700, Tom Lane wrote:
> Alvaro Herrera  writes:
> > Jeff Janes wrote:
> >> I've added the TODO item:
> >> 
> >> When pg_upgrade -j ... is interrupted (for example, ctrl-C from the
> >> keyboard) make it cancel the children processes.
> >> 
> >> The context where this arises is that I want to populate data into a new
> >> installation compiled with a patch under review, but immediately get error
> >> messages indicating I forgot to install a required extension.  I hit ctrl-C
> >> so I can fix the problem, but it keeps running anyway.
> 
> > This looks more like a bug to me than a To-do item.
> 
> Why doesn't the control-C kill all the child processes automatically?
> I'd have expected it to ...

I don't know.  On Unix we use fork() and on Windows we use thread.  It
is not clear in the TODO list which platform this is for.  I don't see
any signal control in the pg_upgrade source code.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_restore cancel TODO

2015-10-19 Thread Tom Lane

Alvaro Herrera  writes:
> Jeff Janes wrote:
>> I've added the TODO item:
>> 
>> When pg_upgrade -j ... is interrupted (for example, ctrl-C from the
>> keyboard) make it cancel the children processes.
>> 
>> The context where this arises is that I want to populate data into a new
>> installation compiled with a patch under review, but immediately get error
>> messages indicating I forgot to install a required extension.  I hit ctrl-C
>> so I can fix the problem, but it keeps running anyway.

> This looks more like a bug to me than a To-do item.

Why doesn't the control-C kill all the child processes automatically?
I'd have expected it to ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] SuperUser check in pg_stat_statements

2015-10-19 Thread Shulgin, Oleksandr

On Mon, Oct 19, 2015 at 5:15 PM, rajan  wrote:

> Hello,
>
> When monitoring using pg_stat_satements I see that postgres by default
> conceals queries executed by other users from the user who is selecting the
> pg_stat_statements view.
>
> I have edited the pg_stat_statements.c by disabling the superuser check
> function so that all queries will be visible to all users.
>

Well, you could see that's by design.  What problem are you trying to solve
with that?


> Can this be posted as a patch to postgresql?
>

This is not going to be accepted.

--
Alex

Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-10-19 Thread Masahiko Sawada

On Wed, Oct 14, 2015 at 3:16 AM, Josh Berkus  wrote:
> On 10/13/2015 11:02 AM, Masahiko Sawada wrote:
>> I thought that this feature for postgresql should be simple at first
>> implementation.
>> It would be good even if there are some restriction such as the
>> nesting level, the group setting.
>> The another new approach that I came up with is,
>> * Add new parameter synchronous_replication_method (say s_r_method)
>> which can have two names: 'priority', 'quorum'
>> * If s_r_method = 'priority', the value of s_s_names (e.g. 'n1,n2,n3')
>> is handled using priority. It's same as '[n1,n2,n3]' in dedicated
>> laguage.
>> * If s_r_method = 'quorum', the value of s_s_names is handled using
>> quorum commit, It's same as '(n1,n2,n3)' in dedicated language.
>
> Well, the first question is: can you implement both of these things for
> 9.6, realistically?  If you can implement them, then we can argue about
> configuration format later.  It's even possible that the nature of your
> implementation will enforce a particular syntax.
>

Hi,

Attached patch is a rough patch which supports multi sync replication
by another approach I sent before.

The new GUC parameters are:
* synchronous_standby_num, which specifies the number of standby
servers using sync rep. (default is 0)
* synchronous_replication_method, which specifies replication method;
priority or quorum. (default is priority)

The behaviour of 'priority' and 'quorum' are same as what we've been discussing.
But I write overview of these here again here.

[Priority Method]
The standby server has each different priority, and the active standby
servers having the top N priroity are become sync standby server.
If synchronous_standby_names = '*', the all active standby server
would be sync standby server.
If you want to set up standby like 9.5 or before, you can set
synchronous_standby_num = 1.

[Quorum Method]
The standby servers have same priority 1, and the all the active
standby servers will be sync standby server.
The master server have to wait for ACK from N sync standby servers at
least before COMMIT.
If synchronous_standby_names = '*', the all active standby server
would be sync standby server.

[Use case]
This patch can handle the main use case where Michael said;
There are 2 data centers, first one has the master and a sync standby,
and second one has a set of standbys.
We need to be sure that the standby in DC1 acknowledges all the time,
and we would only need to wait for one or more of them in DC2.

In order to handle this use case, you set these standbys and GUC
parameter as follows.
* synchronous_standby_names = 'DC1, DC2'
* synchronous_standby_num = 2
* synchronous_replication_method = quorum
* The name of standby server in DC1 is 'DC1', and the names of two
standby servers in DC2 are 'DC2'.

[Extensible]
By setting same application_name to different standbys, we can set up
sync replication with grouping standbys.
If we want to set up replication more complexly and flexibility, we
could add new syntax for s_s_names (e.g., JSON format or dedicated
language), and increase kind of values of
synhcronous_replication_method, e.g. s_r_method = 'complex',

And this patch doesn't need new parser for GUC parameter.

Regards,

--
Masahiko Sawada

synchronous_replication_method_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] SuperUser check in pg_stat_statements

2015-10-19 Thread rajan

Hello,

When monitoring using pg_stat_satements I see that postgres by default
conceals queries executed by other users from the user who is selecting the
pg_stat_statements view.

I have edited the pg_stat_statements.c by disabling the superuser check
function so that all queries will be visible to all users.

Can this be posted as a patch to postgresql?



--
View this message in context: 
http://postgresql.nabble.com/SuperUser-check-in-pg-stat-statements-tp5870589.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_restore cancel TODO

2015-10-19 Thread Bruce Momjian

On Mon, Oct 19, 2015 at 11:51:00AM -0300, Alvaro Herrera wrote:
> Jeff Janes wrote:
> > I've added the TODO item:
> > 
> > When pg_upgrade -j ... is interrupted (for example, ctrl-C from the
> > keyboard) make it cancel the children processes.
> > 
> > The context where this arises is that I want to populate data into a new
> > installation compiled with a patch under review, but immediately get error
> > messages indicating I forgot to install a required extension.  I hit ctrl-C
> > so I can fix the problem, but it keeps running anyway.
> 
> This looks more like a bug to me than a To-do item.

Uh, many TODO items are bugs.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [COMMITTERS] pgsql: Define integer limits independently from the system definitions.

2015-10-19 Thread Michael Paquier

Hi Andres,

On Fri, Apr 3, 2015 at 12:45 AM, Andres Freund  wrote:

> Define integer limits independently from the system definitions.
>
> In 83ff1618 we defined integer limits iff they're not provided by the
> system. That turns out not to be the greatest idea because there's
> different ways some datatypes can be represented. E.g. on OSX PG's 64bit
> datatype will be a 'long int', but OSX unconditionally uses 'long
> long'. That disparity then can lead to warnings, e.g. around printf
> formats.
>

This commit has added PG_INT16_MIN/MAX, however we still rely on SHRT_MIN
and SHRT_MAX in some code paths. On all the platforms I looked at (OpenBSD,
FreeBSD, MinGW, MSVC, OSX, Linux, Solaris) SHRT_MIN and SHRT_MAX
definitions are always the same. Still, shouldn't we replace SHRT_* by
their equivalent PG_INT16_* to be completely independent? I am just
wondering if you considered that.
Regards,
-- 
Michael

Re: [HACKERS] pg_restore cancel TODO

2015-10-19 Thread Alvaro Herrera

Jeff Janes wrote:
> I've added the TODO item:
> 
> When pg_upgrade -j ... is interrupted (for example, ctrl-C from the
> keyboard) make it cancel the children processes.
> 
> The context where this arises is that I want to populate data into a new
> installation compiled with a patch under review, but immediately get error
> messages indicating I forgot to install a required extension.  I hit ctrl-C
> so I can fix the problem, but it keeps running anyway.

This looks more like a bug to me than a To-do item.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improve the concurency of vacuum full table and select statement on the same relation

2015-10-19 Thread Jinyu

The lock upgrade for vacuum full table tends to cause deadlock with other lock 
upgrade transaction which is from AccessShareLock to lockmode > 
AccessShareLock.  Tom Lane's concern is that it will cause vacuum full failed 
after do a lot of work.
But If we can always let other transaction failed to break deadlock instead of 
vacuum full table,  how about this lock upgrade solution?
For example: we can enlarge the 'DeadlockTimeout' for vacuum full table 
transaction to avoid deadlock check.


Jinyu Zhang
regards







At 2015-10-16 23:04:51, "Robert Haas"  wrote:
>On Thu, Oct 15, 2015 at 8:28 PM, Jim Nasby  wrote:
>> It's just how the authors of pg_repack decided to handle it. It seems pretty
>> reasonable, since you probably don't want an errant DDL statement to cause
>> the rollback of hours or days of pg_repack work.
>>
>> Ultimately, I don't think you'll find many people interested in working on
>> this, because the whole goal is to never need VACUUM FULL or pg_repack. If
>> you're clustering just for the sake of clustering, that has it's own set of
>> difficulties that should be addressed.
>
>I think the topic of online table reorganization is a pretty important
>one, actually.  That is a need that we have had for a long time,
>creates serious operational problems for users, and it's also a need
>that is not going away.  I think the chances of eliminating that need
>completely, even if we rearchitected or heap storage, are nil.
>
>I think the bigger issue is that it's a very hard problem to solve.
>pg_repack is one approach, but I've heard more than one person say
>that, as C-3PO said about the asteroid, it may not be entirely stable.
>
>-- 
>Robert Haas
>EnterpriseDB: http://www.enterprisedb.com
>The Enterprise PostgreSQL Company

Re: [HACKERS] PATCH: 9.5 replication origins fix for logical decoding

2015-10-19 Thread Craig Ringer

Hi all

Patch revision 3 attached. It's a one-liner, with just the fix, and an
explanatory note in the patch header.

I'll leave the test changes for now, per Andres's feedback.

I'll write up a proposed addition to the replication origin docs that
explains the existence of separate origins for each change and for the
tx as a whole, and explains how replication origins interacts with
transaction commit timestamps. That'll come as a followup; I think
it's worth getting the bug fix in now.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From 8c4285402706329a64a3f180c66a04357a932b3d Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Mon, 19 Oct 2015 21:37:19 +0800
Subject: [PATCH] Send replication origin on commit

The transaction replication origin is not set when decoding a commit.
Origins for individual replication changes are passed correctly, as is
the origin when calling the transaction filter. Fix that, supplying
the transaction origin to logical decoding callbacks.

Prior to this change the ReorderBufferTXN->origin_id (usually:
txn->origin_id) is always InvalidRepNodeOrigin in decoding callbacks
other than the origin filter.

See the thread
http://www.postgresql.org/message-id/CAMsr+YFhBJLp=qfsz3-j+0p1zlke8znxm2otycn20qrmx38...@mail.gmail.com
for details.
---
 src/backend/replication/logical/decode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c629da3..9f60687 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -450,7 +450,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf,
 {
 	XLogRecPtr	origin_lsn = InvalidXLogRecPtr;
 	XLogRecPtr	commit_time = InvalidXLogRecPtr;
-	XLogRecPtr	origin_id = InvalidRepOriginId;
+	XLogRecPtr	origin_id = XLogRecGetOrigin(buf->record);
 	int			i;
 
 	if (parsed->xinfo & XACT_XINFO_HAS_ORIGIN)
-- 
2.1.0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_restore cancel TODO

2015-10-19 Thread Bruce Momjian

On Wed, Oct 14, 2015 at 09:34:04AM -0700, Jeff Janes wrote:
> I've added the TODO item:
> 
> When pg_upgrade -j ... is interrupted (for example, ctrl-C from the keyboard)
> make it cancel the children processes.
> 
> The context where this arises is that I want to populate data into a new
> installation compiled with a patch under review, but immediately get error
> messages indicating I forgot to install a required extension.  I hit ctrl-C so
> I can fix the problem, but it keeps running anyway.

OK, we will need to look at that at some point.  I am not sure we have
any mechanism now to close those parallel processes.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fs issues on software raid0 (PG_VERSION does not contain valid data)

2015-10-19 Thread Tomas Vondra


Hi,

On 10/18/2015 08:37 PM, Andres Freund wrote:

Hi,

On 2015-10-18 20:25:29 +0200, Tomas Vondra wrote:

I've been doing a lot of filesystem testing / benchmarking recently, and
today I've ran into a really strange issue with ext4 on two SSD devices in a
RAID-0 configuration (Linux software raid).



All this happens on an ext4 filesystem, created on a sw raid0 manager by
kernel 4.0.4. The filesystem is created like this:

   mdadm --create /dev/md0 --level=0 --raid-devices=2 /dev/sda1 /dev/sdb1
   mkfs.ext4 -E stride=128,stripe-width=256 /dev/md0


Sounds like 
http://git.neil.brown.name/?p=md.git;a=commitdiff;h=a81157768a00e8cf8a7b43b5ea5cac931262374f

Try reproducing with 4.0.5 to verify, that should contain the fix.


Yep, that seems to have fixed the issue. I've been actually looking at 
that fix but concluded that it's unrelated, for some reason ...


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-10-19 Thread Kouhei Kaigai

> -Original Message-
> From: Etsuro Fujita [mailto:fujita.ets...@lab.ntt.co.jp]
> Sent: Monday, October 19, 2015 8:52 PM
> To: Kaigai Kouhei(海外 浩平); Kyotaro HORIGUCHI
> Cc: pgsql-hackers@postgresql.org; shigeru.han...@gmail.com;
> robertmh...@gmail.com
> Subject: Re: [HACKERS] Foreign join pushdown vs EvalPlanQual
> 
> I wrote:
> >> As Robert mentioned in [1], I think that if we're inside EPQ,
> >> pushed-down quals and/or pushed-down joins should be locally rechecked
> >> in the same way as other cases such as IndexRecheck.  So, I'll propose
> >> the updated version of the patch.
> 
> On 2015/10/16 18:48, Kouhei Kaigai wrote:
> > You have never answered my question for two months.
> >
> > I never deny to execute the pushed-down qualifiers locally.
> > It is likely the best tactics in most cases.
> > But, why you try to enforce all the people a particular manner?
> >
> > Here are various kind of FDW drivers. How do you guarantee it is
> > the best solution for all the people? It is basically impossible.
> > (Please google "Probatio diabolica")
> >
> > You try to add two special purpose fields in ForeignScan;
> > fdw_recheck_plan and fdw_recheck_quals.
> > It requires FDW drivers to have pushed-down qualifier in a particular
> > data format, and also requires FDW drivers to process EPQ recheck by
> > alternative local plan, even if a part of FDW drivers can process
> > these jobs by its own implementation better.
> >
> > I've repeatedly pointed out this issue, but never get reasonable
> > answer from you.
> >
> > Again, I also admit alternative plan may be reasonable tactics for
> > most of FDW drivers. However, only FDW author can "decide" it is
> > the best tactics to handle the task for their module, not us.
> >
> > I don't think it is a good interface design to enforce people to
> > follow a particular implementation manner. It should be discretion
> > of the extension.
> 
> I think that if you think so, you should give at least one concrete
> example for that.  Ideally accompanied by a demo of how that works well.
>
I previously showed an example situation:
http://www.postgresql.org/message-id/9a28c8860f777e439aa12e8aea7694f801138...@bpxm15gp.gisp.nec.co.jp

Then, your response was below:
| Thanks for the answer, but I'm not still convinced.
| I think the EPQ testing shown in that use-case would probably not
| efficient, compared to the core's.

What I'm repeatedly talking about is flexibility of the interface,
not efficiently. If core backend provide a good enough EPQ recheck
routine, extension can call it but decision by its author.
Why do you want to prohibit extension to choose its implementation?

Also, I introduced the case of PG-Strom in the face-to-face meeting
with you. PG-Strom has self CPU-fallback routine to rescue GPU errors.
thus, I prefer to reuse this routine for EPQ rechecks, rather than
adding alternative local plan support here.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-10-19 Thread Etsuro Fujita

I wrote:
>> As Robert mentioned in [1], I think that if we're inside EPQ,
>> pushed-down quals and/or pushed-down joins should be locally rechecked
>> in the same way as other cases such as IndexRecheck.  So, I'll propose
>> the updated version of the patch.

On 2015/10/16 18:48, Kouhei Kaigai wrote:
> You have never answered my question for two months.
> 
> I never deny to execute the pushed-down qualifiers locally.
> It is likely the best tactics in most cases.
> But, why you try to enforce all the people a particular manner?
> 
> Here are various kind of FDW drivers. How do you guarantee it is
> the best solution for all the people? It is basically impossible.
> (Please google "Probatio diabolica")
> 
> You try to add two special purpose fields in ForeignScan;
> fdw_recheck_plan and fdw_recheck_quals.
> It requires FDW drivers to have pushed-down qualifier in a particular
> data format, and also requires FDW drivers to process EPQ recheck by
> alternative local plan, even if a part of FDW drivers can process
> these jobs by its own implementation better.
> 
> I've repeatedly pointed out this issue, but never get reasonable
> answer from you.
> 
> Again, I also admit alternative plan may be reasonable tactics for
> most of FDW drivers. However, only FDW author can "decide" it is
> the best tactics to handle the task for their module, not us.
> 
> I don't think it is a good interface design to enforce people to
> follow a particular implementation manner. It should be discretion
> of the extension.

I think that if you think so, you should give at least one concrete
example for that.  Ideally accompanied by a demo of how that works well.

Best regards,
Etsuro Fujita



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-10-19 Thread Etsuro Fujita


On 2015/10/16 19:03, Kouhei Kaigai wrote:

*** 48,59  ExecScanFetch(ScanState *node,
+   /*
+* Execute recheck plan and get the next tuple if foreign join.
+*/
+   if (scanrelid == 0)
+   {
+   (*recheckMtd) (node, slot);
+   return slot;
+   }

Ensure the slot is empty if recheckMtd returned false, as base relation
case doing so.


Fixed.


*** 347,352  ExecScanReScan(ScanState *node)
{
Index   scanrelid = ((Scan *) node->ps.plan)->scanrelid;

+   if (scanrelid == 0)
+   return; /* nothing to do */
+
Assert(scanrelid > 0);

estate->es_epqScanDone[scanrelid - 1] = false;

Why nothing to do?
Base relations managed by ForeignScan are tracked in fs_relids bitmap.


I think the estate->es_epqScanDone flag should be initialized when we do 
ExecScanReSacn for each of the component ForeignScanState nodes in the 
local join execution plan state tree.



As you introduced a few days before, if ForeignScan has parametalized
remote join, EPQ slot contains invalid tuples based on old outer tuple.


Maybe my explanation was not enough, but I haven't said such a thing. 
The problem in that case is that just returning the previously-returned 
foeign-join tuple would produce an incorrect result if an outer tuple to 
be joined has changed due to a concurrent transaction, as explained 
upthread.  (I think that the EPQ slots would contain valid tuples.)


Attached is an updated version of the patch.

Other changes:
* remove unnecessary memory-context handling for the foreign-join case 
in ForeignRecheck

* revise code a bit and add a bit more comments

Thanks for the comments!

Best regards,
Etsuro Fujita
*** a/contrib/file_fdw/file_fdw.c
--- b/contrib/file_fdw/file_fdw.c
***
*** 525,530  fileGetForeignPaths(PlannerInfo *root,
--- 525,531 
  	 total_cost,
  	 NIL,		/* no pathkeys */
  	 NULL,		/* no outer rel either */
+ 	 NULL,		/* no alternative path */
  	 coptions));
  
  	/*
*** a/contrib/postgres_fdw/postgres_fdw.c
--- b/contrib/postgres_fdw/postgres_fdw.c
***
*** 560,565  postgresGetForeignPaths(PlannerInfo *root,
--- 560,566 
     fpinfo->total_cost,
     NIL, /* no pathkeys */
     NULL,		/* no outer rel either */
+    NULL,		/* no alternative path */
     NIL);		/* no fdw_private list */
  	add_path(baserel, (Path *) path);
  
***
*** 727,732  postgresGetForeignPaths(PlannerInfo *root,
--- 728,734 
  	   total_cost,
  	   NIL,		/* no pathkeys */
  	   param_info->ppi_req_outer,
+ 	   NULL,	/* no alternative path */
  	   NIL);	/* no fdw_private list */
  		add_path(baserel, (Path *) path);
  	}
*** a/src/backend/executor/execScan.c
--- b/src/backend/executor/execScan.c
***
*** 48,59  ExecScanFetch(ScanState *node,
  		 * conditions.
  		 */
  		Index		scanrelid = ((Scan *) node->ps.plan)->scanrelid;
  
  		Assert(scanrelid > 0);
  		if (estate->es_epqTupleSet[scanrelid - 1])
  		{
- 			TupleTableSlot *slot = node->ss_ScanTupleSlot;
- 
  			/* Return empty slot if we already returned a tuple */
  			if (estate->es_epqScanDone[scanrelid - 1])
  return ExecClearTuple(slot);
--- 48,67 
  		 * conditions.
  		 */
  		Index		scanrelid = ((Scan *) node->ps.plan)->scanrelid;
+ 		TupleTableSlot *slot = node->ss_ScanTupleSlot;
+ 
+ 		if (scanrelid == 0)
+ 		{
+ 			/* Execute recheck plan and store result in the slot */
+ 			if (!(*recheckMtd) (node, slot))
+ ExecClearTuple(slot);	/* would not be returned by scan */
+ 
+ 			return slot;
+ 		}
  
  		Assert(scanrelid > 0);
  		if (estate->es_epqTupleSet[scanrelid - 1])
  		{
  			/* Return empty slot if we already returned a tuple */
  			if (estate->es_epqScanDone[scanrelid - 1])
  return ExecClearTuple(slot);
***
*** 347,352  ExecScanReScan(ScanState *node)
--- 355,363 
  	{
  		Index		scanrelid = ((Scan *) node->ps.plan)->scanrelid;
  
+ 		if (scanrelid == 0)
+ 			return;/* nothing to do */
+ 
  		Assert(scanrelid > 0);
  
  		estate->es_epqScanDone[scanrelid - 1] = false;
*** a/src/backend/executor/nodeForeignscan.c
--- b/src/backend/executor/nodeForeignscan.c
***
*** 24,29 
--- 24,30 
  
  #include "executor/executor.h"
  #include "executor/nodeForeignscan.h"
+ #include "executor/tuptable.h"
  #include "foreign/fdwapi.h"
  #include "utils/memutils.h"
  #include "utils/rel.h"
***
*** 73,80  ForeignNext(ForeignScanState *node)
--- 74,99 
  static bool
  ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
  {
+ 	Index		scanrelid = ((Scan *) node->ss.ps.plan)->scanrelid;
  	ExprContext *econtext;
  
+ 	if (scanrelid == 0)
+ 	{
+ 		TupleTableSlot *resul

[HACKERS] Checkpoint throttling issues

2015-10-19 Thread Andres Freund

Hi,

working on the checkpoint sorting/flushing patch I noticed a number of
preexisting issues:

1) The progress passed to CheckpointWriteDelay() will often be wrong -
   it's calculated as num_written / num_to_write, but num_written is only
   incremented if the buffer hasn't since independently been written
   out. That's bad because it mean's we'll think we're further and
   further behind if there's independent writeout activity.

   Simple enough to fix, we gotta split num_written into num_written
   (for stats purposes) and num_processed (for progress).

   This is pretty much a bug, but I'm a slightly worried about
   backpatching a fix because it can have a rather noticeable
   behavioural impact.

2) CheckpointWriteDelay()'s handling of config file changes et al is
   pretty weird: The config is reloaded during a checkpoing iff it's not
   an immediate checkpoint and we're on schedule. I see very little
   justification for having the got_SIGHUP block inside that if block.

3) The static pg_usleep(10L); in CheckpointWriteDelay() is a bad
   idea.

   On a system with low write activity (< 10 writes sec) this will lead
   to checkpoints finishing too early as there'll always be around ~10
   writes a second. On slow IO, say a sdcard, that's bad.

   On system with a very high write throughput (say 1k+ buffers/sec) the
   unconditional 100ms sleep while on schedule will mean we're far
   behind after the sleep. This leads to rather bursty IO, and makes it
   more likely to run into dirty buffer limits and such.  Just reducing
   the sleep from 100ms to 10ms leads to significant latency
   improvements. On pgbench rate limited to 5k tps:

   before:
   number of transactions skipped: 70352 (1.408 %)
   number of transactions above the 100.0 ms latency limit: 2817 (0.056 %)
   latency average: 5.266 ms
   latency stddev: 11.723 ms
   rate limit schedule lag: avg 3.138 (max 0.000) ms

   after:
   number of transactions skipped: 41696 (0.834 %)
   number of transactions above the 100.0 ms latency limit: 1736 (0.035 %)
   latency average: 4.929 ms
   latency stddev: 8.929 ms
   rate limit schedule lag: avg 2.835 (max 0.000) ms


   I think the sleep time should be computed adaptively based on the
   number of buffers remaining and the remaining time. There's probably
   better formulations, but that seems like an easy enough improvement
   and considerably better than now.

4) It's a bit dubious to only pgstat_send_bgwriter() when on schedule.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Some questions about the array.

2015-10-19 Thread YUriy Zhuravlev

Hello again.
I attached simple patch for omitted boundaries in the slice.
This will simplify the writing of SQL. Instead:
select arr[2:array_upper(arr, 1)];
you can write:
select arr[2:];

simple and elegant.
Omitted boundaries is prohibited in UPDATE.

Thanks. 
-- 
YUriy Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Companydiff --git a/doc/src/sgml/array.sgml b/doc/src/sgml/array.sgml
index 4385a09..57614b7 100644
--- a/doc/src/sgml/array.sgml
+++ b/doc/src/sgml/array.sgml
@@ -257,6 +257,25 @@ SELECT schedule[1:2][1:1] FROM sal_emp WHERE name = 'Bill';
 (1 row)
 
 
+  You can skip the lower-bound or upper-bound
+  for get first or last element in slice.
+
+
+SELECT schedule[:][:] FROM sal_emp WHERE name = 'Bill';
+
+schedule
+
+ {{meeting,lunch},{training,presentation}}
+(1 row)
+
+SELECT schedule[:2][1:] FROM sal_emp WHERE name = 'Bill';
+
+schedule
+
+ {{meeting,lunch},{training,presentation}}
+(1 row)
+
+
   If any dimension is written as a slice, i.e., contains a colon, then all
   dimensions are treated as slices.  Any dimension that has only a single
   number (no colon) is treated as being from 1
diff --git a/src/backend/executor/execQual.c b/src/backend/executor/execQual.c
index 29f058c..6643714 100644
--- a/src/backend/executor/execQual.c
+++ b/src/backend/executor/execQual.c
@@ -268,10 +268,12 @@ ExecEvalArrayRef(ArrayRefExprState *astate,
 	bool		eisnull;
 	ListCell   *l;
 	int			i = 0,
-j = 0;
+j = 0,
+indexexpr;
 	IntArray	upper,
 lower;
 	int		   *lIndex;
+	AnyArrayType *arrays;
 
 	array_source = ExecEvalExpr(astate->refexpr,
 econtext,
@@ -293,6 +295,7 @@ ExecEvalArrayRef(ArrayRefExprState *astate,
 	foreach(l, astate->refupperindexpr)
 	{
 		ExprState  *eltstate = (ExprState *) lfirst(l);
+		eisnull = false;
 
 		if (i >= MAXDIM)
 			ereport(ERROR,
@@ -300,10 +303,23 @@ ExecEvalArrayRef(ArrayRefExprState *astate,
 	 errmsg("number of array dimensions (%d) exceeds the maximum allowed (%d)",
 			i + 1, MAXDIM)));
 
-		upper.indx[i++] = DatumGetInt32(ExecEvalExpr(eltstate,
-	 econtext,
-	 &eisnull,
-	 NULL));
+		if (eltstate == NULL && astate->refattrlength <= 0)
+		{
+			if (isAssignment)
+ereport(ERROR,
+	(errcode(ERRCODE_ARRAY_SUBSCRIPT_ERROR),
+	 errmsg("cannot determine upper index for empty array")));
+			arrays = (AnyArrayType *)DatumGetArrayTypeP(array_source);
+			indexexpr = AARR_LBOUND(arrays)[i] + AARR_DIMS(arrays)[i] - 1;
+		}
+		else
+			indexexpr = DatumGetInt32(ExecEvalExpr(eltstate,
+   econtext,
+   &eisnull,
+   NULL));
+
+		upper.indx[i++] = indexexpr;
+
 		/* If any index expr yields NULL, result is NULL or error */
 		if (eisnull)
 		{
@@ -321,6 +337,7 @@ ExecEvalArrayRef(ArrayRefExprState *astate,
 		foreach(l, astate->reflowerindexpr)
 		{
 			ExprState  *eltstate = (ExprState *) lfirst(l);
+			eisnull = false;
 
 			if (j >= MAXDIM)
 ereport(ERROR,
@@ -328,10 +345,20 @@ ExecEvalArrayRef(ArrayRefExprState *astate,
 		 errmsg("number of array dimensions (%d) exceeds the maximum allowed (%d)",
 j + 1, MAXDIM)));
 
-			lower.indx[j++] = DatumGetInt32(ExecEvalExpr(eltstate,
-		 econtext,
-		 &eisnull,
-		 NULL));
+			if (eltstate == NULL)
+			{
+arrays = (AnyArrayType *)DatumGetArrayTypeP(array_source);
+indexexpr = AARR_LBOUND(arrays)[j];
+			}
+			else
+indexexpr = DatumGetInt32(ExecEvalExpr(eltstate,
+	   econtext,
+	   &eisnull,
+	   NULL));
+
+			lower.indx[j++] = indexexpr;
+
+
 			/* If any index expr yields NULL, result is NULL or error */
 			if (eisnull)
 			{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 0b4ab23..6d9cad4 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -2415,6 +2415,8 @@ _copyAIndices(const A_Indices *from)
 
 	COPY_NODE_FIELD(lidx);
 	COPY_NODE_FIELD(uidx);
+	COPY_SCALAR_FIELD(lidx_default);
+	COPY_SCALAR_FIELD(uidx_default);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index aa6e102..e75b448 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -2162,6 +2162,8 @@ _equalAIndices(const A_Indices *a, const A_Indices *b)
 {
 	COMPARE_NODE_FIELD(lidx);
 	COMPARE_NODE_FIELD(uidx);
+	COMPARE_SCALAR_FIELD(lidx_default);
+	COMPARE_SCALAR_FIELD(uidx_default);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index df7f6e1..6769740 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2756,6 +2756,8 @@ _outA_Indices(StringInfo str, const A_Indices *node)
 
 	WRITE_NODE_FIELD(lidx);
 	WRITE_NODE_FIELD(uidx);
+	WRITE_BOOL_FIELD(lidx_default);
+	WRITE_BOOL_FIELD(uidx_default);
 }
 
 static void
diff --git a/src/backend/parser/gr

[HACKERS] plpgsql - DECLARE - cannot to use %TYPE or %ROWTYPE for composite types

2015-10-19 Thread Pavel Stehule

Hi,

We cannot to declare variable with referenced type on other composite
variable. This limit is probably artificial, because any composite type is
any type too in PostgreSQL.

The issue:

referencing on composite variables doesn't work

do $$ declare x int; y x%type; begin end; $$; -- ok
do $$ declare x pg_class; y x%type; begin end; $$; -- invalid type name
"x%type"
do $$ declare x pg_class; y x%rowtype; begin end; $$; -- relation "x" does
not exist

The %ROWTYPE needs a record in pg_class. Probably we should not to change
it. The change can bring a compatibility issues. So there are two
possibilities:

1. %TYPE can be used for any kind of variables. This behave will be
consistent with polymorphic parameters - we have "anyelement", and we have
not "anyrow".

2. introduce new keyword - %RECTYPE .. it can work, but there will be gap
between polymorphic parameters.

Comments, notices?

Regards

Pavel

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-10-19 Thread Etsuro Fujita


On 2015/10/17 9:58, Robert Haas wrote:

But with Etsuro Fujita's patch, and I think what you have proposed has
been similar, how are you going to do it?  The proposal is to call the
recheck method and hope for the best, but what is the recheck method
going to do?  Where is it going to get the previously-returned tuple?


As I explained in a previous email, just returning the 
previously-returned tuple is not good enough.



How will it know if it has already returned it during the lifetime of
this EPQ check?  Offhand, it looks to me like, at least in some
circumstances, you're probably going to return whatever tuple you
returned most recently (which has a good chance of being the right
one, but not necessarily) over and over again.  That's not going to
fly.


No.  Since the local join execution plan is created so that the scan 
slot for each foreign table involved in the pushed-down join looks at 
its EPQ slot, I think the plan can return at most one tuple.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-10-19 Thread Kouhei Kaigai

> On Fri, Oct 16, 2015 at 9:51 PM, Tom Lane  wrote:
> > Robert Haas  writes:
> >> Both you and Etsuro Fujita are proposing to fix this problem by
> >> somehow making it the FDW's problem to reconstruct the tuple
> >> previously produced by the join from whole-row images of the baserels.
> >> But that's not looking back far enough: why are we asking for
> >> whole-row images of the baserels when what we really want is a
> >> whole-row image of the output of the join?  The output of the join is
> >> what we need to re-return.
> >
> > There are multiple components to the requirement though:
> >
> > 1. Recheck the rows that were in the baserels and possibly fetch updated
> > versions of them.  (Once we're in EPQ, we want the most recent row
> > versions, not what the query snapshot can see.)
> 
> Check.  But postgres_fdw, and probably quite a few other FDWs, use
> early row locking.  So ROW_MARK_COPY is in use and we need not worry
> about refetching rows.
> 
> > 2. Apply relevant restriction clauses and see if the updated row versions
> > still pass the clauses.
> 
> Check.
> 
> > 3. If so, form a join row and return that.  Else return NULL.
> 
> Not check.
> 
> Suppose we've got two foreign tables ft1 and ft2, using postgres_fdw.
> There is a local table t.  The user does something like UPDATE t SET
> ... FROM ft1, ft2 WHERE t = ft1.a AND ft1.b = ft2.b AND   The
> query planner generates something like:
> 
> Update
> -> Join
>   -> Scan on t
>   -> Foreign Scan on 
> 
> If an EPQ recheck occurs, the only thing that matters is that the
> Foreign Scan return the right output row (or possibly now rows, if the
> row it would have formed no longer matches the quals).  It doesn't
> matter how it does this.  Let's say the columns actually needed by the
> query from the ft1-ft2 join are ft1.a, ft1.b, ft2.a, and ft2.b.
> Currently, the output of the foreign scan is something like: ft1.a,
> ft1.b, ft2.a, ft.b, ft1.*, ft2.*.  The EPQ recheck has access to ft1.*
> and ft2.*, but it's not straightforward for postgres_fdw to regenerate
> the join tuple from that.  Maybe the pushed-down was a left join,
> maybe it was a right join, maybe it was a full join.  So some of the
> columns could have gone to NULL.  To figure it out, you need to build
> a secondary plan tree that mimics the structure of the join you pushed
> down, which is kinda hairy.
>
In case of outer join, do we need to care about join-clause, unlike
scan qualifiers?

Rows filled-up by NULLs appears when here is no matched tuple on other
side. It means any rows in the relation of non-NUllable side are visible
regardless of join-clause, even though it may be or may not be matched
with the latest rows refetched based on the latest values.

Example)
remote table: ft1
id |  val
---+---
 1 | 'aaa'
 2 | 'bbb'
 3 | 'ccc'

remote table: ft2
id |  val
---+---
 2 | 'xxx'
 3 | 'yyy'
 4 | 'zzz'

If remote join query is:
  SELECT *, ft1.*, ft2.* FROM ft1 LEFT JOIN ft2 ON ft1.id = ft2.id WHERE ft1.id 
< 3;
its expected result is:
  ft1.id | ft1.val | ft2.id | ft2.val |  ft1.*  |  ft2.*  |
  ---+-++-+-+-+
1|  'aaa'  |  null  |  null   |(1,'aaa')|  null   |
2|  'bbb'  |   2|  'xxx'  |(2,'bbb')|(2,'xxx')|

The non-NULLs side (ft1 in this case) are visible regardless of the join-
clause, as long as tuples in ft1 satisfies the scan-qualifier (ft1.id < 3).

FDW/CSP knows the type of joins that should be responsible, so it can skip
evaluation of join-clauses but apply only scan-qualifiers on base relation's
tuple.

> Think how much easier your life would be if you hadn't bothered
> fetching ft1.* and ft2.*, which aren't so handy in this case, and had
> instead made the output of the foreign scan ft1.a, ft1.b, ft2.a,
> ft2.b, ROW(ft1.a, ft1.b, ft2.a, ft2.b) -- and that the output of that
> ROW() operation was stored in an EPQ slot.  Now, you don't need the
> secondary plan tree any more.  You've got all the data you need right
> in your hand.  The values inside the ROW() constructor were evaluated
> after accounting for the goes-to-NULL effects of any pushed-down
> joins.
> 
> This example is of the early row locking case, but I think the story
> is about the same if the FDW wants to do late row locking instead.  If
> there's an EPQ recheck, it could issue individual row re-fetches
> against every base table and then re-do all the joins that it pushed
> down locally.  But it would be faster and cleaner, I think, to send
> one query to the remote side that re-fetches all the rows at once, and
> whose target list is exactly what we need, rather than whole row
> targetlists for each baserel that then have to be rejiggered on our
> side.
>
Which approach is more reasonable?

In case of early row locking, FDW ensures all the rows involved in
the join is protected by concurrent accesses. So, no need to concern
about refetching from the remote side.
On the other hands, in case of late row locking, we need to pay a

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-10-19 Thread Etsuro Fujita


On 2015/10/17 12:22, Robert Haas wrote:

On Fri, Oct 16, 2015 at 9:51 PM, Tom Lane  wrote:

Robert Haas  writes:

Both you and Etsuro Fujita are proposing to fix this problem by
somehow making it the FDW's problem to reconstruct the tuple
previously produced by the join from whole-row images of the baserels.
But that's not looking back far enough: why are we asking for
whole-row images of the baserels when what we really want is a
whole-row image of the output of the join?  The output of the join is
what we need to re-return.



There are multiple components to the requirement though:



3. If so, form a join row and return that.  Else return NULL.



Not check.

Suppose we've got two foreign tables ft1 and ft2, using postgres_fdw.
There is a local table t.  The user does something like UPDATE t SET
... FROM ft1, ft2 WHERE t = ft1.a AND ft1.b = ft2.b AND   The
query planner generates something like:

Update
-> Join
   -> Scan on t
   -> Foreign Scan on 

If an EPQ recheck occurs, the only thing that matters is that the
Foreign Scan return the right output row (or possibly now rows, if the
row it would have formed no longer matches the quals).  It doesn't
matter how it does this.  Let's say the columns actually needed by the
query from the ft1-ft2 join are ft1.a, ft1.b, ft2.a, and ft2.b.
Currently, the output of the foreign scan is something like: ft1.a,
ft1.b, ft2.a, ft.b, ft1.*, ft2.*.  The EPQ recheck has access to ft1.*
and ft2.*, but it's not straightforward for postgres_fdw to regenerate
the join tuple from that.  Maybe the pushed-down was a left join,
maybe it was a right join, maybe it was a full join.  So some of the
columns could have gone to NULL.  To figure it out, you need to build
a secondary plan tree that mimics the structure of the join you pushed
down, which is kinda hairy.


As Tom mentioned, just recomputing the original join tuple is not good 
enough.  We would need to rejoin the test tuples for the baserels even 
if ROW_MARK_COPY is in use.  Consider:


A=# BEGIN;
A=# UPDATE t SET a = a + 1 WHERE b = 1;
B=# SELECT * from t, ft1, ft2
 WHERE t.a = ft1.a AND t.b = ft2.b AND ft1.c = ft2.c FOR UPDATE;
A=# COMMIT;

where the plan for the SELECT FOR UPDATE is

LockRows
-> Nested Loop
   -> Seq Scan on t
   -> Foreign Scan on 
Remote SQL: SELECT * FROM ft1 JOIN ft2 WHERE ft1.c = ft2.c AND 
ft1.a = $1 AND ft2.b = $2


If an EPQ recheck is invoked by the A's UPDATE, just recomputing the 
original join tuple from the whole-row image that you proposed would 
output an incorrect result in the EQP recheck since the value a in the 
updated version of a to-be-joined tuple in t would no longer match the 
value ft1.a extracted from the whole-row image if the A's UPDATE has 
committed successfully.  So I think we would need to rejoin the tuples 
populated from the whole-row images for the baserels ft1 and ft2, by 
executing the secondary plan with the new parameter values for a and b.


As for the secondary plan, I think we could create the corresponding 
local join execution path during GetForeignJoinPaths, (1) by looking at 
the pathlist of the joinrel RelOptInfo, which would have already 
contained some local join execution paths, as does the patch, or (2) by 
calling a helper function that creates a local join execution path from 
given outer/inner paths selected from the pathlists of the 
outerrel/innerrel RelOptInfos, as proposed be KaiGai-san before.  ISTM 
that the latter would be better, so I plan to propose such a function as 
part of the postgres_fdw join pushdown patch for 9.6.



This example is of the early row locking case, but I think the story
is about the same if the FDW wants to do late row locking instead.  If
there's an EPQ recheck, it could issue individual row re-fetches
against every base table and then re-do all the joins that it pushed
down locally.  But it would be faster and cleaner, I think, to send
one query to the remote side that re-fetches all the rows at once, and
whose target list is exactly what we need, rather than whole row
targetlists for each baserel that then have to be rejiggered on our
side.


I agree with you on that point.  (In fact, I thought that too!)  But 
considering that many FDWs including postgres_fdw use early row locking 
(ie, ROW_MARK_COPY) currently, I'd like to leave that for future work.



I think what Kaigai-san and Etsuro-san are after is trying to find a way
to reuse some of the existing EPQ machinery to help with that.  This may
not be practical, or it may end up being messier than a standalone
implementation; but it's not silly on its face to want to reuse some of
that code.



Yeah, I think we're all in agreement that reusing as much of the EPQ
machinery as is sensible is something we should do.  We are not in
agreement on which parts of it need to be changed or extended.


Agreed.

Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
htt

51 matches

Mail list logo