date:20171009

Re: [HACKERS] Partition-wise aggregation/grouping

2017-10-09 Thread Ashutosh Bapat

On Tue, Oct 10, 2017 at 3:15 AM, David Rowley
 wrote:
> On 10 October 2017 at 01:10, Jeevan Chalke
>  wrote:
>> Attached new patch set having HEAD at 84ad4b0 with all these review points
>> fixed. Let me know if I missed any thanks.
>
> I've only really skimmed over this thread and only opened the code
> enough to extract the following:
>
> + /* Multiply the costs by partition_wise_agg_cost_factor. */
> + apath->startup_cost *= partition_wise_agg_cost_factor;
> + apath->total_cost *= partition_wise_agg_cost_factor;
>
> I've not studied how all the path plumbing is done, but I think
> instead of doing this costing magic we should really stop pretending
> that Append/MergeAppend nodes are cost-free. I think something like
> charging cpu_tuple_cost per row expected through Append/MergeAppend
> would be a better approach to this.
>
> If you perform grouping or partial grouping before the Append, then in
> most cases the Append will receive less rows, so come out cheaper than
> if you perform the grouping after it. I've not learned the
> partition-wise join code enough to know if this is going to affect
> that too, but for everything else, there should be no plan change,
> since there's normally no alternative paths. I see there's even a
> comment in create_append_path() which claims the zero cost is a bit
> optimistic.
>

+1. Partition-wise join will also benefit from costing Append
processing. Number of rows * width of join result compared with the
sum of that measure for joining relations decides whether Append node
processes more data in Append->Join case than Join->Append case.

Append node just returns the result of ExecProcNode(). Charging
cpu_tuple_cost may make it too expensive. In other places where we
charge cpu_tuple_cost there's some processing done to the tuple like
ExecStoreTuple() in SeqNext(). May be we need some other measure for
Append's processing of the tuple.

May be we should try to measure the actual time spent in Append node
as a fraction of say time spent in child seq scans. That might give us
a clue as to how Append processing can be charged in terms of costing.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Extended statistics is not working on Vars hidden under a RelabelType

2017-10-09 Thread David Rowley

Basically, $subject is causing us not to properly find matching
extended stats in this case.

The attached patch fixes it.

The following test cases is an example of the misbehaviour. Note
rows=1 vs rows=98 in the Gather node.

create table ab (a varchar, b varchar);
insert into ab select (x%1000)::varchar, (x%1)::Varchar from
generate_Series(1,100)x;
create statistics ab_a_b_stats (dependencies) on a,b from ab;
analyze ab;

-- Unpatched
explain analyze select * from ab where a = '1' and b = '1';
   QUERY PLAN
-
 Gather  (cost=1000.00..12466.10 rows=1 width=7) (actual
time=0.441..90.515 rows=100 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on ab  (cost=0.00..11466.00 rows=1 width=7)
(actual time=1.081..74.944 rows=33 loops=3)
 Filter: (((a)::text = '1'::text) AND ((b)::text = '1'::text))
 Rows Removed by Filter: 00
 Planning time: 0.184 ms
 Execution time: 105.878 ms
(8 rows)

-- Patched
explain analyze select * from ab where a = '1' and b = '1';
QUERY PLAN
--
 Gather  (cost=1000.00..12475.80 rows=98 width=7) (actual
time=1.076..92.595 rows=100 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on ab  (cost=0.00..11466.00 rows=41 width=7)
(actual time=0.491..77.833 rows=33 loops=3)
 Filter: (((a)::text = '1'::text) AND ((b)::text = '1'::text))
 Rows Removed by Filter: 00
 Planning time: 2.175 ms
 Execution time: 106.326 ms
(8 rows)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


allow_relabelled_vars_in_dependency_stats.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Fix a typo in execReplication.c

2017-10-09 Thread Masahiko Sawada

Hi,

Attached a patch for $subject. ISTM these are mistakes of copy-and-paste.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


fix_comment_in_execReplication_c.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [BUGS] 10.0: Logical replication doesn't execute BEFORE UPDATE OF trigger

2017-10-09 Thread Masahiko Sawada

On Mon, Oct 9, 2017 at 11:13 PM, Aleksander Alekseev
 wrote:
> Hi hackers,
>
> I've found something that looks like a bug.
>
> Steps to reproduce
> --
>
> There are 3 instances of PostgreSQL 10.0 - inst1, inst2 and inst3. There
> is a table `test` on every instance:
>
> ```
> CREATE TABLE test(k TEXT PRIMARY KEY, v TEXT);
> ```
>
> Both inst1 and inst2 have `allpub` publication:
>
> ```
> CREATE PUBLICATION allpub FOR ALL TABLES;
> ```
>
> ... and inst3 is subscribed for both publications:
>
> ```
> CREATE SUBSCRIPTION allsub1
>   CONNECTION 'host=10.128.0.16 user=eax dbname=eax'
>   PUBLICATION allpub;
>
> CREATE SUBSCRIPTION allsub2
>   CONNECTION 'host=10.128.0.26 user=eax dbname=eax'
>   PUBLICATION allpub;
> ```
>
> So basically it's two masters, one replica configuration. To resolve
> insert/update conflicts I've created the following triggers on inst3:
>
> ```
> CREATE OR REPLACE FUNCTION test_before_insert()
> RETURNS trigger AS $$
> BEGIN
>
> RAISE NOTICE 'test_before_insert trigger executed';
>
> IF EXISTS (SELECT 1 FROM test where k = new.k) THEN
>RAISE NOTICE 'test_before_insert trigger - merging data';
>UPDATE test SET v = v || ';' || new.v WHERE k = new.k;
>RETURN NULL;
> END IF;
>
> RETURN NEW;
>
> END
> $$ LANGUAGE plpgsql;
>
>
> CREATE OR REPLACE FUNCTION test_before_update()
> RETURNS trigger AS $$
> BEGIN
>
> RAISE NOTICE 'test_before_update trigger executed';
>
> IF EXISTS (SELECT 1 FROM test where k = new.k) THEN
>RAISE NOTICE 'test_before_update trigger - merging data';
>UPDATE test SET v = v || ';' || new.v WHERE k = new.k;
>DELETE FROM test where k = old.k;
>RETURN NULL;
> END IF;
>
> RETURN NEW;
>
> END
> $$ LANGUAGE plpgsql;
>
> create trigger test_before_insert_trigger
> before insert on test
> for each row execute procedure test_before_insert();
>
> create trigger test_before_update_trigger
> before update of k on test
> for each row execute procedure test_before_update();
>
> ALTER TABLE test ENABLE REPLICA TRIGGER test_before_insert_trigger;
> ALTER TABLE test ENABLE REPLICA TRIGGER test_before_update_trigger;
> ```
>
> The INSERT trigger works just as expected, however the UPDATE trigger
> doesn't. On inst1:
>
> ```
> insert into test values ('k1', 'v1');
> ```
>
> In inst2:
>
> ```
> insert into test values ('k4', 'v4');
> update test set k = 'k1' where k = 'k4';
> ```
>
> Now on inst3:
>
> ```
> select * from test;
> ```
>
> Expected result
> ---
>
> Rows are merged to:
>
> ```
>  k  |   v
> +---
>  k1 | v1;v4
> ```
>
> This is what would happen if all insert/update queries would have been
> executed on one instance.
>
> Actual result
> -
>
> Replication fails, log contains:
>
> ```
> [3227] ERROR:  duplicate key value violates unique constraint "test_pkey"
> [3227] DETAIL:  Key (k)=(k1) already exists.
> [3176] LOG:  worker process: logical replication worker for subscription 
> 16402 (PID 3227) exited with exit code 1
> ```
>
> What do you think?
>

I think the cause of this issue is that the apply worker doesn't set
updatedCols of RangeTblEntry when applying updates. So TriggerEnabled
always ends up with false. I'll make a patch and submit.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On markers of changed data

2017-10-09 Thread Peter Geoghegan

On Sat, Oct 7, 2017 at 6:34 AM, Stephen Frost  wrote:
>> > That’s actually what pg_rman is doing for what it calls incremental
>> > backups (perhaps that would be differential backup in PG
>> > terminology?), and the performance is bad as you can imagine. We could
>> > have a dedicated LSN map to do such things with 4 bytes per page. I am
>> > still not convinced that this much facility and the potential bug
>> > risks are worth it though, Postgres already knows about differential
>> > backups if you shape it as a delta of WAL segments. I think that, in
>> > order to find a LSN map more convincing, we should find first other
>> > use cases where it could become useful. Some use cases may pop up with
>> > VACUUM, but I have not studied the question hard enough...
>>
>> The case I've discussed with barman developers is a large database
>> (couple dozen of TBs should be enough) where a large fraction (say 95%)
>> is read-only but there are many changes to the active part of the data,
>> so that WAL is more massive than size of active data.
>
> Yes, we've seen environments like that also.

I'm pretty sure that those cases are cases where there are many more
FPIs than might be expected, due to a lack of locality. (UUID PKs can
make the size of WAL balloon, for example.)

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Issues with logical replication

2017-10-09 Thread Stas Kelvich


> On 2 Oct 2017, at 19:59, Petr Jelinek  wrote:
> 
>> 
>> Program terminated with signal SIGABRT, Aborted.
>> #0  0x7f3608f8ec37 in __GI_raise (sig=sig@entry=6) at
>> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>> 56../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
>> (gdb) bt
>> #0  0x7f3608f8ec37 in __GI_raise (sig=sig@entry=6) at
>> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>> #1  0x7f3608f92028 in __GI_abort () at abort.c:89
>> #2  0x009f5740 in ExceptionalCondition (conditionName=0xbf6b30
>> "!(((xid) != ((TransactionId) 0)))",
>> errorType=0xbf69af "FailedAssertion", fileName=0xbf69a8 "lmgr.c",
>> lineNumber=582) at assert.c:54
>> #3  0x0086ac1d in XactLockTableWait (xid=0, rel=0x0, ctid=0x0,
>> oper=XLTW_None) at lmgr.c:582
>> #4  0x0081c9f7 in SnapBuildWaitSnapshot (running=0x2749198,
>> cutoff=898498) at snapbuild.c:1400
>> #5  0x0081c7c7 in SnapBuildFindSnapshot (builder=0x2807910,
>> lsn=798477760, running=0x2749198) at snapbuild.c:1311
>> #6  0x0081c339 in SnapBuildProcessRunningXacts
>> (builder=0x2807910, lsn=798477760, running=0x2749198) at snapbuild.c:1106
> 
> 
> Hmm this would suggest that XLOG_RUNNING_XACTS record contains invalid
> transaction ids but we specifically skip those in
> GetRunningTransactionData(). Can you check pg_waldump output for that
> record (lsn is shown in the backtrace)?

  I investigated this case and it seems that XactLockTableWait() in 
SnapBuildWaitSnapshot()
not always work as expected. XactLockTableWait() waits on LockAcquire() for xid 
to be 
completed and if we finally got this lock but transaction is still in progress 
then such xid
assumed to be a subxid. However LockAcquire/LockRelease cycle can happen after 
transaction
set xid, but before XactLockTableInsert().

Namely following history happened for xid 5225 and lead to crash:

[backend] LOG:  AssignTransactionId: XactTopTransactionId = 5225
   [walsender] LOG:  LogCurrentRunningXacts: Wrote RUNNING_XACTS xctn=1, 
xid[0]=5225
   [walsender] LOG:  XactLockTableWait: LockAcquire 5225
   [walsender] LOG:  XactLockTableWait: LockRelease 5225
[backend] LOG:  AssignTransactionId: LockAcquire ExclusiveLock 5225
   [walsender] LOG:  TransactionIdIsInProgress: SVC->latestCompletedXid=5224 < 
xid=5225 => true
[backend] LOG:  CommitTransaction: ProcArrayEndTransaction xid=5225, ipw=0
[backend] LOG:  CommitTransaction: ResourceOwnerRelease locks xid=5225

I’ve quickly checked other usages of XactLockTableWait() and it seems that it 
is used mostly with
xids from heap so tx definetly set it lock somewhere in the past.

Not sure what it the best approach to handle that. May be write running xacts 
only if they already
set their lock?

Also attaching pgbench script that can reproduce failure. I run it over usual 
pgbench database
in 10 threads. It takes several minutes to crash.



reload.pgb
Description: Binary data

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-09 Thread Michael Paquier

On Mon, Oct 9, 2017 at 2:29 AM, Peter Geoghegan  wrote:
> On Sat, Oct 7, 2017 at 4:25 PM, Alvaro Herrera  
> wrote:
>> Hmm, I think I added a random sleep (max. 100ms) right after the
>> HeapTupleSatisfiesVacuum call in vacuumlazy.c (lazy_scan_heap), and that
>> makes the race easier to hit.
>
> I still cannot reproduce. Perhaps you can be more specific?

I have been trying to reproduce things for a total of 4 hours, testing
various variations of the proposed test cases (killed Postgres,
changed fillfactor, manual sleep calls), but I am proving unable to
see a regression as well. I would not think that the OS matters here,
all my attempts were on macos with assertions and debugging enabled.

At least the code is now more stable, which is definitely a good thing.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] search path security issue?

2017-10-09 Thread Fabrízio de Royes Mello

On Sun, Oct 8, 2017 at 3:34 PM, Joe Conway  wrote:
>
> On 10/06/2017 12:52 AM, Magnus Hagander wrote:
> > It would be a nice feature to have in general, like a "basic guc
> > permissions" thing. At least allowing a superuser to prevent exactly
> > this. You could argue the same thing for example for memory parameters
> > and such. We have no permissions at all when it comes to userset gucs
> > today -- and of course, if something should be added about this, it
> > should be done in a way that works for all the userset variables, not
> > just search_path.
>
> +1
>
> I have wished for exactly that more than once before.
>

+1

I have a multi-user and schema database with some user restrictions because
it's a shared environment (i.e: statement_timeout, lock_timeout). But some
"smart" users change to the GUC default and sometimes we suffer with it.
Would be nice to have some kind of guc permissions.

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello

[HACKERS] heap/SLRU verification, relfrozenxid cut-off, and freeze-the-dead bug (Was: amcheck (B-Tree integrity checking tool))

2017-10-09 Thread Peter Geoghegan

On Sun, Oct 16, 2016 at 6:46 PM, Noah Misch  wrote:
>> Noah and I discussed possible future directions for amcheck in person
>> recently. I would like to get Noah's thoughts again here on how a tool
>> like amcheck might reasonably target other access methods for
>> verification. In particular, the heapam. MultiXacts were mentioned as
>> a structure that could receive verification in a future iteration of
>> this tool, but I lack expertise there.
>
> Yes, a heap checker could examine plenty of things.  Incomplete list:

The heap checking enhancement that is under review in the next CF [1],
which checks an index against the table it indexes, doesn't directly
test any of the things you outlined in this mail almost exactly a year
ago. (That said, some of the same things are checked indirectly,
simply by using IndexBuildHeapScan().)

I do still think that there is a very real need for a "direct
heap/SLRU checker" function, though. I only put off implementing one
because there was still some low hanging fruit. I would like to go
over your suggestions again now. My understanding of what we *can* do
has evolved in the past several months.

> - Detect impossible conditions in the hint bits.  A tuple should not have both
>   HEAP_XMAX_COMMITTED and HEAP_XMAX_INVALID.  Every tuple bearing
>   HEAP_ONLY_TUPLE should bear HEAP_UPDATED.  HEAP_HASVARWIDTH should be true
>   if and only if the tuple has a non-NULL value in a negative-typlen column,
>   possibly a dropped column.  A tuple should not have both HEAP_KEYS_UPDATED
>   and HEAP_XMAX_LOCK_ONLY.

It certainly makes sense to check for clearly contradictory hint bits like this.

> - Verify agreement between CLOG, MULTIXACT, and hint bits.

This is where it gets complicated, I think. This is what I really want
to talk about.

The recent commit 20b65522 pretty clearly established that a tuple
could technically escape freezing (having HEAP_XMIN_FROZEN set)
despite actually being before a relfrozenxid cut-off. The idea was
that as long as we reliably set hint bits on heap-only tuples through
WAL-logging, it doesn't matter that their CLOG could be truncated,
because nobody will ever need to interrogate the CLOG anyway (to coin
a phrase, the heap-only tuple nevertheless still had its xmax
"logically frozen"). If nothing else, this leaves me with a very
squishy set of invariant conditions to work with when I go to verify
agreement with CLOG, MULTIXACT, and hint bits.

So, the question is: What is the exact set of invariant conditions
that I can check when I go to verify agreement between a heap relation
and the various SLRUs? In particular, at what precise XID-wise point
do CLOG and MULTIXACT stop reliably storing transaction status? And,
is there a different point for the xmax of tuples that happen to be
heap-only tuples?

Another important concern here following 20b65522 is: Why is it safe
to assume that nobody will ever call TransactionIdDidCommit() for
"logically frozen" heap-only tuples that are not at the end of their
HOT chain, and in so doing get a wrong answer? I can't find a rule
that implies that there is no dangerous ambiguity that's written down
anywhere. I *can* find a comment that suggests that it's wrong, though
-- the "N.B." comment at the top of heap_prepare_freeze_tuple().
(Granted, that comment doesn't seem to acknowledge the fact that the
caller does WAL-log as part of freezing; there has been some churn in
this area.)

I'm pretty sure that the classic rule for TransactionIdDidCommit()
when called from tqual.c was that the tuple cannot have
HEAP_XMIN_FROZEN set, which was straightforward enough back when a
frozen tuple was assumed to have necessarily committed (and to have a
"raw" xmin point that should logically be visible to everyone). This
business of "logically frozen" xmax values for dead heap-only tuples
makes things *way* more complicated, though. I'm concerned that we've
failed to account for that, and that TransactionIdDidCommit() calls
concerning pre-relfrozenxid XIDs can still happen.

I should point out that for whatever the reason, there is evidence
that we do in fact risk TransactionIdDidCommit() calls that give wrong
answers (independent of the  ~2014 MULTIXACT bugs where that was
clearly the case, and even before 20b65522): Jeff Janes showed [2]
that there is probably some unaccounted-for case where "premature"
truncation takes place. This may be related to the recent HOT
chain/freezing bugs, and our (only partial [3]) fixes may have fixed
that, too -- I just don't know.

[1] https://commitfest.postgresql.org/15/1263/
[2] 
https://postgr.es/m/CAMkU=1y-tnp_7jdfg_ubxqdb8esmo280acdgdwsa429hrte...@mail.gmail.com
[3] https://postgr.es/m/20171007232524.sdpi3jk2636fvjge@alvherre.pgsql
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Support to COMMENT ON DATABASE CURRENT_DATABASE

2017-10-09 Thread Jing Wang

Hi

I don't know why the previous email can't be linked with the original email
webpage. It is weird. So supplementing following information for
understanding:

The original email link:
https://www.postgresql.org/message-id/CAF3%2BxM%2BxSswcWQZMP1cjj12gPz8DXHcM9_fT1y-0fVzxi9pmOw%40mail.gmail.com

The recent email with updated patch is as following:
https://www.postgresql.org/message-id/CAF3%2BxMKkKpd8oVRpN9i1BFMau2dFhVrt-Y0BE%3DDsoKtOm%3DC2AQ%40mail.gmail.com

-- 
Regards,
Jing Wang
Fujitsu Australia

Re: [HACKERS] Partition-wise aggregation/grouping

2017-10-09 Thread David Rowley

On 10 October 2017 at 01:10, Jeevan Chalke
 wrote:
> Attached new patch set having HEAD at 84ad4b0 with all these review points
> fixed. Let me know if I missed any thanks.

I've only really skimmed over this thread and only opened the code
enough to extract the following:

+ /* Multiply the costs by partition_wise_agg_cost_factor. */
+ apath->startup_cost *= partition_wise_agg_cost_factor;
+ apath->total_cost *= partition_wise_agg_cost_factor;

I've not studied how all the path plumbing is done, but I think
instead of doing this costing magic we should really stop pretending
that Append/MergeAppend nodes are cost-free. I think something like
charging cpu_tuple_cost per row expected through Append/MergeAppend
would be a better approach to this.

If you perform grouping or partial grouping before the Append, then in
most cases the Append will receive less rows, so come out cheaper than
if you perform the grouping after it. I've not learned the
partition-wise join code enough to know if this is going to affect
that too, but for everything else, there should be no plan change,
since there's normally no alternative paths. I see there's even a
comment in create_append_path() which claims the zero cost is a bit
optimistic.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] oversight in EphemeralNamedRelation support

2017-10-09 Thread Julien Rouhaud

On Mon, Oct 9, 2017 at 10:43 PM, Thomas Munro
 wrote:
>
> I suppose we could consider moving the schemaname check into
> getRTEForSpecialRelationType(), since otherwise both callers need to
> do that (and as you discovered, one forgot).

Thanks for the feedback.  That was my first idea, but I assumed there
could be future use for this function on qualified RangeVar if it
wasn't done this way.

I agree it'd be much safer, so v2 attached, check moved in
getRTEForSpecialRelationType().
diff --git a/src/backend/parser/parse_clause.c 
b/src/backend/parser/parse_clause.c
index 9ff80b8b40..255f485494 100644
--- a/src/backend/parser/parse_clause.c
+++ b/src/backend/parser/parse_clause.c
@@ -184,8 +184,10 @@ setTargetTable(ParseState *pstate, RangeVar *relation,
RangeTblEntry *rte;
int rtindex;
 
-   /* So far special relations are immutable; so they cannot be targets. */
+   /* Check if it's a CTE or tuplestore reference */
rte = getRTEForSpecialRelationTypes(pstate, relation);
+
+   /* So far special relations are immutable; so they cannot be targets. */
if (rte != NULL)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
@@ -1072,6 +1074,12 @@ transformRangeTableSample(ParseState *pstate, 
RangeTableSample *rts)
 }
 
 
+/*
+ * getRTEForSpecialRelationTypes
+ *
+ * If given RangeVar if a CTE reference or an EphemeralNamedRelation, return
+ * the transformed RangeVar otherwise return NULL
+ */
 static RangeTblEntry *
 getRTEForSpecialRelationTypes(ParseState *pstate, RangeVar *rv)
 {
@@ -1079,6 +1087,13 @@ getRTEForSpecialRelationTypes(ParseState *pstate, 
RangeVar *rv)
Index   levelsup;
RangeTblEntry *rte = NULL;
 
+   /*
+* if it is a qualified name, it can't be a CTE or tuplestore
+* reference
+*/
+   if (rv->schemaname)
+   return NULL;
+
cte = scanNameSpaceForCTE(pstate, rv->relname, );
if (cte)
rte = transformCTEReference(pstate, rv, cte, levelsup);
@@ -1119,15 +1134,11 @@ transformFromClauseItem(ParseState *pstate, Node *n,
/* Plain relation reference, or perhaps a CTE reference */
RangeVar   *rv = (RangeVar *) n;
RangeTblRef *rtr;
-   RangeTblEntry *rte = NULL;
+   RangeTblEntry *rte;
int rtindex;
 
-   /*
-* if it is an unqualified name, it might be a CTE or tuplestore
-* reference
-*/
-   if (!rv->schemaname)
-   rte = getRTEForSpecialRelationTypes(pstate, rv);
+   /* Check if it's a CTE or tuplestore reference */
+   rte = getRTEForSpecialRelationTypes(pstate, rv);
 
/* if not found above, must be a table reference */
if (!rte)
diff --git a/src/test/regress/expected/with.out 
b/src/test/regress/expected/with.out
index c32a490580..53ea9991b2 100644
--- a/src/test/regress/expected/with.out
+++ b/src/test/regress/expected/with.out
@@ -2275,3 +2275,7 @@ with ordinality as (select 1 as x) select * from 
ordinality;
 -- check sane response to attempt to modify CTE relation
 WITH d AS (SELECT 42) INSERT INTO d VALUES (1);
 ERROR:  relation "d" cannot be the target of a modifying statement
+-- check qualified relation name doesn't conflict with CTE name
+CREATE TABLE public.self (id integer);
+WITH self AS (SELECT 42) INSERT INTO public.self SELECT * from self;
+DROP TABLE public.self;
diff --git a/src/test/regress/sql/with.sql b/src/test/regress/sql/with.sql
index 8ae5184d0f..17f32c3c87 100644
--- a/src/test/regress/sql/with.sql
+++ b/src/test/regress/sql/with.sql
@@ -1031,3 +1031,8 @@ with ordinality as (select 1 as x) select * from 
ordinality;
 
 -- check sane response to attempt to modify CTE relation
 WITH d AS (SELECT 42) INSERT INTO d VALUES (1);
+
+-- check qualified relation name doesn't conflict with CTE name
+CREATE TABLE public.self (id integer);
+WITH self AS (SELECT 42) INSERT INTO public.self SELECT * from self;
+DROP TABLE public.self;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] oversight in EphemeralNamedRelation support

2017-10-09 Thread Thomas Munro

On Tue, Oct 10, 2017 at 2:35 AM, Julien Rouhaud  wrote:
> Hugo Mercier (in Cc) reported me an error in a query, failing since pg10.
>
> Simple test case to reproduce:
>
> CREATE TABLE public.test (id integer);
> WITH test AS (select 42) INSERT INTO public.test SELECT * FROM test;
>
> which will fail with "relation "test" cannot be the target of a
> modifying statement".
>
> IIUC, that's an oversight in 18ce3a4ab22, where setTargetTable()
> doesn't exclude qualified relation when searching for special
> relation.

I agree.

> PFA a simple patch to fix this issue, with updated regression test.

Thanks!

I suppose we could consider moving the schemaname check into
getRTEForSpecialRelationType(), since otherwise both callers need to
do that (and as you discovered, one forgot).

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Columnar storage support

2017-10-09 Thread Jaime Casanova

On 9 October 2017 at 15:03, legrand legrand  wrote:
> Is there a chance that pluggable storage permits creation of a columnar rdbms
> as monetDB in PostgreSQL ?

pluggable storage is one of the pieces for it, yes...
but is not as simple as that, IIUC

-- 
Jaime Casanova  www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Columnar storage support

2017-10-09 Thread Joshua D. Drake


On 10/09/2017 01:03 PM, legrand legrand wrote:

Is there a chance that pluggable storage permits creation of a columnar rdbms
as monetDB in PostgreSQL ?
Thanks un advance for thé answer


The extension C-Store from Citus is probably what you are looking for.

jD





--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html





--
Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc

PostgreSQL Centered full stack support, consulting and development.
Advocate: @amplifypostgres || Learn: https://pgconf.us
* Unless otherwise stated, opinions are my own.   *


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Columnar storage support

2017-10-09 Thread legrand legrand

Is there a chance that pluggable storage permits creation of a columnar rdbms
as monetDB in PostgreSQL ?
Thanks un advance for thé answer



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Unnecessary complexity around CurrentResourceOwner changes?

2017-10-09 Thread Tom Lane

We have some dozens of places that transiently change CurrentResourceOwner
and then set it back to what it had been.  I happened to notice that,
while many of these places have a PG_TRY block to ensure that the restore
happens even if they lose control to an error, not all of them have one.
My initial reaction was that the latter were buggy, since resowner/README
says you should do it with a PG_TRY.  On further reflection, though, why
are we paying the rather high price of a PG_TRY block for that?  ISTM that
CurrentResourceOwner is much like CurrentMemoryContext, and we certainly
don't worry about protecting transient changes of CurrentMemoryContext
this way.  The reason it's safe not to is that the first step in
transaction or subtransaction abort is to reset CurrentMemoryContext to
something sane.  But the second step is to reset CurrentResourceOwner
to something sane, so it seems like we have that base covered.

So my inclination is to change the advice in resowner/README and rip out
all the PG_TRY blocks that are used only to force restore of
CurrentResourceOwner.  (We might as well leave the code alone in places
where the PG_TRY has to do other things too.)

Thoughts?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] On markers of changed data

2017-10-09 Thread Andrey Borodin

Hi Michael!
> 9 окт. 2017 г., в 17:28, Michael Paquier  
> написал(а):
>> VM is WAL-logged [0]
>> FSM is not [1]
> 
> If you are willing to go down this road, here are my takes on the matter:
> - Any LSN map should be in a different file than FSM and VM. The VM
> uses 2 bits per blocks now, and the FSM 8 bits. The FSM is wanted to
> remain small, so trying to plug into it more data is not welcome. The
> VM bits also are dedicated to purely visibility matters. We may not
> want to use that for the VM.
> - Those should not be logged, as we would end up with tracking down
> WAL records for things that themselves track the effects of other
> records.
I was asking about FSM and VM not because I wanted to place something there, 
but because I was looking for the way to backup FSM and VM efficiently. VM can 
be copied page-incrementally, FSM cannot.

But the design you are describing resembles PTRACK[0]: fork for page changes 
tracking, not WAL-logged, but crash safe due to recovery from others WALs.

Best regards, Andrey Borodin.


[0] https://gist.github.com/stalkerg/ab833d94e2f64df241f1835651e06e4b

Re: [HACKERS] Pluggable storage

2017-10-09 Thread Alexander Korotkov

On Mon, Oct 9, 2017 at 5:32 PM, Tom Lane  wrote:

> Alexander Korotkov  writes:
> > For me, it's crucial point that pluggable storages should be able to have
> > different MVCC implementation, and correspondingly have full control over
> > its interactions with indexes.
> > Thus, it would be good if we would get consensus on that point.  I'd like
> > other discussion participants to comment whether they agree/disagree and
> > why.
> > Any comments?
>
> TBH, I think that's a good way of ensuring that nothing will ever get
> committed.  You're trying to draw the storage layer boundary at a point
> that will take in most of the system.  If we did build it like that,
> what we'd end up with would be very reminiscent of mysql's storage
> engines, complete with inconsistent behaviors and varying feature sets
> across engines.  I don't much want to go there.
>

However, if we insist that pluggable storage should have the same MVCC
implementation, interacts with indexes the same way and also use TIDs as
tuple identifiers, then what useful implementations might we have?
Per-page heap compression and encryption?  Or different heap page layout?
Or tuple format?  OK, but that doesn't justify such wide API as it's
implemented in the current version of patch in this thread.  If we really
want to restrict applicability of pluggable storages that way, then we
probably should give up with "pluggable storages" and make it "pluggable
heap page format" at I proposed upthread.

Implementation of alternative storage would be hard and challenging task.
Yes, it would include reimplementation of significant part of the system.
But that seems inevitable if we're going to implement alternative really
storages (not just hacks over existing storage).  And I don't think that
our pluggable storages would be reminiscent of mysql's storage engines
while we're keeping two properties:
1) All the storages use the same WAL stream,
2) All the storages use same transactions and snapshots.
If we keep these two properties, we wouldn't need neither 2PC to run
transactions across different storages, neither separate log for
replication.  These two are major drawbacks of MySQL model.
Varying feature sets across engines seems inevitable and natural.  We've to
invent alternative storages to have features whose are hard to have in our
current storage.  So, no wonder that feature sets would be varying...

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

2017-10-09 Thread Tom Lane

Alexander Korotkov  writes:
> For me, it's crucial point that pluggable storages should be able to have
> different MVCC implementation, and correspondingly have full control over
> its interactions with indexes.
> Thus, it would be good if we would get consensus on that point.  I'd like
> other discussion participants to comment whether they agree/disagree and
> why.
> Any comments?

TBH, I think that's a good way of ensuring that nothing will ever get
committed.  You're trying to draw the storage layer boundary at a point
that will take in most of the system.  If we did build it like that,
what we'd end up with would be very reminiscent of mysql's storage
engines, complete with inconsistent behaviors and varying feature sets
across engines.  I don't much want to go there.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

2017-10-09 Thread Alexander Korotkov

On Wed, Sep 27, 2017 at 7:51 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> I took a look on this patch.  I've following notes for now.
>
> tuple_insert_hook tuple_insert; /* heap_insert */
>> tuple_update_hook tuple_update; /* heap_update */
>> tuple_delete_hook tuple_delete; /* heap_delete */
>
>
> I don't think this set of functions provides good enough level of
> abstraction for storage AM.  This functions encapsulate only low-level work
> of insert/update/delete tuples into heap itself.  However, it still assumed
> that indexes are managed outside of storage AM.  I don't think this is
> right, assuming that one of most demanded storage API usage would be
> different MVCC implementation (for instance, UNDO log based as discussed
> upthread).  Different MVCC implementation is likely going to manage indexes
> in a different way.  For example, storage AM utilizing UNDO would implement
> in-place update even when indexed columns are modified.  Therefore this
> piece of code in ExecUpdate() wouldn't be relevant anymore.
>
> /*
>> * insert index entries for tuple
>> *
>> * Note: heap_update returns the tid (location) of the new tuple in
>> * the t_self field.
>> *
>> * If it's a HOT update, we mustn't insert new index entries.
>> */
>> if ((resultRelInfo->ri_NumIndices > 0) && 
>> !storage_tuple_is_heaponly(resultRelationDesc,
>> tuple))
>> recheckIndexes = ExecInsertIndexTuples(slot, &(slot->tts_tid),
>>   estate, false, NULL, NIL);
>
>
> I'm firmly convinced that this logic should be encapsulated into storage
> AM altogether with inserting new index tuples on storage insert.  Also, HOT
> should be completely encapsulated into heapam.  It's quite evident for me
> that storage utilizing UNDO wouldn't have anything like our current HOT.
> Therefore, I think there shouldn't be hot_search_buffer() API function.
>  tuple_fetch() may cover hot_search_buffer(). That might require some
> signature change of tuple_fetch() (probably, extra arguments).
>

For me, it's crucial point that pluggable storages should be able to have
different MVCC implementation, and correspondingly have full control over
its interactions with indexes.
Thus, it would be good if we would get consensus on that point.  I'd like
other discussion participants to comment whether they agree/disagree and
why.
Any comments?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

[HACKERS] 10.0: Logical replication doesn't execute BEFORE UPDATE OF trigger

2017-10-09 Thread Aleksander Alekseev

Hi hackers,

I've found something that looks like a bug.

Steps to reproduce
--

There are 3 instances of PostgreSQL 10.0 - inst1, inst2 and inst3. There
is a table `test` on every instance:

```
CREATE TABLE test(k TEXT PRIMARY KEY, v TEXT);
```

Both inst1 and inst2 have `allpub` publication:

```
CREATE PUBLICATION allpub FOR ALL TABLES;
```

... and inst3 is subscribed for both publications:

```
CREATE SUBSCRIPTION allsub1
  CONNECTION 'host=10.128.0.16 user=eax dbname=eax'
  PUBLICATION allpub;

CREATE SUBSCRIPTION allsub2
  CONNECTION 'host=10.128.0.26 user=eax dbname=eax'
  PUBLICATION allpub;
```

So basically it's two masters, one replica configuration. To resolve
insert/update conflicts I've created the following triggers on inst3:

```
CREATE OR REPLACE FUNCTION test_before_insert()
RETURNS trigger AS $$
BEGIN

RAISE NOTICE 'test_before_insert trigger executed';

IF EXISTS (SELECT 1 FROM test where k = new.k) THEN
   RAISE NOTICE 'test_before_insert trigger - merging data';
   UPDATE test SET v = v || ';' || new.v WHERE k = new.k;
   RETURN NULL;
END IF;

RETURN NEW;

END
$$ LANGUAGE plpgsql;


CREATE OR REPLACE FUNCTION test_before_update()
RETURNS trigger AS $$
BEGIN

RAISE NOTICE 'test_before_update trigger executed';

IF EXISTS (SELECT 1 FROM test where k = new.k) THEN
   RAISE NOTICE 'test_before_update trigger - merging data';
   UPDATE test SET v = v || ';' || new.v WHERE k = new.k;
   DELETE FROM test where k = old.k;
   RETURN NULL;
END IF;

RETURN NEW;

END
$$ LANGUAGE plpgsql;

create trigger test_before_insert_trigger
before insert on test
for each row execute procedure test_before_insert();

create trigger test_before_update_trigger
before update of k on test
for each row execute procedure test_before_update();

ALTER TABLE test ENABLE REPLICA TRIGGER test_before_insert_trigger;
ALTER TABLE test ENABLE REPLICA TRIGGER test_before_update_trigger;
```

The INSERT trigger works just as expected, however the UPDATE trigger
doesn't. On inst1:

```
insert into test values ('k1', 'v1');
```

In inst2:

```
insert into test values ('k4', 'v4');
update test set k = 'k1' where k = 'k4';
```

Now on inst3:

```
select * from test;
```

Expected result
---

Rows are merged to:

```
 k  |   v   
+---
 k1 | v1;v4
```

This is what would happen if all insert/update queries would have been
executed on one instance.

Actual result
-

Replication fails, log contains:

```
[3227] ERROR:  duplicate key value violates unique constraint "test_pkey"
[3227] DETAIL:  Key (k)=(k1) already exists.
[3176] LOG:  worker process: logical replication worker for subscription 16402 
(PID 3227) exited with exit code 1
```

What do you think?

-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature

Re: [HACKERS] Discussion on missing optimizations

2017-10-09 Thread Ants Aasma

On Sun, Oct 8, 2017 at 7:24 PM, Tom Lane  wrote:
> Petr Jelinek  writes:
>> Okay, that makes sense, thanks for explanation. Your patch is the way to
>> go then.
>
> Hearing no further comment, pushed.  Thanks for reviewing it.

The tautological col = col comparison on is occasionally used as a
planner "hint" to correct for row count overestimates. Not a great
solution, but PostgreSQL doesn't really have a better way to guide the
planner. Those queries will now have to do something else, like col =
col + 0, which still works.

Regards,
Ants Aasma

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Latches API on windows

2017-10-09 Thread Andres Freund



On October 9, 2017 6:56:10 AM PDT, Tom Lane  wrote:
>Craig Ringer  writes:
>> On 9 October 2017 at 21:26, Abbas Butt 
>wrote:
>>> In my case this is not true, I am calling InitSharedLatch in
>_PG_init
>>> which gets called at CREATE EXTENSION time.
>>> My question : Is there a way to get the latches API work on windows
>>> the way it is working on Linux?
>
>> I suspect you'd need to do it by having your extension load via
>> shared_preload_libraries, registering its latch in shmem_startup_hook
>
>Yeah.  That would also let you request your shared memory area
>honestly,
>instead of relying on there being some slop in the initial allocation.

Might be dsm  style memory.


I think the right approach here, regardless of the source of the memory, is to 
actually bit create a new latch, but instead to store a pointer the the owning 
processes preexisting latch. Besides solving this issue, it also avoids 
problems with various routines already waiting on the proclatch.

Andres

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Latches API on windows

2017-10-09 Thread Abbas Butt

Thanks for the suggestion.


On Mon, Oct 9, 2017 at 6:56 PM, Tom Lane  wrote:

> Craig Ringer  writes:
> > On 9 October 2017 at 21:26, Abbas Butt 
> wrote:
> >> In my case this is not true, I am calling InitSharedLatch in _PG_init
> >> which gets called at CREATE EXTENSION time.
> >> My question : Is there a way to get the latches API work on windows
> >> the way it is working on Linux?
>
> > I suspect you'd need to do it by having your extension load via
> > shared_preload_libraries, registering its latch in shmem_startup_hook
>
> Yeah.  That would also let you request your shared memory area honestly,
> instead of relying on there being some slop in the initial allocation.
>
> regards, tom lane
>



-- 
-- 
*Abbas*
Architect

Ph: 92.334.5100153
Skype ID: gabbasb
www.enterprisedb.co m


*Follow us on Twitter*
@EnterpriseDB

Visit EnterpriseDB for tutorials, webinars, whitepapers
 and more

Re: [HACKERS] Latches API on windows

2017-10-09 Thread Tom Lane

Craig Ringer  writes:
> On 9 October 2017 at 21:26, Abbas Butt  wrote:
>> In my case this is not true, I am calling InitSharedLatch in _PG_init
>> which gets called at CREATE EXTENSION time.
>> My question : Is there a way to get the latches API work on windows
>> the way it is working on Linux?

> I suspect you'd need to do it by having your extension load via
> shared_preload_libraries, registering its latch in shmem_startup_hook

Yeah.  That would also let you request your shared memory area honestly,
instead of relying on there being some slop in the initial allocation.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Latches API on windows

2017-10-09 Thread Craig Ringer

On 9 October 2017 at 21:26, Abbas Butt  wrote:
> Hi,
> I am working on a contrib module that uses RegisterDynamicBackgroundWorker
> API
> to create a couple of worker processes. For synchronization between the
> background worker processes I am using InitSharedLatch, SetLatch, WaitLatch
> APIs.
> One of the processes is supposed to wait for the latch, the other is
> supposed to set it.
> The system works perfectly fine as long as its run on Linux, however when
> tried
> on Windows, it fails giving the error:
> ResetEvent failed: error code 6
> Error code 6 means invalid handle. Debugging reveals that the handle
> contains
> a valid value, however it seems that the handle is not accessible (was not
> created)
> in the process that is calling ResetEvent.
>
> Debugging the issue lead me to the following comment on top of
> InitSharedLatch:
>
>  * InitSharedLatch needs to be called in postmaster before forking child
>  * processes, usually right after allocating the shared memory block
>  * containing the latch with ShmemInitStruct. (The Unix implementation
>  * doesn't actually require that, but the Windows one does.)
>
> In my case this is not true, I am calling InitSharedLatch in _PG_init
> which gets called at CREATE EXTENSION time.
> My question : Is there a way to get the latches API work on windows
> the way it is working on Linux?

I suspect you'd need to do it by having your extension load via
shared_preload_libraries, registering its latch in shmem_startup_hook
.

But ... that's an off-the-cuff guess.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] oversight in EphemeralNamedRelation support

2017-10-09 Thread Julien Rouhaud

Hi,

Hugo Mercier (in Cc) reported me an error in a query, failing since pg10.

Simple test case to reproduce:

CREATE TABLE public.test (id integer);
WITH test AS (select 42) INSERT INTO public.test SELECT * FROM test;

which will fail with "relation "test" cannot be the target of a
modifying statement".

IIUC, that's an oversight in 18ce3a4ab22, where setTargetTable()
doesn't exclude qualified relation when searching for special
relation.

PFA a simple patch to fix this issue, with updated regression test.

Regards.
diff --git a/src/backend/parser/parse_clause.c 
b/src/backend/parser/parse_clause.c
index 9ff80b8b40..48a065e41c 100644
--- a/src/backend/parser/parse_clause.c
+++ b/src/backend/parser/parse_clause.c
@@ -181,11 +181,17 @@ int
 setTargetTable(ParseState *pstate, RangeVar *relation,
   bool inh, bool alsoSource, AclMode requiredPerms)
 {
-   RangeTblEntry *rte;
+   RangeTblEntry *rte = NULL;
int rtindex;
 
+   /*
+* if it is an unqualified name, it might be a CTE or tuplestore
+* reference
+*/
+   if (!relation->schemaname)
+   rte = getRTEForSpecialRelationTypes(pstate, relation);
+
/* So far special relations are immutable; so they cannot be targets. */
-   rte = getRTEForSpecialRelationTypes(pstate, relation);
if (rte != NULL)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
diff --git a/src/test/regress/expected/with.out 
b/src/test/regress/expected/with.out
index c32a490580..53ea9991b2 100644
--- a/src/test/regress/expected/with.out
+++ b/src/test/regress/expected/with.out
@@ -2275,3 +2275,7 @@ with ordinality as (select 1 as x) select * from 
ordinality;
 -- check sane response to attempt to modify CTE relation
 WITH d AS (SELECT 42) INSERT INTO d VALUES (1);
 ERROR:  relation "d" cannot be the target of a modifying statement
+-- check qualified relation name doesn't conflict with CTE name
+CREATE TABLE public.self (id integer);
+WITH self AS (SELECT 42) INSERT INTO public.self SELECT * from self;
+DROP TABLE public.self;
diff --git a/src/test/regress/sql/with.sql b/src/test/regress/sql/with.sql
index 8ae5184d0f..17f32c3c87 100644
--- a/src/test/regress/sql/with.sql
+++ b/src/test/regress/sql/with.sql
@@ -1031,3 +1031,8 @@ with ordinality as (select 1 as x) select * from 
ordinality;
 
 -- check sane response to attempt to modify CTE relation
 WITH d AS (SELECT 42) INSERT INTO d VALUES (1);
+
+-- check qualified relation name doesn't conflict with CTE name
+CREATE TABLE public.self (id integer);
+WITH self AS (SELECT 42) INSERT INTO public.self SELECT * from self;
+DROP TABLE public.self;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GSoC 2017 : Patch for predicate locking in Gist index

2017-10-09 Thread Alexander Korotkov

On Thu, Oct 5, 2017 at 9:48 PM, Shubham Barai 
wrote:

> On 3 October 2017 at 00:32, Alexander Korotkov 
> wrote:
>
>> On Mon, Oct 2, 2017 at 9:11 PM, Andrew Borodin 
>> wrote:
>>
>>> On Mon, Oct 2, 2017 at 8:00 PM, Alexander Korotkov
>>>  wrote:
>>> > What happen if exactly this "continue" fires?
>>> >
>>> >> if (GistFollowRight(stack->page))
>>> >> {
>>> >> if (!xlocked)
>>> >> {
>>> >> LockBuffer(stack->buffer, GIST_UNLOCK);
>>> >> LockBuffer(stack->buffer, GIST_EXCLUSIVE);
>>> >> xlocked = true;
>>> >> /* someone might've completed the split when we unlocked */
>>> >> if (!GistFollowRight(stack->page))
>>> >> continue;
>>> >
>>> >
>>> > In this case we might get xlocked == true without calling
>>> > CheckForSerializableConflictIn().
>>> Indeed! I've overlooked it. I'm remembering this issue, we were
>>> considering not fixing splits if in Serializable isolation, but
>>> dropped the idea.
>>>
>>
>> Yeah, current insert algorithm assumes that split must be fixed before we
>> can correctly traverse the tree downwards.
>>
>>
>>> CheckForSerializableConflictIn() must be after every exclusive lock.
>>>
>>
>> I'm not sure, that fixing split is the case to necessary call
>> CheckForSerializableConflictIn().  This lock on leaf page is not taken
>> to do modification of the page.  It's just taken to ensure that nobody else
>> is fixing this split the same this.  After fixing the split, it might
>> appear that insert would go to another page.
>>
>> > I think it would be rather safe and easy for understanding to more
>>> > CheckForSerializableConflictIn() directly before gistinserttuple().
>>> The difference is that after lock we have conditions to change page,
>>> and before gistinserttuple() we have actual intent to change page.
>>>
>>> From the point of future development first version is better (if some
>>> new calls occasionally spawn in), but it has 3 calls while your
>>> proposal have 2 calls.
>>> It seems to me that CheckForSerializableConflictIn() before
>>> gistinserttuple() is better as for now.
>>>
>>
>> Agree.
>>
>
> I have updated the location of  CheckForSerializableConflictIn()  and
> changed the status of the patch to "needs review".
>

Now, ITSM that predicate locks and conflict checks are placed right for now.
However, it would be good to add couple comments to gistdoinsert() whose
would state why do we call CheckForSerializableConflictIn() in these
particular places.

I also take a look at isolation tests.  You made two separate test specs:
one to verify that serialization failures do fire, and another to check
there are no false positives.
I wonder if we could merge this two test specs into one, but use more
variety of statements with different keys for both inserts and selects.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

[HACKERS] Latches API on windows

2017-10-09 Thread Abbas Butt

Hi,
I am working on a contrib module that uses RegisterDynamicBackgroundWorker
API
to create a couple of worker processes. For synchronization between the
background worker processes I am using InitSharedLatch, SetLatch, WaitLatch
APIs.
One of the processes is supposed to wait for the latch, the other is
supposed to set it.
The system works perfectly fine as long as its run on Linux, however when
tried
on Windows, it fails giving the error:
ResetEvent failed: error code 6
Error code 6 means invalid handle. Debugging reveals that the handle
contains
a valid value, however it seems that the handle is not accessible (was not
created)
in the process that is calling ResetEvent.

Debugging the issue lead me to the following comment on top of
InitSharedLatch:

 * InitSharedLatch needs to be called in postmaster before forking child
 * processes, usually right after allocating the shared memory block
 * containing the latch with ShmemInitStruct. (The Unix implementation
 * doesn't actually require that, but the Windows one does.)

In my case this is not true, I am calling InitSharedLatch in _PG_init
which gets called at CREATE EXTENSION time.
My question : Is there a way to get the latches API work on windows
the way it is working on Linux?

Best Regards

-- 
-- 
*Abbas*
Architect

Ph: 92.334.5100153
Skype ID: gabbasb
www.enterprisedb.co m


*Follow us on Twitter*
@EnterpriseDB

Visit EnterpriseDB for tutorials, webinars, whitepapers
 and more

Re: [HACKERS] On markers of changed data

2017-10-09 Thread Michael Paquier

On Mon, Oct 9, 2017 at 2:23 PM, Andrey Borodin  wrote:
>> I haven't gone and audited it myself, but I would certainly expect you
>> to be able to use the LSN for everything which is WAL'd.  If you have
>> cases where that's not the case, it'd be useful to see them.
>
> Thanks, Stephen, this actually pointed what to look for
> VM is WAL-logged [0]
> FSM is not [1]

If you are willing to go down this road, here are my takes on the matter:
- Any LSN map should be in a different file than FSM and VM. The VM
uses 2 bits per blocks now, and the FSM 8 bits. The FSM is wanted to
remain small, so trying to plug into it more data is not welcome. The
VM bits also are dedicated to purely visibility matters. We may not
want to use that for the VM.
- Those should not be logged, as we would end up with tracking down
WAL records for things that themselves track the effects of other
records.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Predicate Locks for writes?

2017-10-09 Thread Alexander Korotkov

On Sat, Oct 7, 2017 at 2:26 PM, Simon Riggs  wrote:

> SERIALIZABLE looks for chains of rw cases.
>
> When we perform UPDATEs and DELETEs we search for rows and then modify
> them. The current implementation views that as a read followed by a
> write because we issue PredicateLockTuple() during the index fetch.
>
> Is it correct that a statement that only changes data will add
> predicate locks for the rows that it modifies?
>

I'm not very aware of how this SI code exactly works.  But it seems that
once we update row, read SIReadLock on that tuple is released, because
we're already holding stronger lock.  I made simple experiment.

# begin;
BEGIN
Time: 0,456 ms
smagen@postgres=*# select * from test where id = 1;
 id │ val
┼─
  1 │ xyz
(1 row)

*# select locktype, database, relation, page, tuple, mode, granted from
pg_locks where pid = pg_backend_pid();
  locktype  │ database │ relation │ page │ tuple │  mode   │ granted
┼──┼──┼──┼───┼─┼─
 relation   │12558 │11577 │ NULL │  NULL │ AccessShareLock │ t
 relation   │12558 │65545 │ NULL │  NULL │ AccessShareLock │ t
 relation   │12558 │65538 │ NULL │  NULL │ AccessShareLock │ t
 virtualxid │ NULL │ NULL │ NULL │  NULL │ ExclusiveLock   │ t
 page   │12558 │65545 │1 │  NULL │ SIReadLock  │ t
 tuple  │12558 │65538 │0 │ 3 │ SIReadLock  │ t
(6 rows)

*# update test set val = 'xyz' where id = 1;

*# select locktype, database, relation, page, tuple, mode, granted from
pg_locks where pid = pg_backend_pid();
   locktype│ database │ relation │ page │ tuple │   mode   │
granted
───┼──┼──┼──┼───┼──┼─
 relation  │12558 │11577 │ NULL │  NULL │ AccessShareLock  │ t
 relation  │12558 │65545 │ NULL │  NULL │ AccessShareLock  │ t
 relation  │12558 │65545 │ NULL │  NULL │ RowExclusiveLock │ t
 relation  │12558 │65538 │ NULL │  NULL │ AccessShareLock  │ t
 relation  │12558 │65538 │ NULL │  NULL │ RowExclusiveLock │ t
 virtualxid│ NULL │ NULL │ NULL │  NULL │ ExclusiveLock│ t
 transactionid │ NULL │ NULL │ NULL │  NULL │ ExclusiveLock│ t
 page  │12558 │65545 │1 │  NULL │ SIReadLock   │ t
(8 rows)

Once we update the same tuple that we read before, SIReadLock on that tuple
disappears.

PredicateLockTuple() specifically avoids adding an SIRead lock if the
> tuple already has a write lock on it, so surely it must also be
> correct to skip the SIRead lock if we are just about to update the
> row?
>
> I am suggesting that we ignore PredicateLockTuple() for cases where we
> are about to update or delete the found tuple.
>

If my thoughts above are correct, than it would be a reasonable
optimization.  We would skip cycle of getting SIReadLock on tuple and then
releasing it, without any change of behavior.


> ISTM that a Before Row Trigger would need to add an SIRead lock since
> that is clearly a read.
>

Yes, because depending on result of "Before Row Trigger" update might not
really happen.  That SIReadLock on tuple is necessary.
However, ISTM that we could place SIReadLock on tuple after Before Row
Trigger and only in the case when trigger has cancelled an update.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] [POC] hash partitioning

2017-10-09 Thread Ashutosh Bapat

On Mon, Oct 9, 2017 at 4:44 PM, amul sul  wrote:


> 0002 few changes in partition-wise join code to support
> hash-partitioned table as well & regression tests.

+switch (key->strategy)
+{
+case PARTITION_STRATEGY_HASH:
+/*
+ * Indexes array is same as the greatest modulus.
+ * See partition_bounds_equal() for more explanation.
+ */
+num_indexes = DatumGetInt32(src->datums[ndatums - 1][0]);
+break;
This logic is duplicated at multiple places.  I think it's time we consolidate
these changes in a function/macro and call it from the places where we have to
calculate number of indexes based on the information in partition descriptor.
Refactoring existing code might be a separate patch and then add hash
partitioning case in hash partitioning patch.

+intdim = hash_part? 2 : partnatts;
Call the variable as natts_per_datum or just natts?

+hash_part? true : key->parttypbyval[j],
+key->parttyplen[j]);
parttyplen is the length of partition key attribute, whereas what you want here
is the length of type of modulus and remainder. Is that correct? Probably we
need some special handling wherever parttyplen and parttypbyval is used e.g. in
call to partition_bounds_equal() from build_joinrel_partition_info().

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise aggregation/grouping

2017-10-09 Thread Jeevan Chalke

On Thu, Sep 28, 2017 at 3:12 PM, Ashutosh Bapat <
ashutosh.ba...@enterprisedb.com> wrote:

>
> Here are comments on 0004 from last patch set. But most the comments
> still apply.
>

Thank you, Ashutosh for reviewing.


>
> Patch 0001 adds functions create_hash_agg_path() and
> create_sort_agg_path().
> Patch 0004 adds a new argument to those functions for conditions in HAVING
> clause. We should move those changes to 0001 and pass parse->havingQual to
> these functions in 0001 itself. That will keep all changes to those
> functions
> together and also make 0003 small.
>

Done.


> The prologue of try_partition_wise_grouping() mentions a restriction of
> partition keys being leading group by clauses. This restriction is not
> specified in the prologue of have_grouping_by_partkey(), which actually
> checks
> for this restriction. The requirement per prologue of that function is
> just to
> have partition keys in group clause. I think have_grouping_by_partkey() is
> correct, and we should change prologue of try_partition_wise_grouping() to
> be
> in sync with have_grouping_by_partkey().


Done.

The prologue explains why
> partition-wise aggregation/grouping would be efficient with this
> restriction,
> but it doesn't explain why partial aggregation/grouping per partition
> would be
> efficient. May be it should do that as well. Similar is the case with
> partial
> aggregation/grouping discussion in README.
>

I have tried updating it. Please check.


> +/* Do not handle grouping sets for now. */
> Is this a real restriction or just restriction for first cut of this
> feature?
> Can you please add more explanation? May be update README as well?
>

Grouping sets plan does not work with an inheritance subtree (see notes in
create_groupingsets_plan). Thus grouping sets are not handled here.


>
> +grouped_rel->part_scheme = input_rel->part_scheme;
> Is this true even for partial aggregates? I think not. Since group by
> clause
> does not contain partition keys, the rows from multiple partitions
> participate
> in one group and thus the partition keys of input relation do not apply to
> the
> grouped relation. In this case, it seems that the grouped rel will have
> part_rels but will not be partitioned.
>

I have removed this as your analysis is correct. grouped_rel is not
partitioned.


> +/*
> + * If there is no path for the child relation, then we cannot add
> + * aggregate path too.
> + */
> +if (input_child_rel->pathlist == NIL)
> +return;
> When can this happen? I think, similar to partition-wise join it happens
> when
> there is a dummy parent relation. See [1]. If that's the case, you may
> want to
> do things similar to what partition-wise join is doing. If there's some
> other
> reason for doing this, returing from here half-way is actually waste of
> memory
> and planning time. Instead, we may want to loop over the part_rels to find
> if
> any of those have empty pathlist and return from there before doing any
> actual
> work.
>

This is kind of can't happen scenario, so I have converted it to an
Assert().
And yes, I am marking a grouped_rel as dummy rel when input rel is dummy.


> +extra.pathTarget = child_target;
> +extra.inputRows = input_child_rel->cheapest_startup_path->rows;
> +extra.havingQual = (Node *) adjust_appendrel_attrs(root,
> +   (Node *)
> query->havingQual,
> +   nappinfos,
> +   appinfos);
> These lines are updating some fields of "extra" structure in every loop.
> The
> structure is passed to create_child_grouping_paths() in the loop and to
> add_paths_to_append_rel() outside the loop. Thus add_paths_to_append_rel()
> only
> gets some member values for the last child. Is that right?


No. Patch do update those fields before calling add_paths_to_append_rel().

Should we split
> extra into two structures one to be used within the loop and one outside?
> Or
> may be send the members being updated within the loop separately?
>

I don't see any point in splitting. We need almost all fields at child path
creation as well as at finalization step. The patch basically just re-using
the struct variable.


> +/*
> + * Forcibly apply scan/join target to all the Paths for the
> scan/join
> + * rel.
> + *
> [ lines clipped ]
> +if (subpath == input_child_rel->cheapest_total_path)
> +input_child_rel->cheapest_total_path = path;
> +}
> +}
> This code seems to be copied from grouping_planner() almost verbatim. Is
> there
> a way we can refactor it into a function and use it in both the places.
>

Done.
Moved this in
0003-Refactor-code-applying-scanjoin-target-to-paths-into.patch


> have_grouping_by_partkey() may use match_expr_to_partition_keys() to find
>

Re: [HACKERS] [POC] hash partitioning

2017-10-09 Thread amul sul

On Sat, Oct 7, 2017 at 5:22 PM, amul sul  wrote:
> On Fri, Oct 6, 2017 at 5:35 PM, Jesper Pedersen
>  wrote:
>> Hi Amul,
>>
>> Could you rebase on latest master ?
>>
>
> Sure will post that soon, but before that, I need to test hash partitioning
> with recent partition-wise join commit (f49842d1ee), thanks.
>

Updated patch attached.

0001 is the rebased of the previous patch, no new change.
0002 few changes in partition-wise join code to support
hash-partitioned table as well & regression tests.

Thanks & Regards,
Amul


0001-hash-partitioning_another_design-v23.patch
Description: Binary data


0002-Enable-partition-wise-join-support-v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PoC: full merge join on comparison clause

2017-10-09 Thread Ashutosh Bapat

Hi Alexander,
Commit c7a9fa399 has added another test on mergeopfamilies. I think
your patch will need to take care of that test.

On Wed, Oct 4, 2017 at 6:38 PM, Alexander Kuzmenkov
 wrote:
> As discussed earlier, I changed the way we work with mergeopfamilies. I use
> the "is_equality" flag to indicate whether the clause is an equality one,
> and fill mergeopfamilies for both equality and inequality operators.
> The updated patch is attached (rebased to 20b6552242).
>
>
> --
> Alexander Kuzmenkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] separate serial_schedule useful?

2017-10-09 Thread Ashutosh Bapat

On Sat, Oct 7, 2017 at 10:19 PM, Tom Lane  wrote:
> I wrote:
>> Robert Haas  writes:

Sorry, my bad. I wasn't aware of this rule. I should have looked at
the beginning of the file for any rules.

>>> There's no reason why pg_regress couldn't have a
>>> --bail-if-group-size-exceeds=N argument, or why we couldn't have a
>>> separate Perl script to validate the schedule file as part of the
>>> build process.
>
>> I'd go for the former approach; seems like less new code and fewer cycles
>> used to enforce the rule.
>
> Concretely, how about the attached?  (Obviously we'd have to fix
> parallel_schedule before committing this.)
>

Thanks, this will help. May be we should set default to 20 instead of unlimited.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-10-09 Thread Ashutosh Bapat

On Sat, Oct 7, 2017 at 7:04 PM, Robert Haas  wrote:
> On Fri, Oct 6, 2017 at 12:37 PM, Alvaro Herrera  
> wrote:
>> 2. create one index for each existing partition.  These would be
>>identical to what would happen if you created the index directly on
>>each partition, except that there is an additional dependency to the
>>parent's abstract index.
>
> One thing I'm a bit worried about is how to name these subordinate
> indexes.  They have to have names because that's how pg_class works,
> and those names can't all be the same, again because that's how
> pg_class works.  There's no problem right away when you first create
> the partitioned index, because you can just pick names out of a hat
> using whatever name-generating algorithm seems best.  However, when
> you dump-and-restore (including but not limited to the pg_upgrade
> case) you've got to preserve those names.  If you just generate a new
> name that may or may not be the same as the old one, then it may
> collide with a user-specified name that only occurs later in the dump.
> Also, you'll have trouble if the user has applied a COMMENT or a
> SECURITY LABEL to the index because that command works by name, or if
> the user has a reference to the index name inside a function or
> whatever.
>
> These are pretty annoying corner-case bugs because they're not likely
> to come up very often.  Most people won't notice or care if the index
> name changes.  But I don't think it's acceptable to just ignore the
> problem.  An idea I had was to treat the abstract index - to use your
> term - sort of the way we treat an extension.  Normally, when you
> create an index on a partitioned table, it cascades, but for dump and
> restore purpose, we tag on some syntax that says "well, don't actually
> create the subordinate indexes, i'll tell you about those later".
> Then for each subordinate index we issue a separate CREATE INDEX
> command followed by ALTER INDEX abstract_index ATTACH PARTITION
> concrete_index or something of that sort.  That means you can't
> absolutely count on the parent index to have all of the children it's
> supposed to have but maybe that's OK.

+1.

How about CREATE INDEX ... PARTITION OF ... FOR TABLE ...? to create
the index and attach it?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-10-09 Thread Amit Kapila

On Fri, Oct 6, 2017 at 12:03 PM, Amit Khandekar  wrote:
> On 6 October 2017 at 08:49, Amit Kapila  wrote:
>>
>> Okay, but why not cheapest partial path?
>
> I gave some thought on this point. Overall I feel it does not matter
> which partial path it should pick up. RIght now the partial paths are
> not ordered. But for non-partial paths sake, we are just choosing the
> very last path. So in case of mixed paths, leader will get a partial
> path, but that partial path would not be the cheapest path. But if we
> also order the partial paths, the same logic would then pick up
> cheapest partial path. The question is, should we also order the
> partial paths for the leader ?
>
> The only scenario I see where leader choosing cheapest partial path
> *might* show some benefit, is if there are some partial paths that
> need to do some startup work using only one worker. I think currently,
> parallel hash join is one case where it builds the hash table, but I
> guess here also, we support parallel hash build, but not sure about
> the status.
>

You also need to consider how merge join is currently work in parallel
(each worker need to perform the whole of work of right side).  I
think there could be more scenario's where the startup cost is high
and parallel worker needs to do that work independently.

 For such plan, if leader starts it, it would be slow, and
> no other worker would be able to help it, so its actual startup cost
> would be drastically increased. (Another path is parallel bitmap heap
> scan where the leader has to do something and the other workers wait.
> But here, I think it's not much work for the leader to do). So
> overall, to handle such cases, it's better for leader to choose a
> cheapest path, or may be, a path with cheapest startup cost. We can
> also consider sorting partial paths with decreasing startup cost.
>

Yeah, that sounds reasonable.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] parallelize queries containing initplans

2017-10-09 Thread Amit Kapila

On Sat, Oct 7, 2017 at 7:22 PM, Robert Haas  wrote:
> On Fri, Oct 6, 2017 at 7:08 AM, Amit Kapila  wrote:
>> I have fixed the other review comment related to using safe_param_list
>> in the attached patch.  I think I have fixed all comments given by
>> you, but let me know if I have missed anything or you have any other
>> comment.
>
> -Param   *param = (Param *) node;
> +if (list_member_int(context->safe_param_ids, ((Param *)
> node)->paramid))
> +return false;
>
> -if (param->paramkind != PARAM_EXEC ||
> -!list_member_int(context->safe_param_ids, param->paramid))
> -{
> -if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, context))
> -return true;
> -}
> -return false;/* nothing to recurse to */
> +if (max_parallel_hazard_test(PROPARALLEL_RESTRICTED, context))
> +return true;
>
> I think your revised logic is wrong here because it drops the
> paramkind test, and I don't see any justification for that.
>

I have dropped the check thinking that Param extern params will be
always safe and for param exec params we now have a list of safe
params, so we don't need this check and now again thinking about it,
it seems I might not be right.  OTOH, I am not sure if the check as
written is correct for all cases, however, I think for the purpose of
this patch, we can leave it as it is and discuss separately what
should be the check (as suggested by you).  I have reverted the check
in the attached patch.

>
> But I'm also wondering if we're missing a trick here, because isn't a
> PARAM_EXTERN parameter always safe, given SerializeParamList?
>

Right.

>  If so,
> this || ought to be &&, but if that adjustment is needed, it seems
> like a separate patch.
>

How will it work if, during planning, we encounter param_extern param
in any list?  Won't it return true in that case, because param extern
params will not be present in safe_param_ids, so this check will be
true and then max_parallel_hazard_test will also return true?

How about always returning false for PARAM_EXTERN?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

pq_pushdown_initplan_v12.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Slow synchronous logical replication

2017-10-09 Thread Konstantin Knizhnik


Thank you for explanations.

On 08.10.2017 16:00, Craig Ringer wrote:

I think it'd be helpful if you provided reproduction instructions,
test programs, etc, making it very clear when things are / aren't
related to your changes.


It will be not so easy to provide some reproducing scenario, because 
actually it involves many components (postgres_fdw, pg_pasthman, 
pg_shardman, LR,...)

and requires multinode installation.
But let me try to explain what going on:
So we have implement sharding - splitting data between several remote 
tables using pg_pathman and postgres_fdw.
It means that insert or update of parent table  cause insert or update 
of some derived partitions which is forwarded by postgres_fdw to the 
correspondent node.
Number of shards is significantly larger than number of nodes, i.e. for 
5 nodes we have 50 shards. Which means that at each onde we have 10 shards.
To provide fault tolerance each shard is replicated using logical 
replication to one or more nodes. Right now we considered only 
redundancy level 1 - each shard has only one replica.

So from each node we establish 10 logical replication channels.

We want commit to wait until data is actually stored at all replicas, so 
we are using synchronous replication:
So we set synchronous_commit option to "on" and include all ten 10 
subscriptions in synchronous_standby_names list.


In this setup commit latency is very large (about 100msec and most of 
the time is actually spent in commit) and performance is very bad - 
pgbench shows about 300 TPS for optimal number of clients (about 10, for 
larger number performance is almost the same). Without logical 
replication at the same setup we get about 6000 TPS.


I have checked syncrepl.c file, particularly SyncRepGetSyncRecPtr 
function. Each wal sender independently calculates minimal LSN among all 
synchronous replicas and wakeup backends waiting for this LSN. It means 
that transaction performing update of data in one shard will actually 
wait confirmation from replication channels for all shards.
If some shard is updated rarely than other or is not updated at all (for 
example because communication channels between this node is broken), 
then all backens will stuck.
Also all backends are competing for the single SyncRepLock, which also 
can be a contention point.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Fix freezing of a dead HOT-updated tuple

2017-10-09 Thread Wood, Dan

I’m unclear on what is being repro’d in 9.6.  Are you getting the duplicate 
rows problem or just the reindex problem?  Are you testing with asserts 
enabled(I’m not)?

If you are getting the dup rows consider the code in the block in heapam.c that 
starts with the comment “replace multi by update xid”.
When I repro this I find that MultiXactIdGetUpdateXid() returns 0.  There is an 
updater in the multixact array however the status is 
MultiXactStatusForNoKeyUpdate and not MultiXactStatusNoKeyUpdate.  I assume 
this is a preliminary status before the following row in the hot chain has it’s 
multixact set to NoKeyUpdate.

Since a 0 is returned this does precede cutoff_xid and 
TransactionIdDidCommit(0) will return false.  This ends up aborting the 
multixact on the row even though the real xid is committed.  This sets XMAX to 
0 and that row becomes visible as one of the dups.  Interestingly the real xid 
of the updater is 122944 and the cutoff_xid is 122945.

I’m still debugging but I start late so I’m passing this incomplete info along 
now.

On 10/7/17, 4:25 PM, "Alvaro Herrera"  wrote:

Peter Geoghegan wrote:
> On Sat, Oct 7, 2017 at 1:31 AM, Alvaro Herrera  
wrote:
> >> As you must have seen, Alvaro said he has a variant of Dan's original
> >> script that demonstrates that a problem remains, at least on 9.6+,
> >> even with today's fix. I think it's the stress-test that plays with
> >> fillfactor, many clients, etc [1].
> >
> > I just execute setup.sql once and then run this shell command,
> >
> > while :; do
> > psql -e -P pager=off -f ./repro.sql
> > for i in `seq 1 5`; do
> > psql -P pager=off -e --no-psqlrc -f ./lock.sql &
> > done
> > wait && psql -P pager=off -e --no-psqlrc -f ./reindex.sql
> > psql -P pager=off -e --no-psqlrc -f ./report.sql
> > echo "done"
> > done
> 
> I cannot reproduce the problem on my personal machine using this
> script/stress-test. I tried to do so on the master branch git tip.
> This reinforces the theory that there is some timing sensitivity,
> because the remaining race condition is very narrow.

Hmm, I think I added a random sleep (max. 100ms) right after the
HeapTupleSatisfiesVacuum call in vacuumlazy.c (lazy_scan_heap), and that
makes the race easier to hit.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-10-09 Thread Ashutosh Bapat

On Sat, Oct 7, 2017 at 1:04 AM, Robert Haas  wrote:
>
> Committed.  I hope that makes things less red rather than more,
> because I'm going to be AFK for a few hours anyway.
>

Here's the last patch, dealing with the dummy relations, rebased. With
this fix every join order of a partitioned join can be considered
partitioned. (This wasn't the case earlier when dummy relation was
involved.). So, we can allocate the child-join RelOptInfo array in
build_joinrel_partition_info(), instead of waiting for an appropriate
pair to arrive in try_partition_wise_join().
-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
From 4bf8ca38719aee730ed57a7f14522384b1ced7b0 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Tue, 29 Aug 2017 12:37:14 +0530
Subject: [PATCH] Support partition-wise join for dummy partitioned relation.

Current partition-wise join implementation treats dummy relations as
unpartitioned since the child relations may not be marked dummy and may not
have their pathlists set (See set_rel_size() and set_rel_pathlist()). Since
children do not have paths set, they can not be used to form a child-join. This
patch marks the children of dummy partitioned relations as dummy, thus allowing
partition-wise join when one of the joining relations is dummy.

If the dummy partitioned relation is a base relation, its children are base
relations as well and we will be marking base relation dummy during join
planning. But this shouldn't be a problem since populate_joinrel_with_paths()
already does that to an inner relation of left join.

Ashutosh Bapat.
---
 src/backend/optimizer/path/allpaths.c |7 +--
 src/backend/optimizer/path/joinrels.c |   24 
 src/backend/optimizer/util/relnode.c  |5 +
 3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 5535b63..b46fb5b 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -3261,12 +3261,7 @@ generate_partition_wise_join_paths(PlannerInfo *root, RelOptInfo *rel)
 	if (IS_DUMMY_REL(rel))
 		return;
 
-	/*
-	 * Nothing to do if the relation is not partitioned. An outer join
-	 * relation which had empty inner relation in every pair will have rest of
-	 * the partitioning properties set except the child-join RelOptInfos. See
-	 * try_partition_wise_join() for more explanation.
-	 */
+	/* Nothing to do if the relation is not partitioned. */
 	if (rel->nparts <= 0 || rel->part_rels == NULL)
 		return;
 
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 2b868c5..1578dea 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1321,14 +1321,19 @@ try_partition_wise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 	/*
 	 * set_rel_pathlist() may not create paths in children of an empty
-	 * partitioned table and so we can not add paths to child-joins. So, deem
-	 * such a join as unpartitioned. When a partitioned relation is deemed
-	 * empty because all its children are empty, dummy path will be set in
-	 * each of the children.  In such a case we could still consider the join
-	 * as partitioned, but it might not help much.
+	 * partitioned table. Mark such children as dummy so that we can add paths
+	 * to child-joins.
 	 */
-	if (IS_DUMMY_REL(rel1) || IS_DUMMY_REL(rel2))
-		return;
+	if (IS_DUMMY_REL(rel1))
+	{
+		for (cnt_parts = 0; cnt_parts < rel1->nparts; cnt_parts++)
+			mark_dummy_rel(rel1->part_rels[cnt_parts]);
+	}
+	if (IS_DUMMY_REL(rel2))
+	{
+		for (cnt_parts = 0; cnt_parts < rel2->nparts; cnt_parts++)
+			mark_dummy_rel(rel2->part_rels[cnt_parts]);
+	}
 
 	/*
 	 * Since this join relation is partitioned, all the base relations
@@ -1361,11 +1366,6 @@ try_partition_wise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 
 	nparts = joinrel->nparts;
 
-	/* Allocate space to hold child-joins RelOptInfos, if not already done. */
-	if (!joinrel->part_rels)
-		joinrel->part_rels =
-			(RelOptInfo **) palloc0(sizeof(RelOptInfo *) * nparts);
-
 	/*
 	 * Create child-join relations for this partitioned join, if those don't
 	 * exist. Add paths to child-joins for a pair of child relations
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 3bd1063..21fd29f 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -1746,4 +1746,9 @@ build_joinrel_partition_info(RelOptInfo *joinrel, RelOptInfo *outer_rel,
 		joinrel->partexprs[cnt] = partexpr;
 		joinrel->nullable_partexprs[cnt] = nullable_partexpr;
 	}
+
+	/* Allocate space to hold child-joins RelOptInfos. */
+	joinrel->part_rels =
+		(RelOptInfo **) palloc0(sizeof(RelOptInfo *) * joinrel->nparts);
+
 }
-- 
1.7.9.5


-- 
Sent via pgsql-hackers mailing list

43 matches

Mail list logo