"make installcheck" fails in src/test/recovery

2019-04-17 Thread Tom Lane
Up until quite recently, it worked to do "make installcheck" in
src/test/recovery, following the instructions in the README
file there:

NOTE: You must have given the --enable-tap-tests argument to configure.
Also, to use "make installcheck", you must have built and installed
contrib/test_decoding in addition to the core code.

Run
make check
or
make installcheck
You can use "make installcheck" if you previously did "make install".
In that case, the code in the installation tree is tested.  With
"make check", a temporary installation tree is built from the current
sources and then tested.

Now, however, the 016_min_consistency.pl test is falling over,
with symptoms indicating that it expects to have the pageinspect
extension installed as well:

error running SQL: 'psql::2: ERROR:  could not open extension control fil
e "/home/postgres/testversion/share/extension/pageinspect.control": No such file
 or directory'
while running 'psql -XAtq -d port=64106 host=/tmp/KaoBFubKfw dbname='postgres' -
f - -v ON_ERROR_STOP=1' with sql '
CREATE EXTENSION pageinspect;
...

Is this extra dependency actually essential?  I'm not really
happy about increasing the number of moving parts in this test.

regards, tom lane




Re: Runtime pruning problem

2019-04-17 Thread Amit Langote
On 2019/04/17 13:10, Tom Lane wrote:
> David Rowley  writes:
>> On Wed, 17 Apr 2019 at 15:54, Tom Lane  wrote:
>>> What I'm more worried about is whether this breaks any internal behavior
>>> of explain.c, as the comment David quoted upthread seems to think.
>>> If we need to have a tlist to reference, can we make that code look
>>> to the pre-pruning plan tree, rather than the planstate tree?
> 
>> I think most of the complexity is in what to do in
>> set_deparse_planstate() given that there might be no outer plan to
>> choose from for Append and MergeAppend. This controls what's done in
>> resolve_special_varno() as this descends the plan tree down the outer
>> side until it gets to the node that the outer var came from.
> 
>> We wouldn't need to do this if we just didn't show the targetlist in
>> EXPLAIN VERBOSE, but there's also MergeAppend sort keys to worry about
>> too.  Should we just skip on those as well?
> 
> No, the larger issue is that *any* plan node above the Append might
> be recursing down to/through the Append to find out what to print for
> a Var reference.  We have to be able to support that.

Hmm, yes.

I see that the targetlist of Append, MergeAppend, and ModifyTable nodes is
finalized using set_dummy_tlist_references(), wherein the Vars in the
nodes' (Plan struct's) targetlist are modified to be OUTER_VAR vars.  The
comments around set_dummy_tlist_references() says it's done for explain.c
to intercept any accesses to variables in these nodes' targetlist and
return the corresponding variables in the nodes' 1st child subplan's
targetlist, which must have all the variables in the nodes' targetlist.c.
This arrangement makes it mandatory for these nodes to have at least one
subplan, so the hack in runtime pruning code.

I wonder why the original targetlist of these nodes, adjusted using just
fix_scan_list(), wouldn't have been better for EXPLAIN to use?  If I
replace the set_dummy_tlist_references() call by fix_scan_list() for
Append for starters, I see that the targetlist of any nodes on top of the
Append list the Append's output variables without a "refname" prefix.
That can be confusing if the same parent table (Append's parent relation)
is referenced multiple times.  The refname is empty, because
select_rtable_names_for_explain() thinks an Append hasn't got one.  Same
is true for MergeAppend.  ModifyTable, OTOH, has one because it has the
nominalRelation field.  Maybe it's not possible to have such a field for
Append and MergeAppend, because they don't *always* refer to a single
table (UNION ALL, partitionwise join come to mind).  Anyway, even if we do
manage to print a refname for Append/MergeAppend somehow, that wouldn't be
back-patchable to 11.

Another idea is to teach explain.c about this special case of run-time
pruning having pruned all child subplans even though appendplans contains
one element to cater for targetlist accesses.  That is, Append will be
displayed with "Subplans Removed: All" and no child subplans listed below
it, even though appendplans[] has one.  David already said he didn't do in
the first place to avoid PartitionPruneInfo details creeping into other
modules, but maybe there's no other way?

Thanks,
Amit





proposal: psql PSQL_TABULAR_PAGER variable

2019-04-17 Thread Pavel Stehule
Hi

I wrote a pspg pager https://github.com/okbob/pspg

This pager is designed for tabular data. It can work in fallback mode as
classic pager, but it is not designed for this purpose (and I don't plan do
it). Can we enhance a set of psql environment variables about
PSQL_TABULAR_PAGER variable. This pager will be used, when psql will
display tabular data.

Comments, notes?

Regards

Pavel


Re: bug in update tuple routing with foreign partitions

2019-04-17 Thread Amit Langote
On 2019/04/18 14:06, Amit Langote wrote:
> Fujita-san,
> 
> On 2019/04/17 21:49, Etsuro Fujita wrote:
>> (2019/04/11 20:31), Etsuro Fujita wrote:
>>> (2019/04/11 14:33), Amit Langote wrote:
 BTW, have you noticed that the RETURNING clause returns the same row
 twice?
>>>
>>> I noticed this, but I didn't think it hard. :(
>>>
 +update utrtest set a = 3 from (values (2), (3)) s(x) where a = s.x
 returning *;
 + a | b | x
 +---+---+---
 + 3 | qux triggered ! | 2
 + 3 | xyzzy triggered ! | 3
 + 3 | qux triggered ! | 3
 +(3 rows)

 You can see that the row that's moved into remp is returned twice (one
 with "qux triggered!" in b column), whereas it should've been only once?
>>>
>>> Yeah, this is unexpected behavior, so will look into this.
> 
> Thanks for investigating.
> 
>> I think the reason for that is: the row routed to remp is incorrectly
>> fetched as a to-be-updated row when updating remp as a subplan targetrel.
> 
> Yeah.  In the fully-local case, that is, where both the source and the
> target partition of a row movement operation are local tables, heap AM
> ensures that tuples that's moved into a given relation in the same command
> (by way of row movement) are not returned as to-be-updated, because it
> deems such tuples invisible.  The "same command" part being crucial for
> that to work.
> 
> In the case where the target of a row movement operation is a foreign
> table partition, the INSERT used as part of row movement and subsequent
> UPDATE of the same foreign table are distinct commands for the remote
> server.  So, the rows inserted by the 1st command (as part of the row
> movement) are deemed visible by the 2nd command (UPDATE) even if both are
> operating within the same transaction.
> 
> I guess there's no easy way for postgres_fdw to make the remote server
> consider them as a single command.  IOW, no way to make the remote server
> not touch those "moved-in" rows during the UPDATE part of the local query.
>  
>> The right way to fix this would be to have some way in postgres_fdw in
>> which we don't fetch such rows when updating such subplan targetrels.  I
>> tried to figure out a (simple) way to do that, but I couldn't.
> 
> Yeah, that seems a bit hard to ensure with our current infrastructure.
> 
>> One
>> probably-simple solution I came up with is to sort subplan targetrels into
>> the order in which foreign-table subplan targetrels get processed first in
>> ExecModifyTable().  (Note: currently, rows can be moved from local
>> partitions to a foreign-table partition, but rows cannot be moved from
>> foreign-table partitions to another partition, so we wouldn't encounter
>> this situation once we sort like that.)  But I think that's ugly, and I
>> don't think it's a good idea to change the core, just for postgres_fdw.
> 
> Agreed that it seems like contorting the core code to accommodate
> limitations of postgres_fdw.
> 
>> So what I'm thinking is to throw an error for cases like this.  (Though, I
>> think we should keep to allow rows to be moved from local partitions to a
>> foreign-table subplan targetrel that has been updated already.)
> 
> Hmm, how would you distinguish (totally inside postgred_fdw I presume) the
> two cases?

Forgot to say that since this is a separate issue from the original bug
report, maybe we can first finish fixing the latter.  What to do you think?

Thanks,
Amit





Re: bug in update tuple routing with foreign partitions

2019-04-17 Thread Amit Langote
Fujita-san,

On 2019/04/17 21:49, Etsuro Fujita wrote:
> (2019/04/11 20:31), Etsuro Fujita wrote:
>> (2019/04/11 14:33), Amit Langote wrote:
>>> BTW, have you noticed that the RETURNING clause returns the same row
>>> twice?
>>
>> I noticed this, but I didn't think it hard. :(
>>
>>> +update utrtest set a = 3 from (values (2), (3)) s(x) where a = s.x
>>> returning *;
>>> + a | b | x
>>> +---+---+---
>>> + 3 | qux triggered ! | 2
>>> + 3 | xyzzy triggered ! | 3
>>> + 3 | qux triggered ! | 3
>>> +(3 rows)
>>>
>>> You can see that the row that's moved into remp is returned twice (one
>>> with "qux triggered!" in b column), whereas it should've been only once?
>>
>> Yeah, this is unexpected behavior, so will look into this.

Thanks for investigating.

> I think the reason for that is: the row routed to remp is incorrectly
> fetched as a to-be-updated row when updating remp as a subplan targetrel.

Yeah.  In the fully-local case, that is, where both the source and the
target partition of a row movement operation are local tables, heap AM
ensures that tuples that's moved into a given relation in the same command
(by way of row movement) are not returned as to-be-updated, because it
deems such tuples invisible.  The "same command" part being crucial for
that to work.

In the case where the target of a row movement operation is a foreign
table partition, the INSERT used as part of row movement and subsequent
UPDATE of the same foreign table are distinct commands for the remote
server.  So, the rows inserted by the 1st command (as part of the row
movement) are deemed visible by the 2nd command (UPDATE) even if both are
operating within the same transaction.

I guess there's no easy way for postgres_fdw to make the remote server
consider them as a single command.  IOW, no way to make the remote server
not touch those "moved-in" rows during the UPDATE part of the local query.
 
> The right way to fix this would be to have some way in postgres_fdw in
> which we don't fetch such rows when updating such subplan targetrels.  I
> tried to figure out a (simple) way to do that, but I couldn't.

Yeah, that seems a bit hard to ensure with our current infrastructure.

> One
> probably-simple solution I came up with is to sort subplan targetrels into
> the order in which foreign-table subplan targetrels get processed first in
> ExecModifyTable().  (Note: currently, rows can be moved from local
> partitions to a foreign-table partition, but rows cannot be moved from
> foreign-table partitions to another partition, so we wouldn't encounter
> this situation once we sort like that.)  But I think that's ugly, and I
> don't think it's a good idea to change the core, just for postgres_fdw.

Agreed that it seems like contorting the core code to accommodate
limitations of postgres_fdw.

> So what I'm thinking is to throw an error for cases like this.  (Though, I
> think we should keep to allow rows to be moved from local partitions to a
> foreign-table subplan targetrel that has been updated already.)

Hmm, how would you distinguish (totally inside postgred_fdw I presume) the
two cases?

Thanks,
Amit





Re: [HACKERS] Weaker shmem interlock w/o postmaster.pid

2019-04-17 Thread Thomas Munro
On Thu, Apr 11, 2019 at 6:22 PM Noah Misch  wrote:
>  -   my $iaddr = inet_aton($test_localhost);
> +   my $iaddr = inet_aton('0.0.0.0');

This causes make check-world to deliver a flurry of pop-ups from
macOS's built-in Firewall asking if perl should be allowed to listen
to all interfaces (well I didn't catch the exact message, but that's
probably the drift).  Not sure if they'd go away permanently if I
managed to click OK before they disappear, but it's fun trying.  The
silly firewall facility is not actually enabled by default on this OS,
but unfortunately this company-issued machine has it forced to on.
This isn't really an objection to the code, it's more of a bemused
anecdote about a computer that can't decide whether it's a Unix
workstation or a Fisher Price My First Computer.

-- 
Thomas Munro
https://enterprisedb.com




Re: Cleanup/remove/update references to OID column

2019-04-17 Thread Justin Pryzby
On Wed, Apr 17, 2019 at 05:51:15PM -0700, Andres Freund wrote:
> Hi,
> 
> On 2019-04-17 19:42:19 -0500, Justin Pryzby wrote:
> > diff --git a/doc/src/sgml/information_schema.sgml 
> > b/doc/src/sgml/information_schema.sgml
> > index 234a3bb..9c618b1 100644
> > --- a/doc/src/sgml/information_schema.sgml
> > +++ b/doc/src/sgml/information_schema.sgml
> > @@ -1312,8 +1312,8 @@
> >
> > The view columns contains information about all
> > table columns (or view columns) in the database.  System columns
> > -   (ctid, etc.) are not included.  Only those columns 
> > are
> > -   shown that the current user has access to (by way of being the
> > +   (ctid, etc.) are not included.  The only columns 
> > shown
> > +   are those to which the current user has access (by way of being the
> > owner or having some privilege).
> >
> 
> I don't see the point of this change? Nor what it has to do with oids?

It doesn't have to do with oids, but seems more correct and cleaner...to my
eyes.

> > -   rows was exactly one and the target table was
> > +   count was exactly one and the target table 
> > was
> The rows reference is from your change
> :(.

Ouch, not sure how I did that..sorry for the noise (twice).

Thanks,
Justin




Re: ALTER TABLE with ADD COLUMN and ADD PRIMARY KEY USING INDEX throws spurious "column contains null values"

2019-04-17 Thread Tom Lane
"Zhang, Jie"  writes:
> Here's a patch to fix this bug.

I took a look at this patch, but I really dislike it: it adds a mighty
ad-hoc parameter to a whole bunch of functions that shouldn't really
have anything to do with the problem.  Moreover, I have little confidence
that it's really solving the problem and not introducing any new problems
(such as failure to apply the not-null check when we need to).

I think the real problem is exactly that we've got index_check_primary_key
doing its own mini ALTER TABLE in ignorance of what might be happening
in the outer ALTER TABLE.  That's just ripe for order-of-operations
problems, seeing that ALTER TABLE has a lot of very careful decisions
about which steps have to happen before which other ones.  Moreover,
as this old comment notes, it's a horridly inefficient approach if
the table is large:

/*
 * XXX: possible future improvement: when being called from ALTER TABLE,
 * it would be more efficient to merge this with the outer ALTER TABLE, so
 * as to avoid two scans.  But that seems to complicate DefineIndex's API
 * unduly.
 */

So I thought a bit about how to fix that, and realized that we could
easily adjust the parser to emit AT_SetNotNull subcommands as part of the
outer ALTER TABLE that has the ADD PRIMARY KEY subcommand.  Then,
index_check_primary_key doesn't need to do anything at all about adding
NOT NULL, although it seems like a good safety feature for it to check
that somebody else already added that.

So, attached is a WIP patch that fixes it that way.  Some notes
for review:

* I initially thought that index_check_primary_key could be simplified
to the point where it *only* throws an error for missing NOT NULL.
This soon proved to be wrong, because the comments for the function
are lies, or at least obsolete: there are multiple scenarios in which
a CREATE TABLE with a PRIMARY KEY option does need this function to
perform ALTER TABLE SET NOT NULL.  Fortunately, that's not so expensive
in that case, since the table must be empty.  So as coded, it throws
an error if is_alter_table, and otherwise does what it did before.

* We need to fix the order of operations in ALTER TABLE phase 2 so that
any AT_SetNotNull subcommands happen after the AT_PASS_ADD_COL pass
(else the column might not be there yet) and before AT_PASS_ADD_INDEX
(else index_check_primary_key will complain).  I did this by putting
AT_SetNotNull into the AT_PASS_COL_ATTRS pass and moving that to after
AT_PASS_ADD_COL; it had been before AT_PASS_ADD_COL, but that seems at
best random and at worst broken.  (AT_SetIdentity is the only existing
subcommand using AT_PASS_COL_ATTRS, and it sure seems like it'd make more
sense to run it after AT_PASS_ADD_COL, so that it can work on a column
being added in the same ALTER.  Am I missing something?)

* Some existing regression tests for "ALTER TABLE ONLY partitioned_table
ADD PRIMARY KEY" failed.  That apparently is supposed to work if all
partitions already have a suitable unique index and NOT NULL constraint,
but it was failing because ATPrepSetNotNull wasn't expecting to be used
this way.  I thought that function was a pretty terrible kluge anyway,
so what I did was to refactor things so that in this scenario we just
apply checks to see if all the partitions already have suitable NOT NULL.
Note that this represents looser semantics than what was there before,
because previously you couldn't say "ALTER TABLE ONLY partitioned_table
SET NOT NULL" if there were any partitions; now you can, if the partitions
all have suitable NOT NULL already.  We probably ought to change the error
message to reflect that, but I didn't yet.

* A couple of existing test cases change error messages, like so:

-ERROR:  column "test1" named in key does not exist
+ERROR:  column "test1" of relation "atacc1" does not exist

This is because the added AT_SetNotNull subcommand runs before
AT_AddIndex, so it's the one that notices that there's not really
any such column, and historically it's spelled its error message
differently.  This change seems all to the good to me, so I didn't
try to avoid it.

* I haven't yet added any test case(s) reflecting the bug fix nor
the looser semantics for adding NOT NULL to partitioned tables.
It does pass check-world as presented.

* I'm not sure whether we want to try to back-patch this, or how
far it should go.  The misbehavior has been there a long time
(at least back to 8.4, I didn't check further); so the lack of
previous reports means people seldom try to do it.  That may
indicate that it's not worth taking any risks of new bugs to
squash this one.  (Also, I suspect that it might take a lot of
work to port this to before v10, because there are comments
suggesting that this code worked a good bit differently before.)
I do think we should shove this into v12 though.

Comments?

regards, tom lane

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e9399be..0dbc410 

Re: pgsql: Fix plan created for inherited UPDATE/DELETE with all tables exc

2019-04-17 Thread Amit Langote
On 2019/02/23 2:23, Tom Lane wrote:
> Fix plan created for inherited UPDATE/DELETE with all tables excluded.
> 
> In the case where inheritance_planner() finds that every table has
> been excluded by constraints, it thought it could get away with
> making a plan consisting of just a dummy Result node.  While certainly
> there's no updating or deleting to be done, this had two user-visible
> problems: the plan did not report the correct set of output columns
> when a RETURNING clause was present, and if there were any
> statement-level triggers that should be fired, it didn't fire them.
> 
> Hence, rather than only generating the dummy Result, we need to
> stick a valid ModifyTable node on top, which requires a tad more
> effort here.
> 
> It's been broken this way for as long as inheritance_planner() has
> known about deleting excluded subplans at all (cf commit 635d42e9c),
> so back-patch to all supported branches.

I noticed that we may have forgotten to fix one more thing in this commit
-- nominalRelation may refer to a range table entry that's not referenced
elsewhere in the finished plan:

create table parent (a int);
create table child () inherits (parent);
explain verbose update parent set a = a where false;
QUERY PLAN
───
 Update on public.parent  (cost=0.00..0.00 rows=0 width=0)
   Update on public.parent
   ->  Result  (cost=0.00..0.00 rows=0 width=0)
 Output: a, ctid
 One-Time Filter: false
(5 rows)

I think the "Update on public.parent" shown in the 2nd row is unnecessary,
because it won't really be updated.  It's shown because nominalRelation
doesn't match resultRelation, which prompts explain.c to to print the
child target relations separately per this code in show_modifytable_info():

/* Should we explicitly label target relations? */
labeltargets = (mtstate->mt_nplans > 1 ||
(mtstate->mt_nplans == 1 &&
 mtstate->resultRelInfo->ri_RangeTableIndex !=
node->nominalRelation));

Attached a patch to adjust nominalRelation in this case so that "parent"
won't be shown a second time, as follows:

explain verbose update parent set a = a where false;
QUERY PLAN
───
 Update on public.parent  (cost=0.00..0.00 rows=0 width=0)
   ->  Result  (cost=0.00..0.00 rows=0 width=0)
 Output: parent.a, parent.ctid
 One-Time Filter: false
(4 rows)

As you may notice, Result node's targetlist is shown differently than
before, that is, columns are shown prefixed with table name.  In the old
output, the prefix or "refname"  ends up NULL, because the Result node's
targetlist uses resultRelation (1) as varno, which is not referenced
anywhere in the plan tree (for ModifyTable, explain.c prefers to use
nominalRelation instead of resultRelation).  With patch applied,
nominalRelation == resultRelation, so it is referenced and hence its
"refname" non-NULL.  Maybe this change is fine though?

Thanks,
Amit
diff --git a/src/backend/optimizer/plan/planner.c 
b/src/backend/optimizer/plan/planner.c
index 5ed691c2e3..ac1cc51cb4 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -1728,6 +1728,7 @@ inheritance_planner(PlannerInfo *root)
subpaths = list_make1(dummy_path);
subroots = list_make1(root);
resultRelations = list_make1_int(parse->resultRelation);
+   nominalRelation = parse->resultRelation;
if (parse->withCheckOptions)
withCheckOptionLists = 
list_make1(parse->withCheckOptions);
if (parse->returningList)
diff --git a/src/test/regress/expected/inherit.out 
b/src/test/regress/expected/inherit.out
index c6abcfc3cb..1a8cbd4e6f 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -545,27 +545,25 @@ create table some_tab_child () inherits (some_tab);
 insert into some_tab_child values(1,2);
 explain (verbose, costs off)
 update some_tab set a = a + 1 where false;
-QUERY PLAN
---
+ QUERY PLAN  
+-
  Update on public.some_tab
-   Update on public.some_tab
->  Result
- Output: (a + 1), b, ctid
+ Output: (some_tab.a + 1), some_tab.b, some_tab.ctid
  One-Time Filter: false
-(5 rows)
+(4 rows)
 
 update some_tab set a = a + 1 where false;
 explain (verbose, costs off)
 update some_tab set a = a + 1 where false returning b, a;
-QUERY PLAN
---
+ QUERY PLAN  
+-
  Update on public.some_tab
-   Output: b, a
-   Update on public.some_tab
+   Output: 

Pathological performance when inserting many NULLs into a unique index

2019-04-17 Thread Peter Geoghegan
I thought of a case that results in pathological performance due to a
behavior of my nbtree patch series:

regression=# create table uniquenulls(nully int4, constraint pp unique(nully));
CREATE TABLE
Time: 10.694 ms
regression=# insert into uniquenulls select i from generate_series(1, 1e6) i;
INSERT 0 100
Time: 1356.025 ms (00:01.356)
regression=# insert into uniquenulls select null from generate_series(1, 1e6) i;
INSERT 0 100
Time: 270834.196 ms (04:30.834)

The issue here is that the duration of the second INSERT statement is
wildly excessive, because _bt_stepright() needs to step right many
many times for each tuple inserted. I would expect the second insert
to take approximately as long as the first, but it takes ~200x longer
here. It could take much longer still if we pushed this example
further. What we see here is a limited case in which the O(n ^ 2)
performance that "getting tired" used to prevent can occur. Clearly
that's totally unacceptable. Mea culpa.

Sure enough, the problem goes away when the index isn't a unique index
(i.e. in the UNIQUE_CHECK_NO case):

regression=# alter table uniquenulls drop constraint pp;
ALTER TABLE
Time: 28.968 ms
regression=# create index on uniquenulls (nully);
CREATE INDEX
Time: 1159.958 ms (00:01.160)
regression=# insert into uniquenulls select null from generate_series(1, 1e6) i;
INSERT 0 100
Time: 1155.708 ms (00:01.156)

Tentatively, I think that the fix here is to not "itup_key->scantid =
NULL" when a NULL value is involved (i.e. don't "itup_key->scantid =
NULL" when we see IndexTupleHasNulls(itup) within _bt_doinsert()). We
may also want to avoid calling _bt_check_unique() entirely in this
situation. That way, the performance should be the same as the
UNIQUE_CHECK_NO case: we descend to the appropriate leaf page
directly, and then we're done. We don't have to find the appropriate
leaf page by groveling through indefinitely-many existing leaf pages
that are full of NULL values, because we know that there cannot ever
be a unique violation for us to detect.

I'll create an open item for this, and begin work on a patch tomorrow.
-- 
Peter Geoghegan




RE: Speeding up creating UPDATE/DELETE generic plan for partitioned table into a lot

2019-04-17 Thread Kato, Sho
Hello, hackers.

I would like advice on how to design creating generic plans for child tables 
gradually.

The current generic plan is created for all child tables plan by specifying 
NULL in boundParams of pg_plan_queries().
Also, specifying a parameter in boundParams, custom plan is created, so 
pg_plan_queries() cannot create a generic plan of child tables gradually.

Therefore, before creating a plan for all child tables, I want to stop planning 
and resume planning using the parameters specified in EXECUTE.

At first I want opinion about the following.

- When does planner stop planning?
By comparing num_live_parts and partdesc->nparts in 
expand_partitioned_rtentry(), I think that it seems to be possible to judge 
whether planner stop planning in case of UPDATE or SELECT.

- What information does planner need to resume planning?
I think that caching PlannerInfo is good to resume plans, but I'm not 
confident. If there is another good way, please let me know.

- How to create generic plan for the target child table
If planner creates a custom plan and then convert Const node to Param node, 
planner may be create a generic plan for the target child table, but this is 
also less confident. If there is another good way, please let me know.

regards,
> -Original Message-
> From: Kato, Sho [mailto:kato-...@jp.fujitsu.com]
> Sent: Wednesday, February 20, 2019 3:11 PM
> To: Tsunakawa, Takayuki/綱川 貴之 ;
> 'David Rowley' ; Amit Langote
> 
> Cc: Pg Hackers 
> Subject: RE: Speeding up creating UPDATE/DELETE generic plan for
> partitioned table into a lot
> 
> Before addressing to speeding up creating generic plan of UPDATE/DELETE,
> I will begin with the speed up creating SELECT plan.
> 
> I will explain the background as time has passed.
> Since generic plan creates plans of all partitions and is cached, we can
> skip planning and expect performance improvements.
> But, When a table is partitioned into thousands, it takes time to execute
> a generic plan for the first time because planner creates plans for all
> child tables including a child table that may not be accessed.
> 
> Therefore, I would like to develop a method to gradually create a child
> table plan instead of creating and caching all child table plans at once
> at EXECUTE.
> I came up with a mechanism that caches the information like PlannerInfo
> -- necessary to create the plan and the plan and adds the access plan
> of the child table to the cached plan.
> 
> However, I'm not sure that this can be realized and this is right, so
> I want an opinion.
> Also, I'd like advice if it would be better to create a new path for
> partitioning like "Partition Scan Path" or "Partition Index Scan Path".
> 
> regards,
> Sho Kato
> 
> > -Original Message-
> > From: Kato, Sho [mailto:kato-...@jp.fujitsu.com]
> > Sent: Friday, February 1, 2019 5:16 PM
> > To: Tsunakawa, Takayuki/綱川 貴之 ;
> 'David
> > Rowley' ; Amit Langote
> > 
> > Cc: Pg Hackers 
> > Subject: Speeding up creating UPDATE/DELETE generic plan for
> > partitioned table into a lot
> >
> > Sorry, I lost previous mail[1].
> >
> > On Fri, 28 Dec 2018 at 20:36, Tsunakawa, Takayuki
> >  wrote:
> > > Although I may say the same thing as you, I think a natural idea
> > > would
> > be to create a generic plan gradually.  The starting simple question
> > is "why do we have to touch all partitions at first?"  That is, can
> we
> > behave like this:
> >
> > I also think creating a generic plan gradually is a better idea
> > because planner should create a plan when it is needed.
> > Any ideas?
> >
> > On 2018-12-31 08:57:04, David Rowley wrote
> > >I imagine the place to start looking would be around why planning is
> > so slow for that many partitions.
> >
> > As you may already know, FLATCOPY at range_table_mutator has a large
> > bottleneck.
> > Executing UPDATE, about npart squared RangeTblEntry is copied.
> > When I execute UPDATE to 100 partitioned table, FLATCOPY takes about
> > 100
> > * 0.067 ms while total planning time takes 12.689 ms.
> >
> > On 2018-12-31 08:57:04, David Rowley wrote
> > >Another possible interesting idea would be to, instead of creating
> > >large Append/MergeAppend plans for partition scanning, invent some
> > >"Partition Seq Scan" and "Partition Index Scan" nodes that are able
> > >to build plans more similar to scanning a normal table. Likely such
> > >nodes would need to be programmed with a list of Oids that they're
> to
> > >scan during their execution. They'd also need to take care of their
> > >own tuple mapping for when partitions had their columns in varying
> orders.
> >
> > Inventing some "Partition Seq Scan" and "Partition Index Scan" nodes
> > is interesting.
> > It seems easy to add Scan nodes to each partition gradually.
> >
> >
> [1]:CAKJS1f-y1HQK+VjG7=C==vGcLnzxjN8ysD5NmaN8Wh4=VsYipw@mail.gmail.c
> > om
> >
> > regards,
> >
> >
> 
> 






REINDEX INDEX results in a crash for an index of pg_class since 9.6

2019-04-17 Thread Michael Paquier
Hi all,

Fujii-san has sent me a report offline about REINDEX.  And since 9.6,
trying to REINDEX directly an index of pg_class fails lamentably on an
assertion failure (mbsync failure if bypassing the assert):
#2  0x55a9c5bfcc2c in ExceptionalCondition
 (conditionName=0x55a9c5ca9750
 "!(!ReindexIsProcessingIndex(((indexRelation)->rd_id)))",
 errorType=0x55a9c5ca969f "FailedAssertion",
 fileName=0x55a9c5ca9680 "indexam.c", lineNumber=204) at assert.c:54
#3  0x55a9c5686dcd in index_insert (indexRelation=0x7f58402fe5d8,
 values=0x7fff450c3270, isnull=0x7fff450c3250,
 heap_t_ctid=0x55a9c7e2c05c,
 heapRelation=0x7f584031eb68, checkUnique=UNIQUE_CHECK_YES,
 indexInfo=0x55a9c7e30520) at indexam.c:204
#4  0x55a9c5714a12 in CatalogIndexInsert
 (indstate=0x55a9c7e30408, heapTuple=0x55a9c7e2c058) at indexing.c:140
#5  0x55a9c5714b1d in CatalogTupleUpdate (heapRel=0x7f584031eb68,
 otid=0x55a9c7e2c05c, tup=0x55a9c7e2c058) at indexing.c:215
#6  0x55a9c5beda8a in RelationSetNewRelfilenode
 (relation=0x7f58402fe5d8, persistence=112 'p') at relcache.c:3531

Doing a REINDEX TABLE directly on pg_class proves to work correctly,
and CONCURRENTLY is not supported for catalog tables.

Bisecting my way through it, the first commit causing the breakage is
that:
commit: 01e386a325549b7755739f31308de4be8eea110d
author: Tom Lane 
date: Wed, 23 Dec 2015 20:09:01 -0500
Avoid VACUUM FULL altogether in initdb.

Commit ed7b3b3811c5836a purported to remove initdb's use of VACUUM
FULL,
as had been agreed to in a pghackers discussion back in Dec 2014.
But it missed this one ...

The reason why this does not work is that CatalogIndexInsert() tries
to do an index_insert directly on the index worked on.  And the reason
why this works at table level is that we have tweaks in
reindex_relation() to enforce the list of valid indexes in the
relation cache with RelationSetIndexList().  It seems to me that the
logic in reindex_index() is wrong from the start, and that all the
index list handling done in reindex_relation() should just be in
reindex_index() so as REINDEX INDEX gets the right call.

Thoughts?
--
Michael


signature.asc
Description: PGP signature


Re: jsonpath

2019-04-17 Thread John Naylor
On Thu, Apr 18, 2019 at 1:43 AM Tom Lane  wrote:
>
> John Naylor  writes:
> > Attached is a patch to fix some minor issues:
> > -misspelling of an error message
>
> Yeah, I'd noticed that one too :-(.  I think the whole jsonpath patch
> needs a sweep to bring its error messages into line with our style
> guidelines, but no harm in starting with the obvious bugs.
>
> > -Commit 550b9d26f80f failed to update the Makefile comment to reflect
> > how the build changed, and also removed the clean target, which we now
> > have use for since we later got rid of backtracking in the scanner.
>
> Right.  I'm not really sure why we're bothering with anti-backtracking
> here, or with using speed-rather-than-code-space lexer optimization
> options.  It's hard for me to credit that any practically-useful jsonpath
> pattern would be long enough for lexer speed to matter, and even harder to
> credit that the speed of the flex code itself would be an important factor
> in the overall processing cost of a long jsonpath.  Still, as long as we
> have the code it needs to be right.

I was wondering about that. I measured the current size of
yy_transition to be 36492 on my machine. With the flag -Cfe, which
gives the smallest representation without backtracking, yy_nxt is 6336
(there is no yy_transition). I'd say that's a large enough difference
that we'd want the smaller representation if it makes little
difference in performance.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Cleanup/remove/update references to OID column

2019-04-17 Thread Andres Freund
Hi,

On 2019-04-17 19:42:19 -0500, Justin Pryzby wrote:
> diff --git a/doc/src/sgml/information_schema.sgml 
> b/doc/src/sgml/information_schema.sgml
> index 234a3bb..9c618b1 100644
> --- a/doc/src/sgml/information_schema.sgml
> +++ b/doc/src/sgml/information_schema.sgml
> @@ -1312,8 +1312,8 @@
>
> The view columns contains information about all
> table columns (or view columns) in the database.  System columns
> -   (ctid, etc.) are not included.  Only those columns are
> -   shown that the current user has access to (by way of being the
> +   (ctid, etc.) are not included.  The only columns shown
> +   are those to which the current user has access (by way of being the
> owner or having some privilege).
>

I don't see the point of this change? Nor what it has to do with oids?


> diff --git a/doc/src/sgml/ref/insert.sgml b/doc/src/sgml/ref/insert.sgml
> index 189ce2a..f995a76 100644
> --- a/doc/src/sgml/ref/insert.sgml
> +++ b/doc/src/sgml/ref/insert.sgml
> @@ -554,7 +554,7 @@ INSERT oid  class="parameter">count The count is the number of
> rows inserted or updated.  oid is always 0 (it
> used to be the OID assigned to the inserted row if
> -   rows was exactly one and the target table was
> +   count was exactly one and the target table was
> declared WITH OIDS and 0 otherwise, but creating a 
> table
> WITH OIDS is not supported anymore).
>

The rows reference is from your change
:(. I'll fold it into another upcoming change for other tableam comment
improvements (by Heikki).

Greetings,

Andres Freund




Re: Cleanup/remove/update references to OID column

2019-04-17 Thread Justin Pryzby
On Wed, Apr 17, 2019 at 05:23:47PM -0700, Andres Freund wrote:
> Thanks for the patch!

Thanks for fixing it up and commiting.

Would you consider the remaining two hunks, attached ?

Justin
>From fcd6279d0681093e8741cc5a05879afc95751c40 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 2 Apr 2019 19:13:55 -0500
Subject: [PATCH v3] Cleanup/remove/update references to OID column...

..in wake of 578b229718e8f.

See also
93507e67c9ca54026019ebec3026de35d30370f9
1464755fc490a9911214817fe83077a3689250ab

Author: Justin Pryzby 
Author: Daniel Verite 
---
 doc/src/sgml/information_schema.sgml | 4 ++--
 doc/src/sgml/ref/insert.sgml | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/information_schema.sgml b/doc/src/sgml/information_schema.sgml
index 234a3bb..9c618b1 100644
--- a/doc/src/sgml/information_schema.sgml
+++ b/doc/src/sgml/information_schema.sgml
@@ -1312,8 +1312,8 @@
   
The view columns contains information about all
table columns (or view columns) in the database.  System columns
-   (ctid, etc.) are not included.  Only those columns are
-   shown that the current user has access to (by way of being the
+   (ctid, etc.) are not included.  The only columns shown
+   are those to which the current user has access (by way of being the
owner or having some privilege).
   
 
diff --git a/doc/src/sgml/ref/insert.sgml b/doc/src/sgml/ref/insert.sgml
index 189ce2a..f995a76 100644
--- a/doc/src/sgml/ref/insert.sgml
+++ b/doc/src/sgml/ref/insert.sgml
@@ -554,7 +554,7 @@ INSERT oid countcount is the number of
rows inserted or updated.  oid is always 0 (it
used to be the OID assigned to the inserted row if
-   rows was exactly one and the target table was
+   count was exactly one and the target table was
declared WITH OIDS and 0 otherwise, but creating a table
WITH OIDS is not supported anymore).
   
-- 
2.7.4



Re: pgsql: tableam: basic documentation.

2019-04-17 Thread Andres Freund
Hi,

On 2019-04-10 11:55:31 +0900, Michael Paquier wrote:
> Hi Andres,
> 
> On Thu, Apr 04, 2019 at 12:42:06AM +, Andres Freund wrote:
> > tableam: basic documentation.
> > 
> > This adds documentation about the user oriented parts of table access
> > methods (i.e. the default_table_access_method GUC and the USING clause
> > for CREATE TABLE etc), adds a basic chapter about the table access
> > method interface, and adds a note to storage.sgml that it's contents
> > don't necessarily apply for non-builtin AMs.
> > 
> > Author: Haribabu Kommi and Andres Freund
> > Discussion: 
> > https://postgr.es/m/20180703070645.wchpu5muyto5n...@alap3.anarazel.de
> 
> +  Any developer of a new table access method can refer to
> +  the existing heap implementation present in
> +  src/backend/heap/heapam_handler.c for more details of
> +  how it is implemented.
> 
> This path is incorrect, it should be that instead (missing "access"):
> src/backend/access/heap/heapam_handler.c 

Thanks, fix pushed.

Greetings,

Andres Freund




Re: Cleanup/remove/update references to OID column

2019-04-17 Thread Andres Freund
Hi,

On 2019-04-15 18:35:12 +0200, Daniel Verite wrote:
>   Andres Freund wrote:
> 
> > Yes, I was planning to commit that soon-ish. There still seemed
> > review / newer versions happening, though, so I was thinking of waiting
> > for a bit longer.
> 
> You might want to apply this trivial one in the same batch:
> 
> index 452f307..7cfb67f 100644
> --- a/src/bin/pg_dump/pg_dump.c
> +++ b/src/bin/pg_dump/pg_dump.c
> @@ -428,7 +428,7 @@ main(int argc, char **argv)
>  
>   InitDumpOptions();
>  
> - while ((c = getopt_long(argc, argv,
> "abBcCd:E:f:F:h:j:n:N:oOp:RsS:t:T:U:vwWxZ:",
> + while ((c = getopt_long(argc, argv,
> "abBcCd:E:f:F:h:j:n:N:Op:RsS:t:T:U:vwWxZ:",
>   long_options,
> )) != -1)
>   {
>   switch (c)
> 
> "o" in the options list is a leftover. Leaving it in getopt_long() has the 
> effect that pg_dump -o fails (per the default case in the switch),
> but it's missing the expected error message (pg_dump: invalid option -- 'o')

Thanks for finding! Pushed.

Greetings,

Andres Freund




Re: Cleanup/remove/update references to OID column

2019-04-17 Thread Andres Freund
Hi,

On 2019-04-10 11:59:18 -0500, Justin Pryzby wrote:
> @@ -1202,8 +1202,7 @@ CREATE TABLE circles (
>ctid will change if it is
>updated or moved by VACUUM FULL.  Therefore
>ctid is useless as a long-term row
> -  identifier.  The OID, or even better a user-defined serial
> -  number, should be used to identify logical rows.
> +  identifier.  A primary key should be used to identify logical rows.
>   
>  
> 

That works for me.


> @@ -3672,11 +3671,10 @@ VALUES ('Albany', NULL, NULL, 'NY');
>
> Partitions cannot have columns that are not present in the parent.  It
> is not possible to specify columns when creating partitions with
> -   CREATE TABLE, nor is it possible to add columns to
> -   partitions after-the-fact using ALTER TABLE.  
> Tables may be
> -   added as a partition with ALTER TABLE ... ATTACH 
> PARTITION
> -   only if their columns exactly match the parent, including any
> -   oid column.
> +   CREATE TABLE, to add columns to
> +   partitions after-the-fact using ALTER TABLE, nor to
> +   add a partition with ALTER TABLE ... ATTACH 
> PARTITION
> +   if its columns would not exactly match those of the parent.
>
>   

This seems like a bigger change than necessary. I just chopped off the
"including any oid column".



> diff --git a/doc/src/sgml/ref/create_trigger.sgml 
> b/doc/src/sgml/ref/create_trigger.sgml
> index 6456105..3339a4b 100644
> --- a/doc/src/sgml/ref/create_trigger.sgml
> +++ b/doc/src/sgml/ref/create_trigger.sgml
> @@ -465,7 +465,7 @@ UPDATE OF column_name1 [, 
> column_name2 that the NEW row seen by the condition is the current 
> value,
> as possibly modified by earlier triggers.  Also, a 
> BEFORE
> trigger's WHEN condition is not allowed to examine the
> -   system columns of the NEW row (such as 
> oid),
> +   system columns of the NEW row (such as 
> ctid),
> because those won't have been set yet.
>

Hm. Not because of your change, but this sentence seems wrong. For one,
"is not allowed" isn't really true - one can very well write a trigger
doing so. The returned values just are bogus.

CREATE OR REPLACE FUNCTION scream_sysattrs() RETURNS TRIGGER LANGUAGE
plpgsql AS $$
BEGIN
RAISE NOTICE 'inserted row: self: % xmin: % cmin: %, xmax: %, cmax: % 
tableoid: %', NEW.ctid, NEW.xmin, NEW.cmin, NEW.xmax, NEW.cmax, NEW.tableoid;
RETURN NEW;
END;$$;
DROP TABLE IF EXISTS foo; CREATE TABLE foo(i int);CREATE TRIGGER foo BEFORE 
INSERT ON foo FOR EACH ROW EXECUTE FUNCTION scream_sysattrs();
postgres[5532][1]=# INSERT INTO foo values(1);
NOTICE:  0: inserted row: self: (0,0) xmin: 112 cmin: 2249, xmax: 
4294967295, cmax: 2249 tableoid: 0
LOCATION:  exec_stmt_raise, pl_exec.c:3778
INSERT 0 1

We probably should do better...



> diff --git a/doc/src/sgml/ref/insert.sgml b/doc/src/sgml/ref/insert.sgml
> index 62e142f..3e1be4c 100644
> --- a/doc/src/sgml/ref/insert.sgml
> +++ b/doc/src/sgml/ref/insert.sgml
> @@ -552,13 +552,11 @@ INSERT INTO  class="parameter">table_name [ AS   INSERT oid  class="parameter">count
>  
> The count is the
> -   number of rows inserted or updated.  If  -   class="parameter">count is exactly one, and the
> -   target table has OIDs, then  -   class="parameter">oid is the OID
> -   assigned to the inserted row.  The single row must have been
> -   inserted rather than updated.  Otherwise  -   class="parameter">oid is zero.
> +   number of rows inserted or updated.
> +   oid used to be the object ID of the inserted 
> row
> +   if rows was 1 and the target table had OIDs, 
> but
> +   OIDs system columns are not supported anymore; therefore
> +   oid is always 0.
>

I rephrased this a bit. Felt like the important bit came after
historical information:
+   The count is the number of
+   rows inserted or updated.  oid is always 0 (it
+   used to be the OID assigned to the inserted row if
+   rows was exactly one and the target table was
+   declared WITH OIDS and 0 otherwise, but creating a table
+   WITH OIDS is not supported anymore).


>
> diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
> index 08f4bab..0e6e792 100644
> --- a/doc/src/sgml/ref/psql-ref.sgml
> +++ b/doc/src/sgml/ref/psql-ref.sgml
> @@ -3794,6 +3794,9 @@ bar
>  command. This variable is only guaranteed to be valid until
>  after the result of the next SQL command has
>  been displayed.
> +PostgreSQL servers since version 12 do not
> +support OID system columns in user tables, and LASTOID will always 
> be 0
> +following INSERT.
>  
>  
>

It's not just user tables, system tables as well (it's just an ordinary
table now). I also though it might be good to clarify that LASTOID still
works for older servers.

+PostgreSQL servers since version 12 do not
+support OID system columns anymore, thus LASTOID will always be 0

Re: Race conditions with checkpointer and shutdown

2019-04-17 Thread Thomas Munro
On Wed, Apr 17, 2019 at 10:45 AM Tom Lane  wrote:
> I think what we need to look for is reasons why (1) the postmaster
> never sends SIGUSR2 to the checkpointer, or (2) the checkpointer's
> main loop doesn't get to noticing shutdown_requested.
>
> A rather scary point for (2) is that said main loop seems to be
> assuming that MyLatch a/k/a MyProc->procLatch is not used for any
> other purposes in the checkpointer process.  If there were something,
> like say a condition variable wait, that would reset MyLatch at any
> time during a checkpoint, then we could very easily go to sleep at the
> bottom of the loop and not notice that there's a pending shutdown request.

Agreed on the non-composability of that coding, but if there actually
is anything in that loop that can reach ResetLatch(), it's well
hidden...

-- 
Thomas Munro
https://enterprisedb.com




Re: itemptr_encode/itemptr_decode

2019-04-17 Thread Tom Lane
Andres Freund  writes:
> On 2019-04-17 18:57:00 -0400, Tom Lane wrote:
>> What on God's green earth are these functions doing in
>> src/include/catalog/index.h?

> I'm happy to move them elsewhere, but I'm not sure there's really a good
> location. I guess we could move them to itemptr.h - but they're not
> really something particularly generally usable.

I don't have a better idea than that either, but I sure feel that they
don't belong in index.h.  Is it worth inventing a whole new header
for these?  If we stick 'em in itemptr.h, they'll be getting compiled
by a whole lot of files :-(

As for the general usability argument, I'm not sure --- as we start
to look at alternate AMs, we might have more use for them.  When I first
saw the functions, I thought maybe they were part of sort acceleration
for TIDs; evidently they're not (yet), but that seems like another
possible use-case.

regards, tom lane




Re: itemptr_encode/itemptr_decode

2019-04-17 Thread Andres Freund
Hi,

On 2019-04-17 18:57:00 -0400, Tom Lane wrote:
> What on God's green earth are these functions doing in
> src/include/catalog/index.h?

> They don't have any obvious connection to indexes, let alone
> catalog operations on indexes, which is what that file is for.

Well, they were previously declared & defined in
src/backend/catalog/index.c - that's where the location is coming from
(and where they still are defined). And they're currently only used to
implement the index validation scans, which requires the validation scan
to decode item pointers stored in the tuplesort presented to it.

I'm happy to move them elsewhere, but I'm not sure there's really a good
location. I guess we could move them to itemptr.h - but they're not
really something particularly generally usable.

Greetings,

Andres Freund




itemptr_encode/itemptr_decode

2019-04-17 Thread Tom Lane
What on God's green earth are these functions doing in
src/include/catalog/index.h?

They don't have any obvious connection to indexes, let alone
catalog operations on indexes, which is what that file is for.

They weren't there before 2a96909a4, either.

regards, tom lane




Re: ToDo: show size of partitioned table

2019-04-17 Thread Alvaro Herrera
On 2019-Apr-17, Justin Pryzby wrote:

> A reminder about these.
> 
> Also, I suggest renaming "On Table" to "Table", for consistency with \di.

Thanks!  Pushed all three.

Please feel free to add things like these to wiki.postgresql.org/wiki/Open_Items
to serve as reminders for the project as a whole.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: block-level incremental backup

2019-04-17 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Mon, Apr 15, 2019 at 9:01 AM Stephen Frost  wrote:
> > I love the general idea of having additional facilities in core to
> > support block-level incremental backups.  I've long been unhappy that
> > any such approach ends up being limited to a subset of the files which
> > need to be included in the backup, meaning the rest of the files have to
> > be backed up in their entirety.  I don't think we have to solve for that
> > as part of this, but I'd like to see a discussion for how to deal with
> > the other files which are being backed up to avoid needing to just
> > wholesale copy them.
> 
> Ideas?  Generally, I don't think that anything other than the main
> forks of relations are worth worrying about, because the files are too
> small to really matter.  Even if they're big, the main forks of
> relations will be much bigger.  I think.

Sadly, I haven't got any great ideas today.  I do know that the WAL-G
folks have specifically mentioned issues with the visibility map being
large enough across enough of their systems that it kinda sucks to deal
with.  Perhaps we could do something like the rsync binary-diff protocol
for non-relation files?  This is clearly just hand-waving but maybe
there's something reasonable in that idea.

> > I'm quite concerned that trying to graft this on to pg_basebackup
> > (which, as you note later, is missing an awful lot of what users expect
> > from a real backup solution already- retention handling, parallel
> > capabilities, WAL archive management, and many more... but also is just
> > not nearly as developed a tool as the external solutions) is going to
> > make things unnecessairly difficult when what we really want here is
> > better support from core for block-level incremental backup for the
> > existing external tools to leverage.
> >
> > Perhaps there's something here which can be done with pg_basebackup to
> > have it work with the block-level approach, but I certainly don't see
> > it as a natural next step for it and really does seem like limiting the
> > way this is implemented to something that pg_basebackup can easily
> > digest might make it less useful for the more developed tools.
> 
> I agree that there are a bunch of things that pg_basebackup does not
> do, such as backup management.  I think a lot of users do not want
> PostgreSQL to do backup management for them.  They have an existing
> solution that they use to manage backups, and they want PostgreSQL to
> interoperate with it. I think it makes sense for pg_basebackup to be
> in charge of taking the backup, and then other tools can either use it
> as a building block or use the streaming replication protocol to send
> approximately the same commands to the server.  

There's something like 6 different backup tools, at least, for
PostgreSQL that provide backup management, so I have a really hard time
agreeing with this idea that users don't want a PG backup management
system.  Maybe that's not what you're suggesting here, but that's what
came across to me.

Yes, there are some users who have an existing backup solution and
they'd like a better way to integrate PostgreSQL into that solution,
but that's usually something like filesystem snapshots or an enterprise
backup tool which has a PostgreSQL agent or similar to do the start/stop
and collect up the WAL, not something that's just calling pg_basebackup.

Those are typically not things we have any visibility into though and
aren't open source either (and, at least as often as not, they don't
seem to be very well thought through, based on my experience with those
tools...).

Unless maybe I'm misunderstanding and what you're suggesting here is
that the "existing solution" is something like the external PG-specific
backup tools?  But then the rest doesn't seem to make sense, as only
maybe one or two of those tools use pg_basebackup internally.

> I certainly would not
> want to expose server capabilities that let you take an incremental
> backup and NOT teach pg_basebackup to use them -- then we'd be in a
> situation of saying that PostgreSQL has incremental backup, but you
> have to get external tool XYZ to use it.  That will be perceived as
> PostgreSQL does NOT have incremental backup and this external tool
> adds it.

... but this is exactly the situation we're in already with all of the
*other* features around backup (parallel backup, backup management, WAL
management, etc).  Users want those features, pg_basebackup/PG core
doesn't provide it, and therefore there's a bunch of other tools which
have been written that do.  In addition, saying that PG has incremental
backup but no built-in management of those full-vs-incremental backups
and telling users that they basically have to build that themselves
really feels a lot like we're trying to address a check-box requirement
rather than making something that our users are going to be happy with.

> > As an example, I believe all of the other 

Re: pg_dump is broken for partition tablespaces

2019-04-17 Thread Alvaro Herrera
On 2019-Apr-17, Tom Lane wrote:

> Alvaro Herrera  writes:
> > 1. pg_dump now uses regular CREATE TABLE followed by ALTER TABLE / ATTACH
> >PARTITION when creating partitions, rather than CREATE TABLE
> >PARTITION OF.  pg_dump --binary-upgrade was already doing that, so
> >this part mostly removes some code.  In order to make the partitions
> >reach the correct tablespace, the "default_tablespace" GUC is used.
> >No TABLESPACE clause is added to the dump.  This is David's patch
> >near the start of the thread.
> 
> This idea seems reasonable independently of all else, simply on the grounds
> of reducing code duplication.  It also has the advantage that if you try
> to do a selective restore of just a partition, and the parent partitioned
> table isn't around, you can still do it (with an ignorable error).

I'll get this part pushed, then.

> > 2. When creating a partition using the CREATE TABLE PARTITION OF syntax,
> >the TABLESPACE clause has highest precedence; if that is not given,
> >the partitioned table's tablespace is used; if that is set to 0 (the
> >default), default_tablespace is used; if that's set to empty or a
> >nonexistant tablespace, the database's default tablespace is used.
> >This is (I think) what Andres proposed in
> >https://postgr.es/m/20190306223741.lolaaimhkkp4k...@alap3.anarazel.de
> 
> Hmm.  The precedence order between the second and third options seems
> pretty arbitrary and hence unrememberable.  I don't say this choice is
> wrong, but it's not clear that it's right, either.

Well, I see it as the default_tablespace being a global setting whereas
the parent is "closer" to the partition definition, which is why it has
higher priority.  I don't have a strong opinion however (and I think the
patch would be shorter if default_tablespace had higher precedence.)

Maybe others care to comment?

> > 3. Partitioned relations can have the database tablespace in
> >pg_class.reltablespace, as opposed to storage-bearing relations which
> >cannot.  This is useful to be able to put partitions in the database
> >tablespace even if the default_tablespace is set to something else.
> 
> I still feel that this is a darn bad idea.  It goes against the rule
> that's existed for pg_class.reltablespace since its beginning, and
> I think it's inevitable that that's going to break something.

Yes, this deviates from current practice, and while I tested this in as
many ways as I could think of, I cannot deny that it might break
something unexpectedly.


Here's a v4 btw, which is just some adjustments to the regress test
script and expected file.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From c9217e46ff424d68b9b2fdbba009c949665045ef Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 12 Apr 2019 16:58:26 -0400
Subject: [PATCH v4 1/3] psql: display tablespace for partitioned indexes

Nothing was shown previously.
---
 src/bin/psql/describe.c   | 3 ++-
 src/test/regress/input/tablespace.source  | 1 +
 src/test/regress/output/tablespace.source | 8 
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 3a04b0673a3..8a269016333 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -3306,7 +3306,8 @@ add_tablespace_footer(printTableContent *const cont, char relkind,
 	if (relkind == RELKIND_RELATION ||
 		relkind == RELKIND_MATVIEW ||
 		relkind == RELKIND_INDEX ||
-		relkind == RELKIND_PARTITIONED_TABLE)
+		relkind == RELKIND_PARTITIONED_TABLE ||
+		relkind == RELKIND_PARTITIONED_INDEX)
 	{
 		/*
 		 * We ignore the database default tablespace so that users not using
diff --git a/src/test/regress/input/tablespace.source b/src/test/regress/input/tablespace.source
index 47ae73af95f..14ce0e7e04f 100644
--- a/src/test/regress/input/tablespace.source
+++ b/src/test/regress/input/tablespace.source
@@ -63,6 +63,7 @@ CREATE INDEX part_a_idx ON testschema.part (a) TABLESPACE regress_tblspace;
 CREATE TABLE testschema.part2 PARTITION OF testschema.part FOR VALUES IN (2);
 SELECT relname, spcname FROM pg_catalog.pg_tablespace t, pg_catalog.pg_class c
 where c.reltablespace = t.oid AND c.relname LIKE 'part%_idx';
+\d testschema.part_a_idx
 
 -- check that default_tablespace doesn't affect ALTER TABLE index rebuilds
 CREATE TABLE testschema.test_default_tab(id bigint) TABLESPACE regress_tblspace;
diff --git a/src/test/regress/output/tablespace.source b/src/test/regress/output/tablespace.source
index 29d6d2232be..8ebe08b9b28 100644
--- a/src/test/regress/output/tablespace.source
+++ b/src/test/regress/output/tablespace.source
@@ -94,6 +94,14 @@ SELECT relname, spcname FROM pg_catalog.pg_tablespace t, pg_catalog.pg_class c
  part_a_idx  | regress_tblspace
 (3 rows)
 
+\d testschema.part_a_idx
+Partitioned index "testschema.part_a_idx"
+ Column |  Type   | 

Re: pgsql: Fix unportable code in pgbench.

2019-04-17 Thread Tom Lane
Fabien COELHO  writes:
>> Fix unportable code in pgbench.

> Sorry for this unforseen issue... portability is a pain:-(

I think it's my fault, actually --- I don't remember how much of
that patch was yours.

> Yep, but ISTM that it is down to 32 bits,

Only on 32-bit-long machines, which are a dwindling minority (except
for Windows, which I don't really care about).

> So the third short is now always 0. Hmmm. I'll propose another option over 
> the week-end.

I suppose we could put pg_strtouint64 somewhere where pgbench can use it,
but TBH I don't think it's worth the trouble.  The set of people using
the --random-seed=int option at all is darn near empty, I suspect,
and the documentation only says you can write an int there.

regards, tom lane




Re: pg_dump is broken for partition tablespaces

2019-04-17 Thread Tom Lane
Alvaro Herrera  writes:
> 1. pg_dump now uses regular CREATE TABLE followed by ALTER TABLE / ATTACH
>PARTITION when creating partitions, rather than CREATE TABLE
>PARTITION OF.  pg_dump --binary-upgrade was already doing that, so
>this part mostly removes some code.  In order to make the partitions
>reach the correct tablespace, the "default_tablespace" GUC is used.
>No TABLESPACE clause is added to the dump.  This is David's patch
>near the start of the thread.

This idea seems reasonable independently of all else, simply on the grounds
of reducing code duplication.  It also has the advantage that if you try
to do a selective restore of just a partition, and the parent partitioned
table isn't around, you can still do it (with an ignorable error).

> 2. When creating a partition using the CREATE TABLE PARTITION OF syntax,
>the TABLESPACE clause has highest precedence; if that is not given,
>the partitioned table's tablespace is used; if that is set to 0 (the
>default), default_tablespace is used; if that's set to empty or a
>nonexistant tablespace, the database's default tablespace is used.
>This is (I think) what Andres proposed in
>https://postgr.es/m/20190306223741.lolaaimhkkp4k...@alap3.anarazel.de

Hmm.  The precedence order between the second and third options seems
pretty arbitrary and hence unrememberable.  I don't say this choice is
wrong, but it's not clear that it's right, either.

> 3. Partitioned relations can have the database tablespace in
>pg_class.reltablespace, as opposed to storage-bearing relations which
>cannot.  This is useful to be able to put partitions in the database
>tablespace even if the default_tablespace is set to something else.

I still feel that this is a darn bad idea.  It goes against the rule
that's existed for pg_class.reltablespace since its beginning, and
I think it's inevitable that that's going to break something.

regards, tom lane




Re: pgsql: Fix unportable code in pgbench.

2019-04-17 Thread Fabien COELHO



Hello Tom,


Fix unportable code in pgbench.


Sorry for this unforseen issue... portability is a pain:-(


The buildfarm points out that UINT64_FORMAT might not work with sscanf;
it's calibrated for our printf implementation, which might not agree
with the platform-supplied sscanf.  Fall back to just accepting an
unsigned long, which is already more than the documentation promises.


Yep, but ISTM that it is down to 32 bits, whereas the PRNG seed expects 48 
bits a few lines below:


base_random_sequence.xseed[0] = iseed & 0x;
base_random_sequence.xseed[1] = (iseed >> 16) & 0x;
base_random_sequence.xseed[2] = (iseed >> 32) & 0x;

So the third short is now always 0. Hmmm. I'll propose another option over 
the week-end.


--
Fabien.




Re: pg_dump is broken for partition tablespaces

2019-04-17 Thread Alvaro Herrera
On 2019-Apr-17, David Rowley wrote:

> On Mon, 15 Apr 2019 at 15:26, Alvaro Herrera  wrote:
> >
> > On 2019-Apr-15, David Rowley wrote:
> >
> > > To be honest, if I'd done a better job of thinking through the
> > > implications of this tablespace inheritance in ca4103025d, then I'd
> > > probably have not bothered submitting a patch for it.  We could easily
> > > revert that, but we'd still be left with the same behaviour in
> > > partitioned indexes, which is in PG11.
> >
> > Well, I suppose if we do decide to revert it for tables, we should do it
> > for both tables and indexes.  But as I said, I'm not yet convinced that
> > this is the best way forward.
> 
> Ok.  Any ideas or suggestions on how we move on from here?  It seems
> like a bit of a stalemate.

Well, here's my proposed patch.  I'm now fairly happy with how this
looks now, concerning partitioned tables.

This is mostly what was already discussed:

1. pg_dump now uses regular CREATE TABLE followed by ALTER TABLE / ATTACH
   PARTITION when creating partitions, rather than CREATE TABLE
   PARTITION OF.  pg_dump --binary-upgrade was already doing that, so
   this part mostly removes some code.  In order to make the partitions
   reach the correct tablespace, the "default_tablespace" GUC is used.
   No TABLESPACE clause is added to the dump.  This is David's patch
   near the start of the thread.

2. When creating a partition using the CREATE TABLE PARTITION OF syntax,
   the TABLESPACE clause has highest precedence; if that is not given,
   the partitioned table's tablespace is used; if that is set to 0 (the
   default), default_tablespace is used; if that's set to empty or a
   nonexistant tablespace, the database's default tablespace is used.
   This is (I think) what Andres proposed in
   https://postgr.es/m/20190306223741.lolaaimhkkp4k...@alap3.anarazel.de

3. Partitioned relations can have the database tablespace in
   pg_class.reltablespace, as opposed to storage-bearing relations which
   cannot.  This is useful to be able to put partitions in the database
   tablespace even if the default_tablespace is set to something else.

4. For partitioned tables, ALTER TABLE .. SET TABLESPACE DEFAULT is
   available as suggested by David, which makes future partition
   creations target default_tablespace or the database's tablespace.  

5. Recreating indexes during table-rewriting ALTER TABLE resulted in
   broken indexes.  We already had some adhesive tape in place to make
   that work for regular indexes (commit bd673e8e864a); my approach to
   fix it for partitioned indexes is to temporarily reset
   default_tablespace to empty.


As for Tom's question in https://postgr.es/m/12678.1555252...@sss.pgh.pa.us :

> It's possible that Alvaro's patch is also broken, but I haven't had time
> to review it.  The immediate question is what happens when somebody makes
> a partitioned table in template1 and then does CREATE DATABASE with a
> tablespace option.  Does the partitioned table end up in the same
> tablespace as ordinary tables do?

Note that partitioned don't have any files, so they don't end up
anywhere; when a partition is created, the target tablespace is
determined using four rules instead of three (see #2 above) so yes, they
do end up in the same places as ordinary tables.  Note that even if you
do put a partitioned table in some tablespace, you will not later run
afoul of this check:
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("cannot assign new default tablespace \"%s\"",
tablespacename),
 errdetail("There is a conflict because database \"%s\" 
already has some tables in this tablespace.",
   dbtemplate)));
src/backend/commands/dbcommands.c:435

because that check uses ReadDir() and raise an error if any entry is
found; but partitioned tables don't have files in directory, so nothing
happens.  (Of course, it will hit if you have a partition in that
tablespace, but that's also the case for regular tables.)

(I propose to commit both 0002 and 0003 as a single unit that fixes the
whole problem, rather than attacking backend and pg_dump separately.
0001 appears logically separate and I would push on its own.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From c9217e46ff424d68b9b2fdbba009c949665045ef Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Fri, 12 Apr 2019 16:58:26 -0400
Subject: [PATCH v3 1/3] psql: display tablespace for partitioned indexes

Nothing was shown previously.
---
 src/bin/psql/describe.c   | 3 ++-
 src/test/regress/input/tablespace.source  | 1 +
 src/test/regress/output/tablespace.source | 8 
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 3a04b0673a3..8a269016333 

Re: block-level incremental backup

2019-04-17 Thread Stephen Frost
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Tue, Apr 16, 2019 at 5:44 PM Stephen Frost  wrote:
> > > > I love the general idea of having additional facilities in core to
> > > > support block-level incremental backups.  I've long been unhappy that
> > > > any such approach ends up being limited to a subset of the files which
> > > > need to be included in the backup, meaning the rest of the files have to
> > > > be backed up in their entirety.  I don't think we have to solve for that
> > > > as part of this, but I'd like to see a discussion for how to deal with
> > > > the other files which are being backed up to avoid needing to just
> > > > wholesale copy them.
> > >
> > > I assume you are talking about non-heap/index files.  Which of those are
> > > large enough to benefit from incremental backup?
> >
> > Based on discussions I had with Andrey, specifically the visibility map
> > is an issue for them with WAL-G.  I haven't spent a lot of time thinking
> > about it, but I can understand how that could be an issue.
> 
> If I understand correctly, the VM contains 1 byte per 4 heap pages and
> the FSM contains 1 byte per heap page (plus some overhead for higher
> levels of the tree).  Since the FSM is not WAL-logged, I'm not sure
> there's a whole lot we can do to avoid having to back it up, although
> maybe there's some clever idea I'm not quite seeing.  The VM is
> WAL-logged, albeit with some strange warts that I have the honor of
> inventing, so there's more possibilities there.
> 
> Before worrying about it too much, it would be useful to hear more
> about the concerns related to these forks, so that we make sure we're
> solving the right problem.  It seems difficult for a single relation
> to be big enough for these to be much of an issue.  For example, on a
> 1TB relation, we have 2^40 bytes = 2^27 pages = ~2^25 bits of VM fork
> = 32MB.  Not nothing, but 32MB of useless overhead every time you back
> up a 1TB database probably isn't going to break the bank.  It might be
> more of a concern for users with many small tables.  For example, if
> somebody has got a million tables with 1 page in each one, they'll
> have a million data pages, a million VM pages, and 3 million FSM pages
> (unless the new don't-create-the-FSM-for-small-tables stuff in v12
> kicks in).  I don't know if it's worth going to a lot of trouble to
> optimize that case.  Creating a million tables with 100 tuples (or
> whatever) in each one sounds like terrible database design to me.

As I understand it, the problem is not with backing up an individual
database or cluster, but rather dealing with backing up thousands of
individual clusters with thousands of tables in each, leading to an
awful lot of tables with lots of FSMs/VMs, all of which end up having to
get copied and stored wholesale.  I'll point this thread out to him and
hopefully he'll have a chance to share more specific information.

> > > > I'm quite concerned that trying to graft this on to pg_basebackup
> > > > (which, as you note later, is missing an awful lot of what users expect
> > > > from a real backup solution already- retention handling, parallel
> > > > capabilities, WAL archive management, and many more... but also is just
> > > > not nearly as developed a tool as the external solutions) is going to
> > > > make things unnecessairly difficult when what we really want here is
> > > > better support from core for block-level incremental backup for the
> > > > existing external tools to leverage.
> > >
> > > I think there is some interesting complexity brought up in this thread.
> > > Which options are going to minimize storage I/O, network I/O, have only
> > > background overhead, allow parallel operation, integrate with
> > > pg_basebackup.  Eventually we will need to evaluate the incremental
> > > backup options against these criteria.
> >
> > This presumes that we're going to have multiple competeing incremental
> > backup options presented, doesn't it?  Are you aware of another effort
> > going on which aims for inclusion in core?  There's been past attempts
> > made, but I don't believe there's anyone else currently planning to or
> > working on something for inclusion in core.
> 
> Yeah, I really hope we don't end up with dueling patches.  I want to
> come up with an approach that can be widely-endorsed and then have
> everybody rowing in the same direction.  On the other hand, I do think
> that we may support multiple options in certain places which may have
> the kinds of trade-offs that Bruce mentions.  For instance,
> identifying changed blocks by scanning the whole cluster and checking
> the LSN of each block has an advantage in that it requires no prior
> setup or extra configuration.  Like a sequential scan, it always
> works, and that is an advantage.  Of course, for many people, the
> competing advantage of a WAL-scanning approach that can save a lot of
> I/O will appear compelling, but maybe not for everyone.  I think
> there's 

Re: Status of the table access method work

2019-04-17 Thread Dmitry Dolgov
> On Wed, Apr 17, 2019 at 10:25 PM Andres Freund  wrote:
>
> I assume you're aware, but it's probably not going to be applied for 12...

Sure, the patch was mostly to express more clearly what I was thinking about :)

> I think most of the read-only stuff just needs to be non-optional, and most
> of the DML stuff needs to be optional.

> On the executor side it'd probably be good to make the sample scan optional
> too, but then we also need to check for that during parse-analysis. In
> contast to bitmap scans there's no alternative way to execute them.

Yeah, makes sense.

> bulk insert already is optional...

Oh, haven't noticed.




Re: Status of the table access method work

2019-04-17 Thread Andres Freund
Hi!

On 2019-04-17 22:02:24 +0200, Dmitry Dolgov wrote:
> > On Fri, Apr 5, 2019 at 10:25 PM Andres Freund  wrote:
> >
> > A second set of limitations is around making more of tableam
> > optional. Right now it e.g. is not possible to have an AM that doesn't
> > implement insert/update/delete. Obviously an AM can just throw an error
> > in the relevant callbacks, but I think it'd be better if we made those
> > callbacks optional, and threw errors at parse-analysis time (both to
> > make the errors consistent, and to ensure it's consistently thrown,
> > rather than only when e.g. an UPDATE actually finds a row to update).
> 
> Agree, but I guess some of tableam still should be mandatory, and then I 
> wonder
> where to put the live between those that are optional and those that are not.
> E.g. looks like it can be relatively straightforward (ignoring `create table 
> as`
> and some other stuff) to make insert/update/delete optional with messages at
> analysis time, but for others like parallel scan related it's probably not.

Thanks for the patch! I assume you're aware, but it's probably not going
to be applied for 12...

I think most of the read-only stuff just needs to be non-optional, and
most of the DML stuff needs to be optional.

On the executor side it'd probably be good to make the sample scan
optional too, but then we also need to check for that during
parse-analysis. In contast to bitmap scans there's no alternative way to
execute them.


>   Assert(routine->relation_set_new_filenode != NULL);
> diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
> index c39218f8db..36e2dbf1b8 100644
> --- a/src/backend/commands/copy.c
> +++ b/src/backend/commands/copy.c
> @@ -41,6 +41,7 @@
>  #include "miscadmin.h"
>  #include "optimizer/optimizer.h"
>  #include "nodes/makefuncs.h"
> +#include "nodes/nodeFuncs.h"
>  #include "parser/parse_coerce.h"
>  #include "parser/parse_collate.h"
>  #include "parser/parse_expr.h"
> @@ -901,6 +902,13 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt,
>   
> NULL, false, false);
>   rte->requiredPerms = (is_from ? ACL_INSERT : ACL_SELECT);
>  
> + if (is_from && !table_support_multi_insert(rel))
> + ereport(ERROR,
> + 
> (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +  errmsg("Table access method 
> doesn't support the operation"),
> +  parser_errposition(pstate,
> + 
> exprLocation((Node *) stmt;

Probably should fall-back to plain inserts if multi-insert isn't
supported.

And if insert isn't supported either, we should probably talk about
that specifically? I.e. a message like
"access method \"%s\" of table \"%s\" does not support %s"
?

Without knowing at least thatmuch operation it might sometimes be very
hard to figure out what's not supported.



> +static inline bool
> +table_support_speculative(Relation rel)
> +{
> + return rel->rd_tableam == NULL ||
> +(rel->rd_tableam->tuple_insert_speculative != NULL &&
> + rel->rd_tableam->tuple_complete_speculative != NULL);
> +}

In GetTableAmRoutine() I'd assert that either both or none are defined.


> +static inline bool
> +table_support_multi_insert(Relation rel)
> +{
> + return rel->rd_tableam == NULL ||
> +(rel->rd_tableam->multi_insert != NULL &&
> + rel->rd_tableam->finish_bulk_insert != NULL);
> +}

bulk insert already is optional...


I think there's more places that need checks like these - consider
e.g. replication and such that don't go through the full blown executor.

Greetings,

Andres Freund




Re: jsonpath

2019-04-17 Thread Alexander Korotkov
On Wed, Apr 17, 2019 at 8:43 PM Tom Lane  wrote:
> John Naylor  writes:
> > Attached is a patch to fix some minor issues:
> > -misspelling of an error message
>
> Yeah, I'd noticed that one too :-(.  I think the whole jsonpath patch
> needs a sweep to bring its error messages into line with our style
> guidelines, but no harm in starting with the obvious bugs.

I'll go trough the jsonpath error messages and post a patch for fixing them.

> > -Commit 550b9d26f80f failed to update the Makefile comment to reflect
> > how the build changed, and also removed the clean target, which we now
> > have use for since we later got rid of backtracking in the scanner.
>
> Right.  I'm not really sure why we're bothering with anti-backtracking
> here, or with using speed-rather-than-code-space lexer optimization
> options.  It's hard for me to credit that any practically-useful jsonpath
> pattern would be long enough for lexer speed to matter, and even harder to
> credit that the speed of the flex code itself would be an important factor
> in the overall processing cost of a long jsonpath.  Still, as long as we
> have the code it needs to be right.

Actually I found that non of in-core lexers are backtracking.  So, I
understood no backtracking as kind of standard and didn't want to
break that :)

Nevertheless, I could imagine use-case involving parsing a lot of
jsonpath'es.  For example we may construct jsonpath based on table
data and check that for just few jsonb's.  For sure, that wouldn't be
a common use-case, but still.

> > Also, while I have the thought in my head, for v13 we should consider
> > replacing the keyword binary search with the perfect hash technique
> > added in c64d0cd5ce2 -- it might give a small performance boost to the
> > scanner.
>
> I doubt it's worth the trouble, per above.
>
> Patch LGTM, pushed.

Thank you for pushing this!

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [patch] pg_test_timing does not prompt illegal option

2019-04-17 Thread Fabien COELHO



Hello Tom,


We've generally felt that deferring to the behavior of the platform's
getopt() or getopt_long() is a better idea than trying to enforce some
lowest-common-denominator version of switch parsing, on the theory that
users of a given platform will be used to whatever its getopt does.
This does mean that we have undocumented behaviors on particular
platforms.


Interesting.

I'd say that accepting "--" is one of them.  Another example is that 
glibc's getopt is willing to reorder the arguments, so that for example 
this works for me:


$ psql template1 -E
psql (12devel)


Yep, I noticed that one by accident once.


On other platforms that would not work, so we don't document it.


People might get surprised anyway, because the very same command may or 
may not work depending on the platform. Does not matter much, though.


--
Fabien.




Re: Status of the table access method work

2019-04-17 Thread Dmitry Dolgov
> On Fri, Apr 5, 2019 at 10:25 PM Andres Freund  wrote:
>
> A second set of limitations is around making more of tableam
> optional. Right now it e.g. is not possible to have an AM that doesn't
> implement insert/update/delete. Obviously an AM can just throw an error
> in the relevant callbacks, but I think it'd be better if we made those
> callbacks optional, and threw errors at parse-analysis time (both to
> make the errors consistent, and to ensure it's consistently thrown,
> rather than only when e.g. an UPDATE actually finds a row to update).

Agree, but I guess some of tableam still should be mandatory, and then I wonder
where to put the live between those that are optional and those that are not.
E.g. looks like it can be relatively straightforward (ignoring `create table as`
and some other stuff) to make insert/update/delete optional with messages at
analysis time, but for others like parallel scan related it's probably not.


0001-Optional-tableam.patch
Description: Binary data


0002-Toytable-with-missing-am.patch
Description: Binary data


Re: New vacuum option to do only freezing

2019-04-17 Thread Peter Geoghegan
On Wed, Apr 17, 2019 at 10:46 AM Tom Lane  wrote:
> Yeah, if we wanted to expose these complications more directly, we
> could think about adding or changing the main counters.  I was wondering
> about whether we should have it report "x bytes and y line pointers
> freed", rather than counting tuples per se.

I like that idea, but I'm pretty sure that there are very few users
that are aware of these distinctions at all. It would be a good idea
to clearly document them.

-- 
Peter Geoghegan




Re: New vacuum option to do only freezing

2019-04-17 Thread Tom Lane
Masahiko Sawada  writes:
> On Wed, Apr 17, 2019 at 5:00 AM Tom Lane  wrote:
>>> I'm thinking that we really need to upgrade vacuum's reporting totals
>>> so that it accounts in some more-honest way for pre-existing dead
>>> line pointers.  The patch as it stands has made the reporting even more
>>> confusing, rather than less so.

> Or, how about reporting the vacuumed tuples and line pointers we
> separately? It might be too detailed but since with this patch we
> delete either only tuples or only line pointers, it's more accurate.

Yeah, if we wanted to expose these complications more directly, we
could think about adding or changing the main counters.  I was wondering
about whether we should have it report "x bytes and y line pointers
freed", rather than counting tuples per se.

regards, tom lane




Re: jsonpath

2019-04-17 Thread Tom Lane
John Naylor  writes:
> Attached is a patch to fix some minor issues:
> -misspelling of an error message

Yeah, I'd noticed that one too :-(.  I think the whole jsonpath patch
needs a sweep to bring its error messages into line with our style
guidelines, but no harm in starting with the obvious bugs.

> -Commit 550b9d26f80f failed to update the Makefile comment to reflect
> how the build changed, and also removed the clean target, which we now
> have use for since we later got rid of backtracking in the scanner.

Right.  I'm not really sure why we're bothering with anti-backtracking
here, or with using speed-rather-than-code-space lexer optimization
options.  It's hard for me to credit that any practically-useful jsonpath
pattern would be long enough for lexer speed to matter, and even harder to
credit that the speed of the flex code itself would be an important factor
in the overall processing cost of a long jsonpath.  Still, as long as we
have the code it needs to be right.

> Also, while I have the thought in my head, for v13 we should consider
> replacing the keyword binary search with the perfect hash technique
> added in c64d0cd5ce2 -- it might give a small performance boost to the
> scanner.

I doubt it's worth the trouble, per above.

Patch LGTM, pushed.

regards, tom lane




Re: [patch] pg_test_timing does not prompt illegal option

2019-04-17 Thread Tom Lane
Fabien COELHO  writes:
>> I think it might be an actively bad idea.  There's a pretty
>> widespread convention that "--" is a no-op switch indicating
>> the end of switches.  At least some of our tools appear to
>> honor that behavior (probably because glibc's getopt_long
>> does; I do not think we are implementing it ourselves).

> "src/port/getopt_long.c" checks for "--" as the end of options.

Ah.  But I was checking this on a Linux build that's using glibc's
implementation, not our own.  It's pretty easy to prove that psql,
for one, acts that way when using the glibc subroutine:

$ psql -- -E
psql: error: could not connect to server: FATAL:  database "-E" does not exist


We've generally felt that deferring to the behavior of the platform's
getopt() or getopt_long() is a better idea than trying to enforce some
lowest-common-denominator version of switch parsing, on the theory that
users of a given platform will be used to whatever its getopt does.
This does mean that we have undocumented behaviors on particular
platforms.  I'd say that accepting "--" is one of them.  Another example
is that glibc's getopt is willing to reorder the arguments, so that
for example this works for me:

$ psql template1 -E
psql (12devel)
Type "help" for help.

template1=# \set
...
ECHO_HIDDEN = 'on'
...

On other platforms that would not work, so we don't document it.

regards, tom lane




Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-17 Thread Tom Lane
Andres Freund  writes:
> On 2019-04-17 15:49:29 +0530, Amit Kapila wrote:
>> OTOH, if we want to extend it later for whatever reason to a relation
>> level cache, it shouldn't be that difficult as the implementation is
>> mostly contained in freespace.c (fsm* functions) and I think the
>> relation is accessible in most places.  We might need to rip out some
>> calls to clearlocalmap.

> But it really isn't contained to freespace.c. That's my primary
> concern. You added new parameters (undocumented ones!),
> FSMClearLocalMap() needs to be called by callers and xlog, etc.

Given where we are in the release cycle, and the major architectural
concerns that have been raised about this patch, should we just
revert it and try again in v13, rather than trying to fix it under
time pressure?  It's not like there's not anything else on our
plates to fix before beta.

regards, tom lane




Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-17 Thread Andres Freund
Hi,

On 2019-04-17 15:49:29 +0530, Amit Kapila wrote:
> > *and* stash the bitmap of
> > pages that we think are used/free as a bitmap somewhere below the
> > relcache.
> 
> I think maintaining at relcache level will be tricky when there are
> insertions and deletions happening in the small relation.  We have
> considered such a case during development wherein we don't want the
> FSM to be created if there are insertions and deletions in small
> relation.  The current mechanism addresses both this and normal case
> where there are not many deletions.  Sure there is some overhead of
> building the map again and rechecking each page.  The first one is a
> memory operation and takes a few cycles

Yea, I think creating / resetting the map is basically free.

I'm not sure I buy the concurrency issue - ISTM it'd be perfectly
reasonable to cache the local map (in the relcache) and use it for local
FSM queries, and rebuild it from scratch once no space is found. That'd
avoid a lot of repeated checking of relation size for small, but
commonly changed relations.  Add a pre-check of smgr_fsm_nblocks (if >
0, there has to be an fsm), and there should be fewer syscalls.


> and for the second we optimized by checking the pages alternatively
> which means we won't check more than two pages at-a-time.  This cost
> is paid by not checking FSM and it could be somewhat better in some
> cases [1].

Well, it's also paid by potentially higher bloat, because the
intermediate pages aren't tested.


> >  If we cleared that variable at truncations, I think we should
> > be able to make that work reasonably well?
> 
> Not only that, I think it needs to be cleared whenever we create the
> FSM as well which could be tricky as it can be created by the vacuum.

ISTM normal invalidation logic should just take of that kind of thing.


> OTOH, if we want to extend it later for whatever reason to a relation
> level cache, it shouldn't be that difficult as the implementation is
> mostly contained in freespace.c (fsm* functions) and I think the
> relation is accessible in most places.  We might need to rip out some
> calls to clearlocalmap.

But it really isn't contained to freespace.c. That's my primary
concern. You added new parameters (undocumented ones!),
FSMClearLocalMap() needs to be called by callers and xlog, etc.


Greetings,

Andres Freund




Re: [patch] pg_test_timing does not prompt illegal option

2019-04-17 Thread Fabien COELHO




This is not the problem only for pg_test_timing. If you want to
address this, the patch needs to cover all the client commands
like psql, createuser. I'm not sure if it's worth doing that.


I think it might be an actively bad idea.  There's a pretty
widespread convention that "--" is a no-op switch indicating
the end of switches.  At least some of our tools appear to
honor that behavior (probably because glibc's getopt_long
does; I do not think we are implementing it ourselves).


"src/port/getopt_long.c" checks for "--" as the end of options.

--
Fabien.




Re: [patch] pg_test_timing does not prompt illegal option

2019-04-17 Thread Bruce Momjian
On Wed, Apr 17, 2019 at 10:24:17AM -0400, Tom Lane wrote:
> Fujii Masao  writes:
> > On Wed, Apr 17, 2019 at 6:21 PM Zhang, Jie  wrote:
> >> I think "--" is a illegal option, errors should be prompted.
> 
> > This is not the problem only for pg_test_timing. If you want to
> > address this, the patch needs to cover all the client commands
> > like psql, createuser. I'm not sure if it's worth doing that.
> 
> I think it might be an actively bad idea.  There's a pretty
> widespread convention that "--" is a no-op switch indicating
> the end of switches.  At least some of our tools appear to
> honor that behavior (probably because glibc's getopt_long
> does; I do not think we are implementing it ourselves).

Yep, a simple 'ls' on Debian stretch shows it is a common convention:

$ ls --
file1 file2

FYI, 'gcc --' (using Debian 6.3.0-18+deb9u1) does throw an error, so it
is inconsistent.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-17 Thread Andres Freund
Hi,

On 2019-04-17 13:09:05 +0800, John Naylor wrote:
> On Wed, Apr 17, 2019 at 2:04 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > I'm somewhat unhappy in how much the no-fsm-for-small-rels exposed
> > complexity that looks like it should be purely in freespacemap.c to
> > callers.
> >
> >
> >  extern Size GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk);
> > -extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded);
> > +extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded,
> > +bool check_fsm_only);
> >
> > So now freespace.c has an argument that says we should only check the
> > fsm. That's confusing. And it's not explained to callers what that
> > argument means, and when it should be set.
> 
> When first looking for free space, it's "false": Within
> GetPageWithFreeSpace(), we call RelationGetNumberOfBlocks() if the FSM
> returns invalid.
> 
> If we have to extend, after acquiring the lock to extend the relation,
> we call GetPageWithFreeSpace() again to see if another backend already
> extended while waiting on the lock. If there's no FSM, the thinking
> is, it's not worth it to get the number of blocks again.

I can get that (after reading through the code, grepping through all
callers, etc), but it means that every callsite needs to understand
that. That's making the API more complicated than needed, especially
when we're going to grow more callers.



> > +/* Only create the FSM if the heap has greater than this many blocks */
> > +#define HEAP_FSM_CREATION_THRESHOLD 4
> >
> > Hm, this seems to be tying freespace.c closer to heap than I think is
> > great - think of new AMs like zheap, that also want to use it.
> 
> Amit and I kept zheap in mind when working on the patch. You'd have to
> work around the metapage, but everything else should work the same.

My complaint is basically that it's apparently AM specific (we don't use
the logic for e.g. indexes), and that the name suggest it's specific to
heap. And it's not controllable by the outside, which means it can't be
tuned for the specific usecase.


> > I think this is mostly fallout about the prime issue I'm unhappy
> > about. There's now some global variable in freespacemap.c that code
> > using freespace.c has to know about and maintain.
> >
> >
> > +static void
> > +fsm_local_set(Relation rel, BlockNumber cur_nblocks)
> > +{
> > +   BlockNumber blkno,
> > +   cached_target_block;
> > +
> > +   /* The local map must not be set already. */
> > +   Assert(!FSM_LOCAL_MAP_EXISTS);
> > +
> > +   /*
> > +* Starting at the current last block in the relation and working
> > +* backwards, mark alternating blocks as available.
> > +*/
> > +   blkno = cur_nblocks - 1;
> >
> > That comment explains very little about why this is done, and why it's a
> > good idea.
> 
> Short answer: performance -- it's too expensive to try every block.
> The explanation is in storage/freespace/README -- maybe that should be
> referenced here?

Yes. Even just adding "for performance reasons, only try every second
block. See also the README" would be good.

But I'll note that the need to this - and potentially waste space -
counters your claim that there's no performance considerations with this
patch.


> > +/* Status codes for the local map. */
> > +
> > +/* Either already tried, or beyond the end of the relation */
> > +#define FSM_LOCAL_NOT_AVAIL 0x00
> > +
> > +/* Available to try */
> > +#define FSM_LOCAL_AVAIL0x01
> >
> > +/* Local map of block numbers for small heaps with no FSM. */
> > +typedef struct
> > +{
> > +   BlockNumber nblocks;
> > +   uint8   map[HEAP_FSM_CREATION_THRESHOLD];
> > +}  FSMLocalMap;
> > +
> >
> > Hm, given realistic HEAP_FSM_CREATION_THRESHOLD, and the fact that we
> > really only need one bit per relation, it seems like map should really
> > be just a uint32 with one bit per page.
> 
> I fail to see the advantage of that.

It'd allow different AMs to have different numbers of dont-create-fsm
thresholds without needing additional memory (up to 32 blocks).


> > I'm kinda thinking that this is the wrong architecture.
> >
> > 1) Unless I miss something, this will trigger a
> >RelationGetNumberOfBlocks(), which in turn ends up doing an lseek(),
> >once for each page we add to the relation.
> 
> That was true previously anyway if the FSM returned InvalidBlockNumber.

True. That was already pretty annoying though.


> >That strikes me as pretty
> >suboptimal. I think it's even worse if multiple backends do the
> >insertion, because the RelationGetTargetBlock(relation) logic will
> >succeed less often.
> 
> Could you explain why it would succeed less often?

Two aspects: 1) If more backends access a table, there'll be a higher
chance the target page is full 2) There's more backends that don't have
a target page.


> > 2) We'll repeatedly re-encounter the first few pages, because we clear
> >the local map after each 

Re: block-level incremental backup

2019-04-17 Thread Bruce Momjian
On Tue, Apr 16, 2019 at 06:40:44PM -0400, Robert Haas wrote:
> Yeah, I really hope we don't end up with dueling patches.  I want to
> come up with an approach that can be widely-endorsed and then have
> everybody rowing in the same direction.  On the other hand, I do think
> that we may support multiple options in certain places which may have
> the kinds of trade-offs that Bruce mentions.  For instance,
> identifying changed blocks by scanning the whole cluster and checking
> the LSN of each block has an advantage in that it requires no prior
> setup or extra configuration.  Like a sequential scan, it always
> works, and that is an advantage.  Of course, for many people, the
> competing advantage of a WAL-scanning approach that can save a lot of
> I/O will appear compelling, but maybe not for everyone.  I think
> there's room for two or three approaches there -- not in the sense of
> competing patches, but in the sense of giving users a choice based on
> their needs.

Well, by having a separate modblock file for each WAL file, you can keep
both WAL and modblock files and use the modblock list to pull pages from
each WAL file, or from the heap/index files, and it can be done in
parallel.  Having WAL and modblock files in the same directory makes
retention simpler.

In fact, you can do an incremental backup just using the modblock files
and the heap/index files, so you don't even need the WAL.

Also, instead of storing the file name and block number in the modblock
file, using the database oid, relfilenode, and block number (3 int32
values) should be sufficient.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Idea for fixing parallel pg_dump's lock acquisition problem

2019-04-17 Thread Tom Lane
We just had another complaint (bug #15767) about parallel dump's
inability to cope with concurrent lock requests.  The problem is
well described by the comments for lockTableForWorker():

 * Acquire lock on a table to be dumped by a worker process.
 *
 * The master process is already holding an ACCESS SHARE lock.  Ordinarily
 * it's no problem for a worker to get one too, but if anything else besides
 * pg_dump is running, there's a possible deadlock:
 *
 * 1) Master dumps the schema and locks all tables in ACCESS SHARE mode.
 * 2) Another process requests an ACCESS EXCLUSIVE lock (which is not granted
 *because the master holds a conflicting ACCESS SHARE lock).
 * 3) A worker process also requests an ACCESS SHARE lock to read the table.
 *The worker is enqueued behind the ACCESS EXCLUSIVE lock request.
 * 4) Now we have a deadlock, since the master is effectively waiting for
 *the worker.  The server cannot detect that, however.
 *
 * To prevent an infinite wait, prior to touching a table in a worker, request
 * a lock in ACCESS SHARE mode but with NOWAIT.  If we don't get the lock,
 * then we know that somebody else has requested an ACCESS EXCLUSIVE lock and
 * so we have a deadlock.  We must fail the backup in that case.

Failing the whole backup is, of course, not a nice outcome.

While thinking about that, it occurred to me that we could close the
gap if the server somehow understood that the master was waiting for
the worker.  And it's actually not that hard to make that happen:
we could use advisory locks.  Consider a dance like the following:

1. Master has A.S. lock on table t, and dispatches a dump job for t
to worker.

2. Worker chooses some key k, does pg_advisory_lock(k), and sends k
back to master.

3. Worker attempts to get A.S. lock on t.  This might block, if some
other session has a pending lock request on t.

4. Upon receipt of message from worker, master also does
pg_advisory_lock(k).  This blocks, but now the server can see that a
deadlock exists, and after deadlock_timeout elapses it will fix the
deadlock by letting the worker bypass the other pending lock request.

5. Once worker has the lock on table t, it does pg_advisory_unlock(k)
to release the master.

6. Master also does pg_advisory_unlock(k), and goes on about its business.


I've tested that the server side of this works, for either order of steps
3 and 4.  It seems like mostly just a small matter of programming to teach
pg_dump to do this, although there are some questions to resolve, mainly
how we choose the advisory lock keys.  If there are user applications
running that also use advisory locks, there could be unwanted
interference.  One easy improvement is to use pg_try_advisory_lock(k) in
step 2, and just choose a different k if the lock's in use.  Perhaps,
since we don't expect that the locks would be held long, that's
sufficient --- but I suspect that users might wish for some pg_dump
options to restrict the set of keys it could use.

Another point is that the whole dance is unnecessary in the normal
case, so maybe we should only do this if an initial attempt to get
the lock on table t fails.  However, LOCK TABLE NOWAIT throws an
error if it can't get the lock, so this would require using a
subtransaction or starting a whole new transaction in the worker,
so maybe that's more trouble than it's worth.  Some performance
testing might be called for.  There's also the point that the code
path would go almost entirely untested if it's not exercised always.

Lastly, we'd probably want both the master and worker processes to
run with small values of deadlock_timeout, so as to reduce the
time wasted before the server breaks the deadlock.  Are there any
downsides to that?  (If so, again maybe it's not worth the trouble,
since the typical case is that no wait is needed.)

Thoughts?  I'm not volunteering to write this right now, but maybe
somebody else will take up the project.

regards, tom lane




Re: [patch] pg_test_timing does not prompt illegal option

2019-04-17 Thread Tom Lane
Fujii Masao  writes:
> On Wed, Apr 17, 2019 at 6:21 PM Zhang, Jie  wrote:
>> I think "--" is a illegal option, errors should be prompted.

> This is not the problem only for pg_test_timing. If you want to
> address this, the patch needs to cover all the client commands
> like psql, createuser. I'm not sure if it's worth doing that.

I think it might be an actively bad idea.  There's a pretty
widespread convention that "--" is a no-op switch indicating
the end of switches.  At least some of our tools appear to
honor that behavior (probably because glibc's getopt_long
does; I do not think we are implementing it ourselves).

regards, tom lane




Re: [patch] pg_test_timing does not prompt illegal option

2019-04-17 Thread Fujii Masao
On Wed, Apr 17, 2019 at 6:21 PM Zhang, Jie  wrote:
>
> Hi all,
>
> pg_test_timing accepts the following command-line options:
> -d duration
> --duration=duration
>
> Specifies the test duration, in seconds. Longer durations give slightly 
> better accuracy, and are more likely to discover problems with the system 
> clock moving backwards. The default test duration is 3 seconds.
> -V
> --version
>
> Print the pg_test_timing version and exit.
> -?
> --help
>
> Show help about pg_test_timing command line arguments, and exit.
>
> [https://www.postgresql.org/docs/11/pgtesttiming.html]
>
> However, when I use the following command,  no error prompt. the command can 
> run.
> pg_test_timing --
>
> I think "--" is a illegal option, errors should be prompted.
>
> Here is a patch for prompt illegal option.

This is not the problem only for pg_test_timing. If you want to
address this, the patch needs to cover all the client commands
like psql, createuser. I'm not sure if it's worth doing that.

Regards,

-- 
Fujii Masao




Re: ToDo: show size of partitioned table

2019-04-17 Thread Justin Pryzby
A reminder about these.

Also, I suggest renaming "On Table" to "Table", for consistency with \di.

Justin
>From e275a0958f0f2cd826e9683fb24b6f757d0fe6c7 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 7 Apr 2019 18:24:22 -0500
Subject: [PATCH v2] not necessary to join pg_class, we use oid::regclass not
 relname

---
 src/bin/psql/describe.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 3a04b06..3940570 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -3860,7 +3860,7 @@ listPartitionedTables(const char *reltypes, const char *pattern, bool verbose)
 
 	if (showNested || pattern)
 		appendPQExpBuffer(,
-		  ",\n  c3.oid::regclass as \"%s\"",
+		  ",\n  inh.inhparent::regclass as \"%s\"",
 		  gettext_noop("Parent name"));
 
 	if (showIndexes)
@@ -3901,8 +3901,7 @@ listPartitionedTables(const char *reltypes, const char *pattern, bool verbose)
 
 	if (showNested || pattern)
 		appendPQExpBufferStr(,
-			 "\n LEFT JOIN pg_catalog.pg_inherits inh ON c.oid = inh.inhrelid"
-			 "\n LEFT JOIN pg_catalog.pg_class c3 ON c3.oid = inh.inhparent");
+			 "\n LEFT JOIN pg_catalog.pg_inherits inh ON c.oid = inh.inhrelid");
 
 	if (verbose)
 	{
-- 
2.7.4

>From aa499764bd11172785ac9595168e427354f3cd95 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sun, 7 Apr 2019 18:54:04 -0500
Subject: [PATCH v2] Refer to ptn "relations" not tables.  Remove incorrect
 statement "(including that of their indexes)"

---
 doc/src/sgml/ref/psql-ref.sgml | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 464c196..08f4bab 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1675,15 +1675,15 @@ testdb=
 
 
 If the modifier n (nested) is used,
-or a pattern is specified, then non-root partitioned tables are
+or a pattern is specified, then non-root partitioned relations are
 included, and a column is shown displaying the parent of each
 partitioned relation.
 
 
 
-If + is appended to the command, the sum of sizes of
-table's partitions (including that of their indexes) is also displayed,
-along with the associated description.
+If + is appended to the command name, the sum of the
+sizes of each relation's partitions is also displayed, along with the
+relation's description.
 If n is combined with +, two
 sizes are shown: one including the total size of directly-attached
 leaf partitions, and another showing the total size of all partitions,
-- 
2.7.4

>From eb78f32356c32dbb5b65e69a55177a7a06e863e9 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Wed, 17 Apr 2019 08:59:19 -0500
Subject: [PATCH v1] For consistency with \di, rename column to "Table"

---
 src/bin/psql/describe.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 30d1078..0b28d4d 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -3885,7 +3885,7 @@ listPartitionedTables(const char *reltypes, const char *pattern, bool verbose)
 	if (showIndexes)
 		appendPQExpBuffer(,
 		  ",\n c2.oid::regclass as \"%s\"",
-		  gettext_noop("On table"));
+		  gettext_noop("Table"));
 
 	if (verbose)
 	{
-- 
2.7.4



Re: Checksum errors in pg_stat_database

2019-04-17 Thread Robert Treat
On Wed, Apr 17, 2019 at 9:07 AM Julien Rouhaud  wrote:
>
> On Wed, Apr 17, 2019 at 1:55 PM Magnus Hagander  wrote:
> >
> > On Tue, Apr 16, 2019 at 5:39 PM Robert Treat  wrote:
> >>
> >> On Mon, Apr 15, 2019 at 3:32 PM Julien Rouhaud  wrote:
> >> >
> >> > I don't know if that counts as an open item, but I attach a patch for
> >> > all points discussed here.
> >>
> >> ISTM we should mention shared objects in both places in the docs, and
> >> want "NULL if data checksums" rather than "NULL is data checksums".
> >> Attaching slightly modified patch with those changes, but otherwise
> >> LGTM.
>
> Thanks, that's indeed embarassing typos.  And agreed for mentioning
> shared objects in both places.
>
> >
> >  Interestingly enough, that patch comes out as corrupt. I have no idea why 
> > though :) v1 is fine.
> >
> > So I tried merging back your changes into it, and then pushing. Please 
> > doublecheck I didn't miss something :)
>
> Thanks!  I double checked and it all looks fine.

+1

Robert Treat
https://xzilla.net




Re: New vacuum option to do only freezing

2019-04-17 Thread Masahiko Sawada
On Wed, Apr 17, 2019 at 5:00 AM Tom Lane  wrote:
>
> I wrote:
> > I'm thinking that we really need to upgrade vacuum's reporting totals
> > so that it accounts in some more-honest way for pre-existing dead
> > line pointers.  The patch as it stands has made the reporting even more
> > confusing, rather than less so.
>
> Here's a couple of ideas about that:
>
> 1. Ignore heap_page_prune's activity altogether, on the grounds that
> it's just random chance that any cleanup done there was done during
> VACUUM and not some preceding DML operation.  Make tups_vacuumed
> be the count of dead line pointers removed.  The advantage of this
> way is that tups_vacuumed would become independent of previous
> non-VACUUM pruning activity, as well as preceding index-cleanup-disabled
> VACUUMs.  But maybe it's hiding too much information.
>
> 2. Have heap_page_prune count, and add to tups_vacuumed, only HOT
> tuples that it deleted entirely.  The action of replacing a DEAD
> root tuple with a dead line pointer doesn't count for anything.
> Then also add the count of dead line pointers removed to tups_vacuumed.
>
> Approach #2 is closer to the way we've defined tups_vacuumed up to
> now, but I think it'd be more realistic in cases where previous
> pruning or index-cleanup-disabled vacuums have left lots of dead
> line pointers.

On top of the approach #2, how about reporting the number of line
pointers we left so that user can notice that there are many dead line
pointers in the table?

>
> I'm not especially wedded to either of these --- any other ideas?

Or, how about reporting the vacuumed tuples and line pointers we
separately? It might be too detailed but since with this patch we
delete either only tuples or only line pointers, it's more accurate.



Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




Re: log_planner_stats and prepared statements

2019-04-17 Thread Bruce Momjian
On Wed, Apr 17, 2019 at 12:04:35AM -0400, Tom Lane wrote:
> Bruce Momjian  writes:
> > I have found that log_planner_stats only outputs stats until the generic
> > plan is chosen.  For example, if you run the following commands:
> 
> Uh, well, the planner doesn't get run after that point ...

Yes, that was my analysis too, but EXPLAIN still prints a planner line,
with a duration:

 Planning Time: 0.674 ms
 Planning Time: 0.240 ms
 Planning Time: 0.186 ms
 Planning Time: 0.158 ms
 Planning Time: 0.159 ms
 Planning Time: 0.169 ms
-->  Planning Time: 0.012 ms

Is that consistent?  I just don't know.  What is that line measuring?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: Checksum errors in pg_stat_database

2019-04-17 Thread Julien Rouhaud
On Wed, Apr 17, 2019 at 1:55 PM Magnus Hagander  wrote:
>
> On Tue, Apr 16, 2019 at 5:39 PM Robert Treat  wrote:
>>
>> On Mon, Apr 15, 2019 at 3:32 PM Julien Rouhaud  wrote:
>> >
>> > I don't know if that counts as an open item, but I attach a patch for
>> > all points discussed here.
>>
>> ISTM we should mention shared objects in both places in the docs, and
>> want "NULL if data checksums" rather than "NULL is data checksums".
>> Attaching slightly modified patch with those changes, but otherwise
>> LGTM.

Thanks, that's indeed embarassing typos.  And agreed for mentioning
shared objects in both places.

>
>  Interestingly enough, that patch comes out as corrupt. I have no idea why 
> though :) v1 is fine.
>
> So I tried merging back your changes into it, and then pushing. Please 
> doublecheck I didn't miss something :)

Thanks!  I double checked and it all looks fine.




Re: bug in update tuple routing with foreign partitions

2019-04-17 Thread Etsuro Fujita

(2019/04/11 20:31), Etsuro Fujita wrote:

(2019/04/11 14:33), Amit Langote wrote:

BTW, have you noticed that the RETURNING clause returns the same row
twice?


I noticed this, but I didn't think it hard. :(


+update utrtest set a = 3 from (values (2), (3)) s(x) where a = s.x
returning *;
+ a | b | x
+---+---+---
+ 3 | qux triggered ! | 2
+ 3 | xyzzy triggered ! | 3
+ 3 | qux triggered ! | 3
+(3 rows)

You can see that the row that's moved into remp is returned twice (one
with "qux triggered!" in b column), whereas it should've been only once?


Yeah, this is unexpected behavior, so will look into this.


I think the reason for that is: the row routed to remp is incorrectly 
fetched as a to-be-updated row when updating remp as a subplan 
targetrel.  The right way to fix this would be to have some way in 
postgres_fdw in which we don't fetch such rows when updating such 
subplan targetrels.  I tried to figure out a (simple) way to do that, 
but I couldn't.  One probably-simple solution I came up with is to sort 
subplan targetrels into the order in which foreign-table subplan 
targetrels get processed first in ExecModifyTable().  (Note: currently, 
rows can be moved from local partitions to a foreign-table partition, 
but rows cannot be moved from foreign-table partitions to another 
partition, so we wouldn't encounter this situation once we sort like 
that.)  But I think that's ugly, and I don't think it's a good idea to 
change the core, just for postgres_fdw.  So what I'm thinking is to 
throw an error for cases like this.  (Though, I think we should keep to 
allow rows to be moved from local partitions to a foreign-table subplan 
targetrel that has been updated already.)


What do you think about that?

Best regards,
Etsuro Fujita





Re: Checksum errors in pg_stat_database

2019-04-17 Thread Magnus Hagander
On Tue, Apr 16, 2019 at 5:39 PM Robert Treat  wrote:

> On Mon, Apr 15, 2019 at 3:32 PM Julien Rouhaud  wrote:
> >
> > Sorry for late reply,
> >
> > On Sun, Apr 14, 2019 at 7:12 PM Magnus Hagander 
> wrote:
> > >
> > > On Sat, Apr 13, 2019 at 8:46 PM Robert Treat  wrote:
> > >>
> > >> On Fri, Apr 12, 2019 at 8:18 AM Magnus Hagander 
> wrote:
> > >> ISTM the argument here is go with zero since you have zero connections
> > >> vs go with null since you can't actually connect, so it doesn't make
> > >> sense. (There is a third argument about making it -1 since you can't
> > >> connect, but that breaks sum(numbackends) so it's easily dismissed.) I
> > >> think I would have gone for 0 personally, but what ended up surprising
> > >> me was that a bunch of other stuff like xact_commit show zero when
> > >> AFAICT the above reasoning would apply the same to those columns.
> > >> (unless there is a way to commit a transaction in the global objects
> > >> that I don't know about).
> > >
> > >
> > > That's a good point. I mean, you can commit a transaction that
> involves changes of global objects, but it counts in the database that you
> were conneced to.
> > >
> > > We should probably at least make it consistent and make it NULL in all
> or 0 in all.
> > >
> > > I'm -1 for using -1 (!), for the very reason that you mention. But
> either changing the numbackends to 0, or the others to NULL would work for
> consistency. I'm leaning towards the 0 as well.
> >
> > +1 for 0 :)  Especially since it's less code in the view.
> >
>
> +1 for 0
>
> > >> What originally got me looking at this was the idea of returning -1
> > >> (or maybe null) for checksum failures for cases when checksums are not
> > >> enabled. This seems a little more complicated to set up, but seems
> > >> like it might ward off people thinking they are safe due to no
> > >> checksum error reports when they actually aren't.
> > >
> > >
> > > NULL seems like the reasonable thing to return there. I'm not sure
> what you're referring to with a little more complicated to set up, thought?
> Do you mean somehow for the end user?
> > >
> > > Code-wise it seems it should be simple -- just do an "if checksums
> disabled then return null"  in the two functions.
> >
> > That's indeed a good point!  Lack of checksum error is distinct from
> > checksums not activated and we should make it obvious.
> >
> > I don't know if that counts as an open item, but I attach a patch for
> > all points discussed here.
>
> ISTM we should mention shared objects in both places in the docs, and
> want "NULL if data checksums" rather than "NULL is data checksums".
> Attaching slightly modified patch with those changes, but otherwise
> LGTM.
>

 Interestingly enough, that patch comes out as corrupt. I have no idea why
though :) v1 is fine.

So I tried merging back your changes into it, and then pushing. Please
doublecheck I didn't miss something :)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: jsonpath

2019-04-17 Thread John Naylor
Attached is a patch to fix some minor issues:

-misspelling of an error message
-Commit 550b9d26f80f failed to update the Makefile comment to reflect
how the build changed, and also removed the clean target, which we now
have use for since we later got rid of backtracking in the scanner.

Also, while I have the thought in my head, for v13 we should consider
replacing the keyword binary search with the perfect hash technique
added in c64d0cd5ce2 -- it might give a small performance boost to the
scanner.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


json-path-minor-fixes.patch
Description: Binary data


Re: Unhappy about API changes in the no-fsm-for-small-rels patch

2019-04-17 Thread Amit Kapila
On Wed, Apr 17, 2019 at 10:39 AM John Naylor
 wrote:
> On Wed, Apr 17, 2019 at 2:04 AM Andres Freund  wrote:
>
> > +/* Only create the FSM if the heap has greater than this many blocks */
> > +#define HEAP_FSM_CREATION_THRESHOLD 4
> >
> > Hm, this seems to be tying freespace.c closer to heap than I think is
> > great - think of new AMs like zheap, that also want to use it.
>
> Amit and I kept zheap in mind when working on the patch. You'd have to
> work around the metapage, but everything else should work the same.
>

I think we also need to take care of TPD pages along with meta page.
This might be less effective if we encounter TPD pages as well in
small relation which shouldn't be a common scenario, but it won't
hurt, otherwise.  Those pages are anyway temporary and will be
removed.

BTW there is one other thing which is tied to heap in FSM for which we
might want some handling.
#define MaxFSMRequestSize MaxHeapTupleSize

In general, it will be good if we find some pluggable way for both the
defines, otherwise, also, it shouldn't cause a big problem.

>
> > I'm kinda thinking that this is the wrong architecture.
> >
> > 1) Unless I miss something, this will trigger a
> >RelationGetNumberOfBlocks(), which in turn ends up doing an lseek(),
> >once for each page we add to the relation.
>
> That was true previously anyway if the FSM returned InvalidBlockNumber.
>
> >That strikes me as pretty
> >suboptimal. I think it's even worse if multiple backends do the
> >insertion, because the RelationGetTargetBlock(relation) logic will
> >succeed less often.
>
> Could you explain why it would succeed less often?
>
> > 2) We'll repeatedly re-encounter the first few pages, because we clear
> >the local map after each successful RelationGetBufferForTuple().
>
> Not exactly sure what you mean? We only set the map if
> RelationGetTargetBlock() returns InvalidBlockNumber, or if it returned
> a valid block, and inserting there already failed. So, not terribly
> often, I imagine.
>
> > 3) The global variable based approach means that we cannot easily do
> >better. Even if we had a cache that lives across
> >RelationGetBufferForTuple() calls.
> >
> > 4) fsm_allow_writes() does a smgrexists() for the FSM in some
> >cases. That's pretty darn expensive if it's already open.
>
> (from earlier)
> > Isn't this like really expensive? mdexists() closes the relations and
> > reopens it from scratch. Shouldn't we at the very least check
> > smgr_fsm_nblocks beforehand, so this is only done once?
>
> Hmm, I can look into that.
>
>
> I think if we want to keep something like this feature, we'd need to
> cache the relation size in a variable similar to how we cache the FSM
> size (like SMgrRelationData.smgr_fsm_nblocks)
>

makes sense. I think we should do this unless we face any problem with it.

> *and* stash the bitmap of
> pages that we think are used/free as a bitmap somewhere below the
> relcache.

I think maintaining at relcache level will be tricky when there are
insertions and deletions happening in the small relation.  We have
considered such a case during development wherein we don't want the
FSM to be created if there are insertions and deletions in small
relation.  The current mechanism addresses both this and normal case
where there are not many deletions.  Sure there is some overhead of
building the map again and rechecking each page.  The first one is a
memory operation and takes a few cycles and for the second we
optimized by checking the pages alternatively which means we won't
check more than two pages at-a-time.  This cost is paid by not
checking FSM and it could be somewhat better in some cases [1].


>  If we cleared that variable at truncations, I think we should
> be able to make that work reasonably well?

Not only that, I think it needs to be cleared whenever we create the
FSM as well which could be tricky as it can be created by the vacuum.

OTOH, if we want to extend it later for whatever reason to a relation
level cache, it shouldn't be that difficult as the implementation is
mostly contained in freespace.c (fsm* functions) and I think the
relation is accessible in most places.  We might need to rip out some
calls to clearlocalmap.

>
> On Wed, Apr 17, 2019 at 3:16 AM Andres Freund  wrote:
> >
> > Hi,
> >
> > On 2019-04-16 14:31:25 -0400, Tom Lane wrote:
>   
> > > and in any case it seems pretty inefficient for workloads that insert
> > > into multiple tables.
> >
> > As is, it's inefficient for insertions into a *single* relation. The
> > RelationGetTargetBlock() makes it not crazily expensive, but it's still
> > plenty expensive.
>

During development, we were also worried about the performance
regression that can happen due to this patch and we have done many
rounds of performance tests where such a cache could be accessed
pretty frequently.  In the original version, we do see a small
regression as a result of which we came up with an alternate strategy
of not 

Re: hyrax vs. RelationBuildPartitionDesc

2019-04-17 Thread Amit Langote
On 2019/04/17 18:58, Amit Langote wrote:
> where index on queries
> column

Oops, I meant "with an index on the queried column".

Thanks,
Amit





Re: hyrax vs. RelationBuildPartitionDesc

2019-04-17 Thread Amit Langote
On 2019/04/15 4:29, Tom Lane wrote:
> I think that what we ought to do for v12 is have PartitionDirectory
> copy the data, and then in v13 work on creating real reference-count
> infrastructure that would allow eliminating the copy steps with full
> safety.  The $64 question is whether that really would cause unacceptable
> performance problems.  To look into that, I made the attached WIP patches.
> (These are functionally complete, but I didn't bother for instance with
> removing the hunk that 898e5e329 added to relcache.c, and the comments
> need work, etc.)  The first one just changes the PartitionDirectory
> code to do that, and then the second one micro-optimizes
> partition_bounds_copy() to make it somewhat less expensive, mostly by
> collapsing lots of small palloc's into one big one.

Thanks for the patches.  The partition_bound_copy()-micro-optimize one
looks good in any case.

> What I get for test cases like [1] is
> 
> single-partition SELECT, hash partitioning:
> 
> N   tps, HEAD   tps, patch
> 2   11426.24375411448.615193
> 8   11254.83326711374.278861
> 32  11288.32911411371.942425
> 128 11222.32925611185.845258
> 512 11001.17713710572.917288
> 102410612.4564709834.172965
> 40968819.110195 7021.864625
> 81927372.611355 5276.130161
> 
> single-partition SELECT, range partitioning:
> 
> N   tps, HEAD   tps, patch
> 2   11037.85533811153.595860
> 8   11085.21802211019.132341
> 32  10994.34820710935.719951
> 128 10884.41732410532.685237
> 512 10635.5834119578.108915
> 102410407.2864148689.585136
> 40968361.463829 5139.084405
> 81927075.880701 3442.542768
> 
> Now certainly these numbers suggest that avoiding the copy could be worth
> our trouble, but these results are still several orders of magnitude
> better than where we were two weeks ago [2].  Plus, this is an extreme
> case that's not really representative of real-world usage, since the test
> tables have neither indexes nor any data.

I tested the copyPartitionDesc() patch and here are the results for
single-partition SELECT using hash partitioning, where index on queries
column, and N * 1000 rows inserted into the parent table before the test.
I've confirmed that the plan is always an Index Scan on selected partition
(in PG 11, it's under Append, but in HEAD there's no Append due to 8edd0e794)

N   tps, HEAD   tps, patch  tps, PG 11
2   3093.443043 3039.804101 2928.777570
8   3024.545820 3064.333027 2372.738622
32  3029.580531 3032.755266 1417.706212
128 3019.359793 3032.726006 567.099745
512 2948.639216 2986.987862 98.710664
10242971.629939 2882.233026 41.720955
40962680.703000 1937.988908 7.035816
81922599.120308 2069.271274 3.635512

So, the TPS degrades by 14% when going from 128 partitions to 8192
partitions on HEAD, whereas it degrades by 31% with the patch.

Here are the numbers with no indexes defined on the tables, and no data
inserted.

N   tps, HEAD   tps, patch  tps, PG 11
2   3498.247862 3463.695950 3110.314290
8   3524.430780 3445.165206 2741.340770
32  3476.427781 3427.400879 1645.602269
128 3427.121901 3430.385433 651.586373
512 3394.907072 3335.842183 182.201349
10243454.050819 3274.266762 67.942075
40963201.266380 2845.974556 12.320716
81922955.850804 2413.723443 6.151703

Here, the TPS degrades by 13% when going from 128 partitions to 8192
partitions on HEAD, whereas it degrades by 29% with the patch.

So, the degradation caused by copying the bounds is almost same in both
cases.  Actually, even in the more realistic test with indexes and data,
executing the plan is relatively faster than planning as the partition
count grows, because the PartitionBoundInfo that the planner now copies
grows bigger.

Thanks,
Amit





[patch] pg_test_timing does not prompt illegal option

2019-04-17 Thread Zhang, Jie
Hi all,

pg_test_timing accepts the following command-line options:
-d duration
--duration=duration

Specifies the test duration, in seconds. Longer durations give slightly 
better accuracy, and are more likely to discover problems with the system clock 
moving backwards. The default test duration is 3 seconds.
-V
--version

Print the pg_test_timing version and exit.
-?
--help

Show help about pg_test_timing command line arguments, and exit.

[https://www.postgresql.org/docs/11/pgtesttiming.html]

However, when I use the following command,  no error prompt. the command can 
run.
pg_test_timing --

I think "--" is a illegal option, errors should be prompted.

Here is a patch for prompt illegal option.

Best Regards!





pg_test_timing.patch
Description: pg_test_timing.patch


standby recovery fails (tablespace related) (tentative patch and discussion)

2019-04-17 Thread Paul Guo
Hello postgres hackers,

Recently my colleagues and I encountered an issue: a standby can
not recover after an unclean shutdown and it's related to tablespace.
The issue is that the standby re-replay some xlog that needs tablespace
directories (e.g. create a database with tablespace),
but the tablespace directories has already been removed in the
previous replay.

In details, the standby normally finishes replaying for the below
operations, but due to unclean shutdown, the redo lsn
is not updated in pg_control and is still kept a value before the 'create
db with tabspace' xlog, however since the tablespace
directories were removed so it reports error when repay the database create
wal.

create db with tablespace
drop database
drop tablespace.

Here is the log on the standby.
2019-04-17 14:52:14.926 CST [23029] LOG:  starting PostgreSQL 12devel on
x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat
4.8.5-4), 64-bit
2019-04-17 14:52:14.927 CST [23029] LOG:  listening on IPv4 address
"192.168.35.130", port 5432
2019-04-17 14:52:14.929 CST [23029] LOG:  listening on Unix socket
"/tmp/.s.PGSQL.5432"
2019-04-17 14:52:14.943 CST [23030] LOG:  database system was interrupted
while in recovery at log time 2019-04-17 14:48:27 CST
2019-04-17 14:52:14.943 CST [23030] HINT:  If this has occurred more than
once some data might be corrupted and you might need to choose an earlier
recovery target.
2019-04-17 14:52:14.949 CST [23030] LOG:  entering standby mode

2019-04-17 14:52:14.950 CST [23030] LOG:  redo starts at 0/30105B8

2019-04-17 14:52:14.951 CST [23030] FATAL:  could not create directory
"pg_tblspc/65546/PG_12_201904072/65547": No such file or directory
2019-04-17 14:52:14.951 CST [23030] CONTEXT:  WAL redo at 0/3011650 for
Database/CREATE: copy dir 1663/1 to 65546/65547
2019-04-17 14:52:14.951 CST [23029] LOG:  startup process (PID 23030)
exited with exit code 1
2019-04-17 14:52:14.951 CST [23029] LOG:  terminating any other active
server processes
2019-04-17 14:52:14.953 CST [23029] LOG:  database system is shut down


Steps to reprodce:

1. setup a master and standby.
2. On both side, run: mkdir /tmp/some_isolation2_pg_basebackup_tablespace

3. Run SQLs:
drop tablespace if exists some_isolation2_pg_basebackup_tablespace;
create tablespace some_isolation2_pg_basebackup_tablespace location
'/tmp/some_isolation2_pg_basebackup_tablespace';

3. Clean shutdown and restart both postgres instances.

4. Run the following SQLs:

drop database if exists some_database_with_tablespace;
create database some_database_with_tablespace tablespace
some_isolation2_pg_basebackup_tablespace;
drop database some_database_with_tablespace;
drop tablespace some_isolation2_pg_basebackup_tablespace;
\! pkill -9 postgres; ssh host70 pkill -9 postgres

Note immediate shutdown via pg_ctl should also be able to reproduce and the
above steps probably does not 100% reproduce.

I created an initial patch for this issue (see the attachment). The idea is
re-creating those directories recursively. The above issue exists
in dbase_redo(),
but TablespaceCreateDbspace (for relation file create redo) is probably
buggy also so I modified that function also. Even there is no bug
in that function, it seems that using simple pg_mkdir_p() is cleaner. Note
reading TablespaceCreateDbspace(), I found it seems that this issue
has already be thought though insufficient but frankly this solution
(directory recreation) seems to be not perfect given actually this should
have been the responsibility of tablespace creation (also tablespace
creation does more like symlink creation, etc). Also, I'm not sure whether
we need to use invalid page mechanism (see xlogutils.c).

Another solution is that, actually, we create a checkpoint when
createdb/movedb/dropdb/droptablespace, maybe we should enforce to create
restartpoint on standby for such special kind of checkpoint wal - that
means we need to set a flag in checkpoing wal and let checkpoint redo
code to create restartpoint if that flag is set. This solution seems to be
safer.

Thanks,
Paul


0001-Recursively-create-tablespace-directories-if-those-a.patch
Description: Binary data


Re: extensions are hitting the ceiling

2019-04-17 Thread Jiří Fejfar
Hi all!

I am sending our comments to mentioned issues. I was trying to send it
month ago 
(https://www.postgresql.org/message-id/CA%2B8wVNUOt2Bh4x7YQEVoq5BfP%3DjM-F6cDYKxJiTODG_VCGhUVQ%40mail.gmail.com),
but it somehow doesn't append in the "thread" (sorry, I am new in
mailing list practice...).

My colleague already posted some report to bug mailing list
(https://www.postgresql.org/message-id/15616-260dc9cb3bec7...@postgresql.org)
but with no response.

On Tue, 19 Mar 2019 at 02:38, Eric Hanson  wrote:
>
> Hi folks,
>
> After months and years of really trying to make EXTENSIONs meet the 
> requirements of my machinations, I have come to the conclusion that either a) 
> I am missing something or b) they are architecturally flawed.  Or possibly 
> both.
>
> Admittedly, I might be trying to push extensions beyond what the great 
> elephant in the sky ever intended. The general bent here is to try to achieve 
> a level of modular reusable components similar to those in "traditional" 
> programming environments like pip, gem, npm, cpan, etc. Personally, I am 
> trying to migrate as much of my dev stack as possible away from the 
> filesystem and into the database. Files, especially code, configuration, 
> templates, permissions, manifests and other development files, would be much 
> happier in a database where they have constraints and an information model 
> and can be queried!
>
> Regardless, it would be really great to be able to install an extension, and 
> have it cascade down to multiple other extensions, which in turn cascade down 
> to more, and have everything just work. Clearly, this was considered in the 
> extension architecture, but I'm running into some problems making it a 
> reality.  So here they are.
>
>
> #1: Dependencies
>
> Let's say we have two extensions, A and B, both of which depend on a third 
> extension C, let's just say C is hstore.  A and B are written by different 
> developers, and both contain in their .control file the line
>
> requires = 'hstore'
>
> When A is installed, if A creates a schema, it puts hstore in that schema. If 
> not, hstore is already installed, it uses it in that location.  How does the 
> extension know where to reference hstore?
>
> Then, when B is installed, it checks to see if extension hstore is installed, 
> sees that it is, and moves on.  What if it expects it in a different place 
> than A does? The hstore extension can only be installed once, in a single 
> schema, but if multiple extensions depend on it and look for it in different 
> places, they are incompatible.
>
> I have heard talk of a way to write extensions so that they dynamically 
> reference the schema of their dependencies, but sure don't know how that 
> would work if it's possible.  The @extschema@ variable references the 
> *current* extension's schema, but not there is no dynamic variable to 
> reference the schema of a dependency.
>
> Also it is possible in theory to dynamically set search_path to contain every 
> schema of every dependency in play and then just not specify a schema when 
> you use something in a dependency.  But this ANDs together all the scopes of 
> all the dependencies of an extension, introducing potential for collisions, 
> and is generally kind of clunky.
>

It is not possible to specify the version of extension we are
dependent on in .control file.

> #2:  Data in Extensions
>
> Extensions that are just a collection of functions and types seem to be the 
> norm.  Extensions can contain what the docs call "configuration" data, but 
> rows are really second class citizens:  They aren't tracked with 
> pg_catalog.pg_depend, they aren't deleted when the extension is dropped, etc.
>
> Sometimes it would make sense for an extension to contain *only* data, or 
> insert some rows in a table that the extension doesn't "own", but has as a 
> dependency.  For example, a "webserver" extension might contain a "resource" 
> table that serves up the content of resources in the table at a specified 
> path. But then, another extension, say an app, might want to just list the 
> webserver extension as a dependency, and insert a few resource rows into it.  
> This is really from what I can tell beyond the scope of what extensions are 
> capable of.
>

I am not sure about the name "Configuration" Tables. From my point of
view extensions can hold two sorts of data:
1) "static" data: delivered with extension, inserted by update
scripts; the same "static" data are present across multiple
installation of extension in the same version. This data are not
supposed to be dumped.
2) "dynamic" data: inserted by users, have to be included in dumps,
are marked with pg_extension_config_dump and are called
"configuration" tables/data ... but why "configuration"?

>
> #3 pg_dump and Extensions
>
> Tables created by extensions are skipped by pg_dump unless they are flagged 
> at create time with:
>
> pg_catalog.pg_extension_config_dump('my_table', 'where id < 20')
>
>