Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-02-27 Thread Sergei Kornilov
Hi

> The default should always be to shrink, unless either the VACUUM
> option or the reloption turn that off. (So it doesn't make sense to set
> either the VACUUM option or the reloption to "on").

I think VACUUM option can be set to "on" by hand in order to override reloption 
only for this VACUUM call.

regards, Sergei



Re: BUG #15623: Inconsistent use of default for updatable view

2019-02-27 Thread Amit Langote
On 2019/02/27 18:37, Dean Rasheed wrote:
> On Tue, 12 Feb 2019 at 10:33, Dean Rasheed  wrote:
>> Here's an updated patch ...
> 
> So I pushed that. However, ...
> 
> Playing around with it some more, I realised that whilst this appeared
> to fix the reported problem, it exposes another issue which is down to
> the interaction between rewriteTargetListIU() and rewriteValuesRTE()
> --- for an INSERT with a VALUES RTE, rewriteTargetListIU() computes a
> list of target attribute numbers (attrno_list) from the targetlist in
> its original order, which rewriteValuesRTE() then relies on being the
> same length (and in the same order) as each of the lists in the VALUES
> RTE. That's OK for the initial invocation of those functions, but it
> breaks down when they're recursively invoked for auto-updatable views.
>> So actually, I think the right way to fix this is to give up trying to
> compute attrno_list in rewriteTargetListIU(), and have
> rewriteValuesRTE() work out the attribute assignments itself for each
> column of the VALUES RTE by examining the rewritten targetlist. That
> looks to be quite straightforward, and results in a cleaner separation
> of logic between the 2 functions, per the attached patch.

+1.  Only rewriteValuesRTE needs to use that information, so it's better
to teach it to figure it by itself.

> I think that rewriteValuesRTE() should only ever see DEFAULT items for
> columns that are simple assignments to columns of the target relation,
> so it only needs to work out the target attribute numbers for TLEs
> that contain simple Vars referring to the VALUES RTE. Any DEFAULT seen
> in a column not matching that would be an error, but actually I think
> such a thing ought to be a "can't happen" error because of the prior
> checks during parse analysis, so I've used elog() to report if this
> does happen.

+if (attrno == 0)
+elog(ERROR, "Cannot set value in column %d to
DEFAULT", i);

Maybe: s/Cannot/cannot/g

+Assert(list_length(sublist) == numattrs);

Isn't this Assert useless, because we're setting numattrs to
list_length() and transformInsertStmt ensures that all
sublists are same length?


Thanks,
Amit




Re: Drop type "smgr"?

2019-02-27 Thread Haribabu Kommi
On Thu, Feb 28, 2019 at 5:37 PM Tom Lane  wrote:

> Thomas Munro  writes:
> > On Thu, Feb 28, 2019 at 7:08 PM Tom Lane  wrote:
> >> I agree that smgrtype as it stands is pretty pointless, but what
> >> will we be using instead to get to those other implementations?
>
> > Our current thinking is that smgropen() should know how to map a small
> > number of special database OIDs to different smgr implementations
>
> Hmm.  Maybe mapping based on tablespaces would be a better idea?
>

Thanks to bringing up this idea of mutliple smgr implementations. I also
thought of implementing our own smgr implementation to support transparent
data encryption on the disk based on tablespace mapping.

Regards,
Haribabu Kommi
Fujitsu Australia


Re: Drop type "smgr"?

2019-02-27 Thread Thomas Munro
On Thu, Feb 28, 2019 at 7:37 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > On Thu, Feb 28, 2019 at 7:08 PM Tom Lane  wrote:
> >> I agree that smgrtype as it stands is pretty pointless, but what
> >> will we be using instead to get to those other implementations?
>
> > Our current thinking is that smgropen() should know how to map a small
> > number of special database OIDs to different smgr implementations
>
> Hmm.  Maybe mapping based on tablespaces would be a better idea?

In the undo log proposal (about which more soon) we are using
tablespaces for their real purpose, so we need that OID.  If you SET
undo_tablespaces = foo then future undo data created by your session
will be written there, which might be useful for putting that IO on
different storage.  We also use the relation OID to chop up the undo
address space into separate numbered undo logs, so that different
sessions get their own space to insert data without trampling on each
other's buffer locks.  That leaves only the database OID to mess with
(unless we widen buffer tags and the smgropen() interface).

> > Unlike the ancestral code, it wouldn't need to appear in
> > catalogs or ever be seen or typed in by users so there still wouldn't
> > be a use for this type.
>
> Yeah, the $64 problem at this level is that you don't really want
> to depend on catalog contents, because you cannot do a lookup to
> find out what to do.

Right.  It would be a circular dependency if you had to read a catalog
before you could consult low level SLRUs like (say) the clog* via
shared buffers, since you need the clog to read the catalog, or (to
peer a long way down the road and around several corners) if the
catalogs themselves could optionally be stored in an undo-aware table
access manager like zheap, which might require reading undo.  I think
this block storage stuff exists at a lower level than relations and
transactions and therefore also catalogs, and there will be a small
fixed number of them and it makes sense to hard-code the knowledge of
them.

*I'm at least a little bit aware of the history here: your 2001 commit
2589735d moved clog out of shared buffers!  That enabled you to
develop the system of segment files truncated from the front.  That's
sort of what this new smgr work is about; putting things in shared
buffers, but just mapping the blocks to paths differently than md.c,
as appropriate.

-- 
Thomas Munro
https://enterprisedb.com



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-02-27 Thread Laurenz Albe
Alvaro Herrera wrote:
> I think we should have a VACUUM option and a reloption, but no
> GUC.  The default should always be to shrink, unless either the VACUUM
> option or the reloption turn that off.  (So it doesn't make sense to set
> either the VACUUM option or the reloption to "on").

+1

Yours,
Laurenz Albe




Re: bgwriter_lru_maxpages limits in PG 10 sample conf

2019-02-27 Thread Sergei Kornilov
Hello

postgresql.conf.sample was changed recently in REL_10_STABLE (commit 
ab1d9f066aee4f9b81abde6136771debe0191ae8)
So config will be changed in next minor release anyway. We have another reason 
to not fix bgwriter_lru_maxpages comment?

regards, Sergei



Re: Drop type "smgr"?

2019-02-27 Thread Tom Lane
Thomas Munro  writes:
> On Thu, Feb 28, 2019 at 7:08 PM Tom Lane  wrote:
>> I agree that smgrtype as it stands is pretty pointless, but what
>> will we be using instead to get to those other implementations?

> Our current thinking is that smgropen() should know how to map a small
> number of special database OIDs to different smgr implementations

Hmm.  Maybe mapping based on tablespaces would be a better idea?

> Unlike the ancestral code, it wouldn't need to appear in
> catalogs or ever be seen or typed in by users so there still wouldn't
> be a use for this type.

Yeah, the $64 problem at this level is that you don't really want
to depend on catalog contents, because you cannot do a lookup to
find out what to do.  So I agree that we're pretty unlikely to
resurrect an smgr type per se.  But I'd been expecting an answer
mentioning pg_am OIDs, and was wondering how that'd work exactly.
Probably, it would still be down to some C code having hard-wired
knowledge about some OIDs ...

regards, tom lane



Re: POC: converting Lists into arrays

2019-02-27 Thread Tom Lane
Andrew Gierth  writes:
> To get a reliable measurement of timing changes less than around 3%,
> what you have to do is this: pick some irrelevant function and add
> something like an asm directive that inserts a variable number of NOPs,
> and do a series of test runs with different values.

Good point.  If you're looking at a microbenchmark that only exercises
a small amount of code, it can be way worse than that.  I was reminded
of this the other day while fooling with the problem discussed in
https://www.postgresql.org/message-id/flat/6970.1545327...@sss.pgh.pa.us
in which we were spending huge amounts of time in a tight loop in
match_eclasses_to_foreign_key_col.  I normally run with --enable-cassert
unless I'm trying to collect performance data; so I rebuilt with
--disable-cassert, and was bemused to find out that that test case ran
circa 20% *slower* in the non-debug build.  This is silly on its face,
and even more so when you notice that match_eclasses_to_foreign_key_col
itself contains no Asserts and so its machine code is unchanged by the
switch.  (I went to the extent of comparing .s files to verify this.)
So that had to have been down to alignment/cacheline issues triggered
by moving said function around.  I doubt the case would be exactly
reproducible on different hardware or toolchain, but another platform
would likely show similar issues on some case or other.

tl;dr: even a 20% difference might be nothing more than an artifact.

regards, tom lane



Re: Drop type "smgr"?

2019-02-27 Thread Thomas Munro
On Thu, Feb 28, 2019 at 7:08 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Motivation: A couple of projects propose to add new smgr
> > implementations alongside md.c in order to use bufmgr.c for more kinds
> > of files, but it seems entirely bogus to extend the unused smgr type
> > to cover those.
>
> I agree that smgrtype as it stands is pretty pointless, but what
> will we be using instead to get to those other implementations?

Our current thinking is that smgropen() should know how to map a small
number of special database OIDs to different smgr implementations
(where currently it hard-codes smgr_which = 0).  For example there
would be a pseudo-database for undo logs, and another for the SLRUs
that Shawn Debnath and others have been proposing to move into shared
buffers.  Another idea would be to widen the buffer tag to include the
selector.  Unlike the ancestral code, it wouldn't need to appear in
catalogs or ever be seen or typed in by users so there still wouldn't
be a use for this type.

-- 
Thomas Munro
https://enterprisedb.com



RE: Prevent extension creation in temporary schemas

2019-02-27 Thread Kuroda, Hayato
Dear Michael, Chris and Tom,

> Adding special cases to extensions strikes me as adding more
> funny corners to the behavior of the db in this regard.

I understand your arguments and its utility.

> For most of extensions, this can randomly finish with strange error
> messages, say that:
> =# create extension file_fdw with schema pg_temp_3;
> ERROR:  42883: function file_fdw_handler() does not exist
> LOCATION:  LookupFuncName, parse_func.c:2088

I found that this strange error appears after making
temporary tables. 

test=> CREATE TEMPORARY TABLE temp (id int);
CREATE TABLE
test=> CREATE EXTENSION file_fdw WITH SCHEMA pg_temp_3;
ERROR:  function file_fdw_handler() does not exist

I would try to understand this problem for community and
my experience.

Best Regards,
Hayato Kuroda
Fujitsu LIMITED






Re: Drop type "smgr"?

2019-02-27 Thread Tom Lane
Thomas Munro  writes:
> Motivation: A couple of projects propose to add new smgr
> implementations alongside md.c in order to use bufmgr.c for more kinds
> of files, but it seems entirely bogus to extend the unused smgr type
> to cover those.

I agree that smgrtype as it stands is pretty pointless, but what
will we be using instead to get to those other implementations?

regards, tom lane



Drop type "smgr"?

2019-02-27 Thread Thomas Munro
Hello hackers,

The type smgr has only one value 'magnetic disk'.  ~15 years ago it
also had a value 'main memory', and in Berkeley POSTGRES 4.2 there was
a third value 'sony jukebox'.  Back then, all tables had an associated
block storage manager, and it was recorded as an attribute relsmgr of
pg_class (or pg_relation as it was known further back).  This was the
type of that attribute, removed by Bruce in 3fa2bb31 (1997).

Nothing seems to break if you remove it (except for some tests using
it in an incidental way).  See attached.

Motivation: A couple of projects propose to add new smgr
implementations alongside md.c in order to use bufmgr.c for more kinds
of files, but it seems entirely bogus to extend the unused smgr type
to cover those.

-- 
Thomas Munro
https://enterprisedb.com


0001-Remove-the-vestigial-smgr-type.patch
Description: Binary data


Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-02-27 Thread Tom Lane
Alvaro Herrera  writes:
> Hopefully we'll get Tom's patch that addresses the failure-to-truncate
> issues in pg12.

Hm, are you speaking of the handwaving I did in
https://www.postgresql.org/message-id/2348.1544474...@sss.pgh.pa.us
?

I wasn't really working on that for v12 --- I figured it was way
too late in the cycle to be starting on such a significant change.
Still, if we did manage to make that work, it would remove the need
for user-visible kluges like the one discussed in this thread.

regards, tom lane



Re: POC: converting Lists into arrays

2019-02-27 Thread Andrew Gierth
> "David" == David Rowley  writes:

 David> I went and had a few adventures with this patch to see if I
 David> could figure out why the small ~1% regression exists.

Just changing the number of instructions (even in a completely unrelated
place that's not called during the test) can generate performance
variations of this size, even when there's no real difference.

To get a reliable measurement of timing changes less than around 3%,
what you have to do is this: pick some irrelevant function and add
something like an asm directive that inserts a variable number of NOPs,
and do a series of test runs with different values.

See http://tinyurl.com/op9qg8a for an example of the kind of variation
that one can get; this plot records timing runs where each different
padding size was tested 3 times (non-consecutively, so you can see how
repeatable the test result is for each size), each timing is actually
the average of the last 10 of 11 consecutive runs of the test.

To establish a 1% performance benefit or regression you need to show
that there's still a difference _AFTER_ taking this kind of
spooky-action-at-a-distance into account. For example, in the test shown
at the link, if a substantive change to the code moved the upper and
lower bounds of the output from (6091,6289) to (6030,6236) then one
would be justified in claiming it as a 1% improvement.

Such is the reality of modern CPUs.

-- 
Andrew (irc:RhodiumToad)



Re: POC: converting Lists into arrays

2019-02-27 Thread Tom Lane
David Rowley  writes:
> I went and had a few adventures with this patch to see if I could
> figure out why the small ~1% regression exists.

Thanks for poking around!

> ... I had
> suspected it was the lcons() calls being expensive because then need
> to push the elements up one place each time, not something that'll
> scale well with larger lists.

I just did some looking around at lcons() calls, and it's hard to identify
any that seem like they would be performance-critical.  I did notice a
number of places that think that lcons'ing a item onto a list, and later
stripping it off with list_delete_first, is efficient.  With the new
implementation it's far cheaper to lappend and then list_truncate instead,
at least if the list is long.  If the list order matters then that's not
an option, but I found some places where it doesn't matter so we could get
an easy win.  Still, it wasn't obvious that this would move the needle at
all.

> I then tried hacking at the foreach() macro after wondering if the
> lnext() call was somehow making things difficult for the compiler to
> predict what cell would come next.

Yeah, my gut feeling right now is that foreach() is producing crummy
code, though it's not obvious why it would need to.  Maybe some
micro-optimization is called for.  But I've not had time to pursue it.

> The only thing that I did to manage to speed the patch up was to ditch
> the additional NULL test in lnext().  I don't see why that's required
> since lnext(NULL) would have crashed with the old implementation.

Hmmm ...

> Perhaps if we're not going to see gains from the patch alone then
> we'll need to tag on some of the additional stuff that will take
> advantage of list_nth() being fast and test the performance of it all
> again.

Yeah, evaluating this patch in complete isolation is a bit unfair.
Still, it'd be nice to hold the line in advance of any follow-on
improvements.

regards, tom lane



Re: partitioned tables referenced by FKs

2019-02-27 Thread Amit Langote
Hi Alvaro,

I looked at the latest patch and most of the issues/bugs that I was going
to report based on the late January version of the patch seem to have been
taken care of; sorry that I couldn't post sooner which would've saved you
some time.   The patch needs to be rebased on top of ff11e7f4b9 which
touched ri_triggers.c.

In the following case:

create table pk (a int primary key) partition by list (a);
create table pk1 (a int primary key);
create table fk (a int references pk1);

-- adds FK referencing pk1 via ATAddForeignKeyConstraint recursion
alter table pk attach partition pk1 for values in (1);
alter table fk add foreign key (a) references pk;

or:

-- adds FK referencing pk1 via CloneForeignKeyConstraints
alter table fk add foreign key (a) references pk;
alter table pk attach partition pk1 for values in (1);

\d fk
 Table "public.fk"
 Column │  Type   │ Collation │ Nullable │ Default
┼─┼───┼──┼─
 a  │ integer │   │  │
Foreign-key constraints:
"fk_a_fkey" FOREIGN KEY (a) REFERENCES pk1(a)
"fk_a_fkey1" FOREIGN KEY (a) REFERENCES pk(a)
"fk_a_fkey2" FOREIGN KEY (a) REFERENCES pk1(a)

Could we have avoided creating fk_a_fkey2 which must be same as fk_a_fkey
as far as the user-visible behavior is concerned?


* Regarding 0003

I'd like to hear your thoughts on some suggestions to alter the structure
of the reorganized code around foreign key addition/cloning.  With this
patch adding support for foreign keys to reference partitioned tables, the
code now has to consider various cases due to the possibility of having
partitioned tables on both sides of a foreign key, which is reflected in
the complexity of the new code.  Following functions have been introduced
in the foreign key DDL code in the course of adding support for partitioning:

addFkRecurseReferenced
addFkRecurseReferencing
CloneForeignKeyConstraints
CloneFkReferencing
CloneFkReferenced
tryAttachPartitionForeignKey

We have addFkRecurseReferencing and CloneForeignKeyConstraints calling
CloneFkReferencing to recursively add a foreign key constraint being
defined on (or being cloned to) a partitioned table to its partitions.
The latter invokes tryAttachPartitionForeignKey to see if an existing
foreign key constraint is functionally same as one being cloned to avoid
creating a duplicate constraint.

However, we have CloneFkReferenced() calling addFkRecurseReferenced, not
the other way around.  Neither of these functions attempts to reuse an
existing, functionally equivalent, foreign key constraint referencing a
given partition that's being recursed to.  So we get the behavior in the
above example, which again, I'm not sure is intentional.

Also, it seems a bit confusing that there is a CreateConstraintEntry call
in addFkRecurseReferenced() which is creating a constraint on the
*referencing* relation, which I think should be in
ATAddForeignKeyConstraint, the caller.  I know we need to create a copy of
the constraint to reference the partitions of the referenced table, but we
might as well put it in CloneFkReferenced and reverse who calls who --
make addFkRecurseReferenced() call CloneFkReferenced and have the code to
create the cloned constraint and action triggers in the latter.  That will
make the code to handle different sides of foreign key look similar, and
imho, easier to follow.


+/*
+ * addFkRecurseReferenced
+ *  recursive subroutine for ATAddForeignKeyConstraint, referenced side

How about:

Subroutine for ATAddForeignKeyConstraint to add action trigger on
referenced relation, recursing if partitioned, in which case, both the
constraint referencing a given partition and the action trigger are cloned
in all partitions

+/*
+ * addFkRecurseReferenced
+ *  recursive subroutine for ATAddForeignKeyConstraint, referenced side

How about:

Subroutine for ATAddForeignKeyConstraint to add check trigger on
referencing relation, recursing if partitioned, in which case, both the
constraint and the check trigger are cloned in all partitions

I noticed however that this function itself is not recursively called;
CloneFkReferencing handles the recursion.

/*
 * CloneForeignKeyConstraints
 *  Clone foreign keys from a partitioned table to a newly acquired
 *  partition.

Maybe:

Clone foreign key of (or referencing) a partitioned table to a newly
acquired partition


* In 0002,

 /*
+ * Return the OID of the index, in the given partition, that is a child
of the
+ * given index or InvalidOid if there isn't one.
+ */
+Oid
+index_get_partition(Relation partition, Oid indexId)

Maybe add a comma between "...given index" and "or InvalidOid", or maybe
rewrite the sentence as:

Return the OID of index of the given partition that is a child of the
given index, or InvalidOid if there isn't one.


* Unrelated to this thread, but while reviewing, I noticed this code in
ATExecAttachPartitionIdx:

/* no deadlock risk: RangeVarGetRelidExtended already acquire

Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-02-27 Thread Alvaro Herrera
On 2019-Feb-28, Tsunakawa, Takayuki wrote:

> From: Michael Paquier [mailto:mich...@paquier.xyz]
> > So we could you consider adding an option for the VACUUM command as well
> > as vacuumdb?  The interactions with the current patch is that you need to
> > define the behavior at the beginning of vacuum for a given heap, instead
> > of reading the parameter at the time the truncation happens, and give
> 
> I'm not confident whether this is the same as the above, I imagined this:
> 
> * Add a new USERSET GUC vacuum_shrink_table = {on | off}, on by default.
> This follows the naming style "verb_object" like log_connections and 
> enable_xxx.  We may want to use enable_vacuum_shrink or something like that, 
> but enable_xxx seems to be used solely for planner control.  Plus, 
> vacuum-related parameters seem to start with vacuum_.

Robert used the phrase "attractive nuisance", which maybe sounds like a
good thing to have to a non native speaker, but it actually isn't -- he
was saying we should avoid a GUC at all, and I can see the reason for
that.  I think we should have a VACUUM option and a reloption, but no
GUC.  The default should always be to shrink, unless either the VACUUM
option or the reloption turn that off.  (So it doesn't make sense to set
either the VACUUM option or the reloption to "on").

Disclaimer: I did write roughly the same patch using both a GUC and a
VACUUM option, though I named my GUC truncate_on_vacuum and the VACUUM
option "truncate_anyway" (so you can turn truncation off globally, then
enable it selectively at manual vacuum execution time, but not
autovacuum).  However, the reason for doing this were concerns about
robustness caused by data corruption induced by failing to truncate
pages containing removed tuples ... not performance improvement, as your
patch.  So they wanted to turn off truncation for *all* tables, not just
a select few.

Hopefully we'll get Tom's patch that addresses the failure-to-truncate
issues in pg12.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: POC: converting Lists into arrays

2019-02-27 Thread David Rowley
On Tue, 26 Feb 2019 at 18:34, Tom Lane  wrote:
>
> David Rowley  writes:
> > Using the attached patch (as text file so as not to upset the CFbot),
> > which basically just measures and logs the time taken to run
> > pg_plan_query. ...
> > Surprisingly it took 1.13% longer.  I did these tests on an AWS
> > md5.large instance.
>
> Interesting.  Seems to suggest that maybe the cases I discounted
> as being infrequent aren't so infrequent?  Another possibility
> is that the new coding adds more cycles to foreach() loops than
> I'd hoped for.

I went and had a few adventures with this patch to see if I could
figure out why the small ~1% regression exists.  Profiling did not
prove very useful as I saw none of the list functions come up.  I had
suspected it was the lcons() calls being expensive because then need
to push the elements up one place each time, not something that'll
scale well with larger lists.  After changing things so that a new
"first" element index in the List would allow new_head_cell() to just
move everything to the end of the list and mark the start of the
actual data...  I discovered that slowed things down further... Likely
due to all the additional arithmetic work required to find the first
element.

I then tried hacking at the foreach() macro after wondering if the
lnext() call was somehow making things difficult for the compiler to
predict what cell would come next. I experimented with the following
monstrosity:

for ((cell) = list_head(l); ((cell) && (cell) < &((List *)
l)->elements[((List *) l)->first + ((List *) l)->length]) || (cell =
NULL) != NULL; cell++)

it made things worse again...  It ended up much more ugly than I
thought it would have as I had to account for an empty list being NIL
and the fact that we need to make cell NULL after the loop is over.

I tried a few other things... I didn't agree with your memmove() in
list_concat(). I think memcpy() is fine, even when the list pointers
are the same since we never overwrite any live cell values. Strangely
I found memcpy slower than memmove... ?

The only thing that I did to manage to speed the patch up was to ditch
the additional NULL test in lnext().  I don't see why that's required
since lnext(NULL) would have crashed with the old implementation.
Removing this changed the 1.13% regression into a ~0.8% regression,
which at least does show that the foreach() implementation can have an
effect on performance.

> Anyway, it's just a POC; the main point at this stage is to be
> able to make such comparisons at all.  If it turns out that we
> *can't* make this into a win, then all that bellyaching about
> how inefficient Lists are was misinformed ...

My primary concern is how much we bend over backwards because
list_nth() performance is not O(1). I know from my work on
partitioning that ExecInitRangeTable()'s building of
es_range_table_array has a huge impact for PREPAREd plans for simple
PK lookup SELECT queries to partitioned tables with a large number of
partitions, where only 1 of which will survive run-time pruning. I
could get the execution speed of such a query with 300 partitions to
within 94% of the non-partitioned version if the rtable could be
looked up O(1) in the executor natively, (that some changes to
ExecCheckRTPerms() to have it skip rtable entries that don't require
permission checks.).

Perhaps if we're not going to see gains from the patch alone then
we'll need to tag on some of the additional stuff that will take
advantage of list_nth() being fast and test the performance of it all
again.

Attached is the (mostly worthless) series of hacks I made to your
patch.  It might save someone some time if they happened to wonder the
same thing as I did.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


list_hacks.patch
Description: Binary data


Re: pgsql: Avoid creation of the free space map for small heap relations, t

2019-02-27 Thread John Naylor
On Thu, Feb 28, 2019 at 10:25 AM Amit Kapila  wrote:
>
> Here's an updated patch based on comments by you.  I will proceed with
> this unless you have any more comments.

Looks good to me. I would just adjust the grammar in the comment, from
"This prevents us to use the map", to "This prevents us from using the
map". Perhaps also a comma after "first".

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: POC: converting Lists into arrays

2019-02-27 Thread Tom Lane
David Rowley  writes:
> On Thu, 28 Feb 2019 at 09:26, Tom Lane  wrote:
>> 0002 below does this.  I'm having a hard time deciding whether this
>> part is a good idea or just code churn.  It might be more readable
>> (especially to newbies) but I can't evaluate that very objectively.

> I'm less decided on this. Having this now means you'll need to break
> the signature of the macro the same way as you'll need to break
> lnext(). It's perhaps easier to explain in the release notes about
> lnext() having changed so that extension authors can go fix their code
> (probably they'll know already from compile failures, but ). On
> the other hand, if the list_cell_is_last() is new, then there will be
> no calls to that in extensions anyway.  Maybe it's better to do it at
> the same time as the List reimplementation to ensure nobody needs to
> change anything twice?

Yeah, I was considering the idea of setting up the macro as
"list_cell_is_last(list, cell)" from the get-go, with the first
argument just going unused for the moment.  That would be a good
way to back-patch it if we go through with this.  On the other hand,
if we end up not pulling the trigger on the main patch, that'd
look pretty silly ...

regards, tom lane



Re: POC: converting Lists into arrays

2019-02-27 Thread David Rowley
On Thu, 28 Feb 2019 at 09:26, Tom Lane  wrote:
>
> I wrote:
> > I did find a number of places where getting rid of explicit lnext()
> > calls led to just plain cleaner code.  Most of these were places that
> > could be using forboth() or forthree() and just weren't.  There's
> > also several places that are crying out for a forfour() macro, so
> > I'm not sure why we've stubbornly refused to provide it.  I'm a bit
> > inclined to just fix those things in the name of code readability,
> > independent of this patch.
>
> 0001 below does this.  I found a couple of places that could use
> forfive(), as well.  I think this is a clear legibility and
> error-proofing win, and we should just push it.

I've looked over this and I agree that it's a good idea.  Reducing the
number of lnext() usages seems like a good idea in order to reduce the
footprint of the main patch.

The only thing of interest that I saw during the review was the fact
that you've chosen to assign colexpr and coldefexpr before the
continue in get_tablefunc().  We may not end up using those values if
we find an ordinality column. I'm pretty sure it's not worth breaking
the mould for that case though, but just noting it anyway.

> > I also noticed that there's quite a lot of places that are testing
> > lnext(cell) for being NULL or not.  What that really means is whether
> > this cell is last in the list or not, so maybe readability would be
> > improved by defining a macro along the lines of list_cell_is_last().
> > Any thoughts about that?
>
> 0002 below does this.  I'm having a hard time deciding whether this
> part is a good idea or just code churn.  It might be more readable
> (especially to newbies) but I can't evaluate that very objectively.
> I'm particularly unsure about whether we need two macros; though the
> way I initially tried it with just list_cell_is_last() seemed kind of
> double-negatively confusing in the places where the test needs to be
> not-last.  Also, are these macro names too long, and if so what would
> be better?

I'm less decided on this. Having this now means you'll need to break
the signature of the macro the same way as you'll need to break
lnext(). It's perhaps easier to explain in the release notes about
lnext() having changed so that extension authors can go fix their code
(probably they'll know already from compile failures, but ). On
the other hand, if the list_cell_is_last() is new, then there will be
no calls to that in extensions anyway.  Maybe it's better to do it at
the same time as the List reimplementation to ensure nobody needs to
change anything twice?

> Also: if we accept either or both of these, should we back-patch the
> macro additions, so that these new macros will be available for use
> in back-patched code?  I'm not sure that forfour/forfive have enough
> use-cases to worry about that; but the is-last macros would have a
> better case for that, I think.

I see no reason not to put forfour() and forfive() in the back branches.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Why don't we have a small reserved OID range for patch revisions?

2019-02-27 Thread Tom Lane
Peter Eisentraut  writes:
> On 2019-02-27 22:27, Tom Lane wrote:
>>> OID collision doesn't seem to be a significant problem (for me).

>> Um, I beg to differ.  It's not at all unusual for pending patches to
>> bit-rot for no reason other than suddenly getting an OID conflict.
>> I don't have to look far for a current example:

> I'm not saying it doesn't happen, but that it's not a significant
> problem overall.

I do not think you are correct.  It may not be a big problem across
all our incoming patches, but that's only because most of them don't
have anything to do with hand-assigned OIDs.  Among those that do,
I think there is a significant problem.

To try to quantify this a bit, I looked through v12-cycle and pending
patches that touch the catalog data.

We've only committed 12 patches adding new hand-assigned OIDs since v11
was branched off.  (I suspect that's lower than in a typical cycle,
but have not attempted to quantify things further back.)  Of those,
only two seem to have needed OID adjustments after initial posting,
but that's mostly because most of them were committer-originated patches
that got pushed within a week or two.  That's certainly not the typical
wait time for a patch submitted by anybody else.  Also, a lot of these
patches recycled OIDs that'd recently been freed by patches such as the
abstime-ectomy, which means that the amount of OID conflict created for
pending patches is probably *really* low in this cycle-so-far, compared
to our historical norms.

Of what's in the queue to be reviewed right now, there are just
20 (out of 150-plus) patches that touch the catalog/*.dat files.
I got this number by groveling through the cfbot's reports of
patch applications, to see which patches touched those files.
It might omit some patches that the cfbot failed to make sense of.
Also, I'm pretty sure that a few of these patches don't actually
assign any new OIDs but just change existing entries, or create
only entries with auto-assigned OIDs.  I did not try to separate
those out, however, since the point here is to estimate for how
many patches a committer would even need to think about this.

Of those twenty patches, three have unresolved OID conflicts
right now:

multivariate MCV lists and histograms
commontype polymorphics
log10/hyper functions

Another one has recently had to resolve an OID conflict:

SortSupport implementation on inet/cdir 

which is notable considering that that thread is less than three weeks
old.  (The log10 and commontype threads aren't really ancient either.)

I spent some extra effort looking at the patches that both create more
than a few new OIDs and have been around for awhile:

Generic type subscripting
KNN for btree
Custom compression methods
SQL/JSON: functions
SQL/JSON: jsonpath
Generated columns
BRIN bloom indexes

The first four of those have all had to reassign OIDs during their
lifetime.  jsonpath has avoided doing so by choosing fairly high-numbered
OIDs (6K range) to begin with; which I trust you will agree is a solution
that doesn't scale for long.  I'm not entirely sure that the last two
haven't had to renumber OIDs; I ran out of energy before poking through
their history in detail.

In short, this situation may look fine from the perspective of a committer
with a relatively short timeline to commit, but it's pretty darn awful for
everybody else.  The only way to avoid a ~ 50% failure rate is to choose
OIDs above 6K, and once everybody starts doing it like that, things are
going to get very unpleasant very quickly.

regards, tom lane



RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-02-27 Thread Tsunakawa, Takayuki
From: Michael Paquier [mailto:mich...@paquier.xyz]
> So we could you consider adding an option for the VACUUM command as well
> as vacuumdb?  The interactions with the current patch is that you need to
> define the behavior at the beginning of vacuum for a given heap, instead
> of reading the parameter at the time the truncation happens, and give

I'm not confident whether this is the same as the above, I imagined this:

* Add a new USERSET GUC vacuum_shrink_table = {on | off}, on by default.
This follows the naming style "verb_object" like log_connections and 
enable_xxx.  We may want to use enable_vacuum_shrink or something like that, 
but enable_xxx seems to be used solely for planner control.  Plus, 
vacuum-related parameters seem to start with vacuum_.

* Give priority to the reloption, because it's targeted at a particular table.  
If the reloption is not set, the GUC takes effect.

* As a consequence, the user can change the behavior of VACUUM command by 
SETting the GUC in the same session in advance, when the reloption is not set.  
If the reloption is set, the user can ALTER TABLE SET, VACUUM, and ALTER TABLE 
again to restore the table's setting.  But I don't think this use case (change 
whether to shrink per VACUUM command execution) is necessary.  This is no more 
than simply possible.


Regards
Takayuki Tsunakawa






Re: psql display of foreign keys

2019-02-27 Thread Michael Paquier
On Wed, Feb 27, 2019 at 03:37:23PM -0300, Alvaro Herrera wrote:
> It should have listed t2 too, but it doesn't.  Since these functions
> aren't supposed to work on legacy inheritance anyway, I think the right
> action is to return the empty set.  In the current version I just do
> what pg_partition_tree does, but I think we should adjust that behavior.
> I'll start a new thread about that.

Yes, that's not good.  The internal wrapper for ancestors should be
reworked.  The results of pg_partition_tree are what I would expect
them to be though?  Taking your example, t111 gets listed if listing
the trees from t1 or t2.  This seems natural to me.  I am wondering
the amount of work that it would take to actually have the function
return both relations in this case..

> +pg_partition_ancestors(PG_FUNCTION_ARGS)
> +{
> + Oid relid = PG_GETARG_OID(0);
> + FuncCallContext *funcctx;
> + ListCell  **next;
> +
> + if (!check_rel_can_be_partition(relid))
> + PG_RETURN_NULL();

Not returning an empty set here? ;)

I would have added tests with pg_partition_ancestors(NULL) and
pg_partition_ancestors(0) for consistency with the rest.

Except that and the ancestor tracking for inheritance, the shape of
the patch looks good to me.
--
Michael


signature.asc
Description: PGP signature


Re: pgsql: Avoid creation of the free space map for small heap relations, t

2019-02-27 Thread Amit Kapila
On Thu, Feb 28, 2019 at 8:10 AM Amit Kapila  wrote:
>
> On Thu, Feb 28, 2019 at 7:45 AM John Naylor  
> wrote:
> >
> > On Thu, Feb 28, 2019 at 7:24 AM Amit Kapila  wrote:
> > > > The flaw in my thinking was treating extension too similarly too
> > > > finding an existing block. With your patch clearing the local map in
> > > > the correct place, it seems the call at hio.c:682 is now superfluous?
> > >
> > > What if get some valid block from the first call to
> > > GetPageWithFreeSpace via local map which has required space?  I think
> > > in that case we will need the call at hio.c:682.  Am I missing
> > > something?
> >
> > Are you referring to the call at line 393? Then the map will be
> > cleared on line 507 before we return that buffer.
> >
>
> That's correct, I haven't looked at line number very carefully and
> assumed that you are saying to remove the call at 507.  I also think
> that the call at line 682 doesn't seem to be required, but I would
> prefer to keep it for the sake of consistency in this function and
> safety purpose.  However, I think we should add a comment on the lines
> suggested by you.  I will send the patch after making these
> modifications.
>

Here's an updated patch based on comments by you.  I will proceed with
this unless you have any more comments.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_loc_map_clear_3.patch
Description: Binary data


Re: A note about recent ecpg buildfarm failures

2019-02-27 Thread Mark Wong
On Tue, Feb 26, 2019 at 01:25:29PM -0500, Tom Lane wrote:
> Since my commits 9e138a401 et al on Saturday, buildfarm members
> blobfish, brotula, and wunderpus have been showing core dumps
> in the ecpg preprocessor.  This seemed inexplicable given what
> the commits changed, and even more so seeing that only HEAD is
> failing, while the change was back-patched into all branches.
> 
> Mark Wong and I poked into this off-list, and what we find is that
> this seems to be a compiler bug.  Those animals are all running
> nearly the same version of clang (3.8.x / ppc64le).  Looking into
> the assembly code for preproc.y, the crash is occurring at a branch
> that is supposed to jump forward exactly 32768 bytes, but according
> to gdb's disassembler it's jumping backwards exactly -32768 bytes,
> into invalid memory.  It will come as no surprise to hear that the
> branch displacement field in PPC conditional branches is 16 bits
> wide, so that positive 32768 doesn't fit but negative 32768 does.
> Evidently what is happening is that either the compiler or the
> assembler is failing to detect the edge-case field overflow and
> switch to different coding.  So the apparent dependency on 9e138a401
> is because that happened to insert exactly the right number of
> instructions in-between to trigger this scenario.  It's pure luck we
> didn't trip over it before, although none of those buildfarm animals
> have been around for all that long.
> 
> Moral: don't use clang 3.8.x on ppc64.  I think Mark is going
> to upgrade those animals to some more recent compiler version.

I've tried clang 3.9 and 4.0 by hand and they seem to be ok.  These were
the other two readily available versions on Debian stretch.

I'll stop those other clang-3.8 animals...

Regards,
Mark

--
Mark Wong
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/



RE: Timeout parameters

2019-02-27 Thread Nagaura, Ryohei
Hi,

I rewrote two TCP_USER_TIMEOUT patches.
I changed the third argument of setsockopt() from 18 to TCP_USER_TIMEOUT.

This revision has the following two merits.
* Improve readability of source
* Even if the definition of TCP_USER_TIMEOUT is changed, it is not affected.
  e.g., in the current linux, TCP_USER_TIMEOUT is defined, but even if it is 
changed to 19, it 'll be available.

Best regards,
-
Ryohei Nagaura



TCP_backend_v8.patch
Description: TCP_backend_v8.patch


TCP_interface_v8.patch
Description: TCP_interface_v8.patch


socket_timeout_v7.patch
Description: socket_timeout_v7.patch


RE: proposal: pg_restore --convert-to-text

2019-02-27 Thread Imai, Yoshikazu
Hi,

On Tue, Feb 19, 2019 at 8:20 PM, Euler Taveira wrote:
> Em seg, 18 de fev de 2019 às 19:21, Tom Lane  escreveu:
> >
> > Euler Taveira  writes:
> > > Since no one has stepped up, I took a stab at it. It will prohibit
> > > standard output unless '-f -' be specified. -l option also has the
> > > same restriction.
> >
> > Hm, don't really see the need to break -l usage here.
> >
> After thinking about it, revert it.
> 
> > Pls add to next CF, if you didn't already.
> >
> Done.

I saw the patch.

Is there no need to rewrite the Description in the Doc to state we should 
specify either -d or -f option?
(and also it might be better to write if -l option is given, neither -d nor -f 
option isn't necessarily needed.)


I also have the simple question in the code.

I thought the below if-else condition

+   if (filename && strcmp(filename, "-") == 0)
+   fn = fileno(stdout);
+   else if (filename)
fn = -1;
else if (AH->FH)

can also be written by the form below.

if (filename)
{
if(strcmp(filename, "-") == 0)
fn = fileno(stdout);
else
fn = -1;
}
else if (AH->FH)

I think the former one looks like pretty, but which one is preffered?

--
Yoshikazu Imai



Re: pgsql: Avoid creation of the free space map for small heap relations, t

2019-02-27 Thread Amit Kapila
On Thu, Feb 28, 2019 at 7:45 AM John Naylor  wrote:
>
> On Thu, Feb 28, 2019 at 7:24 AM Amit Kapila  wrote:
> > > The flaw in my thinking was treating extension too similarly too
> > > finding an existing block. With your patch clearing the local map in
> > > the correct place, it seems the call at hio.c:682 is now superfluous?
> >
> > What if get some valid block from the first call to
> > GetPageWithFreeSpace via local map which has required space?  I think
> > in that case we will need the call at hio.c:682.  Am I missing
> > something?
>
> Are you referring to the call at line 393? Then the map will be
> cleared on line 507 before we return that buffer.
>

That's correct, I haven't looked at line number very carefully and
assumed that you are saying to remove the call at 507.  I also think
that the call at line 682 doesn't seem to be required, but I would
prefer to keep it for the sake of consistency in this function and
safety purpose.  However, I think we should add a comment on the lines
suggested by you.  I will send the patch after making these
modifications.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Allowing extensions to supply operator-/function-specific info

2019-02-27 Thread Tom Lane
Paul Ramsey  writes:
> I added three indexes to my test table:
>   CREATE INDEX foo_g_gist_x ON foo USING GIST (g);
>   CREATE INDEX foo_g_gist_nd_x ON foo USING GIST (g gist_geometry_ops);
>   CREATE INDEX foo_g_spgist_x ON foo USING SPGIST (g);
> They all support the overlaps (&&) operator.

> So, SupportRequestIndexCondition happens three times, and each time I say 
> “yep, sure, you can construct an index condition by putting the && operator 
> between left_arg and right_arg”.

Sounds right.

> How does the planner end up deciding on which index to *actually* use?

It's whichever has the cheapest cost estimate.  In case of an exact tie,
I believe it'll choose the index with lowest OID (or maybe highest OID,
not sure).

> The selectivity is the same, the operator is the same. I found that I got the 
> ND GIST one first, then the SPGIST and finally the 2d GIST, which is 
> unfortunate, because the 2D and SPGIST are almost certainly faster than the 
> ND GIST.

Given that it'll be the same selectivity, the cost preference is likely to
go to whichever index is physically smallest, at least for indexes of the
same type.  When they're not the same type there might be an issue with
the index AM cost estimators not being lined up very well as to what they
account for and how.

I don't doubt that there's plenty of work to be done in making the cost
estimates better in cases like this --- in particular, I don't think we
have any way of accounting for the idea that one index opclass might be
smarter than another one for the same query, unless that shakes out as a
smaller index.  But you'd have had the same issues with the old approach.

regards, tom lane



Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-02-27 Thread Michael Paquier
On Thu, Feb 28, 2019 at 01:05:07AM +, Tsunakawa, Takayuki wrote:
> From: Robert Haas [mailto:robertmh...@gmail.com]
>> I don't think that a VACUUM option would be out of place, but a GUC
>> sounds like an attractive nuisance to me.  It will encourage people to
>> just flip it blindly instead of considering the particular cases where
>> they need that behavior, and I think chances are good that most people
>> who do that will end up being sad.

I won't disagree with you on that.  I hear enough about people
disappointed that VACUUM does not clean up their garbage enough and
that tables are bloated..  And making autovacuum too aggressive is no
good either.

> Ouch, I sent my previous mail before reading this.  I can understand
> it may be cumbersome to identify and specify each table, so I tend
> to agree the parameter in postgresql, which is USERSET to allow
> ALTER DATABASE/USER SET to tune specific databases and applications.
> But should the vacuuming of system catalogs also follow this
> setting?

So we could you consider adding an option for the VACUUM command as
well as vacuumdb?  The interactions with the current patch is that you
need to define the behavior at the beginning of vacuum for a given
heap, instead of reading the parameter at the time the truncation
happens, and give priority to the command-level option.
--
Michael


signature.asc
Description: PGP signature


Re: pg_partition_tree crashes for a non-defined relation

2019-02-27 Thread Amit Langote
Hi,

On 2019/02/28 10:45, Michael Paquier wrote:
> On Wed, Feb 27, 2019 at 03:48:08PM -0300, Alvaro Herrera wrote:
>> I just happened to come across the result of this rationale in
>> pg_partition_tree() (an SRF) while developing a new related function,
>> pg_partition_ancestors(), and find the resulting behavior rather absurd
>> -- it returns one row with all NULL columns, rather than no rows.  I
>> think the sensible behavior would be to do SRF_RETURN_DONE() before
>> stashing any rows to the output, so that we get an empty result set
>> instead.
> 
> Hmm.  Going through the thread again NULL was decided to make the
> whole experience consistent, now by returning nothing we would get
> a behavior as consistent as when NULL is used in input, so point taken
> to tune the behavior for unsupported relkinds and undefined objects.
> 
> Does the attached look fine to you?

Reading the discussion, we just don't want to throw an "ERROR: unsupported
relkind" error, to avoid, for example, aborting a query that's iteratively
calling pg_partition_tree on all relations in a given set.  Returning no
rows at all seems like a better way of ignoring such relations.

with rels as (select relname from pg_class where relnamespace =
'public'::regnamespace) select pg_partition_tree(relname::regclass) from
rels;
   pg_partition_tree

 (pk1,pk,t,0)
 (pk,,f,0)
 (pk1,pk,t,1)
 (pk1_pkey,pk_pkey,t,0)
 (pk_pkey,,f,0)
 (pk1_pkey,pk_pkey,t,1)

 (another_pk1,another_pk,t,0)
 (another_pk,,f,0)
 (another_pk1,another_pk,t,1)
 (another_pk1_pkey,another_pk_pkey,t,0)
 (another_pk_pkey,,f,0)
 (another_pk1_pkey,another_pk_pkey,t,1)
(13 rows)

Now that Alvaro mentions it, I too think that returning no rows at all is
better than 1 row filled with all NULLs, so +1.

Thanks,
Amit




Re: pgsql: Avoid creation of the free space map for small heap relations, t

2019-02-27 Thread John Naylor
On Thu, Feb 28, 2019 at 7:24 AM Amit Kapila  wrote:
> > The flaw in my thinking was treating extension too similarly too
> > finding an existing block. With your patch clearing the local map in
> > the correct place, it seems the call at hio.c:682 is now superfluous?
>
> What if get some valid block from the first call to
> GetPageWithFreeSpace via local map which has required space?  I think
> in that case we will need the call at hio.c:682.  Am I missing
> something?

Are you referring to the call at line 393? Then the map will be
cleared on line 507 before we return that buffer.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: readdir is incorrectly implemented at Windows

2019-02-27 Thread Michael Paquier
On Mon, Feb 25, 2019 at 06:38:16PM +0300, Konstantin Knizhnik wrote:
> Small issue with readir implementation for Windows.
> Right now it returns ENOENT in case of any error returned by FindFirstFile.
> So all places in Postgres where opendir/listdir are used will assume that
> directory is empty and
> do nothing without reporting any error.
> It is not so good if directory is actually not empty but there are not
> enough permissions for accessing the directory and FindFirstFile
> returns ERROR_ACCESS_DENIED:

Yeah, I think that you are right here.  We should have a clean error
if permissions are incorrect, and your patch does what is mentioned on
Windows docs:
https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-findfirstfilea

Could you add it to the next commit fest as a bug fix please?  I think
that I will be able to look at that in details soon, but if not it
would be better to not lose track of your fix.
--
Michael


signature.asc
Description: PGP signature


RE: Protect syscache from bloating with negative cache entries

2019-02-27 Thread Tsunakawa, Takayuki
From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
> I measured the memory context accounting overhead using Tomas's tool
> palloc_bench,
> which he made it a while ago in the similar discussion.
> https://www.postgresql.org/message-id/53f7e83c.3020...@fuzzy.cz
> 
> This tool is a little bit outdated so I fixed it but basically I followed
> him.
> Things I did:
> - make one MemoryContext
> - run both palloc() and pfree() for 32kB area 1,000,000 times.
> - And measure this time
> 
> The result shows that master is 30 times faster than patched one.
> So as Andres mentioned in upper thread it seems it has overhead.
> 
> [master (without v15 patch)]
> 61.52 ms
> 60.96 ms
> 61.40 ms
> 61.42 ms
> 61.14 ms
> 
> [with v15 patch]
> 1838.02 ms
> 1754.84 ms
> 1755.83 ms
> 1789.69 ms
> 1789.44 ms
> 

I'm afraid the measurement is not correct.  First, the older discussion below 
shows that the accounting overhead is much, much smaller, even with a more 
complex accounting.

9.5: Better memory accounting, towards memory-bounded HashAg
https://www.postgresql.org/message-id/flat/1407012053.15301.53.camel%40jeff-desktop

Second, allocation/free of memory > 8 KB calls malloc()/free().  I guess the 
accounting overhead will be more likely to be hidden under the overhead of 
malloc() and free().  What we'd like to know the overhead when malloc() and 
free() are not called.

And are you sure you didn't enable assert checking?


Regards
Takayuki Tsunakawa




Re: Segfault when restoring -Fd dump on current HEAD

2019-02-27 Thread Michael Paquier
On Wed, Feb 27, 2019 at 12:02:43PM -0500, Tom Lane wrote:
> Alvaro Herrera  writes:
>> I think we should save such a patch for whenever we next update the
>> archive version number, which could take a couple of years given past
>> history.  I'm inclined to add a comment near K_VERS_SELF to remind
>> whoever next patches it.
> 
> +1.  This isn't an unreasonable cleanup idea, but being only a cleanup
> idea, it doesn't seem worth creating compatibility issues for.  Let's
> wait till there is some more-pressing reason to change the archive format,
> and then fix this in the same release cycle.

+1.  Having a comment as reminder would be really nice.
--
Michael


signature.asc
Description: PGP signature


RE: Libpq support to connect to standby server as priority

2019-02-27 Thread Tsunakawa, Takayuki
From: Haribabu Kommi [mailto:kommi.harib...@gmail.com]
> Attached are the updated patches.

Thanks, all look fixed.


> The target_server_type option yet to be implemented.

Please let me review once more and proceed to testing when the above is added, 
to make sure the final code looks good.  I'd like to see how complex the if 
conditions in multiple places would be after adding target_server_type, and 
consider whether we can simplify them together with you.  Even now, the if 
conditions seem complicated to me... that's probably due to the existence of 
read_write_host_index.


Regards
Takayuki Tsunakawa







Re: pg_partition_tree crashes for a non-defined relation

2019-02-27 Thread Michael Paquier
On Wed, Feb 27, 2019 at 03:48:08PM -0300, Alvaro Herrera wrote:
> I just happened to come across the result of this rationale in
> pg_partition_tree() (an SRF) while developing a new related function,
> pg_partition_ancestors(), and find the resulting behavior rather absurd
> -- it returns one row with all NULL columns, rather than no rows.  I
> think the sensible behavior would be to do SRF_RETURN_DONE() before
> stashing any rows to the output, so that we get an empty result set
> instead.

Hmm.  Going through the thread again NULL was decided to make the
whole experience consistent, now by returning nothing we would get
a behavior as consistent as when NULL is used in input, so point taken
to tune the behavior for unsupported relkinds and undefined objects.

Does the attached look fine to you?
--
Michael
diff --git a/src/backend/utils/adt/partitionfuncs.c b/src/backend/utils/adt/partitionfuncs.c
index ffd66b6439..36d9f69cbc 100644
--- a/src/backend/utils/adt/partitionfuncs.c
+++ b/src/backend/utils/adt/partitionfuncs.c
@@ -69,9 +69,6 @@ pg_partition_tree(PG_FUNCTION_ARGS)
 	FuncCallContext *funcctx;
 	ListCell  **next;
 
-	if (!check_rel_can_be_partition(rootrelid))
-		PG_RETURN_NULL();
-
 	/* stuff done only on the first call of the function */
 	if (SRF_IS_FIRSTCALL())
 	{
@@ -82,6 +79,9 @@ pg_partition_tree(PG_FUNCTION_ARGS)
 		/* create a function context for cross-call persistence */
 		funcctx = SRF_FIRSTCALL_INIT();
 
+		if (!check_rel_can_be_partition(rootrelid))
+			SRF_RETURN_DONE(funcctx);
+
 		/* switch to memory context appropriate for multiple function calls */
 		oldcxt = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
 
diff --git a/src/test/regress/expected/partition_info.out b/src/test/regress/expected/partition_info.out
index a884df976f..73269ffd09 100644
--- a/src/test/regress/expected/partition_info.out
+++ b/src/test/regress/expected/partition_info.out
@@ -9,8 +9,7 @@ SELECT * FROM pg_partition_tree(NULL);
 SELECT * FROM pg_partition_tree(0);
  relid | parentrelid | isleaf | level 
 ---+-++---
-   | ||  
-(1 row)
+(0 rows)
 
 SELECT pg_partition_root(NULL);
  pg_partition_root 
@@ -163,14 +162,12 @@ CREATE MATERIALIZED VIEW ptif_test_matview AS SELECT 1;
 SELECT * FROM pg_partition_tree('ptif_test_view');
  relid | parentrelid | isleaf | level 
 ---+-++---
-   | ||  
-(1 row)
+(0 rows)
 
 SELECT * FROM pg_partition_tree('ptif_test_matview');
  relid | parentrelid | isleaf | level 
 ---+-++---
-   | ||  
-(1 row)
+(0 rows)
 
 SELECT pg_partition_root('ptif_test_view');
  pg_partition_root 


signature.asc
Description: PGP signature


RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-02-27 Thread Tsunakawa, Takayuki
From: Robert Haas [mailto:robertmh...@gmail.com]
> I don't think that a VACUUM option would be out of place, but a GUC
> sounds like an attractive nuisance to me.  It will encourage people to
> just flip it blindly instead of considering the particular cases where
> they need that behavior, and I think chances are good that most people
> who do that will end up being sad.

Ouch, I sent my previous mail before reading this.  I can understand it may be 
cumbersome to identify and specify each table, so I tend to agree the parameter 
in postgresql, which is USERSET to allow ALTER DATABASE/USER SET to tune 
specific databases and applications.  But should the vacuuming of system 
catalogs also follow this setting?


Regards
Takayuki Tsunakawa




Re: BUG #15293: Stored Procedure Triggered by Logical Replication is Unable to use Notification Events

2019-02-27 Thread Marc Dean
If you are trying to get around the issue for now, what my team did was
cron an insert statement on the database server. We have a queue table that
has some of these triggers setup so it was easy to write a no-op row to the
queue. This had the side effect of flushing the notification queue.

We haven't had any issues with this approach.

I can also volunteer to help test out patches.

Thanks,

-mdj

On Fri, Feb 22, 2019 at 11:37 AM Robert Welin  wrote:

> I have reproduced this bug on PostgreSQL version 10.7 and 11.2 using the
> steps described in Michael Powers' original report. The issue also still
> seems to be present even with the patch provided by Sergei Kornilov.
>
> Are there plans to address this issue any time soon or is there some way
> I can assist in fixing it? It would be great to have notifications from
> logical replication.
>
> Regards,
> Robert Welin
>
>


Re: get_controlfile() can leak fds in the backend

2019-02-27 Thread Michael Paquier
On Wed, Feb 27, 2019 at 07:45:11PM -0500, Joe Conway wrote:
> It seems to me that OpenTransientFile() is more appropriate. Patch done
> that way attached.

Works for me, thanks for sending a patch!  While on it, could you
clean up the comment on top of get_controlfile()?  "char" is mentioned
instead of "const char" for DataDir which is incorrect.  I would
remove the complete set of arguments from the description and just
keep the routine name.
--
Michael


signature.asc
Description: PGP signature


Re: get_controlfile() can leak fds in the backend

2019-02-27 Thread Michael Paquier
On Wed, Feb 27, 2019 at 11:50:17AM +0100, Fabien COELHO wrote:
>> Shouldn't be necessary - the control file fits into a single page, and
>> writes of that size ought to always be atomic. And I also think
>> introducing flock usage for this would be quite disproportional.

There are static assertions to make sure that the side of control file
data never gets higher than 512 bytes for this purpose.

> Note that my concern is not about the page size, but rather that as more
> commands may change the cluster status by editing the control file, it would
> be better that a postmaster does not start while a pg_rewind or enable
> checksum or whatever is in progress, and currently there is a possible race
> condition between the read and write that can induce an issue, at least
> theoretically.

Something that I think we could live instead is a special flag in the
control file to mark the postmaster as in maintenance mode.  This
would be useful to prevent the postmaster to start if seeing this flag
in the control file, as well to find out that a host has crashed in
the middle of a maintenance operation.  We don't give this insurance
now when running pg_rewind, which is bad.  That's also separate from
the checksum-related patches and pg_rewind.

flock() can be something hard to live with for cross-platform
compatibility like Windows (LockFileEx) or fancy platforms.  And note
that we don't use it yet in the tree.  And flock() would help in the
first case I am mentioning, but not in the second.
--
Michael


signature.asc
Description: PGP signature


Re: get_controlfile() can leak fds in the backend

2019-02-27 Thread Andres Freund
Hi,

On 2019-02-27 11:50:17 +0100, Fabien COELHO wrote:
> Note that my concern is not about the page size, but rather that as more
> commands may change the cluster status by editing the control file, it would
> be better that a postmaster does not start while a pg_rewind or enable
> checksum or whatever is in progress, and currently there is a possible race
> condition between the read and write that can induce an issue, at least
> theoretically.

Seems odd to bring this up in this thread, it really has nothing to do
with the topic.  If we were to want to do more here, ISTM the right
approach would use the postmaster pid file, not the control file.

Greetings,

Andres Freund



RE: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-02-27 Thread Tsunakawa, Takayuki
From: Michael Paquier [mailto:mich...@paquier.xyz]
> This makes the test page-size sensitive.  While we don't ensure that tests
> can be run with different page sizes, we should make a maximum effort to
> keep the tests compatible if that's easy enough.  In this case you could
> just use > 0 as base comparison.  I can fix that by myself, so no need to
> send a new version.

Good idea.  Done.


> Should we also document that the parameter is effective for autovacuum?
> The name can lead to confusion regarding that.

I'm not sure for the need because autovacuum is just an automatic execution of 
vacuum, and existing vacuum_xxx parameters also apply to autovacuum.  But being 
specific is good anyway, so I added reference to autovacuum in the description.


> Also, shouldn't the relopt check happen in should_attempt_truncation()?
> It seems to me that if we use this routine somewhere else then it should
> be controlled by the option.

That makes sense.  Done.


> At the same time, we also have REL_TRUNCATE_FRACTION and
> REL_TRUNCATE_MINIMUM which could be made equally user-tunnable.
> That's more difficult to control, still why don't we also consider this
> part?

I thought of it, too.  But I didn't have a good idea on how to explain those 
parameters.  I'd like to separate it.


> Another thing that seems worth thinking about is a system-level GUC, and
> an option in the VACUUM command to control if truncation should happen or
> not.  We have a lot of infrastructure to control such options between vacuum
> and autovacuum, so it could be a waste to not consider potential synergies.

Being able to specify this parameter in postgresql.conf and SET (especially 
ALTER DATABASE/USER to target specific databases/applications) might be useful, 
but I'm not sure...  I'm less confident about whether VACUUM command can 
specify this, because this is a property of table under a specific workload, 
not a changable property of each VACUUM action.  Anyway, I expect it won't be 
difficult to add those configurability without contradicting the design, so I'm 
inclined to separate it.


From: Masahiko Sawada [mailto:sawada.m...@gmail.com]
> Yeah, that would work. Or it's kind of hackie but the rolling back the
> insertion instead of INSERT and DELETE might also work.

That's good, because it allows us to keep running reloptions test in parallel 
with other tests.  Done.


Regards
Takayuki Tsunakawa




disable-vacuum-truncation_v4.patch
Description: disable-vacuum-truncation_v4.patch


Re: get_controlfile() can leak fds in the backend

2019-02-27 Thread Joe Conway
On 2/27/19 10:26 AM, Joe Conway wrote:
> On 2/27/19 2:47 AM, Michael Paquier wrote:
>> Hi all,
>> (CC-ing Joe as of dc7d70e)

> According to that comment BasicOpenFile does not seem to solve the issue
> you are pointing out (leaking of file descriptor on ERROR). Perhaps
> OpenTransientFile() is more appropriate? I am on the road at the moment
> so have not looked very deeply at this yet though.


It seems to me that OpenTransientFile() is more appropriate. Patch done
that way attached.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development
diff --git a/src/common/controldata_utils.c b/src/common/controldata_utils.c
index abfe706..c3b1934 100644
*** a/src/common/controldata_utils.c
--- b/src/common/controldata_utils.c
***
*** 27,32 
--- 27,35 
  #include "catalog/pg_control.h"
  #include "common/controldata_utils.h"
  #include "port/pg_crc32c.h"
+ #ifndef FRONTEND
+ #include "storage/fd.h"
+ #endif
  
  /*
   * get_controlfile(char *DataDir, const char *progname, bool *crc_ok_p)
*** get_controlfile(const char *DataDir, con
*** 51,63 
  	ControlFile = palloc(sizeof(ControlFileData));
  	snprintf(ControlFilePath, MAXPGPATH, "%s/global/pg_control", DataDir);
  
- 	if ((fd = open(ControlFilePath, O_RDONLY | PG_BINARY, 0)) == -1)
  #ifndef FRONTEND
  		ereport(ERROR,
  (errcode_for_file_access(),
   errmsg("could not open file \"%s\" for reading: %m",
  		ControlFilePath)));
  #else
  	{
  		fprintf(stderr, _("%s: could not open file \"%s\" for reading: %s\n"),
  progname, ControlFilePath, strerror(errno));
--- 54,67 
  	ControlFile = palloc(sizeof(ControlFileData));
  	snprintf(ControlFilePath, MAXPGPATH, "%s/global/pg_control", DataDir);
  
  #ifndef FRONTEND
+ 	if ((fd = OpenTransientFile(ControlFilePath, O_RDONLY | PG_BINARY)) == -1)
  		ereport(ERROR,
  (errcode_for_file_access(),
   errmsg("could not open file \"%s\" for reading: %m",
  		ControlFilePath)));
  #else
+ 	if ((fd = open(ControlFilePath, O_RDONLY | PG_BINARY, 0)) == -1)
  	{
  		fprintf(stderr, _("%s: could not open file \"%s\" for reading: %s\n"),
  progname, ControlFilePath, strerror(errno));
*** get_controlfile(const char *DataDir, con
*** 95,101 
--- 99,109 
  #endif
  	}
  
+ #ifndef FRONTEND
+ 	CloseTransientFile(fd);
+ #else
  	close(fd);
+ #endif
  
  	/* Check the CRC. */
  	INIT_CRC32C(crc);


signature.asc
Description: OpenPGP digital signature


Re: pgsql: Avoid creation of the free space map for small heap relations, t

2019-02-27 Thread Amit Kapila
On Wed, Feb 27, 2019 at 11:07 AM John Naylor
 wrote:
>
> On Wed, Feb 27, 2019 at 5:06 AM Amit Kapila  wrote:
> > I have tried this test many times (more than 1000 times) by varying
> > thread count, but couldn't reproduce it.  My colleague, Kuntal has
> > tried a similar test overnight, but the issue didn't reproduce which
> > is not surprising to me seeing the nature of the problem.  As I could
> > reproduce it via the debugger, I think we can go ahead with the fix.
> > I have improved the comments in the attached patch and I have also
> > tried to address Tom's concern w.r.t comments by adding additional
> > comments atop of data-structure used to maintain the local map.
>
> The flaw in my thinking was treating extension too similarly too
> finding an existing block. With your patch clearing the local map in
> the correct place, it seems the call at hio.c:682 is now superfluous?
>

What if get some valid block from the first call to
GetPageWithFreeSpace via local map which has required space?  I think
in that case we will need the call at hio.c:682.  Am I missing
something?

> It wasn't sufficient before, so should this call be removed so as not
> to mislead? Or maybe keep it to be safe, with a comment like "This
> should already be cleared by now, but make sure it is."
>
> + * To maintain good performance, we only mark every other page as available
> + * in this map.
>
> I think this implementation detail is not needed to comment on the
> struct. I think the pointer to the README is sufficient.
>

Fair enough, I will remove that part of a comment once we agree on the
above point.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: Allowing extensions to supply operator-/function-specific info

2019-02-27 Thread Paul Ramsey


> On Feb 27, 2019, at 3:40 PM, Tom Lane  wrote:
> 
>> Variable SupportRequestCost is very exciting, but given that variable cost 
>> is usually driven by the complexity of arguments, what kind of argument is 
>> the SupportRequestCost call fed during the planning stage? Constant 
>> arguments are pretty straight forward, but what gets sent in when a column 
>> is one (or all) of the arguments? 
> 
> You'll see whatever is in the post-constant-folding parse tree.  If it's a
> Const, you can look at the value.  If it's a Var, you could perhaps look
> at the pg_statistic info for that column, though whether that would give
> you much of a leg up for cost estimation is hard to say.  For any sort of
> expression, you're probably going to be reduced to using a default
> estimate.  The core code generally doesn't try to be intelligent about
> anything beyond the Const and Var cases.

Actually, this is interesting, maybe there’s something to be done looking at 
the vertex density of the area under consideration… would require gathering 
extra stats, but could be useful (maybe, at some point feeding costs into plans 
has to degenerate into wankery…)

Another question:

I added three indexes to my test table:

  CREATE INDEX foo_g_gist_x ON foo USING GIST (g);
  CREATE INDEX foo_g_gist_nd_x ON foo USING GIST (g gist_geometry_ops);
  CREATE INDEX foo_g_spgist_x ON foo USING SPGIST (g);

They all support the overlaps (&&) operator.

So, SupportRequestIndexCondition happens three times, and each time I say “yep, 
sure, you can construct an index condition by putting the && operator between 
left_arg and right_arg”.

How does the planner end up deciding on which index to *actually* use? The 
selectivity is the same, the operator is the same. I found that I got the ND 
GIST one first, then the SPGIST and finally the 2d GIST, which is unfortunate, 
because the 2D and SPGIST are almost certainly faster than the ND GIST.

In practice, most people will just have one spatial index at a time, but I 
still wonder?

P


Re: Allowing extensions to supply operator-/function-specific info

2019-02-27 Thread Tom Lane
Paul Ramsey  writes:
> The documentation says that a support function should have a signature 
> "supportfn(internal) returns internal”, but doesn’t say which (if any) 
> annotations should be provided. IMMUTABLE? PARALLEL SAFE? STRICT? None? All?

It doesn't matter much given that these things aren't callable from SQL.
The builtin ones are marked immutable/safe/strict, but that's mostly
because that's the default state for builtin functions.  The only one
I'd get excited about is marking it strict if you're not going to check
for a null argument ... and even that is neatnik-ism, not something that
will have any practical effect.

> Variable SupportRequestCost is very exciting, but given that variable cost is 
> usually driven by the complexity of arguments, what kind of argument is the 
> SupportRequestCost call fed during the planning stage? Constant arguments are 
> pretty straight forward, but what gets sent in when a column is one (or all) 
> of the arguments? 

You'll see whatever is in the post-constant-folding parse tree.  If it's a
Const, you can look at the value.  If it's a Var, you could perhaps look
at the pg_statistic info for that column, though whether that would give
you much of a leg up for cost estimation is hard to say.  For any sort of
expression, you're probably going to be reduced to using a default
estimate.  The core code generally doesn't try to be intelligent about
anything beyond the Const and Var cases.

regards, tom lane



Re: some hints to understand the plsql cursor.

2019-02-27 Thread Andy Fan
Thanks Kumar.  actually I was asking what the the cursor did in the
server.   By looking the code, looks it cache the previous Portal with the
name is the cursor name,  whenever we run the fetch from the portal,  it
will restore the previous Portal and run it.

But your minimized and interactive code definitely makes the debug much
quicker.  Thank you very much!


On Wed, Feb 27, 2019 at 11:35 PM Dilip Kumar  wrote:

> On Wed, Feb 27, 2019 at 4:42 PM Andy Fan  wrote:
> >
> > actually I'm hacking pg for a function like :
> > 1. define a select query.
> > 2. client ask for some data. and server reply some data.  server will do
> NOTHING if client doesn't ask any more..
> > 3.  client ask some data more data with a batch and SERVER reply some
> data then. then do NOTHING.
> >
> > currently the simple "select * from t",  the server will try to send the
> data to client at one time which is not something I want.
> >
> > by looking into the plsql,  looks it has some api like:
> >
> > fetch 10 from cursor_1;
> > fetch 10 from cursor_1;
> >
> > I'm lacking of the experience to hack plsql.   so my question are:
> > 1.   Does pg has some codes which act like the "ask -> reply -> ask
> again -> reply again" on the server code?currently I'm not sure if the
> above "fetch" really work like this.
> > 2.  any resources or hint or suggestion to understand the "fetch"
> statement?
>
> I guess you are looking for these syntax?
>
> postgres=# BEGIN;
> BEGIN
> postgres=# DECLARE cur CURSOR FOR SELECT * FROM t;
> DECLARE CURSOR
> postgres=# FETCH NEXT cur;
>  a
> ---
>  1
> (1 row)
>
> postgres=# FETCH 10 cur;
>  a
> ---
>  2
>  3
>  4
>  5
>  1
>  2
>  3
>  4
>  5
>  6
> (10 rows)
>
> postgres=# FETCH NEXT cur;
>  a
> ---
>  7
> (1 row)
>
> postgres=# CLOSE cur;
> CLOSE CURSOR
>
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>


Re: patch to allow disable of WAL recycling

2019-02-27 Thread Alvaro Herrera
On 2019-Feb-05, Jerry Jelinek wrote:

> First, since last fall, we have found another performance problem related
> to initializing WAL files. I've described this issue in more detail below,
> but in order to handle this new problem, I decided to generalize the patch
> so the tunable refers to running on a Copy-On-Write filesystem instead of
> just being specific to WAL recycling. Specifically, I renamed the GUC
> tunable from 'wal_recycle' to 'wal_cow_fs'. Hopefully this will make it
> more obvious what is being tuned and will also be more flexible if there
> are other problems in the future which are related to running on a COW
> filesystem. I'm happy to choose a different name for the tunable if people
> don't like 'wal_cow_fs'.

I think the idea of it being a generic tunable for assorted behavior
changes, rather than specific to WAL recycling, is a good one.  I'm
unsure about your proposed name -- maybe "wal_cow_filesystem" is better?

I'm rewording your doc addition a little bit.  Here's my proposal:

   
This parameter should only be set to on when the WAL
resides on a Copy-On-Write 
(COW)
filesystem.
Enabling this option adjusts behavior to take advantage of the
filesystem characteristics (for example, recycling WAL files and
zero-filling new WAL files are disabled).

This part sounds good enough to me -- further suggestions welcome.

I'm less sure about this phrase:

This setting is only appropriate for filesystems which
allocate new disk blocks on every write.

Is "... which allocate new disk blocks on every write" a technique
distinct from CoW itself?  I'm confused as to what it means, or how can
the user tell whether they are on such a filesystem.

Obviously you're thinking that ZFS is such a filesystem and everybody
who has pg_wal on ZFS should enable this option.  What about, say, Btrfs
-- should they turn this option on?  Browsing the wikipedia, I find that
Windows has this ReFS thing that apparently is also CoW, but NTFS isn't.
I don't think either Btrfs or ReFS are realistic options to put pg_wal
on, so let's just list the common filesystems for which users are
supposed to enable this option ... which I think nowadays is just ZFS.
All in all, I would replace this phrase with something like: "This
setting should be enabled when pg_wal resides on a ZFS filesystem or
similar." That should be weasely enough that it's clear that we expect
users to do the homework when on unusual systems, while actively pointing
out the most common use case.

> Finally, the patch now includes bypassing the zero-fill for new WAL files
> when wal_cow_fs is true.

That makes sense.  I think all these benchmarks Tomas Vondra run are not
valid anymore ...

The attached v2 has assorted cosmetic cleanups.  If you can validate it,
I would appreciate it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From a4ce02f5cc8ad983c34712083f9cba7fda6d5b38 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 27 Feb 2019 19:41:05 -0300
Subject: [PATCH v2] pg_wal on COW fs

---
 doc/src/sgml/config.sgml  |  20 
 src/backend/access/transam/xlog.c | 101 --
 src/backend/utils/misc/guc.c  |  13 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h |   1 +
 5 files changed, 102 insertions(+), 34 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8bd57f376b2..60a873273aa 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2959,6 +2959,26 @@ include_dir 'conf.d'
   
  
 
+ 
+  wal_cow_filesystem (boolean)
+  
+   wal_cow_filesystem configuration parameter
+  
+  
+  
+   
+This parameter should only be set to on when the WAL
+resides on a Copy-On-Write (COW)
+filesystem.
+Enabling this option adjusts some behavior to take advantage of the
+filesystem characteristics (for example, recycling WAL files and
+zero-filling new WAL files are disabled).
+This setting is only appropriate for filesystems which
+allocate new disk blocks on every write.
+   
+  
+ 
+
  
  
  
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ecd12fc53ae..1acce1c70d6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -94,6 +94,7 @@ bool		wal_log_hints = false;
 bool		wal_compression = false;
 char	   *wal_consistency_checking_string = NULL;
 bool	   *wal_consistency_checking = NULL;
+bool		wal_cow_filesystem = false;
 bool		log_checkpoints = false;
 int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
@@ -3216,6 +3217,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	XLogSegNo	max_segno;
 

Re: Allowing extensions to supply operator-/function-specific info

2019-02-27 Thread Paul Ramsey
A few more questions…

The documentation says that a support function should have a signature 
"supportfn(internal) returns internal”, but doesn’t say which (if any) 
annotations should be provided. IMMUTABLE? PARALLEL SAFE? STRICT? None? All?

Variable SupportRequestCost is very exciting, but given that variable cost is 
usually driven by the complexity of arguments, what kind of argument is the 
SupportRequestCost call fed during the planning stage? Constant arguments are 
pretty straight forward, but what gets sent in when a column is one (or all) of 
the arguments? 

Thanks,
P




Re: Row Level Security − leakproof-ness and performance implications

2019-02-27 Thread Joe Conway
On 2/20/19 11:24 AM, Tom Lane wrote:
> Pierre Ducroquet  writes:
>> For simple functions like enum_eq/enum_ne, marking them leakproof is an 
>> obvious fix (patch attached to this email, including also textin/textout).
> 
> This is not nearly as "obvious" as you think.  See prior discussions,
> notably
> https://www.postgresql.org/message-id/flat/31042.1546194242%40sss.pgh.pa.us
> 
> Up to now we've taken a very strict definition of what leakproofness
> means; as Noah stated, if a function can throw errors for some inputs,
> it's not considered leakproof, even if those inputs should never be
> encountered in practice.  Most of the things we've been willing to
> mark leakproof are straight-line code that demonstrably can't throw
> any errors at all.
> 
> The previous thread seemed to have consensus that the hazards in
> text_cmp and friends were narrow enough that nobody had a practical
> problem with marking them leakproof --- but we couldn't define an
> objective policy statement that would allow making such decisions,
> so nothing's been changed as yet.  I think it is important to have
> an articulable policy about this, not just a seat-of-the-pants
> conclusion about the risk level in a particular function.

What if we provided an option to redact all client messages (leaving
logged messages as-is). Separately we could provide a GUC to force all
functions to be resolved as leakproof. Depending on your requirements,
having both options turned on could be perfectly acceptable.

Patch for discussion attached.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index e88c45d..8c32d6c 100644
*** a/src/backend/utils/cache/lsyscache.c
--- b/src/backend/utils/cache/lsyscache.c
***
*** 46,51 
--- 46,54 
  #include "utils/syscache.h"
  #include "utils/typcache.h"
  
+ /* GUC to force functions to be treated as leakproof */
+ extern bool		force_leakproof;
+ 
  /* Hook for plugins to get control in get_attavgwidth() */
  get_attavgwidth_hook_type get_attavgwidth_hook = NULL;
  
*** get_func_leakproof(Oid funcid)
*** 1595,1600 
--- 1598,1610 
  	HeapTuple	tp;
  	bool		result;
  
+ 	/*
+ 	 * If client message suppression was requested,
+ 	 * treat all functions as leakproof
+ 	 */
+ 	if (force_leakproof)
+ 		return true;
+ 
  	tp = SearchSysCache1(PROCOID, ObjectIdGetDatum(funcid));
  	if (!HeapTupleIsValid(tp))
  		elog(ERROR, "cache lookup failed for function %u", funcid);
diff --git a/src/backend/utils/error/elog.c b/src/backend/utils/error/elog.c
index 8b4720e..9941756 100644
*** a/src/backend/utils/error/elog.c
--- b/src/backend/utils/error/elog.c
*** int			Log_destination = LOG_DESTINATION_
*** 107,112 
--- 107,113 
  char	   *Log_destination_string = NULL;
  bool		syslog_sequence_numbers = true;
  bool		syslog_split_messages = true;
+ bool		suppress_client_messages = false;
  
  #ifdef HAVE_SYSLOG
  
*** send_message_to_frontend(ErrorData *edat
*** 3166,3189 
  
  		/* M field is required per protocol, so always send something */
  		pq_sendbyte(&msgbuf, PG_DIAG_MESSAGE_PRIMARY);
- 		if (edata->message)
- 			err_sendstring(&msgbuf, edata->message);
- 		else
- 			err_sendstring(&msgbuf, _("missing error text"));
  
! 		if (edata->detail)
  		{
! 			pq_sendbyte(&msgbuf, PG_DIAG_MESSAGE_DETAIL);
! 			err_sendstring(&msgbuf, edata->detail);
! 		}
  
! 		/* detail_log is intentionally not used here */
  
! 		if (edata->hint)
! 		{
! 			pq_sendbyte(&msgbuf, PG_DIAG_MESSAGE_HINT);
! 			err_sendstring(&msgbuf, edata->hint);
  		}
  
  		if (edata->context)
  		{
--- 3167,3197 
  
  		/* M field is required per protocol, so always send something */
  		pq_sendbyte(&msgbuf, PG_DIAG_MESSAGE_PRIMARY);
  
! 		/* skip sending if requested */
! 		if (!suppress_client_messages)
  		{
! 			if (edata->message)
! err_sendstring(&msgbuf, edata->message);
! 			else
! err_sendstring(&msgbuf, _("missing error text"));
  
! 			if (edata->detail)
! 			{
! pq_sendbyte(&msgbuf, PG_DIAG_MESSAGE_DETAIL);
! err_sendstring(&msgbuf, edata->detail);
! 			}
  
! 			/* detail_log is intentionally not used here */
! 
! 			if (edata->hint)
! 			{
! pq_sendbyte(&msgbuf, PG_DIAG_MESSAGE_HINT);
! err_sendstring(&msgbuf, edata->hint);
! 			}
  		}
+ 		else
+ 			err_sendstring(&msgbuf, _("redacted message text"));
  
  		if (edata->context)
  		{
*** send_message_to_frontend(ErrorData *edat
*** 3274,3283 
  		if (edata->show_funcname && edata->funcname)
  			appendStringInfo(&buf, "%s: ", edata->funcname);
  
! 		if (edata->message)
! 			appendStringInfoString(&buf, edata->message);
  		else
! 			appendStringInfoString(&buf, _("missing error text"));
  
  		if (edata->cursorpos > 0)
  			appendStringInfo(&buf, _(" at character %d"),
--- 3282,3297 
  		if (edat

Re: Why don't we have a small reserved OID range for patch revisions?

2019-02-27 Thread Tom Lane
Peter Geoghegan  writes:
> On Wed, Feb 27, 2019 at 2:44 PM Peter Eisentraut
>  wrote:
>> If this is the problem (although I think we'd find that OID collisions
>> are rather rare compared to other gratuitous cfbot failures), why not
>> have the cfbot build with a flag that ignores OID collisions?

> How would that work?

It could work for conflicting OIDs in different system catalogs (so that
the "conflict" is an artifact of our assignment rules rather than an
intrinsic problem).  But I think the majority of new hand-assigned OIDs
are in pg_proc, so that this kind of hack would not help as much as one
might wish.

regards, tom lane



Re: Why don't we have a small reserved OID range for patch revisions?

2019-02-27 Thread Peter Geoghegan
On Wed, Feb 27, 2019 at 2:44 PM Peter Eisentraut
 wrote:
> If this is the problem (although I think we'd find that OID collisions
> are rather rare compared to other gratuitous cfbot failures), why not
> have the cfbot build with a flag that ignores OID collisions?

How would that work?

-- 
Peter Geoghegan



Re: Why don't we have a small reserved OID range for patch revisions?

2019-02-27 Thread Peter Eisentraut
On 2019-02-27 22:50, Peter Geoghegan wrote:
> However, the continuous
> integration stuff has created an expectation that your patch shouldn't
> be left to bitrot for long. Silly mechanical bitrot now seems like a
> much bigger problem than it was before these developments. It unfairly
> puts reviewers off engaging.

If this is the problem (although I think we'd find that OID collisions
are rather rare compared to other gratuitous cfbot failures), why not
have the cfbot build with a flag that ignores OID collisions?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Why don't we have a small reserved OID range for patch revisions?

2019-02-27 Thread Peter Geoghegan
On Wed, Feb 27, 2019 at 2:39 PM Peter Eisentraut
 wrote:
> The changes of a patch (a) allocating a new OID, (b) a second patch
> allocating a new OID, (c) both being in flight at the same time, (d)
> actually picking the same OID, are small.

But...they are. Most patches don't create new system catalog entries
at all. Of those that do, the conventions around assigning new OIDs
make it fairly likely that problems will emerge.

> I guess the overall time lost
> to this issue is perhaps 2 hours per year.  On the other hand, with
> about 2000 commits to master per year, if this renumbering business only
> adds 2 seconds of overhead to committing, we're coming out behind.

The time spent on the final commit is not the cost we're concerned
about, though. It isn't necessary to do that more than once, whereas
all but the most trivial of patches receive multiple rounds of review
and revision.

-- 
Peter Geoghegan



Re: Why don't we have a small reserved OID range for patch revisions?

2019-02-27 Thread Peter Eisentraut
On 2019-02-27 22:27, Tom Lane wrote:
>> OID collision doesn't seem to be a significant problem (for me).
> 
> Um, I beg to differ.  It's not at all unusual for pending patches to
> bit-rot for no reason other than suddenly getting an OID conflict.
> I don't have to look far for a current example:

I'm not saying it doesn't happen, but that it's not a significant
problem overall.

The changes of a patch (a) allocating a new OID, (b) a second patch
allocating a new OID, (c) both being in flight at the same time, (d)
actually picking the same OID, are small.  I guess the overall time lost
to this issue is perhaps 2 hours per year.  On the other hand, with
about 2000 commits to master per year, if this renumbering business only
adds 2 seconds of overhead to committing, we're coming out behind.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Why don't we have a small reserved OID range for patch revisions?

2019-02-27 Thread Tom Lane
Peter Geoghegan  writes:
> On Wed, Feb 27, 2019 at 1:27 PM Tom Lane  wrote:
>>> OID collision doesn't seem to be a significant problem (for me).

>> Um, I beg to differ.  It's not at all unusual for pending patches to
>> bit-rot for no reason other than suddenly getting an OID conflict.
>> I don't have to look far for a current example:
>> https://travis-ci.org/postgresql-cfbot/postgresql/builds/498955351

> Patch authors shouldn't be left with any excuse for leaving their
> patch to bitrot for long. And, more casual patch reviewers shouldn't
> have any excuse for not downloading a patch and applying it locally,
> so that they can spend a spare 10 minutes kicking the tires.

Yeah, that latter point is really the killer argument.  We don't want
to make people spend valuable review time on fixing uninteresting OID
conflicts.  It's even more annoying that several people might have to
duplicate the same work, if they're testing a patch independently.

Given a convention that under-development patches use OIDs in the 9K
range, the only time anybody would have to resolve OID conflicts for
testing would be if they were trying to test the combination of two
or more patches.  Even then, an OID-renumbering script would make it
pretty painless: apply patch 1, renumber its OIDs to someplace else,
apply patch 2, repeat as needed.

> Why not have unused_oids reference the convention as a "tip"?

Hmm, could be helpful.

regards, tom lane



Re: Why don't we have a small reserved OID range for patch revisions?

2019-02-27 Thread Tom Lane
I wrote:
> We do need a couple of pieces of new infrastructure to make this idea
> conveniently workable.  One is a tool to allow automatic OID renumbering
> instead of having to do it by hand; Naylor has a draft for that upthread.

Oh: arguably, something else we'd need to do to ensure that OID
renumbering is trouble-free is to institute a strict rule that OID
references in the *.dat files must be symbolic.  We had not bothered
to convert every single reference type before, reasoning that some
of them were too little-used to be worth the trouble; but someday
that'll rise up to bite us, if semi-automated renumbering becomes
a thing.

It looks to me like the following OID columns remain unconverted:

pg_class.reltype
pg_database.dattablespace
pg_ts_config.cfgparser
pg_ts_config_map.mapcfg, mapdict
pg_ts_dict.dicttemplate
pg_type.typcollation
pg_type.typrelid

regards, tom lane



Re: Why don't we have a small reserved OID range for patch revisions?

2019-02-27 Thread Peter Geoghegan
On Wed, Feb 27, 2019 at 1:27 PM Tom Lane  wrote:
> > OID collision doesn't seem to be a significant problem (for me).
>
> Um, I beg to differ.  It's not at all unusual for pending patches to
> bit-rot for no reason other than suddenly getting an OID conflict.
> I don't have to look far for a current example:
> https://travis-ci.org/postgresql-cfbot/postgresql/builds/498955351

OID conflicts are not that big of a deal when building a patch
locally, because almost everyone knows what the exact problem is
immediately, and because you probably have more than a passing
interest in the patch to even do that much. However, the continuous
integration stuff has created an expectation that your patch shouldn't
be left to bitrot for long. Silly mechanical bitrot now seems like a
much bigger problem than it was before these developments. It unfairly
puts reviewers off engaging.

Patch authors shouldn't be left with any excuse for leaving their
patch to bitrot for long. And, more casual patch reviewers shouldn't
have any excuse for not downloading a patch and applying it locally,
so that they can spend a spare 10 minutes kicking the tires.

> Perhaps it'd be useful for genbki.pl to spit out a warning (NOT an
> error) if it sees OIDs in the reserved range.  I'm not sure that that'd
> really be worth the trouble though, since one could easily forget
> about it while reviewing/testing just before commit, and it'd just be
> useless noise up until it was time to commit.

My sense is that we should err on the side of being informative.

> Another issue, as Robert pointed out, is that this does need to be
> a formal convention not something undocumented.  Naylor's patch adds
> a mention of it in bki.sgml, but I wonder if anyplace else should
> talk about it.

Why not have unused_oids reference the convention as a "tip"?

> I concede your point that a prudent committer would do a rebuild and
> retest rather than just trusting the tool.  But really, how much
> extra work is that?  If you've spent any time optimizing your workflow,
> a full rebuild and check-world should be under five minutes on any
> hardware anyone would be using for development today.

If you use the "check-world parallel" recipe on the committing
checklist Wiki page, and if you use ccache, ~2 minutes is attainable
for optimized builds (though the recipe doesn't work on all release
branches). I don't think that a committer should be a committing
anything if they're not willing to do this much. It's not just prudent
-- it is the *bare minimum* when committing a patch that creates
system catalog entries.

-- 
Peter Geoghegan



Re: Refactoring the checkpointer's fsync request queue

2019-02-27 Thread Thomas Munro
On Thu, Feb 28, 2019 at 10:27 AM Shawn Debnath  wrote:
> We had a quick offline discussion to get on the same page and we agreed
> to move forward with Andres' approach above. Attached is patch v10.
> Here's the overview of the patch:

Thanks.  I will review, and try to rebase my undo patches on top of
this and see what problems I crash into.

> Ran make check-world and repeated the tests described in [1]. The
> numbers show a 12% drop in total time for single run of 1000 clients and
> ~62% drop in total time for 10 parallel runs with 200 clients:

Hmm, good but unexpected.  Will poke at this here.


--
Thomas Munro
https://enterprisedb.com



Re: POC: converting Lists into arrays

2019-02-27 Thread Alvaro Herrera
On 2019-Feb-27, Tom Lane wrote:

> I'm particularly unsure about whether we need two macros; though the
> way I initially tried it with just list_cell_is_last() seemed kind of
> double-negatively confusing in the places where the test needs to be
> not-last.  Also, are these macro names too long, and if so what would
> be better?

I think "!list_cell_is_last()" is just as readable, if not more, than
the "is_not_last" locution:

appendStringInfoChar(&buf, '\'');
if (!list_cell_is_last(l))
appendStringInfoString(&buf, ", ");

I'd go with a single macro.


+1 for backpatching the new macros, too.  I suspect extension authors
are going to need to provide compatibility versions anyway, to be
compilable against older minors.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Why don't we have a small reserved OID range for patch revisions?

2019-02-27 Thread Tom Lane
Peter Eisentraut  writes:
> On 2019-02-08 19:14, Tom Lane wrote:
>> Quite a few people have used OIDs up around 8000 or 9000 for this purpose;
>> I doubt we need a formally reserved range for it.  The main problem with
>> doing it is the hazard that the patch'll get committed just like that,
>> suddenly breaking things for everyone else doing likewise.

> For that reason, I'm not in favor of this.  Forgetting to update the
> catversion is already common enough (for me).  Adding another step
> between having a seemingly final patch and being able to actually commit
> it doesn't seem attractive.  Moreover, these "final adjustments" would
> tend to require a full rebuild and retest, adding even more overhead.

> OID collision doesn't seem to be a significant problem (for me).

Um, I beg to differ.  It's not at all unusual for pending patches to
bit-rot for no reason other than suddenly getting an OID conflict.
I don't have to look far for a current example:
https://travis-ci.org/postgresql-cfbot/postgresql/builds/498955351

We do need a couple of pieces of new infrastructure to make this idea
conveniently workable.  One is a tool to allow automatic OID renumbering
instead of having to do it by hand; Naylor has a draft for that upthread.

Perhaps it'd be useful for genbki.pl to spit out a warning (NOT an
error) if it sees OIDs in the reserved range.  I'm not sure that that'd
really be worth the trouble though, since one could easily forget
about it while reviewing/testing just before commit, and it'd just be
useless noise up until it was time to commit.

Another issue, as Robert pointed out, is that this does need to be
a formal convention not something undocumented.  Naylor's patch adds
a mention of it in bki.sgml, but I wonder if anyplace else should
talk about it.

I concede your point that a prudent committer would do a rebuild and
retest rather than just trusting the tool.  But really, how much
extra work is that?  If you've spent any time optimizing your workflow,
a full rebuild and check-world should be under five minutes on any
hardware anyone would be using for development today.

And, yeah, we probably will make mistakes like this, just like we
sometimes forget the catversion bump.  As long as we have a tool
for OID renumbering, I don't think that's the end of the world.
Fixing it after the fact isn't going to be a big deal, any more
than it is for catversion.

regards, tom lane



Re: POC: converting Lists into arrays

2019-02-27 Thread Tom Lane
Robert Haas  writes:
> On Wed, Feb 27, 2019 at 3:27 PM Tom Lane  wrote:
>> 0001 below does this.  I found a couple of places that could use
>> forfive(), as well.  I think this is a clear legibility and
>> error-proofing win, and we should just push it.

> It sounds like some of these places might need a bigger restructuring
> - i.e. to iterate over a list/vector of structs with 5 members instead
> of iterating over five lists in parallel.

Meh.  Most of them are iterating over parsetree substructure, eg the
components of a RowCompareExpr.  So we could not change them without
pretty extensive infrastructure changes including a catversion bump.
Also, while the separated substructure is indeed a pain in the rear
in some places, it's actually better for other uses.  Two examples
related to RowCompareExpr:

* match_rowcompare_to_indexcol can check whether all the left-hand
or right-hand expressions are nonvolatile with one easy call to
contain_volatile_functions on the respective list.  To do the
same with a single list of sub-structs, it'd need bespoke code
for each case to run through the list and consider only the correct
subexpression of each sub-struct.

* expand_indexqual_rowcompare can deal with commuted-clause cases just
by swapping the list pointers at the start, it doesn't have to think
about it over again for each pair of elements.

So I'm not that excited about refactoring the data representation
for these.  I'm content (for now) with getting these places in line
with the coding convention we use elsewhere for similar cases.

regards, tom lane



Re: POC: converting Lists into arrays

2019-02-27 Thread Robert Haas
On Wed, Feb 27, 2019 at 3:27 PM Tom Lane  wrote:
> 0001 below does this.  I found a couple of places that could use
> forfive(), as well.  I think this is a clear legibility and
> error-proofing win, and we should just push it.

It sounds like some of these places might need a bigger restructuring
- i.e. to iterate over a list/vector of structs with 5 members instead
of iterating over five lists in parallel.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] EvalPlanQual behaves oddly for FDW queries involving system columns

2019-02-27 Thread Andres Freund
Hi,

On 2015-05-12 14:24:34 -0400, Tom Lane wrote:
> I did a very basic update of your postgres_fdw patch to test this with,
> and attach that so that you don't have to repeat the effort.  I'm not sure
> whether we want to try to convert that into something committable.  I'm
> afraid that the extra round trips involved in doing row locking this way
> will be so expensive that no one really wants it for postgres_fdw.  It's
> more credible that FDWs operating against local storage would have use
> for it.

Fujita-san, do you know of any FDWs that use this? I'm currently
converting the EPQ machinery to slots, and in course of that I (with
Horiguchi-san's help), converted RefetchForeignRow to return a slot. But
there's currently no in-core user of this facility...  I guess I can
rebase the preliminary postgres_fdw patch here, but it bitrotted
significantly. I also feel like there should be some test coverage for
an API in a nontrivial part of the code...

Greetings,

Andres Freund



Re: POC: converting Lists into arrays

2019-02-27 Thread Tom Lane
I wrote:
> I did find a number of places where getting rid of explicit lnext()
> calls led to just plain cleaner code.  Most of these were places that
> could be using forboth() or forthree() and just weren't.  There's
> also several places that are crying out for a forfour() macro, so
> I'm not sure why we've stubbornly refused to provide it.  I'm a bit
> inclined to just fix those things in the name of code readability,
> independent of this patch.

0001 below does this.  I found a couple of places that could use
forfive(), as well.  I think this is a clear legibility and
error-proofing win, and we should just push it.

> I also noticed that there's quite a lot of places that are testing
> lnext(cell) for being NULL or not.  What that really means is whether
> this cell is last in the list or not, so maybe readability would be
> improved by defining a macro along the lines of list_cell_is_last().
> Any thoughts about that?

0002 below does this.  I'm having a hard time deciding whether this
part is a good idea or just code churn.  It might be more readable
(especially to newbies) but I can't evaluate that very objectively.
I'm particularly unsure about whether we need two macros; though the
way I initially tried it with just list_cell_is_last() seemed kind of
double-negatively confusing in the places where the test needs to be
not-last.  Also, are these macro names too long, and if so what would
be better?

Also: if we accept either or both of these, should we back-patch the
macro additions, so that these new macros will be available for use
in back-patched code?  I'm not sure that forfour/forfive have enough
use-cases to worry about that; but the is-last macros would have a
better case for that, I think.

regards, tom lane

diff --git a/src/backend/access/common/tupdesc.c b/src/backend/access/common/tupdesc.c
index 47e80ae..832c3e9 100644
--- a/src/backend/access/common/tupdesc.c
+++ b/src/backend/access/common/tupdesc.c
@@ -902,23 +902,12 @@ BuildDescFromLists(List *names, List *types, List *typmods, List *collations)
 	desc = CreateTemplateTupleDesc(natts);
 
 	attnum = 0;
-
-	l2 = list_head(types);
-	l3 = list_head(typmods);
-	l4 = list_head(collations);
-	foreach(l1, names)
+	forfour(l1, names, l2, types, l3, typmods, l4, collations)
 	{
 		char	   *attname = strVal(lfirst(l1));
-		Oid			atttypid;
-		int32		atttypmod;
-		Oid			attcollation;
-
-		atttypid = lfirst_oid(l2);
-		l2 = lnext(l2);
-		atttypmod = lfirst_int(l3);
-		l3 = lnext(l3);
-		attcollation = lfirst_oid(l4);
-		l4 = lnext(l4);
+		Oid			atttypid = lfirst_oid(l2);
+		int32		atttypmod = lfirst_int(l3);
+		Oid			attcollation = lfirst_oid(l4);
 
 		attnum++;
 
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index db3777d..7cbf9d3 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -1683,7 +1683,6 @@ ExecInitExprRec(Expr *node, ExprState *state,
 		   *l_opfamily,
 		   *l_inputcollid;
 ListCell   *lc;
-int			off;
 
 /*
  * Iterate over each field, prepare comparisons.  To handle
@@ -1695,20 +1694,11 @@ ExecInitExprRec(Expr *node, ExprState *state,
 Assert(list_length(rcexpr->opfamilies) == nopers);
 Assert(list_length(rcexpr->inputcollids) == nopers);
 
-off = 0;
-for (off = 0,
-	 l_left_expr = list_head(rcexpr->largs),
-	 l_right_expr = list_head(rcexpr->rargs),
-	 l_opno = list_head(rcexpr->opnos),
-	 l_opfamily = list_head(rcexpr->opfamilies),
-	 l_inputcollid = list_head(rcexpr->inputcollids);
-	 off < nopers;
-	 off++,
-	 l_left_expr = lnext(l_left_expr),
-	 l_right_expr = lnext(l_right_expr),
-	 l_opno = lnext(l_opno),
-	 l_opfamily = lnext(l_opfamily),
-	 l_inputcollid = lnext(l_inputcollid))
+forfive(l_left_expr, rcexpr->largs,
+		l_right_expr, rcexpr->rargs,
+		l_opno, rcexpr->opnos,
+		l_opfamily, rcexpr->opfamilies,
+		l_inputcollid, rcexpr->inputcollids)
 {
 	Expr	   *left_expr = (Expr *) lfirst(l_left_expr);
 	Expr	   *right_expr = (Expr *) lfirst(l_right_expr);
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 324356e..8b29437 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -1332,12 +1332,12 @@ ExecIndexBuildScanKeys(PlanState *planstate, Relation index,
 		{
 			/* (indexkey, indexkey, ...) op (expression, expression, ...) */
 			RowCompareExpr *rc = (RowCompareExpr *) clause;
-			ListCell   *largs_cell = list_head(rc->largs);
-			ListCell   *rargs_cell = list_head(rc->rargs);
-			ListCell   *opnos_cell = list_head(rc->opnos);
-			ListCell   *collids_cell = list_head(rc->inputcollids);
 			ScanKey		first_sub_key;
 			int			n_sub_key;
+			ListCell   *largs_cell;
+			ListCell   *rargs_cell;
+			ListCell   *opnos_cell;
+			ListCell   *collids_cell;
 
 			Assert(!isorderby);
 
@@ -1346,19 +1346,22 @@ ExecIndexBuildScanKeys(PlanState

Re: Index INCLUDE vs. Bitmap Index Scan

2019-02-27 Thread Tom Lane
Tomas Vondra  writes:
> On 2/27/19 6:36 AM, Markus Winand wrote:
> On 27.02.2019, at 00:22, Tom Lane  wrote:
>>> For example, given a filter condition like "1.0/c > 0.1", people
>>> would complain if it still got zero-divide failures even after they'd
>>> deleted all rows with c=0 from their table.

>> Ok, but I don’t see how this case different for key columns vs. INCLUDE 
>> columns.

> Yeah, I'm a bit puzzled by this difference too - why would it be safe
> for keys and not the other included columns?

It's not about the column, it's about the operator.  We assume that
operators appearing in opclasses are safe to execute even for
index entries that correspond to dead rows.  INCLUDE columns don't
have any associated opclass, hence no assumed-usable operators.

> I do recall a discussion about executing expressions on index tuples
> during IOS (before/without inspecting the heap tuple). I'm too lazy to
> search for the thread now, but I recall it was somehow related to
> leak-proof-ness. And AFAICS none of the "div" procs are marked as
> leak-proof, so perhaps that's one of the reasons?

Leak-proof-ness is kind of related, perhaps, but it's not quite the property
we're after here --- at least not to my mind.  It might be an interesting
discussion exactly what the relationship is.  Meanwhile, we don't have any
separate concept of functions that are safe to apply to any index entry;
opclass membership is it.

You could probably argue that any clause containing only indexed variables
and operators that are members of *some* opclass could be used as a filter
in advance of heap liveness checking.  But we lack any infrastructure for
that, in either the planner or executor.

regards, tom lane



Re: psql display of foreign keys

2019-02-27 Thread Alvaro Herrera
On 2019-Feb-27, Michael Paquier wrote:

> On Tue, Feb 26, 2019 at 07:27:57PM -0300, Alvaro Herrera wrote:
> > Thanks for committing pg_partition_root ... but it turns out to be
> > useless for this purpose.
> 
> Well, what's done is done.  The thing is useful by itself in my
> opinion.

Eh, of course -- note that the psql query I added does use
pg_partition_root, it's just that it is not useful *all by itself*.

> In the second patch, pg_partition_ancestors always sets include_self
> to true.  What's the use case you have in mind to set it to false?  In
> the other existing functions we always include the argument itself, so
> we may want to keep things consistent.

Hmm, true.

> I think that you should make the function return a record of regclass
> elements instead of OIDs to be consistent.  This could save casts for
> the callers.

Yeah, done.

> Adding the self-member at the beginning of the record set is more
> consistent with the order of the results returned by
> get_partition_ancestors().

You're right, done.

> It would be nice to add some tests in partition_info.sql for tables
> and indexes (both work).

Well.  I tried this scenario
create table t1 (a int);
create table t11 () inherits (t1);
create table t2 (b int);
create table t111() inherits (t1, t2);

and the result I get from my new function is not good:
alvherre=# select * from pg_partition_ancestors('t111');
 relid 
---
 t111
 t1
(2 filas)

it should have listed t2 too, but it doesn't.  Since these functions
aren't supposed to work on legacy inheritance anyway, I think the right
action is to return the empty set.  In the current version I just do
what pg_partition_tree does, but I think we should adjust that behavior.
I'll start a new thread about that.

> For the meaning of using pg_partition_ancestors, I see...  Not only do
> you want to show the foreign keys defined in the top-most parent, but
> also these defined in intermediate layers.  That makes sense.  Using
> only pg_partition_root would have been enough to show FKs in the
> top-most parent, but the intermediate ones would be missed (using only
> pg_partition_root() would miss the FKs fk_partitioned_fk_5_a_fkey1 and
> fk_partitioned_fk_5_a_fkey when doing "\d fk_partitioned_fk_5_1" based
> on the test set).

Exactly -- that's the whole point.  We need to list all FKs that are
applicable to the partition, indicating which relation is the one where
the FK generates, and without polluting the output with countless
"internal" pg_constraint rows.  The output psql presents for the PK-side
relation when it's partitioned, with my patch to support that, is quite
ugly when there are many partitions.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From 9e5cc595ea952763fdcd879ec19f3cbff82edae6 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Tue, 26 Feb 2019 17:05:24 -0300
Subject: [PATCH v4] pg_partition_ancestors

---
 doc/src/sgml/func.sgml   | 10 +++
 src/backend/utils/adt/partitionfuncs.c   | 49 +
 src/include/catalog/pg_proc.dat  |  5 ++
 src/test/regress/expected/partition_info.out | 72 +++-
 src/test/regress/sql/partition_info.sql  | 18 -
 5 files changed, 151 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 86ff4e5c9e2..ee3962a6c92 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -20274,6 +20274,16 @@ postgres=# SELECT * FROM pg_walfile_name_offset(pg_stop_backup());
 their partitions, and so on.

   
+  
+   
+pg_partition_ancestors
+pg_partition_ancestors(regclass)
+   
+   setof regclass
+   
+List the ancestor relations of the given partition.
+   
+  
   

 pg_partition_root
diff --git a/src/backend/utils/adt/partitionfuncs.c b/src/backend/utils/adt/partitionfuncs.c
index ffd66b64394..2e7bf01106b 100644
--- a/src/backend/utils/adt/partitionfuncs.c
+++ b/src/backend/utils/adt/partitionfuncs.c
@@ -208,3 +208,52 @@ pg_partition_root(PG_FUNCTION_ARGS)
 	Assert(OidIsValid(rootrelid));
 	PG_RETURN_OID(rootrelid);
 }
+
+/*
+ * pg_partition_ancestors
+ *
+ * Produces a view with one row per ancestor of the given partition,
+ * including the input relation itself.
+ */
+Datum
+pg_partition_ancestors(PG_FUNCTION_ARGS)
+{
+	Oid			relid = PG_GETARG_OID(0);
+	FuncCallContext *funcctx;
+	ListCell  **next;
+
+	if (!check_rel_can_be_partition(relid))
+		PG_RETURN_NULL();
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		MemoryContext oldcxt;
+		List	   *ancestors;
+
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		oldcxt = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		ancestors = get_partition_ancestors(relid);
+		ancestors = lcons_oid(relid, ancestors);
+
+		next = (ListCell **) palloc(sizeof(ListCell *));
+		*next = list_head(ancestors);
+		funcctx->user_fctx = (void *) next;
+
+		

Re: pg_partition_tree crashes for a non-defined relation

2019-02-27 Thread Alvaro Herrera
On 2018-Dec-09, Tom Lane wrote:

> Stephen Frost  writes:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> >> ... especially in code that's highly unlikely to break once written.
> 
> > I don't entirely buy off on the argument that it's code that's 'highly
> > unlikely to break once written' though- we do add new relkinds from time
> > to time, for example.  Perhaps we could have these functions run just
> > once per relkind.
> 
> Well, the relevant code is likely to be "if relkind is not x, y, or z,
> then PG_RETURN_NULL".  If we add a new relkind and forget to consider the
> function, the outcome is a NULL result that perhaps should not have been
> NULL ... but a test like this won't help us notice that.

I just happened to come across the result of this rationale in
pg_partition_tree() (an SRF) while developing a new related function,
pg_partition_ancestors(), and find the resulting behavior rather absurd
-- it returns one row with all NULL columns, rather than no rows.  I
think the sensible behavior would be to do SRF_RETURN_DONE() before
stashing any rows to the output, so that we get an empty result set
instead.

alvherre=# select * from pg_partition_tree('information_schema.sequences');
 relid | parentrelid | isleaf | level 
---+-++---
   | ||  
(1 fila)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Remove Deprecated Exclusive Backup Mode

2019-02-27 Thread Christophe Pettus



> On Feb 26, 2019, at 11:38, Magnus Hagander  wrote:
> That should not be a wiki page, really, that should be part of the main 
> documentation. 

I was just suggesting using a wiki page to draft it before we drop it into the 
main documentation.  I'm open to other suggestions, of course!

--
-- Christophe Pettus
   x...@thebuild.com




Re: query logging of prepared statements

2019-02-27 Thread Justin Pryzby
On Fri, Feb 15, 2019 at 08:57:04AM -0600, Justin Pryzby wrote:
> I propose that the prepared statement associated with an EXECUTE should be
> included in log "DETAIL" only when log_error_verbosity=VERBOSE, for both SQL
> EXECUTE and PQexecPrepared (if at all).  I'd like to be able to continue
> logging DETAIL (including bind parameters), so want like to avoid setting
> "TERSE" just to avoid redundant messages.  (It occurs to me that the GUC 
> should
> probably stick to its existing documented behavior rather than be extended,
> which suggests that the duplicitive portions of the logs should simply be
> removed, rather than conditionalized.  Let me know what you think).

I'm attaching a v2 patch which avoids repeated logging of PREPARE, rather than
making such logs conditional on log_error_verbosity=VERBOSE, since
log_error_verbosity is documented to control whether these are output:
DETAIL/HINT/QUERY/CONTEXT/SQLSTATE.

For SQL EXECUTE, excluding "detail" seems reasonable (perhaps for
log_error_verbosityhttps://www.postgresql.org/docs/current/runtime-config-logging.html
|Controls the amount of detail written in the server log for each message that
|is logged. Valid values are TERSE, DEFAULT, and VERBOSE, each adding more
|fields to displayed messages. TERSE excludes the logging of DETAIL, HINT,
|QUERY, and CONTEXT error information. VERBOSE output includes the SQLSTATE
|error code (see also Appendix A) and the source code file name, function name,
|and line number that generated the error. Only superusers can change this
|setting.

As I mentioned in my original message, it seems odd that for SQL EXECUTE, the
PREPARE is shown in "detail", but for the library call, it's shown in
"message".  This patch resolves that inconsistency by showing it in neither.
>From d60957d9cd1108f389dde0125fadee71a96b4229 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Sat, 9 Feb 2019 19:20:43 -0500
Subject: [PATCH v2] Avoid repetitive log of PREPARE during EXECUTE of prepared
 statements

---
 src/backend/tcop/postgres.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 8b4d94c9a1..dac5362c81 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -1041,8 +1041,7 @@ exec_simple_query(const char *query_string)
 	{
 		ereport(LOG,
 (errmsg("statement: %s", query_string),
- errhidestmt(true),
- errdetail_execute(parsetree_list)));
+ errhidestmt(true)));
 		was_logged = true;
 	}
 
@@ -2062,14 +2061,13 @@ exec_execute_message(const char *portal_name, long max_rows)
 	if (check_log_statement(portal->stmts))
 	{
 		ereport(LOG,
-(errmsg("%s %s%s%s: %s",
+(errmsg("%s %s%s%s",
 		execute_is_fetch ?
 		_("execute fetch from") :
 		_("execute"),
 		prepStmtName,
 		*portal_name ? "/" : "",
-		*portal_name ? portal_name : "",
-		sourceText),
+		*portal_name ? portal_name : ""),
  errhidestmt(true),
  errdetail_params(portalParams)));
 		was_logged = true;
@@ -2150,15 +2148,14 @@ exec_execute_message(const char *portal_name, long max_rows)
 			break;
 		case 2:
 			ereport(LOG,
-	(errmsg("duration: %s ms  %s %s%s%s: %s",
+	(errmsg("duration: %s ms  %s %s%s%s",
 			msec_str,
 			execute_is_fetch ?
 			_("execute fetch from") :
 			_("execute"),
 			prepStmtName,
 			*portal_name ? "/" : "",
-			*portal_name ? portal_name : "",
-			sourceText),
+			*portal_name ? portal_name : ""),
 	 errhidestmt(true),
 	 errdetail_params(portalParams)));
 			break;
-- 
2.12.2



Re: New vacuum option to do only freezing

2019-02-27 Thread Bossart, Nathan
On 2/27/19, 2:08 AM, "Masahiko Sawada"  wrote:
>> +   if (skip_index_vacuum)
>> +   appendStringInfo(&buf, ngettext("%.0f tuple is left as 
>> dead.\n",
>> +
>>"%.0f tuples are left as dead.\n",
>> +
>>nleft),
>> +nleft);
>>
>> I think we could emit this metric for all cases, not only when
>> DISABLE_INDEX_CLEANUP is used.
>
> I think that tups_vacuumed shows total number of vacuumed tuples and
> is already shown in the log message. The 'nleft' counts the total
> number of recorded dead tuple but not counts tuples are removed during
> HOT-pruning. Is this a valuable for users in non-DISABLE_INDEX_CLEANUP
> case?

I think it is valuable.  When DISABLE_INDEX_CLEANUP is not used or it
is used for a relation with no indexes, it makes it clear that no
tuples were left marked as dead.  Also, it looks like all of the other
information here is provided regardless of the options used.  IMO it
is good to list all of the stats so that users have the full picture
of what VACUUM did.

Nathan



Re: [HACKERS] Block level parallel vacuum

2019-02-27 Thread Robert Haas
On Thu, Feb 14, 2019 at 5:17 AM Masahiko Sawada  wrote:
> Thank you. Attached the rebased patch.

Here are some review comments.

+ started by a single utility command.  Currently, the parallel
+ utility commands that support the use of parallel workers are
+ CREATE INDEX and VACUUM
+ without FULL option, and only when building
+ a B-tree index.  Parallel workers are taken from the pool of

That sentence is garbled.  The end part about b-tree indexes applies
only to CREATE INDEX, not to VACUUM, since VACUUM does build indexes.

+  Vacuum index and cleanup index in parallel
+  N background
workers (for the detail
+  of each vacuum phases, please refer to . If the

I have two problems with this.  One is that I can't understand the
English very well. I think you mean something like: "Perform the
'vacuum index' and 'cleanup index' phases of VACUUM in parallel using
N background workers," but I'm not entirely sure.  The other is that
if that is what you mean, I don't think it's a sufficient description.
Users need to understand whether, for example, only one worker can be
used per index, or whether the work for a single index can be split
across workers.

+  parallel degree N
is omitted,
+  then VACUUM decides the number of workers based on
+  number of indexes on the relation which further limited by
+  . Also if
this option

Now this makes it sound like it's one worker per index, but you could
be more explicit about it.

+  is specified multile times, the last parallel degree
+  N is considered
into the account.

Typo, but I'd just delete this sentence altogether; the behavior if
the option is multiply specified seems like a triviality that need not
be documented.

+Setting a value for parallel_workers via
+ also controls how many parallel
+worker processes will be requested by a VACUUM
+against the table. This setting is overwritten by setting
+N of
PARALLEL
+option.

I wonder if we really want this behavior.  Should a setting that
controls the degree of parallelism when scanning the table also affect
VACUUM?  I tend to think that we probably don't ever want VACUUM of a
table to be parallel by default, but rather something that the user
must explicitly request.  Happy to hear other opinions.  If we do want
this behavior, I think this should be written differently, something
like this: The PARALLEL N option to VACUUM takes precedence over this
option.

+ * parallel mode nor destories the parallel context. For updating the index

Spelling.

+/* DSM keys for parallel lazy vacuum */
+#define PARALLEL_VACUUM_KEY_SHARED UINT64CONST(0xFFF1)
+#define PARALLEL_VACUUM_KEY_DEAD_TUPLES UINT64CONST(0xFFF2)
+#define PARALLEL_VACUUM_KEY_QUERY_TEXT UINT64CONST(0xFFF3)

Any special reason not to use just 1, 2, 3 here?  The general
infrastructure stuff uses high numbers to avoid conflicting with
plan_node_id values, but end clients of the parallel infrastructure
can generally just use small integers.

+ bool updated; /* is the stats updated? */

is -> are

+ * LVDeadTuples controls the dead tuple TIDs collected during heap scan.

what do you mean by "controls", exactly? stores?

+ * This is allocated in a dynamic shared memory segment when parallel
+ * lazy vacuum mode, or allocated in a local memory.

If this is in DSM, then max_tuples is a wart, I think.  We can't grow
the segment at that point.  I'm suspicious that we need a better
design here.  It looks like you gather all of the dead tuples in
backend-local memory and then allocate an equal amount of DSM to copy
them.  But that means that we are using twice as much memory, which
seems pretty bad.  You'd have to do that at least momentarily no
matter what, but it's not obvious that the backend-local copy is ever
freed.  There's another patch kicking around to allocate memory for
vacuum in chunks rather than preallocating the whole slab of memory at
once; we might want to think about getting that committed first and
then having this build on top of it.  At least we need something
smarter than this.

-heap_vacuum_rel(Relation onerel, int options, VacuumParams *params,
+heap_vacuum_rel(Relation onerel, VacuumOptions options, VacuumParams *params,

We generally avoid passing a struct by value; copying the struct can
be expensive and having multiple shallow copies of the same data
sometimes leads to surprising results.  I think it might be a good
idea to propose a preliminary refactoring patch that invents
VacuumOptions and gives it just a single 'int' member and refactors
everything to use it, and then that can be committed first.  It should
pass a pointer, though, not the actual struct.

+ LVState*lvstate;

It's not clear to me why we need this new LVState thing.  What's the
motivation for that?  If it's a good idea, could it be done as a
separate, preparatory patch?  It seems to be responsible for a lot of
code churn in this patch.   It

Re: Index INCLUDE vs. Bitmap Index Scan

2019-02-27 Thread Tomas Vondra


On 2/27/19 6:36 AM, Markus Winand wrote:
> 
> 
>> On 27.02.2019, at 00:22, Tom Lane  wrote:
>>
>> Markus Winand  writes:
>>> I think Bitmap Index Scan should take advantage of B-tree INCLUDE columns, 
>>> which it doesn’t at the moment (tested on master as of yesterday).
>>
>> Regular index scans don't do what you're imagining either (i.e., check
>> filter conditions in advance of visiting the heap).  There's a roadblock
>> to implementing such behavior, which is that we might end up applying
>> filter expressions to dead rows.  That could make users unhappy.
>> For example, given a filter condition like "1.0/c > 0.1", people
>> would complain if it still got zero-divide failures even after they'd
>> deleted all rows with c=0 from their table.
> 
> Ok, but I don’t see how this case different for key columns vs. INCLUDE 
> columns.
> 

Yeah, I'm a bit puzzled by this difference too - why would it be safe
for keys and not the other included columns?

> When I test this with the (a, b, c) index (no INCLUDE), different
> plans are produced for "c=1" (my original example) vs. "1.0/c > 0.1”.

I do recall a discussion about executing expressions on index tuples
during IOS (before/without inspecting the heap tuple). I'm too lazy to
search for the thread now, but I recall it was somehow related to
leak-proof-ness. And AFAICS none of the "div" procs are marked as
leak-proof, so perhaps that's one of the reasons?

Of course, this does not explain why equality conditions and such (which
are generally leak-proof) couldn't be moved to the bitmap index scan.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [RFC] [PATCH] Flexible "partition pruning" hook

2019-02-27 Thread Peter Eisentraut
On 2019-02-26 19:06, Mike Palmiotto wrote:
> The desired effect would be to have `SELECT * from test.partpar;`
> return check only the partitions where username can see any row in the
> table based on column b. This is applicable, for instance, when a
> partition of test.partpar (say test.partpar_b2) is given a label with
> `SECURITY LABEL on TABLE test.partpar_b2 IS 'foo';` which is exactly
> the same as the b column for every row in said partition. Using this
> hook, we can simply check the table label and kick the entire
> partition out early on. This should greatly improve performance for
> the case where you can enforce that the partition SECURITY LABEL and
> the b column are the same.

To rephrase this: You have a partitioned table, and you have a RLS
policy that hides certain rows, and you know based on your business
logic that under certain circumstances entire partitions will be hidden,
so they don't need to be scanned.  So you want a planner hook that would
allow you to prune those partitions manually.

That sounds pretty hackish to me.  We should give the planner and
executor the appropriate information to make these decisions, like an
additional partition constraint.  If this information is hidden in
user-defined functions in a way that cannot be reasoned about, who is
enforcing these constraints and how do we know they are actually correct?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Pluggable Storage - Andres's take

2019-02-27 Thread Andres Freund
Hi,

On 2019-02-27 18:00:12 +0800, Heikki Linnakangas wrote:
> I haven't been following this thread closely, but I looked briefly at some
> of the patches posted here:

Thanks!


> On 21/01/2019 11:01, Andres Freund wrote:
> > The patchset is now pretty granularly split into individual pieces.
> 
> Wow, 42 patches, very granular indeed! That's nice for reviewing, but are
> you planning to squash them before committing? Seems a bit excessive for the
> git history.

I've pushed a number of the preliminary patches since you replied. We're
down to 23 in my local count...

I do plan / did squash some, but not actually that many. I find that
patches after a certain size are just too hard to do the necessary final
polish to, especially if they do several independent things. Keeping
things granular also allows to push incrementally, even when later
patches aren't quite ready - imo pretty important for a project this
size.


> Patches 1-4:
> 
> * v12-0001-WIP-Introduce-access-table.h-access-relation.h.patch
> * v12-0002-Replace-heapam.h-includes-with-relation.h-table..patch
> * v12-0003-Replace-uses-of-heap_open-et-al-with-table_open-.patch
> * v12-0004-Remove-superfluous-tqual.h-includes.patch
> 
> These look good to me. I think it would make sense to squash these together,
> and commit now.

I've pushed these a while ago.


> Patches 20 and 21:
> * v12-0020-WIP-Slotified-triggers.patch
> * v12-0021-WIP-Slotify-EPQ.patch
> 
> I like this slotification of trigger and EPQ code. It seems like a nice
> thing to do, independently of the other patches. You said you wanted to
> polish that up to committable state, and commit separately: +1 on
> that.

I pushed the trigger patch yesterday evening. Working to finalize the
EPQ patch now - I've polished it a fair bit since the version posted on
the list, but it still needs a bit more.

Once the EPQ patch (and two other simple preliminary ones) is pushed, I
plan to post a new rebased version to this thread. That's then really
only the core table AM work.


> > --- a/src/include/commands/trigger.h
> > +++ b/src/include/commands/trigger.h
> > @@ -35,8 +35,8 @@ typedef struct TriggerData
> > HeapTuple   tg_trigtuple;
> > HeapTuple   tg_newtuple;
> > Trigger*tg_trigger;
> > -   Buffer  tg_trigtuplebuf;
> > -   Buffer  tg_newtuplebuf;
> > +   TupleTableSlot *tg_trigslot;
> > +   TupleTableSlot *tg_newslot;
> > Tuplestorestate *tg_oldtable;
> > Tuplestorestate *tg_newtable;
> >  } TriggerData;
> 
> Do we still need tg_trigtuple and tg_newtuple? Can't we always use the
> corresponding slots instead? Is it for backwards-compatibility with
> user-defined triggers written in C?

Yes, the external trigger interface currently relies on those being
there. I think we probably ought to revise that, at the very least
because it'll otherwise be noticably less efficient to have triggers on
!heap tables, but also because it's just cleaner.  But I feel like I
don't want more significantly sized patches on my plate right now, so my
current goal is to just put that on the todo after the pluggable storage
work.  Kinda wonder if we don't want to do that earlier in a release
cycle too, so we can do other breaking changes to the trigger interface
without breaking external code multiple times.  There's probably also an
argument for just not breaking the interface.


> (Documentation also needs to be updated for the changes in this
> struct)

Ah, nice catch, will do that next.

Greetings,

Andres Freund



Re: Segfault when restoring -Fd dump on current HEAD

2019-02-27 Thread Dmitry Dolgov
> On Wed, Feb 27, 2019 at 1:32 PM Alvaro Herrera  
> wrote:
>
> > > I think it would be better to just put back the .defn = "" (etc) to the
> > > ArchiveEntry calls.
> >
> > Then we should do this not only for defn, but for owner and dropStmt too.
>
> Yeah, absolutely.

Done, please find the attached patch.

> > I can
> > update the fix patch I've sent before, if it's preferrable approach in this
> > particular situation.
>
> I'm not sure we need those changes, since we're forced to update all
> callsites anyway.

I guess we can keep the part about removing null checks before using strlen,
since it's going to be useless.

> On Wed, Feb 27, 2019 at 10:36 AM Dmitry Dolgov <9erthali...@gmail.com> wrote:
>
> > On Tue, Feb 26, 2019 at 11:53 PM Alvaro Herrera  
> > wrote:
> >
> > On 2019-Feb-26, Dmitry Dolgov wrote:
> >
> > > Yes, it should be rather simple, we can e.g. return to the old less 
> > > consistent
> > > NULL handling approach something (like in the attached patch), or replace 
> > > a NULL
> > > value with an empty string in WriteToc. Give me a moment, I'll check it 
> > > out. At
> > > the same time I would suggest to keep replace_line_endings -> 
> > > sanitize_line,
> > > since it doesn't break compatibility.
> >
> > Hmm, shouldn't we modify sanitize_line so that it returns strdup(hyphen)
> > when input is empty and want_hyphen, too?
>
> Yes, you're right.

I've looked closer, and looks like I was mistaken. In the only place where it
matters we anyway pass NULL after verifying noOwner:

sanitized_owner = sanitize_line(ropt->noOwner ? NULL : te->owner, true);

So I haven't change sanitize_line yet, but can update it if there is a strong
opinion about this function.


0001-ArchiveEntry-null-handling.patch
Description: Binary data


Re: Segfault when restoring -Fd dump on current HEAD

2019-02-27 Thread Tom Lane
Alvaro Herrera  writes:
> On 2019-Feb-27, Dmitry Dolgov wrote:
>> But I hope there are no objections if I'll then submit the original
>> changes with more consistent null handling separately to make decision
>> about them more consciously.

> I think we should save such a patch for whenever we next update the
> archive version number, which could take a couple of years given past
> history.  I'm inclined to add a comment near K_VERS_SELF to remind
> whoever next patches it.

+1.  This isn't an unreasonable cleanup idea, but being only a cleanup
idea, it doesn't seem worth creating compatibility issues for.  Let's
wait till there is some more-pressing reason to change the archive format,
and then fix this in the same release cycle.

I'd also note that given what we've seen so far, there are going to be
some slow-to-flush-out null pointer dereferencing bugs from this.
I'm not really eager to introduce that towards the tail end of a devel
cycle.

regards, tom lane



Re: Index INCLUDE vs. Bitmap Index Scan

2019-02-27 Thread Tom Lane
Markus Winand  writes:
>> On 27.02.2019, at 02:00, Justin Pryzby  wrote:
>> On Tue, Feb 26, 2019 at 09:07:01PM +0100, Markus Winand wrote:
>>> (As a side node: I also dislike it how Bitmap Index Scan mixes search 
>>> conditions and filters in “Index Cond”)

>> I don't think it's mixing them;  it's using index scan on leading *and*
>> nonleading column.  That's possible even if frequently not efficient.

> The distinction leading / non-leading is very important for performance. 
> Other database products use different names in the execution plan so that it 
> is immediately visible (without knowing the index definition).

Other database products don't have the wide range of index types that
we do.  The concepts you propose using are pretty much meaningless
for non-btree indexes.  EXPLAIN doesn't really know which of the index
conditions will usefully cut down the index search space for the
particular type, so it just lists everything that has the right form
to be passed to the index AM.

Note that passing a condition to the AM, rather than executing it as
a filter, is generally a win when possible even if it fails to cut
the portion of the index searched at all.  That's because it can save
visits to the heap (tying back to the original point in this thread,
that we test index conditions, then heap liveness check, then filter
conditions).  So the planner is aggressive about pushing things into
that category when it can.

It might help to point out that to be an index condition, a WHERE
clause has to meet tighter conditions than just whether it mentions
an index column.  It generally has to be of the form "index_column
indexable_operator pseudo_constant" (though some index types support
some other cases like "index_column IS NULL" as index conditions too).
Clauses mentioning INCLUDE columns fail this test a priori, because
there are no indexable operators associated with an INCLUDE column.

regards, tom lane



Re: [HACKERS] Can ICU be used for a database's default sort order?

2019-02-27 Thread Marius Timmer
Hello Andrey,

we would like to see ICU collations become the default for entire
databases as well. Therefore we would also review the patch.
Unfortunately your Patch from late October does not apply on the current
master.

Besides of that I noticed the patch applies on master of October but
results in errors when compiling without "--with-icu" and executing
"make check-world":
> libpq.so: Warning: undefined reference to »get_collprovider_name«
> libpq.so: Warning: undefined reference to
»is_valid_nondefault_collprovider«
> libpq.so: Warning: undefined reference to »get_collprovider«

May be caused by your last modification:
> I've added LDFLAGS_INTERNAL += $(ICU_LIBS) in libpq, but I'm not
> entirely sure this is correct way to deal with complaints on ICU
> functions from libpq linking.


Best regards,

Marius Timmer




-- 
Westfälische Wilhelms-Universität Münster (WWU)
Zentrum für Informationsverarbeitung (ZIV)
Röntgenstraße 7-13
48149 Münster
+49 251 83 31158
marius.tim...@uni-muenster.de
https://www.uni-muenster.de/ZIV



signature.asc
Description: OpenPGP digital signature


Re: reloption to prevent VACUUM from truncating empty pages at the end of relation

2019-02-27 Thread Robert Haas
On Mon, Feb 25, 2019 at 4:25 AM Michael Paquier  wrote:
> Another thing that seems worth thinking about is a system-level GUC,
> and an option in the VACUUM command to control if truncation should
> happen or not.  We have a lot of infrastructure to control such
> options between vacuum and autovacuum, so it could be a waste to not
> consider potential synergies.

I don't think that a VACUUM option would be out of place, but a GUC
sounds like an attractive nuisance to me.  It will encourage people to
just flip it blindly instead of considering the particular cases where
they need that behavior, and I think chances are good that most people
who do that will end up being sad.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Oddity with parallel safety test for scan/join target in grouping_planner

2019-02-27 Thread Robert Haas
On Tue, Feb 26, 2019 at 7:26 AM Etsuro Fujita
 wrote:
> The parallel safety of the final scan/join target is determined from the
> grouping target, not that target, which [ is wrong ]

OOPS.  That's pretty embarrassing.

Your patch looks right to me.  I will now go look for a bag to put over my head.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Unneeded parallel safety tests in grouping_planner

2019-02-27 Thread Robert Haas
On Wed, Feb 27, 2019 at 7:46 AM Etsuro Fujita
 wrote:
> Yet another thing I noticed while working on [1] is this in
> grouping_planner:
>
> /*
>  * If the input rel is marked consider_parallel and there's nothing
> that's
>  * not parallel-safe in the LIMIT clause, then the final_rel can be
> marked
>  * consider_parallel as well.  Note that if the query has rowMarks or is
>  * not a SELECT, consider_parallel will be false for every relation
> in the
>  * query.
>  */
> if (current_rel->consider_parallel &&
> is_parallel_safe(root, parse->limitOffset) &&
> is_parallel_safe(root, parse->limitCount))
> final_rel->consider_parallel = true;
>
> If there is a need to add a LIMIT node, we don't consider generating
> partial paths for the final relation below (see commit
> 0927d2f46ddd4cf7d6bf2cc84b3be923e0aedc52), so it seems unnecessary
> anymore to assess the parallel-safety of the LIMIT and OFFSET clauses.
> To save cycles, why not remove those tests from that function like the
> attached?

Because in the future we might want to consider generating
partial_paths in cases where we don't do so today.

I repeatedly made the mistake of believing that I could not bother
setting consider_parallel entirely correctly for one reason or
another, and I've gone through multiple iterations of fixing cases
where I did that and it turned out to cause problems.  I now believe
that we should try to get it right in every case, whether or not we
currently think it's possible for it to matter.  Sometimes it matters
in ways that aren't obvious, and it complicates further development.

I don't think we'd save much by changing this test anyway.  Those
is_parallel_safe() tests aren't entirely free, of course, but they
should be very cheap.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Parallel query vs smart shutdown and Postmaster death

2019-02-27 Thread Robert Haas
On Tue, Feb 26, 2019 at 5:44 PM Thomas Munro  wrote:
> Then perhaps we could do some more involved surgery on master that
> achieves smart shutdown's stated goal here, and lets parallel queries
> actually run?  Better ideas welcome.

I have noticed before that the smart shutdown code does not disclose
to the rest of the system that a shutdown is in progress: no signals
are sent, and no shared memory state is updated.  That makes it a bit
challenging for any other part of the system to respond to the smart
shutdown in a way that is, well, smart.  But I guess that's not really
the problem in this case.

It seems to me that we could fix pmdie() to skip parallel workers; I
think that the postmaster could notice that they are lagged as
BGWORKER_CLASS_PARALLEL.  But we'd also have to fix things so that new
parallel workers could be launched during a smart shutdown, which
looks not quite so simple.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: some hints to understand the plsql cursor.

2019-02-27 Thread Dilip Kumar
On Wed, Feb 27, 2019 at 4:42 PM Andy Fan  wrote:
>
> actually I'm hacking pg for a function like :
> 1. define a select query.
> 2. client ask for some data. and server reply some data.  server will do 
> NOTHING if client doesn't ask any more..
> 3.  client ask some data more data with a batch and SERVER reply some data 
> then. then do NOTHING.
>
> currently the simple "select * from t",  the server will try to send the data 
> to client at one time which is not something I want.
>
> by looking into the plsql,  looks it has some api like:
>
> fetch 10 from cursor_1;
> fetch 10 from cursor_1;
>
> I'm lacking of the experience to hack plsql.   so my question are:
> 1.   Does pg has some codes which act like the "ask -> reply -> ask again -> 
> reply again" on the server code?currently I'm not sure if the above 
> "fetch" really work like this.
> 2.  any resources or hint or suggestion to understand the "fetch"  statement?

I guess you are looking for these syntax?

postgres=# BEGIN;
BEGIN
postgres=# DECLARE cur CURSOR FOR SELECT * FROM t;
DECLARE CURSOR
postgres=# FETCH NEXT cur;
 a
---
 1
(1 row)

postgres=# FETCH 10 cur;
 a
---
 2
 3
 4
 5
 1
 2
 3
 4
 5
 6
(10 rows)

postgres=# FETCH NEXT cur;
 a
---
 7
(1 row)

postgres=# CLOSE cur;
CLOSE CURSOR


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: get_controlfile() can leak fds in the backend

2019-02-27 Thread Joe Conway
On 2/27/19 2:47 AM, Michael Paquier wrote:
> Hi all,
> (CC-ing Joe as of dc7d70e)
> 
> I was just looking at the offline checksum patch, and noticed some
> sloppy coding in controldata_utils.c.  The control file gets opened in
> get_controlfile(), and if it generates an error then the file
> descriptor remains open.  As basic code rule in the backend we should
> only use BasicOpenFile() when opening files, so I think that the issue
> should be fixed as attached, even if this requires including fd.h for
> the backend compilation which is kind of ugly.
> 
> Another possibility would be to just close the file descriptor before
> any error, saving appropriately errno before issuing any %m portion,
> but that still does not respect the backend policy regarding open().

In fd.c I see:
8<
* AllocateFile, AllocateDir, OpenPipeStream and OpenTransientFile are
* wrappers around fopen(3), opendir(3), popen(3) and open(2),
* respectively. They behave like the corresponding native functions,
* except that the handle is registered with the current subtransaction,
* and will be automatically closed at abort. These are intended mainly
* for short operations like reading a configuration file; there is a
* limit on the number of files that can be opened using these functions
* at any one time.
*
* Finally, BasicOpenFile is just a thin wrapper around open() that can
* release file descriptors in use by the virtual file descriptors if
* necessary. There is no automatic cleanup of file descriptors returned
* by BasicOpenFile, it is solely the caller's responsibility to close
* the file descriptor by calling close(2).
8<

According to that comment BasicOpenFile does not seem to solve the issue
you are pointing out (leaking of file descriptor on ERROR). Perhaps
OpenTransientFile() is more appropriate? I am on the road at the moment
so have not looked very deeply at this yet though.

Joe
-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature


Re: [RFC] [PATCH] Flexible "partition pruning" hook

2019-02-27 Thread Mike Palmiotto
On Tue, Feb 26, 2019 at 1:06 PM Mike Palmiotto
 wrote:
>
> On Tue, Feb 26, 2019 at 1:55 AM Tsunakawa, Takayuki
>  wrote:
> >
> > From: Mike Palmiotto [mailto:mike.palmio...@crunchydata.com]
> > > Attached is a patch which attempts to solve a few problems:

Updated patch attached.

> > > 
> > What concrete problems would you expect this patch to solve?  What kind of 
> > extensions do you imagine?  I'd like to hear about the examples.  For 
> > example, "PostgreSQL 12 will not be able to filter out enough partitions 
> > when planning/executing SELECT ... WHERE ... statement.  But an extension 
> > like this can extract just one partition early."
>
> My only application of the patch thus far has been to apply an RLS
> policy driven by the extension's results. For example:
>
> CREATE TABLE test.partpar
> (
>   a int,
>   b text DEFAULT (extension_default_bfield('test.partpar'::regclass::oid))
> )  PARTITION BY LIST (extension_translate_bfield(b));
>
> CREATE POLICY filter_select on test.partpar for SELECT
> USING (extension_filter_by_bfield(b));
>
> CREATE POLICY filter_select on test.partpar for INSERT
> WITH CHECK (extension_generate_insert_bfield('test.partpar'::regclass::oid)
> = b);
>
> CREATE POLICY filter_update on test.partpar for UPDATE
> USING (extension_filter_by_bfield(b))
> WITH CHECK (extension_filter_by_bfield(b));
>
> CREATE POLICY filter_delete on test.partpar for DELETE
> USING (extension_filter_by_bfield(b));
>
> The function would filter based on some external criteria relating to
> the username and the contents of the b column.
>
> The desired effect would be to have `SELECT * from test.partpar;`
> return check only the partitions where username can see any row in the
> table based on column b. This is applicable, for instance, when a
> partition of test.partpar (say test.partpar_b2) is given a label with
> `SECURITY LABEL on TABLE test.partpar_b2 IS 'foo';` which is exactly
> the same as the b column for every row in said partition. Using this
> hook, we can simply check the table label and kick the entire
> partition out early on. This should greatly improve performance for
> the case where you can enforce that the partition SECURITY LABEL and
> the b column are the same.

Is this explanation suitable, or is more information required?

Thanks,
-- 
Mike Palmiotto
Software Engineer
Crunchy Data Solutions
https://crunchydata.com
diff --git a/src/backend/catalog/pg_inherits.c b/src/backend/catalog/pg_inherits.c
index 00f7957b3d..45c5de990f 100644
--- a/src/backend/catalog/pg_inherits.c
+++ b/src/backend/catalog/pg_inherits.c
@@ -24,7 +24,9 @@
 #include "access/table.h"
 #include "catalog/indexing.h"
 #include "catalog/pg_inherits.h"
+#include "optimizer/cost.h"
 #include "parser/parse_type.h"
+#include "partitioning/partprune.h"
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
@@ -92,11 +94,16 @@ find_inheritance_children(Oid parentrelId, LOCKMODE lockmode)
 	while ((inheritsTuple = systable_getnext(scan)) != NULL)
 	{
 		inhrelid = ((Form_pg_inherits) GETSTRUCT(inheritsTuple))->inhrelid;
+
+		if (enable_partition_pruning && inhrelid && !partitionChildAccess_hook(inhrelid))
+			continue;
+
 		if (numoids >= maxoids)
 		{
 			maxoids *= 2;
 			oidarr = (Oid *) repalloc(oidarr, maxoids * sizeof(Oid));
 		}
+
 		oidarr[numoids++] = inhrelid;
 	}
 
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 0debac75c6..185bfc10eb 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1067,6 +1067,13 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 			continue;
 		}
 
+		if (enable_partition_pruning && !partitionChildAccess_hook(childRTE->relid))
+		{
+			/* Implement custom partition pruning filter*/
+			set_dummy_rel_pathlist(childrel);
+			continue;
+		}
+
 		/*
 		 * CE failed, so finish copying/modifying targetlist and join quals.
 		 *
diff --git a/src/include/partitioning/partprune.h b/src/include/partitioning/partprune.h
index 2f75717ffb..968b0ccbe0 100644
--- a/src/include/partitioning/partprune.h
+++ b/src/include/partitioning/partprune.h
@@ -74,6 +74,12 @@ typedef struct PartitionPruneContext
 #define PruneCxtStateIdx(partnatts, step_id, keyno) \
 	((partnatts) * (step_id) + (keyno))
 
+/* Custom partition child access hook. Provides further partition pruning given
+ * child OID.
+ */
+typedef bool (*partitionChildAccess_hook_type) (Oid childOID);
+PGDLLIMPORT partitionChildAccess_hook_type partitionChildAccess_hook;
+
 extern PartitionPruneInfo *make_partition_pruneinfo(struct PlannerInfo *root,
 		 struct RelOptInfo *parentrel,
 		 List *subpaths,


Re: readdir is incorrectly implemented at Windows

2019-02-27 Thread Grigory Smolkin

Originally bug was reported by Yuri Kurenkov:
https://github.com/postgrespro/pg_probackup/issues/48

As pg_probackup rely on readdir() for listing files to backup, wrong 
permissions could lead to a broken backup.


On 02/25/2019 06:38 PM, Konstantin Knizhnik wrote:

Hi hackers,

Small issue with readir implementation for Windows.
Right now it returns ENOENT in case of any error returned by 
FindFirstFile.
So all places in Postgres where opendir/listdir are used will assume 
that directory is empty and

do nothing without reporting any error.
It is not so good if directory is actually not empty but there are not 
enough permissions for accessing the directory and FindFirstFile

returns ERROR_ACCESS_DENIED:

struct dirent *
readdir(DIR *d)
{
WIN32_FIND_DATA fd;

if (d->handle == INVALID_HANDLE_VALUE)
{
d->handle = FindFirstFile(d->dirname, &fd);
if (d->handle == INVALID_HANDLE_VALUE)
{
errno = ENOENT;
return NULL;
}
}


Attached please find small patch fixing the problem.



--
Grigory Smolkin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




pg_dump/pg_restore fail for TAR_DUMP and CUSTOM_DUMP from v94/v95/v96 to v11/master.

2019-02-27 Thread Prabhat Sahu
Hi,

I got a failure in pg_dump/pg_restore as below:
pg_dump/pg_restore fails with 'ERROR: schema "public" already exists' for
TAR_DUMP and CUSTOM_DUMP from v94/v95/v96 to v11/master.

-- Take pg_dump in v94/v95/v96:
[prabhat@localhost bin]$ ./pg_dump -f /tmp/*tar_dump_PG94.tar* -Ft postgres
-p 9000
[prabhat@localhost bin]$ ./pg_dump -f /tmp/*custom_dump_PG94.sql* -Fc
postgres -p 9000

-- Try to restore the above dump into v11/master:
[prabhat@localhost bin]$ ./pg_restore -F t -U prabhat -d db3 -p 9001 /tmp/
*tar_dump_PG94.tar*
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 6; 2615 2200 SCHEMA public
prabhat
pg_restore: [archiver (db)] could not execute query: ERROR:  schema
"public" already exists
Command was: CREATE SCHEMA public;



WARNING: errors ignored on restore: 1

[prabhat@localhost bin]$ ./pg_restore -F c -U prabhat -d db4 -p 9001 /tmp/
*custom_dump_PG94.sql*
pg_restore: [archiver (db)] Error while PROCESSING TOC:
pg_restore: [archiver (db)] Error from TOC entry 6; 2615 2200 SCHEMA public
prabhat
pg_restore: [archiver (db)] could not execute query: ERROR:  schema
"public" already exists
Command was: CREATE SCHEMA public;



WARNING: errors ignored on restore: 1

Note: I am able to perform "Plain dump/restore" across the branches.


-- 


With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Corporation

The Postgres Database Company


Unneeded parallel safety tests in grouping_planner

2019-02-27 Thread Etsuro Fujita
Hi,

Yet another thing I noticed while working on [1] is this in
grouping_planner:

/*
 * If the input rel is marked consider_parallel and there's nothing
that's
 * not parallel-safe in the LIMIT clause, then the final_rel can be
marked
 * consider_parallel as well.  Note that if the query has rowMarks or is
 * not a SELECT, consider_parallel will be false for every relation
in the
 * query.
 */
if (current_rel->consider_parallel &&
is_parallel_safe(root, parse->limitOffset) &&
is_parallel_safe(root, parse->limitCount))
final_rel->consider_parallel = true;

If there is a need to add a LIMIT node, we don't consider generating
partial paths for the final relation below (see commit
0927d2f46ddd4cf7d6bf2cc84b3be923e0aedc52), so it seems unnecessary
anymore to assess the parallel-safety of the LIMIT and OFFSET clauses.
To save cycles, why not remove those tests from that function like the
attached?

Best regards,
Etsuro Fujita

[1] https://commitfest.postgresql.org/22/1950/
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
***
*** 2141,2155  grouping_planner(PlannerInfo *root, bool inheritance_update,
  	final_rel = fetch_upper_rel(root, UPPERREL_FINAL, NULL);
  
  	/*
! 	 * If the input rel is marked consider_parallel and there's nothing that's
! 	 * not parallel-safe in the LIMIT clause, then the final_rel can be marked
! 	 * consider_parallel as well.  Note that if the query has rowMarks or is
! 	 * not a SELECT, consider_parallel will be false for every relation in the
! 	 * query.
  	 */
! 	if (current_rel->consider_parallel &&
! 		is_parallel_safe(root, parse->limitOffset) &&
! 		is_parallel_safe(root, parse->limitCount))
  		final_rel->consider_parallel = true;
  
  	/*
--- 2141,2152 
  	final_rel = fetch_upper_rel(root, UPPERREL_FINAL, NULL);
  
  	/*
! 	 * If the input rel is marked consider_parallel and there's no need to add
! 	 * a LIMIT node, then the final_rel can be marked consider_parallel as
! 	 * well.  Note that if the query has rowMarks or is not a SELECT,
! 	 * consider_parallel will be false for every relation in the query.
  	 */
! 	if (current_rel->consider_parallel && !limit_needed(parse))
  		final_rel->consider_parallel = true;
  
  	/*
***
*** 2263,2272  grouping_planner(PlannerInfo *root, bool inheritance_update,
  	 * Generate partial paths for final_rel, too, if outer query levels might
  	 * be able to make use of them.
  	 */
! 	if (final_rel->consider_parallel && root->query_level > 1 &&
! 		!limit_needed(parse))
  	{
! 		Assert(!parse->rowMarks && parse->commandType == CMD_SELECT);
  		foreach(lc, current_rel->partial_pathlist)
  		{
  			Path	   *partial_path = (Path *) lfirst(lc);
--- 2260,2269 
  	 * Generate partial paths for final_rel, too, if outer query levels might
  	 * be able to make use of them.
  	 */
! 	if (final_rel->consider_parallel && root->query_level > 1)
  	{
! 		Assert(!parse->rowMarks && !limit_needed(parse) &&
! 			   parse->commandType == CMD_SELECT);
  		foreach(lc, current_rel->partial_pathlist)
  		{
  			Path	   *partial_path = (Path *) lfirst(lc);


Re: Segfault when restoring -Fd dump on current HEAD

2019-02-27 Thread Alvaro Herrera
On 2019-Feb-27, Dmitry Dolgov wrote:

> > On Tue, Feb 26, 2019 at 11:53 PM Alvaro Herrera  
> > wrote:
> >
> > I think it would be better to just put back the .defn = "" (etc) to the
> > ArchiveEntry calls.
> 
> Then we should do this not only for defn, but for owner and dropStmt too.

Yeah, absolutely.

> I can
> update the fix patch I've sent before, if it's preferrable approach in this
> particular situation.

I'm not sure we need those changes, since we're forced to update all
callsites anyway.

> But I hope there are no objections if I'll then submit the original
> changes with more consistent null handling separately to make decision
> about them more consciously.

I think we should save such a patch for whenever we next update the
archive version number, which could take a couple of years given past
history.  I'm inclined to add a comment near K_VERS_SELF to remind
whoever next patches it.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



some hints to understand the plsql cursor.

2019-02-27 Thread Andy Fan
actually I'm hacking pg for a function like :
1. define a select query.
2. client ask for some data. and server reply some data.  server will do
NOTHING if client doesn't ask any more..
3.  client ask some data more data with a batch and SERVER reply some data
then. then do NOTHING.

currently the simple "select * from t",  the server will try to send the
data to client at one time which is not something I want.

by looking into the plsql,  looks it has some api like:

fetch 10 from cursor_1;
fetch 10 from cursor_1;

I'm lacking of the experience to hack plsql.   so my question are:
1.   Does pg has some codes which act like the "ask -> reply -> ask again
-> reply again" on the server code?currently I'm not sure if the above
"fetch" really work like this.
2.  any resources or hint or suggestion to understand the "fetch"
statement?

Thanks


Re: When is the MessageContext released?

2019-02-27 Thread Andy Fan
Thanks you Andres for your time!  this context is free with AllocSetReset
rather than AllocSetDelete, that makes my breakpoint doesn't catch it.

On Wed, Feb 27, 2019 at 2:15 PM Andres Freund  wrote:

> On 2019-02-27 14:08:47 +0800, Andy Fan wrote:
> > Hi :
> >   I run a query like "select * from t" and set the break like this:
> >
> > break exec_simple_query
> > break MemoryContextDelete
> >   commands
> >p context->name
> >c
> >   end
> >
> > I can see most of the MemoryContext is relased, but never MessageContext,
> > when will it be released?
>
> It's released above exec_simple_query, as it actually contains the data
> that lead us to call exec_simple_query().  See the main for loop in
> PostgresMain():
>
> /*
>  * Release storage left over from prior query cycle, and
> create a new
>  * query input buffer in the cleared MessageContext.
>  */
> MemoryContextSwitchTo(MessageContext);
> MemoryContextResetAndDeleteChildren(MessageContext);
>
> Greetings,
>
> Andres Freund
>


Re: get_controlfile() can leak fds in the backend

2019-02-27 Thread Fabien COELHO




However, while at it, there is also the question of whether the control file
should be locked when updated, eg with flock(2) to avoid race conditions
between concurrent commands. ISTM that there is currently not such thing in
the code, but that it would be desirable.


Shouldn't be necessary - the control file fits into a single page, and
writes of that size ought to always be atomic. And I also think
introducing flock usage for this would be quite disproportional.


Ok, fine.

Note that my concern is not about the page size, but rather that as more 
commands may change the cluster status by editing the control file, it 
would be better that a postmaster does not start while a pg_rewind or 
enable checksum or whatever is in progress, and currently there is a 
possible race condition between the read and write that can induce an 
issue, at least theoretically.


--
Fabien.



Re: New vacuum option to do only freezing

2019-02-27 Thread Masahiko Sawada
On Wed, Feb 27, 2019 at 10:02 AM Bossart, Nathan  wrote:
>
> Sorry for the delay.  I finally got a chance to look through the
> latest patches.
>
> On 2/3/19, 1:48 PM, "Masahiko Sawada"  wrote:
> > On Fri, Feb 1, 2019 at 11:43 PM Bossart, Nathan  wrote:
> >>
> >> +   if (skip_index_vacuum)
> >> +   ereport(elevel,
> >> +   (errmsg("\"%s\": marked %.0f row 
> >> versions as dead in %u pages",
> >> +   
> >> RelationGetRelationName(onerel),
> >> +   tups_vacuumed, 
> >> vacuumed_pages)));
> >>
> >> IIUC tups_vacuumed will include tuples removed during HOT pruning, so
> >> it could be inaccurate to say all of these tuples have only been
> >> marked "dead."  Perhaps we could keep a separate count of tuples
> >> removed via HOT pruning in case we are using DISABLE_INDEX_CLEANUP.
> >> There might be similar problems with the values stored in vacrelstats
> >> that are used at the end of heap_vacuum_rel() (e.g. tuples_deleted).
> >
> > Yeah, tups_vacuumed include such tuples so the message is inaccurate.
> > So I think that we should not change the message but we can report the
> > dead item pointers we left in errdetail. That's more accurate and
> > would help user.
> >
> > postgres(1:1130)=# vacuum (verbose, disable_index_cleanup) test;
> > INFO:  vacuuming "public.test"
> > INFO:  "test": removed 49 row versions in 1 pages
> > INFO:  "test": found 49 removable, 51 nonremovable row versions in 1
> > out of 1 pages
> > DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 509
> > There were 0 unused item pointers.
> > Skipped 0 pages due to buffer pins, 0 frozen pages.
> > 0 pages are entirely empty.
> > 49 tuples are left as dead.
> > CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
> > VACUUM
>
> This seems reasonable to me.
>
> The current version of the patches builds cleanly, passes 'make
> check-world', and seems to work well in my own manual tests.  I have a
> number of small suggestions, but I think this will be ready-for-
> committer soon.

Thank you for reviewing the patch!

>
> +   Assert(!skip_index_vacuum);
>
> There are two places in lazy_scan_heap() that I see this without any
> comment.  Can we add a short comment explaining why this should be
> true at these points?

Agreed. Will add a comment.

>
> +   if (skip_index_vacuum)
> +   appendStringInfo(&buf, ngettext("%.0f tuple is left as 
> dead.\n",
> + 
>   "%.0f tuples are left as dead.\n",
> + 
>   nleft),
> +nleft);
>
> I think we could emit this metric for all cases, not only when
> DISABLE_INDEX_CLEANUP is used.

I think that tups_vacuumed shows total number of vacuumed tuples and
is already shown in the log message. The 'nleft' counts the total
number of recorded dead tuple but not counts tuples are removed during
HOT-pruning. Is this a valuable for users in non-DISABLE_INDEX_CLEANUP
case?

>
> +   /*
> +* Remove tuples from heap if the table has no index. 
>  If the table
> +* has index but index vacuum is disabled, we don't 
> vacuum and forget
> +* them. The vacrelstats->dead_tuples could have 
> tuples which became
> +* dead after checked at HOT-pruning time which are 
> handled by
> +* lazy_vacuum_page() but we don't worry about 
> handling those because
> +* it's a very rare condition and these would not be 
> a large number.
> +*/
>
> Based on this, it sounds like nleft could be inaccurate.  Do you think
> it is worth adjusting the log message to reflect that, or is this
> discrepancy something we should just live with?  I think adding
> something like "at least N tuples left marked dead" is arguably a bit
> misleading, since the only time the number is actually higher is when
> this "very rare condition" occurs.

Hmm, I think it's true that we leave 'nleft' dead tuples because it
includes both tuples marked as dead and tuples not marked yet. So I
think the log message works. Maybe the comment leads misreading. Will
fix it.

>
> +   /*
> +* Skip index vacuum if it's requested for table with indexes. In this
> +* case we use the one-pass strategy and don't remove tuple storage.
> +*/
> +   skip_index_vacuum =
> +   (options & VACOPT_DISABLE_INDEX_CLEANUP) != 0 && 
> vacrelstats->hasindex;
>
> AFAICT we don't actually need to adjust this based on
> vacrelstats->hasindex because we are already checking for indexes
> everywhere we check for this option.  What do you think about leaving
> that part out?
>

Re: Set fallback_application_name for a walreceiver to cluster_name

2019-02-27 Thread Peter Eisentraut
On 2019-02-21 01:36, Euler Taveira wrote:
>> By default, the fallback_application_name for a physical walreceiver is
>> "walreceiver".  This means that multiple standbys cannot be
>> distinguished easily on a primary, for example in pg_stat_activity or
>> synchronous_standby_names.
>>
> Although standby identification could be made by client_addr in
> pg_stat_activity, it could be useful in multiple clusters in the same
> host. Since it is a fallback application name, it can be overridden by
> an application_name in primary_conninfo parameter.

Yeah, plus that doesn't help with synchronous_standby_names.

>> I propose, if cluster_name is set, use that for
>> fallback_application_name in the walreceiver.  (If it's not set, it
>> remains "walreceiver".)  If someone set cluster_name to identify their
>> instance, we might as well use that by default to identify the node
>> remotely as well.  It's still possible to specify another
>> application_name in primary_conninfo explicitly.
>>
> I tested it and both patches work as described. Passes all tests. Doc
> describes the proposed feature. Doc builds without errors.

Committed, thanks!

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Pluggable Storage - Andres's take

2019-02-27 Thread Heikki Linnakangas
I haven't been following this thread closely, but I looked briefly at 
some of the patches posted here:


On 21/01/2019 11:01, Andres Freund wrote:

The patchset is now pretty granularly split into individual pieces.


Wow, 42 patches, very granular indeed! That's nice for reviewing, but 
are you planning to squash them before committing? Seems a bit excessive 
for the git history.


Patches 1-4:

* v12-0001-WIP-Introduce-access-table.h-access-relation.h.patch
* v12-0002-Replace-heapam.h-includes-with-relation.h-table..patch
* v12-0003-Replace-uses-of-heap_open-et-al-with-table_open-.patch
* v12-0004-Remove-superfluous-tqual.h-includes.patch

These look good to me. I think it would make sense to squash these 
together, and commit now.



Patches 20 and 21:
* v12-0020-WIP-Slotified-triggers.patch
* v12-0021-WIP-Slotify-EPQ.patch

I like this slotification of trigger and EPQ code. It seems like a nice 
thing to do, independently of the other patches. You said you wanted to 
polish that up to committable state, and commit separately: +1 on that. 
Perhaps do that even before patches 1-4.



--- a/src/include/commands/trigger.h
+++ b/src/include/commands/trigger.h
@@ -35,8 +35,8 @@ typedef struct TriggerData
HeapTuple   tg_trigtuple;
HeapTuple   tg_newtuple;
Trigger*tg_trigger;
-   Buffer  tg_trigtuplebuf;
-   Buffer  tg_newtuplebuf;
+   TupleTableSlot *tg_trigslot;
+   TupleTableSlot *tg_newslot;
Tuplestorestate *tg_oldtable;
Tuplestorestate *tg_newtable;
 } TriggerData;


Do we still need tg_trigtuple and tg_newtuple? Can't we always use the 
corresponding slots instead? Is it for backwards-compatibility with 
user-defined triggers written in C? (Documentation also needs to be 
updated for the changes in this struct)



I didn't look a the rest of the patches yet...

- Heikki



Re: Why don't we have a small reserved OID range for patch revisions?

2019-02-27 Thread Peter Eisentraut
On 2019-02-08 19:14, Tom Lane wrote:
> Quite a few people have used OIDs up around 8000 or 9000 for this purpose;
> I doubt we need a formally reserved range for it.  The main problem with
> doing it is the hazard that the patch'll get committed just like that,
> suddenly breaking things for everyone else doing likewise.

For that reason, I'm not in favor of this.  Forgetting to update the
catversion is already common enough (for me).  Adding another step
between having a seemingly final patch and being able to actually commit
it doesn't seem attractive.  Moreover, these "final adjustments" would
tend to require a full rebuild and retest, adding even more overhead.

OID collision doesn't seem to be a significant problem (for me).

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Oddity with parallel safety test for scan/join target in grouping_planner

2019-02-27 Thread Etsuro Fujita
(2019/02/26 21:25), Etsuro Fujita wrote:
> While working on [1], I noticed $Subject, that is:
> 
>  /*
>   * If we have grouping or aggregation to do, the topmost scan/join
>   * plan node must emit what the grouping step wants; otherwise, it
>   * should emit grouping_target.
>   */
>  have_grouping = (parse->groupClause || parse->groupingSets ||
>   parse->hasAggs || root->hasHavingQual);
>  if (have_grouping)
>  {
>  scanjoin_target = make_group_input_target(root, final_target);
> -->  scanjoin_target_parallel_safe =
>  is_parallel_safe(root, (Node *) grouping_target->exprs);
>  }
>  else
>  {
>  scanjoin_target = grouping_target;
>  scanjoin_target_parallel_safe = grouping_target_parallel_safe;
>  }
> 
> The parallel safety of the final scan/join target is determined from the
> grouping target, not that target, which would cause to generate inferior
> parallel plans as shown below:
> 
> pgbench=# explain verbose select aid+bid, sum(abalance), random() from
> pgbench_accounts group by aid+bid;
>   QUERY PLAN
> 
>   GroupAggregate  (cost=137403.01..159903.01 rows=100 width=20)
> Output: ((aid + bid)), sum(abalance), random()
> Group Key: ((pgbench_accounts.aid + pgbench_accounts.bid))
> ->   Sort  (cost=137403.01..139903.01 rows=100 width=8)
>   Output: ((aid + bid)), abalance
>   Sort Key: ((pgbench_accounts.aid + pgbench_accounts.bid))
>   ->   Gather  (cost=10.00..24070.67 rows=100 width=8)
> Output: (aid + bid), abalance
> Workers Planned: 2
> ->   Parallel Seq Scan on public.pgbench_accounts
> (cost=0.00..20560.67 rows=416667 width=12)
>   Output: aid, bid, abalance
> (11 rows)
> 
> The final scan/join target {(aid + bid), abalance} is definitely
> parallel safe, but the target evaluation isn't parallelized across
> workers, which is not good.  Attached is a patch for fixing this.

I added this to the upcoming commitfest.

Best regards,
Etsuro Fujita




  1   2   >