date:20190220

RE: libpq debug log

2019-02-20 Thread Iwata, Aya

Hi Ramanarayana,

Thank you for your review and suggestion. Please see the attached updated patch.

> Issues found while testing
>1) If invalid value is given to PGLOGMINLEVEL, empty log file is created which 
>should not happen.
Thank you for your test. However in my dev environment, empty log file is not 
created.
Could you explain more detail about 1)? I will check it again.

>2) If log file size exceeds the value configured in PGLOGSIZE, new log file is 
>not getting created.
About 2) (and may be 1) ), perhaps is this something like that?

There are my mistake about first line information of created log file 
"The maximum size of this log is %s *Bytes*, the parameter 'logminlevel' is set 
to %s\n".
- Maximum size is not bytes but megabytes.
- Display logminlevel which set by user. Internally, an invalid value is not 
set to logminlevel.

So trust the created log file first line info, if you set PGLOGSIZE=1000 as the 
meaning of "set maximum log size to 1000 Bytes",
a new file was not created even if it exceeds 1000 bytes.
If it is correct, I fixed the comment to output internal setting log maximum 
size and user setting value.

And if you set PGLOGMINLEVEL to invalid value (ex. "aaa"), it is not set to the 
parameter; The default value (level1) is set internally. 
I fixed first line comment to output notification " if invalid value, 
level1(default) is set".

>3) If PGLOGSIZE is greater than 2048 bytes, log file is not created. Is this 
>expected behavior?
Yes. I limit log file size.

>4) In the log file, an extra new line is present whenever the query is 
>printed. Is this intentional?
Thank you, I fixed.

>5)Documentation for this feature is having grammatical errors and some 
>spelling errors which can be looked into.
Thank you. I am checking my documentation now. I will fix it.

> Feedback in the code
Thank you. I fixed my code issue.

> Suggestions
I'll consider that...
Could you explain more about the idea?

Regards,
Aya Iwata


v7-0001-libpq-trace-log.patch
Description: v7-0001-libpq-trace-log.patch

RE: Timeout parameters

2019-02-20 Thread Tsunakawa, Takayuki

From: Nagaura, Ryohei [mailto:nagaura.ryo...@jp.fujitsu.com]
> > Maybe.  Could you suggest good description?
> Clients wait until the socket become readable when they try to get results
> of their query.
> If the socket state get readable, clients read results.
> (See src/interfaces/libpq/fe-exec.c, fe-misc.c)
> The current pg uses poll() to monitor socket states.
> "socket_timeout" is used as timeout argument of poll().
> (See [1] about poll() and its arguments)
> 
> Because "tcp_user_timeout" handles the timeout before and after the above
> procedure,
> there are different monitoring target between "socket_timeout" and
> "tcp_user_timeout".
> When the server is saturated, "statement_timeout" doesn't work while
> "socket_timeout" does work.
> In addition to this, the purpose of "statement_timeout" is to reduce
> server's burden, I think.

OK, I understand your intent.  What I asked Kirk-san is to suggest a 
description for socket_timeout parameter that the user would see in PostgreSQL 
documentation.


> My proposal is to enable clients to release their resource which is used
> communication w/ the saturated server.

I thought the primary purpose is to prevent applications from hanging too long, 
so that the user does not have to wait infinitely.  Releasing resources is a 
secondary purpose.


Regards
Takayuki Tsunakawa

Re: libpq host/hostaddr/conninfo inconsistencies

2019-02-20 Thread Michael Paquier

On Wed, Feb 20, 2019 at 08:12:55PM -0500, Tom Lane wrote:
> The second para explains the cases in which you actually do need to
> provide "host", but I'm afraid that the first sentence will have
> misled people enough that they won't get the point.
>
> I don't think there's anything very wrong with the existing wording
> of this sentence.

I am not seeing anything bad with the first sentence either.  Now if
people are willing to tweak its wording it may point out that
something is confusing in it.  Would it be an improvement with a
formulation like that?  Say cutting the apple in half like that:
"Numeric IP address that can be used in replacement of host."

> Robert's second and third changes seem fine, though.

Agreed.
--
Michael

signature.asc
Description: PGP signature

RE: Timeout parameters

2019-02-20 Thread Nagaura, Ryohei

Hi, Tsunakawa-san

Thank you for your comments.

On Wed, Feb 20, 2019 at 5:56 PM, Tsunakawa, Takayuki wrote:
> > Perhaps you could also clarify a bit more through documentation on how
> > socket_timeout handles the timeout differently from statement_timeout
> > and tcp_user_timeout.
> Maybe.  Could you suggest good description?
Clients wait until the socket become readable when they try to get results of 
their query.
If the socket state get readable, clients read results.
(See src/interfaces/libpq/fe-exec.c, fe-misc.c)
The current pg uses poll() to monitor socket states.
"socket_timeout" is used as timeout argument of poll().
(See [1] about poll() and its arguments)

Because "tcp_user_timeout" handles the timeout before and after the above 
procedure,
there are different monitoring target between "socket_timeout" and 
"tcp_user_timeout".
When the server is saturated, "statement_timeout" doesn't work while 
"socket_timeout" does work.
In addition to this, the purpose of "statement_timeout" is to reduce server's 
burden, I think.
My proposal is to enable clients to release their resource which is used 
communication w/ the saturated server.

On Wed, Feb 20, 2019 at 7:10 PM, Tsunakawa, Takayuki wrote:
> a) is not always accurate, because libpq is also used in the server.  For 
> example,
> postgres_fdw and WAL receiver in streaming replication.
I didn't take that possibility into consideration.
Certainly, a) is bad.

> I'm OK with either the current naming or b).  Frankly, I felt a bit strange 
> when I
> first saw the keepalive parameters, wondering why the same names were not
> chosen.
I will refer to it and wait for opinion of the reviewer.

[1] http://man7.org/linux/man-pages/man2/poll.2.html
Best regards,
-
Ryohei Nagaura

Re: Cache lookup errors with functions manipulation object addresses

2019-02-20 Thread Michael Paquier

On Thu, Jan 31, 2019 at 04:08:18PM +0900, Michael Paquier wrote:
> End-of-commit-fest rebase and moved to next CF.

Rebased version fixing some conflicts with HEAD.
--
Michael
From 5844b0ba07fdd8373a7ef15a1676178cae53c40d Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 25 Dec 2018 12:21:24 +0900
Subject: [PATCH 1/3] Add flag to format_type_extended to enforce NULL-ness

If a type scanned is undefined, type format routines have two behaviors
depending on if FORMAT_TYPE_ALLOW_INVALID is defined by the caller:
- Generate an error
- Return undefined type name "???" or "-".

The current interface is unhelpful for callers willing to format
properly a type name, but still make sure that the type is defined as
there could be types matching the undefined strings.  In order to
counter that, add a new flag called FORMAT_TYPE_FORCE_NULL which returns
a NULL result instead of "??? or "-" which does not generate an error.
They will be used for future patches to improve the SQL interface for
object addresses.
---
 src/backend/utils/adt/format_type.c | 20 
 src/include/utils/builtins.h|  1 +
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/adt/format_type.c b/src/backend/utils/adt/format_type.c
index 6ae2a31345..8ee0e10c29 100644
--- a/src/backend/utils/adt/format_type.c
+++ b/src/backend/utils/adt/format_type.c
@@ -96,6 +96,9 @@ format_type(PG_FUNCTION_ARGS)
  * - FORMAT_TYPE_ALLOW_INVALID
  *			if the type OID is invalid or unknown, return ??? or such instead
  *			of failing
+ * - FORMAT_TYPE_FORCE_NULL
+ *			if the type OID is invalid or unknown, return NULL instead of ???
+ *			or such
  * - FORMAT_TYPE_FORCE_QUALIFY
  *			always schema-qualify type names, regardless of search_path
  *
@@ -114,13 +117,20 @@ format_type_extended(Oid type_oid, int32 typemod, bits16 flags)
 	char	   *buf;
 	bool		with_typemod;
 
-	if (type_oid == InvalidOid && (flags & FORMAT_TYPE_ALLOW_INVALID) != 0)
-		return pstrdup("-");
+	if (type_oid == InvalidOid)
+	{
+		if ((flags & FORMAT_TYPE_FORCE_NULL) != 0)
+			return NULL;
+		else if ((flags & FORMAT_TYPE_ALLOW_INVALID) != 0)
+			return pstrdup("-");
+	}
 
 	tuple = SearchSysCache1(TYPEOID, ObjectIdGetDatum(type_oid));
 	if (!HeapTupleIsValid(tuple))
 	{
-		if ((flags & FORMAT_TYPE_ALLOW_INVALID) != 0)
+		if ((flags & FORMAT_TYPE_FORCE_NULL) != 0)
+			return NULL;
+		else if ((flags & FORMAT_TYPE_ALLOW_INVALID) != 0)
 			return pstrdup("???");
 		else
 			elog(ERROR, "cache lookup failed for type %u", type_oid);
@@ -143,7 +153,9 @@ format_type_extended(Oid type_oid, int32 typemod, bits16 flags)
 		tuple = SearchSysCache1(TYPEOID, ObjectIdGetDatum(array_base_type));
 		if (!HeapTupleIsValid(tuple))
 		{
-			if ((flags & FORMAT_TYPE_ALLOW_INVALID) != 0)
+			if ((flags & FORMAT_TYPE_FORCE_NULL) != 0)
+return NULL;
+			else if ((flags & FORMAT_TYPE_ALLOW_INVALID) != 0)
 return pstrdup("???[]");
 			else
 elog(ERROR, "cache lookup failed for type %u", type_oid);
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 03d0ee2766..43e7ef471c 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -109,6 +109,7 @@ extern Datum numeric_float8_no_overflow(PG_FUNCTION_ARGS);
 #define FORMAT_TYPE_TYPEMOD_GIVEN	0x01	/* typemod defined by caller */
 #define FORMAT_TYPE_ALLOW_INVALID	0x02	/* allow invalid types */
 #define FORMAT_TYPE_FORCE_QUALIFY	0x04	/* force qualification of type */
+#define FORMAT_TYPE_FORCE_NULL		0x08	/* force NULL if undefined */
 extern char *format_type_extended(Oid type_oid, int32 typemod, bits16 flags);
 
 extern char *format_type_be(Oid type_oid);
-- 
2.20.1

From 2590b4ba96851f5ffea5f438e2e82383fdb74dae Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 25 Dec 2018 13:13:28 +0900
Subject: [PATCH 2/3] Refactor format procedure and operator APIs to be more
 modular

This introduce a new set of extended routines for procedure and operator
lookups, with a flags bitmask argument that can modify the default
behavior:
- Force schema qualification
- Force NULL as result instead of a numeric OID for an undefined
object.
---
 src/backend/utils/adt/regproc.c | 61 +++--
 src/include/utils/regproc.h | 10 ++
 2 files changed, 52 insertions(+), 19 deletions(-)

diff --git a/src/backend/utils/adt/regproc.c b/src/backend/utils/adt/regproc.c
index 09cf0e1abc..f85b47821a 100644
--- a/src/backend/utils/adt/regproc.c
+++ b/src/backend/utils/adt/regproc.c
@@ -40,8 +40,6 @@
 #include "utils/regproc.h"
 #include "utils/varlena.h"
 
-static char *format_operator_internal(Oid operator_oid, bool force_qualify);
-static char *format_procedure_internal(Oid procedure_oid, bool force_qualify);
 static void parseNameAndArgTypes(const char *string, bool allowNone,
 	 List **names, int *nargs, Oid *argtypes);
 
@@ -322,24 +320,27 @@ to_regprocedure(PG_FUNCTION_ARGS)
 char *
 format_procedure(Oid procedure_oid)
 {
-	return

Re: WIP: Avoid creation of the free space map for small tables

2019-02-20 Thread John Naylor

On Thu, Feb 21, 2019 at 7:58 AM Amit Kapila  wrote:
> So here you are inserting 4-byte integer and 1024-bytes variable
> length record.  So the tuple length will be tuple_header (24-bytes) +
> 4-bytes for integer + 4-bytes header for variable length data + 1024
> bytes of actual data.  So, the length will be 1056 which is already
> MAXALIGN.  I took the new comments added in your latest version of the
> patch and added them to the previous version of the patch.   Kindly
> see if I have not missed anything while merging the patch-versions?

OK, that makes sense. Looks fine to me.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: Timeout parameters

2019-02-20 Thread Tsunakawa, Takayuki

From: Nagaura, Ryohei [mailto:nagaura.ryo...@jp.fujitsu.com]
> BTW, tcp_user_timeout parameter of servers and clients have same name in
> my current implementation.
> I think it would be better different name rather than same name.
> I'll name them as the following a) or b):
>   a) server_tcp_user_timeout and client_tcp_user_timeout
>   b) tcp_user_timeout and user_timeout
> b) is the same as the naming convention of keepalive, but it is not
> user-friendly.
> Do you come up with better name?
> Or opinion?

a) is not always accurate, because libpq is also used in the server.  For 
example, postgres_fdw and WAL receiver in streaming replication.

I'm OK with either the current naming or b).  Frankly, I felt a bit strange 
when I first saw the keepalive parameters, wondering why the same names were 
not chosen.


Regards
Takayuki Tsunakawa

RE: Timeout parameters

2019-02-20 Thread Nagaura, Ryohei

Hi, Kirk-san.

Thank you for summarizing this thread.

On Tue, Feb 19, 2019 at 6:05 PM, Jamison, Kirk wrote:
> 1) tcp_user_timeout parameter
> As for user_timeout param, there seems to be a common agreement with regards
> to its need.
> I think this can be "committed" separately when it's finalized.
I also recognize so.
BTW, tcp_user_timeout parameter of servers and clients have same name in my 
current implementation.
I think it would be better different name rather than same name.
I'll name them as the following a) or b):
a) server_tcp_user_timeout and client_tcp_user_timeout
b) tcp_user_timeout and user_timeout
b) is the same as the naming convention of keepalive, but it is not 
user-friendly. 
Do you come up with better name?
Or opinion?

> (2) tcp_socket_timeout (or suggested client_statement_timeout,
> send_timeout/recv_timeout)
> Perhaps you could also clarify a bit more through documentation on how
> socket_timeout handles the timeout differently from statement_timeout and
> tcp_user_timeout.
> Then we can decide on the which parameter name is better once the
> implementation becomes clearer.
* The following use case is somewhat different from those listed first. Sorry.
Use case:
1) A client query to the server and the statement is delivered correctly.
2) The server become saturated.
3) But the network layor is alive.
Because of 1), tcp_user_timeout doesn't work.
Because of 2), statement_timeout doesn't work.
Because of 3), keepalive doesn't work.
In the result, clients can't release their resource despite that they want.

My suggestion is this solution.

To limit user waiting time in pqWait(), I implemented some same logic of 
pqWaitTimed().

Best regards,
-
Ryohei Nagaura

RE: Protect syscache from bloating with negative cache entries

2019-02-20 Thread Tsunakawa, Takayuki

From: Robert Haas [mailto:robertmh...@gmail.com]
> That might be enough to justify having the parameter.  But I'm not 
> quite sure how high the value would need to be set to actually get the 
> benefit in a case like that, or what happens if you set it to a value 
> that's not quite high enough.  I think it might be good to play around 
> some more with cases like this, just to get a feeling for how much 
> time you can save in exchange for how much memory.

Why don't we consider this just like the database cache and other DBMS's 
dictionary caches?  That is,

* If you want to avoid infinite memory bloat, set the upper limit on size.

* To find a better limit, check the hit ratio with the statistics view (based 
on Horiguchi-san's original 0004 patch, although that seems modification anyway)


Why do people try to get away from a familiar idea...  Am I missing something?

Ideriha-san,
Could you try simplifying the v15 patch set to see how simple the code would 
look or not?  That is:

* 0001: add dlist_push_tail() ... as is
* 0002: memory accounting, with correction based on feedback
* 0003: merge the original 0003 and 0005, with correction based on feedback


Regards
Takayuki Tsunakawa

Re: Vectors instead of lists, specialised qsort(), binary_search(), unique()

2019-02-20 Thread David Rowley

On Thu, 21 Feb 2019 at 18:02, Thomas Munro  wrote:
> David Rowley posted some interesting stuff about an ArrayList[1] that
> would replace List in many places.  (I probably would have replied to
> that thread if I hadn't changed email account, sorry.)  I had similar
> thoughts when I needed to maintain a sorted vector for my refactoring
> proposal for the fsync queue, as part of my undo log work, except I
> came at the problem with some ideas from C++ in mind.  However, now
> this stuff seems to be redundant for now, given some decisions we've
> arrived at in the thread about how to track the fsync queue, and I'm
> more focused on getting the undo work done.

Hi Thomas,

Glad to see further work being done in this area. I do agree that our
linked list implementation is very limited and of course, does perform
very poorly for Nth lookups and checking if the list contains
something. Seems what you have here solves (at least) these two
issues.  I also think it's tragic in the many places where we take the
List and turn it into an array because we know list_nth() performs
terribly. (e.g ExecInitRangeTable(), setup_simple_rel_arrays() and
setup_append_rel_array(), it would be great to one day see those
fixed)

As for the ArrayList patch that I'd been working on, I was a bit
blocked on it due to Tom's comment in [1], after having quickly looked
over your patch I see there's no solution for that complaint.

(I'd been trying to think of a solution to this as a sort of idle
background task and so far I'd only come up with a sort of List
indexing system, where we could build additional data structures atop
of the list and have functions like list_nth() check for such an index
and use it if available.   I don't particularly like the idea, but it
was the only way I could think of to work around Tom's complaint.  I
felt there was just too much list API code that relies on the list
being a linked list.  e.g checking for empty lists with list == NIL.
lnext() etc.)

So in short, I'm in favour of not just having braindead linked lists
all over the place to store things. I will rejoice the day we move
away from that, but also Tom's comment kinda stuck with me.  He's
often right too.   Likely backpatching pain that Tom talks about would
be much less if we used C++, but... we don't.

I'm happy to support the cause here, just not quite sure yet how I can
best do that.

[1] https://www.postgresql.org/message-id/21592.1509632225%40sss.pgh.pa.us

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

RE: Protect syscache from bloating with negative cache entries

2019-02-20 Thread Tsunakawa, Takayuki

From: Ideriha, Takeshi/出利葉 健
> I checked it with perf record -avg and perf report.
> The following shows top 20 symbols during benchmark including kernel space.
> The main difference between master (unpatched) and patched one seems that
> patched one consumes cpu catcache-evict-and-refill functions including
> SearchCatCacheMiss(),  CatalogCacheCreateEntry(),
> CatCacheCleanupOldEntries().
> So it seems to me that these functions needs further inspection
> to suppress the performace decline as much as possible

Thank you.  It's good to see the expected functions, rather than strange 
behavior.  The performance drop is natural just like the database cache's hit 
ratio is low.  The remedy for performance by the user is also the same as the 
database cache -- increase the catalog cache.


Regards
Takayuki Tsunakawa

Re: [HACKERS] COPY FREEZE and PD_ALL_VISIBLE

2019-02-20 Thread Pavan Deolasee

On Tue, Jan 15, 2019 at 8:48 PM Darafei "Komяpa" Praliaskouski <
m...@komzpa.net> wrote:

> Hello,
>
> Today I bumped into need to limit first VACUUM time on data import.
> I'm using utility called osmium together with COPY FREEZE to import
> openstreetmap data into database.
>
> osmium export -c osmium.config -f pg belarus-latest.osm.pbf  -v --progress
> | psql -1 -c 'create table byosm(geom geometry, osm_type text, osm_id
> bigint, tags jsonb);copy byosm from stdin freeze;'
>
> However, first pass of VACUUM rewrites the whole table. Here is two logs
> of VACUUM VERBOSE in a row:
>
> https://gist.github.com/Komzpa/e765c1c5e04623d83a6263d4833cf3a5
>
> In Russian Postgres Telegram group I've been recommended this thread.
> Can the patch be revived? What is needed to get it up for 12?
>

I posted a new patch [1] for consideration to include in PG12. I started a
new thread because the patch is completely new and this thread was a bit
too old.

Thanks,
Pavan

[1]
https://www.postgresql.org/message-id/CABOikdN-ptGv0mZntrK2Q8OtfUuAjqaYMGmkdU1dCKFtUxVLrg%40mail.gmail.com

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits

2019-02-20 Thread Pavan Deolasee

Hi,

Jeff Janes raised an issue [1] about PD_ALL_VISIBLE not being set correctly
while loading data via COPY FREEZE and had also posted a draft patch.

I now have what I think is a more complete patch. I took a slightly
different approach and instead of setting PD_ALL_VISIBLE bit initially and
then not clearing it during insertion, we now recheck the page for
all-frozen, all-visible tuples just before switching to a new page. This
allows us to then also mark set the visibility map bit, like we do in
vacuumlazy.c

Some special treatment is required to handle the last page before bulk
insert it shutdown. We could have chosen not to do anything special for the
last page and let it remain unfrozen, but I thought it makes sense to take
that extra effort so that we can completely freeze the table and set all VM
bits at the end of COPY FREEZE.

Let me know what you think.

Thanks,
Pavan

[1]
https://www.postgresql.org/message-id/CAMkU%3D1w3osJJ2FneELhhNRLxfZitDgp9FPHee08NT2FQFmz_pQ%40mail.gmail.com

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


copy_freeze_v3.patch
Description: Binary data

RE: Timeout parameters

2019-02-20 Thread Tsunakawa, Takayuki

From: Jamison, Kirk [mailto:k.jami...@jp.fujitsu.com]
> 1) tcp_user_timeout parameter
> I think this can be "committed" separately when it's finalized.

Do you mean you've reviewed and tested the patch by simulating a communication 
failure in the way Nagaura-san suggested?


> 2) tcp_socket_timeout parameter
> - whether (a) it should abort the connection from pqWait() or
>   other means, or 
> (b) cancel the statement similar to how psql
>   does it as suggested by Fabien

We have no choice but to terminate the connection, because we can't tell 
whether we can recover from the problem and continue to use the connection 
(e.g. long-running query) or not (permanent server or network failure).

Regarding the place, pqWait() is the best (and possibly only) place.  The 
purpose of this feature is to avoid waiting for response from the server 
forever (or too long) in any case, as a last resort.

Oracle has similar parameters called SQLNET.RECV_TIMEOUT and 
SQLNET.SEND_TIMEOUT.  From those names, I guess they use SO_RCVTIMEO and 
SO_SNDTIMEO socket options.  However, we can't use them because use 
non-blocking sockets and poll(), while SO_RCV/SND_TIMEO do ont have an effect 
for poll():

[excerpt from "man 7 socket"]
--
   SO_RCVTIMEO and SO_SNDTIMEO
  Specify the receiving or sending  timeouts  until  reporting  an
  error.  The argument is a struct timeval.  If an input or output
  function blocks for this period of time, and data has been  sent
  or  received,  the  return  value  of  that function will be the
  amount of data transferred; if no data has been transferred  and
  the  timeout has been reached then -1 is returned with errno set
  to EAGAIN or EWOULDBLOCK just as if the socket was specified  to
  be  non-blocking.   If  the timeout is set to zero (the default)
  then the operation  will  never  timeout.   Timeouts  only  have
  effect  for system calls that perform socket I/O (e.g., read(2),
  recvmsg(2), send(2), sendmsg(2)); timeouts have  no  effect  for
  select(2), poll(2), epoll_wait(2), etc.
--



> - proper parameter name
> 
> Based from your description below, I agree with Fabien that it's somehow
> the application/client side query timeout

I think the name is good because it indicates the socket-level timeout.  That's 
just like PgJDBC and Oracle, and I didn't feel strange when I read their 
manuals.


> Perhaps you could also clarify a bit more through documentation on how
> socket_timeout handles the timeout differently from statement_timeout
> and tcp_user_timeout.

Maybe.  Could you suggest good description?


Regards
Takayuki Tsunakawa

Re: Prepared transaction releasing locks before deregistering its GID

2019-02-20 Thread Michael Paquier

On Wed, Feb 20, 2019 at 11:50:42AM +0100, Oleksii Kliukin wrote:
> RecoverPreparedTransactions calls ProcessRecords with
> twophase_recover_callbacks right after releasing the TwoPhaseStateLock, so I
> think lock_held should be false here (matching the similar call of
> TwoPhaseGetDummyProc at lock_twophase_recover).

Indeed.  That would be a bad idea.  We don't actually stress this
routine in 009_twophase.pl or the assertion I added would have blown
up immediately.  So I propose to close the gap, and the updated patch
attached adds dedicated tests causing the problem you spotted to be
triggered (soft and hard restarts with 2PC transactions including
DDLs).  The previous version of the patch and those tests cause the
assertion to blow up, failing the tests.

> Since the patch touches TwoPhaseGetDummyBackendId, let’s fix the comment
> header to mention it instead of TwoPhaseGetDummyProc

Not sure I understand the issue you are pointing out here.  The patch
adds an extra argument to both TwoPhaseGetDummyProc() and
TwoPhaseGetDummyBackendId(), and the headers of both functions
document the new argument.

One extra trick I have been using for testing this patch is the
following:
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -816,13 +816,6 @@ TwoPhaseGetGXact(TransactionId xid, bool lock_held)

Assert(!lock_held || LWLockHeldByMe(TwoPhaseStateLock));

-   /*
-* During a recovery, COMMIT PREPARED, or ABORT PREPARED, we'll be called
-* repeatedly for the same XID.  We can save work with a simple cache.
-*/
-   if (xid == cached_xid)
-   return cached_gxact;

This has the advantage to check more aggressively for lock conflicts,
causing the tests to deadlock if the flag is not correctly set in the
worst case scenario.
--
Michael
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 07ae7a31db..c399871940 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1713,7 +1713,7 @@ PostPrepare_MultiXact(TransactionId xid)
 	myOldestMember = OldestMemberMXactId[MyBackendId];
 	if (MultiXactIdIsValid(myOldestMember))
 	{
-		BackendId	dummyBackendId = TwoPhaseGetDummyBackendId(xid);
+		BackendId	dummyBackendId = TwoPhaseGetDummyBackendId(xid, false);
 
 		/*
 		 * Even though storing MultiXactId is atomic, acquire lock to make
@@ -1755,7 +1755,7 @@ void
 multixact_twophase_recover(TransactionId xid, uint16 info,
 		   void *recdata, uint32 len)
 {
-	BackendId	dummyBackendId = TwoPhaseGetDummyBackendId(xid);
+	BackendId	dummyBackendId = TwoPhaseGetDummyBackendId(xid, false);
 	MultiXactId oldestMember;
 
 	/*
@@ -1776,7 +1776,7 @@ void
 multixact_twophase_postcommit(TransactionId xid, uint16 info,
 			  void *recdata, uint32 len)
 {
-	BackendId	dummyBackendId = TwoPhaseGetDummyBackendId(xid);
+	BackendId	dummyBackendId = TwoPhaseGetDummyBackendId(xid, true);
 
 	Assert(len == sizeof(MultiXactId));
 
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 9a8a6bb119..ace71230e4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -801,9 +801,12 @@ pg_prepared_xact(PG_FUNCTION_ARGS)
  * TwoPhaseGetGXact
  *		Get the GlobalTransaction struct for a prepared transaction
  *		specified by XID
+ *
+ * If lock_held is set to true, TwoPhaseStateLock will not be taken, so the
+ * caller had better hold it.
  */
 static GlobalTransaction
-TwoPhaseGetGXact(TransactionId xid)
+TwoPhaseGetGXact(TransactionId xid, bool lock_held)
 {
 	GlobalTransaction result = NULL;
 	int			i;
@@ -811,6 +814,8 @@ TwoPhaseGetGXact(TransactionId xid)
 	static TransactionId cached_xid = InvalidTransactionId;
 	static GlobalTransaction cached_gxact = NULL;
 
+	Assert(!lock_held || LWLockHeldByMe(TwoPhaseStateLock));
+
 	/*
 	 * During a recovery, COMMIT PREPARED, or ABORT PREPARED, we'll be called
 	 * repeatedly for the same XID.  We can save work with a simple cache.
@@ -818,7 +823,8 @@ TwoPhaseGetGXact(TransactionId xid)
 	if (xid == cached_xid)
 		return cached_gxact;
 
-	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+	if (!lock_held)
+		LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
 
 	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
 	{
@@ -832,7 +838,8 @@ TwoPhaseGetGXact(TransactionId xid)
 		}
 	}
 
-	LWLockRelease(TwoPhaseStateLock);
+	if (!lock_held)
+		LWLockRelease(TwoPhaseStateLock);
 
 	if (result == NULL)			/* should not happen */
 		elog(ERROR, "failed to find GlobalTransaction for xid %u", xid);
@@ -849,12 +856,13 @@ TwoPhaseGetGXact(TransactionId xid)
  *
  * Dummy backend IDs are similar to real backend IDs of real backends.
  * They start at MaxBackends + 1, and are unique across all currently active
- * real backends and prepared transactions.
+ * real backends and prepared transactions.  If lock_held is set to true,
+ * TwoPhaseStateLock will not be taken, so the caller had better hold it.

RE: SQL statement PREPARE does not work in ECPG

2019-02-20 Thread Takahashi, Ryohei

Hi Meskes-san,


> Ah right, my bad. The workaround should have been:

Thank you. It works.


> As for the PREPARE statement itself, could you try the attached small
> patch please.

It works well for my statement
"EXEC SQL PREPARE test_prep (int) AS SELECT id from test_table where id = $1;".

However, since data type information is not used, it does not works well
for prepare statements which need data type information such as 
"EXEC SQL PREPARE test_prep (int, int) AS SELECT $1 + $2;".

It fails with "PostgreSQL error : -400[operator is not unique: unknown + 
unknown on line 20]".

(Of course, "EXEC SQL PREPARE test_prep AS SELECT $1::int + $2::int;" works 
well.)


> Could you please also verify for me if this works correctly if you use
> a variable instead of the const? As in:

> EXEC SQL BEGIN DECLARE SECTION;
> int i=2;
> EXEC SQL END DECLARE SECTION;
> ...
> EXEC SQL EXECUTE test_prep (:i);

It also works.
(Actually, I wrote "EXEC SQL EXECUTE test_prep (:i) INTO :ID;".)

ECPG produced as follows.


ECPGdo(__LINE__, 0, 1, NULL, 0, ECPGst_execute, "test_prep",
ECPGt_int,&(i),(long)1,(long)1,sizeof(int),
ECPGt_NO_INDICATOR, NULL , 0L, 0L, 0L, ECPGt_EOIT,
ECPGt_int,&(ID),(long)1,(long)1,sizeof(int),
ECPGt_NO_INDICATOR, NULL , 0L, 0L, 0L, ECPGt_EORT);



Regards,
Ryohei Takahashi

RE: Protect syscache from bloating with negative cache entries

2019-02-20 Thread Ideriha, Takeshi

>From: Tsunakawa, Takayuki
>>From: Ideriha, Takeshi [mailto:ideriha.take...@jp.fujitsu.com]
>> number of tables   | 100  |1000|1
>> ---
>> TPS (master)   |10966  |10654 |9099
>> TPS (patch)| 11137 (+1%) |10710 (+0%) |772 (-91%)
>>
>> It seems that before cache exceeding the limit (no pruning at 100 and
>> 1000), the results are almost same with master but after exceeding the
>> limit (at
>> 1)
>> the decline happens.
>
>How many concurrent clients?
One client (default setting). 

>Can you show the perf's call graph sampling profiles of both the unpatched and
>patched version, to confirm that the bottleneck is around catcache eviction 
>and refill?

I checked it with perf record -avg and perf report. 
The following shows top 20 symbols during benchmark including kernel space.
The main difference between master (unpatched) and patched one seems that
patched one consumes cpu catcache-evict-and-refill functions including 
SearchCatCacheMiss(),  CatalogCacheCreateEntry(), CatCacheCleanupOldEntries().
So it seems to me that these functions needs further inspection 
to suppress the performace decline as much as possible 

Master(%) master|patch (%)  patch
51.25%  cpu_startup_entry   |   51.45%  cpu_startup_entry
51.13%  arch_cpu_idle   |   51.19%  arch_cpu_idle
51.13%  default_idle|   51.19%  default_idle
51.13%  native_safe_halt|   50.95%  native_safe_halt
36.27%  PostmasterMain  |   46.98%  PostmasterMain
36.27%  main|   46.98%  main
36.27%  __libc_start_main   |   46.98%  __libc_start_main
36.07%  ServerLoop  |   46.93%  ServerLoop
35.75%  PostgresMain|   46.89%  PostgresMain
26.03%  exec_simple_query   |   45.99%  exec_simple_query
26.00%  rest_init   |   43.40%  SearchCatCacheMiss
26.00%  start_kernel|   42.80%  CatalogCacheCreateEntry
26.00%  x86_64_start_reservations   |   42.75%  
CatCacheCleanupOldEntries
26.00%  x86_64_start_kernel |   27.04%  rest_init
25.26%  start_secondary |   27.04%  start_kernel
10.25%  pg_plan_queries |   27.04%  x86_64_start_reservations
10.17%  pg_plan_query   |   27.04%  x86_64_start_kernel
10.16%  main|   24.42%  start_secondary
10.16%  __libc_start_main   |   22.35%  pg_analyze_and_rewrite
10.03%  standard_planner|   22.35%  parse_analyze

Regards,
Takeshi Ideriha

Vectors instead of lists, specialised qsort(), binary_search(), unique()

2019-02-20 Thread Thomas Munro

Hello hackers,

David Rowley posted some interesting stuff about an ArrayList[1] that
would replace List in many places.  (I probably would have replied to
that thread if I hadn't changed email account, sorry.)  I had similar
thoughts when I needed to maintain a sorted vector for my refactoring
proposal for the fsync queue, as part of my undo log work, except I
came at the problem with some ideas from C++ in mind.  However, now
this stuff seems to be redundant for now, given some decisions we've
arrived at in the thread about how to track the fsync queue, and I'm
more focused on getting the undo work done.

I wanted to share what I had anyway, since it seems to be related to
David's proposal but make some different trade-offs.  Perhaps it will
be useful for ideas, code, or examples of what not to do :-)  Or
perhaps I'll come back to it later.

Specialising the comparator and element size of functions like
qsort(), binary_search(), unique() is a well known technique for going
faster, and the resulting programming environment has better type
safety.  It's easy to measure a speedup for specialised qsort of small
objects (compared to pg_qsort() with function pointer and object size
as arguments).  Doing that in a way that works well with automatically
managed vectors (and also any array, including in shmem etc) seemed
like a good plan to me, hence this prototyping work.

The basic idea is to use simplehash-style generic programming, like a
kind of poor man's C++ standard library.  The vector type is
instantiated for a given type like Oid etc, and then you can
instantiate specialised qsort() etc.  The vector has a 'small vector
optimisation' where you can hold very small lists without allocating
any extra memory until you get past (say) 3 members.  I was planning
to extend the qsort implementation a bit further to that it could
replace both pg_qsort() and the Perl-based tuple qsort generator, and
then we'd have only one copy of the algorithm in the tree (the dream
of generic programmers).  I wanted to do the same with unique(), since
I'd already noticed that we have many open coded examples of that
algorithm scattered throughout the tree.

Some of the gratuitous C++isms should be removed including some
nonsense about const qualifiers, use of begin and end rather than data
and size (more common in C code), some some other details, and I was
planning to fix some of that before I reposted the patch set as part
of the larger undo patch set, but now that I'm not going to do that...
here is a snapshot of the patch set as-is, with toy examples showing
several examples of List-of-Oid relationOids being replaced with a
specialised vector (with a missed opportunity for it to be sorted and
use binary_search() instead of find()), and the List-of-Node
p_joinexprs being replaced with a specialised vector.  I think those
last things are the sort of thing that David was targeting.

[1] 
https://www.postgresql.org/message-id/flat/CAKJS1f_2SnXhPVa6eWjzy2O9A%3Docwgd0Cj-LQeWpGtrWqbUSDA%40mail.gmail.com

-- 
Thomas Munro
https://enterprisedb.com


0001-Add-parameterized-vectors-and-sorting-searching-supp.patch
Description: Binary data


0002-A-few-extra-tricks-for-simplevector.h.patch
Description: Binary data


0003-Convert-type-of-various-relationOids-members-List-oi.patch
Description: Binary data


0004-More-simplevector-improvements.patch
Description: Binary data


0005-Convert-p_joinexprs-from-a-List-to-a-nodep_vector.patch
Description: Binary data

Re: WIP: Avoid creation of the free space map for small tables

2019-02-20 Thread Amit Kapila

On Wed, Feb 20, 2019 at 8:08 PM Alvaro Herrera  wrote:
>
> Please remember to keep serial_schedule in sync.
>

I don't understand what you mean by this?  It is already present in
serial_schedule.  In parallel_schedule, we are just moving this test
to one of the parallel groups.  Do we need to take care of something
else?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Another way to fix inherited UPDATE/DELETE

2019-02-20 Thread Etsuro Fujita


(2019/02/21 0:14), Tom Lane wrote:

Etsuro Fujita  writes:

(2019/02/20 6:48), Tom Lane wrote:

In the case of a standard inheritance or partition tree, this seems to
go through really easily, since all the children could share the same
returned CTID column (I guess you'd also need a TABLEOID column so you
could figure out which table to direct the update back into).  It gets
a bit harder if the tree contains some foreign tables, because they might
have different concepts of row identity, but I'd think in most cases you
could still combine those into a small number of output columns.



If this is the direction we go in, we should work on the row ID issue
[1] before this?


Certainly, the more foreign tables can use a standardized concept of row
identity, the better this would work.  What I'm loosely envisioning is
that we have one junk row-identity column for each distinct row-identity
datatype needed --- so, for instance, all ordinary tables could share
a TID column.  Different FDWs might need different things, though one
would hope for no more than one datatype per FDW-type involved in the
inheritance tree.  Where things could break down is if we have a lot
of tables that need a whole-row-variable for row identity, and they're
all different rowtypes --- eventually we'd run out of available columns.
So we'd definitely wish to encourage FDWs to have some more efficient
row-identity scheme than that one.


Seems reasonable.  I guess that that can address not only the issue [1] 
but our restriction on FDW row locking that APIs for that currently only 
allow TID for row identity, in a uniform way.



I don't see that as being something that constrains those two projects
to be done in a particular order; it'd just be a nice-to-have improvement.


OK, thanks for the explanation!

Best regards,
Etsuro Fujita

Re: Refactoring the checkpointer's fsync request queue

2019-02-20 Thread Thomas Munro

On Thu, Feb 21, 2019 at 12:50 PM Andres Freund  wrote:
> Thanks for the update!

+1, thanks Shawn.

It's interesting that you measure performance improvements that IIUC
can come only from dropping the bitmapset stuff (or I guess bugs).  I
don't understand the mechanism for that improvement yet.

The idea of just including the segment number (in some representation,
possibly opaque to this code) in the hash table key instead of
carrying a segment list seems pretty good from here (and I withdraw
the sorted vector machinery I mentioned earlier as it's redundant for
this project)... except for one detail.  In your patch, you still have
FORGET_RELATION_FSYNC and FORGET_DATABASE_FSYNC.  That relies on this
sync manager code being able to understand which stuff to drop from
the hash table, which means that is still has knowledge of the key
hierarchy, so that it can match all entries for the relation or
database.  That's one of the things that Andres is arguing against.

I can see how to fix that for the relation case: md.c could simply
enqueue a FORGET_REQUEST message for every segment and fork in the
relation, so they can be removed by O(1) hash table lookup.  I can't
immediately see how to deal with the database case though.  Perhaps in
a tag scheme, you'd have to make the tag mostly opaque, except for a
DB OID field, so you can handle this case?  (Or some kind of predicate
callback, so you can say "does this tag cancel this other tag?"; seems
over the top).

> On 2019-02-20 15:27:40 -0800, Shawn Debnath wrote:
> > As promised, here's a patch that addresses the points discussed by
> > Andres and Thomas at FOSDEM. As a result of how we want checkpointer to
> > track what files to fsync, the pending ops table now integrates the
> > forknum and segno as part of the hash key eliminating the need for the
> > bitmapsets, or vectors from the previous iterations. We re-construct the
> > pathnames from the RelFileNode, ForkNumber and SegmentNumber and use
> > PathNameOpenFile to get the file descriptor to use for fsync.
>
> I still object to exposing segment numbers to smgr and above. I think
> that's an implementation detail of the various storage managers, and we
> shouldn't expose smgr.c further than it already is.

Ok by me.  The solution to this is probably either raw paths (as
Andres has suggested, but which seem problematic to me -- see below),
or some kind of small fixed size tag type that is morally equivalent
to a path and can be converted to a path but is more convenient for
shoving through pipes and into hash tables.

> > Apart from that, this patch moves the system for requesting and
> > processing fsyncs out of md.c into smgr.c, allowing us to call on smgr
> > component specific callbacks to retrieve metadata like relation and
> > segment paths. This allows smgr components to maintain how relfilenodes,
> > forks and segments map to specific files without exposing this knowledge
> > to smgr.  It redefines smgrsync() behavior to be closer to that of
> > smgrimmedsysnc(), i.e., if a regular sync is required for a particular
> > file, enqueue it in locally or forward it to checkpointer.
> > smgrimmedsync() retains the existing behavior and fsyncs the file right
> > away. The processing of fsync requests has been moved from mdsync() to a
> > new ProcessFsyncRequests() function.
>
> I think that's also wrong, imo fsyncs are storage detail, and should be
> relegated to *below* md.c.  That's not to say the code should live in
> md.c, but the issuing of such calls shouldn't be part of smgr.c -
> consider e.g. developing a storage engine living in non volatile ram.

How about we take all that sync-related stuff, that Shawn has moved
from md.c into smgr.c, and my earlier patch had moved out of md.c into
smgrsync.c, and we put it into a new place
src/backend/storage/file/sync.c?  Or somewhere else, but not under
smgr.  It doesn't know anything about smgr concepts, and it can be
used to schedule file sync for any kind of file, not just the smgr
implementations' files.  Though they'd be the main customers, I guess.
md.c and undofile.c etc would call it to register stuff, and
checkpointer.c would call it to actually perform all the fsync calls.

If we do that, the next question is how to represent filenames.  One
idea is to use tags that can be converted back to a path.  I suppose
there could be a table of function pointers somewhere, and the tag
could be a discriminated union?  Or just a descriminator + a small
array of dumb uint32_t of a size big enough for all users, a bit like
lock tags.

One reason to want to use small fixed-sized tags is to avoid atomicity
problems in the future when it comes to the fd-passing work, as
mentioned up-thread.  Here are some other ideas, to avoid having to
use tags:

* send the paths through a shm queue, but the fds through the Unix
domain socket; the messages carrying fds somehow point to the pathname
in the shm queue (and deal with slight disorder...)
* send the paths through the

Re: speeding up planning with partitions

2019-02-20 Thread Amit Langote

On 2019/02/21 0:50, Tom Lane wrote:
> However, there are other reasons why I'm not really happy with the
> approach proposed in this patch.
> 
> The main problem is that cloning the PlannerInfo while still sharing a lot
> of infrastructure between the clones is a horrid hack that I think will be
> very buggy and unmaintainable.  We've gotten away with it so far in
> inheritance_planner because (a) the complexity is all local to that
> function and (b) the cloning happens very early in the planning process,
> so that there's not much shared subsidiary data to worry about; mostly
> just the parse tree, which in fact isn't shared because the first thing
> we do is push it through adjust_appendrel_attrs.  This patch proposes
> to clone much later, and both the actual cloning and the consequences
> of that are spread all over, and I don't think we're nearly done with
> the consequences :-(.  I found the parameterized-path problem while
> wondering why it was okay to not worry about adjusting attrs in data
> structures used during path construction for other baserels ... turns
> out it isn't.  There's a lot of other stuff in PlannerInfo that you're
> not touching, for instance pathkeys and placeholders; and I'm afraid
> much of that represents either current bugs or future problems.
> 
> So what I feel we should do is set this aside for now and see if we
> can make something of the other idea I proposed.  If we could get
> rid of expand-inheritance-at-the-top altogether, and plan inherited
> UPDATE/DELETE the same as inherited SELECT, that would be a large
> reduction in planner complexity, hence much to be preferred over this
> approach which is a large increase in planner complexity.  If that
> approach crashes and burns, we can come back to this.

OK, I agree that the other approach might be a better way forward.  It'll
not just improve the performance in an elegant manner, but will also make
other projects more feasible, such as, MERGE, what Fujita-san mentioned on
the other thread, etc.

> There might be parts of this work we can salvage, though.  It seems
> like the idea of postponing expand_inherited_tables() might be
> something we could use anyway.

+1.  So, let's try to do things in this order:

1. Make inheritance-expansion-at-bottom case perform better now,
addressing at least SELECT performance in PG 12, provided we manage to get
the patches in order in time (I'll try to post the updated
lazy-inheritance-expansion patch later this week.)

2. Overhaul inherited UPDATE/DELETE planning to use
inheritance-expansion-at-bottom (PG 13)

It's unfortunate that UPDATE/DELETE won't perform as well as SELECTs even
couple of releases after declarative partitioning was introduced, but I
agree that we should solve the underlying issues in an elegant way.

Thanks,
Amit

Re: proposal: variadic argument support for least, greatest function

2019-02-20 Thread Chapman Flack

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   not tested
Documentation:not tested

Latest patch passes installcheck-world as of d26a810 and does what it sets out 
to do.

I am not sure I have an answer to the objections being raised on grounds of 
taste. To me, it's persuasive that GREATEST and LEAST are described in the 
docco as functions, they are used much like variadic functions, and this patch 
allows them to be used in the ways you would expect variadic functions to be 
usable.

But as to technical readiness, this builds and passes installcheck-world. The 
functions behave as expected (and return null if passed an empty array, or an 
array containing only nulls, or variadic null::foo[]).

Still no corresponding regression tests or doc.

The new status of this patch is: Waiting on Author

Re: insensitive collations

2019-02-20 Thread Peter Geoghegan

On Tue, Feb 19, 2019 at 6:47 AM Peter Eisentraut
 wrote:
> Another patch to address merge conflicts.

Some remarks on this:

* Your draft commit message says:

> This patch makes changes in three areas:
>
> - CREATE COLLATION DDL changes and system catalog changes to support
>   this new flag.
>
> - Many executor nodes and auxiliary code are extended to track
>   collations.  Previously, this code would just throw away collation
>   information, because the eventually-called user-defined functions
>   didn't use it since they only cared about equality, which didn't
>   need collation information.
>
> - String data type functions that do equality comparisons and hashing
>   are changed to take the (non-)deterministic flag into account.  For
>   comparison, this just means skipping various shortcuts and tie
>   breakers that use byte-wise comparison.  For hashing, we first need
>   to convert the input string to a canonical "sort key" using the ICU
>   analogue of strxfrm().

I wonder if it would be better to break this into distinct commits?

* Why is support for non-deterministic comparisons limited to the ICU
provider? If that's the only case that could possibly be affected,
then why did we ever add the varstrcmp() tie-breaker call to strcmp()
in the first place? The behavior described in the commit message of
bugfix commit 656beff5 describes a case where Hungarian text caused
index corruption by being strcoll()-wise equal but not bitwise equal.
Besides, I think that you can vendor your own case insensitive
collation with glibc, since it's based on UCA.

This restriction feels like it's actually a kludge to work around the
fact that database-wide collations have this foreign key related
restriction in your patch:

> @@ -2901,11 +2921,20 @@ ri_AttributesEqual(Oid eq_opr, Oid typeid,
> }
>
> /*
> -* Apply the comparison operator.  We assume it doesn't care about
> -* collations.
> -*/
> -   return DatumGetBool(FunctionCall2(>eq_opr_finfo,
> - oldvalue, newvalue));
> +* Apply the comparison operator.
> +*
> +* Note: This function is part of a call stack that determines whether an
> +* update to a row is significant enough that it needs checking or action
> +* on the other side of a foreign-key constraint.  Therefore, the
> +* comparison here would need to be done with the collation of the *other*
> +* table.  For simplicity (e.g., we might not even have the other table
> +* open), we'll just use the default collation here, which could lead to
> +* some false negatives.  All this would break if we ever allow
> +* database-wide collations to be nondeterministic.
> +*/
> +   return DatumGetBool(FunctionCall2Coll(>eq_opr_finfo,
> + DEFAULT_COLLATION_OID,
> + oldvalue, newvalue));
>  }

The existing restriction on ICU collations (that they cannot be
database-wide) side-steps the issue.

* Can said restriction somehow be lifted? That seems like it would be
a lot cleaner.

* Why have you disable this optimization?:

> /* Fast pre-check for equality, as discussed in varstr_cmp() */
> -   if (len1 == len2 && memcmp(a1p, a2p, len1) == 0)
> +   if ((!sss->locale || sss->locale->deterministic) &&
> +   len1 == len2 && memcmp(a1p, a2p, len1) == 0)

I don't see why this is necessary. A non-deterministic collation
cannot indicate that bitwise identical strings are non-equal.

* Perhaps you should add a "Tip" referencing the feature to the
contrib/citext documentation.
--
Peter Geoghegan

Re: NOT IN subquery optimization

2019-02-20 Thread Tom Lane

Jim Finnerty  writes:
> re: The idea that's been kicked around in the past is to detect whether the
> subselect's output column(s) can be proved NOT NULL, and if so, convert
> to an antijoin just like NOT EXISTS

> basically, yes.  this will handle nullability of both the outer and inner
> correlated expression(s), multiple expressions, presence or absence of
> predicates in the WHERE clause, and whether the correlated expressions are
> on the null-padded side of an outer join.  If it is judged to be more
> efficient, then it transforms the NOT IN sublink into an anti-join.

Hmm, that seems overcomplicated ...

> some complications enter into the decision to transform NOT IN to anti-join
> based on whether a bitmap plan will/not be used, or whether it will/not be
> eligible for PQ.

... and that even more so, considering that this decision really needs
to be taken long before cost estimates would be available.

As far as I can see, there should be no situation where we'd not want
to transform to antijoin if we can prove it's semantically valid to
do so.  If there are cases where that comes out as a worse plan,
that indicates a costing error that would be something to address
separately (because it'd also be a problem for other antijoin cases).
Also, as long as it nearly always wins, I'm not going to cry too hard
if there are corner cases where it makes the wrong choice.  That's not
something that's possible to avoid completely.

regards, tom lane

Re: Performance issue in foreign-key-aware join estimation

2019-02-20 Thread David Rowley

On Thu, 21 Feb 2019 at 15:00, Tom Lane  wrote:
> Pushed that one.  I'm interested by the "POC" patch, but I agree
> that it'd take some research to show that it isn't a net negative
> for simple queries.  It sounds like you're not really interested
> in pursuing that right now?

Thanks for pushing it.

I'm still interested in the POC patch. I just knew it wasn't something
for the back branches and thought something that was would be more
important... Things having gone quiet here wasn't a good source of
inspiration to do any further work on it. It's good to hear you're
interested.

> Anyway, I rebased the POC patch up to HEAD, just in case anyone
> still wants to play with it.

Cool. Thanks.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Performance issue in foreign-key-aware join estimation

2019-02-20 Thread Tom Lane

David Rowley  writes:
> Attaching the original patch again so the commitfest bot gets off my
> back about the other one not applying.

Pushed that one.  I'm interested by the "POC" patch, but I agree
that it'd take some research to show that it isn't a net negative
for simple queries.  It sounds like you're not really interested
in pursuing that right now?

Anyway, I rebased the POC patch up to HEAD, just in case anyone
still wants to play with it.

regards, tom lane

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 65302fe..2ffa00c 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2351,6 +2351,7 @@ _outEquivalenceClass(StringInfo str, const EquivalenceClass *node)
 	WRITE_NODE_FIELD(ec_sources);
 	WRITE_NODE_FIELD(ec_derives);
 	WRITE_BITMAPSET_FIELD(ec_relids);
+	WRITE_BITMAPSET_FIELD(ec_allrelids);
 	WRITE_BOOL_FIELD(ec_has_const);
 	WRITE_BOOL_FIELD(ec_has_volatile);
 	WRITE_BOOL_FIELD(ec_below_outer_join);
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 0debac7..c5888b8 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -216,6 +216,8 @@ make_one_rel(PlannerInfo *root, List *joinlist)
 	}
 	root->total_table_pages = total_pages;
 
+	root->eq_index = create_eclass_index(root, EQUIVALANCE_IDX_MULTI_MEMBER);
+
 	/*
 	 * Generate access paths for each base rel.
 	 */
@@ -226,6 +228,9 @@ make_one_rel(PlannerInfo *root, List *joinlist)
 	 */
 	rel = make_rel_from_joinlist(root, joinlist);
 
+	free_eclass_index(root->eq_index);
+	root->eq_index = NULL;
+
 	/*
 	 * The result should join all and only the query's base rels.
 	 */
diff --git a/src/backend/optimizer/path/equivclass.c b/src/backend/optimizer/path/equivclass.c
index 61b5b11..570b42f 100644
--- a/src/backend/optimizer/path/equivclass.c
+++ b/src/backend/optimizer/path/equivclass.c
@@ -358,6 +358,7 @@ process_equivalence(PlannerInfo *root,
 		ec1->ec_sources = list_concat(ec1->ec_sources, ec2->ec_sources);
 		ec1->ec_derives = list_concat(ec1->ec_derives, ec2->ec_derives);
 		ec1->ec_relids = bms_join(ec1->ec_relids, ec2->ec_relids);
+		ec1->ec_allrelids = bms_join(ec1->ec_allrelids, ec2->ec_allrelids);
 		ec1->ec_has_const |= ec2->ec_has_const;
 		/* can't need to set has_volatile */
 		ec1->ec_below_outer_join |= ec2->ec_below_outer_join;
@@ -372,6 +373,7 @@ process_equivalence(PlannerInfo *root,
 		ec2->ec_sources = NIL;
 		ec2->ec_derives = NIL;
 		ec2->ec_relids = NULL;
+		ec2->ec_allrelids = NULL;
 		ec1->ec_sources = lappend(ec1->ec_sources, restrictinfo);
 		ec1->ec_below_outer_join |= below_outer_join;
 		ec1->ec_min_security = Min(ec1->ec_min_security,
@@ -432,6 +434,7 @@ process_equivalence(PlannerInfo *root,
 		ec->ec_sources = list_make1(restrictinfo);
 		ec->ec_derives = NIL;
 		ec->ec_relids = NULL;
+		ec->ec_allrelids = NULL;
 		ec->ec_has_const = false;
 		ec->ec_has_volatile = false;
 		ec->ec_below_outer_join = below_outer_join;
@@ -572,6 +575,7 @@ add_eq_member(EquivalenceClass *ec, Expr *expr, Relids relids,
 	{
 		ec->ec_relids = bms_add_members(ec->ec_relids, relids);
 	}
+	ec->ec_allrelids = bms_add_members(ec->ec_allrelids, relids);
 	ec->ec_members = lappend(ec->ec_members, em);
 
 	return em;
@@ -708,6 +712,7 @@ get_eclass_for_sort_expr(PlannerInfo *root,
 	newec->ec_sources = NIL;
 	newec->ec_derives = NIL;
 	newec->ec_relids = NULL;
+	newec->ec_allrelids = NULL;
 	newec->ec_has_const = false;
 	newec->ec_has_volatile = contain_volatile_functions((Node *) expr);
 	newec->ec_below_outer_join = false;
@@ -1114,6 +1119,60 @@ generate_join_implied_equalities_for_ecs(PlannerInfo *root,
 		nominal_join_relids = join_relids;
 	}
 
+	if (root->eq_index != NULL)
+	{
+		Bitmapset  *inner_ecs = NULL;
+		Bitmapset  *join_ecs = NULL;
+		Bitmapset  *matching_ecs;
+		int			i;
+
+		Assert((root->eq_index->ei_flags & EQUIVALANCE_IDX_MULTI_MEMBER));
+
+		i = -1;
+		while ((i = bms_next_member(inner_relids, i)) > 0)
+			inner_ecs = bms_add_members(inner_ecs, root->eq_index->ei_index[i]);
+
+		i = -1;
+		while ((i = bms_next_member(join_relids, i)) > 0)
+			join_ecs = bms_add_members(join_ecs, root->eq_index->ei_index[i]);
+
+		/* Determine all eclasses in common between inner rels and join rels */
+		matching_ecs = bms_int_members(inner_ecs, join_ecs);
+
+		i = -1;
+		while ((i = bms_next_member(matching_ecs, i)) >= 0)
+		{
+			EquivalenceClass *ec = root->eq_index->ei_classes[i];
+			List	   *sublist = NIL;
+
+			/* ECs containing consts do not need any further enforcement */
+			if (ec->ec_has_const)
+continue;
+
+			/* Ensure the class contains members for any of nominal_join_relids */
+			Assert(bms_overlap(ec->ec_relids, nominal_join_relids));
+
+			if (!ec->ec_broken)
+sublist = generate_join_implied_equalities_normal(root,
+  ec,
+  join_relids,
+  outer_relids,
+  inner_relids);
+
+

Re: Delay locking partitions during INSERT and UPDATE

2019-02-20 Thread Tom Lane

Robert Haas  writes:
> On Wed, Feb 20, 2019 at 3:57 PM Tom Lane  wrote:
>> Looking at the patch itself, I agree that a bit more attention to comments
>> is needed, and I wonder whether David has found all the places where
>> it's now necessary to s/NoLock/RowExclusiveLock/.  I don't have any
>> other objections.

> I spent some time thinking about that exact issue this morning and
> studying the code to try to figure that out.  I wasn't able to find
> any other places that seemed to need updating, but it could be that I
> missed something that David also missed.

Actually, in the wake of b04aeb0a0, we probably need not be too stressed
about the possibility that we missed something.  Those assertions wouldn't
detect doing an unwanted lock upgrade, but I think that's unlikely to be
an issue here (or if it is, we have the problem anyway).

regards, tom lane

Re: libpq host/hostaddr/conninfo inconsistencies

2019-02-20 Thread Tom Lane

Robert Haas  writes:
>> Robert, any chance you could opine on the doc patch, given that's your
>> suggested direction?

> I find it to be a more change than we really need, and I'm not sure
> how much it helps to clarify the issue at hand. Here is a simpler
> change which seems like it might do the trick (or maybe not, let's see
> what others think).

My only complaint about this is that it makes it sound like you *must*
provide "host", even when giving "hostaddr":

-Numeric IP address of host to connect to.  This should be in the
+Numeric IP address that should be used for the server specified
+by host.  This should be in the

The second para explains the cases in which you actually do need to
provide "host", but I'm afraid that the first sentence will have
misled people enough that they won't get the point.

I don't think there's anything very wrong with the existing wording
of this sentence.

Robert's second and third changes seem fine, though.

regards, tom lane

Re: amcheck verification for GiST

2019-02-20 Thread Peter Geoghegan

On Sun, Feb 17, 2019 at 12:55 AM Andrey Borodin  wrote:
> That's true, we cannot avoid locking parent and child page simultaneously to 
> check correctness of tuples.

Right.

Some further questions/comments:

* I think that this should be an error:

> +   if (GistPageIsLeaf(page))
> +   {
> +   /* should never happen unless it is root */
> +   Assert(stack->blkno == GIST_ROOT_BLKNO);
> +   }

I use assertions in the verify_nbtree.c, but only for things that are
programming errors, and state that amcheck "owns". If they fail, then
it's a bug in amcheck specifically (they're really obvious assertions
about local state). Whereas this case could in theory happen with a
corrupt index, for example due to a page written in the wrong place.
I'm sure that the root block looks very similar to a leaf, so we won't
detect this any other way.

It's good to be paranoid, and to even think adversarially, provided it
doesn't make the design more difficult and has low runtime overhead.
We could debate whether or not this corruption is realistic, but it's
easy to make it an error and the cost is low, so you should just do
it.

* Couldn't leaf-like root pages also get some of the testing that we
do for other pages within check_index_tuple()? Ideally it wouldn't be
too much of a special case, though.

* I think that you need to add an
errcode(ERRCODE_FEATURE_NOT_SUPPORTED) to this:

> +   /*
> +* Check that it's not a leftover invalid tuple from pre-9.1
> +* See also gistdoinsert() and gistbulkdelete() handlding of such tuples.
> +* We do not consider it error here, but warn operator.
> +*/
> +   if (GistTupleIsInvalid(idxtuple))
> +   ereport(ERROR,
> +   (errmsg("index \"%s\" contains an inner tuple marked as 
> invalid",
> +   RelationGetRelationName(rel)),
> +errdetail("This is caused by an incomplete page split at 
> crash recovery before upgrading to PostgreSQL 9.1."),
> +errhint("Please REINDEX it.")));

> Currently, we do not check index tuples against heap. Should we do this or 
> leave this for another patch?

To address this question: I would leave this for now. It can probably
work based on the same principle as nbtree's heapallindexed
verification, with few or no changes, but I don't think that we need
to make that happen in this release.

* You still hold multiple buffer locks at once, starting with the
parent and moving to the child. Only 2 buffer locks. Why is this
necessary, given that you now hold a ShareLock on both heap and index
relations? Couldn't you just copy the parent into local memory, in the
style of verify_nbtree.c?

* Note that this "parent then child" lock order seems to not be
consistent with the general rule for holding concurrent buffer locks
that is described in the GiST README:

"""
Concurrency control
---
As a rule of thumb, if you need to hold a lock on multiple pages at the
same time, the locks should be acquired in the following order: child page
before parent, and left-to-right at the same level. Always acquiring the
locks in the same order avoids deadlocks.
"""

* The main point of entry for GiST verification is
gist_check_parent_keys_consistency(). It would be nice to have
comments that describe what it does, and in what order. I understand
that English is not your first language, which makes it harder, but I
would appreciate it if you made the effort to explain the theory. I
don't want to assume that I understand your intent -- I could be
wrong.

* Suggest that you use function prototypes consistently, even for
static functions. That is the style that we prefer. Also, please make
the comments consistent with project guidelines on indentation and
style.

* It would be nice to know if the code in
gist_check_parent_keys_consistency() finds problems in the parent, or
in the child. The nbtree check has an idea of a "target" page: every
page gets to be the target exactly once, and we only ever find
problems in the current target. Maybe that is arbitrary in some cases,
because the relationship between the two is where the problem actually
is. I still think that it is a good idea to make it work that way if
possible. It makes it easier to describe complicated relationships in
comments.

* Why does it make sense to use range_gist_picksplit()/operator class
"union" support function to verify parent/child relationships? Does it
handle NULL values correctly, given the special rules?

* Why not use the "consistent" support function instead of the "union"
support function? Could you write new code that was based on
gistindex_keytest() to do this?

* The GiST docs ("63.3. Extensibility") say of the "distance" support
function: "For an internal tree node, the distance returned must not
be greater than the distance to any of the child nodes". Is there an
opportunity to test this relationship, too, by making sure that
distance is sane? Perhaps this is a very naive question

Re: NOT IN subquery optimization

2019-02-20 Thread Tom Lane

Jim Finnerty  writes:
> We have been working on improved optimization of NOT IN, and we would like
> to share this optimizaton with the community.

The idea that's been kicked around in the past is to detect whether the
subselect's output column(s) can be proved NOT NULL, and if so, convert
to an antijoin just like NOT EXISTS.  Is that what you're doing, or
something different?

> We would like to include a patch for this change in the current commitfest. 

Generally, people just send patches, they don't ask for permission first
;-)

Having said that, we have a general policy that we don't like complex
patches that first show up for the last commitfest of a dev cycle.
So unless this is a pretty small patch, it's probably going to get
delayed to v13.  Still, we'd like to have it in the queue, so submit
away ...

regards, tom lane

Re: Set fallback_application_name for a walreceiver to cluster_name

2019-02-20 Thread Euler Taveira

Em sex, 8 de fev de 2019 às 05:16, Peter Eisentraut
 escreveu:
>
> By default, the fallback_application_name for a physical walreceiver is
> "walreceiver".  This means that multiple standbys cannot be
> distinguished easily on a primary, for example in pg_stat_activity or
> synchronous_standby_names.
>
Although standby identification could be made by client_addr in
pg_stat_activity, it could be useful in multiple clusters in the same
host. Since it is a fallback application name, it can be overridden by
an application_name in primary_conninfo parameter.

> I propose, if cluster_name is set, use that for
> fallback_application_name in the walreceiver.  (If it's not set, it
> remains "walreceiver".)  If someone set cluster_name to identify their
> instance, we might as well use that by default to identify the node
> remotely as well.  It's still possible to specify another
> application_name in primary_conninfo explicitly.
>
I tested it and both patches work as described. Passes all tests. Doc
describes the proposed feature. Doc builds without errors.


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: propagating replica identity to partitions

2019-02-20 Thread Simon Riggs

On Wed, 20 Feb 2019 at 18:51, Robert Haas  wrote:

I don't buy Simon's argument that we should treat TABLESPACE
> differently because the tables might be really big and take a long
> time to move.  I agree that this could well be true, but nobody is
> proposing to remove the ability to move tables individually or to use
> ONLY here.  Allowing TABLESPACE to recurse just gives people one
> additional choice that they don't have today: to move everything at
> once. We don't lose any functionality by enabling that.
>

Doing that would add the single largest footgun in Postgres, given that
command's current behavior and the size of partitioned tables.

If it moved partitions concurrently I'd feel differently.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: WAL insert delay settings

2019-02-20 Thread Stephen Frost

Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2019-02-20 18:46:09 -0500, Stephen Frost wrote:
> > * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> > > On 2/20/19 10:43 PM, Stephen Frost wrote:
> > > > Just to share a few additional thoughts after pondering this for a
> > > > while, but the comment Andres made up-thread really struck a chord- we
> > > > don't necessairly want to throttle anything, what we'd really rather do
> > > > is *prioritize* things, whereby foreground work (regular queries and
> > > > such) have a higher priority than background/bulk work (VACUUM, REINDEX,
> > > > etc) but otherwise we use the system to its full capacity.  We don't
> > > > actually want to throttle a VACUUM run any more than a CREATE INDEX, we
> > > > just don't want those to hurt the performance of regular queries that
> > > > are happening.
> > > 
> > > I think you're forgetting the motivation of this very patch was to
> > > prevent replication lag caused by a command generating large amounts of
> > > WAL (like CREATE INDEX / ALTER TABLE etc.). That has almost nothing to
> > > do with prioritization or foreground/background split.
> > > 
> > > I'm not arguing against ability to prioritize stuff, but I disagree it
> > > somehow replaces throttling.
> > 
> > Why is replication lag an issue though?  I would contend it's an issue
> > because with sync replication, it makes foreground processes wait, and
> > with async replication, it makes the actions of foreground processes
> > show up late on the replicas.
> 
> I think reaching the bandwidth limit of either the replication stream,
> or of the startup process is actually more common than these. And for
> that prioritization doesn't help, unless it somehow reduces the total
> amount of WAL.

The issue with hitting those bandwidth limits is that you end up with
queues outside of your control and therefore are unable to prioritize
the data going through them.  I agree, that's an issue and it might be
necessary to ask the admin to provide what the bandwidth limit is, so
that we could then avoid running into issues with downstream queues that
are outside of our control causing unexpected/unacceptable lag.

> > If the actual WAL records for the foreground processes got priority and
> > were pushed out earlier than the background ones, that would eliminate
> > both of those issues with replication lag.  Perhaps there's other issues
> > that replication lag cause but which aren't solved by prioritizing the
> > actual WAL records that you care about getting to the replicas faster,
> > but if so, I'd like to hear what those are.
> 
> Wait, what? Are you actually suggesting that different sources of WAL
> records should be streamed out in different order? You're blowing a
> somewhat reasonably doable project up into "let's rewrite a large chunk
> of all of the durability layer in postgres".
> 
> Stephen, we gotta stop blowing up projects into something that can't
> ever realistically be finished.

I started this sub-thread specifically the way I did because I was
trying to make it clear that these were just ideas for possible
discussion- I'm *not* suggesting, nor saying, that we have to go
implement this right now instead of implementing the throttling that
started this thread.  I'm also, to be clear, not objecting to
implementing the throttling discussed (though, as mentioned but
seemingly ignored, I'd see it maybe configurable in different ways than
originally suggested).

If there's a way I can get that across more clearly than saying "Just to
share a few additional thoughts", I'm happy to try and do so, but I
don't agree that I should be required to simply keep such thoughts to
myself; indeed, I'll admit that I don't know how large a project this
would actually be and while I figured it'd be *huge*, I wanted to share
the thought in case someone might see a way that we could implement it
with much less work and have a better solution as a result.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Ryu floating point output patch

2019-02-20 Thread Andrew Gierth

> "Tom" == Tom Lane  writes:

 Tom> typedef char bool;
 Tom> #endif

 Tom> I believe that we could suppress these warnings by changing that
 Tom> last to be

 Tom> typedef unsigned char bool;

After testing this locally with a hand-tweaked configure result (since I
don't have hardware that triggers it) it all looked ok, so I've pushed
it and at least locust seems to be happy with the result.

-- 
Andrew (irc:RhodiumToad)

Re: ToDo: show size of partitioned table

2019-02-20 Thread Justin Pryzby

On Sat, Feb 16, 2019 at 10:52:35PM +0100, Pavel Stehule wrote:
> I like your changes. I merged all - updated patch is attached

I applied and tested your v10 patch.

Find attached some light modifications.

Thanks,
Justin
diff --git a/doc/src/sgml/ref/psql-ref.sgml b/doc/src/sgml/ref/psql-ref.sgml
index 9fb632b0bd..4b94c9261a 100644
--- a/doc/src/sgml/ref/psql-ref.sgml
+++ b/doc/src/sgml/ref/psql-ref.sgml
@@ -1670,17 +1670,18 @@ testdb=
 partitioned tables are listed; supply a pattern to also include
 partitioned indexes.  If the form \dP+
 is used, the sum of sizes of table's partitions (including their
-indexes) and associated description are also displayed.
+indexes) is displayed, as is the relation's description.
 
 
 
- If modifier n (which stands for
- nested) is used, then non-root partitioned tables are
- displayed too.  The displayed size is divided into two columns in
- this case: one that shows the total size of only the directly
- attached leaf partitions and another that shows total size of all
- partitions, also considering other sub-partitioned partitions, for
- each partitioned tables that's displayed.
+If the modifier n (nested) is used,
+then non-root partitioned tables are included, and a column is shown
+displaying the parent of each partitioned relation.
+
+If n is combined with +, two
+sizes are shown: one including the total size of directly-attached
+leaf partitions, and another showing the total size of all partitions,
+including indirectly attached sub-partitions.
 
 
   
@@ -1694,11 +1695,11 @@ testdb=
 class="parameter">pattern is specified, only entries
 whose name matches the pattern are listed. If the form
 \dPi+ is used, the sum of sizes of index's
-partitions and associated description are also displayed.
+partitions is also displayed, along with the associated description.
 
 
 
- If the modifier n is used, non-root partitioned
+ If the n modifier is used, non-root partitioned
  indexes are displayed too.
 
 
@@ -1713,11 +1714,11 @@ testdb=
 class="parameter">pattern is specified, only entries
 whose name matches the pattern are listed. If the form
 \dPt+ is used, the sum of sizes of table's
-partitions and associated description are also displayed.
+partitions is also displayed, along with the associated description.
 
 
 
- If the modifier n is used, non-root partitioned
+ If the n modifier is used, non-root partitioned
  tables are displayed too.
 
 
diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 2b8628f2ff..6997baedf6 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -3768,16 +3768,16 @@ listPartitions(const char *pattern, bool verbose, bool show_indexes,
 		{
 			appendPQExpBuffer(,
 			  ",\n  s.dps as \"%s\"",
-			  gettext_noop("Direct partitions size"));
+			  gettext_noop("Size: leaves"));
 			appendPQExpBuffer(,
 			  ",\n  s.tps as \"%s\"",
-			  gettext_noop("Total partitions size"));
+			  gettext_noop("Size: total"));
 		}
 		else
 			/* Sizes of all partitions are considered in this case. */
 			appendPQExpBuffer(,
 			  ",\n  s.tps as \"%s\"",
-			  gettext_noop("Partitions size"));
+			  gettext_noop("Size: total"));
 
 		appendPQExpBuffer(,
 		  ",\n  pg_catalog.obj_description(c.oid, 'pg_class') as \"%s\"",
-- 
2.16.4

Re: WAL insert delay settings

2019-02-20 Thread Andres Freund

Hi,

On 2019-02-20 18:46:09 -0500, Stephen Frost wrote:
> * Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> > On 2/20/19 10:43 PM, Stephen Frost wrote:
> > > Just to share a few additional thoughts after pondering this for a
> > > while, but the comment Andres made up-thread really struck a chord- we
> > > don't necessairly want to throttle anything, what we'd really rather do
> > > is *prioritize* things, whereby foreground work (regular queries and
> > > such) have a higher priority than background/bulk work (VACUUM, REINDEX,
> > > etc) but otherwise we use the system to its full capacity.  We don't
> > > actually want to throttle a VACUUM run any more than a CREATE INDEX, we
> > > just don't want those to hurt the performance of regular queries that
> > > are happening.
> > 
> > I think you're forgetting the motivation of this very patch was to
> > prevent replication lag caused by a command generating large amounts of
> > WAL (like CREATE INDEX / ALTER TABLE etc.). That has almost nothing to
> > do with prioritization or foreground/background split.
> > 
> > I'm not arguing against ability to prioritize stuff, but I disagree it
> > somehow replaces throttling.
> 
> Why is replication lag an issue though?  I would contend it's an issue
> because with sync replication, it makes foreground processes wait, and
> with async replication, it makes the actions of foreground processes
> show up late on the replicas.

I think reaching the bandwidth limit of either the replication stream,
or of the startup process is actually more common than these. And for
that prioritization doesn't help, unless it somehow reduces the total
amount of WAL.


> If the actual WAL records for the foreground processes got priority and
> were pushed out earlier than the background ones, that would eliminate
> both of those issues with replication lag.  Perhaps there's other issues
> that replication lag cause but which aren't solved by prioritizing the
> actual WAL records that you care about getting to the replicas faster,
> but if so, I'd like to hear what those are.

Wait, what? Are you actually suggesting that different sources of WAL
records should be streamed out in different order? You're blowing a
somewhat reasonably doable project up into "let's rewrite a large chunk
of all of the durability layer in postgres".


Stephen, we gotta stop blowing up projects into something that can't
ever realistically be finished.

Greetings,

Andres Freund

Re: Refactoring the checkpointer's fsync request queue

2019-02-20 Thread Andres Freund

Hi,

Thanks for the update!

On 2019-02-20 15:27:40 -0800, Shawn Debnath wrote:
> As promised, here's a patch that addresses the points discussed by 
> Andres and Thomas at FOSDEM. As a result of how we want checkpointer to 
> track what files to fsync, the pending ops table now integrates the 
> forknum and segno as part of the hash key eliminating the need for the 
> bitmapsets, or vectors from the previous iterations. We re-construct the 
> pathnames from the RelFileNode, ForkNumber and SegmentNumber and use 
> PathNameOpenFile to get the file descriptor to use for fsync.

I still object to exposing segment numbers to smgr and above. I think
that's an implementation detail of the various storage managers, and we
shouldn't expose smgr.c further than it already is.


> Apart from that, this patch moves the system for requesting and 
> processing fsyncs out of md.c into smgr.c, allowing us to call on smgr 
> component specific callbacks to retrieve metadata like relation and 
> segment paths. This allows smgr components to maintain how relfilenodes, 
> forks and segments map to specific files without exposing this knowledge 
> to smgr.  It redefines smgrsync() behavior to be closer to that of 
> smgrimmedsysnc(), i.e., if a regular sync is required for a particular 
> file, enqueue it in locally or forward it to checkpointer.  
> smgrimmedsync() retains the existing behavior and fsyncs the file right 
> away. The processing of fsync requests has been moved from mdsync() to a 
> new ProcessFsyncRequests() function.

I think that's also wrong, imo fsyncs are storage detail, and should be
relegated to *below* md.c.  That's not to say the code should live in
md.c, but the issuing of such calls shouldn't be part of smgr.c -
consider e.g. developing a storage engine living in non volatile ram.

Greetings,

Andres Freund

Re: WAL insert delay settings

2019-02-20 Thread Stephen Frost

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On 2/20/19 10:43 PM, Stephen Frost wrote:
> > Just to share a few additional thoughts after pondering this for a
> > while, but the comment Andres made up-thread really struck a chord- we
> > don't necessairly want to throttle anything, what we'd really rather do
> > is *prioritize* things, whereby foreground work (regular queries and
> > such) have a higher priority than background/bulk work (VACUUM, REINDEX,
> > etc) but otherwise we use the system to its full capacity.  We don't
> > actually want to throttle a VACUUM run any more than a CREATE INDEX, we
> > just don't want those to hurt the performance of regular queries that
> > are happening.
> 
> I think you're forgetting the motivation of this very patch was to
> prevent replication lag caused by a command generating large amounts of
> WAL (like CREATE INDEX / ALTER TABLE etc.). That has almost nothing to
> do with prioritization or foreground/background split.
> 
> I'm not arguing against ability to prioritize stuff, but I disagree it
> somehow replaces throttling.

Why is replication lag an issue though?  I would contend it's an issue
because with sync replication, it makes foreground processes wait, and
with async replication, it makes the actions of foreground processes
show up late on the replicas.

If the actual WAL records for the foreground processes got priority and
were pushed out earlier than the background ones, that would eliminate
both of those issues with replication lag.  Perhaps there's other issues
that replication lag cause but which aren't solved by prioritizing the
actual WAL records that you care about getting to the replicas faster,
but if so, I'd like to hear what those are.

> > The other thought I had was that doing things on a per-table basis, in
> > particular, isn't really addressing the resource question appropriately.
> > WAL is relatively straight-forward and independent of a resource from
> > the IO for the heap/indexes, so getting an idea from the admin of how
> > much capacity they have for WAL makes sense.  When it comes to the
> > capacity for the heap/indexes, in terms of IO, that really goes to the
> > underlying storage system/channel, which would actually be a tablespace
> > in properly set up environments (imv anyway).
> > 
> > Wrapping this up- it seems unlikely that we're going to get a
> > priority-based system in place any time particularly soon but I do think
> > it's worthy of some serious consideration and discussion about how we
> > might be able to get there.  On the other hand, if we can provide a way
> > for the admin to say "these are my IO channels (per-tablespace values,
> > plus a value for WAL), here's what their capacity is, and here's how
> > much buffer for foreground work I want to have (again, per IO channel),
> > so, PG, please arrange to not use more than 'capacity-buffer' amount of
> > resources for background/bulk tasks (per IO channel)" then we can at
> > least help them address the issue that foreground tasks are being
> > stalled or delayed due to background/bulk work.  This approach means
> > that they won't be utilizing the system to its full capacity, but
> > they'll know that and they'll know that it's because, for them, it's
> > more important that they have that low latency for foreground tasks.
> 
> I think it's mostly orthogonal feature to throttling.

I'm... not sure that what I was getting at above really got across.

What I was saying above, in a nutshell, is that if we're going to
provide throttling then we should give users a way to configure the
throttling on a per-IO-channel basis, which means at the tablespace
level, plus an independent configuration option for WAL since we allow
that to be placed elsewhere too.

Ideally, the configuration parameter would be in the same units as the
actual resource is too- which would probably be IOPS+bandwidth, really.
Just doing it in terms of bandwidth ends up being a bit of a mismatch
as compared to reality, and would mean that users would have to tune it
down farther than they might otherwise and therefore give up that much
more in terms of system capability.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: WAL insert delay settings

2019-02-20 Thread Tomas Vondra

On 2/20/19 10:43 PM, Stephen Frost wrote:
> Greetings,
> ...
>>
>> So maybe doing it for WAL first makes sense. But I still think we need
>> to have at least a rough idea how it interacts with the existing
>> throttling and how it will work in the end.
> 
> Well, it seems like Andres explained how it'll work with the existing
> throttling, no?  As for how all of this will work in the end, that's a
> good question but also a rather difficult one to answer, I suspect.
> 

Well ... he explained how to do WAL throttling, and I agree what he
proposed seems entirely sane to me.

But when it comes to interactions with current vacuum cost-based
throttling, he claims it does not get meaningfully more confusing due to
interactions with WAL throttling. I don't quite agree with that, but I'm
not going to beat this horse any longer ...

> Just to share a few additional thoughts after pondering this for a
> while, but the comment Andres made up-thread really struck a chord- we
> don't necessairly want to throttle anything, what we'd really rather do
> is *prioritize* things, whereby foreground work (regular queries and
> such) have a higher priority than background/bulk work (VACUUM, REINDEX,
> etc) but otherwise we use the system to its full capacity.  We don't
> actually want to throttle a VACUUM run any more than a CREATE INDEX, we
> just don't want those to hurt the performance of regular queries that
> are happening.
> 

I think you're forgetting the motivation of this very patch was to
prevent replication lag caused by a command generating large amounts of
WAL (like CREATE INDEX / ALTER TABLE etc.). That has almost nothing to
do with prioritization or foreground/background split.

I'm not arguing against ability to prioritize stuff, but I disagree it
somehow replaces throttling.

> The other thought I had was that doing things on a per-table basis, in
> particular, isn't really addressing the resource question appropriately.
> WAL is relatively straight-forward and independent of a resource from
> the IO for the heap/indexes, so getting an idea from the admin of how
> much capacity they have for WAL makes sense.  When it comes to the
> capacity for the heap/indexes, in terms of IO, that really goes to the
> underlying storage system/channel, which would actually be a tablespace
> in properly set up environments (imv anyway).
> 
> Wrapping this up- it seems unlikely that we're going to get a
> priority-based system in place any time particularly soon but I do think
> it's worthy of some serious consideration and discussion about how we
> might be able to get there.  On the other hand, if we can provide a way
> for the admin to say "these are my IO channels (per-tablespace values,
> plus a value for WAL), here's what their capacity is, and here's how
> much buffer for foreground work I want to have (again, per IO channel),
> so, PG, please arrange to not use more than 'capacity-buffer' amount of
> resources for background/bulk tasks (per IO channel)" then we can at
> least help them address the issue that foreground tasks are being
> stalled or delayed due to background/bulk work.  This approach means
> that they won't be utilizing the system to its full capacity, but
> they'll know that and they'll know that it's because, for them, it's
> more important that they have that low latency for foreground tasks.
> 

I think it's mostly orthogonal feature to throttling.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Pluggable Storage - Andres's take

2019-02-20 Thread Robert Haas

On Fri, Feb 8, 2019 at 5:18 AM Amit Khandekar  wrote:
> In the attached v1 patch, the prefetch_distance is calculated as
> effective_io_concurrency + 10. Also it has some cosmetic changes.

I did a little brief review of this patch and noticed the following things.

+} PrefetchState;

That name seems too generic.

+/*
+ * An arbitrary way to come up with a pre-fetch distance that grows with io
+ * concurrency, but is at least 10 and not more than the max effective io
+ * concurrency.
+ */

This comment is kinda useless, because it only tells us what the code
does (which is obvious anyway) and not why it does that.  Saying that
your formula is arbitrary may not be the best way to attract support
for it.

+ for (i = prefetch_state->next_item; i < nitems && count < prefetch_count; i++)

It looks strange to me that next_item is stored in prefetch_state and
nitems is passed around as an argument.  Is there some reason why it's
like that?

+ /* prefetch a fixed number of pages beforehand. */

Not accurate -- the number of pages we prefetch isn't fixed.  It
depends on effective_io_concurrency.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[patch] Add schema total size to psql \dn+

2019-02-20 Thread Gilles Darold

Hi all,


When we want to get total size of all relation in a schema we have to
execute one of our favorite DBA query. It  is quite simple but what
about displaying schema size when using \dn+ in psql ?


gilles=# \dn+
   List of schemas
  Name  |  Owner   |  Access privileges   |  Size   |  Description 
+--+--+-+
 public | postgres | postgres=UC/postgres+| 608 kB  | standard public schema
    |  | =UC/postgres | |
 test   | gilles   |  | 57 MB   |
 empty  | gilles   |  | 0 bytes |
(3 rows)

The attached simple patch adds this feature. Is there any cons adding
this information? The patch tries to be compatible to all PostgreSQL
version. Let me know if I have missed something.


Best regards,

-- 
Gilles Darold
Consultant PostgreSQL
http://dalibo.com - http://dalibo.org

diff --git a/src/bin/psql/describe.c b/src/bin/psql/describe.c
index 4da6719ce7..8702e52e4d 100644
--- a/src/bin/psql/describe.c
+++ b/src/bin/psql/describe.c
@@ -4188,6 +4188,10 @@ listSchemas(const char *pattern, bool verbose, bool showSystem)
 	{
 		appendPQExpBufferStr(, ",\n  ");
 		printACLColumn(, "n.nspacl");
+		appendPQExpBuffer(,
+		  ",\n  ((SELECT pg_catalog.pg_size_pretty((sum(pg_catalog.pg_relation_size(c.oid)))::bigint) FROM pg_catalog.pg_class c\nLEFT JOIN pg_catalog.pg_namespace s ON s.oid = c.relnamespace WHERE s.nspname = n.nspname)) as \"%s\"",
+		  gettext_noop("Size"));
+
 		appendPQExpBuffer(,
 		  ",\n  pg_catalog.obj_description(n.oid, 'pg_namespace') AS \"%s\"",
 		  gettext_noop("Description"));

Re: Delay locking partitions during INSERT and UPDATE

2019-02-20 Thread David Rowley

On Thu, 21 Feb 2019 at 10:32, Robert Haas  wrote:
> I spent some time thinking about that exact issue this morning and
> studying the code to try to figure that out.  I wasn't able to find
> any other places that seemed to need updating, but it could be that I
> missed something that David also missed.

It looks like the comment that claimed the table was already locked
crept back in during a (seemingly) sloppy rebase after the
relation_open() -> table_open() change.

I've made a pass over this again and updated the header comments in
functions that now obtain a lock to mention that fact.

Also slightly updated commit msg in the patch.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

v3-0001-On-demand-locking-of-partitions-during-INSERT-and.patch
Description: Binary data

Re: WAL insert delay settings

2019-02-20 Thread Stephen Frost

Greetings,

* Tomas Vondra (tomas.von...@2ndquadrant.com) wrote:
> On 2/19/19 8:40 PM, Andres Freund wrote:
> > On 2019-02-19 20:34:25 +0100, Tomas Vondra wrote:
> >> On 2/19/19 8:22 PM, Andres Freund wrote:
> >>> And my main point is that even if you implement a proper bytes/sec limit
> >>> ONLY for WAL, the behaviour of VACUUM rate limiting doesn't get
> >>> meaningfully more confusing than right now.
> >>
> >> So, why not to modify autovacuum to also use this approach? I wonder if
> >> the situation there is more complicated because of multiple workers
> >> sharing the same budget ...
> > 
> > I think the main reason is that implementing a scheme like this for WAL
> > rate limiting isn't a small task, but it'd be aided by the fact that
> > it'd probably not on by default, and that there'd not be any regressions
> > because the behaviour didn't exist before. I contrast, people are
> > extremely sensitive to autovacuum behaviour changes, even if it's to
> > improve autovacuum. I think it makes more sense to build the logic in a
> > lower profile case first, and then migrate autovacuum over it. Even
> > leaving the maturity issue aside, reducing the scope of the project into
> > more bite sized chunks seems to increase the likelihood of getting
> > anything substantially.
> 
> Maybe.

I concur with that 'maybe'. :)

> I guess the main thing I'm advocating for here is to aim for a unified
> throttling approach, not multiple disparate approaches interacting in
> ways that are hard to understand/predict.

Yes, agreed.

> The time-based approach you described looks fine, an it's kinda what I
> was imagining (and not unlike the checkpoint throttling). I don't think
> it'd be that hard to tweak autovacuum to use it too, but I admit I have
> not thought about it particularly hard and there's stuff like per-table
> settings which might make it more complex.

When reading Andres' proposal, I was heavily reminded of how checkpoint
throttling is handled and wondered if there might be some way to reuse
or generalize that existing code/technique/etc and make it available to
be used for WAL, and more-or-less any/every other bulk operation (CREATE
INDEX, REINDEX, CLUSTER, VACUUM...).

> So maybe doing it for WAL first makes sense. But I still think we need
> to have at least a rough idea how it interacts with the existing
> throttling and how it will work in the end.

Well, it seems like Andres explained how it'll work with the existing
throttling, no?  As for how all of this will work in the end, that's a
good question but also a rather difficult one to answer, I suspect.

Just to share a few additional thoughts after pondering this for a
while, but the comment Andres made up-thread really struck a chord- we
don't necessairly want to throttle anything, what we'd really rather do
is *prioritize* things, whereby foreground work (regular queries and
such) have a higher priority than background/bulk work (VACUUM, REINDEX,
etc) but otherwise we use the system to its full capacity.  We don't
actually want to throttle a VACUUM run any more than a CREATE INDEX, we
just don't want those to hurt the performance of regular queries that
are happening.

The other thought I had was that doing things on a per-table basis, in
particular, isn't really addressing the resource question appropriately.
WAL is relatively straight-forward and independent of a resource from
the IO for the heap/indexes, so getting an idea from the admin of how
much capacity they have for WAL makes sense.  When it comes to the
capacity for the heap/indexes, in terms of IO, that really goes to the
underlying storage system/channel, which would actually be a tablespace
in properly set up environments (imv anyway).

Wrapping this up- it seems unlikely that we're going to get a
priority-based system in place any time particularly soon but I do think
it's worthy of some serious consideration and discussion about how we
might be able to get there.  On the other hand, if we can provide a way
for the admin to say "these are my IO channels (per-tablespace values,
plus a value for WAL), here's what their capacity is, and here's how
much buffer for foreground work I want to have (again, per IO channel),
so, PG, please arrange to not use more than 'capacity-buffer' amount of
resources for background/bulk tasks (per IO channel)" then we can at
least help them address the issue that foreground tasks are being
stalled or delayed due to background/bulk work.  This approach means
that they won't be utilizing the system to its full capacity, but
they'll know that and they'll know that it's because, for them, it's
more important that they have that low latency for foreground tasks.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Delay locking partitions during INSERT and UPDATE

2019-02-20 Thread Robert Haas

On Wed, Feb 20, 2019 at 3:57 PM Tom Lane  wrote:
> I agree that any deadlock would have to involve somebody doing something
> quite odd --- not just one partition-oriented operation, but something
> taking multiple strong locks without regard to the partition structure.
> So I don't see a problem with taking that risk; people doing that sort
> of thing are probably at risk of deadlocks no matter what we do here.

OK.

> Looking at the patch itself, I agree that a bit more attention to comments
> is needed, and I wonder whether David has found all the places where
> it's now necessary to s/NoLock/RowExclusiveLock/.  I don't have any
> other objections.

I spent some time thinking about that exact issue this morning and
studying the code to try to figure that out.  I wasn't able to find
any other places that seemed to need updating, but it could be that I
missed something that David also missed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: libpq host/hostaddr/conninfo inconsistencies

2019-02-20 Thread Robert Haas

On Thu, Feb 14, 2019 at 1:10 PM Andres Freund  wrote:
> > (2) you are not against improving the documentation, although you find it
> > clear enough already, but you agree that some people could get confused.
> >
> > The attached patch v4 only improves the documentation so that it reflects
> > what the implementation really does. I think it too bad to leave out the
> > user-friendly aspects of the patch, though.
>
> Robert, any chance you could opine on the doc patch, given that's your
> suggested direction?

I find it to be a more change than we really need, and I'm not sure
how much it helps to clarify the issue at hand. Here is a simpler
change which seems like it might do the trick (or maybe not, let's see
what others think).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


clarify-hostaddr-rmh.patch
Description: Binary data

Re: Problems with plan estimates in postgres_fdw

2019-02-20 Thread Jeff Janes

On Wed, Jan 30, 2019 at 6:12 AM Etsuro Fujita 
wrote:

> (2018/12/28 15:50), Etsuro Fujita wrote:
> > Attached is a new version of the
> > patch.
>
> Here is an updated version of the patch set.  Changes are:
>
> * In the previous version, LIMIT without OFFSET was not performed
> remotely as the costs was calculated the same as the costs of performing
> it locally.  (Actually, such LIMIT was performed remotely in a case in
> the postgres_fdw regression test, but that was due to a bug :(.)  I
> think we should prefer performing such LIMIT remotely as that avoids
> extra row fetches from the remote that performing it locally might
> cause, so I tweaked the costs estimated in estimate_path_cost_size(), to
> ensure that we'll prefer performing such LIMIT remotely.
>

With your tweaks, I'm still not seeing the ORDER-less LIMIT get pushed down
when using use_remote_estimate in a simple test case, either with this set
of patches, nor in the V4 set.  However, without use_remote_estimate, the
LIMIT is now getting pushed with these patches when it does not in HEAD.

See attached test case, to be run in new database named 'foo'.

Cheers,

Jeff
create extension postgres_fdw ;
create server foo foreign data wrapper postgres_fdw options 
(use_remote_estimate  'true', fetch_size '10');
\! pgbench -i -s20 foo
create user MAPPING FOR public SERVER foo ;
create schema schema2;
import FOREIGN SCHEMA public from server foo into schema2;
explain (verbose,analyze) select * from schema2.pgbench_accounts limit 1;

Re: BRIN summarize autovac_report_workitem passes datname as relname

2019-02-20 Thread Euler Taveira

Em qua, 20 de fev de 2019 às 15:56, Justin Pryzby
 escreveu:
>
> I guess it should be database.namespace.relname ?
>
Yup. It is an oversight in 7526e10224f0792201e99631567bbe44492bbde4.


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: Compressed TOAST Slicing

2019-02-20 Thread Stephen Frost

Greetings,

* Paul Ramsey (pram...@cleverelephant.ca) wrote:
> On Wed, Feb 20, 2019 at 10:50 AM Daniel Verite  
> wrote:
> >
> > Paul Ramsey wrote:
> >
> > > Oddly enough, I couldn't find many/any things that were sensitive to
> > > left-end decompression. The only exception is "LIKE this%" which
> > > clearly would be helped, but unfortunately wouldn't be a quick
> > > drop-in, but a rather major reorganization of the regex handling.
> >
> > What about starts_with(string, prefix)?
> >
> > text_starts_with(arg1,arg2) in varlena.c does a full decompression
> > of  arg1 when it could limit itself to the length of the smaller arg2:
> 
> Nice catch, I didn't find that one as it's not user visible, seems to
> be only called in spgist (!!)
> ./backend/access/spgist/spgtextproc.c:
> DatumGetBool(DirectFunctionCall2(text_starts_with
> 
> Thanks, I'll add that.

That sounds good to me, I look forward to an updated patch.  As Andres
mentioned, he and I chatted a bit about this approach vs the iterator
approach at FOSDEM and convinced me that this is worthwhile to do even
if we might add an iterator approach later- which also seems to be the
consensus of this thread, so once the patch has been updated to catch
this case, I'll review it (again) with an eye on committing it soon.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: Delay locking partitions during INSERT and UPDATE

2019-02-20 Thread Tom Lane

Robert Haas  writes:
> On Wed, Feb 20, 2019 at 11:03 AM Tom Lane  wrote:
>> What I was wondering about was the possibility of the set of
>> tables-that-need-to-be-locked-at-all changing.  Maybe that won't
>> create an issue either, but I'm not quite sure.

> That's pretty much what I was thinking, too.  I think it might be fair
> to say, however, that if it does give rise to deadlock situations,
> they will be corner cases.  For instance, suppose you lock are busy
> locking top-down and, meanwhile, somebody detaches a partition you
> haven't gotten around to locking yet and tries to attach it someplace
> higher up in the partition hierarchy.  I think that there's a
> more-or-less unavoidable deadlock there.  And there may be other cases
> where it is practically avoidable but we will fail to avoid it.  But I
> don't think it's such a common scenario that we have two concurrent
> DDL commands on the same partitioning hierarchy that we should stress
> about it too much.  If the common cases work OK, a theoretical
> deadlock risk in some more obscure case seems acceptable to me, if it
> means we get a significant performance boost.

I agree that any deadlock would have to involve somebody doing something
quite odd --- not just one partition-oriented operation, but something
taking multiple strong locks without regard to the partition structure.
So I don't see a problem with taking that risk; people doing that sort
of thing are probably at risk of deadlocks no matter what we do here.

Looking at the patch itself, I agree that a bit more attention to comments
is needed, and I wonder whether David has found all the places where
it's now necessary to s/NoLock/RowExclusiveLock/.  I don't have any
other objections.

regards, tom lane

Re: bgwriter_lru_maxpages limits in PG 10 sample conf

2019-02-20 Thread Sergei Kornilov

Hello

> I'm a bit reluctant to whack postgresql.conf around in back-branches
> because sometimes that makes funny things happen when somebody
> upgrades, e.g. via RPM.

If i remember correctly both deb and rpm packages will ask user about config 
difference.
But good point, comment change is too small difference. I am a bit late, good 
time for such change was before last minor release (we add data_sync_retry and 
config was changed anyway).

regards, Sergei

Re: Compressed TOAST Slicing

2019-02-20 Thread Tomas Vondra

On 2/20/19 7:50 PM, Robert Haas wrote:
> On Wed, Feb 20, 2019 at 1:45 PM Paul Ramsey  wrote:
>> What this does not support: any function that probably wants 
>> less-than-everything, but doesn’t know how big a slice to look
>> for. Stephen thinks I should put an iterator on decompression,
>> which would be an interesting piece of work. >> Having looked at
>> the json code a little doing partial searches would require a lot
>> of re-work that is above my paygrade, but if there was an iterator
>> in place, at least that next stop would then be open.
>>
>> Note that adding an iterator isn’t adding two ways to do the same 
>> thing, since the iterator would slot nicely underneath the existing
>> slicing API, and just iterate to the requested slice size. So this
>> is easily just “another step” along the train line to providing
>> streaming access to compressed and TOASTed data.
> 
> Yeah.  Plus, I'm not sure the iterator thing is even the right design
> for the JSONB case.  It might be better to think, for that case, about
> whether there's someway to operate directly on the compressed data.
> If you could somehow jigger the format and the chunking so that you
> could jump directly to the right chunk and decompress from there,
> rather than having to walk over all of the earlier chunks to figure
> out where the data you want is, you could probably obtain a large
> performance benefit.  But figuring out how to design such a scheme
> seems pretty far afield from the topic at hand.
> 

I doubt working directly on compressed data is doable / efficient unless
the compression was designed with that use case in mind. Which pglz
almost certainly was not. Furthermore, I think there's an issue with
layering - the compression currently happens in the TOAST infrastructure
(and Paul's patch does not change this), but operating on compressed
data is inherently specific to a given data type.

> I'd actually be inclined not to add an iterator until we have a real 
> user for it, for exactly the reason that we don't know that it is
> the right thing.  But there is certain value in decompressing
> partially, to a known byte position, as your patch does, no matter
> what we decide to do about that stuff.
> 

Well, I think Simon's suggestion was that we should also use the
iterator from JSONB code, so that would be the use of it. And if Paul
implemented the current slicing on top of the iterator, that would also
be an user (even without the JSONB stuff).

But I think Andres is right this might increase the complexity of the
patch too much, possibly pushing it from PG12. I don't see anything
wrong with doing the simple approach now and then extending it to handle
JSONB later, if someone wants to invest their time into it.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Compressed TOAST Slicing

2019-02-20 Thread Tom Lane

Paul Ramsey  writes:
>> On Feb 20, 2019, at 10:37 AM, Simon Riggs  wrote:
>> If we add one set of code now and need to add another different one later, 
>> we will have 2 sets of code that do similar things.

> Note that adding an iterator isn’t adding two ways to do the same thing,
> since the iterator would slot nicely underneath the existing slicing
> API, and just iterate to the requested slice size. So this is easily
> just “another step” along the train line to providing streaming access
> to compressed and TOASTed data.

Yeah, I find Paul's argument fairly convincing there.  There wouldn't be
much code duplication, and for the places that can use it, a "fetch the
first N bytes" API is probably going to be more natural and easier to
use than an iteration-based API.  So we'd likely want to keep it, even
if it ultimately becomes just a thin wrapper over the iterator.

I've not reviewed the patch, but as far as the proposed functionality
goes, it seems fine to accept this rather than waiting for something
much more difficult to happen.

regards, tom lane

Re: Proving IS NOT NULL inference for ScalarArrayOpExpr's

2019-02-20 Thread Tom Lane

BTW, I just pushed a fix (e04a3905e) that adds a little more code
to clause_is_strict_for().  This shouldn't cause more than minor
rebasing pain for you, I hope.

regards, tom lane

Re: bgwriter_lru_maxpages limits in PG 10 sample conf

2019-02-20 Thread Robert Haas

On Wed, Feb 20, 2019 at 5:53 AM Sergei Kornilov  wrote:
> We increased bgwriter_lru_maxpages limit in 10 release [1]. Docs now are 
> changed correctly but in REL_10_STABLE postgresql.conf.sample we still have 
> comment "0-1000 max buffers written/round".
> Master (and REL_11_STABLE) was updated later in 
> 611fe7d4793ba6516e839dc50b5319b990283f4f, but not REL_10. I think we need 
> backpatch this line.

I'm a bit reluctant to whack postgresql.conf around in back-branches
because sometimes that makes funny things happen when somebody
upgrades, e.g. via RPM.  I don't remember exactly what happens but I
think typically either the new file overwrites the existing file which
gets moved to something like postgresql.conf.rpmsave or the new file
is written into postgresql.conf.rpmnew instead of the original
location.  I don't think it's worth making stuff like that happen for
the sake of a comment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Compressed TOAST Slicing

2019-02-20 Thread Daniel Verite

Paul Ramsey wrote:

> > text_starts_with(arg1,arg2) in varlena.c does a full decompression
> > of  arg1 when it could limit itself to the length of the smaller arg2:
> 
> Nice catch, I didn't find that one as it's not user visible, seems to
> be only called in spgist (!!)

It's also exposed in SQL since v11, as 
  starts_with(string,prefix) returns bool
and as an operator:
  text ^@ text
I guess it's meant to be more efficient than (string LIKE prefix||'%')
or strpos(string,prefix)=1, and it will be even more so if we can
avoid some amount of decompression :)


Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

BRIN summarize autovac_report_workitem passes datname as relname

2019-02-20 Thread Justin Pryzby

src/backend/postmaster/autovacuum.c declares:

|static void
|autovac_report_workitem(AutoVacuumWorkItem *workitem,
|const char *nspname, const char *relname)

But calls it like:

|cur_relname = get_rel_name(workitem->avw_relation);
|cur_nspname = 
get_namespace_name(get_rel_namespace(workitem->avw_relation));
|cur_datname = get_database_name(MyDatabaseId);
|if (!cur_relname || !cur_nspname || !cur_datname)
|goto deleted2;
|
|autovac_report_workitem(workitem, cur_nspname, cur_datname);

So I see stuff like:

|check_pg - txn_time POSTGRES_TXN_TIME OK: DB main longest txn: 164s PID:10697 
database:main username: query:autovacuum: BRIN summarize public.main 1028223

I guess it should be database.namespace.relname ?

Re: Compressed TOAST Slicing

2019-02-20 Thread Paul Ramsey

On Wed, Feb 20, 2019 at 10:50 AM Daniel Verite  wrote:
>
> Paul Ramsey wrote:
>
> > Oddly enough, I couldn't find many/any things that were sensitive to
> > left-end decompression. The only exception is "LIKE this%" which
> > clearly would be helped, but unfortunately wouldn't be a quick
> > drop-in, but a rather major reorganization of the regex handling.
>
> What about starts_with(string, prefix)?
>
> text_starts_with(arg1,arg2) in varlena.c does a full decompression
> of  arg1 when it could limit itself to the length of the smaller arg2:

Nice catch, I didn't find that one as it's not user visible, seems to
be only called in spgist (!!)
./backend/access/spgist/spgtextproc.c:
DatumGetBool(DirectFunctionCall2(text_starts_with

Thanks, I'll add that.

P

Re: propagating replica identity to partitions

2019-02-20 Thread Robert Haas

On Wed, Feb 20, 2019 at 12:38 PM Alvaro Herrera
 wrote:
> > I don't see that in the NOTES section for ALTER TABLE.  Are you
> > looking at some patch, or at master?
>
> I was looking here:
> https://www.postgresql.org/docs/11/sql-altertable.html

OK, I was looking at the wrong thing.  Not enough caffeine, apparently.

> Maybe the ALL IN TABLESPACE and OWNED BY sub-forms should be split to a
> separate para.  I suggest:
>
> :   This form changes the table's tablespace to the specified tablespace
> :   and moves the data file(s) associated with the table to the new
> :   tablespace. Indexes on the table, if any, are not moved; but they
> :   can be moved separately with additional SET TABLESPACE commands.
> :   When applied to a partitioned table, nothing is moved, but any
> :   partitions created afterwards with CREATE TABLE PARTITION OF
> :   will use that tablespace.
> :
> :   All
> :   tables in the current database in a tablespace can be moved by using
> :   the ALL IN TABLESPACE form, which will lock all tables to be moved
> :   first and then move each one. This form also supports OWNED BY,
> :   which will only move tables owned by the roles specified. If the
> :   NOWAIT option is specified then the command will fail if it is
> :   unable to acquire all of the locks required immediately. Note that
> :   system catalogs are not moved by this command, use ALTER DATABASE or
> :   explicit ALTER TABLE invocations instead if desired. The
> :   information_schema relations are not considered part of the system
> :   catalogs and will be moved. See also CREATE TABLESPACE.

Seems reasonable.

> I think the reason tablespace was made not to recurse was to avoid
> acquiring access exclusive lock on too many tables at once, but I'm not
> sure.  This is very old behavior.
>
> > Obviously any property where the
> > parents and children have to match, like adding a column or changing a
> > data type, has to always recurse; nothing else is sensible.  For
> > others, it seems we have a choice.  It's unclear to me why we'd choose
> > to make REPLICA IDENTITY recurse by default and, say, OWNER not
> > recurse.
>
> Ah.  I did argue that OWNER should recurse:
> https://postgr.es/m/20171017163203.uw7hmlqonidlfeqj@alvherre.pgsql
> but it was too late then to change it for pg10.  We can change it now,
> surely.

Yeah, we could.  I wonder, though, whether we should just make
everything recurse.  I think that's what people are commonly going to
want, at least for partitioned tables, and it doesn't seem to me that
it would hurt anything to make the inheritance case work that way,
too.  Right now it looks like we have this list of exceptions:

- actions for identity columns (ADD GENERATED, SET etc., DROP IDENTITY)
- TRIGGER
- CLUSTER
- OWNER
- TABLESPACE never recurse to descendant tables
- Adding a constraint recurses only for CHECK constraints that are not
marked NO INHERIT.

I have a feeling we eventually want this list to be empty, right?  We
want a partitioned table to work as much like a non-partitioned table
as possible, unless the user asks for some other behavior.  Going from
six exceptions to four and maybe having some of them depend on whether
it's partitioning or inheritance doesn't seem like it's as good and
clear as trying to adopt a really uniform policy.

I don't buy Simon's argument that we should treat TABLESPACE
differently because the tables might be really big and take a long
time to move.  I agree that this could well be true, but nobody is
proposing to remove the ability to move tables individually or to use
ONLY here.  Allowing TABLESPACE to recurse just gives people one
additional choice that they don't have today: to move everything at
once. We don't lose any functionality by enabling that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG #15646: Inconsistent behavior for current_setting/set_config

2019-02-20 Thread Joe Conway

On 2/20/19 12:11 PM, Tom Lane wrote:
> Joe Conway  writes:
>> On 2/20/19 11:10 AM, PG Bug reporting form wrote:
>>> But current behavior returns empty string instead of NULL (the initial
>>> value) after transaction is rolled back. When I restart session, NULL is
>>> returned again as it is expected.
> 
>> This has been discussed before and dismissed:
>> https://www.postgresql.org/message-id/flat/56842412.505%40joeconway.com
>> Personally I agree it is a bug, but I am not sure you will get much
>> support for that position.
> 
> The fact that we allow undeclared user-defined GUCs at all is a bug IMO.
> We need to find a way to replace that behavior with something whereby
> the name and type of a parameter are declared up-front before you can
> set it.

(moving to hackers)

Perhaps we could do something like:

1. If the user-defined GUC is defined in postgresql.conf, et al, same
   behavior as now
2. Backward compatibility concerns would be an issue, so create another
   new GUC declare_custom_settings which initially defaults to false.
3. If declare_custom_settings is true, and the user-defined GUC is not
   defined in postgresql.conf, then in order to create it dynamically
   via SET or similar methods, you must do something like:

  CREATE SETTING name TYPE guctype [LIST];
  SET name TO value;

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

signature.asc
Description: OpenPGP digital signature

Re: Compressed TOAST Slicing

2019-02-20 Thread Robert Haas

On Wed, Feb 20, 2019 at 1:45 PM Paul Ramsey  wrote:
> What this does not support: any function that probably wants 
> less-than-everything, but doesn’t know how big a slice to look for. Stephen 
> thinks I should put an iterator on decompression, which would be an 
> interesting piece of work. Having looked at the json code a little doing 
> partial searches would require a lot of re-work that is above my paygrade, 
> but if there was an iterator in place, at least that next stop would then be 
> open.
>
> Note that adding an iterator isn’t adding two ways to do the same thing, 
> since the iterator would slot nicely underneath the existing slicing API, and 
> just iterate to the requested slice size. So this is easily just “another 
> step” along the train line to providing streaming access to compressed and 
> TOASTed data.

Yeah.  Plus, I'm not sure the iterator thing is even the right design
for the JSONB case.  It might be better to think, for that case, about
whether there's someway to operate directly on the compressed data.
If you could somehow jigger the format and the chunking so that you
could jump directly to the right chunk and decompress from there,
rather than having to walk over all of the earlier chunks to figure
out where the data you want is, you could probably obtain a large
performance benefit.  But figuring out how to design such a scheme
seems pretty far afield from the topic at hand.

I'd actually be inclined not to add an iterator until we have a real
user for it, for exactly the reason that we don't know that it is the
right thing.  But there is certain value in decompressing partially,
to a known byte position, as your patch does, no matter what we decide
to do about that stuff.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Compressed TOAST Slicing

2019-02-20 Thread Daniel Verite

Paul Ramsey wrote:

> Oddly enough, I couldn't find many/any things that were sensitive to
> left-end decompression. The only exception is "LIKE this%" which
> clearly would be helped, but unfortunately wouldn't be a quick
> drop-in, but a rather major reorganization of the regex handling.

What about starts_with(string, prefix)?

text_starts_with(arg1,arg2) in varlena.c does a full decompression
of  arg1 when it could limit itself to the length of the smaller arg2:

Datum
text_starts_with(PG_FUNCTION_ARGS)

  len1 = toast_raw_datum_size(arg1);
  len2 = toast_raw_datum_size(arg2);
  if (len2 > len1)
  result = false;
  else
  {
  text   *targ1 = DatumGetTextPP(arg1);
  text   *targ2 = DatumGetTextPP(arg2);

  result = (memcmp(VARDATA_ANY(targ1), VARDATA_ANY(targ2),
   VARSIZE_ANY_EXHDR(targ2)) == 0);
...



Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Compressed TOAST Slicing

2019-02-20 Thread Paul Ramsey

> On Feb 20, 2019, at 10:37 AM, Simon Riggs  wrote:
> 
> -1, I think this is blowing up the complexity of a already useful patch,
> even though there's no increase in complexity due to the patch proposed
> here.  I totally get wanting incremental decompression for jsonb, but I
> don't see why Paul should be held hostage for that.
> 
> Not sure I agree with your emotive language. Review comments != holding 
> hostages.
> 
> If we add one set of code now and need to add another different one later, we 
> will have 2 sets of code that do similar things.

So, current state is, asked for a datum slice, we can now decompress just the 
parts we need to get that slice. This allows us to speed up anything that knows 
in advance how big a slice they are going to want. At this moment all I’ve 
found is left() and substr() for the start-at-front case.

What this does not support: any function that probably wants 
less-than-everything, but doesn’t know how big a slice to look for. Stephen 
thinks I should put an iterator on decompression, which would be an interesting 
piece of work. Having looked at the json code a little doing partial searches 
would require a lot of re-work that is above my paygrade, but if there was an 
iterator in place, at least that next stop would then be open. 

Note that adding an iterator isn’t adding two ways to do the same thing, since 
the iterator would slot nicely underneath the existing slicing API, and just 
iterate to the requested slice size. So this is easily just “another step” 
along the train line to providing streaming access to compressed and TOASTed 
data.

I’d hate for the simple slice ability to get stuck behind the other work, since 
it’s both (a) useful and (b) exists. If you are concerned the iterator will 
never get done, I can only offer my word that, since it seems important to 
multiple people on this list, I will do it. (Just not, maybe, very well :)

P.

Re: Compressed TOAST Slicing

2019-02-20 Thread Simon Riggs

On Wed, 20 Feb 2019 at 16:27, Andres Freund  wrote:

>
> > Sure, but we have the choice between something that benefits just a few
> > cases or one that benefits more widely.
> >
> > If we all only work on the narrow use cases that are right in front of us
> > at the present moment then we would not have come this far. I'm sure many
> > GIS applications also store JSONB data, so you would be helping the
> > performance of the whole app, even if there isn't much JSON in PostGIS.
>
> -1, I think this is blowing up the complexity of a already useful patch,
> even though there's no increase in complexity due to the patch proposed
> here.  I totally get wanting incremental decompression for jsonb, but I
> don't see why Paul should be held hostage for that.
>

Not sure I agree with your emotive language. Review comments != holding
hostages.

If we add one set of code now and need to add another different one later,
we will have 2 sets of code that do similar things.

I'm surprised to hear you think that is a good thing.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: propagating replica identity to partitions

2019-02-20 Thread Simon Riggs

On Wed, 20 Feb 2019 at 17:38, Alvaro Herrera 
wrote:

> > Fair enough.  I don't think it's entirely useful to think about this
> > in terms of which operations do and don't recurse; the question in my
> > mind is WHY some things recurse and other things don't, and what will
> > be easiest for users to understand.
>
> I think the reason tablespace was made not to recurse was to avoid
> acquiring access exclusive lock on too many tables at once, but I'm not
> sure.  This is very old behavior.
>

-1 for changing that; here's why:

Partitioning is designed to support very large tables.

Since the table is very large there is a valid use case for moving older
partitions to other tablespaces, one at a time.

If we moved all partitions at once, while holding AccessExclusiveLock, it
would a) likely fail with out of space errors, b) if you are unlucky enough
for it to succeed it would lock the table for a ridiculously long time. The
main use case for changing the tablespace is to define where new partitions
should be created. So in this specific case only, recursion makes no sense.

> > Obviously any property where the
> > parents and children have to match, like adding a column or changing a
> > data type, has to always recurse; nothing else is sensible.  For
> > others, it seems we have a choice.  It's unclear to me why we'd choose
> > to make REPLICA IDENTITY recurse by default and, say, OWNER not
> > recurse.
>
> Ah.  I did argue that OWNER should recurse:
> https://postgr.es/m/20171017163203.uw7hmlqonidlfeqj@alvherre.pgsql
> but it was too late then to change it for pg10.  We can change it now,
> surely.
>
> > In both cases, the default assumption is presumably that the
> > whole partitioning hierarchy should match, but in a particular case a
> > user could want to do something else.  Consequently, I don't
> > understand why a user would want one of those operations to descend to
> > children by default and the other not.
>
> I agree that OWNER TO should recurse for partitioned tables.

+1

That was clearly just missed. It's a strange thought to have a partitioned
table with different pieces owned by different users. So strange that the
lack of complaints about the recursion are surely because nobody ever tried
it in a real situation. I would prefer to disallow differences in ownership
across partitions, since that almost certainly prevents running sensible
DDL and its just a huge footgun.

Recursion should be the default for partitioned tables.

I'm -0 on
> changing it for legacy inheritance, but if the majority is to change it
> for both, let's do that.
>

-1 for changing legacy inheritance. Two separate features. Inheritance has
been around for many years and its feature set is stable. No need to touch
it.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Compressed TOAST Slicing

2019-02-20 Thread Robert Haas

On Wed, Feb 20, 2019 at 11:27 AM Andres Freund  wrote:
> -1, I think this is blowing up the complexity of a already useful patch,
> even though there's no increase in complexity due to the patch proposed
> here.  I totally get wanting incremental decompression for jsonb, but I
> don't see why Paul should be held hostage for that.

I concur.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Protect syscache from bloating with negative cache entries

2019-02-20 Thread Robert Haas

On Tue, Feb 19, 2019 at 11:15 PM Kyotaro HORIGUCHI
 wrote:
> Difference from v15:
>
>   Removed AllocSet accounting stuff. We use approximate memory
>   size for catcache.
>
>   Removed prune-by-number(or size) stuff.
>
>   Adressing comments from Tsunakawa-san and Ideriha-san .
>
>   Separated catcache monitoring feature. (Removed from this set)
> (But it is crucial to check this feature...)
>
> Is this small enough ?

The commit message in 0002 says 'This also can put a hard limit on the
number of catcache entries.' but neither of the GUCs that you've
documented have that effect.  Is that a leftover from a previous
version?

I'd like to see some evidence that catalog_cache_memory_target has any
value, vs. just always setting it to zero.  I came up with the
following somewhat artificial example that shows that it might have
value.

rhaas=# create table foo (a int primary key, b text) partition by hash (a);
[rhaas pgsql]$ perl -e 'for (0..) { print "CREATE TABLE foo$_
PARTITION OF foo FOR VALUES WITH (MODULUS 1, REMAINDER $_);\n"; }'
| psql

First execution of 'select * from foo' in a brand new session takes
about 1.9 seconds; subsequent executions take about 0.7 seconds.  So,
if catalog_cache_memory_target were set to a high enough value to
allow all of that stuff to remain in cache, we could possibly save
about 1.2 seconds coming off the blocks after a long idle period.
That might be enough to justify having the parameter.  But I'm not
quite sure how high the value would need to be set to actually get the
benefit in a case like that, or what happens if you set it to a value
that's not quite high enough.  I think it might be good to play around
some more with cases like this, just to get a feeling for how much
time you can save in exchange for how much memory.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: propagating replica identity to partitions

2019-02-20 Thread Alvaro Herrera

On 2019-Feb-20, Robert Haas wrote:

> On Tue, Feb 19, 2019 at 5:41 PM Alvaro Herrera  
> wrote:
> > OK, let me concede that point -- it's not rewriting that makes things
> > act differently, but rather TABLESPACE (as well as some other things)
> > behave that way.  ALTER TABLE ... SET DATA TYPE is the obvious
> > counterexample.
> >
> > The Notes section of ALTER TABLE says:
> >
> > : The actions for identity columns (ADD GENERATED, SET etc., DROP 
> > IDENTITY), as
> > : well as the actions TRIGGER, CLUSTER, OWNER, and TABLESPACE never recurse 
> > to
> > : descendant tables; that is, they always act as though ONLY were specified.
> > : Adding a constraint recurses only for CHECK constraints that are not 
> > marked NO
> > : INHERIT.
> 
> I don't see that in the NOTES section for ALTER TABLE.  Are you
> looking at some patch, or at master?

I was looking here:
https://www.postgresql.org/docs/11/sql-altertable.html

> More generally, our documentation in this area seems to be somewhat
> lacking.  For instance, the TABLESPACE section of the ALTER TABLE
> documentation appears to make no mention of what exactly the behavior
> is when applied to a partition table.  It doesn't seem sufficient to
> simply say "well, it doesn't recurse," because that would imply that
> it doesn't do anything at all, which isn't the case.

That's true.  Maybe we can add a short blurb in the stanza for the SET
TABLESPACE form in the Description section.  It currently says:

:   This form changes the table's tablespace to the specified tablespace
:   and moves the data file(s) associated with the table to the new
:   tablespace. Indexes on the table, if any, are not moved; but they
:   can be moved separately with additional SET TABLESPACE commands. All
:   tables in the current database in a tablespace can be moved by using
:   the ALL IN TABLESPACE form, which will lock all tables to be moved
:   first and then move each one. This form also supports OWNED BY,
:   which will only move tables owned by the roles specified. If the
:   NOWAIT option is specified then the command will fail if it is
:   unable to acquire all of the locks required immediately. Note that
:   system catalogs are not moved by this command, use ALTER DATABASE or
:   explicit ALTER TABLE invocations instead if desired. The
:   information_schema relations are not considered part of the system
:   catalogs and will be moved. See also CREATE TABLESPACE.

Maybe the ALL IN TABLESPACE and OWNED BY sub-forms should be split to a
separate para.  I suggest:

:   This form changes the table's tablespace to the specified tablespace
:   and moves the data file(s) associated with the table to the new
:   tablespace. Indexes on the table, if any, are not moved; but they
:   can be moved separately with additional SET TABLESPACE commands.
:   When applied to a partitioned table, nothing is moved, but any
:   partitions created afterwards with CREATE TABLE PARTITION OF
:   will use that tablespace.
:
:   All
:   tables in the current database in a tablespace can be moved by using
:   the ALL IN TABLESPACE form, which will lock all tables to be moved
:   first and then move each one. This form also supports OWNED BY,
:   which will only move tables owned by the roles specified. If the
:   NOWAIT option is specified then the command will fail if it is
:   unable to acquire all of the locks required immediately. Note that
:   system catalogs are not moved by this command, use ALTER DATABASE or
:   explicit ALTER TABLE invocations instead if desired. The
:   information_schema relations are not considered part of the system
:   catalogs and will be moved. See also CREATE TABLESPACE.

> > Since REPLICA IDENTITY does not appear in this list, the documented
> > behavior is to recurse, per the description of the "name" parameter:
> >
> > : The name (optionally schema-qualified) of an existing table to
> > : alter. If ONLY is specified before the table name, only that table
> > : is altered. If ONLY is not specified, the table and all its
> > : descendant tables (if any) are altered. Optionally, * can be
> > : specified after the table name to explicitly indicate that
> > : descendant tables are included.
> >
> > I didn't come up with this on my own, as you imply.
> 
> Fair enough.  I don't think it's entirely useful to think about this
> in terms of which operations do and don't recurse; the question in my
> mind is WHY some things recurse and other things don't, and what will
> be easiest for users to understand.

I think the reason tablespace was made not to recurse was to avoid
acquiring access exclusive lock on too many tables at once, but I'm not
sure.  This is very old behavior.

> Obviously any property where the
> parents and children have to match, like adding a column or changing a
> data type, has to always recurse; nothing else is sensible.  For
> others, it seems we have a choice.  It's unclear to me why we'd choose
> to make REPLICA IDENTITY recurse by default and,

Re: list append syntax for postgresql.conf

2019-02-20 Thread Tom Lane

Peter Eisentraut  writes:
> Nowadays there are a number of methods for composing multiple
> postgresql.conf files for modularity.  But if you have a bunch of things
> you want to load via shared_preload_libraries, you have to put them all
> in one setting.  How about some kind of syntax for appending something
> to a list, like
> shared_preload_libraries += 'pg_stat_statements'

Seems potentially useful, but you'd need to figure out how to represent
this in the pg_settings and pg_file_settings views, which currently
suppose that any given GUC's value is determined by exactly one place in
the config file(s).

regards, tom lane

Re: Index Skip Scan

2019-02-20 Thread Dmitry Dolgov

> On Fri, Feb 1, 2019 at 8:24 PM Jesper Pedersen  
> wrote:
>
> Dmitry and I will look at this and take it into account for the next
> version.

In the meantime, just to not forget, I'm going to post another version with a
fix for cursor fetch backwards, which was crashing before. And talking about
this topic I wanted to ask to clarify a few points, since looks like I'm
missing something:

One of not yet addressed points in this patch is amcanbackward. From the
historical thread, mentioned in the first email:

> On 2016-11-25 at 01:33 Robert Haas  wrote:
> +if (ScanDirectionIsForward(dir))
> +{
> +so->currPos.moreLeft = false;
> +so->currPos.moreRight = true;
> +}
> +else
> +{
> +so->currPos.moreLeft = true;
> +so->currPos.moreRight = false;
> +}
>
>
>
> The lack of comments makes it hard for me to understand what the
> motivation for this is, but I bet it's wrong.  Suppose there's a
> cursor involved here and the user tries to back up. Instead of having
> a separate amskip operation, maybe there should be a flag attached to
> a scan indicating whether it should return only distinct results.
> Otherwise, you're allowing for the possibility that the same scan
> might sometimes skip and other times not skip, but then it seems hard
> for the scan to survive cursor operations.  Anyway, why would that be
> useful?

I assume that "sometimes skip and other times not skip" refers to the
situation, when we did fetch forward and jump something over, and then right
away doing fetch backwards, when we don't actually need to skip anything and
can get the result right away, right? If so, I can't find any comments about
why is it should be a problem for cursor operations?

v9-0001-Index-skip-scan.patch
Description: Binary data

Re: Compressed TOAST Slicing

2019-02-20 Thread Andres Freund

On 2019-02-20 08:39:38 +, Simon Riggs wrote:
> On Tue, 19 Feb 2019 at 23:09, Paul Ramsey  wrote:
> 
> > On Sat, Feb 16, 2019 at 7:25 AM Simon Riggs  wrote:
> >
> > > Could we get an similarly optimized implementation of -> operator for
> > JSONB as well?
> > > Are there any other potential uses? Best to fix em all up at once and
> > then move on to other things. Thanks.
> >
> > Oddly enough, I couldn't find many/any things that were sensitive to
> > left-end decompression. The only exception is "LIKE this%" which
> > clearly would be helped, but unfortunately wouldn't be a quick
> > drop-in, but a rather major reorganization of the regex handling.
> >
> > I had a look at "->" and I couldn't see how a slice could be used to
> > make it faster? We don't a priori know how big a slice would give us
> > what we want. This again makes Stephen's case for an iterator, but of
> > course all the iterator benefits only come when the actual function at
> > the top (in this case the json parser) are also updated to be
> > iterative.
> >
> > Committing this little change doesn't preclude an iterator, or even
> > make doing one more complicated... :)
> >
> 
> Sure, but we have the choice between something that benefits just a few
> cases or one that benefits more widely.
> 
> If we all only work on the narrow use cases that are right in front of us
> at the present moment then we would not have come this far. I'm sure many
> GIS applications also store JSONB data, so you would be helping the
> performance of the whole app, even if there isn't much JSON in PostGIS.

-1, I think this is blowing up the complexity of a already useful patch,
even though there's no increase in complexity due to the patch proposed
here.  I totally get wanting incremental decompression for jsonb, but I
don't see why Paul should be held hostage for that.

Greetings,

Andres Freund

Re: Row Level Security − leakproof-ness and performance implications

2019-02-20 Thread Tom Lane

Pierre Ducroquet  writes:
> For simple functions like enum_eq/enum_ne, marking them leakproof is an 
> obvious fix (patch attached to this email, including also textin/textout).

This is not nearly as "obvious" as you think.  See prior discussions,
notably
https://www.postgresql.org/message-id/flat/31042.1546194242%40sss.pgh.pa.us

Up to now we've taken a very strict definition of what leakproofness
means; as Noah stated, if a function can throw errors for some inputs,
it's not considered leakproof, even if those inputs should never be
encountered in practice.  Most of the things we've been willing to
mark leakproof are straight-line code that demonstrably can't throw
any errors at all.

The previous thread seemed to have consensus that the hazards in
text_cmp and friends were narrow enough that nobody had a practical
problem with marking them leakproof --- but we couldn't define an
objective policy statement that would allow making such decisions,
so nothing's been changed as yet.  I think it is important to have
an articulable policy about this, not just a seat-of-the-pants
conclusion about the risk level in a particular function.

regards, tom lane

Re: Delay locking partitions during INSERT and UPDATE

2019-02-20 Thread Robert Haas

On Wed, Feb 20, 2019 at 11:03 AM Tom Lane  wrote:
> Right, that's the same thing I was trying to say.

OK, thanks.

> > ...  So my question is - what do
> > you mean by the parenthetical note about the partitioning info not
> > changing?  Regardless of whether it does or does not, I think the same
> > property holds.
>
> What I was wondering about was the possibility of the set of
> tables-that-need-to-be-locked-at-all changing.  Maybe that won't
> create an issue either, but I'm not quite sure.

That's pretty much what I was thinking, too.  I think it might be fair
to say, however, that if it does give rise to deadlock situations,
they will be corner cases.  For instance, suppose you lock are busy
locking top-down and, meanwhile, somebody detaches a partition you
haven't gotten around to locking yet and tries to attach it someplace
higher up in the partition hierarchy.  I think that there's a
more-or-less unavoidable deadlock there.  And there may be other cases
where it is practically avoidable but we will fail to avoid it.  But I
don't think it's such a common scenario that we have two concurrent
DDL commands on the same partitioning hierarchy that we should stress
about it too much.  If the common cases work OK, a theoretical
deadlock risk in some more obscure case seems acceptable to me, if it
means we get a significant performance boost.

> I do have to say
> that I find the idea of somebody changing the partitioning data
> concurrently with queries on that partitioned table to be stone
> cold petrifying.  I don't think I believe you can make that work.

I guess you won't be surprised to hear that I believe otherwise, but I
agree it's a pretty thorny issue.  I would welcome your comments on
the patches and issues discussed on the ATTACH/DETACH CONCURRENTLY
thread.  To get things to (apparently) work, I have had to separately
handle: (1) the case where things change as we are building the
PartitionDesc, (2) the case where the same PartitionDesc changes
during a planning cycle, (3) the case where the PartitionDesc changes
between planning and execution, and (4) the case where the
PartitionDesc changes during execution.  And even then I'm only able
so far to handle cases where a partition is added, just by ignoring
it, not cases where a partition is removed.  But it does seem to work.
There isn't that much code yet that depends on the PartitionDesc not
changing, so it seems to be an easier problem than, say, allowing a
column to be added concurrently.  As David says, we probably shouldn't
hijack this thread to talk about that in detail, but I would
definitely like it if you would comment on what I've done over there.
I think it's good work but it is very easy to be wrong about such
things -- and even if the basic approach is sound, there may be more
cases of which I have not thought.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: list append syntax for postgresql.conf

2019-02-20 Thread Chapman Flack

On 2/20/19 10:54 AM, Joe Conway wrote:
>> shared_preload_libraries += 'pg_stat_statements'
> 
> I like the idea, but presume it would apply to any GUC list, not just
> shared_preload_libraries?

It would be nice if there were some way for extensions to declare
GUC lists (the very thing that recently became explicitly unsupported).

The difficulty seems to be that a GUC may be assigned before the
extension has been loaded to determine whether list syntax should apply.

Could a change like this improve that situation too, perhaps by
deciding that += syntax /implies/ that an as-yet-undeclared GUC is
to be of list form (which could then be checked when the extension
declares the GUC)?

-Chap

Re: propagating replica identity to partitions

2019-02-20 Thread Robert Haas

On Tue, Feb 19, 2019 at 5:41 PM Alvaro Herrera  wrote:
> OK, let me concede that point -- it's not rewriting that makes things
> act differently, but rather TABLESPACE (as well as some other things)
> behave that way.  ALTER TABLE ... SET DATA TYPE is the obvious
> counterexample.
>
> The Notes section of ALTER TABLE says:
>
> : The actions for identity columns (ADD GENERATED, SET etc., DROP IDENTITY), 
> as
> : well as the actions TRIGGER, CLUSTER, OWNER, and TABLESPACE never recurse to
> : descendant tables; that is, they always act as though ONLY were specified.
> : Adding a constraint recurses only for CHECK constraints that are not marked 
> NO
> : INHERIT.

I don't see that in the NOTES section for ALTER TABLE.  Are you
looking at some patch, or at master?

More generally, our documentation in this area seems to be somewhat
lacking.  For instance, the TABLESPACE section of the ALTER TABLE
documentation appears to make no mention of what exactly the behavior
is when applied to a partition table.  It doesn't seem sufficient to
simply say "well, it doesn't recurse," because that would imply that
it doesn't do anything at all, which isn't the case.

> Since REPLICA IDENTITY does not appear in this list, the documented
> behavior is to recurse, per the description of the "name" parameter:
>
> : The name (optionally schema-qualified) of an existing table to
> : alter. If ONLY is specified before the table name, only that table
> : is altered. If ONLY is not specified, the table and all its
> : descendant tables (if any) are altered. Optionally, * can be
> : specified after the table name to explicitly indicate that
> : descendant tables are included.
>
> I didn't come up with this on my own, as you imply.

Fair enough.  I don't think it's entirely useful to think about this
in terms of which operations do and don't recurse; the question in my
mind is WHY some things recurse and other things don't, and what will
be easiest for users to understand.  Obviously any property where the
parents and children have to match, like adding a column or changing a
data type, has to always recurse; nothing else is sensible.  For
others, it seems we have a choice.  It's unclear to me why we'd choose
to make REPLICA IDENTITY recurse by default and, say, OWNER not
recurse.  In both cases, the default assumption is presumably that the
whole partitioning hierarchy should match, but in a particular case a
user could want to do something else.  Consequently, I don't
understand why a user would want one of those operations to descend to
children by default and the other not.  It feels like we're choosing
the behavior in individual cases, as Tom would say, with the aid of a
dartboard.  Maybe there's some coherent principle here that I'm just
missing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Conflict handling for COPY FROM

2019-02-20 Thread Andres Freund




On February 20, 2019 6:05:53 AM PST, Andrew Dunstan 
 wrote:
>
>On 2/20/19 8:01 AM, Surafel Temesgen wrote:
>>
>>
>> On Tue, Feb 19, 2019 at 3:47 PM Andres Freund > > wrote:
>>
>>
>>
>> Err, what? Again, that requires super user permissions (in
>> contrast to copy from/to stdin/out). Backends run as the user
>> postgres runs under
>>
>>
>>  
>> okay i see it now and modified the patch similarly 
>>
>>
>
>
>Why log to a file at all? We do have, you know, a database handy, where
>we might more usefully log errors. You could usefully log the offending
>row as an array of text, possibly.

Or even just return it as a row. CopyBoth is relatively widely supported these 
days.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Delay locking partitions during INSERT and UPDATE

2019-02-20 Thread Tom Lane

Robert Haas  writes:
> On Mon, Feb 18, 2019 at 6:15 PM Tom Lane  wrote:
>> I'm inclined to think that if we already have lock on the parent
>> partitioned table (thereby, IIUC, guaranteeing that its partitioning
>> info can't change) that the order in which we acquire the same lock
>> level on its partition(s) isn't very important.

> But that being said, I don't think I quite see how the two things you
> mention here are connected to each other.  If operation A locks
> parents before children, and operation B also locks parents before
> children, they generally won't deadlock against each other, even if
> each locks the children in a different order.

Right, that's the same thing I was trying to say.

> ...  So my question is - what do
> you mean by the parenthetical note about the partitioning info not
> changing?  Regardless of whether it does or does not, I think the same
> property holds.

What I was wondering about was the possibility of the set of
tables-that-need-to-be-locked-at-all changing.  Maybe that won't
create an issue either, but I'm not quite sure.  I do have to say
that I find the idea of somebody changing the partitioning data
concurrently with queries on that partitioned table to be stone
cold petrifying.  I don't think I believe you can make that work.

regards, tom lane

Re: list append syntax for postgresql.conf

2019-02-20 Thread David G. Johnston

On Wednesday, February 20, 2019, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> Nowadays there are a number of methods for composing multiple
> postgresql.conf files for modularity.  But if you have a bunch of things
> you want to load via shared_preload_libraries, you have to put them all
> in one setting.  How about some kind of syntax for appending something
> to a list, like
>
> shared_preload_libraries += 'pg_stat_statements'
>

I would rather just have the behavior for that variable “append mode”,
period.  Maybe do it generally for all multi-value variables.  It would be
like “add only” permissions - if you don’t want something loaded it cannot
be specified ever, overwrite is not allowed.  Get rid of any
order-of-operations concerns.

David J.

Re: list append syntax for postgresql.conf

2019-02-20 Thread Joe Conway

On 2/20/19 10:15 AM, Peter Eisentraut wrote:
> Nowadays there are a number of methods for composing multiple
> postgresql.conf files for modularity.  But if you have a bunch of things
> you want to load via shared_preload_libraries, you have to put them all
> in one setting.  How about some kind of syntax for appending something
> to a list, like
> 
> shared_preload_libraries += 'pg_stat_statements'
> 
> Thoughts?


I like the idea, but presume it would apply to any GUC list, not just
shared_preload_libraries?

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development



signature.asc
Description: OpenPGP digital signature

Re: speeding up planning with partitions

2019-02-20 Thread Tom Lane

Amit Langote  writes:
> On 2019/02/20 5:57, Tom Lane wrote:
>> but that is not
>> going to fix the fundamental problem: the cost estimate for such a
>> Path should vary depending on how large we think the outer rel is,
>> and we don't have a reasonable way to set that if we're trying to
>> make a one-size-fits-all Path for something that's being joined to
>> an inheritance tree with a widely varying set of relation sizes.

> What if we set the parent target relation's rows to an average of child
> target relation's rows, that is, instead of setting it to dummy 1 that
> previous versions of the patches were doing?

Well, if somebody were holding a gun to our collective heads and saying
you must do inherited UPDATE/DELETE this way, we could probably limp along
with that; or maybe it'd be better to use the sum of the children's row
counts.  That depends on how many of the per-child join plans end up using
the parameterized path, which is something we couldn't hope to guess so
early.  (Arguably, the way the code is now, it's overestimating the true
costs of such paths, since it doesn't account for different child plans
possibly using the same indexscan and thereby getting caching benefits.)
In any case there'd be side-effects on code that currently expects
appendrels to have size zero, eg the total_table_pages calculation in
make_one_rel.

However, there are other reasons why I'm not really happy with the
approach proposed in this patch.

The main problem is that cloning the PlannerInfo while still sharing a lot
of infrastructure between the clones is a horrid hack that I think will be
very buggy and unmaintainable.  We've gotten away with it so far in
inheritance_planner because (a) the complexity is all local to that
function and (b) the cloning happens very early in the planning process,
so that there's not much shared subsidiary data to worry about; mostly
just the parse tree, which in fact isn't shared because the first thing
we do is push it through adjust_appendrel_attrs.  This patch proposes
to clone much later, and both the actual cloning and the consequences
of that are spread all over, and I don't think we're nearly done with
the consequences :-(.  I found the parameterized-path problem while
wondering why it was okay to not worry about adjusting attrs in data
structures used during path construction for other baserels ... turns
out it isn't.  There's a lot of other stuff in PlannerInfo that you're
not touching, for instance pathkeys and placeholders; and I'm afraid
much of that represents either current bugs or future problems.

So what I feel we should do is set this aside for now and see if we
can make something of the other idea I proposed.  If we could get
rid of expand-inheritance-at-the-top altogether, and plan inherited
UPDATE/DELETE the same as inherited SELECT, that would be a large
reduction in planner complexity, hence much to be preferred over this
approach which is a large increase in planner complexity.  If that
approach crashes and burns, we can come back to this.

There might be parts of this work we can salvage, though.  It seems
like the idea of postponing expand_inherited_tables() might be
something we could use anyway.

regards, tom lane

Re: Delay locking partitions during INSERT and UPDATE

2019-02-20 Thread Robert Haas

On Tue, Feb 19, 2019 at 4:07 PM David Rowley
 wrote:
> I'd say that here we should only discuss what this patch is doing, ...

On that note, I spent some more time looking at what the patch is doing today.

 /*
  * We locked all the partitions in ExecSetupPartitionTupleRouting
  * including the leaf partitions.
  */
-partrel = table_open(dispatch->partdesc->oids[partidx], NoLock);
+partrel = table_open(dispatch->partdesc->oids[partidx], RowExclusiveLock);

It seems to me that the reason for this change is precisely that the
comment is now false, and therefore the comment needs to be updated.

Does that sound right?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: list append syntax for postgresql.conf

2019-02-20 Thread Christoph Berg

Re: Peter Eisentraut 2019-02-20 
<74af1f60-34af-633e-ee8a-310d40c74...@2ndquadrant.com>
> Nowadays there are a number of methods for composing multiple
> postgresql.conf files for modularity.  But if you have a bunch of things
> you want to load via shared_preload_libraries, you have to put them all
> in one setting.  How about some kind of syntax for appending something
> to a list, like
> 
> shared_preload_libraries += 'pg_stat_statements'

I've thought about that syntax myself in the past. It would be very
handy to be able to things like

/etc/postgresql/11/main/conf.d/pg_stat_statements.conf:
shared_preload_libraries += 'pg_stat_statements'
pg_stat_statements.track = all

(The current Debian packages already create and support conf.d/ in the
default configuration.)

+1.

Christoph

Re: list append syntax for postgresql.conf

2019-02-20 Thread Euler Taveira

Em qua, 20 de fev de 2019 às 12:15, Peter Eisentraut
 escreveu:
>
> Nowadays there are a number of methods for composing multiple
> postgresql.conf files for modularity.  But if you have a bunch of things
> you want to load via shared_preload_libraries, you have to put them all
> in one setting.  How about some kind of syntax for appending something
> to a list, like
>
Partial setting could confuse users, no? I see the usefulness of such
feature but I prefer to implement it via ALTER SYSTEM. Instead of += I
prefer to add another option to ALTER SYSTEM that appends new values
such as:

ALTER SYSTEM APPEND shared_preload_libraries TO 'pg_stat_statements,
pg_prewarm';

and it will expand to:

shared_preload_libraries = 'foo, bar, pg_stat_statements, pg_prewarm'


> shared_preload_libraries += 'pg_stat_statements'
>
What happen if you have:

shared_preload_libraries = 'foo'

then

shared_preload_libraries += 'bar'

and then

shared_preload_libraries = 'pg_stat_statements'

You have only 'pg_stat_statements' or 'foo, bar, pg_stat_statements'?

Suppose you mistype 'bar' as 'baz', do you have only 'pg_stat_statements'?


-- 
   Euler Taveira   Timbira -
http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

list append syntax for postgresql.conf

2019-02-20 Thread Peter Eisentraut

Nowadays there are a number of methods for composing multiple
postgresql.conf files for modularity.  But if you have a bunch of things
you want to load via shared_preload_libraries, you have to put them all
in one setting.  How about some kind of syntax for appending something
to a list, like

shared_preload_libraries += 'pg_stat_statements'

Thoughts?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Another way to fix inherited UPDATE/DELETE

2019-02-20 Thread Tom Lane

Etsuro Fujita  writes:
> (2019/02/20 6:48), Tom Lane wrote:
>> In the case of a standard inheritance or partition tree, this seems to
>> go through really easily, since all the children could share the same
>> returned CTID column (I guess you'd also need a TABLEOID column so you
>> could figure out which table to direct the update back into).  It gets
>> a bit harder if the tree contains some foreign tables, because they might
>> have different concepts of row identity, but I'd think in most cases you
>> could still combine those into a small number of output columns.

> If this is the direction we go in, we should work on the row ID issue 
> [1] before this?

Certainly, the more foreign tables can use a standardized concept of row
identity, the better this would work.  What I'm loosely envisioning is
that we have one junk row-identity column for each distinct row-identity
datatype needed --- so, for instance, all ordinary tables could share
a TID column.  Different FDWs might need different things, though one
would hope for no more than one datatype per FDW-type involved in the
inheritance tree.  Where things could break down is if we have a lot
of tables that need a whole-row-variable for row identity, and they're
all different rowtypes --- eventually we'd run out of available columns.
So we'd definitely wish to encourage FDWs to have some more efficient
row-identity scheme than that one.

I don't see that as being something that constrains those two projects
to be done in a particular order; it'd just be a nice-to-have improvement.

regards, tom lane

Re: Another way to fix inherited UPDATE/DELETE

2019-02-20 Thread Tom Lane

Amit Langote  writes:
> On 2019/02/20 13:54, Tom Lane wrote:
>> That's something we'd need to think about.  Obviously, anything
>> along this line breaks the existing FDW update APIs, but let's assume
>> that's acceptable.  Is it impossible, or even hard, for an FDW to
>> support this definition of UPDATE rather than the existing one?
>> I don't think so --- it seems like it's just different --- but
>> I might well be missing something.

> IIUC, in the new approach, only the root of the inheritance tree (target
> table specified in the query) will appear in the query's join tree, not
> the child target tables, so pushing updates with joins to the remote side
> seems a bit hard, because we're not going to consider child joins.  Maybe
> I'm missing something though.

Hm.  Even if that's true (I'm not convinced), I don't think it's such a
significant use-case as to be considered a blocker.

regards, tom lane

Re: Row Level Security − leakproof-ness and performance implications

2019-02-20 Thread Joe Conway

On 2/19/19 6:43 PM, Laurenz Albe wrote:
> Pierre Ducroquet wrote:
>> if we had a 'RLS-enabled' context on functions, a way to make a lot of built-
>> in functions leakproof would simply be to prevent them from logging errors 
>> containing values.
>> 
>> For others, like arraycontains, it's much trickier :[...]
> 
> I feel a little out of my depth here, so I won't comment.

I had more-or-less the same idea, and even did a PoC patch (not
preventing the log entries but rather redacting the variable data), but
after discussing offlist with some other hackers I got the impression
that it would be a really hard sell.

It does seem to me though that such a feature would be pretty useful.
Something like use a GUC to turn it on, and while on log messages get
redacted. If you really needed to see the details, you could either
duplicate your issue on another installation with the redaction turned
off, or maybe turn it off on production in a controlled manner just long
enough to capture the full error message.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

signature.asc
Description: OpenPGP digital signature

Re: [PROPOSAL] Shared Ispell dictionaries

2019-02-20 Thread Arthur Zakirov

Hello,

I've created the new commitfest entry since the previous entry was 
closed with status "Returned with feedback":

https://commitfest.postgresql.org/22/2007/

I attached new version of the patch. There are changes only in 
0003-Retrieve-shared-location-for-dict-v18.patch.

I added a reference counter to shared hash tables dictionary entries. It 
is necessary to not face memory bloat. It is necessary to delete shared 
hash table entries if there are a lot of ALTER and DROP TEXT SEARCH 
DICTIONARY.

Previous version of the patch had released unused DSM segments but left 
shared hash table entries untouched.

There was refcnt before:

https://www.postgresql.org/message-id/20180403115720.GA7450%40zakirov.localdomain

But I didn't fully understand how on_dsm_detach() works.

On 22.01.2019 22:17, Tomas Vondra wrote:

I think there are essentially two ways:

(a) Define max amount of memory available for shared dictionarires, and
come up with an eviction algorithm. This will be tricky, because when
the frequently-used dictionaries need a bit more memory than the limit,
this will result in trashing (evict+load over and over).

(b) Define what "unused" means for dictionaries, and unload dictionaries
that become unused. For example, we could track timestamp of the last
time each dict was used, and decide that dictionaries unused for 5 or
more minutes are unused. And evict those.

The advantage of (b) is that it adopts automatically, more or less. When
you have a bunch of frequently used dictionaries, the amount of shared
memory increases. If you stop using them, it decreases after a while.
And rarely used dicts won't force eviction of the frequently used ones.
I'm working on the (b) approach. I thought about a priority queue 
structure. There no such ready structure within PostgreSQL sources 
except binaryheap.c, but it isn't for concurrent algorithms.

--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
>From 3e220e259eebc6b9730c9500176015b04e588cae Mon Sep 17 00:00:00 2001
From: Arthur Zakirov 
Date: Thu, 17 Jan 2019 14:27:32 +0300
Subject: [PATCH 1/4] Fix-ispell-memory-handling

---
 src/backend/tsearch/spell.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/src/backend/tsearch/spell.c b/src/backend/tsearch/spell.c
index eb39466b22..eb8416ce7f 100644
--- a/src/backend/tsearch/spell.c
+++ b/src/backend/tsearch/spell.c
@@ -78,6 +78,8 @@
 #define tmpalloc(sz)  MemoryContextAlloc(Conf->buildCxt, (sz))
 #define tmpalloc0(sz)  MemoryContextAllocZero(Conf->buildCxt, (sz))

+#define tmpstrdup(str)	MemoryContextStrdup(Conf->buildCxt, (str))
+
 /*
  * Prepare for constructing an ISpell dictionary.
  *
@@ -498,7 +500,7 @@ NIAddSpell(IspellDict *Conf, const char *word, const char *flag)
 	Conf->Spell[Conf->nspell] = (SPELL *) tmpalloc(SPELLHDRSZ + strlen(word) + 1);
 	strcpy(Conf->Spell[Conf->nspell]->word, word);
 	Conf->Spell[Conf->nspell]->p.flag = (*flag != '\0')
-		? cpstrdup(Conf, flag) : VoidString;
+		? tmpstrdup(flag) : VoidString;
 	Conf->nspell++;
 }

@@ -1040,7 +1042,7 @@ setCompoundAffixFlagValue(IspellDict *Conf, CompoundAffixFlag *entry,
 		entry->flag.i = i;
 	}
 	else
-		entry->flag.s = cpstrdup(Conf, s);
+		entry->flag.s = tmpstrdup(s);

 	entry->flagMode = Conf->flagMode;
 	entry->value = val;
@@ -1541,6 +1543,9 @@ nextline:
 	return;

 isnewformat:
+	pfree(recoded);
+	pfree(pstr);
+
 	if (oldformat)
 		ereport(ERROR,
 (errcode(ERRCODE_CONFIG_FILE_ERROR),
-- 
2.20.1

>From 291667b579641176ca74eaa343521dd5c258a744 Mon Sep 17 00:00:00 2001
From: Arthur Zakirov 
Date: Thu, 17 Jan 2019 15:05:44 +0300
Subject: [PATCH 2/4] Change-tmplinit-argument

---
 contrib/dict_int/dict_int.c  |  4 +-
 contrib/dict_xsyn/dict_xsyn.c|  4 +-
 contrib/unaccent/unaccent.c  |  4 +-
 src/backend/commands/tsearchcmds.c   | 10 -
 src/backend/snowball/dict_snowball.c |  4 +-
 src/backend/tsearch/dict_ispell.c|  4 +-
 src/backend/tsearch/dict_simple.c|  4 +-
 src/backend/tsearch/dict_synonym.c   |  4 +-
 src/backend/tsearch/dict_thesaurus.c |  4 +-
 src/backend/utils/cache/ts_cache.c   | 13 +-
 src/include/tsearch/ts_cache.h   |  4 ++
 src/include/tsearch/ts_public.h  | 67 ++--
 12 files changed, 105 insertions(+), 21 deletions(-)

diff --git a/contrib/dict_int/dict_int.c b/contrib/dict_int/dict_int.c
index 628b9769c3..ddde55eee4 100644
--- a/contrib/dict_int/dict_int.c
+++ b/contrib/dict_int/dict_int.c
@@ -30,7 +30,7 @@ PG_FUNCTION_INFO_V1(dintdict_lexize);
 Datum
 dintdict_init(PG_FUNCTION_ARGS)
 {
-	List	   *dictoptions = (List *) PG_GETARG_POINTER(0);
+	DictInitData *init_data = (DictInitData *) PG_GETARG_POINTER(0);
 	DictInt*d;
 	ListCell   *l;

@@ -38,7 +38,7 @@ dintdict_init(PG_FUNCTION_ARGS)
 	d->maxlen = 6;
 	d->rejectlong = false;

-	foreach(l, dictoptions)
+	foreach(l, init_data->dict_options)
 	{
 		DefElem*defel = (DefElem *) lfirst(l);

diff --git

Remove unnecessary use of PROCEDURAL

2019-02-20 Thread Peter Eisentraut

I propose the attached patch to remove some unnecessary, legacy-looking
use of the PROCEDURAL keyword before LANGUAGE.  We mostly don't use this
anymore, so some of these look a bit old.

There is still some use in pg_dump, which is harder to remove because
it's baked into the archive format, so I'm not touching that.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 3e6826d39a77a93ae13a8745fa7888e09d6ce84f Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 20 Feb 2019 13:02:22 +0100
Subject: [PATCH] Remove unnecessary use of PROCEDURAL

---
 doc/src/sgml/xplang.sgml| 4 ++--
 src/backend/commands/proclang.c | 7 +++
 src/include/nodes/parsenodes.h  | 3 +--
 src/pl/plperl/plperl--1.0.sql   | 4 ++--
 src/pl/plperl/plperl--unpackaged--1.0.sql   | 2 +-
 src/pl/plperl/plperlu--1.0.sql  | 4 ++--
 src/pl/plperl/plperlu--unpackaged--1.0.sql  | 2 +-
 src/pl/plpgsql/src/plpgsql--1.0.sql | 4 ++--
 src/pl/plpgsql/src/plpgsql--unpackaged--1.0.sql | 2 +-
 src/pl/plpython/plpython2u--1.0.sql | 4 ++--
 src/pl/plpython/plpython2u--unpackaged--1.0.sql | 2 +-
 src/pl/plpython/plpython3u--1.0.sql | 4 ++--
 src/pl/plpython/plpython3u--unpackaged--1.0.sql | 2 +-
 src/pl/plpython/plpythonu--1.0.sql  | 4 ++--
 src/pl/plpython/plpythonu--unpackaged--1.0.sql  | 2 +-
 src/pl/tcl/pltcl--1.0.sql   | 4 ++--
 src/pl/tcl/pltcl--unpackaged--1.0.sql   | 2 +-
 src/pl/tcl/pltclu--1.0.sql  | 4 ++--
 src/pl/tcl/pltclu--unpackaged--1.0.sql  | 2 +-
 19 files changed, 30 insertions(+), 32 deletions(-)

diff --git a/doc/src/sgml/xplang.sgml b/doc/src/sgml/xplang.sgml
index db765b4644..d215ce82d0 100644
--- a/doc/src/sgml/xplang.sgml
+++ b/doc/src/sgml/xplang.sgml
@@ -137,7 +137,7 @@ Installing Procedural Languages
  
   Finally, the PL must be declared with the command
 
-CREATE TRUSTED PROCEDURAL LANGUAGE 
language-name
+CREATE TRUSTED LANGUAGE 
language-name
 HANDLER handler_function_name
 INLINE inline_function_name
 VALIDATOR 
validator_function_name ;
@@ -200,7 +200,7 @@ Manual Installation of 
PL/Perl
  
   The command:
 
-CREATE TRUSTED PROCEDURAL LANGUAGE plperl
+CREATE TRUSTED LANGUAGE plperl
 HANDLER plperl_call_handler
 INLINE plperl_inline_handler
 VALIDATOR plperl_validator;
diff --git a/src/backend/commands/proclang.c b/src/backend/commands/proclang.c
index 59c4e8dfd0..b7917618bf 100644
--- a/src/backend/commands/proclang.c
+++ b/src/backend/commands/proclang.c
@@ -1,7 +1,7 @@
 /*-
  *
  * proclang.c
- *   PostgreSQL PROCEDURAL LANGUAGE support code.
+ *   PostgreSQL LANGUAGE support code.
  *
  * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -55,9 +55,8 @@ static ObjectAddress create_proc_lang(const char 
*languageName, bool replace,
 Oid valOid, bool trusted);
 static PLTemplate *find_language_template(const char *languageName);
 
-/* -
- * CREATE PROCEDURAL LANGUAGE
- * -
+/*
+ * CREATE LANGUAGE
  */
 ObjectAddress
 CreateProceduralLanguage(CreatePLangStmt *stmt)
diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h
index 6d035a072e..a7e859dc90 100644
--- a/src/include/nodes/parsenodes.h
+++ b/src/include/nodes/parsenodes.h
@@ -2433,8 +2433,7 @@ typedef struct AlterEventTrigStmt
 } AlterEventTrigStmt;
 
 /* --
- * Create/Drop PROCEDURAL LANGUAGE Statements
- * Create PROCEDURAL LANGUAGE Statements
+ * Create LANGUAGE Statements
  * --
  */
 typedef struct CreatePLangStmt
diff --git a/src/pl/plperl/plperl--1.0.sql b/src/pl/plperl/plperl--1.0.sql
index 801900f023..f716ba1c56 100644
--- a/src/pl/plperl/plperl--1.0.sql
+++ b/src/pl/plperl/plperl--1.0.sql
@@ -6,6 +6,6 @@
  * knowledge into this script.
  */
 
-CREATE PROCEDURAL LANGUAGE plperl;
+CREATE LANGUAGE plperl;
 
-COMMENT ON PROCEDURAL LANGUAGE plperl IS 'PL/Perl procedural language';
+COMMENT ON LANGUAGE plperl IS 'PL/Perl procedural language';
diff --git a/src/pl/plperl/plperl--unpackaged--1.0.sql 
b/src/pl/plperl/plperl--unpackaged--1.0.sql
index b062bd5d9b..5e097c443d 100644
--- a/src/pl/plperl/plperl--unpackaged--1.0.sql
+++ b/src/pl/plperl/plperl--unpackaged--1.0.sql
@@ -1,6 +1,6 @@
 /* src/pl/plperl/plperl--unpackaged--1.0.sql */
 
-ALTER EXTENSION plperl ADD PROCEDURAL LANGUAGE plperl;
+ALTER EXTENSION plperl ADD LANGUAGE plperl;
 -- ALTER ADD LANGUAGE doesn't pick up the support functions, so we

Re: WIP: Avoid creation of the free space map for small tables

2019-02-20 Thread Alvaro Herrera

Please remember to keep serial_schedule in sync.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: crash in pg_identify_object_as_address

2019-02-20 Thread Alvaro Herrera

Pushed, thanks.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Unnecessary checks for new rows by some RI trigger functions?

2019-02-20 Thread Antonin Houska

When reviewing a related patch [1], I spent some time thinking about the
"detectNewRows" argument of ri_PerformCheck(). My understanding is that it
should (via the es_crosscheck_snapshot field of EState) prevent the executor
from accidentally deleting PK value that another transaction (whose data
changes the trigger's transaction does not see) might need.

However I find it confusing that the trigger functions pass detectNewRows=true
even if they only execute SELECT statement. I think that in these cases the
trigger functions 1) detect themselves that another transaction inserted
row(s) referencing the PK whose deletion is being checked, 2) do not try to
delete the PK anyway. Furthermore (AFAICS), only heap_update() and
heap_delete() functions receive the crosscheck snapshot, so ri_PerformCheck()
does not have to pass the crosscheck snapshot to SPI_execute_snapshot() for
SELECT queries.

Following is patch proposal. Is anything wrong about that?

diff --git a/src/backend/utils/adt/ri_triggers.c 
b/src/backend/utils/adt/ri_triggers.c
index e1aa3d0044..e235ad9cd0 100644
--- a/src/backend/utils/adt/ri_triggers.c
+++ b/src/backend/utils/adt/ri_triggers.c
@@ -574,7 +574,7 @@ ri_Check_Pk_Match(Relation pk_rel, Relation fk_rel,
result = ri_PerformCheck(riinfo, , qplan,
 fk_rel, pk_rel,
 old_row, NULL,
-true,  /* treat like 
update */
+false,
 SPI_OK_SELECT);
 
if (SPI_finish() != SPI_OK_FINISH)
@@ -802,7 +802,7 @@ ri_restrict(TriggerData *trigdata, bool is_no_action)
ri_PerformCheck(riinfo, , qplan,
fk_rel, pk_rel,
old_row, NULL,
-   true,   /* must detect 
new rows */
+   false,
SPI_OK_SELECT);
 
if (SPI_finish() != SPI_OK_FINISH)


[1] https://commitfest.postgresql.org/22/1975/

-- 
Antonin Houska
https://www.cybertec-postgresql.com

Re: Conflict handling for COPY FROM

2019-02-20 Thread Andrew Dunstan



On 2/20/19 8:01 AM, Surafel Temesgen wrote:
>
>
> On Tue, Feb 19, 2019 at 3:47 PM Andres Freund  > wrote:
>
>
>
> Err, what? Again, that requires super user permissions (in
> contrast to copy from/to stdin/out). Backends run as the user
> postgres runs under
>
>
>  
> okay i see it now and modified the patch similarly 
>
>


Why log to a file at all? We do have, you know, a database handy, where
we might more usefully log errors. You could usefully log the offending
row as an array of text, possibly.


cheers


andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: 2019-03 CF Summary / Review - Tranche #2

2019-02-20 Thread Alexey Kondratov


On 16.02.2019 8:45, Andres Freund wrote:

- pg_rewind: options to use restore_command from recovery.conf or
   command line

   WOA: Was previously marked as RFC, but I don't see how it is. Possibly
   can be finished, but does require a good bit more work.


Just sent new version of the patch to the thread [1], which removes all 
unnecessary complexity. I am willing to address all new issues during 
2019-03 CF if any.


[1] 
https://www.postgresql.org/message-id/c9cfabce-8fb6-493f-68ec-e0a72d957bf4%40postgrespro.ru



Thanks

--
Alexey Kondratov

Re: Conflict handling for COPY FROM

2019-02-20 Thread Surafel Temesgen

On Tue, Feb 19, 2019 at 3:47 PM Andres Freund  wrote:

>
>
> Err, what? Again, that requires super user permissions (in contrast to
> copy from/to stdin/out). Backends run as the user postgres runs under
>


okay i see it now and modified the patch similarly

regards
Surafel
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 254d3ab8eb..5ee70d62bf 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -380,6 +380,28 @@ WHERE condition
 

 
+   
+on_conflict_log
+
+ 
+  Specifies to log error record up to specified amount.
+  Instead write the record to log file and
+  precede to the next record
+ 
+
+   
+
+   
+log_file_name
+
+ 
+  The path name of the log file.  It must be an absolute
+  path.  Windows users might need to use an E'' string and
+  double any backslashes used in the path name.
+ 
+
+   
+
   
  
 
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index dbb06397e6..2a2c3d98b4 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -46,6 +46,7 @@
 #include "port/pg_bswap.h"
 #include "rewrite/rewriteHandler.h"
 #include "storage/fd.h"
+#include "storage/lmgr.h"
 #include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
@@ -123,6 +124,7 @@ typedef struct CopyStateData
 	int			file_encoding;	/* file or remote side's character encoding */
 	bool		need_transcoding;	/* file encoding diff from server? */
 	bool		encoding_embeds_ascii;	/* ASCII can be non-first byte? */
+	FILE	   *failed_rec_file;		/* used if ignore_conflict is true */
 
 	/* parameters from the COPY command */
 	Relation	rel;			/* relation to copy to or from */
@@ -152,6 +154,9 @@ typedef struct CopyStateData
 	List	   *convert_select; /* list of column names (can be NIL) */
 	bool	   *convert_select_flags;	/* per-column CSV/TEXT CS flags */
 	Node	   *whereClause;	/* WHERE condition (or NULL) */
+	char	   *failed_rec_filename;	/* failed record filename */
+	bool	   ignore_conflict;
+	int	   error_limit;			/* total # of error to log */
 
 	/* these are just for error messages, see CopyFromErrorCallback */
 	const char *cur_relname;	/* table name for error messages */
@@ -773,6 +778,21 @@ CopyLoadRawBuf(CopyState cstate)
 	return (inbytes > 0);
 }
 
+/*
+ * LogCopyError log error in to failed record file
+ */
+static void
+LogCopyError(CopyState cstate, const char *str)
+{
+	appendBinaryStringInfo(>line_buf, str, strlen(str));
+#ifndef WIN32
+	appendStringInfoCharMacro(>line_buf, '\n');
+#else
+	appendBinaryStringInfo(>line_buf, "\r\n", strlen("\r\n"));
+#endif
+	fwrite(cstate->line_buf.data, 1, cstate->line_buf.len, cstate->failed_rec_file);
+	cstate->error_limit--;
+}
 
 /*
  *	 DoCopy executes the SQL COPY statement
@@ -836,6 +856,7 @@ DoCopy(ParseState *pstate, const CopyStmt *stmt,
 		 errmsg("must be superuser or a member of the pg_write_server_files role to COPY to a file"),
 		 errhint("Anyone can COPY to stdout or from stdin. "
  "psql's \\copy command also works for anyone.")));
+
 		}
 	}
 
@@ -1249,6 +1270,36 @@ ProcessCopyOptions(ParseState *pstate,
 defel->defname),
 		 parser_errposition(pstate, defel->location)));
 		}
+		else if (strcmp(defel->defname, "on_conflict_log") == 0)
+		{
+			if (cstate->ignore_conflict)
+ereport(ERROR,
+		(errcode(ERRCODE_SYNTAX_ERROR),
+		 errmsg("conflicting or redundant options"),
+		 parser_errposition(pstate, defel->location)));
+
+			cstate->ignore_conflict = true;
+			cstate->error_limit =defGetInt64(defel);
+			if (cstate->error_limit < 0)
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+		 errmsg("argument to option \"%s\" must be positive number",
+defel->defname),
+		 parser_errposition(pstate, defel->location)));
+		}
+		else if (strcmp(defel->defname, "log_file_name") == 0)
+		{
+			if (cstate->failed_rec_filename)
+ereport(ERROR,
+		(errcode(ERRCODE_SYNTAX_ERROR),
+		 errmsg("conflicting or redundant options"),
+		 parser_errposition(pstate, defel->location)));
+			if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_WRITE_SERVER_FILES))
+ereport(ERROR,
+		(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+		 errmsg("must be superuser or a member of the pg_write_server_files role to log error")));
+			cstate->failed_rec_filename =defGetString(defel);
+		}
 		else
 			ereport(ERROR,
 	(errcode(ERRCODE_SYNTAX_ERROR),
@@ -1271,6 +1322,21 @@ ProcessCopyOptions(ParseState *pstate,
 (errcode(ERRCODE_SYNTAX_ERROR),
  errmsg("cannot specify NULL in BINARY mode")));
 
+	if (!cstate->error_limit && cstate->failed_rec_filename)
+		ereport(ERROR,
+(errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("cannot specify log file name without on conflict log option")));
+
+	if (cstate->error_limit && !cstate->failed_rec_filename)
+		ereport(ERROR,
+(errcode(ERRCODE_SYNTAX_ERROR),
+

Re: WIP: Avoid creation of the free space map for small tables

2019-02-20 Thread Amit Kapila

On Mon, Feb 11, 2019 at 10:48 PM John Naylor
 wrote:
>
> On 2/9/19, Amit Kapila  wrote:
> > On Tue, Feb 5, 2019 at 3:25 PM John Naylor 
> > wrote:
> >>
> >> On Tue, Feb 5, 2019 at 4:04 AM Amit Kapila 
> >> wrote:
> > This is certainly a good test w.r.t code coverage of new code, but I
> > have few comments:
> > 1. The size of records in test still depends on alignment (MAXALIGN).
> > Though it doesn't seem to be a problematic case, I still suggest we
> > can avoid using records whose size depends on alignment.  If you
> > change the schema as CREATE TABLE fsm_check_size (num1 int, num2 int,
> > str text);, then you can avoid alignment related issues for the
> > records being used in test.
>
> Done.
>
> > 2.
> > +-- Fill most of the last block
> > ..
> > +-- Make sure records can go into any block but the last one
> > ..
> > +-- Insert large record and make sure it does not cause the relation to
> > extend
> >
> > The comments in some part of the test seems too focussed towards the
> > algorithm used for in-memory map.  I think we can keep these if we
> > want, but it is required to write a more generic comment stating what
> > is the actual motive of additional tests (basically we are testing the
> > functionality of in-memory map (LSM) for the heap, so we should write
> > about it.).
>
> Done.
>

Thanks, the modification looks good.  I have slightly changed the
commit message in the attached patch.  I will spend some more time
tomorrow morning on this and will commit unless I see any new problem.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


v3-0001-Add-more-tests-for-FSM.patch
Description: Binary data

Re: 2019-03 CF Summary / Review - Tranche #2

2019-02-20 Thread Etsuro Fujita


(2019/02/18 12:40), Etsuro Fujita wrote:

(2019/02/16 14:45), Andres Freund wrote:

- postgres_fdw: Perform UPPERREL_ORDERED and UPPERREL_FINAL steps
remotely

WOA: This is a nontrivial change, and the design and review only
started in late December. It's probably not realistic to target v12.

Andres: punt to v13


I also think this needs more reviews, but I don't think it's unrealistic
to target v12, because 1) the patch is actually not that large (at least
in the latest version, most of the changes are in regression tests), and
2) IMO the patch is rather straightforward.


There seems to be no objections, so I marked this as targeting v12.

Best regards,
Etsuro Fujita

Re: SQL statement PREPARE does not work in ECPG

2019-02-20 Thread Michael Meskes

Takahashi-san,

>   EXEC SQL EXECUTE test_prep (2);

Could you please also verify for me if this works correctly if you use
a variable instead of the const? As in:

EXEC SQL BEGIN DECLARE SECTION;
int i=2;
EXEC SQL END DECLARE SECTION;
...
EXEC SQL EXECUTE test_prep (:i);

I guess the problem is that constants in this subtree are move to the
output literally instead treated as parameters.

Thanks.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org
Jabber: michael at xmpp dot meskes dot org
VfL Borussia! Força Barça! SF 49ers! Use Debian GNU/Linux, PostgreSQL

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2019-02-20 Thread Alexey Kondratov


Hi,



I will work out this one with postgres -C and come back till the next 
commitfest. I found that something similar is already used in pg_ctl 
and there is a mechanism for finding valid executables in exec.c. So 
it does not seem to be a big deal at the first sight.




I have reworked the patch, please find new version attached. It is 3 
times as smaller than the previous one and now touches pg_rewind's code 
only. Tests are also slightly refactored in order to remove duplicated 
code. Execution of postgres -C is used for restore_command retrieval (if 
-r is passed) as being suggested. Otherwise everything works as before.


Andres, Alvaro does it make sense now?



Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

>From 4c8f5c228e089e7e72835ae5c409a5bc8425ab15 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Tue, 19 Feb 2019 19:14:53 +0300
Subject: [PATCH v4] pg_rewind: options to use restore_command from command
 line or cluster config

---
 doc/src/sgml/ref/pg_rewind.sgml   |  30 -
 src/bin/pg_rewind/parsexlog.c | 161 +-
 src/bin/pg_rewind/pg_rewind.c |  98 +++-
 src/bin/pg_rewind/pg_rewind.h |   7 +-
 src/bin/pg_rewind/t/001_basic.pl  |   4 +-
 src/bin/pg_rewind/t/002_databases.pl  |   4 +-
 src/bin/pg_rewind/t/003_extrafiles.pl |   4 +-
 src/bin/pg_rewind/t/RewindTest.pm |  84 +-
 8 files changed, 372 insertions(+), 20 deletions(-)

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 53a64ee29e..0c2441afa7 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -67,8 +67,10 @@ PostgreSQL documentation
ancestor. In the typical failover scenario where the target cluster was
shut down soon after the divergence, this is not a problem, but if the
target cluster ran for a long time after the divergence, the old WAL
-   files might no longer be present. In that case, they can be manually
-   copied from the WAL archive to the pg_wal directory, or
+   files might no longer be present. In that case, they can be automatically
+   copied by pg_rewind from the WAL archive to the 
+   pg_wal directory if either -r or
+   -R option is specified, or
fetched on startup by configuring  or
.  The use of
pg_rewind is not limited to failover, e.g.  a standby
@@ -200,6 +202,30 @@ PostgreSQL documentation
   
  
 
+ 
+  -r
+  --use-postgresql-conf
+  
+   
+Use restore_command in the postgresql.conf to
+retreive missing in the target pg_wal directory
+WAL files from the WAL archive.
+   
+  
+ 
+
+ 
+  -R restore_command
+  --restore-command=restore_command
+  
+   
+Specifies the restore_command to use for retrieval of the missing
+in the target pg_wal directory WAL files from
+the WAL archive.
+   
+  
+ 
+
  
   --debug
   
diff --git a/src/bin/pg_rewind/parsexlog.c b/src/bin/pg_rewind/parsexlog.c
index e19c265cbb..5978ec9b99 100644
--- a/src/bin/pg_rewind/parsexlog.c
+++ b/src/bin/pg_rewind/parsexlog.c
@@ -12,6 +12,7 @@
 #include "postgres_fe.h"
 
 #include 
+#include 
 
 #include "pg_rewind.h"
 #include "filemap.h"
@@ -45,6 +46,7 @@ static char xlogfpath[MAXPGPATH];
 typedef struct XLogPageReadPrivate
 {
 	const char *datadir;
+	const char *restoreCommand;
 	int			tliIndex;
 } XLogPageReadPrivate;
 
@@ -53,6 +55,9 @@ static int SimpleXLogPageRead(XLogReaderState *xlogreader,
    int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
    TimeLineID *pageTLI);
 
+static int RestoreArchivedWAL(const char *path, const char *xlogfname,
+   off_t expectedSize, const char *restoreCommand);
+
 /*
  * Read WAL from the datadir/pg_wal, starting from 'startpoint' on timeline
  * index 'tliIndex' in target timeline history, until 'endpoint'. Make note of
@@ -60,7 +65,7 @@ static int SimpleXLogPageRead(XLogReaderState *xlogreader,
  */
 void
 extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex,
-			   XLogRecPtr endpoint)
+			   XLogRecPtr endpoint, const char *restore_command)
 {
 	XLogRecord *record;
 	XLogReaderState *xlogreader;
@@ -69,6 +74,7 @@ extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex,
 
 	private.datadir = datadir;
 	private.tliIndex = tliIndex;
+	private.restoreCommand = restore_command;
 	xlogreader = XLogReaderAllocate(WalSegSz, ,
 	);
 	if (xlogreader == NULL)
@@ -156,7 +162,7 @@ readOneRecord(const char *datadir, XLogRecPtr ptr, int tliIndex)
 void
 findLastCheckpoint(const char *datadir, XLogRecPtr forkptr, int tliIndex,
    XLogRecPtr *lastchkptrec, TimeLineID *lastchkpttli,
-   XLogRecPtr *lastchkptredo)
+   XLogRecPtr *lastchkptredo, const char *restoreCommand)
 {
 	/* Walk backwards, starting from the given record */
 	XLogRecord *record;
@@ -181,6 +187,7 @@

Re: SQL statement PREPARE does not work in ECPG

2019-02-20 Thread Michael Meskes

Takahashi-san,

> I tried as follows.
> ...
> Unfortunately, this does not work.
> ECPGst_execute seems good, but prepare statement is the same as my
> first post.

Ah right, my bad. The workaround should have been:

EXEC SQL PREPARE test_prep from "SELECT id from test_table where id =
$1";
EXEC SQL EXECUTE test_prep using 2;

> It fails with "PostgreSQL error : -202[too few arguments on line
> 16]".
> 
> This error is caused by following source code.
> ...
> I think next_insert() should ignore "$n" in the case of SQL statement
> PREPARE.

Actually I am not so sure.

> In addition, we should fix following, right?
> 
> (1)
> As Matsumura-san wrote, ECPG should not produce '"' for SQL statement
> PREPARE.

Why's that?

> (2)
> ECPG should produce argument for execute statement such as "EXEC SQL
> EXECUTE test_prep (2);"

Correct.

As for the PREPARE statement itself, could you try the attached small
patch please.

This seems to create the correct ECPGPrepare call, but I have not yet
tested it for many use cases.

Thanks.

Michael
-- 
Michael Meskes
Michael at Fam-Meskes dot De, Michael at Meskes dot (De|Com|Net|Org)
Meskes at (Debian|Postgresql) dot Org
Jabber: michael at xmpp dot meskes dot org
VfL Borussia! Força Barça! SF 49ers! Use Debian GNU/Linux, PostgreSQL
diff --git a/src/interfaces/ecpg/preproc/ecpg.addons b/src/interfaces/ecpg/preproc/ecpg.addons
index e8052819b0..bfd76960a2 100644
--- a/src/interfaces/ecpg/preproc/ecpg.addons
+++ b/src/interfaces/ecpg/preproc/ecpg.addons
@@ -22,12 +22,7 @@ ECPG: stmtUpdateStmt block
 ECPG: stmtExecuteStmt block
 	{ output_statement($1, 1, ECPGst_execute); }
 ECPG: stmtPrepareStmt block
-	{
-		if ($1.type == NULL || strlen($1.type) == 0)
-			output_prepare_statement($1.name, $1.stmt);
-		else
-			output_statement(cat_str(5, mm_strdup("prepare"), $1.name, $1.type, mm_strdup("as"), $1.stmt), 0, ECPGst_normal);
-	}
+	{ output_prepare_statement($1.name, $1.stmt); }
 ECPG: stmtTransactionStmt block
 	{
 		fprintf(base_yyout, "{ ECPGtrans(__LINE__, %s, \"%s\");", connection ? connection : "NULL", $1);

1 2 >

1 - 100 of 117 matches

Mail list logo