date:20180713

On Sat, Jul 14, 2018 at 12:50:02AM -0400, Alvaro Herrera wrote:
> I did read them, though not in minute detail.  0002 seems to result in
> code easier to read.  If there are particular places that deviate from
> the obvious patterns, I didn't notice them.
> 
> In 0001 one thing I wasn't terribly in love with was random deviations
> in sprintf format specifiers for things that should be identical, ie.
> %lu in some places and %zu in others, for "read only %d of %d".  It
> seems you should pick the more general one (%zu) and use casts to Size
> (or is it size_t?) in the places that have other types.  That way you
> *really* decrease translator effort to the bare minimum :-)

Yeah, that would be really the point, hence let's use "could not read
file \"%s\": read %d of %zu" everywhere with a cast to Size.  That's
used in some other places as well.

> Ah, in 0001 you have one case of "could not read _from_" (in
> SimpleXLogPageRead).  The "from" is not present in the other places.
> Looks odd.

Right.

> I'm not sure about putting the wait event stuff inside the new
> functions.  It looks odd, though I understand why you did it.

I don't really find that strange as any transient files should be
tracked by I/O wait events :)

> No opinion on the 2PC stuff -- didn't look at that.

For now, I think that just moving forward with 0001, and then revisit
0002 once the other 2PC patch is settled makes the most sense.  On the
other thread, the current 2PC behavior can create silent data loss so
I would like to back-patch it, so that would be less work.
--
Michael
From 434481d861575ffab6294d871089897762a6b911 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Sat, 14 Jul 2018 15:30:55 +0900
Subject: [PATCH 1/2] Rework error messages around file handling

Some error messages related to file handling are using the code path
context to define their state.  For example, 2PC-related errors are
referring to "two-phase status files", or "relation mapping file" is
used for catalog-to-filenode mapping, however those prove to be
difficult to translate, and are not more helpful than just referring to
the path of the file being manipulated.  So simplify all those error
messages by just referring to files with their path used.  In some
cases, like the manipulation of WAL segments, the context is helpful so
those are kept.

Calls to the system function read() have also been rather inconsistent
with their error handling sometimes not reporting the number of bytes
read, and some other code paths trying to use an errno which has not
been set.  The in-core functions are using a more consistent pattern
with this patch, which checks for both errno if set or if an
inconsistent read is happening.

So as to care about pluralization when reading an unexpected number of
bytes, "could not read: read %d of %zu" is used as error message.

Author: Michael Paquier
Discussion: https://postgr.es/m/20180622234803.ga1...@paquier.xyz
---
 src/backend/access/transam/twophase.c   | 40 ++--
 src/backend/access/transam/xlog.c   | 40 
 src/backend/replication/logical/origin.c| 13 +++-
 src/backend/replication/logical/snapbuild.c | 68 +++--
 src/backend/replication/slot.c  | 26 +---
 src/backend/replication/walsender.c | 20 --
 src/backend/utils/cache/relmapper.c | 28 ++---
 src/bin/pg_basebackup/pg_receivewal.c   | 12 +++-
 src/bin/pg_rewind/file_ops.c| 14 -
 src/bin/pg_rewind/parsexlog.c   | 14 -
 src/bin/pg_waldump/pg_waldump.c | 17 --
 src/common/controldata_utils.c  |  8 +--
 12 files changed, 206 insertions(+), 94 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index e8d4e37fe3..0c99b33664 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1219,6 +1219,7 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 	uint32		crc_offset;
 	pg_crc32c	calc_crc,
 file_crc;
+	int			r;
 
 	TwoPhaseFilePath(path, xid);
 
@@ -1228,8 +1229,7 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 		if (give_warnings)
 			ereport(WARNING,
 	(errcode_for_file_access(),
-	 errmsg("could not open two-phase state file \"%s\": %m",
-			path)));
+	 errmsg("could not open file \"%s\": %m", path)));
 		return NULL;
 	}
 
@@ -1249,8 +1249,7 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 			errno = save_errno;
 			ereport(WARNING,
 	(errcode_for_file_access(),
-	 errmsg("could not stat two-phase state file \"%s\": %m",
-			path)));
+	 errmsg("could not stat file \"%s\": %m", path)));
 		}
 		return NULL;
 	}
@@ -1277,7 +1276,8 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 	buf = (char *) palloc(stat.st_size);
 
 	pgstat_report_wait_start(WAIT_EVENT_TWOPHASE_FILE_READ);
-	if (read(fd, buf, stat.st_size) != stat.st_size)
+	r = read(fd, buf, stat.st_si

Re: Fix some error handling for read() and errno

On 2018-Jul-14, Michael Paquier wrote:

> On Fri, Jul 13, 2018 at 11:31:31AM -0400, Alvaro Herrera wrote:
> >> Mr. Robot has been complaining about this patch set, so attached is a
> >> rebased version.  Thinking about it, I would tend to just merge 0001 and
> >> give up on 0002 as that may not justify future backpatch pain.  Thoughts
> >> are welcome.
> > 
> > I vote to push both.
> 
> Thanks!  Did you look at the code?  The first patch is just some
> cleanup, while the second could have adjustments?  For the second I went
> with the minimal amount of work, and actually there is no need to make
> ReadTransientFile() return a status per my study of ReadTwoPhaseFile()
> in https://postgr.es/m/20180709050309.gm1...@paquier.xyz which must fail
> when reading the file.  So patch 0002 depends on the other 2PC patch.

I did read them, though not in minute detail.  0002 seems to result in
code easier to read.  If there are particular places that deviate from
the obvious patterns, I didn't notice them.

In 0001 one thing I wasn't terribly in love with was random deviations
in sprintf format specifiers for things that should be identical, ie.
%lu in some places and %zu in others, for "read only %d of %d".  It
seems you should pick the more general one (%zu) and use casts to Size
(or is it size_t?) in the places that have other types.  That way you
*really* decrease translator effort to the bare minimum :-)

Ah, in 0001 you have one case of "could not read _from_" (in
SimpleXLogPageRead).  The "from" is not present in the other places.
Looks odd.

I'm not sure about putting the wait event stuff inside the new
functions.  It looks odd, though I understand why you did it.

No opinion on the 2PC stuff -- didn't look at that.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Small patch for pg_basebackup argument parsing

On 2017-Sep-18, Pierre Ducroquet wrote:

> On Monday, September 18, 2017 5:13:38 PM CEST Tom Lane wrote:

> > I took a quick look through this and had a few thoughts:
> 
> I agree with your suggestions, I will try to produce a new patch before the 
> end of the week.

Hi Pierre.  Are you updating this patch soon?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Finding database for pg_upgrade missing library

2018-07-13 Thread Justin T Pryzby

On Fri, Jul 13, 2018 at 12:28:15PM -0400, Bruce Momjian wrote:
> I received a private pg_upgrade feature request to report the database
> name for missing loadable libraries.  Currently we report "could not
> load library" and the library file name, e.g. $libdir/pgpool-regclass.
> 
> The request is that we report the _database_ name that contained the
> loadable library reference.  However, that isn't easy to do because we
> gather all loadable library file names, sort them, and remove
> duplicates, for reasons of efficiency and so we check libraries in a
> predictable alphabetical order.
> 
> Is it worth modifying pg_upgrade to report the first or all databases
> that contain references to missing loadable libraries?  I don't think
> so, but I wanted to ask here.

Yes please, with a preference for the "all databases" option.

We typically have only 4 DBs, including postgres and template1,2.  It's
annoying enough when an upgrade process breaks because pg_repack or
pg_stat_buffercache installed into postgres DB.  But it's a veritable pain when
you discover in the middle of an upgrade that postgis had been somehow loaded
into template1, needs to be uninstalled (or upgraded from 22 to 23 to allow
upgrade), old postgis package was already removed..  Maybe you find that one
library was installed one place, fix it and restart the upgrade process.  Then
it fails because the old library was also installed some other place..

When I've had to figure this out in the past, I ended up grepping the dumps to
figure out what old library was where.

I have these comments to myself from the last time I had to figure out what
[(database, [missing library,...]), ...] were involved, probably last
September.

# time /usr/pgsql-9.6/bin/pg_dump --schema-only --quote-all-identifiers 
--binary-upgrade --format=custom dbname=ts |grep -a postgis- |grep -wv 
postgis-2.2
# [pryzbyj@database ~]$ time sudo sh -c 'for f in 
/var/lib/pgsql/pg_upgrade_dump_*.custom; do x=`pg_restore "$f" |grep 
"postgis-2.[^4]"` && echo "$f $x"; done'

Thanks for considering,
Justin

Re: Fwd: GSOC 2018 Project - A New Sorting Routine

2018-07-13 Thread Peter Geoghegan

On Fri, Jul 13, 2018 at 6:03 PM, Tomas Vondra
 wrote:
> Unlikely. The pivot randomization is merely a way to defeat an adversary
> attempting to perform DoS by triggering sorts on a killer sequence.
> Randomization makes it much harder/impossible, because the killer
> sequence changes over time. It's not a regular performance optimization.

+1. The importance of the quadratic worst case for an industrial
strength quicksort seems to often be overstated.

Robert Sedgewick's Algorithms book has *excellent* analysis of
quicksort's worst case, which might be worth a read.

-- 
Peter Geoghegan

Re: Fwd: GSOC 2018 Project - A New Sorting Routine

2018-07-13 Thread Tomas Vondra

On 07/14/2018 12:04 AM, Kefan Yang wrote:
> 
> ...
>
> And finally, I see the PDF reports "CPU clocks" but I'm not sure what
> that actually is? Is that elapsed time in milliseconds or something
> else?
> 
> 
> Sorry for the confusion, but "CPU clocks" actually means CPU clock
> ticks, which are units of time of a constant but system-specific length.
> 

OK, how do you measure this metric?

> After reading your test results, I think the main problems of intro sort are
> 
> 1. Slow on CREATE INDEX cases.
> 
> I am still trying to figure out where the bottleneck is. Is the data
> pattern in index creation very different from other cases? Also,
> pg_qsort has 10%-20% advantage at creating index even on sorted data
> (faster CPU, N = 100). This is very strange to me since the two
> sorting routines execute exactly the same code when the input data is
> sorted.
> 
> 2. Slow on faster CPU and large data set.
> 
> The main difference is still on CREATE INDEX.  But there are several
> SELECT cases(faster CPU, line 179, 206, 377) where pg_qsort can have
> more than 60% advantage, which is crazy. All these cases are dealing
> with mostly sorted data. 
> 
> Personally, I would say intro sort is good as long as it has nearly the
> same performance as pg_qsort on average cases because of the better
> worst-case complexity. In fact, we can make the two sorting routines as
> similar as we want by increasing the threshold that intro sort switches
> to heap sort. Therefore, I think by no means could pg_qsort be 10%-20%
> faster than a well-optimized intro sort because they execute the same
> code most of the time. There must be something special about CREATE
> INDEX test cases, and I am trying to figure it out. Also, I am
> considering performing the preordered check on every recursion, like
> what pg_qsort does, and see how it goes.
> 

I don't know. All I have is the results I shared. I suggest you try
reproducing the tests on your system - the scripts are in the git
repositories.

> Finally, you've mentioned
> 
> But even if it was, I guess the easiest way to deal with it would be to
> randomize the selection of pivots.
> 
> 
> I am wondering if pg_sort with random pivot selecting could be faster
> than intro sort. I've just done some simple tests, which shows that
> intro sort in faster in all cases. But I guess it depends on how we
> would implement the random number generation.
> 

Unlikely. The pivot randomization is merely a way to defeat an adversary
attempting to perform DoS by triggering sorts on a killer sequence.
Randomization makes it much harder/impossible, because the killer
sequence changes over time. It's not a regular performance optimization.


regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] LLVM tuple deforming improvements

2018-07-13 Thread Pierre Ducroquet

On Friday, July 13, 2018 11:08:45 PM CEST Andres Freund wrote:
> Hi,
> 
> Thanks for looking at this!
> 
> On 2018-07-13 10:20:42 +0200, Pierre Ducroquet wrote:
> > 2) improve the LLVM IR code
> > 
> > The code generator in llvmjit-deform.c currently rely on the LLVM
> > optimizer to do the right thing. For instance, it can generate a lot of
> > empty blocks with only a jump. If we don't want to enable the LLVM
> > optimizer for every code, we have to get rid of this kind of pattern. The
> > attached patch does that. When the optimizer is not used, this gets a few
> > cycles boost, nothing impressive. I have tried to go closer to the
> > optimized bitcode, but it requires building phi nodes manually instead of
> > using alloca, and this isn't enough to bring us to the performance level
> > of -O1.
> 
> Building phi blocks manually is too painful imo. But there's plenty
> blocks we could easily skip entirely, even without creating phi nodes.
> 
> > From 4da278ee49b91d34120747c6763c248ad52da7b7 Mon Sep 17 00:00:00 2001
> > From: Pierre Ducroquet 
> > Date: Mon, 2 Jul 2018 13:44:10 +0200
> > Subject: [PATCH] Introduce opt1 in LLVM/JIT, and force it with deforming
> 
> I think I'd rather go for more explicit pipelines than defaulting to OX
> pipelines.  This is too late for v11, and I suspect quite strongly that
> we'll end up not relying on any of the default pipelines going forward -
> they're just not targeted at our usecase.  I just didn't go there for
> v11, because I was running out of steam / time.

Hi

After looking at the optimization passes, I noticed that, at least in that 
case, most benefits do not come from any of the bitcode level passes.
Using a -O3 pipeline only on the bitcode gives at best a 20% performance 
boost.
Using a -O0 on the bitcode, but changing the codegen opt level from None (O1) 
to Less (O1) yields even better performance than a complete O1.
The attached patch alone gives a query time of 650 ms, vs 725ms for 'full' O1 
and 770ms for O3.

As far as I know (and from quickly looking at LLVM code, it doesn't look like 
this changed recently), the CodeGen part doesn't expose a pass manager, thus 
making it impossible to have our own optimization pipeline there, and we do 
not control the code generation directly since we rely on the ORC execution 
engine.


Regards

 Pierre
>From 8b66f60869e285b6f45f3cb900f8c1df44df15ee Mon Sep 17 00:00:00 2001
From: Pierre Ducroquet 
Date: Sat, 14 Jul 2018 01:51:31 +0200
Subject: [PATCH] LLVM - Use the O1 CodeGen level

---
 src/backend/jit/llvm/llvmjit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 5d0cdab1fc..a73619ae2f 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -643,7 +643,7 @@ llvm_session_initialize(void)
 
 	llvm_opt0_targetmachine =
 		LLVMCreateTargetMachine(llvm_targetref, llvm_triple, cpu, features,
-LLVMCodeGenLevelNone,
+LLVMCodeGenLevelDefault,
 LLVMRelocDefault,
 LLVMCodeModelJITDefault);
 	llvm_opt3_targetmachine =
-- 
2.18.0

Re: Fix some error handling for read() and errno

On Fri, Jul 13, 2018 at 11:31:31AM -0400, Alvaro Herrera wrote:
>> Mr. Robot has been complaining about this patch set, so attached is a
>> rebased version.  Thinking about it, I would tend to just merge 0001 and
>> give up on 0002 as that may not justify future backpatch pain.  Thoughts
>> are welcome.
> 
> I vote to push both.

Thanks!  Did you look at the code?  The first patch is just some
cleanup, while the second could have adjustments?  For the second I went
with the minimal amount of work, and actually there is no need to make
ReadTransientFile() return a status per my study of ReadTwoPhaseFile()
in https://postgr.es/m/20180709050309.gm1...@paquier.xyz which must fail
when reading the file.  So patch 0002 depends on the other 2PC patch.
--
Michael


signature.asc
Description: PGP signature

Re: Generating partitioning tuple conversion maps faster

2018-07-13 Thread David Rowley

On 14 July 2018 at 04:57, Heikki Linnakangas  wrote:
> Pushed, thanks!

Thanks for pushing, and thanks again for reviewing it, Alexander.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: function lca('{}'::ltree[]) caused DB Instance crash

I wrote:
> However, I don't understand why this code is returning NULL, rather than
> a zero-length ltree, in the case that there's no common prefix.  That
> doesn't seem consistent to me.

After looking more closely, I see that what lca() returns is the longest
common *ancestor* of the input paths, not the longest common *prefix*
... at least, by my understanding of what a prefix is.  If the longest
prefix is empty, then there's no common ancestor, so returning null in
that case isn't so insane after all.  However, the documentation seems
very misleading on this point.  I changed it around along with pushing
the crash fix.

I wonder whether there isn't a need for an lcp() that would return the
common prefix as that's usually understood.

regards, tom lane

Re: cannot restore schema with is not distinct from on hstore since PG 9.6.8

2018-07-13 Thread Andrew Dunstan





On 07/13/2018 05:23 PM, Tom Lane wrote:

"David G. Johnston"  writes:

I think serious consideration needs to be given to ways to allow the user
of pg_dump/pg_restore to choose the prior, less secure, mode of operation.
IMO the risk surface presented to support back-patching the behavioral
changes was not severe enough to do so in the first place.  I'm presuming
undoing the back-patch will be shot down without mercy but at least
consider an escape hatch for unafflicted secure systems that just happen to
depend on search_path more than a super-hardened system would.

FWIW, in the security team's discussions of CVE-2018-1058, I argued
strenuously in favor of providing a way to run pg_dump/pg_restore with
the system's default search_path as before.  I lost the argument;
but maybe the need for features like this shows that we are not really
ready to insist on unconditional security there.




I don't remember that, TBH.

Certainly this problem seems nasty enough that we should possibly 
revisit the issue.


cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Finding database for pg_upgrade missing library

2018-07-13 Thread Daniel Gustafsson

> On 13 Jul 2018, at 18:28, Bruce Momjian  wrote:
> 
> I received a private pg_upgrade feature request to report the database
> name for missing loadable libraries.  Currently we report "could not
> load library" and the library file name, e.g. $libdir/pgpool-regclass.
> 
> The request is that we report the _database_ name that contained the
> loadable library reference.  However, that isn't easy to do because we
> gather all loadable library file names, sort them, and remove
> duplicates, for reasons of efficiency and so we check libraries in a
> predictable alphabetical order.
> 
> Is it worth modifying pg_upgrade to report the first or all databases
> that contain references to missing loadable libraries?  I don't think
> so, but I wanted to ask here.

I ran into that very problem when upgrading 50+ clusters during a long night a
while back, and patched pg_upgrade to report the database which helped a lot
(or at least helped me be a bit lazy).  So, +1 on the feature from me.  If
there is interest I can see if I can dig out the patch and polish it up.

cheers ./daniel

Re: GSOC 2018 Project - A New Sorting Routine

2018-07-13 Thread Peter Geoghegan

On Fri, Jul 13, 2018 at 3:04 PM, Kefan Yang  wrote:
> 1. Slow on CREATE INDEX cases.
>
> I am still trying to figure out where the bottleneck is. Is the data pattern
> in index creation very different from other cases? Also, pg_qsort has
> 10%-20% advantage at creating index even on sorted data (faster CPU, N =
> 100). This is very strange to me since the two sorting routines execute
> exactly the same code when the input data is sorted.

Yes. CREATE INDEX uses heap TID as a tie-breaker, so it's impossible
for any two index tuples to compare as equal within tuplesort.c, even
though they may be equal in other contexts. This is likely to defeat
things like the Bentley-McIlroy optimization where equal keys are
swapped, which is very effective in the event of many equal keys.

(Could also be parallelism, though I suppose you probably accounted for that.)

-- 
Peter Geoghegan

Fwd: GSOC 2018 Project - A New Sorting Routine

2018-07-13 Thread Kefan Yang

-- Forwarded message --
From: Kefan Yang 
Date: 2018-07-13 15:02 GMT-07:00
Subject: Re: GSOC 2018 Project - A New Sorting Routine
To: Tomas Vondra 

Hey Tomas,

Thanks for your reply!

First I’d like to make some clarification about my test result.

> First of all, testing this on t2.micro is *insane* considering that this
> instance type is subject to throttling (depending on CPU credits). I
> don't know if this happened to be an issue during your tests, of course.
> Furthermore, the instance only has 1 virtual core, so there's likely a
> lot of noise due to other tasks (kernel of whatever needs to run).

I agree that testing this on t2.micro instance might not be a very great
idea. But I have tried 100 rounds for each test case to minimize the
influence of other processes. In fact, I think the result is
consistent with yours on i5-2500k (the older CPU): intro sort with only one
preordered check is slightly better on random cases. There's no big
difference if we also take the other cases into consideration. (I would say
intro sort is still better by simply counting the number of improved test
cases)

Secondly, I see the PDF includes results for various data set types
> (random, reversed, mostly random, ...) but the archive you provided only
> includes the random + killer cases.

I tried data of all patterns in the standalone test, but only random and
killer cases in pg_bench.  This is a problem but I would try other cases in
pg_bench once I write a shell script to do it automatically.

And finally, I see the PDF reports "CPU clocks" but I'm not sure what
> that actually is? Is that elapsed time in milliseconds or something else?

Sorry for the confusion, but "CPU clocks" actually means CPU clock ticks,
which are units of time of a constant but system-specific length.

After reading your test results, I think the main problems of intro sort are

1. Slow on CREATE INDEX cases.

I am still trying to figure out where the bottleneck is. Is the data
pattern in index creation very different from other cases? Also, pg_qsort
has 10%-20% advantage at creating index even on sorted data (faster CPU, N
= 100). This is very strange to me since the two sorting routines
execute exactly the same code when the input data is sorted.

2. Slow on faster CPU and large data set.

The main difference is still on CREATE INDEX.  But there are several SELECT
cases(faster CPU, line 179, 206, 377) where pg_qsort can have more than 60%
advantage, which is crazy. All these cases are dealing with mostly sorted
data.

Personally, I would say intro sort is good as long as it has nearly the
same performance as pg_qsort on average cases because of the better
worst-case complexity. In fact, we can make the two sorting routines as
similar as we want by increasing the threshold that intro sort switches to
heap sort. Therefore, I think by no means could pg_qsort be 10%-20% faster
than a well-optimized intro sort because they execute the same code most of
the time. There must be something special about CREATE INDEX test cases,
and I am trying to figure it out. Also, I am considering performing
the preordered check on every recursion, like what pg_qsort does, and see
how it goes.

Finally, you've mentioned

But even if it was, I guess the easiest way to deal with it would be to
> randomize the selection of pivots.

I am wondering if pg_sort with random pivot selecting could be faster than
intro sort. I've just done some simple tests, which shows that intro sort
in faster in all cases. But I guess it depends on how we would implement
the random number generation.

In conclusion, I still believe intro sort is good if we optimized the
implementation and select the parameters more carefully.

Thanks again for your time. I am hoping to see your thoughts on this:)

Regards,

Kefan

2018-07-12 17:50 GMT-07:00 Tomas Vondra :

> Hi Kefan,
>
> On 07/10/2018 11:02 PM, Kefan Yang wrote:
> > Hello, Hackers!
> >
> > I am working on my project in Google Summer of Code 2018
> >  ms_benchmark_and_implementation_.282018.29>.
> > In this project, I am trying to improve the in-memory sorting routine in
> > PostgreSQL. Now I am very excited to share my progress with you guys.
> >
> > Originally, PostgreSQL is using the QuickSort implemented by J. L.
> > Bentley and M. D. McIlroy in "Engineering a sort function" with some
> > modifications. This sorting routine is very fast, yet may fall to O(n^2)
> > time complexity in the worst case scenario. We are trying to find faster
> > sorting algorithms with guaranteed O(nlogn) time complexity.
> >
>
> Time complexity is nice, but it merely estimates the number of
> comparisons needed by the sort algorithm. It entirely ignores other
> factors that are quite important - behavior with caches, for example.
> And quicksort works really well in this regard, I think.
>
> The worst-case complexity may be an issue, but we're already dealing
>

Re: pgsql: Fix parallel index and index-only scans to fall back to serial.

2018-07-13 Thread Robert Haas

On Fri, Jul 13, 2018 at 2:22 PM, Heikki Linnakangas  wrote:
> I just bumped into this comment, from commit 09529a70bb5, and I can't make
> sense of it:
>
>> +   /*
>> +* We reach here if the index only scan is not parallel,
>> or if we're
>> +* executing a index only scan that was intended to be
>> parallel
>> +* serially.
>> +*/
>
>
> What was that intended to say?

There are two ways that you can reach that code.  One is that you have
the thing that shows up in EXPLAIN output as "Index-Only Scan".  The
other is that you have the thing that shows up in EXPLAIN output as
"Parallel Index-Only Scan", but you didn't get any workers, so now
you're falling back to running what was intended to be a parallel plan
without parallelism i.e. serially.  The comment is intended to alert
you to the fact that an intended-as-parallel scan can end up here in
corner cases where the plan doesn't end up being parallel.  We've had
some difficulty in consistently getting that case correct.

If you decide to rephase the comment for clarity, note that there are
three other near-copies of it cf. git grep -C4 'We reach here if'

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [PATCH] kNN for SP-GiST

2018-07-13 Thread Nikita Glukhov


Attached  6th version of the patches.

On 09.07.2018 20:47, Andrey Borodin wrote:


4 июля 2018 г., в 3:21, Nikita Glukhov  написал(а):
Attached 5th version of the patches, where minor refactoring of distance
handling was done (see below).

I'm reviewing this patch. Currently I'm trying to understand sp-gist scan
deeeper, but as for now have some small notices.


Thank your for your review.



Following directives can be omitted:
#include "lib/rbtree.h"
#include "storage/relfilenode.h"

This message is not noly GiST related, is it?
elog(ERROR, "GiST operator family's FOR ORDER BY operator must return float8 or 
float4 if the distance function is lossy");

Some small typos:
usefull -> useful
nearest-neighbour -> nearest-neighbor (or do we use e.g. "colour"?)


Fixed.

On 10.07.2018 20:09, Andrey Borodin wrote:


I've passed through the code one more time. Here are few more questions:
1. Logic behind division of the patch into steps is described last time
2017-01-30, but ISTM actual steps have changed since than? May I ask you
to write a bit about steps of the patchset?


A brief description of the current patch set:
1. Exported two box functions that are used in patch 5.
2. Extracted subroutine index_store_float8_orderby_distances() from GiST code
   that is common for GiST, SP-GiST, and possibly other kNN implementations
   (used in patch 4).
3. Extracted two subroutines from GiST code common for gistproperty() and
   spgistproperty() (used in path 4).
4. The main patch: added kNN support to SP-GiST.
   This patch in theory can be split into two patches: refactoring and kNN
   support, but it will require some additional effort.
   Refactoring basically consists in the extraction of spgInnerTest(),
   spgTestLeafTuple(), spgMakeInnerItem() subroutines.
   A more detailed commit history can be found in my github [1].
5. Added ordering operator point <-> point to kd_point_ops and quad_point_ops.
6. Added ordering operator polygon <-> point to poly_ops.


2. The patch leaves contribs intact. Do extensions with sp-gist opclasses
need to update it's behavior somehow to be used as-is? Or to support new
functionality?


There are no SP-GiST opclasses in contrib/, so there is nothing to change in
contrib/.  Moreover, existing extensions need to be updated only for support of
new distance-ordered searches -- they need to implement distances[][] array
calculation in their spgInnerConsistent() and spgLeafConsistent() support
functions.


3. There is a patch about predicate locking in SP-GiST [2] Is this KNN patch
theoretically compatible with predicate locking? Seems like it is, I just
want to note that this functionality may exist.


I think kNN is compatible with predicate locking.  Patch [2] does not apply
cleanly on kNN but the conflict resolution is trivial.


4. Scan state now have scanStack and queue. May be it's better to name
scanStack and scanQueue or stack and queue?


"queue" was renamed to "scanQueue".

[1] https://github.com/glukhovn/postgres/commits/knn_spgist
[2] https://commitfest.postgresql.org/14/1215/

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/utils/adt/geo_ops.c b/src/backend/utils/adt/geo_ops.c
index f57380a..a4345ef 100644
--- a/src/backend/utils/adt/geo_ops.c
+++ b/src/backend/utils/adt/geo_ops.c
@@ -41,7 +41,6 @@ enum path_delim
 static int	point_inside(Point *p, int npts, Point *plist);
 static int	lseg_crossing(double x, double y, double px, double py);
 static BOX *box_construct(double x1, double x2, double y1, double y2);
-static BOX *box_fill(BOX *result, double x1, double x2, double y1, double y2);
 static bool box_ov(BOX *box1, BOX *box2);
 static double box_ht(BOX *box);
 static double box_wd(BOX *box);
@@ -451,7 +450,7 @@ box_construct(double x1, double x2, double y1, double y2)
 
 /*		box_fill		-		fill in a given box struct
  */
-static BOX *
+BOX *
 box_fill(BOX *result, double x1, double x2, double y1, double y2)
 {
 	if (x1 > x2)
@@ -482,7 +481,7 @@ box_fill(BOX *result, double x1, double x2, double y1, double y2)
 /*		box_copy		-		copy a box
  */
 BOX *
-box_copy(BOX *box)
+box_copy(const BOX *box)
 {
 	BOX		   *result = (BOX *) palloc(sizeof(BOX));
 
diff --git a/src/include/utils/geo_decls.h b/src/include/utils/geo_decls.h
index a589e42..94806c2 100644
--- a/src/include/utils/geo_decls.h
+++ b/src/include/utils/geo_decls.h
@@ -182,6 +182,7 @@ typedef struct
 extern double point_dt(Point *pt1, Point *pt2);
 extern double point_sl(Point *pt1, Point *pt2);
 extern double pg_hypot(double x, double y);
-extern BOX *box_copy(BOX *box);
+extern BOX *box_copy(const BOX *box);
+extern BOX *box_fill(BOX *box, double xlo, double xhi, double ylo, double yhi);
 
 #endif			/* GEO_DECLS_H */
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index c4e8a3b..9279756 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -14,9 +14,9 @@
  */
 #i

Re: cannot restore schema with is not distinct from on hstore since PG 9.6.8

"David G. Johnston"  writes:
> I think serious consideration needs to be given to ways to allow the user
> of pg_dump/pg_restore to choose the prior, less secure, mode of operation.
> IMO the risk surface presented to support back-patching the behavioral
> changes was not severe enough to do so in the first place.  I'm presuming
> undoing the back-patch will be shot down without mercy but at least
> consider an escape hatch for unafflicted secure systems that just happen to
> depend on search_path more than a super-hardened system would.

FWIW, in the security team's discussions of CVE-2018-1058, I argued
strenuously in favor of providing a way to run pg_dump/pg_restore with
the system's default search_path as before.  I lost the argument;
but maybe the need for features like this shows that we are not really
ready to insist on unconditional security there.

regards, tom lane

Re: pgsql: Fix parallel index and index-only scans to fall back to serial.

2018-07-13 Thread David G. Johnston

On Fri, Jul 13, 2018 at 12:22 PM, Heikki Linnakangas 
wrote:

> Hi!
>
> I just bumped into this comment, from commit 09529a70bb5, and I can't make
> sense of it:
>
> +   /*
>> +* We reach here if the index only scan is not parallel,
>> or if we're
>> +* executing a index only scan that was intended to be
>> parallel
>> +* serially.
>> +*/
>>
>
> What was that intended to say?
>
>
I think...

...or if we're serially executing a[n] index only scan that was intended
to be [executed in] parallel

s/intended/planned/ ?

David J.

Re: [WIP PATCH] Index scan offset optimisation using visibility map

On 21/05/18 18:43, Michail Nikolaev wrote:

Hello everyone.
This letter related to “Extended support for index-only-scan” from my
previous message in the thread.

WIP version of the patch is ready for a while now and I think it is time to
resume the work on the feature. BTW, I found a small article about Oracle
vs Postgres focusing this issue -
https://blog.dbi-services.com/postgres-vs-oracle-access-paths-viii/

Current WIP version of the patch is located here -
https://github.com/postgres/postgres/compare/88ba0ae2aa4aaba8ea0d85c0ff81cc46912d9308...michail-nikolaev:index_only_fetch,
passing all checks. In addition, patch includes small optimization for
caching of amcostestimate results.

Please submit an actual path, extracted e.g. with "git format-patch
-n1", rather than a link to an external site. That is a requirement for
archival purposes, so that people reading the email archives later on
can see what was being discussed. (And that link doesn't return a proper
diff, anyway.)

For now, I decide to name the plan as “Index Only Fetch Scan”. Therefore:
* In case of “Index Scan” – we touch the index and heap for EVERY tuple we
need to test
* For “Index Only Scan” – we touch the index for every tuple and NEVER
touch the heap
* For “Index Only Fetch Scan” – we touch the index for every tuple and
touch the heap for those tuples we need to RETURN ONLY.

Hmm. IIRC there was some discussion on doing that, when index-only scans
were implemented. It's not generally OK to evaluate expressions based on
data that has already been deleted from the table. As an example of the
kind of problems you might get:

Imagine that a user does "DELETE * FROM table WHERE div = 0; SELECT *
FROM table WHERE 100 / div < 10". Would you expect the query to throw a
"division by zero error"? If there was an index on 'div', you might
evaluate the "100 / div" expression based on values from the index,
which still includes entries for the just-deleted tuples with div = 0.
They would be filtered out later, after performing the visibility
checks, but it's too late if you already threw an error.

Now, maybe there's some way around that, but I don't know what. Without
some kind of a solution, this won't work.

- Heikki

Re: cannot restore schema with is not distinct from on hstore since PG 9.6.8

2018-07-13 Thread David G. Johnston

On Fri, Jul 13, 2018 at 1:54 PM, Tom Lane  wrote:

> So this is all pretty messy, but on the bright side, fixing it would allow
> cleaning up some ancient squishy coding in ruleutils.c.  It wouldn't be
> controversial as just a v12 addition, perhaps ... but do we have a choice
> about back-patching?  Dump/restore failures are not good.
>

I think serious consideration needs to be given to ways to allow the user
of pg_dump/pg_restore to choose the prior, less secure, mode of operation.

IMO the risk surface presented to support back-patching the behavioral
changes was not severe enough to do so in the first place.  I'm presuming
undoing the back-patch will be shot down without mercy but at least
consider an escape hatch for unafflicted secure systems that just happen to
depend on search_path more than a super-hardened system would.

David J.

Re: partition pruning doesn't work with IS NULL clause in multikey range partition case

On 2018-Jul-13, Ashutosh Bapat wrote:

> On Fri, Jul 13, 2018 at 1:15 PM, Amit Langote
>  wrote:
> >>
> >> I don't think this is true. When equality conditions and IS NULL clauses 
> >> cover
> >> all partition keys of a hash partitioned table and do not have 
> >> contradictory
> >> clauses, we should be able to find the partition which will remain 
> >> unpruned.
> >
> > I was trying to say the same thing, but maybe the comment doesn't like it.
> >  How about this:
> >
> > + * For hash partitioning, if we found IS NULL clauses for some keys and
> > + * OpExpr's for others, gen_prune_steps_from_opexps() will generate the
> > + * necessary pruning steps if all partition keys are taken care of.
> > + * If we found only IS NULL clauses, then we can generate the pruning
> > + * step here but only if all partition keys are covered.
> >
> 
> It's still confusing a bit. I think "taken care of" doesn't convey the
> same meaning as "covered". I don't think the second sentence is
> necessary, it talks about one special case of the first sentence. But
> this is better than the prior version.

Hmm, let me reword this comment completely.  How about the attached?

I also changed the order of the tests, putting the 'generate_opsteps'
one in the middle instead of at the end (so the not-null one doesn't
test that boolean again).  AFAICS it's logically the same, and it makes
more sense to me that way.

I also changed the tests so that they apply to the existing mc2p table,
which is identical to the one Amit was adding.

> > How about if we explicitly spell out the strategies like this:
> >
> > +if (!bms_is_empty(nullkeys) &&
> > +(part_scheme->strategy == PARTITION_STRATEGY_LIST ||
> > + part_scheme->strategy == PARTITION_STRATEGY_RANGE ||
> > + (part_scheme->strategy == PARTITION_STRATEGY_HASH &&
> > +  bms_num_members(nullkeys) == part_scheme->partnatts)))
> 
> I still do not understand why do we need (part_scheme->strategy ==
> PARTITION_STRATEGY_HASH && bms_num_members(nullkeys) ==
> part_scheme->partnatts) there. I thought that this will be handled by
> code, which deals with null keys and op_exprs together, somewhere
> else.

I'm not sure I understand this question.  This strategy applies to hash
partitioning only if we have null tests for all keys, so there are no
op_exprs.  If there are any, there's no point in checking them.

Now this case is a fun one:
select * from mc2p where a in (1, 2) and b is null;
Note that no partitions are pruned in that case; yet if you do this:
select * from mc2p where a in (1) and b is null;
then it does prune.  I think the reason for this is that the
PARTCLAUSE_MATCH_STEPS is ... pretty special.

> > Actually, there is only one case where the pruning step generated by that
> > block of code is useful -- to prune a list partition that accepts only
> > nulls.  List partitioning only allows one partition, key so this works,
> > but let's say only accidentally.  I modified the condition as follows:
> >
> > +else if (!generate_opsteps && !bms_is_empty(notnullkeys) &&
> > + bms_num_members(notnullkeys) == part_scheme->partnatts)
> >
> 
> Hmm. Ok. May be it's better to explicitly say that it's only useful in
> case of list partitioned table.

Yeah, maybe.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/partitioning/partprune.c 
b/src/backend/partitioning/partprune.c
index cdc61a8997..e44df8ac94 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -853,54 +853,62 @@ gen_partprune_steps_internal(GeneratePruningStepsContext 
*context,
}
}
 
-   /*
-* If generate_opsteps is set to false it means no OpExprs were directly
-* present in the input list.
+   /*---
+* Now generate some (more) pruning steps.
+*
+* We may be able to (1) generate pruning steps based on IS NULL 
clauses,
+* if there are enough of them:
+* a) For list partitioning, null partition keys can only be found in 
the
+*designated partition, so if there are IS NULL clauses containing
+*partition keys we should generate a pruning step that gets rid of
+*all partitions but the special null-accepting partitions.
+* b) For range partitioning, the same applies.  Doing this check first
+*means we'll disregard OpExprs that may have been found for other
+*keys, but that's fine, because it is not possible to prune range
+*partitions with a combination of null and non-null keys.
+* c) For hash partitioning, we only apply this strategy if we have
+*IS NULL clauses for all the keys.  Strategy 2 will take care of 
the
+*case where some keys have OpExprs and others have IS NULL clauses.
+*
+* If not, th

Re: [PATCH] LLVM tuple deforming improvements

2018-07-13 Thread Andres Freund

Hi,

Thanks for looking at this!

On 2018-07-13 10:20:42 +0200, Pierre Ducroquet wrote:
> 2) improve the LLVM IR code
> 
> The code generator in llvmjit-deform.c currently rely on the LLVM optimizer 
> to 
> do the right thing. For instance, it can generate a lot of empty blocks with 
> only a jump. If we don't want to enable the LLVM optimizer for every code, we 
> have to get rid of this kind of pattern. The attached patch does that. When 
> the optimizer is not used, this gets a few cycles boost, nothing impressive.
> I have tried to go closer to the optimized bitcode, but it requires building 
> phi nodes manually instead of using alloca, and this isn't enough to bring us 
> to the performance level of -O1.

Building phi blocks manually is too painful imo. But there's plenty
blocks we could easily skip entirely, even without creating phi nodes.

> From 4da278ee49b91d34120747c6763c248ad52da7b7 Mon Sep 17 00:00:00 2001
> From: Pierre Ducroquet 
> Date: Mon, 2 Jul 2018 13:44:10 +0200
> Subject: [PATCH] Introduce opt1 in LLVM/JIT, and force it with deforming

I think I'd rather go for more explicit pipelines than defaulting to OX
pipelines.  This is too late for v11, and I suspect quite strongly that
we'll end up not relying on any of the default pipelines going forward -
they're just not targeted at our usecase.  I just didn't go there for
v11, because I was running out of steam / time.

Greetings,

Andres Freund

Re: cannot restore schema with is not distinct from on hstore since PG 9.6.8

Andrew Dunstan  writes:
> On 07/09/2018 11:34 AM, Tom Lane wrote:
>> I think the most practical way to deal with this probably is to change
>> the parser so that the lookup works by finding a default btree or hash
>> opclass rather than by looking for "=" by name.  We've made similar
>> changes in the past to get rid of implicit dependencies on operator
>> names, but those efforts never reached IS [NOT] DISTINCT FROM.

> I agree with your approach, including backpatching. I guess we'll have
> to try to think of some scenario that backpatching would break. Maybe to
> minimize any effect it should fall back on a default opclass if the
> search for "=" fails? Dunno how practical that is.

I poked into this area for awhile, and it turns out to be even a
worse can of worms than I thought.  I looked through gram.y and
parse_expr.c, and identified several distinct classes of issue.
(I'm not promising that I found everything.)

1. LIKE, ILIKE, SIMILAR TO are converted to unqualified operator names
"~~", "~~*", etc.  One might reflexively propose forcing these expansions
to be qualified with pg_catalog, but I'm afraid that would break use-cases
involving non-core data types, where the desired operator might well not
be in pg_catalog.  However, I think that there's not actually any new
dump/restore hazard here, because of the fact that ruleutils.c simply
prints the results as the internal operator names without trying to
convert back to LIKE et al.  If the operator isn't in the search path
it will get schema-qualified, and all's well.  So my inclination in this
category is to do nothing.  There's certainly a hazard for naive writers
of SQL, who may not realize that LIKE entails a search-path-dependent
lookup; but that hazard is no worse than it was before we decided to
lock down dump/restore search paths.

2. BETWEEN and related constructs are converted to boolean expressions
involving unqualified operator names ">=" etc.  As above, we are saved
by the fact that what will be dumped out is the expanded expression,
providing room to schema-qualify the selected operators if needed.
So again I'm inclined to leave this alone.  (This will be something
to think about if we ever get around to re-implementing BETWEEN as a
unitary construct, though.)

3. IN and NOT IN are converted to "= ANY" or "<> ALL" with unqualified
operator names.  This is almost like the above cases, except that
ruleutils.c sometimes chooses to print the result using "IN" rather than
with the actual operator name.  We might have to back off that readability
hack in some places (though it seems like it'd be OK if what would be
printed is exactly unqualified "= ANY" or "<> ALL").

4. As per original report, IS [NOT] DISTINCT FROM, as well as NULLIF,
are interpreted by doing a binary operator lookup using unqualified "="
as the operator name.  My original thought of replacing that by using
default operator classes to identify the comparison semantics doesn't
really work, because there is no requirement that the two input values be
of the same datatype.  For example, "int_var IS DISTINCT FROM numeric_var"
is handled perfectly well at present, but I see no principled way to
figure out what to do with that just by reference to operator classes.
We might have little choice but to invent a schema-qualifiable variant
syntax so we can represent these constructs unambiguously in dumps.
(Ugly as that is, it does have the advantage that we'd not be risking
subtle changes in the semantics of these operations.)

5. Row comparison constructs, for example ROW(a,b) < ROW(c,d), are handled
by doing column-by-column lookups using the specified operator name; this
example devolves to what "a < c" and "b < d" would be interpreted as.
Although we support OPERATOR() syntax in these constructs, that doesn't
get us far: if the original input was not a schema-qualified operator
then it's entirely possible that the per-column operators were found in
different schemas, which we can't handle by writing OPERATOR() in the
middle.  (There's already a comment in ruleutils.c's handling of
RowCompareExpr complaining about this.)

One idea for resolving that is to extend the OPERATOR syntax to allow
multiple operator names for row comparisons, along the lines of
ROW(a,b) OPERATOR(pg_catalog.<, public.<) ROW(c,d)
This seems probably to be not terribly hard, although back-patching it
wouldn't be appetizing.

6. The row comparison problem also manifests for multiple-column cases
of ANY_SUBLINK and ALL_SUBLINK, for instance
WHERE (a, b) IN (SELECT x, y FROM ...)
Again, ruleutils.c contributes to the issue by printing "IN" even
though it knows that that isn't 100% accurate.  But I think that here
again, a possible fix is to allow writing
WHERE (a, b) OPERATOR(pg_catalog.=, public.=) ANY (SELECT x, y FROM ...)
so that the operators to use can be locked down exactly.

7. The implicit-comparison form of CASE also resolves comparisons by
doing lookups with an assumed operator name

Re: Vacuum: allow usage of more than 1GB of work mem

2018-07-13 Thread Andrew Dunstan





On 07/13/2018 09:44 AM, Heikki Linnakangas wrote:

On 13/07/18 01:39, Andrew Dunstan wrote:

On 07/12/2018 06:34 PM, Alvaro Herrera wrote:

On 2018-Jul-12, Andrew Dunstan wrote:

I fully understand. I think this needs to go back to "Waiting on 
Author".

Why?  Heikki's patch applies fine and passes the regression tests.


Well, I understood Claudio was going to do some more work (see
upthread).


Claudio raised a good point, that doing small pallocs leads to 
fragmentation, and in particular, it might mean that we can't give 
back the memory to the OS. The default glibc malloc() implementation 
has a threshold of 4 or 32 MB or something like that - allocations 
larger than the threshold are mmap()'d, and can always be returned to 
the OS. I think a simple solution to that is to allocate larger 
chunks, something like 32-64 MB at a time, and carve out the 
allocations for the nodes from those chunks. That's pretty 
straightforward, because we don't need to worry about freeing the 
nodes in retail. Keep track of the current half-filled chunk, and 
allocate a new one when it fills up.



Google seems to suggest the default threshold is much lower, like 128K. 
Still, making larger allocations seems sensible. Are you going to work 
on that?





He also wanted to refactor the iterator API, to return one ItemPointer 
at a time. I don't think that's necessary, the current iterator API is 
more convenient for the callers, but I don't feel strongly about that.


Anything else?


If we're going to go with Heikki's patch then do we need to
change the author, or add him as an author?


Let's list both of us. At least in the commit message, doesn't matter 
much what the commitfest app says.





I added you as an author in the CF App

cheers

andrew

--
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: GiST VACUUM


On 13/07/18 21:28, Andrey Borodin wrote:

13 июля 2018 г., в 18:25, Heikki Linnakangas 
написал(а):

Looking at the second patch, to scan the GiST index in physical
order, that seems totally unsafe, if there are any concurrent page
splits. In the logical scan, pushStackIfSplited() deals with that,
by comparing the page's NSN with the parent's LSN. But I don't see
anything like that in the physical scan code.


Leaf page can be pointed by internal page and rightlink
simultaneously. The purpose of NSN is to visit this page exactly once
by following only on of two links in a scan. This is achieved
naturally if we read everything from the beginning to the end. (That
is how I understand, I can be wrong)


The scenario where this fails goes like this:

1. Vacuum scans physical pages 1-10
2. A concurrent insertion splits page 15. The new left half stays on 
page 15, but the new right half goes to page 5

3. Vacuum scans pages 11-20

Now, if there were any dead tuples on the right half of the split, moved 
to page 5, the vacuum would miss them.


The way this is handled in B-tree is that when a page is split, the page 
is stamped with current "vacuum cycle id". When the vacuum scan 
encounters a page with the current cycle id, whose right-link points to 
a lower-numbered page, it immediately follows the right link, and 
re-scans it. I.e. in the above example, if it was a B-tree, in step 3 
when vacuum scans page 15, it would see that it was concurrently split. 
It would immediately vacuum page 5 again, before continuing the scan in 
physical order.


We'll need to do something similar in GiST.

- Heikki

Re: [HACKERS] Client Connection redirection support for PostgreSQL

2018-07-13 Thread Dave Cramer

>
>
>
> One thing where I can see a feature like this being quite helpful is
> planned failovers, reducing the time to reconnect (for existing
> connections) and rediscover (for new connections, which need to
> write). But that'd require that the redirect needs to be able to be sent
> in an established connection too.
>
>
Somewhat related to this feature is the notion of giving a hint as to
whether a connection is read only.

Currently we can create a read only transaction which at this point pgpool
through some machinations which are less than favourable IMO
can connect to a secondary. This even works with the JDBC driver which has
setReadOnly facility on connections.

However it would be far better to have a startup parameter which indicated
that we wanted to connect to a read only database. At that point
pools could redirect to a secondary. Given the proliferation of cloud based
implementations I can see this being a useful feature.


Regards,

Dave Cramer

Re: [HACKERS] Client Connection redirection support for PostgreSQL

2018-07-13 Thread Andres Freund

On 2018-07-13 23:00:04 +0300, Heikki Linnakangas wrote:
> On 05/03/18 22:18, Satyanarayana Narlapuram wrote:
> > Please see the attached patch with the comments.
> > 
> > Changes in the patch:
> > A client-side PGREDIRECTLIMIT parameter has been introduced to control 
> > the maximum number of retries.
> > BE_v3.1 sends a ProtocolNegotiation message. FE_v3.1 downgrades to v3.0 
> > upon receipt of this message.
> > FE falls back to v3.0 if 3.1 is not supported by the server.
> > 
> > 
> > >> I hadn't really thought deeply about whether redirection should 
> > happen before or after authentication.  For the most part, before seems 
> > better, because it seems a bit silly to force people to authenticate just 
> > so that you can tell them to go someplace else.  Also, that would lead to 
> > double authentication,which might for example result in multiple 
> > password prompts, which users might either dislike or find confusing.
> > 
> > Yes, redirection before authentication would avoid multiple password 
> > prompts.

FWIW, I think it's quite dangerous to do the redirect before
authentication, and more importantly, certificate validation / channel
binding.


> FWIW, if we were to do this, I think pg_hba.conf would be a fine place for
> this. That's where you currently have configuration for what happens when a
> client with certain host/username/database tries to connect. In addition to
> "accept" or "reject", it seems logical to add "redirect" as an outcome,
> instead of e.g. adding a whole new configuration file fore this.

I'd personally argue that it'd also make sense to have this as actual
database level option.


One thing where I can see a feature like this being quite helpful is
planned failovers, reducing the time to reconnect (for existing
connections) and rediscover (for new connections, which need to
write). But that'd require that the redirect needs to be able to be sent
in an established connection too.

Greetings,

Andres Freund

Re: [HACKERS] Client Connection redirection support for PostgreSQL


On 05/03/18 22:18, Satyanarayana Narlapuram wrote:

Please see the attached patch with the comments.

Changes in the patch:
A client-side PGREDIRECTLIMIT parameter has been introduced to control 
the maximum number of retries.
BE_v3.1 sends a ProtocolNegotiation message. FE_v3.1 downgrades to v3.0 
upon receipt of this message.
FE falls back to v3.0 if 3.1 is not supported by the server.


>> I hadn't really thought deeply about whether redirection should 
happen before or after authentication.  For the most part, before seems better, 
because it seems a bit silly to force people to authenticate just so that you can 
tell them to go someplace else.  Also, that would lead to double authentication,  
which might for example result in multiple password prompts, which users might 
either dislike or find confusing.

Yes, redirection before authentication would avoid multiple password 
prompts.


I think we should have this feature. I can see a redirect being useful 
in some similar cases like HTTP redirects are useful, but a database 
server is not a web server. There are no redirects in IMAP or most other 
protocols, either.


This would also require modifying every client library to honor the 
redirect.


How would the redirect behave with TLS certificate verification? If you 
are redirected from "foo-server" to "bar-server", but the original 
connection string was "host=foo-server sslmode=verify-full", would the 
connection be allowed?


FWIW, if we were to do this, I think pg_hba.conf would be a fine place 
for this. That's where you currently have configuration for what happens 
when a client with certain host/username/database tries to connect. In 
addition to "accept" or "reject", it seems logical to add "redirect" as 
an outcome, instead of e.g. adding a whole new configuration file fore this.


But overall, IMHO we should mark this patch "rejected".

- Heikki

Logical decoding from promoted standby with same replication slot

2018-07-13 Thread Jeremy Finzel

Hello -

We are working on several DR scenarios with logical decoding.  Although we
are using pglogical the question we have I think is generally applicable to
logical replication.

Say we have need to drop a logical replication slot for some emergency
reason on the master, but we don't want to lose the data permanently.  We
can make a point-in-time-recovery snapshot of the master to use in order to
recover the lost data in the slot we are about to drop.  Then we drop the
slot on master.

We can then point our logical subscription to pull from the snapshot to get
the lost data, once we promote it.

The question is that after promotion, logical decoding is looking for a
timeline 2 file whereas the file is still at timeline 1.

The WAL file is 000108FD003C, for example.  After promotion, it
is still 000108FD003C in pg_wal.  But logical decoding says
ERROR: segment 000208FD003C has already been removed (it is
looking for a timeline 2 WAL file).  Simply renaming the file actually
allows us to stream from the replication slot accurately and recover the
data.

But all of this begs the question of an easier way to do this - why doesn't
logical decoding know to look for a timeline 1 file?  It is really helpful
to have this ability to easily recover logical replicated data from a
snapshot of a replication slot, in case of disaster.

All thoughts very welcome!

Thanks,
Jeremy

Re: pgsql: Fix parallel index and index-only scans to fall back to serial.


Hi!

I just bumped into this comment, from commit 09529a70bb5, and I can't 
make sense of it:



+   /*
+* We reach here if the index only scan is not parallel, or if 
we're
+* executing a index only scan that was intended to be parallel
+* serially.
+*/


What was that intended to say?

- Heikki

Re: GiST VACUUM

2018-07-13 Thread Andrey Borodin



> 13 июля 2018 г., в 18:10, Heikki Linnakangas  написал(а):
> 
 But the situation in gistdoinsert(), where you encounter a deleted leaf 
 page, could happen during normal operation, if vacuum runs concurrently 
 with an insert. Insertion locks only one page at a time, as it descends 
 the tree, so after it has released the lock on the parent, but before it 
 has locked the child, vacuum might have deleted the page. In the latest 
 patch, you're checking for that just before swapping the shared lock for 
 an exclusive one, but I think that's wrong; you need to check for that 
 after swapping the lock, because otherwise vacuum might delete the page 
 while you're not holding the lock.
>>> Looks like a valid concern, I'll move that code again.
>> Done.
> 
> Ok, the comment now says:
> 
>> +/*
>> + * Leaf pages can be left deleted but still referenced 
>> in case of
>> + * crash during VACUUM's gistbulkdelete()
>> + */
> 
> But that's not accurate, right? You should never see deleted pages after a 
> crash, because the parent is updated in the same WAL record as the child 
> page, right?
Fixed the comment.
> 
> I'm still a bit scared about using pd_prune_xid to store the XID that 
> prevents recycling the page too early. Can we use some field in 
> GISTPageOpaqueData for that, similar to how the B-tree stores it in 
> BTPageOpaqueData?
There is no room in opaque data, but, technically all page is just a tombstone 
until reused, so we can pick arbitrary place. PFA v7 where xid after delete is 
stored in opaque data, but we can use other places, like line pointer array or 
opaque-1

> 13 июля 2018 г., в 18:25, Heikki Linnakangas  написал(а):
> 
> Looking at the second patch, to scan the GiST index in physical order, that 
> seems totally unsafe, if there are any concurrent page splits. In the logical 
> scan, pushStackIfSplited() deals with that, by comparing the page's NSN with 
> the parent's LSN. But I don't see anything like that in the physical scan 
> code.

Leaf page can be pointed by internal page and rightlink simultaneously. The 
purpose of NSN is to visit this page exactly once by following only on of two 
links in a scan. This is achieved naturally if we read everything from the 
beginning to the end. (That is how I understand, I can be wrong)

> I think we can do something similar in the physical scan: remember the 
> current LSN position at the beginning of the vacuum, and compare with that. 
> The B-tree code uses the "cycle ID" for similar purposes.
> 
> Do we still need the separate gistvacuumcleanup() pass, if we scan the index 
> in physical order in the bulkdelete pass already?

We do not need to gather stats there, but we are doing RecordFreeIndexPage() 
and IndexFreeSpaceMapVacuum(). Is it correct to move this to first scan?

Best regards, Andrey Borodin.



0002-Physical-GiST-scan-during-VACUUM-v7.patch
Description: Binary data


0001-Delete-pages-during-GiST-VACUUM-v7.patch
Description: Binary data

Re: [PATCH] Include application_name in "connection authorized" log message

2018-07-13 Thread Don Seiler

On Fri, Jul 13, 2018 at 10:13 AM, Don Seiler  wrote:

> On Fri, Jul 13, 2018 at 9:37 AM, Stephen Frost  wrote:
>
>>
>> Don, do you want to update the patch accordingly?  If not, I'm happy to
>> handle it when I go to commit it, which I'm thinking of doing sometime
>> this weekend as it seems to be pretty uncontroversial at this point.
>
>
> Whatever is easiest for you. I'll try get a new patch put together later
> today just for the practice, but if it's simpler for you to just update the
> diff file, please feel free.
> 
>

See attached for latest revision.

Thanks!

Don.

-- 
Don Seiler
www.seiler.us


0001-Changes-to-add-application_name-to-Port-struct-so-we.patch
Description: Binary data

Re: requested timeline ... does not contain minimum recovery point ...

2018-07-13 Thread Christophe Pettus

> On Jul 12, 2018, at 19:54, Andres Freund  wrote:
> Do you see a "checkpoint complete: wrote ..." message
> before the rewind started?

Checking, but I suspect that's exactly the problem.

This raises a question: Would it make sense for pg_rewind to either force a 
checkpoint or have a --checkpoint option along the lines of pg_basebackup?  
This scenario (pg_rewind being run very quickly after secondary promotion) is 
not uncommon when there's scripting around the switch-over process.

--
-- Christophe Pettus
   x...@thebuild.com

Re: [HACKERS] PATCH: Keep one postmaster monitoring pipe per process


On 18/04/18 09:55, Thomas Munro wrote:

Here's a draft patch that does that.  One contentious question is:
should you have to opt *in* to auto-exit-on-postmaster death?  Andres
opined that you should.  I actually think it's not so bad if you don't
have to do that, and instead have to opt out.  I think of it as a kind
of 'process cancellation point' or a quiet PANIC that you can opt out
of.  It's nice to remove the old boilerplate code without having to
add a new boilerplate event that you have to remember every time.  Any
other opinions?


Hmm. Exiting on postmaster death by default does feel a bit too magical 
to me. But as your patch points out, there are currently no places where 
you *don't* want to exit on postmaster death, some callers just prefer 
to handle it themselves. And I like having a default or something to 
make sure that all call sites in the future will also exit quickly.


I'd suggest that we add a new WL_EXIT_ON_POSTMASTER_DEATH flag, making 
it opt-on. But add an assertion in WaitLatchOrSocket:


Assert ((wakeEvents & (WL_EXIT_POSTMASTER_DEATH | WL_POSTMASTER_DEATH)) 
!= 0);


That ensures that all callers either use WL_EXIT_ON_POSTMASTER_DEATH, or 
WL_POSTMASTER_DEATH to handle it in the caller. Having to specify 
WL_EXIT_ON_POSTMASTER_DEATH reminds you that the call might exit(), so 
if that's not what you want, you need to do something else. But the 
assertion makes sure that all callers deal with postmaster death in some 
way.


- Heikki

Re: Cannot dump foreign key constraints on partitioned table

On 2018-Jul-13, amul sul wrote:

> Thanks for the prompt fix, patch [1] works for me.
> 
> 1]  https://postgr.es/m/20180712184537.5vjwgxlbuiomomqd@alvherre.pgsql

Thanks for checking, pushed.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-07-13 Thread Tomas Vondra




On 07/13/2018 01:19 PM, Dean Rasheed wrote:
> On 24 June 2018 at 20:45, Tomas Vondra  wrote:
>> Attached is a rebased version of this patch series, mostly just fixing
>> the breakage caused by reworked format of initial catalog data.
>>
>> Aside from that, the MCV building now adopts the logic introduced by
>> commit b5db1d93d2 for single-column MCV lists. The new algorithm seems
>> pretty good and I don't see why multi-column MCV lists should use
>> something special.
> 
> Agreed.
> 
> 
>> I'm sure there are plenty of open questions to discuss, particularly
>> stuff related to combining the various types of statistics to the final
>> estimate (a lot of that was already improved based on Dean's reviews).
> 
> Yes, that's definitely one of the trickier parts of this. I don't
> think that the current algorithm is ideal as it stands. In particular,
> the way that it attempts to handle complex combinations of clauses
> doesn't look right. I think mcv_clauselist_selectivity() and
> histogram_clauselist_selectivity() are plausibly correct, but the way
> that the resulting selectivities are combined in
> statext_clauselist_selectivity() doesn't seem right. In particular,
> the use of estimate_equality_groups() to count "nmatches" and
> "fullmatch" only takes into account top-level equality clauses, so it
> will fail to recognise other cases like (a=1 AND (b=1 OR b=2)) which
> might be fully covered by, say, the MCV stats. Consider, for example,
> the following simple test case:
> 
> 
> create table foo(a int, b int);
> insert into foo select 1,1 from generate_series(1,5);
> insert into foo select 1,2 from generate_series(1,4);
> insert into foo select 1,x/10 from generate_series(30,25) g(x);
> insert into foo select 2,1 from generate_series(1,3);
> insert into foo select 2,2 from generate_series(1,2);
> insert into foo select 2,x/10 from generate_series(30,50) g(x);
> insert into foo select 3,1 from generate_series(1,1);
> insert into foo select 3,2 from generate_series(1,5000);
> insert into foo select 3,x from generate_series(3,60) g(x);
> insert into foo select x,x/10 from generate_series(4,75) g(x);
> 
> create statistics foo_mcv_ab (mcv) on a,b from foo;
> analyse foo;
> 
> explain analyse select * from foo where a=1 and b=1;
>  -- Actual rows: 5, Estimated: 52690 (14149 without MV-stats)
> 
> explain analyse select * from foo where a=1 and b=2;
>  -- Actual rows: 4, Estimated: 41115 (10534 without MV-stats)
> 
> explain analyse select * from foo where a=1 and (b=1 or b=2);
>  -- Actual rows: 9, Estimated: 181091 (24253 without MV-stats)
> 
> explain analyse select * from foo where (a=1 or a=2) and (b=1 or b=2);
>  -- Actual rows: 14, Estimated: 276425 (56716 without MV-stats)
> 
> 
> In the first 2 queries the multivariate MCV stats help a lot and give
> good estimates, but in the last 2 queries the estimates are around
> twice as large as they should be, even though good MCV stats are
> available on those specific values.
> 

I'm not so sure. The issue is that a lot of the MCV deductions depends
on whether we can answer questions like "Is there a single match?" or
"If we got a match in MCV, do we need to look at the non-MCV part?" This
is not very different from the single-column estimates, except of course
here we need to look at multiple columns.

The top-level clauses allow us to make such deductions, with deeper
clauses it's much more difficult (perhaps impossible). Because for
example with (a=1 AND b=1) there can be just a single match, so if we
find it in MCV we're done. With clauses like ((a=1 OR a=2) AND (b=1 OR
b=2)) it's not that simple, because there may be multiple combinations
and so a match in MCV does not guarantee anything.

I don't think there's a way around this. The single-dimensional case
does not have this issue, of course.

> The tricky thing is to work out how to correctly combine the various
> stats that are available. In the above examples, the selectivity
> returned by mcv_clauselist_selectivity() would have been basically
> correct, but since it will have not been identified as a fullmatch and
> some non-equality clauses will have been seen at the top-level (the OR
> clauses), it ends up adding on additional selectivities from
> clauselist_selectivity().
> 
> I think perhaps it might be better not to attempt to combine the
> *overall* selectivity returned by mcv_clauselist_selectivity() with
> that returned by clauselist_selectivity(), but rather merge the logic
> of these two functions together into a single traversal of the
> clause-tree. That way, the various individual selectivities can be
> combined on a clause-by-clause basis to give the best running total
> based on the available information. Even when the multivariate stats
> are incomplete, they may still provide a useful lower bound on the
> selectivity.

I don't follow. The example you presented above showed multivariate
stats producing over-estimates, so how would

Re: Problem with tupdesc in jsonb_to_recordset

Andrew Gierth  writes:
> What's happening in the original case is this: the SRF call protocol
> says that it's the executor's responsibility to free rsi.setDesc if it's
> not refcounted, under the assumption that in such a case it's in
> per-query memory (and not in either a shorter-lived or longer-lived
> context).
> The change to update_cached_tupdesc broke the protocol by causing
> populate_recordset_worker to set rsi.setDesc to a non-refcounted tupdesc
> allocated in a context that's not the per-query context.

Right, and more to the point, a non-refcounted tupdesc that the function
will use again if it's called again.

> What's not clear here is why populate_recordset_record thinks it needs
> to update the tupdesc for every record in a single result set. The
> entire result set will ultimately be treated as having the same tupdesc
> regardless, so any per-record changes will break things later anyway.

It's just convenient to check it there; it's not really expecting the
tupdesc to change intra-query.

The core of the problem here is that populate_record() is recursive,
in cases with nested composite types.  So we might be dealing either
with the outer composite type that is also going to be the returned
tuple type, or with some inner composite type that won't be.  It's
the desire to avoid repeated tupdesc lookups for the latter case that
motivates wanting to keep a tupdesc inside the fn_extra cache structure.
Otherwise we could do one lookup_rowtype_tupdesc per SRF call cycle and
pass back the result as rsi.setDesc, without caching it.

> My first approach - assuming that update_cached_tupdesc has good reason
> to make a copy, which I'm not convinced is the case - would be to simply
> make a per-query-context copy of the tupdesc to assign to rsi.setDesc in
> order to conform to the call protocol.

The alternative to making a copy would be finding a way to guarantee that
we decrement the refcount of the result of lookup_rowtype_tupdesc at end
of query (or whenever the fn_extra cache is discarded), rather than
immediately.  That seems like a mess.

I agree that making a copy to return as rsi.setDesc is the simplest and
safest fix.  If somebody produces a test case showing that that adds
unacceptable overhead, we could think about ways to improve it later;
but it's going to be more complicated and probably more fragile.

regards, tom lane

Re: Generating partitioning tuple conversion maps faster


On 09/07/18 22:58, David Rowley wrote:

On 9 July 2018 at 23:28, Alexander Kuzmenkov  wrote:

On 07/09/2018 10:13 AM, David Rowley wrote:

I've attached v5.


v5 looks good to me, I've changed the status to ready.


Many thanks for reviewing this.


Pushed, thanks!

- Heikki

Finding database for pg_upgrade missing library

2018-07-13 Thread Bruce Momjian

I received a private pg_upgrade feature request to report the database
name for missing loadable libraries.  Currently we report "could not
load library" and the library file name, e.g. $libdir/pgpool-regclass.

The request is that we report the _database_ name that contained the
loadable library reference.  However, that isn't easy to do because we
gather all loadable library file names, sort them, and remove
duplicates, for reasons of efficiency and so we check libraries in a
predictable alphabetical order.

Is it worth modifying pg_upgrade to report the first or all databases
that contain references to missing loadable libraries?  I don't think
so, but I wanted to ask here.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +

Re: Fix some error handling for read() and errno

On 2018-Jul-13, Michael Paquier wrote:

> Mr. Robot has been complaining about this patch set, so attached is a
> rebased version.  Thinking about it, I would tend to just merge 0001 and
> give up on 0002 as that may not justify future backpatch pain.  Thoughts
> are welcome.

I vote to push both.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Include application_name in "connection authorized" log message

2018-07-13 Thread Don Seiler

On Fri, Jul 13, 2018 at 9:37 AM, Stephen Frost  wrote:

>
> Don, do you want to update the patch accordingly?  If not, I'm happy to
> handle it when I go to commit it, which I'm thinking of doing sometime
> this weekend as it seems to be pretty uncontroversial at this point.

Whatever is easiest for you. I'll try get a new patch put together later
today just for the practice, but if it's simpler for you to just update the
diff file, please feel free.

Cheers,
Don.
-- 
Don Seiler
www.seiler.us

Re: [PATCH] Include application_name in "connection authorized" log message

2018-07-13 Thread Stephen Frost

Greetings,

* Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote:
> On 02.07.18 15:12, Don Seiler wrote:
> > On Mon, Jul 2, 2018 at 2:13 AM, Peter Eisentraut
> >  > > wrote:
> > 
> > On 21.06.18 16:21, Don Seiler wrote:
> > > -                                               (errmsg("connection
> > > authorized: user=%s database=%s",
> > > -                                                             
> > >  port->user_name, port->database_name)));
> > > +                                               (errmsg("connection
> > > authorized: user=%s database=%s application=%s",
> > > +                                                             
> > >  port->user_name, port->database_name, port->application_name)));
> > 
> > Why is it "application" and not "application_name"?
> > 
> > 
> > I was trying to be consistent since we don't use "user_name" or
> > "database_name" as labels even though those are the variable names.
> 
> "user" and "database" are the keys used in the startup packet.

There's a good deal lacking here when it comes to consistency- the
string that users actually use most (the libpq connection string)
requires the database to be specified as 'dbname', not 'database', for
example.

Still, what's specified in the libpq connection string is
'application_name' and that's what the GUC is, and what's in the startup
packet, so it seems like we could at least be consistent with that.

Don, do you want to update the patch accordingly?  If not, I'm happy to
handle it when I go to commit it, which I'm thinking of doing sometime
this weekend as it seems to be pretty uncontroversial at this point.

As an independent patch, it might be nice to have libpq accept
'database' in place of 'dbname' and update the docs to recommend that,
and maybe then even have a 'database' read-only GUC, and maybe a 'user'
one too (though it looks like we might have to hack the grammar some to
get that to work cleanly...), just to move things to be more consistent
across the board.  That's all very clearly independent from this patch
though.

Thanks!

Stephen

signature.asc
Description: PGP signature

Re: function lca('{}'::ltree[]) caused DB Instance crash

Pierre Ducroquet  writes:
> On Friday, July 13, 2018 12:09:20 PM CEST 李海龙 wrote:
>> contrib_regression=# select lca('{}'::ltree[]);
>> server closed the connection unexpectedly

> There is indeed a bug. The _lca function in _ltree_op.c tries to allocate 0 
> bytes of memory, doesn't initialize it and dereference it in lca_inner.
> The attached basic patch fixes it.

I think the defense ought to be in lca_inner not there.

However, I don't understand why this code is returning NULL, rather than
a zero-length ltree, in the case that there's no common prefix.  That
doesn't seem consistent to me.

regards, tom lane

Re: GiST VACUUM

Looking at the second patch, to scan the GiST index in physical order, 
that seems totally unsafe, if there are any concurrent page splits. In 
the logical scan, pushStackIfSplited() deals with that, by comparing the 
page's NSN with the parent's LSN. But I don't see anything like that in 
the physical scan code.


I think we can do something similar in the physical scan: remember the 
current LSN position at the beginning of the vacuum, and compare with 
that. The B-tree code uses the "cycle ID" for similar purposes.


Do we still need the separate gistvacuumcleanup() pass, if we scan the 
index in physical order in the bulkdelete pass already?


- Heikki

Re: GiST VACUUM


On 13/07/18 16:41, Andrey Borodin wrote:
12 июля 2018 г., в 21:07, Andrey Borodin > написал(а):
12 июля 2018 г., в 20:40, Heikki Linnakangas > написал(а):
Actually, now that I think about it more, I'm not happy with leaving orphaned 
pages like that behind. Let's WAL-log the removal of the downlink, and 
marking the leaf pages as deleted, in one WAL record, to avoid that.


OK, will do this. But this will complicate WAL replay seriously, and I do not 
know a proper way to test that (BTW there is GiST amcheck in progress, but I 
decided to leave it for a while).

Done. Now WAL record for deleted page also removes downlink from internal page.
I had to use IndexPageTupleDelete() instead of IndexPageTupleMultiDelete(), but
I do not think it will have any impact on performance.


Yeah, I think that's fine, this isn't that performance critical

But the situation in gistdoinsert(), where you encounter a deleted leaf page, 
could happen during normal operation, if vacuum runs concurrently with an 
insert. Insertion locks only one page at a time, as it descends the tree, so 
after it has released the lock on the parent, but before it has locked the 
child, vacuum might have deleted the page. In the latest patch, you're 
checking for that just before swapping the shared lock for an exclusive one, 
but I think that's wrong; you need to check for that after swapping the lock, 
because otherwise vacuum might delete the page while you're not holding the lock.

Looks like a valid concern, I'll move that code again.

Done.


Ok, the comment now says:


+   /*
+* Leaf pages can be left deleted but still referenced 
in case of
+* crash during VACUUM's gistbulkdelete()
+*/


But that's not accurate, right? You should never see deleted pages after 
a crash, because the parent is updated in the same WAL record as the 
child page, right?


I'm still a bit scared about using pd_prune_xid to store the XID that 
prevents recycling the page too early. Can we use some field in 
GISTPageOpaqueData for that, similar to how the B-tree stores it in 
BTPageOpaqueData?


- Heikki

Re: Expression errors with "FOR UPDATE" and postgres_fdw with partition wise join enabled.

On Thu, Jul 12, 2018 at 4:32 PM, Etsuro Fujita
 wrote:
>
> In this example, the value of the whole-row reference to the child table
> ptp1 for that record is ('foo',1), and that of the index expression for that
> record is (1,'foo').  Those have different column orders, but the latter
> could be mapped to the former by a technique like do_convert_tuple.  So,
> what I suspect is: by producing the value of the whole-row reference from
> that of the index expression like that,

The expression in this case would look like ptp1::pt::ptp1 which won't
match targetlist expression ptp1. I am also doubtful that the planner
will be able to deduce that it need to apply an inverse function of
::pt and what exactly such an inverse function is. So index only scan
won't be picked.

> we could support index-only scans
> with such an index in the case where we have the whole-row reference in the
> targetlist, not the index expression itself.

Can you please show an index only scan path being created in this case?


>
>> If that time is long
>> enough, that's ok. In this case, I agree that if we haven't heard from
>> users till now that they need to create indexes on whole-row
>> expressions, there's chance that we will never implement this feature.
>
>
> I don't object to adding that feature to future versions of PG if required.
>
>So, I think it is safe to have whole-row
> Vars instead of ConvertRowtypeExprs in the targetlist.



 when it's looking for an expression, it finds a whole-row expression
 so it think it needs to add a ConvertRowtypeExpr on that. But when the
 plan is created, there is ConvertRowtypeExpr already, but there is no
 way to know that a new ConvertRowtypeExpr is not needed anymore. So,
 we may have two ConvertRowtypeExprs giving wrong results.
>>>
>>>
>>>
>>> Maybe I'm missing something, but I don't think that we need to worry
>>> about
>>> that, because in the approach I proposed, we only add CREs above
>>> whole-row
>>> Vars in the targetlists for subplans of an Append/MergeAppend for a
>>> partitioned relation at plan creation time.
>>
>>
>> There's a patch in an adjacent thread started by David Rowley to rip
>> out Append/MergeAppend when there is only one subplan. So, your
>> solution won't work there.
>
>
> Thanks for sharing that information!  I skimmed the thread.  I haven't yet
> caught up with all the discussions there, so I think I'm missing something,
> but it looks like that we haven't yet reached any consensus on the way to
> go.  In my opinion, I like the approach mentioned in [1].  And if we go that
> way, my patch seems to fit into that, because in that approach the
> Append/MergeAppend could be removed after adjusting the targetlists for its
> subplans in create_append_plan/create_merge_append_plan.  Anyway, I'd like
> to join in that work for PG12.

Whatever may be the outcome of that work, I think what we fix here
shouldn't require to reverted in a few months from now, just so that
that patch works.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: BUG #15212: Default values in partition tables don't work as expected and allow NOT NULL violation

On Thu, Jul 12, 2018 at 2:29 PM, Amit Langote
 wrote:
> Thanks Ashutosh.
>
> On 2018/07/10 22:50, Ashutosh Bapat wrote:
>> I didn't see any hackers thread linked to this CF entry. Hence sending this 
>> mail through CF app.
>
> Hmm, yes.  I hadn't posted the patch to -hackers.
>
>> The patch looks good to me. It applies cleanly, compiles cleanly and make 
>> check passes.
>>
>> I think you could avoid condition
>> +   /* Do not override parent's NOT NULL constraint. */
>> +   if (restdef->is_not_null)
>> +   coldef->is_not_null = restdef->is_not_null;
>>
>> by rewriting this line as
>> coldef->is_not_null = coldef->is_not_null || restdef->is_not_null;
>
> Agreed, done like that.
>
>> The comment looks a bit odd either way since we are changing 
>> coldef->is_not_null based on restdef->is_not_null. That's because of the 
>> nature of boolean variables. May be a bit of explanation is needed.
>
> I have modified the comments around this code in the updated patch.

+/*
+ * Each member in 'saved_schema' contains a ColumnDef containing
+ * partition-specific options for the column.  Below, we merge the
+ * information from each into the ColumnDef of the same name found in
+ * the original 'schema' list before deleting it from the list.  So,
+ * once we've finished processing all the columns from the original
+ * 'schema' list, there shouldn't be any ColumnDefs left that came
+ * from the 'saved_schema' list.
+ */

This is conveyed by the prologue of the function.


+/*
+ * Combine using OR so that the NOT NULL constraint
+ * in the parent's definition doesn't get lost.
+ */
This too is specified in prologue as
 * Constraints (including NOT NULL constraints) for the child table
 * are the union of all relevant constraints, from both the child schema
 * and parent tables.

So, I don't think we need any special mention of OR, "union" conveys
the intention.

>
>> On a side note, I see
>> coldef->constraints = restdef->constraints;
>> Shouldn't we merge the constraints instead of just overwriting those?
>
> Actually, I checked that coldef->constraints is always NIL in this case.
> coldef (ColumnDef) is constructed from a member in the parent's TupleDesc
> earlier in that function and any constraints that may be present in the
> parent's definition of the column are saved in a separate variable that's
> returned to the caller as one list containing "old"/inherited constraints.
>  So, no constraints are getting lost here.

In case of multiple inheritance coldef->constraints is "union" of
constraints from all the parents as described in the prologue. But in
case of partitioning there is only one parent and hence
coldef->constraints is NULL and hence just overwriting it works. I
think we need comments mentioning this special case.

Also, I am not sure whether we really need all conditions related to
raw_default and cooked_default. Do you have any testcase showing the
need for those changes?

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: Vacuum: allow usage of more than 1GB of work mem


On 13/07/18 01:39, Andrew Dunstan wrote:

On 07/12/2018 06:34 PM, Alvaro Herrera wrote:

On 2018-Jul-12, Andrew Dunstan wrote:


I fully understand. I think this needs to go back to "Waiting on Author".

Why?  Heikki's patch applies fine and passes the regression tests.


Well, I understood Claudio was going to do some more work (see
upthread).


Claudio raised a good point, that doing small pallocs leads to 
fragmentation, and in particular, it might mean that we can't give back 
the memory to the OS. The default glibc malloc() implementation has a 
threshold of 4 or 32 MB or something like that - allocations larger than 
the threshold are mmap()'d, and can always be returned to the OS. I 
think a simple solution to that is to allocate larger chunks, something 
like 32-64 MB at a time, and carve out the allocations for the nodes 
from those chunks. That's pretty straightforward, because we don't need 
to worry about freeing the nodes in retail. Keep track of the current 
half-filled chunk, and allocate a new one when it fills up.


He also wanted to refactor the iterator API, to return one ItemPointer 
at a time. I don't think that's necessary, the current iterator API is 
more convenient for the callers, but I don't feel strongly about that.


Anything else?


If we're going to go with Heikki's patch then do we need to
change the author, or add him as an author?


Let's list both of us. At least in the commit message, doesn't matter 
much what the commitfest app says.


- Heikki

Re: GiST VACUUM

2018-07-13 Thread Andrey Borodin

Hi!PFA v6 of the patch.12 июля 2018 г., в 21:07, Andrey Borodin написал(а):12 июля 2018 г., в 20:40, Heikki Linnakangas написал(а):Actually, now that I think about it more, I'm not happy with leaving orphaned pages like that behind. Let's WAL-log the removal of the downlink, and marking the leaf pages as deleted, in one WAL record, to avoid that.OK, will do this. But this will complicate WAL replay seriously, and I do not know a proper way to test that (BTW there is GiST amcheck in progress, but I decided to leave it for a while).Done. Now WAL record for deleted page also removes downlink from internal page. I had to use IndexPageTupleDelete() instead of IndexPageTupleMultiDelete(), but I do not think it will have any impact on performance.But the situation in gistdoinsert(), where you encounter a deleted leaf page, could happen during normal operation, if vacuum runs concurrently with an insert. Insertion locks only one page at a time, as it descends the tree, so after it has released the lock on the parent, but before it has locked the child, vacuum might have deleted the page. In the latest patch, you're checking for that just before swapping the shared lock for an exclusive one, but I think that's wrong; you need to check for that after swapping the lock, because otherwise vacuum might delete the page while you're not holding the lock.Looks like a valid concern, I'll move that code again.Done.I do not know how to test thisreliably. Internal pages are locked before leafs and locks arecoupled. No cuncurrent backend can see downlinks to pages beingdeleted, unless crash happens.Are you sure? At a quick glance, I don't think the locks are coupled.I haven't yet tested WAL replay well, will do this soon.Best regards, Andrey Borodin.

0002-Physical-GiST-scan-during-VACUUM-v6.patch
Description: Binary data

0001-Delete-pages-during-GiST-VACUUM-v6.patch
Description: Binary data

Re: file cloning in pg_upgrade and CREATE DATABASE

On Fri, Jul 13, 2018 at 10:22:21AM +0200, Peter Eisentraut wrote:
> On 13.07.18 07:09, Thomas Munro wrote:
>> TIL that Solaris 11.4 (closed) ZFS supports reflink() too.  Sadly,
>> it's not in OpenZFS though I see numerous requests and discussions...
> 
> I look forward to your FreeBSD patch then. ;-)

+1.
--
Michael


signature.asc
Description: PGP signature

Re: Fix some error handling for read() and errno

On Mon, Jun 25, 2018 at 04:18:18PM +0900, Michael Paquier wrote:
> As this one is done, I have been looking at that this thread again.
> Peter Eisentraut has pushed as e5d11b9 something which does not need to
> worry about pluralization of error messages.  So I have moved to this
> message style for all messages.  All of this is done as 0001.

Mr. Robot has been complaining about this patch set, so attached is a
rebased version.  Thinking about it, I would tend to just merge 0001 and
give up on 0002 as that may not justify future backpatch pain.  Thoughts
are welcome.
--
Michael
From c1bf79e30eae60ae3d57f28b8ba28361c5447468 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 25 Jun 2018 14:37:18 +0900
Subject: [PATCH 1/2] Rework error messages around file handling

Some error messages related to file handling are using the code path
context to define their state.  For example, 2PC-related errors are
referring to "two-phase status files", or "relation mapping file" is
used for catalog-to-filenode mapping, however those prove to be
difficult to translate, and are not more helpful than just referring to
the path of the file being manipulated.  So simplify all those error
messages by just referring to files with their path used.  In some
cases, like the manipulation of WAL segments, the context is helpful so
those are kept.

Calls to the system function read() have also been rather inconsistent
with their error handling sometimes not reporting the number of bytes
read, and some other code paths trying to use an errno which has not
been set.  The in-core functions are using a more consistent pattern
with this patch, which checks for both errno if set or if an
inconsistent read is happening.

So as to care about pluralization when reading an unexpected number of
bytes, "could not read: read %d of %d" is used as error message.

Author: Michael Paquier
Discussion: https://postgr.es/m/20180622234803.ga1...@paquier.xyz
---
 src/backend/access/transam/twophase.c   | 40 ++--
 src/backend/access/transam/xlog.c   | 40 
 src/backend/replication/logical/origin.c| 13 +++-
 src/backend/replication/logical/snapbuild.c | 68 +++--
 src/backend/replication/slot.c  | 26 +---
 src/backend/replication/walsender.c | 16 -
 src/backend/utils/cache/relmapper.c | 28 ++---
 src/bin/pg_basebackup/pg_receivewal.c   | 12 +++-
 src/bin/pg_rewind/file_ops.c| 14 -
 src/bin/pg_rewind/parsexlog.c   | 14 -
 src/bin/pg_waldump/pg_waldump.c | 17 --
 11 files changed, 200 insertions(+), 88 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index e8d4e37fe3..0c99b33664 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1219,6 +1219,7 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 	uint32		crc_offset;
 	pg_crc32c	calc_crc,
 file_crc;
+	int			r;
 
 	TwoPhaseFilePath(path, xid);
 
@@ -1228,8 +1229,7 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 		if (give_warnings)
 			ereport(WARNING,
 	(errcode_for_file_access(),
-	 errmsg("could not open two-phase state file \"%s\": %m",
-			path)));
+	 errmsg("could not open file \"%s\": %m", path)));
 		return NULL;
 	}
 
@@ -1249,8 +1249,7 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 			errno = save_errno;
 			ereport(WARNING,
 	(errcode_for_file_access(),
-	 errmsg("could not stat two-phase state file \"%s\": %m",
-			path)));
+	 errmsg("could not stat file \"%s\": %m", path)));
 		}
 		return NULL;
 	}
@@ -1277,7 +1276,8 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 	buf = (char *) palloc(stat.st_size);
 
 	pgstat_report_wait_start(WAIT_EVENT_TWOPHASE_FILE_READ);
-	if (read(fd, buf, stat.st_size) != stat.st_size)
+	r = read(fd, buf, stat.st_size);
+	if (r != stat.st_size)
 	{
 		int			save_errno = errno;
 
@@ -1285,11 +1285,17 @@ ReadTwoPhaseFile(TransactionId xid, bool give_warnings)
 		CloseTransientFile(fd);
 		if (give_warnings)
 		{
-			errno = save_errno;
-			ereport(WARNING,
-	(errcode_for_file_access(),
-	 errmsg("could not read two-phase state file \"%s\": %m",
-			path)));
+			if (r < 0)
+			{
+errno = save_errno;
+ereport(WARNING,
+		(errcode_for_file_access(),
+		 errmsg("could not read file \"%s\": %m", path)));
+			}
+			else
+ereport(WARNING,
+		(errmsg("could not read file \"%s\": read %d of %zu",
+path, r, stat.st_size)));
 		}
 		pfree(buf);
 		return NULL;
@@ -1632,8 +1638,7 @@ RemoveTwoPhaseFile(TransactionId xid, bool giveWarning)
 		if (errno != ENOENT || giveWarning)
 			ereport(WARNING,
 	(errcode_for_file_access(),
-	 errmsg("could not remove two-phase state file \"%s\": %m",
-			path)));
+	 errmsg("could not remove file \"%s\": %m", path)));
 }
 
 /*
@@ -1661,8 +1666,7 @@ RecreateTwoP

Re: partition pruning doesn't work with IS NULL clause in multikey range partition case

On Fri, Jul 13, 2018 at 1:15 PM, Amit Langote
 wrote:
>>
>> I don't think this is true. When equality conditions and IS NULL clauses 
>> cover
>> all partition keys of a hash partitioned table and do not have contradictory
>> clauses, we should be able to find the partition which will remain unpruned.
>
> I was trying to say the same thing, but maybe the comment doesn't like it.
>  How about this:
>
> + * For hash partitioning, if we found IS NULL clauses for some keys and
> + * OpExpr's for others, gen_prune_steps_from_opexps() will generate the
> + * necessary pruning steps if all partition keys are taken care of.
> + * If we found only IS NULL clauses, then we can generate the pruning
> + * step here but only if all partition keys are covered.
>

It's still confusing a bit. I think "taken care of" doesn't convey the
same meaning as "covered". I don't think the second sentence is
necessary, it talks about one special case of the first sentence. But
this is better than the prior version.

>> I
>> see that we already have this supported in get_matching_hash_bounds()
>> /*
>>  * For hash partitioning we can only perform pruning based on equality
>>  * clauses to the partition key or IS NULL clauses.  We also can only
>>  * prune if we got values for all keys.
>>  */
>> if (nvalues + bms_num_members(nullkeys) == partnatts)
>> {
>>
>>   */
>> -if (!generate_opsteps)
>> +if (!bms_is_empty(nullkeys) &&
>> +(part_scheme->strategy != PARTITION_STRATEGY_HASH ||
>> + bms_num_members(nullkeys) == part_scheme->partnatts))
>>
>> So, it looks like we don't need bms_num_members(nullkeys) ==
>> part_scheme->partnatts there.
>
> Yes, we can perform pruning in all three cases for hash partitioning:
>
> * all keys are covered by OpExprs providing non-null keys
>
> * some keys are covered by IS NULL clauses, others by OpExprs (all keys
>   covered)
>
> * all keys are covered by IS NULL clauses
>
> ... as long as we generate a pruning step at all.  The issue here was that
> we skipped generating the pruning step due to poorly coded condition in
> gen_partprune_steps_internal in some cases.
>
>> Also, I think, we don't know how some new partition strategy will treat NULL
>> values so above condition looks wrong to me. Instead it should explicitly 
>> check
>> the strategies for which we know that the NULL values go to a single 
>> partition.
>
> How about if we explicitly spell out the strategies like this:
>
> +if (!bms_is_empty(nullkeys) &&
> +(part_scheme->strategy == PARTITION_STRATEGY_LIST ||
> + part_scheme->strategy == PARTITION_STRATEGY_RANGE ||
> + (part_scheme->strategy == PARTITION_STRATEGY_HASH &&
> +  bms_num_members(nullkeys) == part_scheme->partnatts)))

I still do not understand why do we need (part_scheme->strategy ==
PARTITION_STRATEGY_HASH && bms_num_members(nullkeys) ==
part_scheme->partnatts) there. I thought that this will be handled by
code, which deals with null keys and op_exprs together, somewhere
else.

>
> The proposed comment also describes why the condition looks like that.
>
>>  /*
>> - * Note that for IS NOT NULL clauses, simply having step suffices;
>> - * there is no need to propagate the exact details of which keys are
>> - * required to be NOT NULL.  Hash partitioning expects to see actual
>> - * values to perform any pruning.
>> + * There are no OpExpr's, but there are IS NOT NULL clauses, which
>> + * can be used to eliminate the null-partition-key-only partition.
>>
>> I don't understand this. When there are IS NOT NULL clauses for all the
>> partition keys, it's only then that we could eliminate the partition 
>> containing
>> NULL values, not otherwise.
>
> Actually, there is only one case where the pruning step generated by that
> block of code is useful -- to prune a list partition that accepts only
> nulls.  List partitioning only allows one partition, key so this works,
> but let's say only accidentally.  I modified the condition as follows:
>
> +else if (!generate_opsteps && !bms_is_empty(notnullkeys) &&
> + bms_num_members(notnullkeys) == part_scheme->partnatts)
>

Hmm. Ok. May be it's better to explicitly say that it's only useful in
case of list partitioned table.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: How to make partitioning scale better for larger numbers of partitions

On Fri, Jul 13, 2018 at 9:23 AM, Kato, Sho  wrote:
>>I wondered if you compared to PG10 or to inheritence-partitioning (parent 
>>with relkind='r' and either trigger or rule or >INSERT/UPDATE directly into 
>>child) ?
>
> Thank you for your reply.
>
> I compared to PG11beta2 with non-partitioned table.
>
> Non-partitioned table has 1100 records in one table.
> Partitioned table has one record on each leaf partitions.
>

I don't think partitioning should be employed this way even for the
sake of comparison. Depending upon the size of each tuple, 1100 tuples
are inserted into a single table, they will probably occupy few
hundred pages. In a partitioned table with one tuple per partition
they will occupy 1100 pages at least. There is other space, locking
overheads to maintain 1100 tables. I think the right way to compare is
to have really large that which really requires 1100 partitions and
then compare performance by putting that data in 1100 partitions and
in an unpartitioned table. Even with that kind of data, you will see
some difference in performance, but that won't be as dramatic as you
report.

I might be missing something though.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Re: patch to allow disable of WAL recycling

2018-07-13 Thread Jerry Jelinek

Thanks to everyone who has taken the time to look at this patch and provide
all of the feedback.

I'm going to wait another day to see if there are any more comments. If
not, then first thing next week, I will send out a revised patch with
improvements to the man page change as requested. If anyone has specific
things they want to be sure are covered, please just let me know.

Thanks again,
Jerry


On Thu, Jul 12, 2018 at 7:06 PM, Thomas Munro  wrote:

> On Thu, Jul 12, 2018 at 10:52 PM, Tomas Vondra
>  wrote:
> > I don't follow Alvaro's reasoning, TBH. There's a couple of things that
> > confuse me ...
> >
> > I don't quite see how reusing WAL segments actually protects against full
> > filesystem? On "traditional" filesystems I would not expect any
> difference
> > between "unlink+create" and reusing an existing file. On CoW filesystems
> > (like ZFS or btrfs) the space management works very differently and
> reusing
> > an existing file is unlikely to save anything.
>
> Yeah, I had the same thoughts.
>
> > But even if it reduces the likelihood of ENOSPC, it does not eliminate it
> > entirely. max_wal_size is not a hard limit, and the disk may be filled by
> > something else (when WAL is not on a separate device, when there is think
> > provisioning, etc.). So it's not a protection against data corruption we
> > could rely on. (And as was discussed in the recent fsync thread, ENOSPC
> is a
> > likely source of past data corruption issues on NFS and possibly other
> > filesystems.)
>
> Right.  That ENOSPC discussion was about checkpointing though, not
> WAL.  IIUC the hypothesis was that there may be stacks (possibly
> involving NFS or thin provisioning, or perhaps historical versions of
> certain local filesystems that had reservation accounting bugs, on a
> certain kernel) that could let you write() a buffer, and then later
> when the checkpointer calls fsync() the filesystem says ENOSPC, the
> kernel reports that and throws away the dirty page, and then at next
> checkpoint fsync() succeeds but the checkpoint is a lie and the data
> is smoke.
>
> We already PANIC on any errno except EINTR in XLogWriteLog(), as seen
> in Jerry's nearby stack trace, so that failure mode seems to be
> covered already for WAL, no?
>
> > AFAICS the original reason for reusing WAL segments was the belief that
> > overwriting an existing file is faster than writing a new file. That
> might
> > have been true in the past, but the question is if it's still true on
> > current filesystems. The results posted here suggest it's not true on
> ZFS,
> > at least.
>
> Yeah.
>
> The wal_recycle=on|off patch seems reasonable to me (modulo Andres's
> comments about the documentation; we should make sure that the 'off'
> setting isn't accidentally recommended to the wrong audience) and I
> vote we take it.
>
> Just by the way, if I'm not mistaken ZFS does avoid faulting when
> overwriting whole blocks, just like other filesystems:
>
> https://github.com/freebsd/freebsd/blob/master/sys/cddl/
> contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c#L1034
>
> So then where are those faults coming from?  Perhaps the tree page
> that holds the block pointer, of which there must be many when the
> recordsize is small?
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>

Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options

2018-07-13 Thread Oliver Ford

Adds the options RESPECT/IGNORE NULLS (null treatment clause) and FROM
FIRST/LAST to the non-aggregate window functions.

A previous patch
(https://www.postgresql.org/message-id/CA+=vxna5_n1q5q5okxc0aqnndbo2ru6gvw+86wk+onsunjd...@mail.gmail.com)
partially implemented this feature. However, that patch worked by
adding the null treatment clause to the window frame's frameOptions
variable, and consequently had the limitation that it wasn't possible
to reuse a window frame definition in a single query where two
functions were called that had different null treatment options. This
meant that the patch was never committed. The attached path takes a
different approach which gets around this limitation.

For example, the following query would not work correctly with the
implementation in the old patch but does with the attached patch:

WITH cte (x) AS (
select null union select 1 union select 2)
SELECT x,
first_value(x) over w as with_default,
first_value(x) respect nulls over w as with_respect,
first_value(x) ignore nulls over w as with_ignore
from cte WINDOW w as (order by x nulls first rows between unbounded
preceding and unbounded following);
 x | with_default | with_respect | with_ignore
---+--+--+-
   |  |  |   1
 1 |  |  |   1
 2 |  |  |   1
(3 rows)


== Implementation ==

The patch adds two types to the pg_type catalog: "ignorenulls" and
"fromlast". These types are of the Boolean category, and work as
wrappers around the bool type. They are used as function arguments to
extra versions of the window functions that take additional boolean
arguments. RESPECT NULLS and FROM FIRST are ignored by the parser, but
IGNORE NULLS and FROM LAST lead to the extra versions being called
with arguments to ignore nulls and order from last.

== Testing ==

Updated documentation and added regression tests. All existing tests
pass. This change will need a catversion bump.
Thanks to Krasiyan Andreev for initially testing this patch.


0001-respect.patch
Description: Binary data

Re: automatic restore point

On Fri, Jul 13, 2018 at 08:16:00AM +, Yotsunaga, Naoki wrote:
> Do you feel it is too complicated?

In short, yes.
--
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] PATCH: multivariate histograms and MCV lists

2018-07-13 Thread Dean Rasheed

On 24 June 2018 at 20:45, Tomas Vondra  wrote:
> Attached is a rebased version of this patch series, mostly just fixing
> the breakage caused by reworked format of initial catalog data.
>
> Aside from that, the MCV building now adopts the logic introduced by
> commit b5db1d93d2 for single-column MCV lists. The new algorithm seems
> pretty good and I don't see why multi-column MCV lists should use
> something special.

Agreed.

> I'm sure there are plenty of open questions to discuss, particularly
> stuff related to combining the various types of statistics to the final
> estimate (a lot of that was already improved based on Dean's reviews).

Yes, that's definitely one of the trickier parts of this. I don't
think that the current algorithm is ideal as it stands. In particular,
the way that it attempts to handle complex combinations of clauses
doesn't look right. I think mcv_clauselist_selectivity() and
histogram_clauselist_selectivity() are plausibly correct, but the way
that the resulting selectivities are combined in
statext_clauselist_selectivity() doesn't seem right. In particular,
the use of estimate_equality_groups() to count "nmatches" and
"fullmatch" only takes into account top-level equality clauses, so it
will fail to recognise other cases like (a=1 AND (b=1 OR b=2)) which
might be fully covered by, say, the MCV stats. Consider, for example,
the following simple test case:

create table foo(a int, b int);
insert into foo select 1,1 from generate_series(1,5);
insert into foo select 1,2 from generate_series(1,4);
insert into foo select 1,x/10 from generate_series(30,25) g(x);
insert into foo select 2,1 from generate_series(1,3);
insert into foo select 2,2 from generate_series(1,2);
insert into foo select 2,x/10 from generate_series(30,50) g(x);
insert into foo select 3,1 from generate_series(1,1);
insert into foo select 3,2 from generate_series(1,5000);
insert into foo select 3,x from generate_series(3,60) g(x);
insert into foo select x,x/10 from generate_series(4,75) g(x);

create statistics foo_mcv_ab (mcv) on a,b from foo;
analyse foo;

explain analyse select * from foo where a=1 and b=1;
 -- Actual rows: 5, Estimated: 52690 (14149 without MV-stats)

explain analyse select * from foo where a=1 and b=2;
 -- Actual rows: 4, Estimated: 41115 (10534 without MV-stats)

explain analyse select * from foo where a=1 and (b=1 or b=2);
 -- Actual rows: 9, Estimated: 181091 (24253 without MV-stats)

explain analyse select * from foo where (a=1 or a=2) and (b=1 or b=2);
 -- Actual rows: 14, Estimated: 276425 (56716 without MV-stats)

In the first 2 queries the multivariate MCV stats help a lot and give
good estimates, but in the last 2 queries the estimates are around
twice as large as they should be, even though good MCV stats are
available on those specific values.

The tricky thing is to work out how to correctly combine the various
stats that are available. In the above examples, the selectivity
returned by mcv_clauselist_selectivity() would have been basically
correct, but since it will have not been identified as a fullmatch and
some non-equality clauses will have been seen at the top-level (the OR
clauses), it ends up adding on additional selectivities from
clauselist_selectivity().

I think perhaps it might be better not to attempt to combine the
*overall* selectivity returned by mcv_clauselist_selectivity() with
that returned by clauselist_selectivity(), but rather merge the logic
of these two functions together into a single traversal of the
clause-tree. That way, the various individual selectivities can be
combined on a clause-by-clause basis to give the best running total
based on the available information. Even when the multivariate stats
are incomplete, they may still provide a useful lower bound on the
selectivity. If/when all MCV columns have been matched exactly, that
lower bound might turn into the appropriate overall result, if there
is a matching MCV entry. For example, suppose that there are MCV stats
on 3 columns a,b,c and a WHERE clause like (a=1 AND b=2 AND c=3). You
might process that something like:

* Get sel(a=1) using the normal univariate stats
* Update the multivariate MCV stats match bitmap based on a=1
* Get sel(b=2) using the normal univariate stats
* Compute the total selectivity assuming independence:
total_sel = sel(a=1)*sel(b=2)
* Update the multivariate MCV stats match bitmap based on b=2
* Compute the multivariate MCV selectivity so far:
mcv_sel = Sum of MCV frequencies that match so far
* Use that as a lower bound:
total_sel = Max(total_sel, mcv_sel)
* Get sel(c=3) using the normal univariate stats
* Compute the new total selectivity assuming independence:
total_sel *= sel(c=3)
* Update the multivariate MCV stats match bitmap based on c=3
* Compute the new multivariate MCV selectivity:
mcv_sel = Sum of MCV frequencies that now match
* Use that as a new lower bound:
total_sel = Max(total_sel,

Re: [HACKERS] WIP: Data at rest encryption

2018-07-13 Thread Toshi Harada

Hi.

I am interested in a patch of "WIP: Data at rest encryption".
This patch("data-at-rest-encryption-wip-2018.06.27.patch") is applied to 
PostgreSQL 11-beta 2 and it is running.

In the explanation of this patch, since "data stored during logical decoding" 
is written, 
we tried logical decoding by the test_decoding module, but the following error 
occurs when creating a slot.


pgbench_db=# SELECT * FROM pg_create_logical_replication_slot('my_slot', 
'test_decoding');
ERROR:  invalid magic number B419 in log segment 00020010, 
offset 0
pgbench_db=#


(Also, if you run "CREATE SUNSCRIPTION" for logical replication from another 
server, a similar error will occur.)

Question.
In "data-at-rest-encryption-wip-2018.06.27.patch", is logical decoding still 
not implemented?
Or is there a need for another logical decoding plug-in for "WIP: Data at rest 
encryption"?

Regards.

Antonin Houska  wrote:
> Ants Aasma  wrote:
> 
> > Attached to this mail is a work in progress patch that adds an
> > extensible encryption mechanism. There are some loose ends left to tie
> > up, but the general concept and architecture is at a point where it's
> > ready for some feedback, fresh ideas and bikeshedding.
> 
> Rebased patch is attached here, in case it helps to achieve (some of) the
> goals mentioned in the related thread [1].
> 
> Besides encrypting table and WAL pages, it encrypts the temporary files
> (buffile.c), data stored during logical decoding (reorderbuffer.c) and
> statistics temporary files (pgstat.c). Unlike the previous version, SLRU files
> (e.g. CLOG) are not encrypted (it does not seem critical and the encryption
> makes torn page write quite difficult to handle).
> 
> Another difference is that we use the OpenSSL of the (tweaked) AES XTS cipher
> now.
> 
> Binary upgrade from unencrypted to encrypted cluster is not implemented yet.
> 
> 
> [1] 
> https://www.postgresql.org/message-id/031401d3f41d$5c70ed90$1552c8b0$@lab.ntt.co.jp
> 
> -- 
> Antonin Houska
> Cybertec Schonig & Schonig GmbH
> Grohrmuhlgasse 26, A-2700 Wiener Neustadt
> Web: https://www.cybertec-postgresql.com
>

Re: function lca('{}'::ltree[]) caused DB Instance crash

2018-07-13 Thread Pierre Ducroquet

On Friday, July 13, 2018 12:09:20 PM CEST 李海龙 wrote:
> HI,Oleg && pgsql-hackers
> 
> Plese help me to check this is a bug of ltree?
> 

Hi

There is indeed a bug. The _lca function in _ltree_op.c tries to allocate 0 
bytes of memory, doesn't initialize it and dereference it in lca_inner.
The attached basic patch fixes it.

Regards

 Pierre
>From 4e59747cea428d39c80974c408e95ba86bf63ecc Mon Sep 17 00:00:00 2001
From: Pierre Ducroquet 
Date: Fri, 13 Jul 2018 12:47:36 +0200
Subject: [PATCH] Fix segfault with lca('{}'::ltree[])

---
 contrib/ltree/_ltree_op.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/contrib/ltree/_ltree_op.c b/contrib/ltree/_ltree_op.c
index 9bb6bcaeff..6afcd95cd1 100644
--- a/contrib/ltree/_ltree_op.c
+++ b/contrib/ltree/_ltree_op.c
@@ -297,6 +297,9 @@ _lca(PG_FUNCTION_ARGS)
 	ltree	  **a,
 			   *res;
 
+	if (num == 0)
+		PG_RETURN_NULL();
+
 	if (ARR_NDIM(la) > 1)
 		ereport(ERROR,
 (errcode(ERRCODE_ARRAY_SUBSCRIPT_ERROR),
-- 
2.18.0

function lca('{}'::ltree[]) caused DB Instance crash

2018-07-13 Thread 李海龙



HI,Oleg && pgsql-hackers

Plese help me to check this is a bug of ltree?


thxs!




lhl@localhost:~$ cat /etc/issue
Ubuntu 14.04.5 LTS \n \l

lhl@localhost:~$ uname -av
Linux localhost 3.13.0-107-generic #154-Ubuntu SMP Tue Dec 20 09:57:27 UTC 2016 
x86_64 x86_64 x86_64 GNU/Linux


lhl@localhost:~$ psql mydb
psql (10.3)
Type "help" for help.

mydb=# select version();
version
---
 PostgreSQL 10.3 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
4.8.4-2ubuntu1~14.04.3) 4.8.4, 64-bit
(1 row)

mydb=# \dx ltree
List of installed extensions
 Name  | Version | Schema |   Description
---+-++-
 ltree | 1.1 | public | data type for hierarchical tree-like structures
(1 row)

mydb=# select lca('{}'::ltree[]);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.




in /var/log/syslog

Jul 13 18:02:39 localhost kernel: [25703.907912] postgres[8869]: segfault at 4 
ip 7f6bad8c2ca6 sp 7fffd8f64560 error 4 in ltree.so[7f6bad8be000+d000]


in log of PostgreSQL Datadir

[lhl mydb psql [local] 2018-07-13 18:02:36.962 CST 8869 5b4878b6.22a5 1 3/2 
0]LOG:  statement: SELECT e.extname AS "Name", e.extversion AS "Version", 
n.nspname AS "Schema", c.description AS "Description"
FROM pg_catalog.pg_extension e LEFT JOIN pg_catalog.pg_namespace n ON n.oid 
= e.extnamespace LEFT JOIN pg_catalog.pg_description c ON c.objoid = e.oid AND 
c.classoid = 'pg_catalog.pg_extension'::pg_catalog.regclass
WHERE e.extname OPERATOR(pg_catalog.~) '^(ltree)$'
ORDER BY 1;
[lhl mydb psql [local] 2018-07-13 18:02:36.972 CST 8869 5b4878b6.22a5 2 3/0 
0]LOG:  duration: 10.269 ms
[lhl mydb psql [local] 2018-07-13 18:02:39.313 CST 8869 5b4878b6.22a5 3 3/3 
0]LOG:  statement: select lca('{}'::ltree[]);
[2018-07-13 18:02:39.444 CST 1638 5b481476.666 102  0]LOG:  server process 
(PID 8869) was terminated by signal 11: Segmentation fault
[2018-07-13 18:02:39.444 CST 1638 5b481476.666 103  0]DETAIL:  Failed 
process was running: select lca('{}'::ltree[]);
[2018-07-13 18:02:39.444 CST 1638 5b481476.666 104  0]LOG:  terminating any 
other active server processes
[lhl mydb [unknown] [local] 2018-07-13 18:02:39.445 CST 8873 5b4878bf.22a9 1  
0]FATAL:  the database system is in recovery mode
[2018-07-13 18:02:39.446 CST 8635 5b4876a7.21bb 1 1/0 0]WARNING:  
terminating connection because of crash of another server process
[2018-07-13 18:02:39.446 CST 8635 5b4876a7.21bb 2 1/0 0]DETAIL:  The 
postmaster has commanded this server process to roll back the current 
transaction and exit, because another server process exited abnormally and 
possibly corrupted shared memory.
[2018-07-13 18:02:39.446 CST 8635 5b4876a7.21bb 3 1/0 0]HINT:  In a moment 
you should be able to reconnect to the database and repeat your command.
[2018-07-13 18:02:39.447 CST 1638 5b481476.666 105  0]LOG:  archiver 
process (PID 8636) exited with exit code 1
[2018-07-13 18:02:39.447 CST 1638 5b481476.666 106  0]LOG:  all server 
processes terminated; reinitializing
[2018-07-13 18:02:39.520 CST 8874 5b4878bf.22aa 1  0]LOG:  database system 
was interrupted; last known up at 2018-07-13 17:53:42 CST
[2018-07-13 18:02:39.798 CST 8874 5b4878bf.22aa 2  0]LOG:  database system 
was not properly shut down; automatic recovery in progress
[2018-07-13 18:02:39.838 CST 8874 5b4878bf.22aa 3  0]LOG:  redo starts at 
1/9815F478
[2018-07-13 18:02:39.838 CST 8874 5b4878bf.22aa 4  0]LOG:  invalid record 
length at 1/9815F4B0: wanted 24, got 0
[2018-07-13 18:02:39.839 CST 8874 5b4878bf.22aa 5  0]LOG:  redo done at 
1/9815F478
[2018-07-13 18:02:40.036 CST 1638 5b481476.666 107  0]LOG:  database system 
is ready to accept connections



--
PostgreSQL DBA hailong.li

<>

Re: Temporary WAL segments files not cleaned up after an instance crash

2018-07-13 Thread Yugo Nagata

On Thu, 12 Jul 2018 16:44:45 +0900
Michael Paquier  wrote:

> On Thu, Jul 12, 2018 at 03:35:53PM +0900, Yugo Nagata wrote:
> > I think it makes sense to remove unnecessary temporary WAL files although
> > I'm not sure how high the risk of ENOSPC is.
> 
> It depends on how close to the partition size limit max_wal_size is set,
> and how much a system is unstable.  Switching on/off a VM where Postgres
> is located can participate in that, as well as VM snapshots taken
> without memory (I work a lot on those as you can guess :D).  Setting it
> to 70% of the partition size is what I imagine is the base, but I can
> imagine as well people setting it at 90% or more.

Thank you for your explaining this. I have understood the problem
you concern well.

Thanks,

-- 
Yugo Nagata

Re: Problem on pg_dump RANGE partition with expressions

2018-07-13 Thread Yugo Nagata

On Thu, 12 Jul 2018 17:44:48 +0900
Amit Langote  wrote:

> > 1) Allow to appear more than once in range partition key
> > 
> > I don't understand why there is this restriction. If we have no clear 
> > reason, 
> > can we rip out this restrition?
> 
> I can't recall exactly, but back when I wrote this code, I might have been
> thinking that such a feature is useless and would actually just end up
> being a foot gun for users.
> 
> Although, I'm open to ripping it out if it's seen as being overly restrictive.

I think this way is good to resolve this, because allowing columns to appear 
more
than once would be harmless at least with the current partitioning methods, 
range and hash, although this is useless. 

Actually, we currenly allow same expression apears more than once in partition 
key
like

 create table t (i int) partition by range((i+1),(i+1));

In the past, I proposed a patch to forbid this, but this is rejected
since there is harmless and no need to restrict this.

Attached is a patch to get rid of "appears more than once" restriction.

Regards,

-- 
Yugo Nagata 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 22e81e7..d7e8633 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -13623,20 +13623,6 @@ transformPartitionSpec(Relation rel, PartitionSpec *partspec, char *strategy)
 		PartitionElem *pelem = castNode(PartitionElem, lfirst(l));
 		ListCell   *lc;
 
-		/* Check for PARTITION BY ... (foo, foo) */
-		foreach(lc, newspec->partParams)
-		{
-			PartitionElem *pparam = castNode(PartitionElem, lfirst(lc));
-
-			if (pelem->name && pparam->name &&
-strcmp(pelem->name, pparam->name) == 0)
-ereport(ERROR,
-		(errcode(ERRCODE_DUPLICATE_COLUMN),
-		 errmsg("column \"%s\" appears more than once in partition key",
-pelem->name),
-		 parser_errposition(pstate, pelem->location)));
-		}
-
 		if (pelem->expr)
 		{
 			/* Copy, to avoid scribbling on the input */

RE: How to make partitioning scale better for larger numbers of partitions

2018-07-13 Thread Kato, Sho

Tsunakawa-san

>Kato-san, could you try pgbench -M prepared?

I did pgbench -M prepared and perf record.

UPDATE latency in prepared mode is 95% shorter than in simple mode. 
SELECT latency in prepared mode is 54% shorter than in simple mode.
INSERT latency in prepared mode is 8% shorter than in simple mode.

In perf report, AllocSetAlloc, hash_search_with_hash_value and hash_any 
appeared in all SQL.

Details are as follows.

pgbench results
--

transaction type: update.sql
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 180 s
number of transactions actually processed: 12295
latency average = 14.641 ms
tps = 68.302806 (including connections establishing)
tps = 68.303430 (excluding connections establishing)
statement latencies in milliseconds:
 0.009  \set aid random(1, 1100 * 1)
 0.004  \set delta random(-5000, 5000)
 0.026  BEGIN;
14.089  UPDATE test.accounts SET abalance = abalance + :delta WHERE aid 
= :aid;
 0.513  END;

transaction type: select.sql
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 180 s
number of transactions actually processed: 45145
latency average = 3.996 ms
tps = 250.272094 (including connections establishing)
tps = 250.274404 (excluding connections establishing)
statement latencies in milliseconds:
 0.750  \set aid random(1, 1100 * 1)
 0.714  \set delta random(-5000, 5000)
 0.023  BEGIN;
 2.262  SELECT abalance FROM test.accounts WHERE aid = :aid;
 0.247  END;

transaction type: insert.sql
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 180 s
number of transactions actually processed: 34894
latency average = 5.158 ms
tps = 193.855216 (including connections establishing)
tps = 193.857020 (excluding connections establishing)
statement latencies in milliseconds:
 1.007  \set aid random(1, 1100 * 1)
 0.802  \set delta random(-5000, 5000)
 0.025  BEGIN;
 2.963  INSERT INTO test.accounts_history (aid, delta, mtime) VALUES 
(:aid, :delta, CURRENT_TIMESTAMP);
 0.360  END;

Top 10 of perf report
-
UPDATE:
 11.86%  postgres  postgres   [.] list_nth
 10.23%  postgres  postgres   [.] ExecOpenScanRelation
  7.47%  postgres  postgres   [.] list_member_int
  4.78%  postgres  postgres   [.] AllocSetAlloc
  3.61%  postgres  postgres   [.] palloc0
  3.09%  postgres  postgres   [.] hash_search_with_hash_value
  2.33%  postgres  postgres   [.] ResourceArrayAdd
  1.99%  postgres  postgres   [.] hash_any
  1.83%  postgres  postgres   [.] copyObjectImpl
  1.64%  postgres  postgres   [.] SearchCatCache1

SELECT:
 33.60%  postgres  postgres   [.] hash_search_with_hash_value
 13.02%  postgres  postgres   [.] hash_any
  5.30%  postgres  postgres   [.] LockAcquireExtended
  5.16%  postgres  postgres   [.] LockReleaseAll
  4.00%  postgres  postgres   [.] hash_seq_search
  3.84%  postgres  postgres   [.] LWLockRelease
  3.81%  postgres  postgres   [.] AllocSetAlloc
  3.23%  postgres  postgres   [.] LWLockAcquire
  2.55%  postgres  postgres   [.] SetupLockInTable
  1.90%  postgres  postgres   [.] AcquireExecutorLocks

INSERT:
 21.86%  postgres  postgres   [.] hash_search_with_hash_value
  6.36%  postgres  postgres   [.] hash_any
  4.95%  postgres  postgres   [.] AllocSetAlloc
  4.08%  postgres  postgres   [.] LWLockRelease
  3.83%  postgres  postgres   [.] LWLockAcquire
  3.26%  postgres  postgres   [.] SearchCatCache
  3.15%  postgres  postgres   [.] oid_cmp
  2.78%  postgres  postgres   [.] LockReleaseAll
  2.76%  postgres  postgres   [.] pg_qsort
  2.41%  postgres  postgres   [.] SearchCatCache1


-Original Message-
From: Tsunakawa, Takayuki/綱川 貴之 
Sent: Friday, July 13, 2018 2:49 PM
To: 'Amit Langote' ; Kato, Sho/加藤 翔 
; PostgreSQL mailing lists 

Subject: RE: How to make partitioning scale better for larger numbers of 
partitions

From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
> For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase
> is pretty significant and gets worse as the number of partitions grows.
> I
> had intended to fix that in PG 11, but we could only manage to get part
> of
> that work into PG 11, with significant help from David and others.  So,
> while PG 11's overhead of partitioning during planning is less than PG
> 10's due to improved pruning algorithm, it's still pretty far from ideal,
> because it isn't just the pruning algorithm that had overheads.  In fact,
> PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still
> carry the overhead that was in PG 10.

David has submitted multiple patches for PG 12, one of which spe

Re: ALTER TABLE on system catalogs

On 28.06.18 10:14, Peter Eisentraut wrote:
> On 6/28/18 01:10, Michael Paquier wrote:
>> On Wed, Jun 27, 2018 at 01:37:33PM -0700, Andres Freund wrote:
>>> On 2018-06-27 22:31:30 +0200, Peter Eisentraut wrote:
 I propose that we instead silently ignore attempts to add TOAST tables
 to shared and system catalogs after bootstrapping.
>>>
>>> That seems like an extremely bad idea to me.  I'd rather go ahead and
>>> just add the necessary toast tables than silently violate preconditions
>>> that code assumes are fulfilled.
>>
>> Agreed.  Joe Conway was working on a patch to do exactly that.  I was
>> personally looking for the possibility of having one with pg_authid in
>> v12 :)
> 
> OK, that would change things a bit, in that the silent addition of a
> TOAST table would no longer be a problem, but it wouldn't fix the other
> scenarios that end up in an error.  If such a patch is forthcoming, we
> can revisit this again afterwards.

After reviewing that thread, I think my patch would still be relevant.
Because the pending proposal is to not add TOAST tables to some catalogs
such as pg_attribute, so my original use case of allowing ALTER TABLE /
SET on system catalogs would still be broken for those tables.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pgbench's expression parsing & negative numbers

2018-07-13 Thread Fabien COELHO



Hello Andres,


I'll come up with a patch for that sometime soon.


ISTM that you have not sent any patch on the subject, otherwise I would 
have reviewed it. Maybe I could do one some time later, unless you think 
that I should not.


Here is a patch which detects pgbench overflows on int & double constants, 
and on integer operators.


--
Fabien.diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index 88cf8b3933..8c464c9d42 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -989,6 +989,13 @@ pgbench  options 
 d
   are FALSE.
  
 
+ 
+  Too large or small integer and double constants, as well as
+  integer arithmetic operators (+,
+  -, * and /)
+  raise errors on overflows.
+ 
+
  
   When no final ELSE clause is provided to a
   CASE, the default value is NULL.
diff --git a/src/bin/pgbench/exprscan.l b/src/bin/pgbench/exprscan.l
index 5c1bd88128..3e82437bc8 100644
--- a/src/bin/pgbench/exprscan.l
+++ b/src/bin/pgbench/exprscan.l
@@ -192,15 +192,21 @@ notnull   [Nn][Oo][Tt][Nn][Uu][Ll][Ll]
return BOOLEAN_CONST;
}
 {digit}+   {
-   yylval->ival = strtoint64(yytext);
+   if (!strtoint64(yytext, true, 
&yylval->ival))
+   expr_yyerror_more(yyscanner, 
"bigint constant overflow",
+   
  strdup(yytext));
return INTEGER_CONST;
}
 {digit}+(\.{digit}*)?([eE][-+]?{digit}+)?  {
-   yylval->dval = atof(yytext);
+   if (!strtodouble(yytext, true, 
&yylval->dval))
+   expr_yyerror_more(yyscanner, 
"double constant overflow",
+   
  strdup(yytext));
return DOUBLE_CONST;
}
 \.{digit}+([eE][-+]?{digit}+)? {
-   yylval->dval = atof(yytext);
+   if (!strtodouble(yytext, true, 
&yylval->dval))
+   expr_yyerror_more(yyscanner, 
"double constant overflow",
+   
  strdup(yytext));
return DOUBLE_CONST;
}
 {alpha}{alnum}*{
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index f0c5149523..3f1392fc03 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -38,6 +38,8 @@
 #include "libpq-fe.h"
 #include "portability/instr_time.h"
 
+#include "common/int.h" /* integer overflow checks */
+
 #include 
 #include 
 #include 
@@ -627,19 +629,24 @@ is_an_int(const char *str)
 /*
  * strtoint64 -- convert a string to 64-bit integer
  *
- * This function is a modified version of scanint8() from
+ * This function is a slightly modified version of scanint8() from
  * src/backend/utils/adt/int8.c.
+ *
+ * If not errorOK, an error message is printed out.
  */
-int64
-strtoint64(const char *str)
+bool
+strtoint64(const char *str, bool errorOK, int64 *result)
 {
const char *ptr = str;
-   int64   result = 0;
-   int sign = 1;
+   int64   tmp = 0;
+   boolneg = false;
 
/*
 * Do our own scan, rather than relying on sscanf which might be broken
 * for long long.
+*
+* As INT64_MIN can't be stored as a positive 64 bit integer, accumulate
+* value as a negative number.
 */
 
/* skip leading spaces */
@@ -650,46 +657,80 @@ strtoint64(const char *str)
if (*ptr == '-')
{
ptr++;
-
-   /*
-* Do an explicit check for INT64_MIN.  Ugly though this is, 
it's
-* cleaner than trying to get the loop below to handle it 
portably.
-*/
-   if (strncmp(ptr, "9223372036854775808", 19) == 0)
-   {
-   result = PG_INT64_MIN;
-   ptr += 19;
-   goto gotdigits;
-   }
-   sign = -1;
+   neg = true;
}
else if (*ptr == '+')
ptr++;
 
/* require at least one digit */
-   if (!isdigit((unsigned char) *ptr))
-   fprintf(stderr, "invalid input syntax for integer: \"%s\"\n", 
str);
+   if (unlikely(!isdigit((unsigned char) *ptr)))
+   goto invalid_syntax;
 
/* process digits */
while (*ptr && isdigit((unsigned char) *ptr))

Re: Failed assertion due to procedure created with SECURITY DEFINER option

On 12.07.18 21:52, Jonathan S. Katz wrote:
> So it handles it as expected.

Committed.

> Code and test cases look fine to me from what I can tell. My only suggestion
> would be if we could add some guidance to the documentation on what
> languages
> can support transaction control statements inside procedures with a SECURITY
> DEFINER.

The documentation change is in the CREATE PROCEDURE ref page, since it's
independent of any language.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restricting maximum keep segments by repslots

2018-07-13 Thread Kyotaro HORIGUCHI

Hello.

At Wed, 11 Jul 2018 15:09:23 +0900, Masahiko Sawada  
wrote in 
> On Mon, Jul 9, 2018 at 2:47 PM, Kyotaro HORIGUCHI
>  wrote:
..
> Here is review comments of v4 patches.
> 
> +   if (minKeepLSN)
> +   {
> +   XLogRecPtr slotPtr = XLogGetReplicationSlotMinimumLSN();
> +   Assert(!XLogRecPtrIsInvalid(slotPtr));
> +
> +   tailSeg = GetOldestKeepSegment(currpos, slotPtr);
> +
> +   XLogSegNoOffsetToRecPtr(tailSeg, 0, *minKeepLSN,
> wal_segment_size);
> +   }
> 
> The usage of XLogSegNoOffsetToRecPtr is wrong. Since we specify the
> destination at 4th argument the wal_segment_size will be changed in
> the above expression. The regression tests by PostgreSQL Patch Tester

I'm not sure I get this correctly, the definition of the macro is
as follows.

| #define XLogSegNoOffsetToRecPtr(segno, offset, dest, wal_segsz_bytes) \
|   (dest) = (segno) * (wal_segsz_bytes) + (offset)

The destination is the *3rd* parameter and the forth is segment
size which is not to be written.

> seem passed but I got the following assertion failure in
> recovery/t/010_logical_decoding_timelines.pl
> 
> TRAP: FailedAssertion("!(XLogRecPtrToBytePos(*StartPos) ==
> startbytepos)", File: "xlog.c", Line: 1277)

Hmm. I don't see a relation with this patch, but how did you
cause the failure? The failure means inconsistency between
existing XLogBytePosToRecPtr and XLogRecPtrToBytePos, which
doesn't seem to happen without modifying the two functions.

> 
> +   XLByteToSeg(restartLSN, restartSeg, wal_segment_size);
> +
> +
> +   if (minKeepLSN)
> There is an extra empty line.
> 
> 
> +/* but, keep larger than wal_segment_size if any*/
> +if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
> +keepSegs = wal_keep_segments;
> 
> You meant wal_keep_segments in the above comment rather than
> wal_segment_size? Also, the above comment need a whitespace just after
> "any".

Ouch! You're right. Fixed.

> 
> +bool
> +IsLsnStillAvaiable(XLogRecPtr restartLSN, XLogRecPtr *minKeepLSN)
> +{
> 
> I think restartLSN is a word used for replication slots. Since this
> function is defined in xlog.c it would be better to change the
> argument name to more generic name, for example recptr.

Agreed. I used "target" instead.

> 
> +   /*
> +* Calcualte keep segments by slots first. The second term of the
> +* condition is just a sanity check.
> +*/
> 
> s/calcualte/calculate/

Fixed.

> 
> +   /* get minimum segment ignorig timeline ID */
> 
> s/ignorig/ignoring/

Fixed.

# One of my fingers is literally fatter with bandaid than usual..

> 
> min_keep_lsn in pg_replication_slots currently shows the same value in
> every slots but I felt that the value seems not easy to understand
> intuitively for users because users will have to confirm that value
> and to compare the current LSN in order to check if replication slots
> will become the "lost" status. So how about showing values that
> indicate how far away from the point where we become "lost" for
> individual slots?

Yeah, that is what I did in the first cut of this patch from the
same thought. pg_replication_slots have two additional columns
"live" and "distance".

https://www.postgresql.org/message-id/20171031.184310.182012625.horiguchi.kyot...@lab.ntt.co.jp

Thre current design is changed following a comment.

https://www.postgresql.org/message-id/20171108.131431.170534842.horiguchi.kyotaro%40lab.ntt.co.jp

> > I don't think 'distance' is a good metric - that's going to continually
> > change. Why not store the LSN that's available and provide a function
> > that computes this? Or just rely on the lsn - lsn operator?
> 
> It seems reasonable.,The 'secured minimum LSN' is common among
> all slots so showing it in the view may look a bit stupid but I
> don't find another suitable place for it.  distance = 0 meant the
> state that the slot is living but insecured in the previous patch
> and that information is lost by changing 'distance' to
> 'min_secure_lsn'.

As I reconsidered this, I noticed that "lsn - lsn" doesn't make
sense here. The correct formula for the value is
"max_slot_wal_keep_size * 1024 * 1024 - ((oldest LSN to keep) -
restart_lsn). It is not a simple formula to write by hand but
doesn't seem general enough. I re-changed my mind to show the
"distance" there again.

pg_replication_slots now has the column "remain" instaed of
"min_keep_lsn", which shows an LSN when wal_status is "streaming"
and otherwise "0/0". In a special case, "remain" can be "0/0"
while "wal_status" is "streaming". It is the reason for the
tristate return value of IsLsnStillAvaialbe().

wal_status | remain 
streaming  | 0/19E3C0  -- WAL is reserved
streaming  | 0/0   -- Still reserved but on the boundary
keeping| 0/0   -- About to be lost.
lost   | 0/0   -- Lost.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From c0cd29

Re: Constraint documentation

2018-07-13 Thread Fabien COELHO



Hello Peter,


I'm not sure what is the suggestion wrt to the documentation text. Is the
issue only with the first introductory sentence? Would removing it be
enough?


Yes.  But it would be even better to fix pg_dump.


Sure. The purpose of Lætitia patch is simply to document the consequences 
if the current behavior. Fixing pg_dump is another issue:-)


I guess that this would involve postponing all non trivial CHECK 
declarations to after all table and function creations. While waiting for 
such a significant change, ISTM that documenting the issue is a reasonable 
option.


--
Fabien.

Re: file cloning in pg_upgrade and CREATE DATABASE

On 13.07.18 07:09, Thomas Munro wrote:
> On Wed, Feb 21, 2018 at 4:00 PM, Peter Eisentraut
>  wrote:
>> - XFS has (optional) reflink support.  This file system is probably more
>> widely used than Btrfs.
>>
>> - Linux and glibc have a proper function to do this now.
>>
>> - APFS on macOS supports file cloning.
> 
> TIL that Solaris 11.4 (closed) ZFS supports reflink() too.  Sadly,
> it's not in OpenZFS though I see numerous requests and discussions...

I look forward to your FreeBSD patch then. ;-)

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Speeding up INSERTs and UPDATEs to partitioned tables

2018-07-13 Thread Amit Langote

Hi David.

On 2018/06/22 15:28, David Rowley wrote:
> Hi,
> 
> As part of my efforts to make partitioning scale better for larger
> numbers of partitions, I've been looking at primarily INSERT VALUES
> performance.  Here the overheads are almost completely in the
> executor. Planning of this type of statement is very simple since
> there is no FROM clause to process.

Thanks for this effort.

> My benchmarks have been around a RANGE partitioned table with 10k leaf
> partitions and no sub-partitioned tables. The partition key is a
> timestamp column.
> 
> I've found that ExecSetupPartitionTupleRouting() is very slow indeed
> and there are a number of things slow about it.  The biggest culprit
> for the slowness is the locking of each partition inside of
> find_all_inheritors().

Yes. :-(

> For now, this needs to remain as we must hold
> locks on each partition while performing RelationBuildPartitionDesc(),
> otherwise, one of the partitions may get dropped out from under us.

We lock all partitions using find_all_inheritors to keep locking order
consistent with other sites that may want to lock tables in the same
partition tree but with a possibly conflicting lock mode.  If we remove
the find_all_inheritors call in ExecSetupPartitionPruneState (like your
0002 does), we may end up locking partitions in arbitrary order in a given
transaction, because input tuples will be routed to various partitions in
an order that's not predetermined.

But, maybe it's not necessary to be that paranoid.  If we've locked on the
parent, any concurrent lockers would have to wait for the lock on the
parent anyway, so it doesn't matter which order tuple routing locks the
partitions.

> The locking is not the only slow thing. I found the following to also be slow:
> 
> 1. RelationGetPartitionDispatchInfo uses a List and lappend() must
> perform a palloc() each time a partition is added to the list.
> 2. A foreach loop is performed over leaf_parts to search for subplans
> belonging to this partition. This seems pointless to do for INSERTs as
> there's never any to find.
> 3. ExecCleanupTupleRouting() loops through the entire partitions
> array. If a single tuple was inserted then all but one of the elements
> will be NULL.
> 4. Tuple conversion map allocates an empty array thinking there might
> be something to put into it. This is costly when the array is large
> and pointless when there are no maps to store.
> 5. During get_partition_dispatch_recurse(), get_rel_relkind() is
> called to determine if the partition is a partitioned table or a leaf
> partition. This results in a slow relcache hashtable lookup.
> 6. get_partition_dispatch_recurse() also ends up just building the
> indexes array with a sequence of numbers from 0 to nparts - 1 if there
> are no sub-partitioned tables. Doing this is slow when there are many
> partitions.
> 
> Besides the locking, the only thing that remains slow now is the
> palloc0() for the 'partitions' array. In my test, it takes 0.6% of
> execution time. I don't see any pretty ways to fix that.
> 
> I've written fixes for items 1-6 above.
> 
> I did:
> 
> 1. Use an array instead of a List.
> 2. Don't do this loop. palloc0() the partitions array instead. Let
> UPDATE add whatever subplans exist to the zeroed array.
> 3. Track what we initialize in a gapless array and cleanup just those
> ones. Make this array small and increase it only when we need more
> space.
> 4. Only allocate the map array when we need to store a map.
> 5. Work that out in relcache beforehand.
> 6. ditto

The issues you list seem all legitimate to me and also your proposed fixes
for each, except I think we could go a bit further.

Why don't we abandon the notion altogether that
ExecSetupPartitionTupleRouting *has to* process the whole partition tree?
ISTM, there is no need to determine the exact number of leaf partitions
and partitioned tables in the partition tree and allocate the arrays in
PartitionTupleRouting based on that.  I know that the indexes array in
PartitionDispatchData contains mapping from local partition indexes (0 to
partdesc->nparts - 1) to those that span *all* leaf partitions and *all*
partitioned tables (0 to proute->num_partitions or proute->num_dispatch)
in a partition tree, but we can change that.

The idea I had was inspired by looking at partitions_init stuff in your
patch.  We could allocate proute->partition_dispatch_info and
proute->partitions arrays to be of a predetermined size, which doesn't
require us to calculate the exact number of leaf partitions and
partitioned tables beforehand.  So, RelationGetPartitionDispatchInfo need
not recursively go over all of the partition tree.  Instead we create just
one PartitionDispatch object of the root parent table, whose indexes array
is initialized with -1 meaning none of the partitions has not been
encountered yet.  In ExecFindPartition, once tuple routing chooses a
partition, we create either a ResultRelInfo (if leaf) or a
PartitionDispatch for it and sto

[PATCH] LLVM tuple deforming improvements

2018-07-13 Thread Pierre Ducroquet

Hi

As reported in the «effect of JIT tuple deform?» thread, there are for some 
cases slowdowns when using JIT tuple deforming.
I've played with the generated code and with the LLVM optimizer trying to fix 
that issue, here are the results of my experiments, with the corresponding 
patches.

All performance measurements are done following the test from 
https://www.postgresql.org/message-id/CAFj8pRAOcSXNnykfH=M6mNaHo
+g=FaUs=dldzsohdjbkujr...@mail.gmail.com

Base measurements : 

No JIT : 850ms 
JIT without tuple deforming : 820 ms (0.2ms optimizing)
JIT with tuple deforming, no opt : 1650 ms (1.5ms)
JIT with tuple deforming, -O3 : 770 ms (105ms)

1) force a -O1 when deforming

This is by far the best I managed to get. With -O1, the queries are even 
faster than with -O3 since the optimizer is faster, while generating an 
already efficient code.
I have tried adding the right passes to the passmanager, but it looks like the 
interesting ones are not available unless you enable -O1.

JIT with tuple deforming, -O1 : 725 ms (54ms)

2) improve the LLVM IR code

The code generator in llvmjit-deform.c currently rely on the LLVM optimizer to 
do the right thing. For instance, it can generate a lot of empty blocks with 
only a jump. If we don't want to enable the LLVM optimizer for every code, we 
have to get rid of this kind of pattern. The attached patch does that. When 
the optimizer is not used, this gets a few cycles boost, nothing impressive.
I have tried to go closer to the optimized bitcode, but it requires building 
phi nodes manually instead of using alloca, and this isn't enough to bring us 
to the performance level of -O1.

JIT with tuple deforming, no opt : 1560 ms (1.5ms)

3) *experimental* : faster non-NULL handling

Currently, the generated code always look at the tuple header bitfield to 
check each field null-ness, using afterwards an and against the hasnulls bit.
Checking only for hasnulls improves performance when there are mostly null-
less tuples, but taxes the performance when nulls are found.
I have not yet suceeded in implementing it, but I think that using the 
statistics collected for a given table, we could use that when we know that we 
may benefit from it.

JIT with tuple deforming, no opt : 1520 ms (1.5ms)
JIT with tuple deforming, -O1 : 690 ms (54ms)
>From 14a226107f845454676a2e14ae0fb843a5b4f668 Mon Sep 17 00:00:00 2001
From: Pierre Ducroquet 
Date: Wed, 11 Jul 2018 23:41:59 +0200
Subject: [PATCH 1/2] Check for the hasnulls attribute before checking
 individual fields

---
 src/backend/jit/llvm/llvmjit_deform.c | 26 +++---
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/src/backend/jit/llvm/llvmjit_deform.c b/src/backend/jit/llvm/llvmjit_deform.c
index 795f67114e..c53855eb63 100644
--- a/src/backend/jit/llvm/llvmjit_deform.c
+++ b/src/backend/jit/llvm/llvmjit_deform.c
@@ -48,6 +48,7 @@ slot_compile_deform(LLVMJitContext *context, TupleDesc desc, int natts)
 	LLVMBasicBlockRef b_out;
 	LLVMBasicBlockRef b_dead;
 	LLVMBasicBlockRef *attcheckattnoblocks;
+	LLVMBasicBlockRef *attfaststartblocks;
 	LLVMBasicBlockRef *attstartblocks;
 	LLVMBasicBlockRef *attisnullblocks;
 	LLVMBasicBlockRef *attcheckalignblocks;
@@ -145,6 +146,7 @@ slot_compile_deform(LLVMJitContext *context, TupleDesc desc, int natts)
 	b = LLVMCreateBuilder();
 
 	attcheckattnoblocks = palloc(sizeof(LLVMBasicBlockRef) * natts);
+	attfaststartblocks = palloc(sizeof(LLVMBasicBlockRef) * natts);
 	attstartblocks = palloc(sizeof(LLVMBasicBlockRef) * natts);
 	attisnullblocks = palloc(sizeof(LLVMBasicBlockRef) * natts);
 	attcheckalignblocks = palloc(sizeof(LLVMBasicBlockRef) * natts);
@@ -239,6 +241,8 @@ slot_compile_deform(LLVMJitContext *context, TupleDesc desc, int natts)
 	{
 		attcheckattnoblocks[attnum] =
 			l_bb_append_v(v_deform_fn, "block.attr.%d.attcheckattno", attnum);
+		attfaststartblocks[attnum] =
+			l_bb_append_v(v_deform_fn, "block.attr.%d.faststart", attnum);
 		attstartblocks[attnum] =
 			l_bb_append_v(v_deform_fn, "block.attr.%d.start", attnum);
 		attisnullblocks[attnum] =
@@ -337,7 +341,7 @@ slot_compile_deform(LLVMJitContext *context, TupleDesc desc, int natts)
 
 		/*
 		 * If this is the first attribute, slot->tts_nvalid was 0. Therefore
-		 * reset offset to 0 to, it be from a previous execution.
+		 * reset offset to 0 too, it could be from a previous execution.
 		 */
 		if (attnum == 0)
 		{
@@ -351,7 +355,7 @@ slot_compile_deform(LLVMJitContext *context, TupleDesc desc, int natts)
 		 */
 		if (attnum <= guaranteed_column_number)
 		{
-			LLVMBuildBr(b, attstartblocks[attnum]);
+			LLVMBuildBr(b, attfaststartblocks[attnum]);
 		}
 		else
 		{
@@ -361,8 +365,19 @@ slot_compile_deform(LLVMJitContext *context, TupleDesc desc, int natts)
 	 l_attno,
 	 v_maxatt,
 	 "heap_natts");
-			LLVMBuildCondBr(b, v_islast, b_out, attstartblocks[attnum]);
+			LLVMBuildCondBr(b, v_islast, b_out, attfaststartblocks[attnum]);
 		}
+
+		LLVMPositionBui

RE: automatic restore point

2018-07-13 Thread Yotsunaga, Naoki

>-Original Message-
>From: Michael Paquier [mailto:mich...@paquier.xyz] 
>Sent: Wednesday, July 11, 2018 3:34 PM

>Well, if you put in place correct measures from the start you would not have 
>problems.  
>It seems to me that there is no point in implementing something which is a 
>solution for a very narrow case, where the user has shot his own foot to begin 
>with. 
>Having backups anyway is mandatory by the way, standby replicas are not 
>backups.

I think that the Undo function of AWS and Oracle's Flashback function are to 
save such users, and it is a function to prevent human error.
So, how about postgres implementing such a function?
 
Also, as an approach to achieving the goal, I thought about outputting lsn to 
the server log when a specific command was executed.
 
I do not think the source code of postgres will be complicated when 
implementing this function.
Do you feel it is too complicated?

---
Naoki Yotsunaga

Re: Constraint documentation

On 07.07.18 10:23, Fabien COELHO wrote:
> I'm not sure what is the suggestion wrt to the documentation text. Is the 
> issue only with the first introductory sentence? Would removing it be 
> enough?

Yes.  But it would be even better to fix pg_dump.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Include application_name in "connection authorized" log message