date:20171220

Re: != should give error?

2017-12-20 Thread Mikael Kjellström


On 2017-12-21 08:16, Craig Ringer wrote:


postgres=# create table tbl (col_a int, col_b int);
CREATE TABLE
postgres=# insert into tbl values (1,2);
INSERT 0 1
postgres=# insert into tbl values (2,1);
INSERT 0 1
*postgres=# select * from tbl where col_a ! = 1;
 col_a | col_b
---+---
 1 | 2
(1 row)*
*
*

One-factorial is one, so it's correct.


Yes, but I think he was expecting the behavior to be different from / 
not equal.  So he meant to write:


select * from tbl where col_a != 1;

I.e. col_a different from 1.

BTW isn't the "preferred" way of writing different from / not equal in 
SQL to use the <> operator?


/Mikael

Re: Using ProcSignal to get memory context stats from a running backend

2017-12-20 Thread Craig Ringer

On 21 December 2017 at 15:24, Andres Freund  wrote:

> Hi,
>
> On 2017-12-21 15:13:13 +0800, Craig Ringer wrote:
> > There tons of callers to  enlargeStringInfo, so a 'noerror' parameter
> would
> > be viable.
>
> Not sure what you mean with that sentence?
>

Mangled in editing and sent prematurely, disregard.

There are NOT tons of callers to  enlargeStringInfo, so adding a parameter
that allowed it to return a failure result rather than ERROR on OOM seems
to be a reasonable option. But it relies on repalloc(), which will ERROR on
OOM. AFAICS we don't have "no error" variants of the memory allocation
routines and I'm not eager to add them. Especially since we can't trust
that we're not on misconfigured Linux where the OOM killer will go wild
instead of giving us a nice NULL result.

So I guess that means we should probably just do individual elog(...)s and
live with the ugliness of scraping the resulting mess out of the logs.
After all, a log destination that could possibly copy and modify the string
being logged a couple of times it's not a good idea to try to drop the
whole thing into the logs in one blob. And we can't trust things like
syslog with large messages.

I'll follow up with a revised patch that uses individual elog()s soon.

BTW, I also pgindented it in my tree, so it'll have formatting fixed up.
pgindent's handling of

static void printwrapper_stringinfo(void *extra, const char *fmt,...)
pg_attribute_printf(2, 3);

upsets me a little, since if I break the line it mangles it into

static void
printwrapper_stringinfo(void *extra, const char *fmt,...)
pg_attribute_printf(2, 3);

so it'll break line-length rules a little, but otherwise conform.

> But I'm not convinced it's worth it personally. If we OOM in response to a
> > ProcSignal request for memory context output, we're having pretty bad
> luck.
> > The output is 8k in my test. But even if it were a couple of hundred kb,
> > happening to hit OOM just then isn't great luck on modern systems with
> many
> > gigabytes of RAM.
>
> I've seen plenty memory dumps in the dozens to hundreds of
> megabytes. And imo such cases are more likely to invite use of this
> facility.
>

OK, that's more concerning then. Also impressively huge.

As written it's using MemoryContextStatsDetail(TopMemoryContext, 100) so it
shouldn't really be *getting* multimegabyte dumps, but I'm not thrilled by
hardcoding that. It doesn't merit a GUC though, so I'm not sure what to do
about it and figured I'd make it a "later" problem.

> > If that *does* happen, repalloc(...) will call
> > MemoryContextStats(TopMemoryContext) before returning NULL. So we'll get
> > our memory context dump anyway, albeit to stderr.
>
> That would still abort the query that might otherwise continue to work,
> so that seems no excuse.

Eh. It'd probably die soon enough anyway. But yeah, I agree it's not safe
to hope we can sling around up to a couple of hundred MB under presumed
memory pressure.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Using ProcSignal to get memory context stats from a running backend

2017-12-20 Thread Andres Freund

Hi,

On 2017-12-21 15:13:13 +0800, Craig Ringer wrote:
> There tons of callers to  enlargeStringInfo, so a 'noerror' parameter would
> be viable.

Not sure what you mean with that sentence?


> But I'm not convinced it's worth it personally. If we OOM in response to a
> ProcSignal request for memory context output, we're having pretty bad luck.
> The output is 8k in my test. But even if it were a couple of hundred kb,
> happening to hit OOM just then isn't great luck on modern systems with many
> gigabytes of RAM.

I've seen plenty memory dumps in the dozens to hundreds of
megabytes. And imo such cases are more likely to invite use of this
facility.


> If that *does* happen, repalloc(...) will call
> MemoryContextStats(TopMemoryContext) before returning NULL. So we'll get
> our memory context dump anyway, albeit to stderr.

That would still abort the query that might otherwise continue to work,
so that seems no excuse.

Greetings,

Andres Freund

Re: != should give error?

2017-12-20 Thread Craig Ringer

On 21 December 2017 at 14:43, Rajkumar Raghuwanshi <
rajkumar.raghuwan...@enterprisedb.com> wrote:

> Hi,
>
> with below query I am getting unexpected output. here != is
> behaving as =
> is this expected behaviour?
>
> postgres=# create table tbl (col_a int, col_b int);
> CREATE TABLE
> postgres=# insert into tbl values (1,2);
> INSERT 0 1
> postgres=# insert into tbl values (2,1);
> INSERT 0 1
>
>
>
>
> *postgres=# select * from tbl where col_a ! = 1; col_a |
> col_b---+--- 1 | 2(1 row)*
>
>
One-factorial is one, so it's correct.

test=> SELECT 1!, 2!, 3!;
 ?column? | ?column? | ?column?
--+--+--
1 |2 |6
(1 row)

test=> SELECT (1 !) = 1;
 ?column?
--
 t
(1 row)

test=> \do+ !
  List of operators
   Schema   | Name | Left arg type | Right arg type | Result type |
 Function   | Description
+--+---++-+-+-
 pg_catalog | !| bigint|| numeric |
numeric_fac | factorial
(1 row)




-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Using ProcSignal to get memory context stats from a running backend

2017-12-20 Thread Craig Ringer

On 21 December 2017 at 14:58, Andres Freund  wrote:

> Hi,
>
> On 2017-12-21 14:49:28 +0800, Craig Ringer wrote:
> > +/*
> > + * Accumulate writes into the buffer in diag_request_buf,
> > + * for use with functions that expect a printf-like callback.
> > + */
> > +static void
> > +printwrapper_stringinfo(void *extra, const char * fmt, ...)
> > +{
> > + StringInfo out = extra;
> > + for (;;)
> > + {
> > + va_list args;
> > + int needed;
> > + va_start(args, fmt);
> > + needed = appendStringInfoVA(out, fmt, args);
> > + va_end(args);
> > + if (needed == 0)
> > + break;
> > + enlargeStringInfo(out, needed);
> > + }
> >  }
>
> Hm, so I'm not entirely sure it's ok to use something that ERRORs on
> OOM. There's plenty of scenarios with thousands of memory contexts,
> making this output fairly large. If we want to make this usable in
> production, I'm not sure it's ok to introduce additional ERRORs.  I
> wonder if we can change this to emit a static message if collecting the
> output exhausted memory.

There tons of callers to  enlargeStringInfo, so a 'noerror' parameter would
be viable.

But I'm not convinced it's worth it personally. If we OOM in response to a
ProcSignal request for memory context output, we're having pretty bad luck.
The output is 8k in my test. But even if it were a couple of hundred kb,
happening to hit OOM just then isn't great luck on modern systems with many
gigabytes of RAM.

If that *does* happen, repalloc(...) will call
MemoryContextStats(TopMemoryContext) before returning NULL. So we'll get
our memory context dump anyway, albeit to stderr.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Using ProcSignal to get memory context stats from a running backend

2017-12-20 Thread Andres Freund

Hi,

On 2017-12-21 14:49:28 +0800, Craig Ringer wrote:
> +/*
> + * Accumulate writes into the buffer in diag_request_buf,
> + * for use with functions that expect a printf-like callback.
> + */
> +static void
> +printwrapper_stringinfo(void *extra, const char * fmt, ...)
> +{
> + StringInfo out = extra;
> + for (;;)
> + {
> + va_list args;
> + int needed;
> + va_start(args, fmt);
> + needed = appendStringInfoVA(out, fmt, args);
> + va_end(args);
> + if (needed == 0)
> + break;
> + enlargeStringInfo(out, needed);
> + }
>  }

Hm, so I'm not entirely sure it's ok to use something that ERRORs on
OOM. There's plenty of scenarios with thousands of memory contexts,
making this output fairly large. If we want to make this usable in
production, I'm not sure it's ok to introduce additional ERRORs.  I
wonder if we can change this to emit a static message if collecting the
output exhausted memory.

Greetings,

Andres Freund

Re: Using ProcSignal to get memory context stats from a running backend

2017-12-20 Thread Craig Ringer

On 20 December 2017 at 08:46, Craig Ringer  wrote:

> On 20 December 2017 at 02:35, Andres Freund  wrote:
>
>
>> > Yeah.  But please don't mess with MemoryContextStats per se ---
>> > I dunno about you guys but "call MemoryContextStats(TopMemoryContext)"
>> > is kinda wired into my gdb reflexes.  I think what'd make sense
>> > is a new function "MemoryContextStatsTo(context, function_pointer)".
>> > It's okay to redefine the APIs of the per-context-type functions
>> > these would call, though, because nobody calls those functions directly.
>>
>> We already have MemoryContextStatsDetail() - it seems to make sense to
>> expand that API and leave MemoryContextStats() alone. I don't think
>> there's a need for a third variant?
>>
>
> Cool, can do.
>
> I'll have to expose a typedef for the printf-wrapper callback in
> memnodes.h and add it to the stats method, which I thought would be more
> likely to get complaints than the global hook. I'm actually happier to do
> it with a passed callback.
>
> Will revise when I get a chance in the next couple of days.
>

Done.

It uses vfprintf unconditionally, even on Windows where we'd usually use
write_stderr, so it doesn't change the current MemoryContextStats behaviour.

2017-12-21 14:39:20.045 AWST [10588] pg_regress/misc_functions LOG:
 diagnostic dump requested; memory context info: TopMemoryContext: 67440
total in 5 blocks; 13376 free (6 chunks); 54064 used
  pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks;
1376 free (0 chunks); 6816 used
  TopTransactionContext: 8192 total in 1 blocks; 7728 free (1
chunks); 464 used
  ...

To verify that stderr output still works properly I ran:


SELECT
  repeat('spamspamspam', 2000),
  repeat('spamspamspam', 2000),
  repeat('spamspamspam', 2000),
  repeat('spamspamspam', 2000),
  ... lots ...;


and got the expected

+ERROR:  out of memory
+DETAIL:  Failed on request of size 24004.

and memory context dump to stderr. I didn't add a TAP test for that because
I'd rather avoid exercising OOM in tests where we don't know what the
host's swap config is like, how its ulimit behaves, whether it has
craziness like the OOM killer, etc. But it might make sense to add a TAP
test to set a low ulimit and exercise OOM recovery at some point.

I've added a separate vfprintf wrapper for memory context use that doesn't
try to use the windows error log like write_stderr(...) does. If we want to
change OOM behaviour on Windows it can be done in a followup.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From db5caa16cd3c78544f7a6e336c3d8b9118f7cdaf Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Tue, 19 Dec 2017 20:47:36 +0800
Subject: [PATCH v2] Add pg_diag_backend to print memory context info

The new pg_diag_backend(pid) function signals a backend to dump
a summary of its memory context state when it next calls
CHECK_FOR_INTERRUPTS(). The memory context information is
output to the server log.

This works on any backend that uses the standard
procsignal_sigusr1_handler or explicitly handles PROCSIG_DIAG_REQUEST in
its own handler. Currently this includes normal user backends, the
walsender and autovacuum workers but not the other system processes.
Background workers must handle it explicitly handle SIGUSR1 with
procsignal_sigusr1_handler.
---
 doc/src/sgml/bgworker.sgml   | 10 -
 doc/src/sgml/func.sgml   | 13 ++
 src/backend/storage/ipc/procsignal.c |  3 ++
 src/backend/tcop/postgres.c  | 66 
 src/backend/utils/adt/misc.c | 16 +++
 src/backend/utils/init/globals.c |  1 +
 src/backend/utils/mmgr/aset.c| 10 +++--
 src/backend/utils/mmgr/mcxt.c| 50 -
 src/include/catalog/pg_proc.h|  2 +
 src/include/miscadmin.h  |  4 ++
 src/include/nodes/memnodes.h | 10 -
 src/include/storage/procsignal.h |  5 ++-
 src/include/utils/memutils.h |  4 +-
 src/test/regress/expected/misc_functions.out | 12 +
 src/test/regress/sql/misc_functions.sql  |  9 
 15 files changed, 194 insertions(+), 21 deletions(-)

diff --git a/doc/src/sgml/bgworker.sgml b/doc/src/sgml/bgworker.sgml
index 4bc2b69..50bb54f 100644
--- a/doc/src/sgml/bgworker.sgml
+++ b/doc/src/sgml/bgworker.sgml
@@ -185,7 +185,8 @@ typedef struct BackgroundWorker
when the process is started or exits.  It should be 0 for workers registered
at postmaster startup time, or when the backend registering the worker does
not wish to wait for the worker to start up.  Otherwise, it should be
-   initialized to MyProcPid.
+   initialized to MyProcPid. Typically this sets the
+   receiving process's latch via its signal handler.
   
 
   Once running, the process can connect to a database by calling
@@

!= should give error?

2017-12-20 Thread Rajkumar Raghuwanshi

Hi,

with below query I am getting unexpected output. here != is behaving
as =
is this expected behaviour?

postgres=# create table tbl (col_a int, col_b int);
CREATE TABLE
postgres=# insert into tbl values (1,2);
INSERT 0 1
postgres=# insert into tbl values (2,1);
INSERT 0 1




*postgres=# select * from tbl where col_a ! = 1; col_a |
col_b---+--- 1 | 2(1 row)*

postgres=# select * from tbl where col_a = 1;
 col_a | col_b
---+---
 1 | 2
(1 row)

Thanks & Regards,
Rajkumar Raghuwanshi
QMG, EnterpriseDB Corporation

Re: Bitmap table scan cost per page formula

2017-12-20 Thread Haisheng Yuan

Hi Jeff,

The issue happens on our customer's production environment, I don't have
access to their hardware. But I agree, the default value 100 is indeed a
poor value. After I change the default value to 30 or less, the query
starts generating plan with bitmap scan as expected.

~ ~ ~
Haisheng Yuan

On Wed, Dec 20, 2017 at 9:43 PM, Jeff Janes  wrote:

> On Tue, Dec 19, 2017 at 11:55 AM, Haisheng Yuan  wrote:
>
>> Hi hackers,
>>
>> This is Haisheng Yuan from Greenplum Database.
>>
>> We had some query in production showing that planner favors seqscan over
>> bitmapscan, and the execution of seqscan is 5x slower than using
>> bitmapscan, but the cost of bitmapscan is 2x the cost of seqscan. The
>> statistics were updated and quite accurate.
>>
>> Bitmap table scan uses a formula to interpolate between random_page_cost
>> and seq_page_cost to determine the cost per page. In Greenplum Database,
>> the default value of random_page_cost is 100, the default value of
>> seq_page_cost is 1. With the original cost formula, random_page_cost
>> dominates in the final cost result, even the formula is declared to be
>> non-linear.
>>
>
> My first inclination would be take this as evidence that 100 is a poor
> default for random_page_cost, rather than as evidence that the bitmap heap
> scan IO cost model is wrong.
>
> Could you try the low level benchmark I posted elsewhere in the thread on
> your hardware for reading 1/3 or 1/2 of the pages, in order?  Maybe your
> kernel/system does  a better job of predicting read ahead.
>
> Cheers,
>
> Jeff
>

Re: domain cast in parameterized vs. non-parameterized query

2017-12-20 Thread Pavel Stehule

Hi

2017-12-20 23:41 GMT+01:00 Tom Lane :

> I wrote:
> > You might consider whether you can write 'spa-000'::uid explicitly in
> your
> > query; that results in immediate application of the domain coercion, so
> > that the planner no longer sees that as a run-time operation it has to
> > avoid.
>
> Hm, scratch that --- experimentation shows that the parser still produces
> a CoerceToDomain node in that case, not a literal of the domain type.
>
> regression=# create domain foo as text;
> CREATE DOMAIN
> regression=# explain verbose select 'x'::foo;
> QUERY PLAN
> ---
>  Result  (cost=0.00..0.01 rows=1 width=32)
>Output: ('x'::text)::foo
> (2 rows)
>
> You could force the issue with an immutable function:
>

Why the rewrite doesn't reduce it? Or why parser does it?

Regards

Pavel


>
> regression=# create function forcefoo(text) returns foo as
> regression-# 'begin return $1::foo; end' language plpgsql immutable;
> CREATE FUNCTION
> regression=# explain verbose select forcefoo('x');
> QUERY PLAN
> ---
>  Result  (cost=0.00..0.01 rows=1 width=32)
>Output: 'x'::foo
> (2 rows)
>
> Marking this function as immutable is sort of a lie, because it
> is effectively telling the planner that you don't expect any
> failure from pre-evaluation of the function.  But it'd get the
> job done, and in most situations there's no practical difference
> because any failure would have happened anyway at runtime.
>

> regards, tom lane
>
>

Re: Bitmap table scan cost per page formula

2017-12-20 Thread Jeff Janes

On Wed, Dec 20, 2017 at 5:03 PM, Tom Lane  wrote:

>
> The parabola is probably wrong in detail --- its behavior as we approach
> reading all of the pages ought to be more asymptotic, seems like.
> I suppose that the reason it appears to go below the seqscan cost at the
> right is that even the rightmost end of the curve doesn't reflect reading
> all of the *tuples*, just one tuple per page, so that there's some CPU
> savings from not inspecting all the tuples.

I think that the graph is only about the IO costs of the heap scan, and the
CPU costs are an independent issue.

The reason it drops below the seqscan on the far right edge is much
simpler.  10,000 is the point where 100% of the pages are being read, so
the two scans converge to the same value.  The graph continues above 10,000
but it is meaningless as a bitmap heap scan will never read more than 100%
of the table pages and so the math no longer represents a reality.

Cheers,

Jeff

Re: Bitmap table scan cost per page formula

2017-12-20 Thread Jeff Janes

On Tue, Dec 19, 2017 at 7:25 PM, Justin Pryzby  wrote:

>
> I started playing with this weeks ago (probably during Vitaliy's problem
> report).  Is there any reason cost_bitmap_heap_scan shouldn't interpolate
> based
> on correlation from seq_page_cost to rand_page_cost, same as cost_index ?
>

I think that doing something like that is a good idea in general, but
someone has to implement the code, and so far no one seems enthused to do
so.  You seem pretty interested in the topic, so

It is not obvious to me how to pass the correlation from the cost_index up
to the bitmap heap scan, especially not if it has to go through a BitmapAnd
or a BitmapOr to get there.  Maybe the optimization would only be used in
the case where there are no BitmapAnd or BitmapOr, at least for a proof of
concept?

I haven't been able to reproduce your test case, but I have not had the
hardware or the time to try very hard.  I think next year I'll have more
time, but I don't know about the hardware.

Cheers,

Jeff

Re: Bitmap table scan cost per page formula

2017-12-20 Thread Jeff Janes

On Tue, Dec 19, 2017 at 11:55 AM, Haisheng Yuan  wrote:

> Hi hackers,
>
> This is Haisheng Yuan from Greenplum Database.
>
> We had some query in production showing that planner favors seqscan over
> bitmapscan, and the execution of seqscan is 5x slower than using
> bitmapscan, but the cost of bitmapscan is 2x the cost of seqscan. The
> statistics were updated and quite accurate.
>
> Bitmap table scan uses a formula to interpolate between random_page_cost
> and seq_page_cost to determine the cost per page. In Greenplum Database,
> the default value of random_page_cost is 100, the default value of
> seq_page_cost is 1. With the original cost formula, random_page_cost
> dominates in the final cost result, even the formula is declared to be
> non-linear.
>

My first inclination would be take this as evidence that 100 is a poor
default for random_page_cost, rather than as evidence that the bitmap heap
scan IO cost model is wrong.

Could you try the low level benchmark I posted elsewhere in the thread on
your hardware for reading 1/3 or 1/2 of the pages, in order?  Maybe your
kernel/system does  a better job of predicting read ahead.

Cheers,

Jeff

Re: GSoC 2018

2017-12-20 Thread Andrey Borodin

Hi, Stefan!
> 15 дек. 2017 г., в 15:03, Stefan Keller  написал(а):
> 
> What about adding "Learned Index" as project task [*]?
> This type of index looks promising for certain properties.
> 
> [*] "The Case for Learned Index Structures" Kraska et al. (Dec 2017)
> https://arxiv.org/abs/1712.01208

Let's fill in task description here 
https://wiki.postgresql.org/index.php?title=GSoC_2018

At least we need:
Project title
Project Description
Skills needed
Difficulty Level
Potential Mentors
Expected Outcomes

I propose title: Learned index prototype
Expected outcome: Design for integration of learned indexes into PostgreSQL 
(possibly, using index-as-extension technology), prototype implementation and 
benchmarks.

Best regards, Andrey Borodin.

Re: Bitmap table scan cost per page formula

2017-12-20 Thread Jeff Janes

On Wed, Dec 20, 2017 at 2:18 PM, Robert Haas  wrote:

> On Wed, Dec 20, 2017 at 4:20 PM, Jeff Janes  wrote:
>>
>> It is not obvious to me that the parabola is wrong.  I've certainly seen
>> cases where reading every 2nd or 3rd block (either stochastically, or
>> modulus) actually does take longer than reading every block, because it
>> defeats read-ahead.  But it depends on a lot on your kernel version and
>> your kernel settings and your file system and probably other things as well.
>>
>
> Well, that's an interesting point, too.  Maybe we need another graph that
> also shows the actual runtime of a bitmap scan and a sequential scan.
>

I've did some low level IO benchmarking, and I actually get 13 times slower
to read every 3rd block than every block using CentOS6.9 with ext4 and the
setting:
blockdev --setra 8192 /dev/sdb1
On some virtualized storage which I don't know the details of, but it
behaves as if it were RAID/JBOD with around 6 independent spindles..

If I pick the 1/3 of the blocks to read stochastically rather than by
modulus, it is only 2 times slower than reading all of them, I assume
because having occasional reads which are adjacent to each other does make
read-ahead kick in, while evenly spaced never-adjacent reads does not.
This is probably a better model of how bitmap table scans actually work, as
there is no general reason to think they would be evenly spaced and
non-adjacent.  So this result is in reasonable agreement with how the
current cost estimation works, the parabola peaks at about twice the cost
as the sequence scan.

I used a file of about 10GB, because I happened to have one laying around.

## read every block ($_%3>5 is never true)
sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
time perl -le 'open my $fh, "rand" or die; foreach (1..130) {$x="";
next if $_%3>5; sysseek $fh,$_*8*1024,0 or die $!; sysread $fh, $x,8*1024;
print length $x} '|uniq -c

1295683 8192
   4317 0
real0m38.329s

## read every 3rd block
sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
time perl -le 'open my $fh, "rand" or die; foreach (1..130) {$x="";
next if $_%3>0; sysseek $fh,$_*8*1024,0 or die $!; sysread $fh, $x,8*1024;
print length $x} '|uniq -c
 431894 8192
   1439 0
real8m54.698s

time perl -wle 'open my $fh, "rand" or die; foreach (1..130) {$x="";
next if rand()>0.; sysseek $fh,$_*8*1024,0 or die $!; sysread $fh,
$x,8*1024; print length $x} '|uniq -c
 431710 8192
   1434 0
real1m23.130s

Dropping the caches is a reasonable way to do this type of benchmark,
because it simulates what would happen if your data set was much larger
than RAM, without needing to actually use a data set much larger than RAM.

It would be interesting to see what other people get for similar low level
tests, as well actual bitmap scans.

Cheers,

Jeff

Re: Add hint about replication slots when nearing wraparound

2017-12-20 Thread Michael Paquier

On Wed, Dec 20, 2017 at 10:00 PM, Feike Steenbergen
 wrote:
> As far as I know the issue only occurs for stale replication slots for
> logical decoding but not for physical replication, is that correct?

Yeah, I recall something similar.

@@ -255,7 +255,9 @@ $ pg_recvlogical -d postgres --slot test --drop-slot
   even when there is no connection using them. This consumes storage
   because neither required WAL nor required rows from the system catalogs
   can be removed by VACUUM as long as they are
required by a replication
-  slot.  So if a slot is no longer required it should be dropped.
+  slot.  In extreme cases this could cause the database to shut
down to prevent
+  transaction ID wraparound (see ).
+  So if a slot is no longer required it should be dropped.
  

Don't you want to put that in its own  block? That's rather
important not to miss for administrators.
-- 
Michael

Re: Letting plpgsql in on the fun with the new expression eval stuff

2017-12-20 Thread Robert Haas

On Wed, Dec 20, 2017 at 6:06 PM, Tom Lane  wrote:
> Anyway, I left it as-is, but I'm willing to make the change if
> people feel the other way is better.

I feel the other way -- let's not add more pointer indirections if it
isn't really necessary.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] REINDEX CONCURRENTLY 2.0

2017-12-20 Thread Craig Ringer

On 21 December 2017 at 11:31, Michael Paquier 
wrote:

> On Thu, Dec 21, 2017 at 11:46 AM, Alvaro Herrera
>  wrote:
> > Michael Paquier wrote:
> >> Well, the idea is really to get rid of that as there are already
> >> facilities of this kind for CREATE TABLE LIKE in the parser and ALTER
> >> TABLE when rewriting a relation. It is not really attractive to have a
> >> 3rd method in the backend code to do the same kind of things, for a
> >> method that is even harder to maintain than the other two.
> >
> > I dislike the backend code that uses SPI and manufacturing node to
> > re-creates indexes.  IMO we should get rid of it.  Let's not call it
> > "facilities", but rather "grotty hacks".
>
> Aha. You are making my day here ;)
>
> > I think before suggesting to add even more code to perpetuate that idea,
> > we should think about going in the other direction.  I have not tried to
> > write the code, but it should be possible to have an intermediate
> > function called by ProcessUtility* which transforms the IndexStmt into
> > an internal representation, then calls DefineIndex.   This way, all this
> > code that wants to create indexes for backend-internal reasons can
> > create the internal representation directly then call DefineIndex,
> > instead of the horrible hacks they use today creating parse nodes by
> > hand.
>
> Yeah, that would be likely possible. I am not volunteering for that in
> the short term though..
>

It sounds like that'd make some of ALTER TABLE a bit less ... upsetting ...
too.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] REINDEX CONCURRENTLY 2.0

2017-12-20 Thread Michael Paquier

On Thu, Dec 21, 2017 at 11:46 AM, Alvaro Herrera
 wrote:
> Michael Paquier wrote:
>> Well, the idea is really to get rid of that as there are already
>> facilities of this kind for CREATE TABLE LIKE in the parser and ALTER
>> TABLE when rewriting a relation. It is not really attractive to have a
>> 3rd method in the backend code to do the same kind of things, for a
>> method that is even harder to maintain than the other two.
>
> I dislike the backend code that uses SPI and manufacturing node to
> re-creates indexes.  IMO we should get rid of it.  Let's not call it
> "facilities", but rather "grotty hacks".

Aha. You are making my day here ;)

> I think before suggesting to add even more code to perpetuate that idea,
> we should think about going in the other direction.  I have not tried to
> write the code, but it should be possible to have an intermediate
> function called by ProcessUtility* which transforms the IndexStmt into
> an internal representation, then calls DefineIndex.   This way, all this
> code that wants to create indexes for backend-internal reasons can
> create the internal representation directly then call DefineIndex,
> instead of the horrible hacks they use today creating parse nodes by
> hand.

Yeah, that would be likely possible. I am not volunteering for that in
the short term though..
-- 
Michael

Re: [HACKERS] replace GrantObjectType with ObjectType

2017-12-20 Thread Stephen Frost

Michael, Peter, all,

* Michael Paquier (michael.paqu...@gmail.com) wrote:
> On Thu, Dec 21, 2017 at 1:19 AM, Peter Eisentraut
>  wrote:
> > On 12/20/17 10:37, Alvaro Herrera wrote:
> >> I think Michael's point is that instead of a "default:" clause, this
> >> switch should list all the known values of the enum and throw an
> >> "unsupported object type" error for them.  So whenever somebody adds a
> >> new object type, the compiler will complain that this switch doesn't
> >> handle it and the developer will have to think about this function.
> 
> Thanks Álvaro, that's exactly the point I am coming at. The previous
> version of the patch was breaking an existing flow.
> 
> > Right.  I was actually looking at a later patch that I had not sent in
> > yet that had already addressed that.  So here it is.
> 
> Thanks for the new version. I have looked at 0001, and this looks
> acceptable for me in this shape.
> 
> In the set of things that could be improved, but I am of course not
> asking about those being addressed in this patch... Things could be
> made more consistent for ExecGrantStmt_oids, objectNamesToOids,
> objectsInSchemaToOids, SetDefaultACL and
> ExecAlterDefaultPrivilegesStmt for the switch/case handlings.

I looked into various bits while considering this change.

One concern I have here is that it's introducing OBJECT_RELATION as a
new object type when we already have OBJECT_TABLE, OBJECT_VIEW and
OBJECT_SEQUENCE.  In some ways it makes sense- the GRANT command allows
the user to be ambiguous when it comes to the details of the object
type:

GRANT SELECT ON foo TO bar;

In this case, foo could be a table, a view, or a sequence, so we have
the grammer code is as OBJECT_RELATION and then use RangeVarGetRelOids
to get the OIDs for it, since that function doesn't much care what kind
of relation it is, and off we go.

There's some downsides to this approach though: we do an initial set of
checks in ExecGrantStmt, but we can't do all of them because we don't
know if it's a sequence or not, so we end up with some additional
special checks to see if the GRANT is valid down in ExecGrant_Relation
after we figure out what kind of relation it is.

Further, anything which handles an OBJECT_TABLE or OBJECT_VIEW or
OBJECT_SEQUENCE today might be expected to now be able to handle an
OBJECT_RELATION instead, though it doesn't look like this patch makes
any attempt to address that.

Lastly, and I'm not sure if we'd actually want to change it, but:

GRANT SELECT ON TABLE sequence1 TO role1;

works just fine even though it's not really correct.  The other way:

GRANT SELECT ON SEQUENCE table1 TO role1;

doesn't work though, and neither does GRANT SELECT ON VIEW (at all).

In the end, I'd personally prefer refactoring ExecGrantStmt and friends
to move the GRANT-type checks down into the individual functions and,
ideally, avoid having to have OBJECT_RELATION at all, though that seems
like it'd be a good bit of work.

My second preference would be to at least add some commentary about
OBJECT_RELATION vs. OBJECT_(TABLE|VIEW|SEQUENCE), when which should be
used and why, and review functions that accept objects of those types
to see if they should be extended to also accept OBJECT_RELATION.

> I have not looked at 0002 in details.

Neither have I.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: Reproducible builds: genbki.pl and Gen_fmgrtab.pl

2017-12-20 Thread Tom Lane

Christoph Berg  writes:
> Re: Tom Lane 2017-12-16 <417.1513438...@sss.pgh.pa.us>
>> I think we're talking at cross-purposes.  I'm not saying we should not fix
>> this problem.  I'm saying that the proposed fix appears incomplete ...

> Grepping through the source, there are three places where $0 printed
> to files in regular operation (as opposed to being used in --help):

I poked around and found a few more.

> I believe the reason why we've only been seeing half of the problem
> yet is that the generated files are shipped with the tarballs, so it
> might be a timestamping issue determining if the scripts are
> re-executed.

Right; some parts of this problem would only materialize for you if you
needed to rebuild the generated files that are included in the tarball,
which should basically not be happening in normal packager builds.
Rather the risk is at our end: if we ever switched the tarball creation
process to be a VPATH build, then there'd be path dependencies in the
created tarballs.  That would be bad.

More generally, my concern here is not just that we fix this problem
but that it stays fixed.  If some individual scripts print $0 into
their output and it happens to not affect any built distribution files
today, it's still bad, because tomorrow somebody might copy that coding
pattern into someplace else where it matters more.  I think we need a
project policy that thou shalt not print $0 into generated files, period.

Also, experimenting with a VPATH build, I verified that such "helpful"
practices as printing $infile or @ARGV into the output file will also
create path dependencies.  So I think we need to lose those too.
It's not like they're adding any info you can't find out from the
Makefiles.

On the other hand, there is something we can do that will improve
matters: rather than just printing the base name of the script,
let's print its full relative path within the PG sources, eg instead
of Gen_fmgrtab.pl let's print src/backend/utils/Gen_fmgrtab.pl.
My thought here is that if you're not already intimately familiar
with a script you might not remember where it lives, the more so
if you're looking at a file that's been put into an installation
tree far away from where it was generated.  I see that this policy
was already followed in some places, just not in the ones that were
using the $0 shortcut.

In short, I propose the attached more-extensive patch.

Some of the files generated by these scripts, particularly the map
files generated by the src/backend/utils/mb/Unicode/ scripts, are
not just present in tarballs but are actually in our git repo.
So changing those scripts won't affect anything until/unless someone
updates the repo's generated files, which I've not done here and
don't feel much need to do.  I just want to establish a principle
that we don't print path-dependent info into generated files.

regards, tom lane

diff --git a/src/backend/catalog/genbki.pl b/src/backend/catalog/genbki.pl
index 256a9c9..e4a0b8b 100644
--- a/src/backend/catalog/genbki.pl
+++ b/src/backend/catalog/genbki.pl
@@ -340,7 +340,7 @@ print $schemapg < EUC_JIS_2004 table

diff --git a/src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl b/src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl
index 5db663f..77ce273 100755
--- a/src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl
+++ b/src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl
@@ -14,7 +14,7 @@
 use strict;
 use convutils;

-my $this_script = $0;
+my $this_script = 'src/backend/utils/mb/Unicode/UCS_to_EUC_JP.pl';

 # Load JIS0212.TXT
 my $jis0212 = &read_source("JIS0212.TXT");
diff --git a/src/backend/utils/mb/Unicode/UCS_to_EUC_KR.pl b/src/backend/utils/mb/Unicode/UCS_to_EUC_KR.pl
index f0d978d..8e0126c 100755
--- a/src/backend/utils/mb/Unicode/UCS_to_EUC_KR.pl
+++ b/src/backend/utils/mb/Unicode/UCS_to_EUC_KR.pl
@@ -19,7 +19,7 @@
 use strict;
 use convutils;

-my $this_script = $0;
+my $this_script = 'src/backend/utils/mb/Unicode/UCS_to_EUC_KR.pl';

 # Load the source file.

diff --git a/src/backend/utils/mb/Unicode/UCS_to_EUC_TW.pl b/src/backend/utils/mb/Unicode/UCS_to_EUC_TW.pl
index 0bfcbd5..61013e3 100755
--- a/src/backend/utils/mb/Unicode/UCS_to_EUC_TW.pl
+++ b/src/backend/utils/mb/Unicode/UCS_to_EUC_TW.pl
@@ -20,7 +20,7 @@
 use strict;
 use convutils;

-my $this_script = $0;
+my $this_script = 'src/backend/utils/mb/Unicode/UCS_to_EUC_TW.pl';

 my $mapping = &read_source("CNS11643.TXT");

diff --git a/src/backend/utils/mb/Unicode/UCS_to_GB18030.pl b/src/backend/utils/mb/Unicode/UCS_to_GB18030.pl
index 4469cc7..e9f816c 100755
--- a/src/backend/utils/mb/Unicode/UCS_to_GB18030.pl
+++ b/src/backend/utils/mb/Unicode/UCS_to_GB18030.pl
@@ -16,7 +16,7 @@
 use strict;
 use convutils;

-my $this_script = $0;
+my $this_script = 'src/backend/utils/mb/Unicode/UCS_to_GB18030.pl';

 # Read the input

diff --git a/src/backend/utils/mb/Unicode/UCS_to_JOHAB.pl b/src/backend/utils/mb/Unicode/UCS_to_JOHAB.pl
index 7c6f526..be10d52 100755
--- a/src/backend/utils

Re: [HACKERS] REINDEX CONCURRENTLY 2.0

2017-12-20 Thread Alvaro Herrera

Michael Paquier wrote:

> Well, the idea is really to get rid of that as there are already
> facilities of this kind for CREATE TABLE LIKE in the parser and ALTER
> TABLE when rewriting a relation. It is not really attractive to have a
> 3rd method in the backend code to do the same kind of things, for a
> method that is even harder to maintain than the other two.

I dislike the backend code that uses SPI and manufacturing node to
re-creates indexes.  IMO we should get rid of it.  Let's not call it
"facilities", but rather "grotty hacks".

I think before suggesting to add even more code to perpetuate that idea,
we should think about going in the other direction.  I have not tried to
write the code, but it should be possible to have an intermediate
function called by ProcessUtility* which transforms the IndexStmt into
an internal representation, then calls DefineIndex.   This way, all this
code that wants to create indexes for backend-internal reasons can
create the internal representation directly then call DefineIndex,
instead of the horrible hacks they use today creating parse nodes by
hand.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: AS OF queries

2017-12-20 Thread Craig Ringer

On 21 December 2017 at 00:17, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 12/20/17 10:29, Tom Lane wrote:
> > Please say that's just an Oracle-ism and not SQL standard, because it's
> > formally ambiguous.  This is required to work by spec:
> >
> > regression=# select x as of from (values(1)) t(x);
> >  of
> > 
> >   1
> > (1 row)
> >
> > so it's not possible for us ever to support an expression that includes
> > top-level "AS OF" (or, pretty much, "AS anything") without some rather
> > enormous pushups.
>
> The SQL standard syntax appears to be something like
>
> "tablename" [ AS OF SYSTEM TIME 'something' ] [ [ AS ] "alias" ]
>
> That's not going to be fun to parse.


Well, the SQL committe seem to specialise in parser torture.

Window functions, anybody?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] REINDEX CONCURRENTLY 2.0

2017-12-20 Thread Alvaro Herrera

Andreas Karlsson wrote:
> Here is a rebased version of the patch.

Is anybody working on rebasing this patch?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] replace GrantObjectType with ObjectType

2017-12-20 Thread Michael Paquier

On Thu, Dec 21, 2017 at 1:19 AM, Peter Eisentraut
 wrote:
> On 12/20/17 10:37, Alvaro Herrera wrote:
>> I think Michael's point is that instead of a "default:" clause, this
>> switch should list all the known values of the enum and throw an
>> "unsupported object type" error for them.  So whenever somebody adds a
>> new object type, the compiler will complain that this switch doesn't
>> handle it and the developer will have to think about this function.

Thanks Álvaro, that's exactly the point I am coming at. The previous
version of the patch was breaking an existing flow.

> Right.  I was actually looking at a later patch that I had not sent in
> yet that had already addressed that.  So here it is.

Thanks for the new version. I have looked at 0001, and this looks
acceptable for me in this shape.

In the set of things that could be improved, but I am of course not
asking about those being addressed in this patch... Things could be
made more consistent for ExecGrantStmt_oids, objectNamesToOids,
objectsInSchemaToOids, SetDefaultACL and
ExecAlterDefaultPrivilegesStmt for the switch/case handlings.

I have not looked at 0002 in details.
-- 
Michael

Re: User defined data types in Logical Replication

2017-12-20 Thread Masahiko Sawada

On Wed, Dec 20, 2017 at 5:39 PM, Huong Dangminh
 wrote:
> Hi Sawada-san,
>
>> Thank you for quick response. The changes look good to me. But I wonder
>> if the following changes needs some comments to describe what each checks
>> does for.
>>
>> -if (errarg->attnum < 0)
>> +if (errarg->local_attnum <0)
>> +return;
>> +
>> +rel = errarg->rel;
>> +remote_attnum = rel->attrmap[errarg->local_attnum];
>> +
>> +if (remote_attnum < 0)
>>  return;
>
> Thanks, I have added some comments as you mentioned.
>

Thank you for updating the patch.

-   if (errarg->attnum < 0)
+   /* Check case of slot_store_error_callback() is called before
+* errarg.local_attnum is set. */
+   if (errarg->local_attnum <0)

This comment style isn't preferred by PostgreSQL code. Please refer to
https://www.postgresql.org/docs/current/static/source-format.html

--
$ git diff --check
src/backend/replication/logical/worker.c:291: trailing whitespace.
+   /* Check case of slot_store_error_callback() is called before

There is an extra white space in the patch.

I'm inclined to think SlotErrCallbackArg can have remote_attnum and
pass it to the callback function. That way, we don't need to case the
case where local_attnum is not set yet.
Attached a new version patch incorporated the aboves. Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

fix_slot_store_error_callback_v11.patch
Description: Binary data

Re: Bitmap table scan cost per page formula

2017-12-20 Thread Tom Lane

Robert Haas  writes:
> On Wed, Dec 20, 2017 at 4:20 PM, Jeff Janes  wrote:
>> It is not obvious to me that the parabola is wrong.  I've certainly seen
>> cases where reading every 2nd or 3rd block (either stochastically, or
>> modulus) actually does take longer than reading every block, because it
>> defeats read-ahead.  But it depends on a lot on your kernel version and
>> your kernel settings and your file system and probably other things as well.

> Well, that's an interesting point, too.

Yes.  I think the proposed new curve is flat wrong: for areas near the
middle of the graph, the estimated I/O cost for a bitmap scan should
*substantially* exceed the cost of a seqscan, because in that regime
you're reading a lot of pages but you are probably failing to activate
the kernel's readahead heuristics.  This is especially true if you
consider that random_page_cost >> seq_page_cost as the Greenplum folk
seem to want.

The parabola is probably wrong in detail --- its behavior as we approach
reading all of the pages ought to be more asymptotic, seems like.
I suppose that the reason it appears to go below the seqscan cost at the
right is that even the rightmost end of the curve doesn't reflect reading
all of the *tuples*, just one tuple per page, so that there's some CPU
savings from not inspecting all the tuples.  Once you approach needing to
read all the tuples, the curve ought to go back higher than seqscan cost.

The point upthread about perhaps wanting to consider index correlation is
interesting too.  I think that correlation as currently defined is the
wrong stat for that purpose: what we are interested in for a bitmap scan
is clumping, not ordering.  If reading 10% of the index is likely to
result in accessing a tightly packed 10% of the table, rather than
roughly-one-page-in-10, then we ideally ought to be costing that as
sequential access not random access.  It's true that perfect correlation
would guarantee good clumping as well as good ordering, but it seems like
that's too strong an assumption.  An index with almost zero correlation
might still have pretty good clumping behavior.

regards, tom lane

Re: Tracking of page changes for backup purposes. PTRACK [POC]

2017-12-20 Thread Michael Paquier

On Thu, Dec 21, 2017 at 7:35 AM, Robert Haas  wrote:
> On Wed, Dec 20, 2017 at 3:45 PM, Tomas Vondra
>  wrote:
>>> Isn't more effective hold this info in Postgres than in backup sw?
>>> Then any backup sw can use this implementation.
>>
>> I don't think it means it can't be implemented in Postgres, but does it
>> need to be done in backend?
>>
>> For example, it might be a command-line tool similar to pg_waldump,
>> which processes WAL segments and outputs list of modified blocks,
>> possibly with the matching LSN. Or perhaps something like pg_receivewal,
>> doing that in streaming mode.
>>
>> This part of the solution can still be part of PostgreSQL codebase, and
>> the rest has to be part of backup solution anyway.
>
> I agree with all of that.

+1. This summarizes a bunch of concerns about all kinds of backend
implementations proposed. Scanning for a list of blocks modified via
streaming gives more availability, but knowing that you will need to
switch to a new segment anyway when finishing a backup, does it really
matter? Doing it once a segment has finished would be cheap enough,
and you can even do it in parallel with a range of segments.

Also, since 9.4 and the introduction of the new WAL API to track
modified blocks, you don't need to know about the record types to know
which blocks are being changed. Here is an example of tool I hacked up
in a couple of hours that does actually what you are looking for, aka
a scanner of the blocks modified per record for a given WAL segment
using xlogreader.c:
https://github.com/michaelpq/pg_plugins/tree/master/pg_wal_blocks

You could just use that and shape the data in the way you want and you
would be good to go.
-- 
Michael

Re: Basebackups reported as idle

2017-12-20 Thread Michael Paquier

On Wed, Dec 20, 2017 at 9:02 PM, Magnus Hagander  wrote:
> On Wed, Dec 20, 2017 at 12:50 PM, Michael Paquier
>  wrote:
> Yes. Of course. I can't read. That's the same as the notice below about it
> not returning false -- I managed to miss the extra if() there, and thought
> it always exited with ERROR.

I think that the call to pgstat_report_activity in WalSndLoop() should
be kept as well. There is a small gap between the moment the process
is started and the first replication command is run.

>> +   /* Report to pgstat that this process is running */
>> +   pgstat_report_activity(STATE_RUNNING, NULL);
>> Bonus points if cmd_string is used instead of string? This way, you
>> can know what is the replication command running ;)
>
> Do we want that though? That would be a compat break at least, wouldn't it?

Perhaps not, I found the idea funky but you actually don't want to
show a string in exec_replication_command for a T_SQLCmd node. That's
not complicated to check either. So let's discard such a thing for
now.
-- 
Michael

Re: Bitmap table scan cost per page formula

2017-12-20 Thread Haisheng Yuan

Robert, you are right. The new formula serves Greenplum better than the
original formula, because our default random page cost is much higher than
Postgres. We don't want random cost always dominates in the final cost per
page.

~ ~ ~
Haisheng Yuan

On Wed, Dec 20, 2017 at 12:25 PM, Robert Haas  wrote:

> On Tue, Dec 19, 2017 at 10:25 PM, Justin Pryzby 
> wrote:
> > In this old thread: https://www.postgresql.org/
> message-id/CAGTBQpZ%2BauG%2BKhcLghvTecm4-cGGgL8vZb5uA3%
> 3D47K7kf9RgJw%40mail.gmail.com
> > ..Claudio Freire  wrote:
> >> Correct me if I'm wrong, but this looks like the planner not
> >> accounting for correlation when using bitmap heap scans.
> >>
> >> Checking the source, it really doesn't.
> >
> > ..which I think is basically right: the formula does distinguish between
> the
> > cases of small or large fraction of pages, but doesn't use correlation.
> Our
> > issue in that case seems to be mostly a failure of cost_index to account
> for
> > fine-scale deviations from large-scale correlation; but, if cost_bitmap
> > accounted for our high correlation metric (>0.99), it might've helped
> our case.
>
> I think this is a different and much harder problem than the one
> Haisheng Yuan is attempting to fix.  His data shows that the cost
> curve has a nonsensical shape even when the assumption that pages are
> spread uniformly is correct.  That should surely be fixed.  Now, being
> able to figure out whether the assumption of uniform spread is correct
> on a particular table would be nice too, but it seems like a much
> harder problem.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Re: Letting plpgsql in on the fun with the new expression eval stuff

2017-12-20 Thread Tom Lane

I wrote:
> * Redesign the API for the ParamListInfo paramFetch hook so that the
> ParamExternData array can be entirely virtual.  Typical access to
> the info about a PARAM_EXTERN Param now looks like
> 
> if (paramInfo->paramFetch != NULL)
> prm = paramInfo->paramFetch(paramInfo, paramId, ...);
> else
> prm = ¶mInfo->params[paramId - 1];
> 
> so that if paramFetch is defined, the core code no longer touches
> the params[] array at all, and it doesn't have to be there.

I forgot to mention that I dithered about changing the params field
from a variable-length array to a pointer, ie,

-   ParamExternData params[FLEXIBLE_ARRAY_MEMBER];
+   ParamExternData *params;

Then we could set the pointer to NULL in cases where no physical
array is provided, which would be a good thing in terms of helping
to catch code that hasn't been updated to the new convention.
However, this would force less-than-trivial changes in every place
that creates a ParamListInfo.  For instance, copyParamList might
change from

size = offsetof(ParamListInfoData, params) +
from->numParams * sizeof(ParamExternData);

retval = (ParamListInfo) palloc(size);
... fill retval ...

to

size = MAXALIGN(sizeof(ParamListInfoData)) +
from->numParams * sizeof(ParamExternData);

retval = (ParamListInfo) palloc(size);
retval->params = (ParamExternData *)
((char *) retval + MAXALIGN(sizeof(ParamListInfoData)));
... fill rest of retval ...

That seemed like a pain in the rear, and easy to get wrong
(although it could be a lot simpler if you didn't mind doing
two pallocs instead of one).

Now there aren't that many places in the core code that do this,
so it wouldn't be very hard to fix them, but I'm concerned about
the possible impact on extension modules.  Creating param lists
seems like something that a lot of things would have code for.

Anyway, I left it as-is, but I'm willing to make the change if
people feel the other way is better.

regards, tom lane

Re: vacuum vs heap_update_tuple() and multixactids

2017-12-20 Thread Robert Haas

On Wed, Dec 20, 2017 at 9:05 AM, Andres Freund  wrote:
> Indeed. I kinda wonder whether we hackishly solve this by simply
> returning true fore all pids if it's waiting for a cleanup lock. That's
> not actually that far from the truth... The big problem with that I see
> is very short waits that resolve themselves causing issues - but given
> that waiting for cleanup locks is really rare, that might not be too
> bad.

I'm wondering if that might cause random, spurious failures.  We go on
to the next step if we detect that the current step is waiting... and
the decisions we make about whether to do that cause differences in
the output.

>> 
>>
>> What if we add a hook to vacuum that lets you invoke some arbitrary C
>> code after each block, and then write a test module that uses that
>> hook to acquire and release an advisory lock at that point?  Then you
>> could do:
>
> I guess that'd work, albeit a bit invasive and specific to this
> test. Trying to come up with a concept that'll be generizable longer
> term would be neat.

True, but I'm not sure there's much hope of that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: domain cast in parameterized vs. non-parameterized query

2017-12-20 Thread Tom Lane

I wrote:
> You might consider whether you can write 'spa-000'::uid explicitly in your
> query; that results in immediate application of the domain coercion, so
> that the planner no longer sees that as a run-time operation it has to
> avoid.

Hm, scratch that --- experimentation shows that the parser still produces
a CoerceToDomain node in that case, not a literal of the domain type.

regression=# create domain foo as text;
CREATE DOMAIN
regression=# explain verbose select 'x'::foo;   
QUERY PLAN 
---
 Result  (cost=0.00..0.01 rows=1 width=32)
   Output: ('x'::text)::foo
(2 rows)

You could force the issue with an immutable function:

regression=# create function forcefoo(text) returns foo as
regression-# 'begin return $1::foo; end' language plpgsql immutable;
CREATE FUNCTION
regression=# explain verbose select forcefoo('x');
QUERY PLAN 
---
 Result  (cost=0.00..0.01 rows=1 width=32)
   Output: 'x'::foo
(2 rows)

Marking this function as immutable is sort of a lie, because it
is effectively telling the planner that you don't expect any
failure from pre-evaluation of the function.  But it'd get the
job done, and in most situations there's no practical difference
because any failure would have happened anyway at runtime.

regards, tom lane

Re: Tracking of page changes for backup purposes. PTRACK [POC]

2017-12-20 Thread Robert Haas

On Wed, Dec 20, 2017 at 3:45 PM, Tomas Vondra
 wrote:
>> Isn't more effective hold this info in Postgres than in backup sw?
>> Then any backup sw can use this implementation.
>
> I don't think it means it can't be implemented in Postgres, but does it
> need to be done in backend?
>
> For example, it might be a command-line tool similar to pg_waldump,
> which processes WAL segments and outputs list of modified blocks,
> possibly with the matching LSN. Or perhaps something like pg_receivewal,
> doing that in streaming mode.
>
> This part of the solution can still be part of PostgreSQL codebase, and
> the rest has to be part of backup solution anyway.

I agree with all of that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Shouldn't execParallel.c null-terminate query_string in the parallel DSM?

2017-12-20 Thread Robert Haas

On Wed, Dec 20, 2017 at 2:03 AM, Rafia Sabih
 wrote:
> On Wed, Dec 20, 2017 at 7:58 AM, Thomas Munro
>  wrote:
>> I just saw some trailing garbage in my log file emanating from a
>> parallel worker when my query happened to be a BUFFERALIGNed length
>> (in this case 64 characters).  Did commit 4c728f382970 forget to make
>> sure the null terminator is copied?  See attached proposed fix.
>>
> Yes, I missed that.
> Your patch looks good to me, thanks.

Sigh.  I missed that too.  Committed and back-patched to v10.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: domain cast in parameterized vs. non-parameterized query

2017-12-20 Thread David Kamholz

>
> You might consider whether you can write 'spa-000'::uid explicitly in your
> query; that results in immediate application of the domain coercion, so
> that the planner no longer sees that as a run-time operation it has to
> avoid.
>

I should have mentioned that I tried an explicit cast and found that
'spa-000' and 'spa-000'::uid produce identical results. As far as I can
tell, there is *no* way to get the planner to constant-fold in this case
without using prepared statements.


> It's tempting to wonder whether we could somehow constant-fold
> CoerceToDomain, at least in the estimation case, but I'm afraid that
> would lead to domain constraint failures that would not necessarily occur
> at runtime.  Or we could skip the domain check for estimation purposes,
> but then we're possibly feeding a value that fails the domain check to a
> function that might not work right under such conditions.  So on the
> whole I'm afraid to monkey with that decision.


OK, I can see how that makes sense. But shouldn't an explicit cast still
work?

Re: domain cast in parameterized vs. non-parameterized query

2017-12-20 Thread Tom Lane

Robert Haas  writes:
> On Wed, Dec 20, 2017 at 1:03 AM, David Kamholz  wrote:
>> Note that in the above plan, 'spa-000' is cast to text before it's cast to 
>> uid. This
>> is apparently connected to why postgresql can't choose the better plan.

> It's slightly hard for me to follow what's going on with the
> auto_explain output you provided because you didn't specify what SQL
> you ran to produce that output, but I suspect that's not the case.

I think what's happening in the first case is that $1 is deemed to be
already of type uid, and there's a value available for it (viz,
'spa-000'::uid), so the planner constant-folds $1 to 'spa-000'::uid,
which is why we see the latter not the former in the plan output.
But then, since uid_langvar() is marked stable, we're allowed to
pre-evaluate it to produce an estimated value of the function result,
and the selectivity estimate for (langvar = uid_langvar('spa-000'::uid))
is done using that estimated value.  This evidently leads us to conclude,
correctly, that very few rows will fail that filter condition; so we end
up picking a plan that relies primarily on the condition on id.

In the second case, what we're dealing with is evidently
('spa-000'::text)::uid), which has to be read carefully: that's a constant
of type text with a run-time cast to the domain.  The planner is unwilling
to try to constant-fold CoerceToDomain, even speculatively, so it's left
with no way to get an estimated value for the uid_langvar() result,
leading to a default estimate for the selectivity of the langvar =
uid_langvar() condition.  That default estimate is way off --- far too
optimistic --- so we make a poor choice of plan as a result.

You might consider whether you can write 'spa-000'::uid explicitly in your
query; that results in immediate application of the domain coercion, so
that the planner no longer sees that as a run-time operation it has to
avoid.

It's tempting to wonder whether we could somehow constant-fold
CoerceToDomain, at least in the estimation case, but I'm afraid that
would lead to domain constraint failures that would not necessarily occur
at runtime.  Or we could skip the domain check for estimation purposes,
but then we're possibly feeding a value that fails the domain check to a
function that might not work right under such conditions.  So on the
whole I'm afraid to monkey with that decision.

regards, tom lane

Re: Bitmap table scan cost per page formula

2017-12-20 Thread Robert Haas

On Wed, Dec 20, 2017 at 4:20 PM, Jeff Janes  wrote:
>
> It is not obvious to me that the parabola is wrong.  I've certainly seen
> cases where reading every 2nd or 3rd block (either stochastically, or
> modulus) actually does take longer than reading every block, because it
> defeats read-ahead.  But it depends on a lot on your kernel version and
> your kernel settings and your file system and probably other things as well.
>

Well, that's an interesting point, too.  Maybe we need another graph that
also shows the actual runtime of a bitmap scan and a sequential scan.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: domain cast in parameterized vs. non-parameterized query

2017-12-20 Thread David Kamholz

>
> That's not too surprising.  PostgreSQL can't choose a plan based on
> the parameter value when it doesn't know the parameter value


I thought that since 9.2, postgresql could "generate plans based on the
parameter value even when using prepared statements" (paraphrase of 9.2
release notes). I'm running version 10. That's why I was surprised to find
the different plans. If you're right that taking the value into account
causes the planner to guess wrong, I agree that's a separate issue -- but
is that really what's going on?


> > Note that in the above plan, 'spa-000' is cast to text before it's cast
> to uid. This
> > is apparently connected to why postgresql can't choose the better plan.
>
> It's slightly hard for me to follow what's going on with the
> auto_explain output you provided because you didn't specify what SQL
> you ran to produce that output, but I suspect that's not the case.


The queries included in the output (after "Query Text:"), which is why I
didn't include them separately.


> I think the deeper problem here may be that the planner has no idea
> what value uid_langvar() will return, so its selectivity estimates are
> probably fairly bogus.  If you looked up that id first and then
> searched for the resulting value, it might do better.
>

I was under the impression, possibly incorrect, that the planner would
sometimes or always call a stable/immutable function in the planning stage
in order to consider its return value for planning. RhodiumToad on
#postgresql mentioned that functions returning constant values will be
folded in. He thought the planner should call uid_langvar() even though it
wasn't constant. Changing it from stable to immutable makes no difference,
and neither does reducing the cost to 10. Looking up the id first is an
obvious option but I thought there was a way to do this within a single
query. I guess not?

In any case, I still don't understand why prepared vs. not makes a
difference.

Dave

Re: domain cast in parameterized vs. non-parameterized query

2017-12-20 Thread Robert Haas

On Wed, Dec 20, 2017 at 1:03 AM, David Kamholz  wrote:
> I've recently come across a query that produces different plans depending on
> whether it's parameterized or not.

That's not too surprising.  PostgreSQL can't choose a plan based on
the parameter value when it doesn't know the parameter value, so the
only way it could get the same plan in both cases is if it ignored the
parameter value when it does know it, which would result in a lot of
really terrible plans.  Knowing the parameter value tends to improve
the plan considerably, although apparently not in this case.  Planning
is an inexact science and estimates are and actual numbers can vary,
so it can happen that the generic plan contains no bad estimate and
the parameter-specific plan does have a bad estimate.

> Note that in the above plan, 'spa-000' is cast to text before it's cast to 
> uid. This
> is apparently connected to why postgresql can't choose the better plan.

It's slightly hard for me to follow what's going on with the
auto_explain output you provided because you didn't specify what SQL
you ran to produce that output, but I suspect that's not the case.  I
think the planner just has to guess whether it should scan the index
on exprx, taking advance of the fact that the ordering of that index
matches the desired output ordering of the the query, and hoping that
the join to expr will produce output rows fairly quickly so that the
whole nested loop will not have to be executed; or whether it should
instead using the index on expr, which lets it throw away all of the
rows where langvar doesn't have the right value to be interesting.  In
the first strategy, we've got to probe expr for every value found in
exprx and some of the rows we find will have a langvar that causes us
to ignore them; the second strategy lets us immediately focus in on
the rows with the right langvar but requires a sort afterward.

I think the deeper problem here may be that the planner has no idea
what value uid_langvar() will return, so its selectivity estimates are
probably fairly bogus.  If you looked up that id first and then
searched for the resulting value, it might do better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Bitmap table scan cost per page formula

2017-12-20 Thread Jeff Janes

On Wed, Dec 20, 2017 at 12:29 PM, Robert Haas  wrote:

> On Tue, Dec 19, 2017 at 2:55 PM, Haisheng Yuan  wrote:
>>
>> Below is the graph (credit to Heikki) that plots the total estimated cost
>> of a bitmap heap scan, where table size is 1 pages, and
>> random_page_cost=10 and seq_page_cost=1. X axis is the number of pages
>> fetche. I.e. on the left, no pages are fetched, and on the right end, at
>> 1, all pages are fetched. The original formula is in black, the new
>> formula in red, and the horizontal line, in blue, shows the cost of a Seq
>> Scan.
>> [image: Inline image 3]
>>
>>
>> Thoughts? Any better ideas?
>>
>
> The parabola-shape curve we're getting at present is clearly wrong;
> approaching a horizontal line as an asymptote seems much better.  However,
> shouldn't the red line level off at some level *above* the blue line rather
> than *at* the blue line? Reading the index pages isn't free, so a
> sequential scan should be preferred when we're going to read the whole
> table anyway.
>

It is not obvious to me that the parabola is wrong.  I've certainly seen
cases where reading every 2nd or 3rd block (either stochastically, or
modulus) actually does take longer than reading every block, because it
defeats read-ahead.  But it depends on a lot on your kernel version and
your kernel settings and your file system and probably other things as well.

Cheers,

Jeff

Re: [HACKERS] parallel.c oblivion of worker-startup failures

2017-12-20 Thread Robert Haas

On Tue, Dec 19, 2017 at 11:28 PM, Amit Kapila  wrote:
> That is not the main point I am bothered about.  I am concerned that
> the patch proposed by you can lead to hang if there is a crash (abrupt
> failure like proc_exit(1)) after attaching to the error queue. This is
> explained in my email up thread [1].  I think we can fix it by trying
> to read the queues as mentioned by you, but was not sure if that is
> the best way to deal with it, so proposed an alternate solution.

OK.  My core point is that I really, really don't want Gather or
Gather Merge or parallel CREATE INDEX to have to worry about these
cases; I believe strongly that it is best if these cases cause an
error at the ParallelContext layer.   I read your email as suggesting
the opposite, but maybe I misunderstood you.

One thing I thought about is introducing a new state
BGWH_STARTUP_FAILED, as distinguished from BGWH_STOPPED.  However, it
seems problematic.  Once the worker is unregistered we wouldn't know
which one to return, especially if the slot has been reused.
Furthermore, it doesn't help in the case where the worker starts and
immediately exits without attaching to the DSM.  So it doesn't seem to
me that we can fix the problem this way.

It seems to me that we instead have to solve the problem in the
ParallelContext layer, which can understand the state of the worker by
comparing the state of the background worker to what we can find out
via the DSM.  The background worker layer knows whether or not the
worker is running and can signal us when it changes state, but it
can't know whether the worker exited cleanly or uncleanly.  To know
that, we have to look at things like whether it sent us a Terminate
message before it stopped.

While I think that the clean-vs-unclean distinction has to be made at
the ParallelContext layer, it doesn't necessarily have to happen by
inserting an error queue.  I think that may be a convenient way; for
example, we could plug the hole you mentioned before by attaching an
on_dsm_detach callback that signals the leader with
SendProcSignal(..., PROCSIG_PARALLEL_MESSAGE, ...).  At that point,
all of the information required for the leader to know that the
failure has occurred is present in shared memory; it is only that the
leader may not have anything to trigger it to look at the relevant
bits of information, and this would fix that.  But there are probably
other ways the information could also be communicated as well.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: AS OF queries

2017-12-20 Thread Alvaro Hernandez




On 20/12/17 14:48, Konstantin Knizhnik wrote:



On 20.12.2017 16:12, Laurenz Albe wrote:

Konstantin Knizhnik wrote:

I wonder if Postgres community is interested in supporting time travel
queries in PostgreSQL (something like AS OF queries in Oracle:
https://docs.oracle.com/cd/B14117_01/appdev.101/b10795/adfns_fl.htm).
As far as I know something similar is now developed for MariaDB.

I think that would be a good thing to have that could make
the DBA's work easier - all the requests to restore a table
to the state from an hour ago.


Please notice that it is necessary to configure postgres in proper way 
in order to be able to perform time travels.


    This makes sense. BTW, I believe this feature would be an amazing 
addition to PostgreSQL.



If you do not disable autovacuum, then old versions will be just 
cleaned-up.
If transaction commit timestamps are not tracked, then it is not 
possible to locate required timeline.


So DBA should make a decision in advance whether this feature is 
needed or not.
It is not a proper instrument for restoring/auditing existed database 
which was not configured to keep all versions.


May be it is better to add special configuration parameter for this 
feature which should implicitly toggle

autovacuumand track_commit_timestamp parameters).


    Downthread a "moving xid horizon" is proposed. I believe this is 
not too user friendly. I'd rather use a timestamp horizon (e.g. "up to 2 
days ago"). Given that the commit timestamp is tracked, I don't think 
this is an issue. This is the same as the undo_retention in Oracle, 
which is expressed in seconds.





The obvious drawbacks of keeping all versions are
1. Increased size of database.
2. Decreased query execution speed because them need to traverse a lot 
of not visible versions.


    In other words, what is nowadays called "bloat". I have seen in the 
field a lot of it. Not everybody tunes vacuum to keep up to date. So I 
don't expect this feature to be too expensive for many. While at the 
same time an awesome addition, not to fire a new separate server and 
exercise PITR, and then find the ways to move the old data around.



    Regards,

    Álvaro


--

Alvaro Hernandez


---
OnGres

Re: Tracking of page changes for backup purposes. PTRACK [POC]

2017-12-20 Thread Tomas Vondra

On 12/20/2017 09:29 PM, Pavel Stehule wrote:
> 
> 
> 2017-12-20 21:18 GMT+01:00 Robert Haas  >:
> 
> On Wed, Dec 20, 2017 at 3:15 PM, Pavel Stehule
> mailto:pavel.steh...@gmail.com>> wrote:
> >> > So I'm somewhat hesitant to proclaim option 5 as the clear winner, 
> here.
> >>
> >> I agree.  I think (4) is better.
> >
> > Can depends on load? For smaller intensive updated databases the 5 can 
> be
> > optimal, for large less updated databases the 4 can be better.
> 
> It seems to me that the difference is that (4) tracks which pages have
> changed in the background, and (5) does it in the foreground.  Why
> would we want the latter?
> 
> 
> Isn't more effective hold this info in Postgres than in backup sw?
> Then any backup sw can use this implementation.
> 

I don't think it means it can't be implemented in Postgres, but does it
need to be done in backend?

For example, it might be a command-line tool similar to pg_waldump,
which processes WAL segments and outputs list of modified blocks,
possibly with the matching LSN. Or perhaps something like pg_receivewal,
doing that in streaming mode.

This part of the solution can still be part of PostgreSQL codebase, and
the rest has to be part of backup solution anyway.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Letting plpgsql in on the fun with the new expression eval stuff

2017-12-20 Thread Tom Lane

I wrote:
> Will send a patch in a bit.  I need to write an explanation of what all
> I changed.

OK then.  What the attached patch does is:

* Create a new step type EEOP_PARAM_CALLBACK (if anyone has a better
naming idea, I'm receptive) and add the infrastructure needed for
add-on modules to generate that.  As discussed, the best control
mechanism for that, at least for the immediate use, seems to be to
add another hook function to ParamListInfo, which will be called by
ExecInitExpr if it's supplied and a PARAM_EXTERN Param is found.
For stand-alone expressions, we add a new entry point to allow the
ParamListInfo to be specified directly rather than found in the
parent plan node's EState.

* In passing, remove the "parent" pointer from the arguments to
ExecInitExprRec and related functions, instead storing that pointer
in a transient field in ExprState.  There are now several transient
fields there.  We discussed splitting them out to a different struct,
but that seems like material for a different patch.  I had to do
something here, though, since otherwise I'd have had to pass down
the stand-alone ParamListInfo as another parameter :-(.

* Redesign the API for the ParamListInfo paramFetch hook so that the
ParamExternData array can be entirely virtual.  Typical access to
the info about a PARAM_EXTERN Param now looks like

if (paramInfo->paramFetch != NULL)
prm = paramInfo->paramFetch(paramInfo, paramId, ...);
else
prm = ¶mInfo->params[paramId - 1];

so that if paramFetch is defined, the core code no longer touches
the params[] array at all, and it doesn't have to be there.
(It still can be there, if paramFetch wants to use it; but the sole
existing user of the hook, plpgsql, doesn't want to.)

* This also lets us get rid of ParamListInfo.paramMask, instead
leaving it to the paramFetch hook to decide which param IDs should
be accessible or not.  plpgsql_param_fetch was already doing the
identical masking check, so having callers do it too seemed quite
redundant and overcomplex.

* While I was at it, I added a "speculative" flag to paramFetch
that the planner can specify as TRUE to avoid unwanted failures.
This solves an ancient problem for plpgsql that it couldn't provide
values of non-DTYPE_VAR variables to the planner for fear of
triggering premature "record not assigned yet" or "field not found"
errors during planning.

* Rework plpgsql to get rid of the need for "unshared" parameter lists,
by dint of turning the single ParamListInfo per estate into a nearly
read-only data structure that doesn't instantiate any per-variable data.
Instead, the paramFetch hook controls access to per-variable data and can
make the right decisions on the fly, replacing the cases that we used to
need multiple ParamListInfos for.  This might perhaps have been a
performance loss on its own, but by using a paramCompile hook we can
bypass plpgsql_param_fetch entirely during normal query execution.
(It's now only called when, eg, we copy the ParamListInfo into a cursor
portal.  copyParamList() or SerializeParamList() effectively instantiate
the virtual parameter array as a simple physical array without a
paramFetch hook, which is what we want in those cases.)  This allows
reverting most of commit 6c82d8d1f, though I kept the cosmetic
code-consolidation aspects of that (eg the assign_simple_var function).

* During compilation of a PARAM_EXTERN expression node, predetermine
as much as we can to select one of several execution routines.

I've done light performance testing on this, mainly comparing the
runtimes for the test functions shown in the second attachment.
I see overall gains that are modest but reproducible (in the range
of a couple percent) for the "row" (named composite type) cases,
and more significant -- around 10% -- for the "record" cases.
I attribute the difference to the fact that the "row" cases use DTYPE_VAR
variables for the fields, which were already pretty well optimized,
whereas the "record" cases use DTYPE_RECFIELD variables which invoked
all that overhead we discussed.  The fact that this isn't losing
on DTYPE_VAR convinces me that it should be a win in all cases.

It'd theoretically be possible to split this into three patches,
one to change the stuff around ExecInitExpr, one to rejigger the
ParamListInfo API, and one to get rid of unshared param lists in
plpgsql.  However, that would require writing some throwaway code
in plpgsql, so I don't especially want to bother.

I have another patch in the pipeline that interacts with this,
so I'd kind of like to get this committed sooner rather than later.

regards, tom lane

diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index be7222f..6e90912 100644
*** a/src/backend/commands/prepare.c
--- b/src/backend/commands/prepare.c
*** EvaluateParams(PreparedStatement *pstmt,
*** 399,408 
  	/* we have static list of params, so no hooks needed */
  	paramLI->par

Re: Tracking of page changes for backup purposes. PTRACK [POC]

2017-12-20 Thread Pavel Stehule

2017-12-20 21:18 GMT+01:00 Robert Haas :

> On Wed, Dec 20, 2017 at 3:15 PM, Pavel Stehule 
> wrote:
> >> > So I'm somewhat hesitant to proclaim option 5 as the clear winner,
> here.
> >>
> >> I agree.  I think (4) is better.
> >
> > Can depends on load? For smaller intensive updated databases the 5 can be
> > optimal, for large less updated databases the 4 can be better.
>
> It seems to me that the difference is that (4) tracks which pages have
> changed in the background, and (5) does it in the foreground.  Why
> would we want the latter?
>

Isn't more effective hold this info in Postgres than in backup sw? Then any
backup sw can use this implementation.


> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

Re: Bitmap table scan cost per page formula

2017-12-20 Thread Robert Haas

On Tue, Dec 19, 2017 at 2:55 PM, Haisheng Yuan  wrote:
>
> Below is the graph (credit to Heikki) that plots the total estimated cost
> of a bitmap heap scan, where table size is 1 pages, and
> random_page_cost=10 and seq_page_cost=1. X axis is the number of pages
> fetche. I.e. on the left, no pages are fetched, and on the right end, at
> 1, all pages are fetched. The original formula is in black, the new
> formula in red, and the horizontal line, in blue, shows the cost of a Seq
> Scan.
> [image: Inline image 3]
>
>
> Thoughts? Any better ideas?
>

The parabola-shape curve we're getting at present is clearly wrong;
approaching a horizontal line as an asymptote seems much better.  However,
shouldn't the red line level off at some level *above* the blue line rather
than *at* the blue line? Reading the index pages isn't free, so a
sequential scan should be preferred when we're going to read the whole
table anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Bitmap table scan cost per page formula

2017-12-20 Thread Robert Haas

On Tue, Dec 19, 2017 at 10:25 PM, Justin Pryzby  wrote:
> In this old thread: 
> https://www.postgresql.org/message-id/CAGTBQpZ%2BauG%2BKhcLghvTecm4-cGGgL8vZb5uA3%3D47K7kf9RgJw%40mail.gmail.com
> ..Claudio Freire  wrote:
>> Correct me if I'm wrong, but this looks like the planner not
>> accounting for correlation when using bitmap heap scans.
>>
>> Checking the source, it really doesn't.
>
> ..which I think is basically right: the formula does distinguish between the
> cases of small or large fraction of pages, but doesn't use correlation.  Our
> issue in that case seems to be mostly a failure of cost_index to account for
> fine-scale deviations from large-scale correlation; but, if cost_bitmap
> accounted for our high correlation metric (>0.99), it might've helped our 
> case.

I think this is a different and much harder problem than the one
Haisheng Yuan is attempting to fix.  His data shows that the cost
curve has a nonsensical shape even when the assumption that pages are
spread uniformly is correct.  That should surely be fixed.  Now, being
able to figure out whether the assumption of uniform spread is correct
on a particular table would be nice too, but it seems like a much
harder problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Tracking of page changes for backup purposes. PTRACK [POC]

2017-12-20 Thread Robert Haas

On Wed, Dec 20, 2017 at 3:15 PM, Pavel Stehule  wrote:
>> > So I'm somewhat hesitant to proclaim option 5 as the clear winner, here.
>>
>> I agree.  I think (4) is better.
>
> Can depends on load? For smaller intensive updated databases the 5 can be
> optimal, for large less updated databases the 4 can be better.

It seems to me that the difference is that (4) tracks which pages have
changed in the background, and (5) does it in the foreground.  Why
would we want the latter?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Tracking of page changes for backup purposes. PTRACK [POC]

2017-12-20 Thread Pavel Stehule

2017-12-20 21:11 GMT+01:00 Robert Haas :

> On Tue, Dec 19, 2017 at 5:37 PM, Tomas Vondra
>  wrote:
> > On 12/18/2017 11:18 AM, Anastasia Lubennikova wrote:
> >> 1. Use file modification time as a marker that the file has changed.
> >> 2. Compute file checksums and compare them.
> >> 3. LSN-based mechanisms. Backup pages with LSN >= last backup LSN.
> >> 4. Scan all WAL files in the archive since the previous backup and
> >> collect information about changed pages.
> >> 5. Track page changes on the fly. (ptrack)
> >
> > I share the opinion that options 1 and 2 are not particularly
> > attractive, due to either unreliability, or not really saving that much
> > CPU and I/O.
> >
> > I'm not quite sure about 3, because it doesn't really explain how would
> > it be done - it seems to assume we'd have to reread the files. I'll get
> > back to this.
> >
> > Option 4 has some very interesting features. Firstly, relies on WAL and
> > so should not require any new code (and it could, in theory, support
> > even older PostgreSQL releases, for example). Secondly, this can be
> > offloaded to a different machine. And it does even support additional
> > workflows - e.g. "given these two full backups and the WAL, generate an
> > incremental backup between them".
> >
> > So I'm somewhat hesitant to proclaim option 5 as the clear winner, here.
>
> I agree.  I think (4) is better.
>

Can depends on load? For smaller intensive updated databases the 5 can be
optimal, for large less updated databases the 4 can be better.

Regards

Pavel


> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>

Re: Tracking of page changes for backup purposes. PTRACK [POC]

2017-12-20 Thread Robert Haas

On Tue, Dec 19, 2017 at 5:37 PM, Tomas Vondra
 wrote:
> On 12/18/2017 11:18 AM, Anastasia Lubennikova wrote:
>> 1. Use file modification time as a marker that the file has changed.
>> 2. Compute file checksums and compare them.
>> 3. LSN-based mechanisms. Backup pages with LSN >= last backup LSN.
>> 4. Scan all WAL files in the archive since the previous backup and
>> collect information about changed pages.
>> 5. Track page changes on the fly. (ptrack)
>
> I share the opinion that options 1 and 2 are not particularly
> attractive, due to either unreliability, or not really saving that much
> CPU and I/O.
>
> I'm not quite sure about 3, because it doesn't really explain how would
> it be done - it seems to assume we'd have to reread the files. I'll get
> back to this.
>
> Option 4 has some very interesting features. Firstly, relies on WAL and
> so should not require any new code (and it could, in theory, support
> even older PostgreSQL releases, for example). Secondly, this can be
> offloaded to a different machine. And it does even support additional
> workflows - e.g. "given these two full backups and the WAL, generate an
> incremental backup between them".
>
> So I'm somewhat hesitant to proclaim option 5 as the clear winner, here.

I agree.  I think (4) is better.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Missed parallelism option in plpgsql?

2017-12-20 Thread Robert Haas

On Wed, Dec 20, 2017 at 2:26 PM, Tom Lane  wrote:
> I happened to notice that while writing this in plpgsql
> will consider a parallel plan:
>
> select count(*) into s from tenk1 where ten = x;
>
> writing this will not:
>
> s := count(*) from tenk1 where ten = x;
>
> Is that intentional?  Seems to me these cases ought to be
> treated the same.
>
> The reason for it is that exec_assign_expr's call of
> exec_prepare_plan does not specify CURSOR_OPT_PARALLEL_OK.

Hmm, I think I didn't realize that you could include a "from" clause
in an assignment statement.  I assumed it would just be a simple
expression getting evaluated, which wouldn't make sense to do in
parallel.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-20 Thread Alvaro Herrera

Just to show what I'm talking about, here are a few prototype patches.
Beware, these are all very lightly tested/reviewed.  Input is welcome,
particularly if it comes in the guise of patches to regression tests
showing cases that misbehave.

0001 is a fixup for the v6 patch I posted upthread; it's needed so the
0002 patch doesn't end up with indexes marked invalid after pg_dump.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>From bd26504debb14567de780da10bf859b205326b46 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 20 Dec 2017 16:31:05 -0300
Subject: [PATCH v7 1/4] Fixup! Local partitioned indexes v6

Don't actually mark an index invalid if it's on a plain table,
rather than a partition, even with ONLY specified.
---
 src/backend/commands/indexcmds.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index a757f05dd5..e925351056 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -723,7 +723,7 @@ DefineIndex(Oid relationId,
flags |= INDEX_CREATE_PARTITIONED;
if (stmt->primary)
flags |= INDEX_CREATE_IS_PRIMARY;
-   if (stmt->relation && !stmt->relation->inh)
+   if (partitioned && stmt->relation && !stmt->relation->inh)
flags |= INDEX_CREATE_INVALID;
 
if (stmt->deferrable)
-- 
2.11.0

>From b10f9afbe4b1c46c7a2706c70a398307732ac9fc Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Mon, 6 Nov 2017 17:04:55 +0100
Subject: [PATCH v7 2/4] allow indexes on partitioned tables to be unique

---
 src/backend/commands/indexcmds.c   |  71 +--
 src/backend/parser/parse_utilcmd.c |  24 -
 src/test/regress/expected/alter_table.out  |   8 --
 src/test/regress/expected/create_table.out |  12 ---
 src/test/regress/expected/indexing.out | 142 -
 src/test/regress/sql/alter_table.sql   |   2 -
 src/test/regress/sql/create_table.sql  |   8 --
 src/test/regress/sql/indexing.sql  |  73 ++-
 8 files changed, 274 insertions(+), 66 deletions(-)

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index e925351056..86dcab6136 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -426,20 +426,11 @@ DefineIndex(Oid relationId,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("cannot create index on 
partitioned table \"%s\" concurrently",

RelationGetRelationName(rel;
-   if (stmt->unique)
-   ereport(ERROR,
-   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-errmsg("cannot create unique index on 
partitioned table \"%s\"",
-   
RelationGetRelationName(rel;
if (stmt->excludeOpNames)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 errmsg("cannot create exclusion 
constraints on partitioned table \"%s\"",

RelationGetRelationName(rel;
-   if (stmt->primary || stmt->isconstraint)
-   ereport(ERROR,
-   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-errmsg("cannot create constraints on 
partitioned tables")));
}
 
/*
@@ -637,6 +628,68 @@ DefineIndex(Oid relationId,
index_check_primary_key(rel, indexInfo, is_alter_table);
 
/*
+* If this table is partitioned and we're creating a unique index or a
+* primary key, make sure that the indexed columns are part of the
+* partition key.  Otherwise it would be possible to violate uniqueness 
by
+* putting values that ought to be unique in different partitions.
+*
+* We could lift this limitation if we had global indexes, but those 
have
+* their own problems, so this is a useful feature combination.
+*/
+   if (partitioned && (stmt->unique || stmt->primary))
+   {
+   PartitionKey key = rel->rd_partkey;
+   int i;
+
+   /*
+* A partitioned table can have unique indexes, as long as all 
the
+* columns in the partition key appear in the unique key.  A
+* partition-local index can enforce global uniqueness iff the 
PK
+* value completely determines the partition that a row is in.
+*
+* Thus, verify that all the columns i

Re: Cost Model

2017-12-20 Thread neto brpr

2017-12-20 17:34 GMT-02:00 Andres Freund :

> On 2017-12-20 17:13:31 -0200, neto brpr wrote:
> > Just to explain it better. The idea of differentiating read and write
> > parameters (sequential and random) is exactly so that the access plans
> can
> > be better chosen by the optimizer. But for this, the Hash join, merge
> join,
> > sorting and other algorithms should also be changed to consider these new
> > parameters.
>
> I'm doubtful that there's that much benefit. Mergejoin doesn't write,
> hashjoins commonly don't write , and usually if so there's not that many
> alternatives to batched hashjoins. Similar-ish with sorts, although
> sometimes that can instead be done using ordered index scans.
>

Dear Andres
By reading some cientific paper, it has been said that the hash join and
sort merge join algorithms perform better than nested loop, considering
that it runs on an HDD, since the costs of read and write are practically
the same (symmetrical). However, in an SSD, where the cost of write is
double the cost of reads, this is not true, since both algorithms (sort
merge and hash join) require some writing operations on disk, when the data
does not fit in memory RAM. If we consider that the nested loop that works
only with read, and in case there is an index for the internal table, the
nested loop this would be a good alternative, since the readings on SSDs
are many times faster than in HDDs. This is an example of a situation in
which the difference between reading and writing could make the Optimizer
choose the Nested Loop rather than the Hash Join.

Regards

> What are the cases you forsee where costing reads/writes differently
> will lead to better plans?
>
> Greetings,
>
> Andres Freund
>

Livre
de vírus. www.avast.com
.
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

PGCon 2018 call for papers

2017-12-20 Thread Dan Langille

PGCon 2018 will be on 29 May - 1 June 2018 at University of Ottawa.

* 29-30 May (Tue-Wed) tutorials
* 30 May (Wed) The Unconference
* 31 May - 1 June (Thu-Fri) talks - the main part of the conference

See http://www.pgcon.org/2018/

We are now accepting proposals for the main part of the conference (31 May - 1 
June).
Proposals can be quite simple. We do not require academic-style papers.

If you are doing something interesting with PostgreSQL, please submit
a proposal.  You might be one of the backend hackers or work on a
PostgreSQL related project and want to share your know-how with
others. You might be developing an interesting system using PostgreSQL
as the foundation. Perhaps you migrated from another database to
PostgreSQL and would like to share details.  These, and other stories
are welcome. Both users and developers are encouraged to share their
experiences.

Here are a some ideas to jump start your proposal process:

- novel ways in which PostgreSQL is used
- migration of production systems from another database
- data warehousing
- tuning PostgreSQL for different work loads
- replication and clustering
- hacking the PostgreSQL code
- PostgreSQL derivatives and forks
- applications built around PostgreSQL
- benchmarking and performance engineering
- case studies
- location-aware and mapping software with PostGIS
- The latest PostgreSQL features and features in development
- research and teaching with PostgreSQL
- things the PostgreSQL project could do better
- integrating PostgreSQL with 3rd-party software

Both users and developers are encouraged to share their experiences.

The schedule is:

1 Dec 2017 Proposal acceptance begins
19 Jan 2018 Proposal acceptance ends
19 Feb 2018 Confirmation of accepted proposals

NOTE: the call for lightning talks will go out very close to the conference.
Do not submit lightning talks proposals until then.

See also http://www.pgcon.org/2018/papers.php

Instructions for submitting a proposal to PGCon 2018 are available
from: http://www.pgcon.org/2018/submissions.php

-- 
Dan Langille - BSDCan / PGCon
d...@langille.org

Re: Cost Model

2017-12-20 Thread Andres Freund

On 2017-12-20 17:13:31 -0200, neto brpr wrote:
> Just to explain it better. The idea of differentiating read and write
> parameters (sequential and random) is exactly so that the access plans can
> be better chosen by the optimizer. But for this, the Hash join, merge join,
> sorting and other algorithms should also be changed to consider these new
> parameters.

I'm doubtful that there's that much benefit. Mergejoin doesn't write,
hashjoins commonly don't write , and usually if so there's not that many
alternatives to batched hashjoins. Similar-ish with sorts, although
sometimes that can instead be done using ordered index scans.

What are the cases you forsee where costing reads/writes differently
will lead to better plans?

Greetings,

Andres Freund

Re: Cost Model

2017-12-20 Thread David G. Johnston

On Wed, Dec 20, 2017 at 12:26 PM, neto brpr  wrote:

>
> About what you said, that some settings can be configured by Tablespace?
> I have already seen this in IBM DB2, but in Postgresql as far as I know,
> for example the Random_page_cost and Seq_page_cost parameters are
> configured for the Integer Database and not for Tablespace, ok?.
> You or someone can tell me if cost parameters can be configured by
> Tablespace, this would be useful for me, thinking of a server that has
> hybrid storage environment (HDD x SSDs), I could leave some in a tablespace
> with HDD adapted settings and the same way for when I have an SSD disk.
>

ALTER TABLESPACE

https://www.postgresql.org/docs/10/static/sql-altertablespace.html

David J.

Re: Cost Model

2017-12-20 Thread neto brpr

2017-12-20 16:37 GMT-02:00 David G. Johnston :

> On Wed, Dec 20, 2017 at 11:26 AM, neto brpr  wrote:
>
>> Dear David
>> I have read documentation that you send, but it has only sequential page
>> cost and random page cost parameters. What I need, would be a model of
>> custo for Differentiate Read/Write (sequential and random), because in SSDs
>> the reads and writes have different costs. If you or someone knows a patch
>> or other solution, that allows you to configure individual parameters to:
>>
>> - Sequential reading page cost
>> - cost of the random reading page
>> - sequential recording page cost
>> - Random recording page cost
>>
>>
> Please don't top-post.
>

OK, sorry!

>
> OK...reading more closely I don't see how "recording/writing" costs are
> important in decision making.  Either you have to write something, or you
> don't.  If you do you updates pages in a buffer and generate WAL and then
> the system puts the data onto disk where they belong - the database itself
> doesn't care about that part and knowing how fast or slow it might happen
> would impact it behavior.  So PostgreSQL provides read settings to be tuned
> so it can decide between index and table scans on the table in question.
>
> The system doesn't understand SSD or HDD but does understand tablespaces
> and I believe that many of these settings are able to be configured on a
> per-tablespace (or table?) basis.
>
>
About what you said, that some settings can be configured by Tablespace?
I have already seen this in IBM DB2, but in Postgresql as far as I know,
for example the Random_page_cost and Seq_page_cost parameters are
configured for the Integer Database and not for Tablespace, ok?.
You or someone can tell me if cost parameters can be configured by
Tablespace, this would be useful for me, thinking of a server that has
hybrid storage environment (HDD x SSDs), I could leave some in a tablespace
with HDD adapted settings and the same way for when I have an SSD disk.

Regards
Neto

David J.
>
>


Livre
de vírus. www.avast.com
.
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Missed parallelism option in plpgsql?

2017-12-20 Thread Tom Lane

I happened to notice that while writing this in plpgsql
will consider a parallel plan:

select count(*) into s from tenk1 where ten = x;

writing this will not:

s := count(*) from tenk1 where ten = x;

Is that intentional?  Seems to me these cases ought to be
treated the same.

The reason for it is that exec_assign_expr's call of
exec_prepare_plan does not specify CURSOR_OPT_PARALLEL_OK.

regards, tom lane

Re: Cost Model

2017-12-20 Thread Alvaro Herrera

neto brpr wrote:

> Anyway, It seems that there has not yet been any initiative related to this
> in the postgresql community, am I right?

Yes.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-20 Thread Alvaro Herrera

Robert Haas wrote:

> Sounds great!  I made the comment during my talk at PGCONF.EU that
> partitioned tables in PostgreSQL are really just a bunch of tables
> glued together, but that over time "we'll make the glue better", and I
> think the improvements on which you are working will go a long way in
> that direction.

Thanks -- yes, that's my intention, to improve our partitioning story.
Many other projects related to improving the glue are already underway
by others, so I think we're making great progress on the overall
partitioning front.

> With regard to indexes, presumably we can only have a unique index if
> the index definition matches the partition key definition.  Otherwise,
> we can only guarantee per-partition uniqueness, unless (1) we
> implement "global" indexes, which sounds like a patch that would take
> a fair amount of work, or (2) every index insertion goes and checks
> for conflicting values in all of the "sibling" indexes.  This latter
> approach was suggested to me by Andres, and certainly solves a bunch
> of problems with vacuuming and dropping indexes, but seems like it
> would probably have lousy insert performance unless the number of
> partitions is very small.

Actually, we discussed unique constraints before.  What I proposed is to
implement a useful subset: only allow a unique constraint if the
partition columns are in it, which means that enforcing the constraint
on each partition independently is enough to satisfy the constraint
globally.  This doesn't solve all cases where you want UNIQUE on
partitioned tables, but the reason to implement this subset is precisely
that it doesn't have those drawbacks, so it's quick and efficient.

Another point is that global indexes are not only a lot of work, but
they also come with additional drawbacks such as the necessity to
clean up when partitions are dropped/detached, which is a huge pain.

> With regard to foreign keys, it seems like there are two directions to
> worry about.  An "outbound" foreign key needs to cascade, I think,
> more or less like an index does, with every child inheriting the
> foreign key from the partition root.  An "inbound" foreign key just
> needs a unique index to point at and depend on.  Given the work you
> are doing on unique indexes, I'm guessing that you are talking about
> supporting the latter, not the former, but the way you phrased it
> sounds like the opposite.

Actually, I intend to do both.  I have a prototype patch which seems to
work, though I haven't hammered it heavily yet.  Some of the cascading
options are not implemented -- as I recall I implemented it for the
CREATE TABLE PARTITION OF case but not yet the ALTER TABLE ATTACH
PARTITION one.

> I haven't thought much about the possibility of supporting FOR EACH
> ROW triggers, but it sounds like a good idea if we can do it.  I'm not
> sure how that will work, though, if users directly access the leaf
> partitions.

Yeah, I'm not sure yet either.  Needs more thought than I've given it so
far.  I have a prototype patch for this also, and the basic case seems
to give reasonable results, but I haven't tried to find any weird corner
cases either.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Cost Model

2017-12-20 Thread neto brpr

2017-12-20 16:35 GMT-02:00 Alvaro Herrera :

> neto brpr wrote:
> > Dear David
> > I have read documentation that you send, but it has only sequential page
> > cost and random page cost parameters. What I need, would be a model of
> > custo for Differentiate Read/Write (sequential and random), because in
> SSDs
> > the reads and writes have different costs.
>
> I don't think it matters.  I mean, differentiating seq/random read
> speeds can make the planner choose one plan type over another depending
> on how much each plan intends to read randomly or sequentially.  But why
> does it matter if one write is 360x as expensive as one read?  No plan
> is going to change usefully because of that, because you can't turn one
> write into 360 reads or even 1 reads.
>
>
Just to explain it better. The idea of differentiating read and write
parameters (sequential and random) is exactly so that the access plans can
be better chosen by the optimizer. But for this, the Hash join, merge join,
sorting and other algorithms should also be changed to consider these new
parameters.
Because postgresql uses a cost-based optimizer, I believe that
differentiating these costs can have a positive impact on the process of
choosing access methods... This is just an opinion, I'm not sure.

If you said "writes of type X are 100 times as fast as writes of type
> Y", then some useful cost model could perhaps be developed.  But that's
> not what you're saying.
>
>
Anyway, It seems that there has not yet been any initiative related to this
in the postgresql community, am I right?

Best Regards
Neto


> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>




Livre
de vírus. www.avast.com
.
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-20 Thread Robert Haas

On Wed, Dec 20, 2017 at 12:01 PM, Alvaro Herrera
 wrote:
> I have two patches to rebase on top of this, which I hope to post later
> today:
>
> 1) let these indexes be unique (i.e. allow unique and PK constraints)
> 2) allow foreign keys on partitioned tables
>
> I have a further patch to allow FOR EACH ROW triggers on partitioned
> tables, but I think it's largely unrelated to this (and, as I was
> surprised to discover, it's also unrelated to the FKs one).

Sounds great!  I made the comment during my talk at PGCONF.EU that
partitioned tables in PostgreSQL are really just a bunch of tables
glued together, but that over time "we'll make the glue better", and I
think the improvements on which you are working will go a long way in
that direction.

With regard to indexes, presumably we can only have a unique index if
the index definition matches the partition key definition.  Otherwise,
we can only guarantee per-partition uniqueness, unless (1) we
implement "global" indexes, which sounds like a patch that would take
a fair amount of work, or (2) every index insertion goes and checks
for conflicting values in all of the "sibling" indexes.  This latter
approach was suggested to me by Andres, and certainly solves a bunch
of problems with vacuuming and dropping indexes, but seems like it
would probably have lousy insert performance unless the number of
partitions is very small.

With regard to foreign keys, it seems like there are two directions to
worry about.  An "outbound" foreign key needs to cascade, I think,
more or less like an index does, with every child inheriting the
foreign key from the partition root.  An "inbound" foreign key just
needs a unique index to point at and depend on.  Given the work you
are doing on unique indexes, I'm guessing that you are talking about
supporting the latter, not the former, but the way you phrased it
sounds like the opposite.

I haven't thought much about the possibility of supporting FOR EACH
ROW triggers, but it sounds like a good idea if we can do it.  I'm not
sure how that will work, though, if users directly access the leaf
partitions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: AS OF queries

2017-12-20 Thread Pantelis Theodosiou

On Wed, Dec 20, 2017 at 4:26 PM, Tom Lane  wrote:

> Peter Eisentraut  writes:
> > On 12/20/17 10:29, Tom Lane wrote:
> >> Please say that's just an Oracle-ism and not SQL standard, because it's
> >> formally ambiguous.
>
> > The SQL standard syntax appears to be something like
>
> > "tablename" [ AS OF SYSTEM TIME 'something' ] [ [ AS ] "alias" ]
>
> > That's not going to be fun to parse.
>

Examples from DB2 documentation (which may be closer to the standard):

SELECT coverage_amt
FROM policy FOR SYSTEM_TIME AS OF '2010-12-01'
WHERE id = ;


SELECT count(*)
FROM policy FOR SYSTEM_TIME FROM '2011-11-30'
  TO '-12-30'
WHERE vin = 'A';


So besides AS .. AS , it could also be  FROM .. FROM


> Bleah.  In principle we could look two tokens ahead so as to recognize
> "AS OF SYSTEM", but base_yylex is already a horrid mess with one-token
> lookahead; I don't much want to try to extend it to that.
>
> Possibly the most workable compromise is to use lookahead to convert
> "AS OF" to "AS_LA OF", and then we could either just break using OF
> as an alias, or add an extra production that allows "AS_LA OF" to
> be treated as "AS alias" if it's not followed by the appropriate
> stuff.
>
> It's a shame that the SQL committee appears to be so ignorant of
> standard parsing technology.
>
> regards, tom lane
>
>

Re: Basebackups reported as idle

2017-12-20 Thread Alvaro Herrera

Magnus Hagander wrote:

> PFA a patch that fixes this. I think this is bugfix-for-backpatch, I don't
> think it has a large risk of breaking things. Thoughts?

Agreed.  As long as it doesn't show up as idle-in-transaction afterwards
or something odd like that, it should be okay to backpatch.  (I suppose
there's no way for it to end as being in-transaction anyway.)

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Cost Model

2017-12-20 Thread David G. Johnston

On Wed, Dec 20, 2017 at 11:26 AM, neto brpr  wrote:

> Dear David
> I have read documentation that you send, but it has only sequential page
> cost and random page cost parameters. What I need, would be a model of
> custo for Differentiate Read/Write (sequential and random), because in SSDs
> the reads and writes have different costs. If you or someone knows a patch
> or other solution, that allows you to configure individual parameters to:
>
> - Sequential reading page cost
> - cost of the random reading page
> - sequential recording page cost
> - Random recording page cost
>
>
Please don't top-post.

OK...reading more closely I don't see how "recording/writing" costs are
important in decision making.  Either you have to write something, or you
don't.  If you do you updates pages in a buffer and generate WAL and then
the system puts the data onto disk where they belong - the database itself
doesn't care about that part and knowing how fast or slow it might happen
would impact it behavior.  So PostgreSQL provides read settings to be tuned
so it can decide between index and table scans on the table in question.

The system doesn't understand SSD or HDD but does understand tablespaces
and I believe that many of these settings are able to be configured on a
per-tablespace (or table?) basis.

David J.

Re: Cost Model

2017-12-20 Thread Alvaro Herrera

neto brpr wrote:
> Dear David
> I have read documentation that you send, but it has only sequential page
> cost and random page cost parameters. What I need, would be a model of
> custo for Differentiate Read/Write (sequential and random), because in SSDs
> the reads and writes have different costs.

I don't think it matters.  I mean, differentiating seq/random read
speeds can make the planner choose one plan type over another depending
on how much each plan intends to read randomly or sequentially.  But why
does it matter if one write is 360x as expensive as one read?  No plan
is going to change usefully because of that, because you can't turn one
write into 360 reads or even 1 reads.

If you said "writes of type X are 100 times as fast as writes of type
Y", then some useful cost model could perhaps be developed.  But that's
not what you're saying.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Cost Model

2017-12-20 Thread neto brpr

Dear David
I have read documentation that you send, but it has only sequential page
cost and random page cost parameters. What I need, would be a model of
custo for Differentiate Read/Write (sequential and random), because in SSDs
the reads and writes have different costs. If you or someone knows a patch
or other solution, that allows you to configure individual parameters to:

- Sequential reading page cost
- cost of the random reading page
- sequential recording page cost
- Random recording page cost

Best Regards
Neto

Livre
de vírus. www.avast.com
.
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

2017-12-20 15:35 GMT-02:00 David G. Johnston :

> On Wed, Dec 20, 2017 at 10:29 AM, neto brpr  wrote:
>
>> Any comment, hint about it or something, please inform me.
>>
>
> The docs contain this - its seem to cover what you describe:
>
> https://www.postgresql.org/docs/10/static/runtime-config-
> query.html#RUNTIME-CONFIG-QUERY-CONSTANTS
>
> David J.
>
>

Re: Letting plpgsql in on the fun with the new expression eval stuff

2017-12-20 Thread Tom Lane

Andres Freund  writes:
> On 2017-12-20 12:12:48 -0500, Tom Lane wrote:
>> I'm using several different test cases, but one that shows up the problem
>> is [...]

> Which certainly seems interesting. The outer ExecInterpExpr() indeed
> doesn't do that much, it's the inner call that's the most relevant
> piece.  That so much time is spent in SPI_fnumber() and the functions it
> calls, namely strcmp(), certainly doesn't seem right.  I suspect that
> without addressing that cost, a lot of the other potential improvements
> aren't going to mean much.

Right, that's mostly coming from the fact that exec_eval_datum on
a RECFIELD does SPI_fnumber() every time.  I have a patch in the
pipeline to address that, but this business with the expression eval
API is a separable issue (and it stands out a lot more with that
patch in place than it does in HEAD ;-)).

>> Um, you left out something here?  I don't mind changing the callback
>> signature, but it seems like it generally ought to look the same as
>> the other out-of-line eval functions.

> Yes, I did. Intercontinental travel does wonders.

> I was thinking that it might be better not to expose the exact details
> of the expression evaluation step struct to plpgsql etc. I'm kinda
> forseeing a bit of further churn there that I don't think other parts of
> the code necessarily need to be affected by. We could have the callback
> have a signature roughly like
> Datum callback(void *private_data, ExprContext econtext, bool *isnull);

I don't especially like that, because it forces the callback to use a
separately allocated private_data area even when the available space
in the op step struct would serve perfectly well.  In my draft patch
I have

--- 338,352 
Oid paramtype;  /* OID of parameter's datatype */
}   param;

+   /* for EEOP_PARAM_CALLBACK */
+   struct
+   {
+   ExecEvalSubroutine paramfunc;   /* add-on evaluation subroutine */
+   void   *paramarg;   /* private data for same */
+   int paramid;/* numeric ID for parameter */
+   Oid paramtype;  /* OID of parameter's datatype */
+   }   cparam;
+ 
/* for EEOP_CASE_TESTVAL/DOMAIN_TESTVAL */
struct
{

and it turns out that plpgsql isn't bothering with paramarg because
paramid and paramtype are all it needs.  I doubt that whatever you
have in mind to do to struct ExprEvalStep is likely to be so major
that it changes what we can keep in these fields.

>> I'm not seeing anything that needs to be reset?

> I was kind of thinking of the params_dirty business in
> plpgsql_param_fetch(), setup_param_list() etc. But I'm not quite sure
> how you're thinking of representing parameters on the plpgsql side after
> changing the callbacks style.

Turns out we can just get rid of that junk altogether.  I've redesigned
the ParamListInfo API a bit in service of this, and there no longer seems
to be a need for plpgsql to use a mutable ParamListInfo at all.

Will send a patch in a bit.  I need to write an explanation of what all
I changed.

regards, tom lane

Re: Letting plpgsql in on the fun with the new expression eval stuff

2017-12-20 Thread Andres Freund

Hi,

On 2017-12-20 12:12:48 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > What's the workload you're testing? I'm mildly surprised to see
> > ExecEvalParamExtern() show up, rather than just plpgsql_param_fetch() &
> > exec_eval_datum(). Or were you just listing that to specify the
> > callpath?
>
> I'm using several different test cases, but one that shows up the problem
> is [...]

Thanks.


> The outer do-block is just to get a run long enough for reliable perf
> statistics.  I find these relevant entries in the report:
>
>   Children  Self   Samples  Shared Object  Symbol
> +   98.29% 4.44% 10827  postgres   [.] ExecInterpExpr
> +   61.82% 3.23%  7878  plpgsql.so [.] exec_eval_expr
> +   34.64% 3.91%  9521  postgres   [.] ExecEvalParamExtern
> +   29.78% 4.88% 11892  plpgsql.so [.] plpgsql_param_fetch
> +   23.06% 6.90% 16831  plpgsql.so [.] exec_eval_datum
> +6.28% 2.89%  7049  plpgsql.so [.] setup_param_list
> +4.02% 4.01%  9774  postgres   [.] bms_next_member

> I think the ridiculous "children" percentage for ExecInterpExpr can be
> discounted, because that's a result of essentially everything happening
> inside the outer call of ptest4_rec.

Right.

Unfolding this gives:
-   79.73% 5.66%  postgres  postgres[.] ExecInterpExpr
   - 92.90% ExecInterpExpr
plpgsql_call_handler
plpgsql_exec_function
exec_stmt_block
  - exec_stmts
 - 100.00% exec_for_query
- 68.11% exec_stmts
   - exec_assign_expr
  - 60.96% ExecInterpExpr
 - 99.10% ExecEvalParamExtern
- plpgsql_param_fetch
   + 61.82% SPI_fnumber
 15.87% SPI_getbinval
 14.29% nocachegetattr
 4.18% bms_is_member
 3.84% SPI_gettypeid
   0.90% int4pl
  + 18.95% SPI_plan_get_cached_plan
11.48% bms_next_member
  + 5.13% exec_assign_value
  + 3.13% ReleaseCachedPlan
+ 16.70% SPI_cursor_fetch
+ 7.19% CreateTupleDescCopy
+ 5.49% heap_copytuple
  1.26% AllocSetFree
   + 6.13% 0x
   + 0.71% 0x5624318c8d6f

Which certainly seems interesting. The outer ExecInterpExpr() indeed
doesn't do that much, it's the inner call that's the most relevant
piece.  That so much time is spent in SPI_fnumber() and the functions it
calls, namely strcmp(), certainly doesn't seem right.  I suspect that
without addressing that cost, a lot of the other potential improvements
aren't going to mean much.

Looking at the function cost excluding children:
+7.79%  postgres  plpgsql.so  [.] exec_assign_expr
+7.39%  postgres  plpgsql.so  [.] plpgsql_param_fetch
-6.71%  postgres  libc-2.25.so[.] __strncmp_sse42
   - __strncmp_sse42
  + 99.97% SPI_fnumber
+5.66%  postgres  postgres[.] ExecInterpExpr
-4.60%  postgres  postgres[.] bms_next_member
   - bms_next_member
  - 99.98% exec_assign_expr
-4.59%  postgres  postgres[.] CreateTupleDescCopy
   - CreateTupleDescCopy
  - 92.93% exec_for_query
   exec_stmts
   exec_stmt_block
   plpgsql_exec_function
   plpgsql_call_handler
+4.40%  postgres  postgres[.] AllocSetAlloc
-3.77%  postgres  postgres[.] SPI_fnumber
   + SPI_fnumber
+3.49%  postgres  plpgsql.so  [.] exec_for_query
+2.93%  postgres  postgres[.] GetCachedPlan
+2.90%  postgres  postgres[.] nocachegetattr
+2.85%  postgres  postgres[.] ExecEvalParamExtern
+2.68%  postgres  postgres[.] heap_getnext
+2.64%  postgres  postgres[.] SPI_getbinval
+2.39%  postgres  plpgsql.so  [.] exec_assign_value
+2.22%  postgres  postgres[.] heap_copytuple
+2.21%  postgres  plpgsql.so  [.] exec_stmts

The time in exec_assign_expr() is largely spent doing bms_next_member()
via the inlined setup_param_list().

Certainly shows that there's some expression eval related overhead. But
it seems the lowest hanging fruits aren't quite there, and wouldn't
necessarily be addressed by the type of change we're discussing here.

> >> So the execution would look more or less like
> >>op->eval_param(state, op, econtext);
> >> and there'd need to be some extra fields (at least a void*) in the op
> >> struct where plpgsql could keep private data.
>
> > I think I'd redo the parameters to the callback slightly, but generally
> > that sounds sane. Was thinking of something more like
>
> Um, you left out something here?  I don't mind changing the callback
> signature, but it seems like it gene

Re: Cost Model

2017-12-20 Thread David G. Johnston

On Wed, Dec 20, 2017 at 10:29 AM, neto brpr  wrote:

> Any comment, hint about it or something, please inform me.
>

The docs contain this - its seem to cover what you describe:


https://www.postgresql.org/docs/10/static/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS

David J.

Re: Schema-qualify function calls in information_schema

2017-12-20 Thread Tom Lane

David Fetter  writes:
> Please find attached a patch against master to do $Subject, which
> tones down the hilarity, at least in information_schema.

The views do not need this sort of change, because they're parsed
only once during initdb.

The bodies of functions in information_schema do need qualification,
but I think they've already got it, or at least I remember having
looked through them for the issue in the past.

> Another way to fix this, which I know will be controversial, is simply
> to mandate that pg_catalog (and possibly information_schema) be
> non-changeably first in the search_path.

I think that ship sailed long ago.  It might be workable to attach
"SET search_path" clauses to the functions, if you want to make them
more bulletproof.

regards, tom lane

Cost Model

2017-12-20 Thread neto brpr

Hello All

Anyone know if there is a patch that changes the PostgreSQL cost model in
some way?
I'm working with an tuning technique, based in hybrid storage environments
(HDDs and SSDs).

I need to know if exist any patches that allow postgresql to differentiate,
for example, I/O costs of read and write and also costs of sequential read
and random read.

This is because the read and write speeds of SSDs are asynchronous. On SSDs
the read speed can be 340 times faster than writing.
In Hard Disks the sequential reading and writing values are almost the
same. Also in HDDs the random reading is much slower than the sequential
reading.
In the SSDs the sequential and random reading are practically the same, due
to being a purely electronic device.
I would like to know if there is any solution related to this that I said,
because I would not need to develop this if something similar already
exists.

It would be interesting if there is a patch that inserts new parameters in
the postgresql.conf file to:

- sequencial read page cost
- random read page cost
- sequencial write page cost
- random write page cost

Any comment, hint about it or something, please inform me.

Regards,
Neto.



Livre
de vírus. www.avast.com
.
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Schema-qualify function calls in information_schema

2017-12-20 Thread David Fetter

Folks,

It's possible to arrange for schemas to precede pg_catalog and
information_schema in a search_path setting, and when that's done,
hilarity can ensue, especially when someone has created functions with
identical signatures but non-identical behavior.  People who do that
should probably be presumed to be attackers, but it's conceivable that
such hilarity could merely be poor judgement combined with buggy code.

Please find attached a patch against master to do $Subject, which
tones down the hilarity, at least in information_schema.  I did not
attempt to go through and make sure that functions calls are
schema-qualified all through the back-end, but that seems like a
worthwhile project on grounds of reducing the search_path attack
surface.

Another way to fix this, which I know will be controversial, is simply
to mandate that pg_catalog (and possibly information_schema) be
non-changeably first in the search_path.

What say?

Best,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
>From 41e10d687817f12eb0eadd5d4b967696eccd1c1f Mon Sep 17 00:00:00 2001
From: David Fetter 
Date: Mon, 18 Dec 2017 16:35:28 -0500
Subject: [PATCH] Schema-qualified function calls

---
 src/backend/catalog/information_schema.sql | 156 ++---
 1 file changed, 78 insertions(+), 78 deletions(-)

diff --git a/src/backend/catalog/information_schema.sql b/src/backend/catalog/information_schema.sql
index 360725d59a..de4dcac3c3 100644
--- a/src/backend/catalog/information_schema.sql
+++ b/src/backend/catalog/information_schema.sql
@@ -186,7 +186,7 @@ CREATE FUNCTION _pg_interval_type(typid oid, mod int4) RETURNS text
 AS
 $$SELECT
   CASE WHEN $1 IN (1186) /* interval */
-   THEN upper(substring(format_type($1, $2) from 'interval[()0-9]* #"%#"' for '#'))
+   THEN pg_catalog.upper(substring(format_type($1, $2) from 'interval[()0-9]* #"%#"' for '#'))
ELSE null
   END$$;
 
@@ -262,7 +262,7 @@ CREATE VIEW applicable_roles AS
 FROM pg_auth_members m
  JOIN pg_authid a ON (m.member = a.oid)
  JOIN pg_authid b ON (m.roleid = b.oid)
-WHERE pg_has_role(a.oid, 'USAGE');
+WHERE pg_catalog.pg_has_role(a.oid, 'USAGE');
 
 GRANT SELECT ON applicable_roles TO PUBLIC;
 
@@ -312,12 +312,12 @@ CREATE VIEW attributes AS
  AS data_type,
 
CAST(
- _pg_char_max_length(_pg_truetypid(a, t), _pg_truetypmod(a, t))
+ information_schema._pg_char_max_length(_pg_truetypid(a, t), information_schema._pg_truetypmod(a, t))
  AS cardinal_number)
  AS character_maximum_length,
 
CAST(
- _pg_char_octet_length(_pg_truetypid(a, t), _pg_truetypmod(a, t))
+ information_schema._pg_char_octet_length(_pg_truetypid(a, t), information_schema._pg_truetypmod(a, t))
  AS cardinal_number)
  AS character_octet_length,
 
@@ -330,27 +330,27 @@ CREATE VIEW attributes AS
CAST(co.collname AS sql_identifier) AS collation_name,
 
CAST(
- _pg_numeric_precision(_pg_truetypid(a, t), _pg_truetypmod(a, t))
+ information_schema._pg_numeric_precision(_pg_truetypid(a, t), information_schema._pg_truetypmod(a, t))
  AS cardinal_number)
  AS numeric_precision,
 
CAST(
- _pg_numeric_precision_radix(_pg_truetypid(a, t), _pg_truetypmod(a, t))
+ information_schema._pg_numeric_precision_radix(_pg_truetypid(a, t), information_schema._pg_truetypmod(a, t))
  AS cardinal_number)
  AS numeric_precision_radix,
 
CAST(
- _pg_numeric_scale(_pg_truetypid(a, t), _pg_truetypmod(a, t))
+ information_schema._pg_numeric_scale(_pg_truetypid(a, t), information_schema._pg_truetypmod(a, t))
  AS cardinal_number)
  AS numeric_scale,
 
CAST(
- _pg_datetime_precision(_pg_truetypid(a, t), _pg_truetypmod(a, t))
+ information_schema._pg_datetime_precision(_pg_truetypid(a, t), information_schema._pg_truetypmod(a, t))
  AS cardinal_number)
  AS datetime_precision,
 
CAST(
- _pg_interval_type(_pg_truetypid(a, t), _pg_truetypmod(a, t))
+ information_schema._pg_interval_type(_pg_truetypid(a, t), information_schema._pg_truetypmod(a, t))
  AS character_data)
  AS interval_type,
CAST(null AS cardinal_number) AS interval_precision,
@@ -425,7 +425,7 @@ CREATE VIEW check_constraint_routine_usage AS
   AND d.refobjid = p.oid
   AND d.refclassid = 'pg_catalog.pg_proc'::regclass
   AND p.pronamespace = np.oid
-  AND pg_has_role(p.proowner, 'USAGE');
+  AND pg_catalog.pg_has_role(p.proowner, 'USAGE');
 
 GRANT SELECT ON check_constraint_routine_usage TO PUBLIC;
 
@@ -445,7 +445,7 @@ CREATE VIEW ch

Re: Letting plpgsql in on the fun with the new expression eval stuff

2017-12-20 Thread Tom Lane

Andres Freund  writes:
> On 2017-12-19 13:00:41 -0500, Tom Lane wrote:
>> I'm looking at ways to get plpgsql expression evaluation to go faster,
>> and one thing I'm noticing is the rather large overhead of going through
>> ExecEvalParamExtern and plpgsql_param_fetch to get to the useful work
>> (exec_eval_datum).

> What's the workload you're testing? I'm mildly surprised to see
> ExecEvalParamExtern() show up, rather than just plpgsql_param_fetch() &
> exec_eval_datum(). Or were you just listing that to specify the
> callpath?

I'm using several different test cases, but one that shows up the problem
is

create table foo(a int, b int, c int);
insert into foo select 1,2,3 from generate_series(1,100);
vacuum foo;

create or replace function ptest4_rec() returns void as $$
declare r record; t int;
begin
  for r in select * from foo loop
t := r.a + r.b + r.c;
  end loop;
end$$
language plpgsql stable;

-- then trace this:
do $$
begin
for i in 1..200 loop
  perform ptest4_rec();
end loop;
end; $$ language plpgsql;

The outer do-block is just to get a run long enough for reliable perf
statistics.  I find these relevant entries in the report:

  Children  Self   Samples  Shared Object  Symbol
+   98.29% 4.44% 10827  postgres   [.] ExecInterpExpr
+   61.82% 3.23%  7878  plpgsql.so [.] exec_eval_expr
+   34.64% 3.91%  9521  postgres   [.] ExecEvalParamExtern
+   29.78% 4.88% 11892  plpgsql.so [.] plpgsql_param_fetch
+   23.06% 6.90% 16831  plpgsql.so [.] exec_eval_datum
+6.28% 2.89%  7049  plpgsql.so [.] setup_param_list
+4.02% 4.01%  9774  postgres   [.] bms_next_member

I think the ridiculous "children" percentage for ExecInterpExpr can be
discounted, because that's a result of essentially everything happening
inside the outer call of ptest4_rec.  But the rest of it can be taken
at face value, and it says that ExecEvalParamExtern and
plpgsql_param_fetch are each pretty expensive; and both of them are just
interface routines doing little useful work.

>> We've ameliorated that for DTYPE_VAR variables
>> by keeping a pre-set-up copy of their values in a ParamListInfo struct,
>> but that's pretty ugly and carries a bunch of costs of its own.

> Just to make sure I understand correctly, you're talking about
> setup_unshared_param_list() / setup_param_list()?

Right, and the stuff in e.g. assign_simple_var() to keep the paramlist
up to date.

>> So the execution would look more or less like
>>  op->eval_param(state, op, econtext);
>> and there'd need to be some extra fields (at least a void*) in the op
>> struct where plpgsql could keep private data.

> I think I'd redo the parameters to the callback slightly, but generally
> that sounds sane. Was thinking of something more like

Um, you left out something here?  I don't mind changing the callback
signature, but it seems like it generally ought to look the same as
the other out-of-line eval functions.

> One question I have is how we will re-initialize the relevant state
> between exec_simple_expr() calls. I guess the most realistic one is that
> the op will have a pointer into an array managed by exec_simple_expr()
> that can get reset?

I'm not seeing anything that needs to be reset?

> Hm. We could have a version of ExecInitExpr() / ExecInitExprRec that
> first gives a callback chance to handle an operation. That'd make
> overwriting parts like this quite easy, because we know we want to
> handle Param nodes anywhere differently. So we'd not have a Param
> specific routine, but the ability to intercept everything, and defer to
> ExecInitExprRec() if undesired.

Yeah, I thought of that idea too, but at least for what I'm trying to
do here, it doesn't seem all that helpful.  The problem is that plpgsql
needs to get its hooks into not only its own direct calls of ExecInitExpr
for "simple" expressions, but also calls that occur within arbitrary plan
trees that are parsed/planned/executed through SPI.  And then things need
to happen entirely differently in parallel child workers, which will have
"flat" copies of the ParamListInfo.  So I really want a hook that's tied
to ParamListInfo, and that only makes much sense for PARAM_EXTERN Params.
There may be use-cases for the more general hook you're talking about,
but I'm not very clear where that would be set up.

>> Another way we could do it is to invent ExecInitExprWithParams() that
>> takes an additional ParamListInfo pointer, and use that.  Rather than
>> adding yet one more parameter that has to be passed down through
>> ExecInitExprRec, I suggest that we could waste a bit of space in
>> struct ExprState and store that value there.  Maybe do the same with
>> the parent pointer so as to reduce the number of recursive parameters.

> I was thinking something slightly different. Namely that we should have
> two structs ExprState and ExprBuildState. We can stuff random additi

Re: [HACKERS] Proposal: Local indexes for partitioned table

2017-12-20 Thread Alvaro Herrera

Great, thanks for the input.

pg_dump behaves as described upthread -- thanks David and Robert for the
input.  I did this by injecting a fake "INDEX ATTACH" object in
pg_dump's model, which depends on the index-on-parent-table; which in
turn depends on the index-on-partition.  Because of the dependencies,
the attach is dumped last, and the index-on-parent is dumped after all
the indexes-on-partitions have been dumped.  I needed to apply a little
bit of surgery so that each table object would contain links to its
indexes, which previously wasn't there.  A followup step to obtaining
index info creates the INDEX ATTACH objects.

In the backend code: There is now a difference in initial setting of
indisvalid and indisready when creating an index.  Previously we always
set them the same (!concurrent) but now indisready depends only on
"concurrent" while indisvalid additionally depends on whether a a
non-inherited index on a partitioned table is being built.  Failing to
set indisready to true initially was causing the index to be invisible
to pg_dump.

There is no ALTER INDEX / DETACH, as discussed.  Also, trying to attach
an index for a partition that already contains an attached index raises
an error -- which means that in order to determine whether attaching an
index as partition makes the parent index valid, is a simple matter of
counting tuples.

All these behaviors are immortalized in the regression tests.

I added tab-complete support too (thanks Jesper for the reminder).  It
is probably not exhaustive, but should be good enough.

I have two patches to rebase on top of this, which I hope to post later
today:

1) let these indexes be unique (i.e. allow unique and PK constraints)
2) allow foreign keys on partitioned tables

I have a further patch to allow FOR EACH ROW triggers on partitioned
tables, but I think it's largely unrelated to this (and, as I was
surprised to discover, it's also unrelated to the FKs one).

Thanks

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/doc/src/sgml/ref/alter_index.sgml 
b/doc/src/sgml/ref/alter_index.sgml
index e54237272c..3984686d67 100644
--- a/doc/src/sgml/ref/alter_index.sgml
+++ b/doc/src/sgml/ref/alter_index.sgml
@@ -23,6 +23,7 @@ PostgreSQL documentation
 
 ALTER INDEX [ IF EXISTS ] name 
RENAME TO new_name
 ALTER INDEX [ IF EXISTS ] name 
SET TABLESPACE tablespace_name
+ALTER INDEX name ATTACH PARTITION 
index_name
 ALTER INDEX name DEPENDS ON 
EXTENSION extension_name
 ALTER INDEX [ IF EXISTS ] name 
SET ( storage_parameter = 
value [, ... ] )
 ALTER INDEX [ IF EXISTS ] name 
RESET ( storage_parameter [, ... ] 
)
@@ -76,6 +77,19 @@ ALTER INDEX ALL IN TABLESPACE name

 

+ATTACH
+
+ 
+  Causes the named index to become attached to the altered index.
+  The named index must be on a partition of the table containing the
+  index being altered, and have an equivalent definition.  An attached
+  index cannot be dropped by itself, and will automatically be dropped
+  if its parent index is dropped.
+ 
+
+   
+
+   
 DEPENDS ON EXTENSION
 
  
diff --git a/doc/src/sgml/ref/alter_table.sgml 
b/doc/src/sgml/ref/alter_table.sgml
index 7bcf242846..0a2f3e3646 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -783,7 +783,10 @@ ALTER TABLE [ IF EXISTS ] name
   as a partition of the target table. The table can be attached
   as a partition for specific values using FOR VALUES
or as a default partition by using DEFAULT
-  .
+  .  For each index in the target table, a corresponding
+  one will be created in the attached table; or, if an equivalent
+  index already exists, will be attached to the target table's index,
+  as if ALTER INDEX ATTACH had been executed.
  
 
  
@@ -844,7 +847,8 @@ ALTER TABLE [ IF EXISTS ] name
  
   This form detaches specified partition of the target table.  The detached
   partition continues to exist as a standalone table, but no longer has any
-  ties to the table from which it was detached.
+  ties to the table from which it was detached.  Any indexes that were
+  attached to the target table's indexes are detached.
  
 

diff --git a/doc/src/sgml/ref/create_index.sgml 
b/doc/src/sgml/ref/create_index.sgml
index 025537575b..b7b0506d6d 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] name ] ON table_name [ USING method ]
+CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] name ] ON [ ONLY ] table_name [ USING method ]
 ( { column_name | ( 
expression ) } [ COLLATE 
collation ] [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST 
} ] [, ...] )
 [ WITH ( storage_parameter = 
value [, ... ] ) ]
 [ TABLESPACE t

Re: AS OF queries

2017-12-20 Thread Tom Lane

Peter Eisentraut  writes:
> On 12/20/17 10:29, Tom Lane wrote:
>> Please say that's just an Oracle-ism and not SQL standard, because it's
>> formally ambiguous.

> The SQL standard syntax appears to be something like

> "tablename" [ AS OF SYSTEM TIME 'something' ] [ [ AS ] "alias" ]

> That's not going to be fun to parse.

Bleah.  In principle we could look two tokens ahead so as to recognize
"AS OF SYSTEM", but base_yylex is already a horrid mess with one-token
lookahead; I don't much want to try to extend it to that.

Possibly the most workable compromise is to use lookahead to convert
"AS OF" to "AS_LA OF", and then we could either just break using OF
as an alias, or add an extra production that allows "AS_LA OF" to
be treated as "AS alias" if it's not followed by the appropriate
stuff.

It's a shame that the SQL committee appears to be so ignorant of
standard parsing technology.

regards, tom lane

Re: AS OF queries

2017-12-20 Thread Magnus Hagander

On Wed, Dec 20, 2017 at 5:17 PM, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 12/20/17 10:29, Tom Lane wrote:
> > Please say that's just an Oracle-ism and not SQL standard, because it's
> > formally ambiguous.  This is required to work by spec:
> >
> > regression=# select x as of from (values(1)) t(x);
> >  of
> > 
> >   1
> > (1 row)
> >
> > so it's not possible for us ever to support an expression that includes
> > top-level "AS OF" (or, pretty much, "AS anything") without some rather
> > enormous pushups.
>
> The SQL standard syntax appears to be something like
>
> "tablename" [ AS OF SYSTEM TIME 'something' ] [ [ AS ] "alias" ]
>
> That's not going to be fun to parse.
>

There was a presentation about this given at FOSDEM PGDay a couple of years
back. Slides at
https://wiki.postgresql.org/images/6/64/Fosdem20150130PostgresqlTemporal.pdf
.


-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: AS OF queries

2017-12-20 Thread Peter Eisentraut

On 12/20/17 10:29, Tom Lane wrote:
> Please say that's just an Oracle-ism and not SQL standard, because it's
> formally ambiguous.  This is required to work by spec:
> 
> regression=# select x as of from (values(1)) t(x);
>  of 
> 
>   1
> (1 row)
> 
> so it's not possible for us ever to support an expression that includes
> top-level "AS OF" (or, pretty much, "AS anything") without some rather
> enormous pushups.

The SQL standard syntax appears to be something like

"tablename" [ AS OF SYSTEM TIME 'something' ] [ [ AS ] "alias" ]

That's not going to be fun to parse.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: AS OF queries

2017-12-20 Thread David Fetter

On Wed, Dec 20, 2017 at 03:03:50PM +0100, Laurenz Albe wrote:
> Konstantin Knizhnik wrote:
> > Please notice that it is necessary to configure postgres in proper
> > way in order to be able to perform time travels.  If you do not
> > disable autovacuum, then old versions will be just cleaned-up.  If
> > transaction commit timestamps are not tracked, then it is not
> > possible to locate required timeline. 
> > 
> > So DBA should make a decision in advance whether this feature is
> > needed or not.  It is not a proper instrument for
> > restoring/auditing existed database which was not configured to
> > keep all versions.
> 
> Of course; you'd have to anticipate the need to travel in time, and
> you have to pay the price for it.  Anybody who has read science
> fiction stories know that time travel does not come free.

A few extra terabytes' worth of storage space is a pretty small price
to pay, at least on the scale of time travel penalties.

> > May be it is better to add special configuration parameter for
> > this feature which should implicitly toggle autovacuum and
> > track_commit_timestamp parameters).
> 
> The feature would be most useful with some kind of "moving xid
> horizon" that guarantees that only dead tuples whose xmax lies more
> than a certain time interval in the past can be vacuumed.

+1 for this horizon.  It would be very nice, but maybe not strictly
necessary, for this to be adjustable downward without a restart.

It's not clear that adjusting it upward should work at all, but if it
did, the state of dead tuples would have to be known, and they'd have
to be vacuumed a way that was able to establish a guarantee of
gaplessness at least back to the new horizon.  Maybe there could be
some kind of "high water mark" for it.  Would that impose overhead or
design constraints on vacuum that we don't want?

Also nice but not strictly necessary, making it tunable per relation,
or at least per table.  I'm up in the air as to whether queries with
an AS OF older than the horizon[1] should error out or merely throw
warnings.

Best,
David.

[1] If we allow setting this at granularities coarser than DB
instance, this means going as far back as the relationship with the
newest "last" tuple among the relations involved in the query.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: [HACKERS] taking stdbool.h into use

2017-12-20 Thread Peter Eisentraut

On 11/15/17 15:13, Peter Eisentraut wrote:
> I'm going to put this patch set as Returned With Feedback for now.  The
> GinNullCategory issues look like they will need quite a bit of work.
> But it will be worth picking this up some time.

I think the issue with GinNullCategory is practically unfixable.  This
is on-disk data that needs to be castable to an array of bool.  So
tolerating a bool of size other than 1 would either require a disk
format change or extensive code changes, neither of which seem
worthwhile at this point.

So here is a minimal patch set to perhaps wrap this up for the time
being.  I have added static assertions that check the sizes of
GinNullCategory and GinTernaryValue, which I think are the two critical
places that require compatibility with bool.  And then we include
 only if its bool has size 1.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From c9b5eb5024fcfc48bee93d621778dcacba8dd575 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 15 Dec 2017 17:01:03 -0500
Subject: [PATCH v4 1/3] Add static assertions about size of GinNullCategory vs
 bool

We need these to be the same because some code uses an array of one as
the other.  If bool were a different size (e.g., because stdbool.h is
used, which could make sizeof(bool) == 4 in some configurations), then
this would silently break.  Since GinNullCategory is signed char and
bool is a char of unspecified signedness, there need to be explicit
casts between the two, otherwise the compiler would complain.  But those
casts also silently hide any size mismatches.
---
 src/backend/access/gin/ginscan.c | 6 +-
 src/backend/access/gin/ginutil.c | 7 ++-
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index 7ceea7a741..eaf4ae6fdf 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -367,8 +367,12 @@ ginNewScanKey(IndexScanDesc scan)
}
}
}
-   /* now we can use the nullFlags as category codes */
 
+   /*
+* now we can use the nullFlags as category codes
+*/
+   StaticAssertStmt(sizeof(GinNullCategory) == sizeof(bool),
+"sizes of GinNullCategory and 
bool are not equal");
ginFillScanKey(so, skey->sk_attno,
   skey->sk_strategy, searchMode,
   skey->sk_argument, nQueryValues,
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index d9c6483437..28912e416a 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -540,7 +540,12 @@ ginExtractEntries(GinState *ginstate, OffsetNumber attnum,
for (i = 0; i < *nentries; i++)
nullFlags[i] = (nullFlags[i] ? true : false);
}
-   /* now we can use the nullFlags as category codes */
+
+   /*
+* now we can use the nullFlags as category codes
+*/
+   StaticAssertStmt(sizeof(GinNullCategory) == sizeof(bool),
+"sizes of GinNullCategory and bool are 
not equal");
*categories = (GinNullCategory *) nullFlags;
 
/*

base-commit: 7d3583ad9ae54b44119973a9d6d731c9cc74c86e
-- 
2.15.1

From b1b89da6ccae2778c4d2561ffd3ba522c56586a4 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 19 Dec 2017 13:54:05 -0500
Subject: [PATCH v4 2/3] Add static assertions about size of GinTernaryValue vs
 bool

We need these to be the same because some code casts between pointers to
them.
---
 src/backend/utils/adt/tsginidx.c | 4 +++-
 src/include/access/gin.h | 4 ++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/tsginidx.c b/src/backend/utils/adt/tsginidx.c
index aba456ed88..23b98a3d50 100644
--- a/src/backend/utils/adt/tsginidx.c
+++ b/src/backend/utils/adt/tsginidx.c
@@ -309,7 +309,9 @@ gin_tsquery_consistent(PG_FUNCTION_ARGS)
 * query.
 */
gcv.first_item = GETQUERY(query);
-   gcv.check = check;
+   StaticAssertStmt(sizeof(GinTernaryValue) == sizeof(bool),
+"sizes of GinTernaryValue and 
bool are not equal");
+   gcv.check = (GinTernaryValue *) check;
gcv.map_item_operand = (int *) (extra_data[0]);
gcv.need_recheck = recheck;
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index ec83058095..19fdf132b4 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -51,8 +51,8 @@ typedef struct GinStatsData
 /*
  * A ternary value used by tri-consistent functions.
  *
- * For convenience, this is compatible

Re: [HACKERS] static assertions in C++

2017-12-20 Thread Tom Lane

Peter Eisentraut  writes:
> On 12/20/17 00:57, Tom Lane wrote:
>> I do not have a well-informed opinion on whether
>> #if defined(__cpp_static_assert) && __cpp_static_assert >= 200410
>> is an appropriate test for static_assert() being available, but I'm
>> pretty suspicious of it because none of my C++ compilers seem to
>> take that path, not even recent stuff like clang 9.0.0.

> For clang, you apparently need to pass -std=c++11 or higher.  With g++
> >=6, that's the default.

Ah.  I did try -std=c++0x on one machine, but forgot to do so with
the newer clang.  That does seem to make it take the static_assert
path on recent macOS.

regards, tom lane

Re: [HACKERS] replace GrantObjectType with ObjectType

2017-12-20 Thread Alvaro Herrera

Peter Eisentraut wrote:
> On 12/19/17 19:56, Michael Paquier wrote:
> > -stringify_adefprivs_objtype(GrantObjectType objtype)
> > +stringify_adefprivs_objtype(ObjectType objtype)
> > [...]
> > +default:
> > +elog(ERROR, "unrecognized grant object type: %d", (int) 
> > objtype);
> > +return "???";/* keep compiler quiet */
> >  }
> > -
> > -elog(ERROR, "unrecognized grant object type: %d", (int) objtype);
> > -return "???";/* keep compiler quiet */
> > Still this portion in 0001 is something that we try to avoid as much
> > as possible, no? I would have thought that all object types should be
> > listed directly so as nothing is missed in the future.
> 
> But we don't really know what such future GRANT commands would actually
> look like.  What would the GRANT syntax corresponding to OBJECT_CAST or
> OBJECT_STATISTIC_EXT be?  I think that's best filled in when we know.

I think Michael's point is that instead of a "default:" clause, this
switch should list all the known values of the enum and throw an
"unsupported object type" error for them.  So whenever somebody adds a
new object type, the compiler will complain that this switch doesn't
handle it and the developer will have to think about this function.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pltcl valgrind output

2017-12-20 Thread Tom Lane

Andrew Dunstan  writes:
> The following appears to keep valgrind quiet. AFAICT the problem isn't
> in our code.

I'm a little worried about whether this could suppress reports that
we do need to hear about.  I don't know what TclNRRunCallbacks is
or does exactly, but just going by the name, it seems like it could
invoke non-Tcl code (ie, callbacks supplied by us).  It'd be better
to constrain the call stack more tightly, particularly what the top
function is.

I agree that the traces you showed before don't look like our fault.

regards, tom lane

Re: AS OF queries

2017-12-20 Thread Tom Lane

Laurenz Albe  writes:
> Konstantin Knizhnik wrote:
>> I failed to support AS OF clause (as in Oracle) because of shift-reduce 
>> conflicts with aliases,
>> so I have to introduce new ASOF keyword. May be yacc experts can propose 
>> how to solve this conflict without introducing new keyword...

> I think it would be highly desirable to have AS OF, because that's
> the way the SQL standard has it.

Please say that's just an Oracle-ism and not SQL standard, because it's
formally ambiguous.  This is required to work by spec:

regression=# select x as of from (values(1)) t(x);
 of 

  1
(1 row)

so it's not possible for us ever to support an expression that includes
top-level "AS OF" (or, pretty much, "AS anything") without some rather
enormous pushups.

If we absolutely had to do it, the path to a solution would involve some
lexer-level lookahead, cf base_yylex() --- but that's messy and tends to
introduce its own set of corner case misbehaviors.  I'd much rather use a
syntax that wasn't chosen with blind disregard for SQL's existing
syntactic constraints.

regards, tom lane

Re: [HACKERS] replace GrantObjectType with ObjectType

2017-12-20 Thread Peter Eisentraut

On 12/19/17 19:56, Michael Paquier wrote:
> -stringify_adefprivs_objtype(GrantObjectType objtype)
> +stringify_adefprivs_objtype(ObjectType objtype)
> [...]
> +default:
> +elog(ERROR, "unrecognized grant object type: %d", (int) objtype);
> +return "???";/* keep compiler quiet */
>  }
> -
> -elog(ERROR, "unrecognized grant object type: %d", (int) objtype);
> -return "???";/* keep compiler quiet */
> Still this portion in 0001 is something that we try to avoid as much
> as possible, no? I would have thought that all object types should be
> listed directly so as nothing is missed in the future.

But we don't really know what such future GRANT commands would actually
look like.  What would the GRANT syntax corresponding to OBJECT_CAST or
OBJECT_STATISTIC_EXT be?  I think that's best filled in when we know.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] static assertions in C++

2017-12-20 Thread Peter Eisentraut

On 12/20/17 00:57, Tom Lane wrote:
> I do not have a well-informed opinion on whether
> 
> #if defined(__cpp_static_assert) && __cpp_static_assert >= 200410
> 
> is an appropriate test for static_assert() being available, but I'm
> pretty suspicious of it because none of my C++ compilers seem to
> take that path, not even recent stuff like clang 9.0.0.

For clang, you apparently need to pass -std=c++11 or higher.  With g++
>=6, that's the default.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Transaction control in procedures

2017-12-20 Thread Peter Eisentraut

Updated patch attached.

I have addressed the most recent review comments I believe.

The question about what happens to cursor loops in PL/Perl and PL/Python
would be addressed by the separate thread "portal pinning".  The test
cases in this patch are currently marked by FIXMEs.

I have changed the SPI API a bit.  I got rid of SPI_set_nonatomic() and
instead introduced SPI_connect_ext() that you can pass flags to.  The
advantage of that is that in the normal case we can continue to use the
existing memory contexts, so nothing changes for existing uses, which
seems desirable.  (This also appears to address some sporadic test
failures in PL/Perl.)

I have also cleaned up the changes in portalmem.c further, so the
changes are now even smaller.

The commit message in this patch contains more details about some of
these changes.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 984d2974311ce6d8760ec3d59af66363cc53fc6d Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Wed, 20 Dec 2017 09:25:45 -0500
Subject: [PATCH v5] Transaction control in PL procedures

In each of the supplied procedural languages (PL/pgSQL, PL/Perl,
PL/Python, PL/Tcl), add language-specific commit and rollback
functions/commands to control transactions in procedures in that
language.  Add similar underlying functions to SPI.  Some additional
cleanup so that transaction commit or abort doesn't blow away data
structures still used by the procedure call.  Add execution context
tracking to CALL and DO statements so that transaction control commands
can only be issued in top-level procedure and block calls, not function
calls or other procedure or block calls.

- SPI

Add a new function SPI_connect_ext() that is like SPI_connect() but
allows passing option flags.  The only option flag right now is
SPI_OPT_NONATOMIC.  A nonatomic SPI connection can execute transaction
control commands, otherwise it's not allowed.  This is meant to be
passed down from CALL and DO statements which themselves know in which
context they are called.  A nonatomic SPI connection uses different
memory management.  A normal SPI connection allocates its memory in
TopTransactionContext.  For nonatomic connections we use PortalContext
instead.  As the comment in SPI_connect_ext() (previously SPI_connect())
indicates, one could potentially use PortalContext in all cases, but it
seems safest to leave the existing uses alone, because this stuff is
complicated enough already.

SPI also gets new functions SPI_start_transaction(), SPI_commit(), and
SPI_rollback(), which can be used by PLs to implement their transaction
control logic.

- portalmem.c

Some adjustments were made in the code that cleans up portals at
transaction abort.  The portal code could already handle a command
*committing* a transaction and continuing (e.g., VACUUM), but it was not
quite prepared for a command *aborting* a transaction and continuing.

In AtAbort_Portals(), remove the code that marks an active portal as
failed.  As the comment there already predicted, this doesn't work if
the running command wants to keep running after transaction abort.  And
it's actually not necessary, because pquery.c is careful to run all
portal code in a PG_TRY block and explicitly runs MarkPortalFailed() if
there is an exception.  So the code in AtAbort_Portals() is never used
anyway.

In AtAbort_Portals() and AtCleanup_Portals(), we need to be careful not
to clean up active portals too much.  This mirrors similar code in
PreCommit_Portals().

- PL/Perl

Gets new functions spi_commit() and spi_rollback()

- PL/pgSQL

Gets new commands COMMIT and ROLLBACK.

Update the PL/SQL porting example in the documentation to reflect that
transactions are now possible in procedures.

- PL/Python

Gets new functions plpy.commit and plpy.rollback.

- PL/Tcl

Gets new commands commit and rollback.
---
 doc/src/sgml/plperl.sgml   |  49 +
 doc/src/sgml/plpgsql.sgml  |  91 
 doc/src/sgml/plpython.sgml |  39 
 doc/src/sgml/pltcl.sgml|  41 
 doc/src/sgml/ref/call.sgml |   7 +
 doc/src/sgml/ref/create_procedure.sgml |   7 +
 doc/src/sgml/ref/do.sgml   |   7 +
 doc/src/sgml/spi.sgml  | 177 +++
 src/backend/commands/functioncmds.c|  45 +++-
 src/backend/executor/spi.c |  99 -
 src/backend/tcop/utility.c |   6 +-
 src/backend/utils/mmgr/portalmem.c |  49 +++--
 src/include/commands/defrem.h  |   4 +-
 src/include/executor/spi.h |   7 +
 src/include/executor/spi_priv.h|   4 +
 src/include/nodes/nodes.h  |   3 +-
 src/include/nodes/parsenodes.h |   7

Re: pltcl valgrind output

2017-12-20 Thread Andrew Dunstan



On 12/18/2017 10:12 AM, Andrew Dunstan wrote:
> I've been adding support for valgrind to the buildfarm client (code will
> hit the git repo shortly). Mostly the results have been pretty clean,
> but the pltcl tests generated the attached output. Perhaps someone with
> more valgrind-fu than I have so far would like to use the information to
> extend our supp file appropriately (or fix what it's complaining about).


The following appears to keep valgrind quiet. AFAICT the problem isn't
in our code.

diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index af03051..24c6f5b 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,11 @@
    Memcheck:Cond
    fun:PyObject_Realloc
 }
+
+ issue with TclNRRunCallbacks
+{
+   tcl_callback
+   Memcheck:Cond
+   ...
+   fun:TclNRRunCallbacks
+}


cheers

andrew


-- 

Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Letting plpgsql in on the fun with the new expression eval stuff

2017-12-20 Thread Andres Freund

Hi,

Cool to see you looking at that, I think there's quite some optimization
potential around.  I've to reread a bunch of plpgsql code, it's not
exactly an area of the code I'm intimately familiar with.

On 2017-12-19 13:00:41 -0500, Tom Lane wrote:
> I'm looking at ways to get plpgsql expression evaluation to go faster,
> and one thing I'm noticing is the rather large overhead of going through
> ExecEvalParamExtern and plpgsql_param_fetch to get to the useful work
> (exec_eval_datum).

What's the workload you're testing? I'm mildly surprised to see
ExecEvalParamExtern() show up, rather than just plpgsql_param_fetch() &
exec_eval_datum(). Or were you just listing that to specify the
callpath?

> We've ameliorated that for DTYPE_VAR variables
> by keeping a pre-set-up copy of their values in a ParamListInfo struct,
> but that's pretty ugly and carries a bunch of costs of its own.

Just to make sure I understand correctly, you're talking about
setup_unshared_param_list() / setup_param_list()?

> What I'm wondering about, given the v10 redesign of expression evaluation,
> is whether we couldn't be smarter about this by allowing plpgsql (or other
> external users) to skip the ParamListInfo representation altogether, and
> instead compile Param references into calls to evaluation functions that
> are better tailored to the problem of fetching the desired value.

Yea, that seems to make sense.

> In the existing execution infrastructure, what seems to make sense is to
> have an ExprEvalStep type that has functionality like EEOP_PARAM_EXTERN,
> but includes a function pointer to a plpgsql-supplied function having the
> same signature as ExecEvalParamExtern.  So the execution would look more
> or less like
> 
>   EEO_CASE(EEOP_PARAM_CALLBACK)
>   {
>   op->eval_param(state, op, econtext);
>   EEO_NEXT();
>   }
> 
> and there'd need to be some extra fields (at least a void*) in the op
> struct where plpgsql could keep private data.

I think I'd redo the parameters to the callback slightly, but generally
that sounds sane. Was thinking of something more like

One question I have is how we will re-initialize the relevant state
between exec_simple_expr() calls. I guess the most realistic one is that
the op will have a pointer into an array managed by exec_simple_expr()
that can get reset?

> The JIT stuff you're working on could just compile an equivalent of the
> above, although in the very long term maybe there would be some advantage
> to letting add-on modules compile specialized code for such steps.

Yea, that's reasonable too. I'm not yet 100% sure whether it'll not be
more reasonable to add a few more generic opcodes to the central JITing
that external code can emit and core can handle, or a hook like you
describe is better.

> The immediate problem is how can ExecInitExpr generate such a step?
> It can't itself know what to put into the function ptr or the additional
> fields.  There has to be a way for it to call a plpgsql-supplied
> support routine that can construct the eval step.  (And we have to
> export ExprEvalPushStep, though that doesn't seem like a problem.)

Hm. We could have a version of ExecInitExpr() / ExecInitExprRec that
first gives a callback chance to handle an operation. That'd make
overwriting parts like this quite easy, because we know we want to
handle Param nodes anywhere differently. So we'd not have a Param
specific routine, but the ability to intercept everything, and defer to
ExecInitExprRec() if undesired.

> For compiling full-fledged query trees, what I think we could do is
> add a method (function pointer) to ParamListInfo and have ExecInitExpr
> invoke plan->state->es_param_list_info->compile_param if that's set.
> However, that solution doesn't immediately work for compiling simple
> expressions because we pass a null "parent" pointer when building
> those.
> 
> I thought about instantiating a dummy PlanState and EState to use
> just for carrying this info, but that seems pretty ugly.

Yea, not a fan.

> Another way we could do it is to invent ExecInitExprWithParams() that
> takes an additional ParamListInfo pointer, and use that.  Rather than
> adding yet one more parameter that has to be passed down through
> ExecInitExprRec, I suggest that we could waste a bit of space in
> struct ExprState and store that value there.  Maybe do the same with
> the parent pointer so as to reduce the number of recursive parameters.

I was thinking something slightly different. Namely that we should have
two structs ExprState and ExprBuildState. We can stuff random additions
to ExprBuildState without concerns about increasing ExprState's state.

> I've not written any code around this idea yet, and am not sure
> if it conflicts with what you're trying to do for JIT or further out.
> Comments, better ideas?

I so far see little reason for concern WRT JIT. Implementing support for
a expression step like you descr

Re: vacuum vs heap_update_tuple() and multixactids

2017-12-20 Thread Andres Freund

Hi,

On 2017-12-19 15:01:03 -0500, Robert Haas wrote:
> On Tue, Dec 19, 2017 at 1:31 PM, Andres Freund  wrote:
> > Could I perhaps convince somebody to add that as a feature to
> > isolationtester? I'm willing to work on a bugfix for the bug itself, but
> > I've already spent tremendous amounts of time, energy and pain on
> > multixact bugs, and I'm at the moment feeling a bit unenthusiastic about
> > also working on test infrastructure for it...  If somebody else is
> > willing to work on both infrastructure *and* a bugfix, that's obviously
> > even better ;)
> 
> Hmm.  The problem with trying to make the isolation tester do this is
> that pg_isolation_test_session_is_blocked(X, A) is documented to tell
> us whether X is waiting for one the PIDs listed in A.  It's easy
> enough to tell whether X is blocked waiting for a cleanup lock by
> looking at BackendPidGetProc(pid)->wait_event_info, but that gives us
> no information about which processes are holding the buffer pins that
> prevent us from acquiring that lock.  That's a hard problem to solve
> because that data is not recorded in shared memory and doing so would
> probably be prohibitively expensive.

Indeed. I kinda wonder whether we hackishly solve this by simply
returning true fore all pids if it's waiting for a cleanup lock. That's
not actually that far from the truth... The big problem with that I see
is very short waits that resolve themselves causing issues - but given
that waiting for cleanup locks is really rare, that might not be too
bad.


> 
> 
> What if we add a hook to vacuum that lets you invoke some arbitrary C
> code after each block, and then write a test module that uses that
> hook to acquire and release an advisory lock at that point?  Then you
> could do:

I guess that'd work, albeit a bit invasive and specific to this
test. Trying to come up with a concept that'll be generizable longer
term would be neat.

Greetings,

Andres Freund

Re: GSoC 2018

2017-12-20 Thread Simone Gotti

On Mon, Dec 18, 2017 at 10:53 AM, Aleksander Alekseev
 wrote:
> Hello hackers,
>
> Thanks you a lot for your feedback. I modified the project description
> to make it more clear that it implies augmenting an existing HA
> solution, particularly Stolon, and doesn't imply solving existing
> limitations of logical replication like lack of DDL replication.
>
> Removing some of these limitations or adding logical replication support
> to other existing HA applications like RepMgr or Patroni or corosync /
> Pacemaker or maybe Postgres-XL sounds like good ideas for other GSoC
> projects. I strongly encourage to add these ideas to wiki anyone who is
> willing to be a mentor.
>
> Unfortunately I don't have any experience of using these applications
> and have no idea how they behave in different corner cases like
> netsplit, slow network, etc. Also I don't have much motivation to figure
> this out because I'm pretty happy with Stolon. For these reasons I
> personally can't be a mentor for corresponding projects.
>
> Also today I'm going to contact Simone Gotti, the main developer of
> Stolon, to let him know about this thread. I wonder what are his
> thoughts on all this.

Hi,

here at SorintLab we'll be very happy to provide help and mentor this project.

I read and agree with your concerns since some missing features in
logical replication cannot provide, at the current stage, a complete
seamless HA experience like with streaming replication.
But I believe that enhancing stolon to use logical replication will be
a really useful scenario where we can test the features of logical
replication for HA and use it also for future enhancements (like major
upgrades with near zero downtime).
At the same time this could also push me and other contributors to
implement the missing logical replication features needed to achieve
seamless HA.

Cheers,

Simone.

>
> --
> Best regards,
> Aleksander Alekseev

Re: AS OF queries

2017-12-20 Thread Laurenz Albe

Konstantin Knizhnik wrote:
> Please notice that it is necessary to configure postgres in proper way in 
> order to be able to perform time travels.
> If you do not disable autovacuum, then old versions will be just cleaned-up.
> If transaction commit timestamps are not tracked, then it is not possible to 
> locate required timeline. 
> 
> So DBA should make a decision in advance whether this feature is needed or 
> not.
> It is not a proper instrument for restoring/auditing existed database which 
> was not configured to keep all versions.

Of course; you'd have to anticipate the need to travel in time,
and you have to pay the price for it.
Anybody who has read science fiction stories know that time travel
does not come free.

> May be it is better to add special configuration parameter for this feature 
> which should implicitly toggle 
> autovacuum and track_commit_timestamp parameters).

The feature would be most useful with some kind of "moving xid
horizon" that guarantees that only dead tuples whose xmax lies
more than a certain time interval in the past can be vacuumed.

Yours,
Laurenz Albe

Re: AS OF queries

2017-12-20 Thread Joe Wildish

On 20 Dec 2017, at 13:48, Konstantin Knizhnik  wrote:
> 
> On 20.12.2017 16:12, Laurenz Albe wrote:
>> Konstantin Knizhnik wrote:
>>> I wonder if Postgres community is interested in supporting time travel 
>>> queries in PostgreSQL (something like AS OF queries in Oracle: 
>>> https://docs.oracle.com/cd/B14117_01/appdev.101/b10795/adfns_fl.htm 
>>> ).
>>> As far as I know something similar is now developed for MariaDB.
>> I think that would be a good thing to have that could make
>> the DBA's work easier - all the requests to restore a table
>> to the state from an hour ago.
> 
> Please notice that it is necessary to configure postgres in proper way in 
> order to be able to perform time travels.
> If you do not disable autovacuum, then old versions will be just cleaned-up.
> If transaction commit timestamps are not tracked, then it is not possible to 
> locate required timeline. 
> 
> So DBA should make a decision in advance whether this feature is needed or 
> not.
> It is not a proper instrument for restoring/auditing existed database which 
> was not configured to keep all versions.
> 
> May be it is better to add special configuration parameter for this feature 
> which should implicitly toggle 
> autovacuum and track_commit_timestamp parameters).
> 


I seem to recall that Oracle handles this by requiring tables that want the 
capability to live within a tablespace that supports flashback. That tablespace 
is obviously configured to retain redo/undo logs. It would be nice if the 
vacuuming process could be configured in a similar manner. I have no idea if it 
would make sense on a tablespace basis or not, though — I’m not entirely sure 
how analogous they are between Postgres & Oracle as I’ve never used tablespaces 
in Postgres.

-Joe

Re: AS OF queries

2017-12-20 Thread Konstantin Knizhnik




On 20.12.2017 16:12, Laurenz Albe wrote:

Konstantin Knizhnik wrote:

I wonder if Postgres community is interested in supporting time travel
queries in PostgreSQL (something like AS OF queries in Oracle:
https://docs.oracle.com/cd/B14117_01/appdev.101/b10795/adfns_fl.htm).
As far as I know something similar is now developed for MariaDB.

I think that would be a good thing to have that could make
the DBA's work easier - all the requests to restore a table
to the state from an hour ago.


Please notice that it is necessary to configure postgres in proper way 
in order to be able to perform time travels.

If you do not disable autovacuum, then old versions will be just cleaned-up.
If transaction commit timestamps are not tracked, then it is not 
possible to locate required timeline.


So DBA should make a decision in advance whether this feature is needed 
or not.
It is not a proper instrument for restoring/auditing existed database 
which was not configured to keep all versions.


May be it is better to add special configuration parameter for this 
feature which should implicitly toggle

autovacuumand track_commit_timestamp parameters).

The obvious drawbacks of keeping all versions are
1. Increased size of database.
2. Decreased query execution speed because them need to traverse a lot 
of not visible versions.


So may be in practice it will be useful to limit lifetime of versions.

I failed to support AS OF clause (as in Oracle) because of shift-reduce
conflicts with aliases,
so I have to introduce new ASOF keyword. May be yacc experts can propose
how to solve this conflict without introducing new keyword...

I think it would be highly desirable to have AS OF, because that's
the way the SQL standard has it.
Completely agree  with you: I just give up after few hours of attempts 
to make bison to resolve this conflicts.




Yours,
Laurenz Albe


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: AS OF queries

2017-12-20 Thread Laurenz Albe

Konstantin Knizhnik wrote:
> I wonder if Postgres community is interested in supporting time travel 
> queries in PostgreSQL (something like AS OF queries in Oracle: 
> https://docs.oracle.com/cd/B14117_01/appdev.101/b10795/adfns_fl.htm).
> As far as I know something similar is now developed for MariaDB.

I think that would be a good thing to have that could make
the DBA's work easier - all the requests to restore a table
to the state from an hour ago.

> I failed to support AS OF clause (as in Oracle) because of shift-reduce 
> conflicts with aliases,
> so I have to introduce new ASOF keyword. May be yacc experts can propose 
> how to solve this conflict without introducing new keyword...

I think it would be highly desirable to have AS OF, because that's
the way the SQL standard has it.

Yours,
Laurenz Albe

Re: Add hint about replication slots when nearing wraparound

2017-12-20 Thread Feike Steenbergen

On 20 December 2017 at 06:22, Michael Paquier 
wrote:
> prepare_transaction.sgml has a "Caution" block mentioning that it is
> unwise to keep 2PC transactions unfinished for a too-long time as it
> interferes with VACUUM. In doc/src/sgml/logicaldecoding.sgml, it would
> be nice to add the a similar caution notice about replication slots so
> as users can get be warned before a wraparound shows up.

Added.

As far as I know the issue only occurs for stale replication slots for
logical decoding
but not for physical replication, is that correct?


0002-Add-hint-about-replication-slots-for-wraparound.patch
Description: Binary data

AS OF queries

2017-12-20 Thread Konstantin Knizhnik

I wonder if Postgres community is interested in supporting time travel 
queries in PostgreSQL (something like AS OF queries in Oracle: 
https://docs.oracle.com/cd/B14117_01/appdev.101/b10795/adfns_fl.htm).

As far as I know something similar is now developed for MariaDB.

It seems to me that it will be not so difficult to implement them in 
Postgres - we already have versions of tuples.

Looks like we only need to do three things:
1. Disable autovacuum (autovacuum = off)
2. Enable commit timestamp (track_commit_timestamp = on)
3. Add asofTimestamp to snapshot and patch XidInMVCCSnapshot to compare 
commit timestamps when it is specified in snapshot.



Attached please find my prototype implementation of it.
Most of the efforts are needed to support asof timestamp in grammar and 
add it to query plan.
I failed to support AS OF clause (as in Oracle) because of shift-reduce 
conflicts with aliases,
so I have to introduce new ASOF keyword. May be yacc experts can propose 
how to solve this conflict without introducing new keyword...


Please notice that now ASOF timestamp is used only for data snapshot, 
not for catalog snapshot.
I am not sure that it is possible (and useful) to travel through 
database schema history...


Below is an example of how it works:

postgres=# create table foo(pk serial primary key, ts timestamp default 
now(), val text);

CREATE TABLE
postgres=# insert into foo (val) values ('insert');
INSERT 0 1
postgres=# insert into foo (val) values ('insert');
INSERT 0 1
postgres=# insert into foo (val) values ('insert');
INSERT 0 1
postgres=# select * from foo;
 pk | ts |  val
++
  1 | 2017-12-20 14:59:17.715453 | insert
  2 | 2017-12-20 14:59:22.933753 | insert
  3 | 2017-12-20 14:59:27.87712  | insert
(3 rows)

postgres=# select * from foo asof timestamp '2017-12-20 14:59:25';
 pk | ts |  val
++
  1 | 2017-12-20 14:59:17.715453 | insert
  2 | 2017-12-20 14:59:22.933753 | insert
(2 rows)

postgres=# select * from foo asof timestamp '2017-12-20 14:59:20';
 pk | ts |  val
++
  1 | 2017-12-20 14:59:17.715453 | insert
(1 row)

postgres=# update foo set val='upd',ts=now() where pk=1;
UPDATE 1
postgres=# select * from foo asof timestamp '2017-12-20 14:59:20';
 pk | ts |  val
++
  1 | 2017-12-20 14:59:17.715453 | insert
(1 row)

postgres=# select * from foo;
 pk | ts |  val
++
  2 | 2017-12-20 14:59:22.933753 | insert
  3 | 2017-12-20 14:59:27.87712  | insert
  1 | 2017-12-20 15:09:17.046047 | upd
(3 rows)

postgres=# update foo set val='upd2',ts=now() where pk=1;
UPDATE 1
postgres=# select * from foo asof timestamp '2017-12-20 15:10';
 pk | ts |  val
++
  2 | 2017-12-20 14:59:22.933753 | insert
  3 | 2017-12-20 14:59:27.87712  | insert
  1 | 2017-12-20 15:09:17.046047 | upd
(3 rows)


Comments and feedback are welcome:)

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 3de8333..2126847 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -2353,6 +2353,7 @@ JumbleQuery(pgssJumbleState *jstate, Query *query)
 	JumbleExpr(jstate, (Node *) query->sortClause);
 	JumbleExpr(jstate, query->limitOffset);
 	JumbleExpr(jstate, query->limitCount);
+	JumbleExpr(jstate, query->asofTimestamp);
 	/* we ignore rowMarks */
 	JumbleExpr(jstate, query->setOperations);
 }
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895..d2e0799 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -29,6 +29,6 @@ OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
nodeCtescan.o nodeNamedtuplestorescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
-   nodeTableFuncscan.o
+   nodeTableFuncscan.o nodeAsof.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index f1636a5..38c79b8 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -285,6 +285,10 @@ ExecReScan(PlanState *node)
 			ExecReScanLimit((LimitState *) node);
 			break;
 
+		case T_AsofState:
+			ExecReScanAsof((AsofState *) node);
+			break;
+
 		default:
 			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
 			break;
diff --git a/src/backend/executor/execCurrent.c b/src/backend/executor/execCurrent.c
index a3e962e..1912ae4 100644
--- a/src/

Re: Basebackups reported as idle

2017-12-20 Thread Magnus Hagander

On Wed, Dec 20, 2017 at 12:50 PM, Michael Paquier  wrote:

> On Wed, Dec 20, 2017 at 6:07 PM, Magnus Hagander 
> wrote:
> > What about the attached?
>
> The new positions look good to me, still aren't you missing the case
> where a SQL command is found and exec_replication_command returns
> false? This should be switched to idle as well.
>

Yes. Of course. I can't read. That's the same as the notice below about it
not returning false -- I managed to miss the extra if() there, and thought
it always exited with ERROR.

:O

> +   /* Report to pgstat that this process is running */
> +   pgstat_report_activity(STATE_RUNNING, NULL);
> Bonus points if cmd_string is used instead of string? This way, you
> can know what is the replication command running ;)
>

Do we want that though? That would be a compat break at least, wouldn't it?

> It's still quite a bit weird that we call this process "walsender" when it
> > does other things as well. But the ship sailed on that many years ago,
> > changing that completely now would not be worth the breakage.
>
> ps shows walsender as well, that's a reason why "walsender" has been
> decided.
>

Right. It's just a weird term for what it is. But it's the term that we've
always used, so we can't go change it now.

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-12-20 Thread Ashutosh Bapat

On Tue, Dec 5, 2017 at 1:24 PM, Rajkumar Raghuwanshi
 wrote:
> On Tue, Dec 5, 2017 at 11:04 AM, Rajkumar Raghuwanshi
>  wrote:
>> On Mon, Dec 4, 2017 at 7:34 AM, Ashutosh Bapat
>>  wrote:
>>> I agree, the patch looks longer than expected. I think, it's important
>>> to have some testcases to test partition-wise join with default
>>> partitions. I think we need at least one test for range default
>>> partitions, one test for list partitioning, one for multi-level
>>> partitioning and one negative testcase with default partition missing
>>> from one side of join.
>>>
>>> May be we could reduce the number of SQL commands and queries in the
>>> patch by adding default partition to every table that participates in
>>> partition-wise join (leave the tables participating in negative tests
>>> aside.). But that's going to increase the size of EXPLAIN outputs and
>>> query results. The negative test may simply drop the default partition
>>> from one of the tables.
>>>
>>> For every table being tested, the patch adds two ALTER TABLE commands,
>>> one for detaching an existing partition and then attach the same as
>>> default partition. Alternative to that is just add a new default
>>> partition without detaching and existing partition. But then the
>>> default partition needs to populated with some data, which requires 1
>>> INSERT statement at least. That doesn't reduce the size of patch, but
>>> increases the output of query and EXPLAIN plan.
>>>
>>> May be in case of multi-level partitioning test, we don't need to add
>>> DEFAULT in every partitioned relation; adding to one of them would be
>>> enough. May be add it to the parent, but that too can be avoided. That
>>> would reduce the size of patch a bit.
>>
>> Thanks Ashutosh for suggestions.
>>
>> I have reduced test cases as suggested. Attaching updated patch.
>>
> Sorry Attached wrong patch.
>
> attaching correct patch now.

Thanks. Here are some comments

+-- test default partition behavior for range
+ALTER TABLE prt1 DETACH PARTITION prt1_p3;
+ALTER TABLE prt1 ATTACH PARTITION prt1_p3 DEFAULT;
+ALTER TABLE prt2 DETACH PARTITION prt2_p3;
+ALTER TABLE prt2 ATTACH PARTITION prt2_p3 DEFAULT;

I think we need an ANALYZE here in case the statistics gets updated while
DETACH and ATTACH is going on. Other testcases also need to be updated with
ANALYZE, including the negative one.

+-- partition-wise join can not be applied if the only one of joining table have

Correction: ... if only one of the joining tables has ...

Please add the patch to the next commitfest so that it's not
forgotten. I think we can get rid of the multi-level partition-wise
testcase as well. Also, since we are re-attaching existing partition
tables as default partitions, we don't need to check the output as
well; just plan should be enough.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

1 2 >

1 - 100 of 106 matches

Mail list logo