date:20180202

Re: pie-in-sky idea: 'sensitive' function parameters

2018-02-02 Thread Craig Ringer

On 3 February 2018 at 11:46, Tom Lane  wrote:

> Chapman Flack  writes:
> > ... which led me to the idea of a function parameter
> > declaration, putting the function definer in control of what
> > bits should get redacted.
>
> +1 for thinking outside the box, but ...
>
> > Would anyone else see some value in this capability? Could it
> > (or some suitable restriction of it) seem implementable, or would
> > the complications be overwhelming?
>
> ... the problem with this idea is that knowledge that the item ought to be
> hidden would be obtained only very late in the parsing process.  So for
> example if you fat-fingered something just to the left of the function
> call in the query text, or the name of the function itself, your password
> would still get exposed in the log.
>
> This indeed is the core problem with every proposal I've seen for
> semantics-based log filtering.  Error logging needs to be considered
> as a very low-level operation, because reports may come out when
> little if anything is known about the real semantics of the query.
>
>
About the only time I think it's really viable to pursue is if it's
restricted to bind parameters. We only log those later and more selectively
as it is, so it seems much more reasonable to say "I never want  to appear in the logs".

That said, I'm not sure it can be done at the function-interface level, or
if it'd have to be done in the Bind message to make it reliable and
available early enough.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] proposal: schema variables

2018-02-02 Thread Pavel Stehule

Hi

2018-02-03 1:48 GMT+01:00 David G. Johnston :

> I've done a non-compilation documentation review, the diff from the poc
> patch and the diff from master are attached.
>
> Comments are inter-twined in the patch in xml comment format; though I
> reiterate (some of?) them below.
>
> On Fri, Feb 2, 2018 at 3:06 PM, Pavel Stehule 
> wrote:
>
>> Hi
>>
>> I wrote proof concept of schema variables. The patch is not nice, but the
>> functionality is almost complete (for scalars only) and can be good enough
>> for playing with this concept.
>>
>> I recap a goals (the order is random):
>>
>> 1. feature like PL/SQL package variables (with similar content life cycle)
>> 2. available from any PL used by PostgreSQL, data can be shared between
>> different PL
>> 3. possibility to store short life data in fast secured storage
>>
>
> The generic use of the word secure here bothers me.  I'm taking it to be
> "protected by grant/revoke"-based privileges; plus session-locality.
>

I have not a problem with any other formulation.

>
> 4. possibility to pass parameters and results to/from anonymous blocks
>> 5. session variables with possibility to process static code check
>>
>
> What does "process static code check" means here?
>

It mean the possibility to check validity of code without code execution.
You can use plpgsql_check for example.

>
>
>> 6. multiple API available from different environments - SQL commands, SQL
>> functions, internal functions
>>
>
> I made the public aspect of this explicit in the CREATE VARIABLE doc
> (though as noted below it probably belongs in section II)
> 
>
>> 7. data are stored in binary form
>>
>
> Thoughts during my review:
>
> There is, for me, a cognitive dissonance between "schema variable" and
> "variable value" - I'm partial to the later.  Since we use "setting" for
> GUCs the term variable here hopefully wouldn't cause ambiguity...
>

The "schema" is important in this case. 1) it is a analogy to "package
variable", 2) not necessary, but probably often it will be used together
with PLpgSQL. There are variables too. "Session variables" doesn't well
specify the implementation. The session variables can be GUC, psql client
variables or some custom implementation in Postgres or package variables in
Oracle.

> I've noticed that we don't seem to have or enforce any policy on how to
> communicate "SQL standards compatibility" to the user...
>
> We are missing the ability to alter ownership (or at least its
> undocumented), and if that brings into existing ALTER VARIABLE we should
> probably add ALTER TYPE TO new_type USING (cast) for completeness.
>

good note. I didn't test it. I am not sure, what variants of ALTER should
be supported. Type of variables is interface. Probably we can allow to add
new field, but change type or remove field can break other object. So it
can be prohibited like we doesn't support ALTER on views. ALTERing is
another and pretty complex topic, and I don't think it is necessary to
solve it now. This feature can be valuable without ALTER support, and
nothing block later ALTER VARIABLE implementation.

This design allows lot of interesting features (that can be implemented
step by step)

1. support for default expression
2. support for constraints and maybe triggers
3. reset on transaction end
4. initialization of session start - via default expression or triggers it
can be way how to start code on session start.

>
> Its left for the reader to presume that because these are schema
> "relations" that namespace resolution via search_path works the same as any
> other relation.
>
> I think I've answered my own question regarding DISCARD in that
> "variables" discards values while if TEMP is in effect all temp variables
> are dropped.
>

DISCARD should to remove TEMP variables and should to remove content of all
variables.

>
> Examples abound though it doesn't feel like too much: but saying "The
> usage is very simple:" before giving the example in the function section
> seems to be outside of our general style.  A better preamble than "An
> example:" would be nice but the example is so simple I could not think of
> anything worth writing.
>

This doc is just design frame. I invite any enhancing because this feature
can be difficult for some people, because mix persistent object with
temporal/session content - and term "variable" can be used in relation
algebra in different semantic. It is natural for people with stored
procedures experience - mainly with Oracle, but for any other can be little
bit difficult. I believe so there should be more practical examples -
related to RLS for example.

>
> Its worth considering how both:
>
> https://www.postgresql.org/docs/10/static/ddl.html
> and
> https://www.postgresql.org/docs/10/static/queries.html
>
> could be updated to incorporate the broad picture of schema variables,
> with examples, and leave the reference (SQL and functions) sections mainly
> relegated to syntax and reminders.
>
> A moderate number of

Re: pie-in-sky idea: 'sensitive' function parameters

2018-02-02 Thread Tom Lane

Chapman Flack  writes:
> ... which led me to the idea of a function parameter
> declaration, putting the function definer in control of what
> bits should get redacted.

+1 for thinking outside the box, but ...

> Would anyone else see some value in this capability? Could it
> (or some suitable restriction of it) seem implementable, or would
> the complications be overwhelming?

... the problem with this idea is that knowledge that the item ought to be
hidden would be obtained only very late in the parsing process.  So for
example if you fat-fingered something just to the left of the function
call in the query text, or the name of the function itself, your password
would still get exposed in the log.

This indeed is the core problem with every proposal I've seen for
semantics-based log filtering.  Error logging needs to be considered
as a very low-level operation, because reports may come out when
little if anything is known about the real semantics of the query.

regards, tom lane

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-02 Thread Peter Geoghegan

On Fri, Feb 2, 2018 at 4:31 PM, Peter Geoghegan  wrote:
> On Fri, Feb 2, 2018 at 1:58 PM, Andres Freund  wrote:
>> Not saying you're wrong, but you should include a comment on why this is
>> a benign warning. Presumably it's some padding memory somewhere, but
>> it's not obvious from the above bleat.
>
> Sure. This looks slightly more complicated than first anticipated, but
> I'll keep everyone posted.

I couldn't make up my mind if it was best to prevent the uninitialized
write(), or to instead just add a suppression. I eventually decided
upon the suppression -- see attached patch. My proposed commit message
has a full explanation of the Valgrind issue, which I won't repeat
here. Go read it before reading the rest of this e-mail.

It might seem like my suppression is overly broad, or not broad
enough, since it essentially targets LogicalTapeFreeze(). I don't
think it is, though, because this can occur in two places within
LogicalTapeFreeze() -- it can occur in the place we actually saw the
issue on lousyjack, from the ltsReadBlock() call within
LogicalTapeFreeze(), as well as a second place -- when
BufFileExportShared() is called. I found that you have to tweak code
to prevent it happening in the first place before you'll see it happen
in the second place. I see no point in actually playing whack-a-mole
for a totally benign issue like this, though, which made me finally
decide upon the suppression approach.

Bear in mind that a third way of fixing this would be to allocate
logtape.c buffers using palloc0() rather than palloc() (though I don't
like that idea at all). For serial external sorts, the logtape.c
buffers are guaranteed to have been written to/initialized at least
once as part of spilling a sort to disk. Parallel external sorts don't
quite guarantee that, which is why we run into this Valgrind issue.

-- 
Peter Geoghegan
From e51b9d411a859072e6a3f8b2f77e32e0f47d409a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan 
Date: Fri, 2 Feb 2018 17:47:13 -0800
Subject: [PATCH] Add logtape.c Valgrind suppression.

LogicalTapeFreeze() may write out its first block when it is dirty but
not full, and then immediately read the first block back in from its
BufFile as a BLCKSZ-width block.  This can only occur in rare cases
where next to no tuples were written out, which is only possible with
parallel external tuplesorts.  While this is pointless, it's also
harmless.

However, this issue causes Valgrind to complain about a write() of
uninitialized bytes, so a suppression is needed.  This is because
BufFile block reading will flush out its whole dirty stdio-style temp
buffer, writing bytes that include values originating from
poisoned/uninitialized logtape.c buffer bytes.

Author: Peter Geoghegan
---
 src/tools/valgrind.supp | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index af03051..22039c0 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -112,6 +112,16 @@
 	fun:BootStrapXLOG
 }

+# LogicalTapeFreeze() needs this for calls from parallel worker with few tuples
+{
+	padding_buffile_dump_buffer
+	Memcheck:Param
+	write(buf)
+
+	...
+	fun:LogicalTapeFreeze
+}
+
 {
 	bootstrap_write_relmap_overlap
 	Memcheck:Overlap
-- 
2.7.4

Re: JIT compiling with LLVM v9.1

2018-02-02 Thread Jeff Davis

On Mon, Jan 29, 2018 at 1:53 AM, Andres Freund  wrote:
>>   https://git.postgresql.org/git/users/andresfreund/postgres.git

There's a patch in there to change the scan order. I suggest that you
rename the GUC "synchronize_seqscans" to something more generic like
"optimize_scan_order", and use it to control your feature as well
(after all, it's the same trade-off: weird scan order vs.
performance). Then, go ahead and commit it. FWIW I see about a 7%
boost on my laptop[1] from that patch on master, without JIT or
anything else.

I also see you dropped "7ae518bf Centralize slot deforming logic a
bit.". Was that intentional? Do we want it? I think I saw about a 2%
gain here over master, but when I applied it on top of the fast scans
it did not seem to add anything on top of fast scans. Seems
reproducible, but I don't have an explanation.

And you are probably already working on this, but it would be helpful
to get the following two patches in also:
* 3c22065f Do execGrouping via expression eval
* a9dde4aa Allow tupleslots to have a fixed tupledesc

I took a brief look at those two, but will review them in more detail.

Regards,
 Jeff Davis

[1] Simple scan with simple predicate on 50M tuples, after pg_prewarm.

Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2018-02-02 Thread Tom Lane

Tomas Vondra  writes:
> BTW wouldn't it be possible to derive "traditional" proof in relational
> algebra, similarly to other transforms?

Perhaps.  The patch depends on CTID, but you could probably model that
as a primary key in a proof.

regards, tom lane

Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2018-02-02 Thread Tomas Vondra

On 02/02/2018 03:26 PM, Tom Lane wrote:
> Tomas Vondra  writes:
>> ISTM this patch got somewhat stuck as we're not quite sure the
>> transformation is correct in all cases. Is my impression correct?
> 
> Yeah, that's the core issue.
> 
>> If yes, how to we convince ourselves? Would some sort of automated testing
>> (generating data and queries) help? I'm willing to spend some cycles on
>> that, if considered helpful.
> 
> I'm not sure if that would be enough to convince doubters.  On the other
> hand, if it found problems, that would definitely be useful.
> 

OK, I'll take a stab at this, then. I wasn't really suggesting it might
prove definitely prove the transformation is correct - clearly, it can
only find counter-examples. But I think it's worth it anyway.

BTW wouldn't it be possible to derive "traditional" proof in relational
algebra, similarly to other transforms?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pie-in-sky idea: 'sensitive' function parameters

2018-02-02 Thread Chapman Flack

CREATE FUNCTION commence_primary_ignition(target text, password text)
RETURNS void AS ...;

SELECT commence_primary_ignition(target=>'Alderaan', password=>'1234');

/* 1234 could appear in logs, activity stats ... disturbing */



CREATE OR REPLACE FUNCTION commence_primary_ignition(
target text,
password text SENSITIVE) RETURNS void AS ...;

SELECT commence_primary_ignition(target=>'Alderaan', password=>'1234');

/* logs, stats show something like ... password=>[redacted]); */


I had this idea while thinking about how to send a query with
some sensitive data in a way that would protect the data from
log exposure, and realizing I couldn't think of any. Docs for
log_statement even promise that extended query protocol Bind
parameters do get included. (But docs also suggested that some
minimal parsing occurs before log_statement happens, so there
could be an opportunity to detect that a sensitive parameter
is being mentioned and the extents of the assigned expression.)

I also appreciate that one doesn't want to empower a user to
conceal queries at will (thus track_activities and log_statement
are SUSET), which led me to the idea of a function parameter
declaration, putting the function definer in control of what
bits should get redacted.

In full generality, I'm sure this would present lots of
implementation challenges. But various restrictions might
possibly simplify it

- recognize only in calls using named notation for the
  sensitive parameter(s)? (nothing to check for function
  calls using only positional notation)

- introduce a special named-notation variant?
  target=>'Alderaan', password=!>'1234'
  (no function lookup needed during parse, can identify what
  to redact immediately, and defer a simple check that 'password'
  really is declared sensitive to some later stage when function is
  looked up)

- recognize only in extended-protocol queries where the parameter
  is supplied by a Bind? (know what to redact without having to
  parse arbitrarily hairy expressions)


Would anyone else see some value in this capability? Could it
(or some suitable restriction of it) seem implementable, or would
the complications be overwhelming?

-Chap

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-02 Thread Peter Geoghegan

On Fri, Feb 2, 2018 at 1:58 PM, Andres Freund  wrote:
> Not saying you're wrong, but you should include a comment on why this is
> a benign warning. Presumably it's some padding memory somewhere, but
> it's not obvious from the above bleat.

Sure. This looks slightly more complicated than first anticipated, but
I'll keep everyone posted.

Valgrind suppression aside, this raises another question. The stack
trace shows that the error happens during the creation of a new TOAST
table (CheckAndCreateToastTable()). I wonder if I should also pass
down a flag that makes sure that parallelism is never even attempted
from that path, to match TRUNCATE's suppression of parallel index
builds during its reindexing. It really shouldn't be a problem as
things stand, but maybe it's better to be consistent about "useless"
parallel CREATE INDEX attempts, and suppress them here too.

-- 
Peter Geoghegan

Re: RelOptInfo -> Relation

2018-02-02 Thread Tom Lane

Robert Haas  writes:
> Is there some good reason why get_relation_info() copies all of the
> data that will be needed later in planning into the RelOptInfo instead
> of just storing a pointer to the Relation directly into the
> RelOptInfo?  Of course, if we did that, then it would have to leave
> the relation open instead of closing it, and similarly for the
> indexes, but so what?  As far as I understand it, we're keeping a lock
> on the relation, so none of this stuff should be able to change under
> it, and pointing to data is faster than copying it.

We are not keeping a lock strong enough to prevent the relcache entry
from changing altogether, e.g. an ANALYZE could commit midway through
and change, say, the relpages/reltuples info.  The consequences of
that for consistent planner behavior are not clear to me, but they're
probably not something to dismiss out of hand.

Also, there's a big difference between the entry's contents being
logically unchanging and being physically unchanging: we could get a
relcache inval event that forces a rebuild, and only very limited parts of
the entry are guaranteed not to move around if that happens.  There's a
general rule that code outside the relcache shouldn't keep pointers to
most parts of a relcache entry's infrastructure, and for those parts where
we do provide a guarantee (like the tupdesc), it doesn't come for free.
Maybe the planner's uses of the info wouldn't be any more dangerous than
anything else's, but again it's not clear.

I'm disinclined to monkey with the way this works without someone
presenting hard evidence that it creates enough of a performance problem
to be worth spending a significant amount of time changing it.  Up to
now I don't think I've ever noticed plancat.c being a large fraction
of planner runtime, though of course that might depend on the test case.

regards, tom lane

Re: Query running for very long time (server hanged) with parallel append

2018-02-02 Thread David Kohn

Please forgive my inexperience with the codebase, but as the guy who
reported this bugger:
https://www.postgresql.org/message-id/flat/151724453314.1238.409882538067070269%40wrigleys.postgresql.org#151724453314.1238.409882538067070...@wrigleys.postgresql.org,
I thought I'd follow your hints, as it's causing some major issues for me.
So some notes on what is happening for me and some (possibly silly)
thoughts on why:

On Fri, Feb 2, 2018 at 10:16 AM Robert Haas  wrote:

> Is it getting stuck here?
>
> /*
>  * We can't finish transaction commit or abort until all of the workers
>  * have exited.  This means, in particular, that we can't respond to
>  * interrupts at this stage.
>  */
> HOLD_INTERRUPTS();
> WaitForParallelWorkersToExit(pcxt);
> RESUME_INTERRUPTS();
>
I am seeing unkillable queries with the client backend in
IPC-BgWorkerShutdown wait event, which, it appears to me can only happen
inside of bgworker.c at WaitForBackgroundWorkerShutdown which is called by
parallel.c at WaitForParallelWorkersToExit inside of
DestroyParallelContext, which seems like it should be called when there is
a statement timeout (which I think is happening in at least some of my
cases) so it would make sense that this is where the problem is.

My background workers are in the IPC-MessageQueuePutMessage event, which
appears to only be possible from pqmq.c  at mq_putmessage , directly
following the WaitLatch, there is a CHECK_FOR_INTERRUPTS(); so, if it's
waiting on that latch and never gets to the interrupt that would explain
things. Also it appears that it sends a signal to the leader process a few
lines before starting to wait, which is supposed to tell the leader to come
read messages off the queue. If the leader gets to
WaitForParallelWorkersToExit at the wrong time and ends up waiting on that
event, I could see how they would both end up waiting for the other and
never finishing.

The thing is that DestroyParallelContext seems to be detaching from the
queues, but if the worker hit the wait step before the leader detaches from
the queue does it have any way of knowing that?

Anyway, I'm entirely unsure of my analysis here, but thought I'd offer
something to help speed this along.

Best,
David Kohn

Re: Boolean partitions syntax

2018-02-02 Thread Tom Lane

Robert Haas  writes:
> On Fri, Feb 2, 2018 at 4:40 PM, Peter Eisentraut
>  wrote:
>> There might be other options, but one way to solve this would be to
>> treat partition bounds as a general expression in the grammar and then
>> check in post-parse analysis that it's a constant.

> Yeah -- isn't the usual way of handling this to run the user's input
> through eval_const_expressions and see if the result is constant?

Not sure we want to go quite that far: at minimum it would imply invoking
arbitrary stuff during a utility statement, which we generally try to
avoid.  Still, copy-from-query does that, so we can certainly make it
work if we wish.

Perhaps more useful to discuss: would that truly be the semantics we want,
or should we just evaluate the expression and have done?  It's certainly
arguable that "IN (random())" ought to draw an error, not compute some
random value and use that.  But if you are insistent on partition bounds
being immutable in any strong sense, you've already got problems, because
e.g. a timestamptz literal's interpretation isn't necessarily fixed.
It's only after we've reduced the original input to Datum form that we
can make any real promises about the value not moving.  So I'm not seeing
where is the bright line between "IN ('today')" and "IN (random())".

regards, tom lane

Re: [HACKERS] Vacuum: allow usage of more than 1GB of work mem

2018-02-02 Thread Claudio Freire

On Thu, Jan 25, 2018 at 6:21 PM, Thomas Munro
 wrote:
> On Fri, Jan 26, 2018 at 9:38 AM, Claudio Freire  
> wrote:
>> I had the tests running in a loop all day long, and I cannot reproduce
>> that variance.
>>
>> Can you share your steps to reproduce it, including configure flags?
>
> Here are two build logs where it failed:
>
> https://travis-ci.org/postgresql-cfbot/postgresql/builds/332968819
> https://travis-ci.org/postgresql-cfbot/postgresql/builds/332592511
>
> Here's one where it succeeded:
>
> https://travis-ci.org/postgresql-cfbot/postgresql/builds/333139855
>
> The full build script used is:
>
> ./configure --enable-debug --enable-cassert --enable-coverage
> --enable-tap-tests --with-tcl --with-python --with-perl --with-ldap
> --with-icu && make -j4 all contrib docs && make -Otarget -j3
> check-world
>
> This is a virtualised 4 core system.  I wonder if "make -Otarget -j3
> check-world" creates enough load on it to produce some weird timing
> effect that you don't see on your development system.

I can't reproduce it, not even with the same build script.

It's starting to look like a timing effect indeed.

I get a similar effect if there's an active snapshot in another
session while vacuum runs. I don't know how the test suite ends up in
that situation, but it seems to be the case.

How do you suggest we go about fixing this? The test in question is
important, I've caught actual bugs in the implementation with it,
because it checks that vacuum effectively frees up space.

I'm thinking this vacuum test could be put on its own parallel group
perhaps? Since I can't reproduce it, I can't know whether that will
fix it, but it seems sensible.

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-02-02 Thread Robert Haas

On Fri, Jan 26, 2018 at 1:28 AM, Amit Kapila  wrote:
> So, this means that in case of logical replication, it won't generate
> the error this patch is trying to introduce.  I think if we want to
> handle this we need some changes in WAL and logical decoding as well.
>
> Robert, others, what do you think?  I am not very comfortable leaving
> this unaddressed, if we don't want to do anything about it, at least
> we should document it.

As I said on the other thread, I'm not sure how reasonable it really
is to try to do anything about this.  For both the issue you raised
there, I think we'd need to introduce a new WAL record type that
represents a delete from one table and an insert to another that
should be considered as a single operation. I'm not keen on that idea,
but you can make an argument that it's the Right Thing To Do.  I would
be more inclined, at least for v11, to just document that the
delete+insert will be replayed separately on replicas.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Boolean partitions syntax

2018-02-02 Thread Robert Haas

On Fri, Feb 2, 2018 at 4:40 PM, Peter Eisentraut
 wrote:
> There might be other options, but one way to solve this would be to
> treat partition bounds as a general expression in the grammar and then
> check in post-parse analysis that it's a constant.

Yeah -- isn't the usual way of handling this to run the user's input
through eval_const_expressions and see if the result is constant?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Draft release notes for 10.2 et al

2018-02-02 Thread Tom Lane

... are posted at
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=bf641d3376018c40860f26167a60febff5bc1f51

They should appear in the website's development docs in a couple of hours,
too.

Please comment before Sunday if you see anything that could be improved.

regards, tom lane

Re: Boolean partitions syntax

2018-02-02 Thread Tom Lane

Peter Eisentraut  writes:
> There might be other options, but one way to solve this would be to
> treat partition bounds as a general expression in the grammar and then
> check in post-parse analysis that it's a constant.

That's pretty much what I said upthread.  What I basically don't like
about the current setup is that it's assuming that the bound item is
a bare literal.  Even disregarding future-extension issues, that's bad
because it can't result in an error message smarter than "syntax error"
when someone tries the rather natural thing of writing a more complicated
expression.

regards, tom lane

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-02 Thread Andres Freund

On 2018-02-02 13:35:59 -0800, Peter Geoghegan wrote:
> On Fri, Feb 2, 2018 at 10:38 AM, Peter Geoghegan  wrote:
> > On Fri, Feb 2, 2018 at 10:37 AM, Robert Haas  wrote:
> >> If you could keep an eye on the buildfarm and investigate anything
> >> that breaks, I would appreciate it.
> 
> > I can keep an eye on it throughout the day.
> 
> There is a benign Valgrind error that causes the lousyjack animal to
> report failure. It looks like this:
> 
> ==6850== Syscall param write(buf) points to uninitialised byte(s)
> ==6850==at 0x4E4D534: write (in /usr/lib64/libpthread-2.26.so)
> ==6850==by 0x82328F: FileWrite (fd.c:2017)
> ==6850==by 0x8261AD: BufFileDumpBuffer (buffile.c:513)
> ==6850==by 0x826569: BufFileFlush (buffile.c:657)
> ==6850==by 0x8262FB: BufFileRead (buffile.c:561)
> ==6850==by 0x9F6C79: ltsReadBlock (logtape.c:273)
> ==6850==by 0x9F7ACF: LogicalTapeFreeze (logtape.c:906)
> ==6850==by 0xA05B0D: worker_freeze_result_tape (tuplesort.c:4477)
> ==6850==by 0xA05BC6: worker_nomergeruns (tuplesort.c:4499)
> ==6850==by 0x9FCA1E: tuplesort_performsort (tuplesort.c:1823)

Not saying you're wrong, but you should include a comment on why this is
a benign warning. Presumably it's some padding memory somewhere, but
it's not obvious from the above bleat.

Greetings,

Andres Freund

Re: Boolean partitions syntax

2018-02-02 Thread Peter Eisentraut

On 2/2/18 04:39, Amit Langote wrote:
> Most of the code does treat partition bound values as Node values doing
> coercing before calling the input value good and failing upon not being
> able to convert to the desired type for whatever reason.
> 
> create table b (a bool) partition by list (a);
> create table bt partition of b for values in (1);
> ERROR:  specified value cannot be cast to type boolean for column "a"
> LINE 1: create table bt partition of b for values in (1);
> 
> Can you say a bit more about the compatibility issues if we extend the syntax?

Without getting into the technicial details too much, this thread
surprised me.  Why do we need to add support for each kind of constant
literal separately?  That seems wrong.

There might be other options, but one way to solve this would be to
treat partition bounds as a general expression in the grammar and then
check in post-parse analysis that it's a constant.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-02 Thread Peter Geoghegan

On Fri, Feb 2, 2018 at 10:38 AM, Peter Geoghegan  wrote:
> On Fri, Feb 2, 2018 at 10:37 AM, Robert Haas  wrote:
>> If you could keep an eye on the buildfarm and investigate anything
>> that breaks, I would appreciate it.

> I can keep an eye on it throughout the day.

There is a benign Valgrind error that causes the lousyjack animal to
report failure. It looks like this:

==6850== Syscall param write(buf) points to uninitialised byte(s)
==6850==at 0x4E4D534: write (in /usr/lib64/libpthread-2.26.so)
==6850==by 0x82328F: FileWrite (fd.c:2017)
==6850==by 0x8261AD: BufFileDumpBuffer (buffile.c:513)
==6850==by 0x826569: BufFileFlush (buffile.c:657)
==6850==by 0x8262FB: BufFileRead (buffile.c:561)
==6850==by 0x9F6C79: ltsReadBlock (logtape.c:273)
==6850==by 0x9F7ACF: LogicalTapeFreeze (logtape.c:906)
==6850==by 0xA05B0D: worker_freeze_result_tape (tuplesort.c:4477)
==6850==by 0xA05BC6: worker_nomergeruns (tuplesort.c:4499)
==6850==by 0x9FCA1E: tuplesort_performsort (tuplesort.c:1823)

I'll need to go and write a Valgrind suppression for this. I'll get to
it later today.

-- 
Peter Geoghegan

RelOptInfo -> Relation

2018-02-02 Thread Robert Haas

Is there some good reason why get_relation_info() copies all of the
data that will be needed later in planning into the RelOptInfo instead
of just storing a pointer to the Relation directly into the
RelOptInfo?  Of course, if we did that, then it would have to leave
the relation open instead of closing it, and similarly for the
indexes, but so what?  As far as I understand it, we're keeping a lock
on the relation, so none of this stuff should be able to change under
it, and pointing to data is faster than copying it.

I think that this problem has gradually gotten worse as we've added
features to the system.  Between 7.4 and master, we've added the
following stuff to what this function needs to propagate into the
RelOptInfo: tablespace, attr_needed, attr_width, allvisfrac,
rel_parallel_workers, statlist, serverid, fdwroutine, fkey_list,
part_scheme, boundinfo, nparts, partexprs.  That's a fair amount of
stuff, and IndexOptInfo includes more things now, too.  The whole
thing seems pretty inefficient.  get_relation_statistics() stores a
list of statistics object OIDs in the relcache and then, in
get_relation_statistics(), builds a StaExtInfo for each one.  That,
however, means that the StaExtInfo is getting rebuilt for every query.
If the RelOptInfo could point directly into the relcache, then we
could build that stuff once when the relcache entry was created, or
maybe on demand, and then keep it forever.  But if we did that today
it wouldn't work too well: we'd still have to copy the result from the
relcache into the RelOptInfo.  The just-copy-it approach is what
set_relation_partition_info does for rel->boundinfo, and that
similarly seems like a waste of cycles.  I have a feeling we're all
cargo-culting each new feature into a system that we may have
outgrown, but maybe there's some reason behind this that I'm missing.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] path toward faster partition pruning

2018-02-02 Thread Robert Haas

On Fri, Feb 2, 2018 at 9:33 AM, Robert Haas  wrote:
>> Updated set of patches attached (patches 0002 onward mostly unchanged,
>> except I incorporated the delta patch posted by David upthread).
>
> Committed 0001.  Thanks.

Some preliminary thoughts...

Regarding 0002, I can't help noticing that this adds a LOT of new code
to partition.c.  With 0002 applied, it climbs into the top 2% of all
".c" files in terms of lines of code.  It seems to me, though, that
maybe it would make sense to instead add all of this code to some new
file .c file, e.g. src/backend/optimizer/util/partprune.c.  I realize
that's a little awkward in this case because we're hoping to use this
code at runtime and not just in the optimizer, but I don't have a
better idea.  Using src/backend/catalog as a dumping-ground for code
that doesn't have a clear-cut place to live is not a superior
alternative, for sure.  And it seems to me that the code you're adding
here is really quite similar to what we've already got in that
directory -- for example, predtest.c, which currently does partition
pruning, lives there; so does clauses.c, whose evaluate_expr facility
this patch wants to use; so does relnode.c, which the other patches
modify; and in general this looks an awful lot like other optimizer
logic that decomposes clauses.  I'm open to other suggestions but I
don't think adding all of this directly into partition.c is a good
plan.

If we do add a new file for this code, the header comment for that
file would be a good place to write an overall explanation of this new
facility.  The individual bits and pieces seem to have good comments
but an overall explanation of what's going on here seems to be
lacking.

It doesn't seem good that get_partitions_from_clauses requires us to
reopen the relation.  I'm going to give my standard review feedback
any time someone injects additional relation_open or heap_open calls
into a patch: please look for a way to piggyback on one of the places
that already has the relation open instead of adding code to re-open
it elsewhere.  Reopening it is not entirely free, and, especially when
NoLock is used, makes it hard to tell whether we're doing the locking
correctly.  Note that we've already got things like
set_relation_partition_info (which copies the bounds) and
set_baserel_partition_key_exprs (which copies, with some partitioning,
the partitioning expressions) and also find_partition_scheme, but
instead of using that existing data from the RelOptInfo, this patch is
digging it directly out of the relcache entry, which doesn't seem
great.

The changes to set_append_rel_pathlist probably make it slower in the
case where rte->relkind != RELKIND_PARTITIONED_TABLE.  We build a
whole new list that we don't really need.  How about just keeping the
(appinfo->parent_relid != parentRTindex) test in the loop and setting
rel_appinfos to either root->append_rel_list or
rel->live_part_appinfos as appropriate?

I understand why COLLATION_MATCH think that a collation OID match is
OK, but why is InvalidOid also OK?  Can you add a comment?  Maybe some
test cases, too?

How fast is this patch these days, compared with the current approach?
 It would be good to test both when nearly all of the partitions are
pruned and when almost none of the partitions are pruned.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ERROR: too many dynamic shared memory segments

2018-02-02 Thread Tom Lane

Robert Haas  writes:
> On Fri, Feb 2, 2018 at 3:01 PM, Tom Lane  wrote:
>> Do you mind if I change that Assert to a run-time test?

> Hrm, I guess I could have done something like
> shm_toc_lookup(pcxt->toc, PARALLEL_KEY_ERROR_QUEUE, (pcxt->nworkers ==
> 0)).

OK, that'd work too.  I'll make it so.

regards, tom lane

Re: [HACKERS] Bug in to_timestamp().

2018-02-02 Thread Dmitry Dolgov

> On 1 February 2018 at 11:24, Arthur Zakirov  wrote:
>
> I fixed those messages, but in a different manner. I think that an
> unexpected character is unknown and neither space nor separator. And
> better to say that was expected space/separator character.

Sounds good, thanks.

> Attached fixed patch.

For some reason I can't apply it clean to the latest master:

(Stripping trailing CRs from patch; use --binary to disable.)
patching file doc/src/sgml/func.sgml
(Stripping trailing CRs from patch; use --binary to disable.)
patching file src/backend/utils/adt/formatting.c
(Stripping trailing CRs from patch; use --binary to disable.)
patching file src/test/regress/expected/horology.out
Hunk #1 FAILED at 2769.
Hunk #2 FAILED at 2789.
Hunk #3 succeeded at 2810 with fuzz 2.
Hunk #4 succeeded at 2981 with fuzz 2.
Hunk #5 succeeded at 3011 with fuzz 2.
Hunk #6 FAILED at 3029.
3 out of 6 hunks FAILED -- saving rejects to file
src/test/regress/expected/horology.out.rej
(Stripping trailing CRs from patch; use --binary to disable.)
patching file src/test/regress/sql/horology.sql

> On 2 February 2018 at 16:40, Robert Haas  wrote:
>
> I'm not quite sure whether it's relevant here, but I think some of the
> ctype.h functions have locale-dependent behavior.  By implementing our
> own test we make sure that we don't accidentally inherit any such
> behavior.

Yes, you're right, `isalpha` is actually locale dependent function.

In some locales, there may be additional characters for which isalpha() is
true—letters which are neither uppercase nor lowercase.

So, if I understand the patch correctly, with `isalpha` the function
`is_separator_char` will return false for some locale-specific characters, and
without - those characters will be treated as separators. Is it desire
behavior?

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-02 Thread Robert Haas

On Fri, Feb 2, 2018 at 3:35 PM, Peter Geoghegan  wrote:
> On Fri, Feb 2, 2018 at 12:30 PM, Robert Haas  wrote:
>> For the record, I typically construct the list of reviewers by reading
>> over the thread and adding all the people whose names I find there in
>> chronological order, excluding things that are clearly not review
>> (like "Bumped to next CF.") and opinions on narrow questions that
>> don't indicate that any code-reading or testing was done (like "+1 for
>> calling the GUC foo_bar_baz rather than quux_bletch".)  I saw that you
>> copied Corey on the original email, but I see no posts from him on the
>> thread, which is why he didn't get included in the commit message.
>
> I did credit him in my own proposed commit message. I know that it's
> not part of your workflow to preserve that, but I had assumed that
> that would at least be taken into account.

Ah.  Sorry, I didn't look at that.  I try to remember to look at
proposed commit messages, but not everyone includes them, which is
probably part of the reason I don't always remember to look for them.
Or maybe I just have failed to adequately develop that habit...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-02 Thread Peter Geoghegan

On Fri, Feb 2, 2018 at 12:30 PM, Robert Haas  wrote:
> For the record, I typically construct the list of reviewers by reading
> over the thread and adding all the people whose names I find there in
> chronological order, excluding things that are clearly not review
> (like "Bumped to next CF.") and opinions on narrow questions that
> don't indicate that any code-reading or testing was done (like "+1 for
> calling the GUC foo_bar_baz rather than quux_bletch".)  I saw that you
> copied Corey on the original email, but I see no posts from him on the
> thread, which is why he didn't get included in the commit message.

I did credit him in my own proposed commit message. I know that it's
not part of your workflow to preserve that, but I had assumed that
that would at least be taken into account.

Anyway, mistakes like this happen. I'm glad that we now have the
reviewer credit list, so that they can be corrected afterwards.

-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-02 Thread Robert Haas

On Fri, Feb 2, 2018 at 3:23 PM, Peter Geoghegan  wrote:
> On Fri, Feb 2, 2018 at 10:38 AM, Peter Geoghegan  wrote:
>> Thanks everyone
>
> I would like to acknowledge the assistance of Corey Huinker with early
> testing of the patch (this took place in 2016, and much of it was not
> on-list). Even though he wasn't credited in the commit message, he
> should appear in the V11 release notes reviewer list IMV. His
> contribution certainly merits it.

For the record, I typically construct the list of reviewers by reading
over the thread and adding all the people whose names I find there in
chronological order, excluding things that are clearly not review
(like "Bumped to next CF.") and opinions on narrow questions that
don't indicate that any code-reading or testing was done (like "+1 for
calling the GUC foo_bar_baz rather than quux_bletch".)  I saw that you
copied Corey on the original email, but I see no posts from him on the
thread, which is why he didn't get included in the commit message.
While I have no problem with him being included in the release notes,
I obviously can't know about activity that happens entirely off-list.
If you mentioned somewhere in the 200+ message on this topic that he
should be included, I missed that, too.  I think it's much harder to
give credit adequately when contributions are off-list; letting
everyone know what's going on is why we have a list.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-02 Thread Peter Geoghegan

On Fri, Feb 2, 2018 at 10:38 AM, Peter Geoghegan  wrote:
> Thanks everyone

I would like to acknowledge the assistance of Corey Huinker with early
testing of the patch (this took place in 2016, and much of it was not
on-list). Even though he wasn't credited in the commit message, he
should appear in the V11 release notes reviewer list IMV. His
contribution certainly merits it.

-- 
Peter Geoghegan

Re: Built-in connection pooling

2018-02-02 Thread Vladimir Sitnikov

Konstantin>I do not have explanation of performance degradation in case of
this
particular workload.

A) Mongo Java Client uses a connection-pool of 100 connections by default.
That is it does not follow "connection per client" (in YCSB terms), but it
is capped by 100 connections. I think it can be adjusted by adding
?maxPoolSize=100500 or ?maxpoolsize=100500 to the Mongo URL

I wonder if you could try to vary that parameter and see if it changes
Mongo results.

B) There's a bug in JDBC client of YCSB (it might affect PostgreSQL
results, however I'm not sure if the impact would be noticeable). The
default configuration is readallfields=true, however Jdbc client just
discards the results instead of accessing the columns. I've filed
https://github.com/brianfrankcooper/YCSB/issues/1087 for that.


C) I might miss something, however my local (Macbook) benchmarks show that
PostgreSQL 9.6 somehow uses Limit->Sort->BitmapScan kind of plans.
I have picked a "bad" userid value via auto_explain.
Jdbc client uses prepared statements, so a single bind might spoil the
whole thing causing bad plans for all the values afterwards.
Does it make sense to disable bitmap scan somehow?

For instance:

explain (analyze, buffers) select * From usertable where
YCSB_KEY>='user884845140610037639' order by YCSB_KEY limit 100;
  QUERY PLAN
---
 Limit  (cost=320.99..321.24 rows=100 width=1033) (actual time=1.408..1.429
rows=100 loops=1)
   Buffers: shared hit=140
   ->  Sort  (cost=320.99..321.33 rows=135 width=1033) (actual
time=1.407..1.419 rows=100 loops=1)
 Sort Key: ycsb_key
 Sort Method: quicksort  Memory: 361kB
 Buffers: shared hit=140
 ->  Bitmap Heap Scan on usertable  (cost=9.33..316.22 rows=135
width=1033) (actual time=0.186..0.285 rows=167 loops=1)
   Recheck Cond: ((ycsb_key)::text >=
'user884845140610037639'::text)
   Heap Blocks: exact=137
   Buffers: shared hit=140
   ->  Bitmap Index Scan on usertable_pkey  (cost=0.00..9.29
rows=135 width=0) (actual time=0.172..0.172 rows=167 loops=1)
 Index Cond: ((ycsb_key)::text >=
'user884845140610037639'::text)
 Buffers: shared hit=3
 Planning time: 0.099 ms
 Execution time: 1.460 ms

vs

explain (analyze, buffers) select * From usertable where
YCSB_KEY>='user184845140610037639' order by YCSB_KEY limit 100;
QUERY PLAN
---
 Limit  (cost=0.28..89.12 rows=100 width=1033) (actual time=0.174..0.257
rows=100 loops=1)
   Buffers: shared hit=102
   ->  Index Scan using usertable_pkey on usertable  (cost=0.28..2154.59
rows=2425 width=1033) (actual time=0.173..0.246 rows=100 loops=1)
 Index Cond: ((ycsb_key)::text >= 'user184845140610037639'::text)
 Buffers: shared hit=102
 Planning time: 0.105 ms
 Execution time: 0.277 ms

Vladimir

Re: ERROR: too many dynamic shared memory segments

2018-02-02 Thread Robert Haas

On Fri, Feb 2, 2018 at 3:01 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> That turned out to produce more than one problem.  I find that the
>> select_parallel test then fails like this:
>> ERROR:  could not find key 18446744073709486082 in shm TOC at 0x10be98040
>> The fix for that problem seems to be:
>
>>  /* Recreate error queues. */
>>  error_queue_space =
>> -shm_toc_lookup(pcxt->toc, PARALLEL_KEY_ERROR_QUEUE, false);
>> +shm_toc_lookup(pcxt->toc, PARALLEL_KEY_ERROR_QUEUE, true);
>> +Assert(pcxt->nworkers == 0 || error_queue_space != NULL);
>
> I happened across this patch while preparing the release notes, and
> I'm quite distressed by it, because it completely breaks the point
> of what I'd done in commit d46633506 (to wit, to not just blindly
> assume, nor just Assert, that shm_toc_lookup always succeeds).
> Do you mind if I change that Assert to a run-time test?

Hrm, I guess I could have done something like
shm_toc_lookup(pcxt->toc, PARALLEL_KEY_ERROR_QUEUE, (pcxt->nworkers ==
0)).

I don't mind much if you change it, but I will note that for the
record, before d46633506, we had a theoretical source of bugs, whereas
after that commit, we had a bug. 445dbd82a fixed that; if you change
this around again, please take care not to make it buggy again.
Otherwise, I'll be the one who is quite distressed.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: ERROR: too many dynamic shared memory segments

2018-02-02 Thread Tom Lane

Robert Haas  writes:
> That turned out to produce more than one problem.  I find that the
> select_parallel test then fails like this:
> ERROR:  could not find key 18446744073709486082 in shm TOC at 0x10be98040
> The fix for that problem seems to be:

>  /* Recreate error queues. */
>  error_queue_space =
> -shm_toc_lookup(pcxt->toc, PARALLEL_KEY_ERROR_QUEUE, false);
> +shm_toc_lookup(pcxt->toc, PARALLEL_KEY_ERROR_QUEUE, true);
> +Assert(pcxt->nworkers == 0 || error_queue_space != NULL);

I happened across this patch while preparing the release notes, and
I'm quite distressed by it, because it completely breaks the point
of what I'd done in commit d46633506 (to wit, to not just blindly
assume, nor just Assert, that shm_toc_lookup always succeeds).
Do you mind if I change that Assert to a run-time test?

regards, tom lane

Re: [HACKERS] make async slave to wait for lsn to be replayed

2018-02-02 Thread Simon Riggs

On 2 February 2018 at 18:46, Robert Haas  wrote:
> On Fri, Feb 2, 2018 at 3:46 AM, Simon Riggs  wrote:
>> In PG11, I propose the following command, sticking mostly to Ants'
>> syntax, and allowing to wait for multiple events before it returns. It
>> doesn't hold snapshot and will not get cancelled by Hot Standby.
>>
>> WAIT FOR event [, event ...] options
>>
>> event is
>> LSN value
>> TIMESTAMP value
>>
>> options
>> TIMEOUT delay
>> UNTIL TIMESTAMP timestamp
>> (we have both, so people don't need to do math, they can use whichever
>> they have)
>
> WAIT FOR TIMEOUT sounds a lot like SELECT pg_sleep_for(), and WAIT
> UNTIL TIMESTAMP sounds a lot like SELECT pg_sleep_until().

Yes, it sounds very similar. It's the behavior that differs; I read
and agreed with yours and Thomas' earlier comments on that point.

As pointed out upthread, the key difference is whether it gets
cancelled on Hot Standby and whether you can call it in a non-READ
COMMITTED transaction.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] make async slave to wait for lsn to be replayed

2018-02-02 Thread Robert Haas

On Fri, Feb 2, 2018 at 3:46 AM, Simon Riggs  wrote:
> In PG11, I propose the following command, sticking mostly to Ants'
> syntax, and allowing to wait for multiple events before it returns. It
> doesn't hold snapshot and will not get cancelled by Hot Standby.
>
> WAIT FOR event [, event ...] options
>
> event is
> LSN value
> TIMESTAMP value
>
> options
> TIMEOUT delay
> UNTIL TIMESTAMP timestamp
> (we have both, so people don't need to do math, they can use whichever
> they have)

WAIT FOR TIMEOUT sounds a lot like SELECT pg_sleep_for(), and WAIT
UNTIL TIMESTAMP sounds a lot like SELECT pg_sleep_until().

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-02 Thread Peter Geoghegan

On Fri, Feb 2, 2018 at 10:37 AM, Robert Haas  wrote:
> And that patch you attached is also, now, committed.
>
> If you could keep an eye on the buildfarm and investigate anything
> that breaks, I would appreciate it.

Fantastic!

I can keep an eye on it throughout the day.

Thanks everyone
-- 
Peter Geoghegan

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2018-02-02 Thread Robert Haas

On Fri, Feb 2, 2018 at 11:16 AM, Peter Geoghegan  wrote:
> Attached patch has these changes.

And that patch you attached is also, now, committed.

If you could keep an eye on the buildfarm and investigate anything
that breaks, I would appreciate it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Add RANGE with values and exclusions clauses to the Window Functions

2018-02-02 Thread David G. Johnston

On Fri, Feb 2, 2018 at 9:26 AM, Oliver Ford  wrote:

> On Thu, Feb 1, 2018 at 1:46 AM, David G. Johnston
>  wrote:
>
> > The three callers of WinGetFuncArgInFrame don't use the isout argument;
> they
> > probably need to read that and a new isexcluded argument.  Start at the
> > head, loop until isout = true || isexcluded = false.
>
> The patch takes a slightly different approach and puts the logic in
> WinGetFuncArgInFrame.
> The "row_is_in_frame" function now returns a specific return code for
> when an Exclude
> clause was matched.


I would suggest adding constants for the 4 possible results from
row_is_in_frame.

David J.

Re: Temporary tables prevent autovacuum, leading to XID wraparound

2018-02-02 Thread Robert Haas

On Fri, Feb 2, 2018 at 1:27 AM, Masahiko Sawada  wrote:
> Thank you for suggestion. It sounds more smarter. So it would be more
> better if we vacuums database for anti-wraparound in ascending order
> of relfrozenxid?

Currently, we're doing it based on datfrozenxid.  I was looking for a
small, relatively safe change to improve things, which led to my
proposal: continue using datfrozenxid when the database isn't being
vacuumed, but when it is, substitute the datfrozenxid that we expect
the database *will have* after the tables currently being vacuumed are
finished for the actual datfrozenxid.

On further reflection, though, I see problems.  Suppose db1 has
age(datfrozenxid) = 600m and two tables with age(relfrozenxid) of 600m
and 400m.  db2 has age(datfrozenxid) = 500m.  Then suppose #1 starts
vacuuming the table with age 600m.  The AV launcher sees that, with
that table out of the way, we'll be able to bring datfrozenxid forward
to 400m and so sends worker #2 to db2.  But actually, when the worker
in db1 finishes vacuuming the table with age 600m, it's going to have
to vacuum the table with age 400m before performing any relfrozenxid
update.  So actually, the table with age 400m is holding the
relfrozenxid at 600m just as much as if it too had a relfrozenxid of
600m.  In fact, any tables in the same database that need routine
vacuuming are a problem, too: we're going to vacuum all of those
before updating relfrozenxid, too.

Maybe we should start by making the scheduling algorithm used by the
individual workers smarter:

1. Let's teach do_autovacuum() that, when there are any tables needing
vacuum for wraparound, either for relfrozenxid or relminmxid, it
should vacuum only those and forget about the rest.  This seems like
an obvious optimization to prevent us from delaying
datfrozenxid/datminmxid updates for the sake of vacuuming tables that
are "merely" bloated.

2. Let's have do_autovacuum() sort the tables to be vacuumed in
ascending order by relfrozenxid, so older tables are vacuumed first.
A zero-order idea is to put tables needing relfrozenxid vacuuming
before tables needing relminmxid vacuuming, and sort the latter by
ascending relminmxid, but maybe there's something smarter possible
there.  The idea here is to vacuum tables in order of priority rather
than in whatever order they happen to be physically mentioned in
pg_class.

3. Let's have do_autovacuum() make an extra call to
vac_update_datfrozenxid() whenever the next table to be vacuumed is at
least 10 million XIDs (or MXIDs) newer than the first one it vacuumed
either since the last call to vac_update_datfrozenxid() or, if none,
since it started.  That way, we don't have to wait for every single
table to get vacuumed before we can consider advancing
relfrozenxid/relminmxid.

4. When it's performing vacuuming for wraparound, let's have AV
workers advertise in shared memory the oldest relfrozenxid and
relminmxid that it might exist in the database.  Given #1 and #2, this
is pretty easy, since we start by moving through tables in increasing
relfrozenxid order and then shift to moving through them in increasing
relminmxid order.  When we're working through the relfrozenxid tables,
the oldest relminmxid doesn't move, and the oldest relfrozenxid is
that of the next table in the list.  When we're working through the
relminmxid tables, it's the reverse.  We need a little cleverness to
figure out what value to advertise when we're on the last table in
each list -- it should be the next-higher value, even though that will
be above the relevant threshold, not a sentinel value.

5. With those steps in place, I think we can now adopt my previous
idea to have the AV launcher use any advertised relfrozenxid (and, as
I now realize, relminmxid) instead of the ones found in pg_database,
because now we know that individual workers are definitely focused on
getting relfrozenxid (and/or relminmxid) as soon as possible, and
vacuuming unrelated tables won't help them do it any faster.

This gets us fairly close to vacuuming tables in decreasing order of
wraparound danger across the entire cluster.  It's not perfect.  It
prefers to keep vacuuming tables in the same database rather than
having a worker exit and maybe launching a new one in a different
database -- but the alternative is not very appealing.  If we didn't
do it that way, and if we had a million tables with XIDs that were
closely spaced spread across different databases, we'd have to
terminate and relaunching workers at a very high rate to get
everything sorted out, which would be inefficient and annoying to
program.  Also, it keeps the existing hard prioritization of
relfrozenxid over relminmxid, which could theoretically be wrong for
some installation.  But I think that might not be a big problem in
practice, and it seems like that could be separately improved at
another time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Built-in connection pooling

2018-02-02 Thread Konstantin Knizhnik




On 01.02.2018 23:28, Vladimir Sitnikov wrote:

> config/pgjsonb-local.dat

Do you use standard "workload" configuration values? 
(e.g. recordcount=1000, maxscanlength=100)


Yes, I used default value for workload. For example, workload-A has the 
following settings:


# Yahoo! Cloud System Benchmark
# Workload A: Update heavy workload
#   Application example: Session store recording recent actions
#
#   Read/update ratio: 50/50
#   Default data size: 1 KB records (10 fields, 100 bytes each, plus key)
#   Request distribution: zipfian

recordcount=1000
operationcount=1000
workload=com.yahoo.ycsb.workloads.CoreWorkload

readallfields=true

readproportion=0.5
updateproportion=0.5
scanproportion=0
insertproportion=0

requestdistribution=zipfian




Could you share ycsb output (e.g. for workload a)?
I mean lines like
[TOTAL_GC_TIME], Time(ms), xxx
[TOTAL_GC_TIME_%], Time(%), xxx


$ cat results/last/run_pgjsonb-local_workloada_70_bt.out
[OVERALL], RunTime(ms), 60099.0
[OVERALL], Throughput(ops/sec), 50444.83269272367
[TOTAL_GCS_PS_Scavenge], Count, 6.0
[TOTAL_GC_TIME_PS_Scavenge], Time(ms), 70.0
[TOTAL_GC_TIME_%_PS_Scavenge], Time(%), 0.11647448376844872
[TOTAL_GCS_PS_MarkSweep], Count, 0.0
[TOTAL_GC_TIME_PS_MarkSweep], Time(ms), 0.0
[TOTAL_GC_TIME_%_PS_MarkSweep], Time(%), 0.0
[TOTAL_GCs], Count, 6.0
[TOTAL_GC_TIME], Time(ms), 70.0
[TOTAL_GC_TIME_%], Time(%), 0.11647448376844872
[READ], Operations, 1516174.0
[READ], AverageLatency(us), 135.802076146933
[READ], MinLatency(us), 57.0
[READ], MaxLatency(us), 23327.0
[READ], 95thPercentileLatency(us), 382.0
[READ], 99thPercentileLatency(us), 828.0
[READ], Return=OK, 1516174
[CLEANUP], Operations, 70.0
[CLEANUP], AverageLatency(us), 134.21428571428572
[CLEANUP], MinLatency(us), 55.0
[CLEANUP], MaxLatency(us), 753.0
[CLEANUP], 95thPercentileLatency(us), 728.0
[CLEANUP], 99thPercentileLatency(us), 750.0
[UPDATE], Operations, 1515510.0
[UPDATE], AverageLatency(us), 2622.6653258639008
[UPDATE], MinLatency(us), 86.0
[UPDATE], MaxLatency(us), 1059839.0
[UPDATE], 95thPercentileLatency(us), 1261.0
[UPDATE], 99thPercentileLatency(us), 87039.0
[UPDATE], Return=OK, 1515510


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Can ICU be used for a database's default sort order?

2018-02-02 Thread Peter Geoghegan

On Fri, Feb 2, 2018 at 4:22 AM, Andrey Borodin  wrote:
> I can try to do this before next CF. But ISTM that EDB and Postgres Pro 
> already have various flavours of similar feature. Maybe they are planning to 
> publish that?

I would definitely review that patch.


-- 
Peter Geoghegan

Re: Wait for parallel workers to attach

2018-02-02 Thread Peter Geoghegan

On Fri, Feb 2, 2018 at 6:11 AM, Robert Haas  wrote:
>> Fair enough, you can proceed with the patch.
>
> Committed.  Now, on to the main event!

Thank you both.

-- 
Peter Geoghegan

Re: [HACKERS] Bug in to_timestamp().

2018-02-02 Thread Robert Haas

On Wed, Jan 31, 2018 at 11:53 AM, Dmitry Dolgov <9erthali...@gmail.com> wrote:
> Why not write `is_separator_char` using `isprint`, `isalpha`, `isdigit` from
> ctype.h? Something like:
>
> return isprint(*str) && !isalpha(*str) && !isdigit(*str)
>
> From what I see in the source code they do exactly the same and tests are
> successfully passing with this change.

I'm not quite sure whether it's relevant here, but I think some of the
ctype.h functions have locale-dependent behavior.  By implementing our
own test we make sure that we don't accidentally inherit any such
behavior.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Changing WAL Header to reduce contention during ReserveXLogInsertLocation()

2018-02-02 Thread Robert Haas

On Thu, Feb 1, 2018 at 7:21 PM, Simon Riggs  wrote:
> Yes, it would be about 99% of the time.
>
> But you have it backwards - we are not assuming that case. That is the
> only case that has risk - the one where an old WAL record starts at
> exactly the place the latest one stops. Otherwise the rest of the WAL
> record will certainly fail the CRC check, since it will effectively
> have random data in it, as you say.

OK, I get it now.  Thanks for explaining.  I think I understand now
why you think this problem can be solved just by controlling the way
we recycle segments, but I'm still not sure if that can really be made
fully reliable.  Michael seems concerned about what might happen after
multiple recyclings, and Tom has raised the issue of old data
reappearing after a crash.

I do agree that it would be nice if we could make it work - saving 8
bytes per WAL record would be significant.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Query running for very long time (server hanged) with parallel append

2018-02-02 Thread Robert Haas

On Fri, Feb 2, 2018 at 1:43 AM, Amit Khandekar  wrote:
> On 2 February 2018 at 03:50, Thomas Munro  
> wrote:
>> Whatever logic bug might be causing the query to hang, it's not good
>> that we're unable to SIGINT/SIGTERM our way out of this state.  See
>> also this other bug report for a known problem (already fixed but not
>> yet released), but which came with an extra complaint, as yet
>> unexplained, that the query couldn't be interrupted:
>>
>> https://www.postgresql.org/message-id/flat/151724453314.1238.409882538067070269%40wrigleys.postgresql.org
>
> Yeah, it is not good that there is no response to the SIGINT.
>
> The query is actually hanging because one of the workers is in a small
> loop where it iterates over the subplans searching for unfinished
> plans, and it never comes out of the loop (it's a bug which I am yet
> to fix). And it does not make sense to keep CHECK_FOR_INTERRUPTS in
> each iteration; it's a small loop that does not pass control to any
> other functions .

Uh, sounds like we'd better fix that bug.

> But I am not sure about this : while the workers are at it, why the
> backend that is waiting for the workers does not come out of the wait
> state with a SIGINT. I guess the same issue has been discussed in the
> mail thread that you pointed.

Is it getting stuck here?

/*
 * We can't finish transaction commit or abort until all of the workers
 * have exited.  This means, in particular, that we can't respond to
 * interrupts at this stage.
 */
HOLD_INTERRUPTS();
WaitForParallelWorkersToExit(pcxt);
RESUME_INTERRUPTS();

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] [PATCH] Lockable views

2018-02-02 Thread Robert Haas

On Thu, Feb 1, 2018 at 8:09 PM, Tatsuo Ishii  wrote:
> Initially I thought all base tables including ones in a subquery also
> should be locked like you. But after some discussions with Yugo, I
> agree that locking table in a subquery is less valuable for users (and
> I am afraid it may introduce more deadlock chances). See upthead
> discussions.

I just reread those discussions but I don't see that they really make
any argument for the behavior the patch implements.  I see no
explanation on the thread for why locking a table inside of a subquery
is more or less likely to cause deadlock than locking one outside of a
subquery.

>> I think that if we
>> change the rules for which subqueries get flattened in a future
>> release, then the behavior will also change.  That seems bad.
>
> I doubt it could happen in the future but if that happend we should
> disallow locking on such views.

That doesn't make any sense to me.  When someone migrates from
PostgreSQL 11 to, say, PostgreSQL 14, the view definition is going to
be recreated from an SQL query.  Neither the user nor the database
will know whether the query was optimized the same way on both
databases, so how could we disallow locking only those views where
there was a difference on the two releases?  Even if we could, how
does that help anything?  Throwing an error is just as much a
backward-incompatibility in the command as silently changing what gets
locked.

But my complaint may have been a little off base all the same -- I
guess we're doing this based on the rewriter output, rather than the
optimizer output, which makes it a lot less likely that we would
decide to change anything here.

>> I also think that this is a bad idea for another reason, which is that
>> it leaves us with no syntax to say that you want to lock the view
>> itself, and pg_dump wants do that if only we had syntax for it.
>
> I agree with Yugo and Alvaro. It's better to have a separate syntax
> for locking views itself.
>
> https://www.postgresql.org/message-id/20171226143407.6wjzjn42pt54qskm@alvherre.pgsql

Hmm, Alvaro's argument makes sense, I guess.  I withdraw this complaint.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Prefix operator for text and spgist support

2018-02-02 Thread Ildus Kurbangaliev

Hi,

Attached patch introduces prefix operator ^@ for text type. For 'a ^@ b'
it returns true if 'a' starts with 'b'. Also there is spgist index
support for this operator.

It could be useful as an alternative for LIKE for 'something%'
templates. Or even used in LIKE queries instead of ~>=~ and ~<~ in the
future. But it would require new strategy for btree.

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
diff --git a/src/backend/access/spgist/spgtextproc.c b/src/backend/access/spgist/spgtextproc.c
index f156b2166e..5beb9e33a1 100644
--- a/src/backend/access/spgist/spgtextproc.c
+++ b/src/backend/access/spgist/spgtextproc.c
@@ -496,7 +496,7 @@ spg_text_inner_consistent(PG_FUNCTION_ARGS)
 			 * well end with a partial multibyte character, so that applying
 			 * any encoding-sensitive test to it would be risky anyhow.)
 			 */
-			if (strategy > 10)
+			if (strategy > 10 && strategy != RTPrefixStrategyNumber)
 			{
 if (collate_is_c)
 	strategy -= 10;
@@ -526,6 +526,10 @@ spg_text_inner_consistent(PG_FUNCTION_ARGS)
 	if (r < 0)
 		res = false;
 	break;
+case RTPrefixStrategyNumber:
+	if (r != 0)
+		res = false;
+	break;
 default:
 	elog(ERROR, "unrecognized strategy number: %d",
 		 in->scankeys[j].sk_strategy);
@@ -601,10 +605,21 @@ spg_text_leaf_consistent(PG_FUNCTION_ARGS)
 	for (j = 0; j < in->nkeys; j++)
 	{
 		StrategyNumber strategy = in->scankeys[j].sk_strategy;
-		text	   *query = DatumGetTextPP(in->scankeys[j].sk_argument);
-		int			queryLen = VARSIZE_ANY_EXHDR(query);
+		text	   *query;
+		int			queryLen;
 		int			r;
 
+		/* for this strategy collation is not important */
+		if (strategy == RTPrefixStrategyNumber)
+		{
+			res = DatumGetBool(DirectFunctionCall2(text_startswith,
+out->leafValue, in->scankeys[j].sk_argument));
+			goto next;
+		}
+
+		query = DatumGetTextPP(in->scankeys[j].sk_argument);
+		queryLen = VARSIZE_ANY_EXHDR(query);
+
 		if (strategy > 10)
 		{
 			/* Collation-aware comparison */
@@ -655,6 +670,7 @@ spg_text_leaf_consistent(PG_FUNCTION_ARGS)
 break;
 		}
 
+next:
 		if (!res)
 			break;/* no need to consider remaining conditions */
 	}
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index fcc8323f62..be91082b56 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -1315,8 +1315,18 @@ patternsel(PG_FUNCTION_ARGS, Pattern_Type ptype, bool negate)
 	 * right cache key, so that they can be re-used at runtime.
 	 */
 	patt = (Const *) other;
-	pstatus = pattern_fixed_prefix(patt, ptype, collation,
-   &prefix, &rest_selec);
+
+	if (ptype == Pattern_Type_Prefix)
+	{
+		char	*s = TextDatumGetCString(constval);
+		prefix = string_to_const(s, vartype);
+		pstatus = Pattern_Prefix_Partial;
+		rest_selec = 1.0;	/* all */
+		pfree(s);
+	}
+	else
+		pstatus = pattern_fixed_prefix(patt, ptype, collation,
+	   &prefix, &rest_selec);
 
 	/*
 	 * If necessary, coerce the prefix constant to the right type.
@@ -1486,6 +1496,16 @@ likesel(PG_FUNCTION_ARGS)
 }
 
 /*
+ *		prefixsel			- selectivity of prefix operator
+ */
+Datum
+prefixsel(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_FLOAT8(patternsel(fcinfo, Pattern_Type_Prefix, false));
+}
+
+/*
+ *
  *		iclikesel			- Selectivity of ILIKE pattern match.
  */
 Datum
@@ -2904,6 +2924,15 @@ likejoinsel(PG_FUNCTION_ARGS)
 	PG_RETURN_FLOAT8(patternjoinsel(fcinfo, Pattern_Type_Like, false));
 }
 
+/*
+ *		prefixjoinsel			- Join selectivity of prefix operator
+ */
+Datum
+prefixjoinsel(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_FLOAT8(patternjoinsel(fcinfo, Pattern_Type_Prefix, false));
+}
+
 /*
  *		iclikejoinsel			- Join selectivity of ILIKE pattern match.
  */
diff --git a/src/backend/utils/adt/varlena.c b/src/backend/utils/adt/varlena.c
index 304cb26691..050875e0bb 100644
--- a/src/backend/utils/adt/varlena.c
+++ b/src/backend/utils/adt/varlena.c
@@ -1762,6 +1762,34 @@ text_ge(PG_FUNCTION_ARGS)
 	PG_RETURN_BOOL(result);
 }
 
+Datum
+text_startswith(PG_FUNCTION_ARGS)
+{
+	Datum		arg1 = PG_GETARG_DATUM(0);
+	Datum		arg2 = PG_GETARG_DATUM(1);
+	bool		result;
+	Size		len1,
+len2;
+
+	len1 = toast_raw_datum_size(arg1);
+	len2 = toast_raw_datum_size(arg2);
+	if (len2 > len1)
+		result = false;
+	else
+	{
+		text	   *targ1 = DatumGetTextPP(arg1);
+		text	   *targ2 = DatumGetTextPP(arg2);
+
+		result = (memcmp(VARDATA_ANY(targ1), VARDATA_ANY(targ2),
+		 len2 - VARHDRSZ) == 0);
+
+		PG_FREE_IF_COPY(targ1, 0);
+		PG_FREE_IF_COPY(targ2, 1);
+	}
+
+	PG_RETURN_BOOL(result);
+}
+
 Datum
 bttextcmp(PG_FUNCTION_ARGS)
 {
diff --git a/src/include/access/stratnum.h b/src/include/access/stratnum.h
index bddfac4c10..0db11a1117 100644
--- a/src/include/access/stratnum.h
+++ b/src/include/access/stratnum.h
@@ -68,8 +68,9 @@ typedef uint16 StrategyNumber;
 #define RTSubEqualStrategyNumber		25	/* for inet <<= */
 #define RTSuperStrategyNumber			26	/* for inet << */
 #define RTSuperEqualStr

Re: [HACKERS] path toward faster partition pruning

2018-02-02 Thread Robert Haas

On Fri, Feb 2, 2018 at 1:04 AM, Amit Langote
 wrote:
> Your proposed cleanup sounds much better, so I implemented it in the
> attached updated 0001, while dropping the previously proposed
> PartitionBoundCmpArg approach.
>
> Updated set of patches attached (patches 0002 onward mostly unchanged,
> except I incorporated the delta patch posted by David upthread).

Committed 0001.  Thanks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2018-02-02 Thread Tom Lane

Tomas Vondra  writes:
> ISTM this patch got somewhat stuck as we're not quite sure the
> transformation is correct in all cases. Is my impression correct?

Yeah, that's the core issue.

> If yes, how to we convince ourselves? Would some sort of automated testing
> (generating data and queries) help? I'm willing to spend some cycles on
> that, if considered helpful.

I'm not sure if that would be enough to convince doubters.  On the other
hand, if it found problems, that would definitely be useful.

regards, tom lane

Re: [HACKERS] [PATCH] Vacuum: Update FSM more frequently

2018-02-02 Thread Claudio Freire

On Thu, Feb 1, 2018 at 9:34 PM, Masahiko Sawada  wrote:
> On Mon, Jan 29, 2018 at 11:31 PM, Claudio Freire  
> wrote:
>> On Mon, Jan 29, 2018 at 4:12 AM, Masahiko Sawada  
>> wrote:
>>> On Sat, Jul 29, 2017 at 9:42 AM, Claudio Freire  
>>> wrote:
 Introduce a tree pruning threshold to FreeSpaceMapVacuum that avoids
 recursing into branches that already contain enough free space, to
 avoid having to traverse the whole FSM and thus induce quadratic
 costs. Intermediate FSM vacuums are only supposed to make enough
 free space visible to avoid extension until the final (non-partial)
 FSM vacuum.
>>>
>>> Hmm, I think this resolve a  part of the issue. How about calling
>>> AutoVacuumRequestWork() in PG_CATCH() if VACOPT_VACUUM is specified
>>> and give the relid that we were vacuuming but could not complete as a
>>> new autovacuum work-item? The new autovacuum work-item makes the
>>> worker vacuum FSMs of the given relation and its indices.
>>
>> Well, I tried that in fact, as I mentioned in the OP.
>>
>> I abandoned it due to the conjunction of the 2 main blockers I found
>> and mentioned there. In essence, those issues defeat the purpose of
>> the patch (to get the free space visible ASAP).
>>
>> Don't forget, this is aimed at cases where autovacuum of a single
>> relation takes a very long time. That is, very big relations. Maybe
>> days, like in my case. A whole autovacuum cycle can take weeks, so
>> delaying FSM vacuum that much is not good, and using work items still
>> cause those delays, not to mention the segfaults.
>
> Yeah, I agree to vacuum fsm more frequently because it can prevent
> table bloating from concurrent modifying. But considering the way to
> prevent the table bloating from cancellation of autovacuum, I guess we
> need more things. This proposal seems to provide us an ability that is
> we "might be" able to prevent table bloating due to cancellation of
> autovacuum. Since we can know that autovacuum is cancelled, I'd like
> to have a way so that we can surely vacuum fsm even if vacuum is
> cancelled. Thoughts?

After autovacuum gets cancelled, the next time it wakes up it will
retry vacuuming the cancelled relation. That's because a cancelled
autovacuum doesn't update the last-vacuumed stats.

So the timing between an autovacuum work item and the next retry for
that relation is more or less an autovacuum nap time, except perhaps
in the case where many vacuums get cancelled, and they have to be
queued.

That's why an initial FSM vacuum makes sense. It has a similar timing
to the autovacuum work item, it has the benefit that it can be
triggered manually (manual vacuum), and it's cheap and efficient.

> Also the patch always vacuums fsm at beginning of the vacuum with a
> threshold but it is useless if the table has been properly vacuumed. I
> don't think it's good idea to add an additional step that "might be"
> efficient, because current vacuum is already heavy.

FSMs are several orders of magnitude smaller than the tables
themselves. A TB-sized table I have here has a 95M FSM. If you add
threshold skipping, that initial FSM vacuum *will* be efficient.

By definition, the FSM will be less than 1/4000th of the table, so
that initial FSM pass takes less than 1/4000th of the whole vacuum.
Considerably less considering the simplicity of the task.

> On detail of the patch,
>
> --- a/src/backend/storage/freespace/indexfsm.c
> +++ b/src/backend/storage/freespace/indexfsm.c
> @@ -70,5 +70,5 @@ RecordUsedIndexPage(Relation rel, BlockNumber usedBlock)
>  void
>  IndexFreeSpaceMapVacuum(Relation rel)
>  {
> -  FreeSpaceMapVacuum(rel);
> +  FreeSpaceMapVacuum(rel, 0);
>  }
>
> @@ -816,11 +820,19 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,
> bool *eof_p)
>   {
>  int child_avail;
>
> +/* Tree pruning for partial vacuums */
> +if (threshold)
> +{
> +   child_avail = fsm_get_avail(page, slot);
> +   if (child_avail >= threshold)
> +  continue;
> +}
>
> Don't we skip all fsm pages if we set the threshold to 0?

No, that doesn't skip anything if threshold is 0, since it doesn't get
to the continue, which is the one skipping nodes.

Re: Wait for parallel workers to attach

2018-02-02 Thread Robert Haas

On Thu, Feb 1, 2018 at 9:48 PM, Amit Kapila  wrote:
> On Thu, Feb 1, 2018 at 9:09 PM, Robert Haas  wrote:
>> On Wed, Jan 31, 2018 at 10:08 PM, Amit Kapila  
>> wrote:
>>> I think suggesting to use this API to wait "for a specific worker"
>>> doesn't seem like a good idea as it doesn't have any such provision.
>>
>> I see your point, but in the absence of a more specific API it could
>> be used that way, and it wouldn't be unreasonable.  Just might wait a
>> little longer than absolutely necessary.
>
> Fair enough, you can proceed with the patch.

Committed.  Now, on to the main event!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

2018-02-02 Thread Robert Haas

On Fri, Feb 2, 2018 at 8:25 AM, Jeevan Chalke
 wrote:
>> The problem is that create_partition_agg_paths() is doing *exactly*
>> same thing that add_paths_to_grouping_rel() is already doing inside
>> the blocks that say if (grouped_rel->partial_pathlist).  We don't need
>> two copies of that code.  Both of those places except to take a
>> partial path that has been partially aggregated and produce a
>> non-partial path that is fully aggregated.  We do not need or want two
>> copies of that code.
>
> OK. Got it.
>
> Will try to find a common place for them and will also check how it goes
> with your suggested design change.
>
>> Here's another way to look at it.  We have four kinds of things.
>>
>> 1. Partially aggregated partial paths
>> 2. Partially aggregated non-partial paths
>> 3. Fully aggregated partial paths
>> 4. Fully aggregated non-partial paths

So in the new scheme I'm proposing, you've got a partially_grouped_rel
and a grouped_rel.  So all paths of type #1 go into
partially_grouped_rel->partial_pathlist, paths of type #2 go into
partially_grouped_rel->pathlist, type #3 (if we have any) goes into
grouped_rel->partial_pathlist, and type #4 goes into
grouped_rel->pathlist.

> add_paths_to_append_rel() -> generate_mergeappend_paths() does not consider
> partial_pathlist. Thus we will never see MergeAppend over parallel scan
> given by partial_pathlist. And thus plan like:
> -> Gather Merge
>   -> MergeAppend
> is not possible with current HEAD.
>
> Are you suggesting we should implement that here? I think that itself is a
> separate task.

Oh, I didn't realize that wasn't working already.  I agree that it's a
separate task from this patch, but it's really too bad that it doesn't
already work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] GnuTLS support

2018-02-02 Thread Robert Haas

On Thu, Feb 1, 2018 at 5:08 AM, Christoph Berg  wrote:
> Re: Peter Eisentraut 2018-01-03 
> <99680dba-cf63-8151-1de2-46ca93897...@2ndquadrant.com>
>> One scenario is that if GnuTLS goes in, it's quite plausible that the
>> PG11 packages for Debian and Ubuntu will use it by default.  But if it
>> doesn't support tls-server-endpoint, then a JDBC client (assuming
>> channel binding support is added) can't connect to such a server with
>> SCRAM authentication over SSL (which we hope will be a popular
>> configuration), unless they manually disable channel binding altogether
>> using the new scramchannelbinding connection option.  That would be a
>> very poor experience.
>
> GnuTLS support would mean that Debian could finally link psql against
> libreadline (instead of just LD_PRELOADing it at runtime) because
> there's not OpenSSL license conflict anymore. But I'm only going to do
> that switch if there's no visible incompatibilities for clients, and
> even any server-side GUC name changes would need a damn good
> justification because they make upgrades harder. The LD_PRELOAD hack
> in psql works, there's no pressing urgency to remove it.

Yeah.  The original justification at top-of-thread for supporting
GnuTLS was licensing.  I am not a lawyer and I don't have an opinion
on how much of a licensing problem there is around OpenSSL, but if
somebody things (as they evidently do, 'cuz this patch exists) that
there's a problem there and is willing to do the work to get it fixed,
I think that's great.  Even the perception of a legal problem can
hinder adoption, and our goal is to get people to use PostgreSQL.  Or
at least, I think that should be our goal.

I don't expect that to generate a lot of work for us because, just as
we insist that people who want obscure operating systems supported
should help by arranging for buildfarm members, so too we should
insist as part of supporting GnuTLS that someone provide buildfarm
members for it.  If those buildfarm members break and don't get fixed,
then we'll have to consider removing GnuTLS support.  Of course, this
arrangement doesn't guarantee that it's going to be zero work for us,
just as people who primarily work on UNIX-like systems still have to
worry somewhat about Microsoft Windows.  But it keeps the effort
pretty manageable.  I think that's likely to be even more true for
GnuTLS, because there's a huge amount of code that can depend on
pointer width and endian-ness, whereas it seems unlikely that anything
other than SSL patches will need to care about GnuTLS.  And there's
just not that many of those.  The biggest risk seems to be that new
SSL-related features someone wants to add in the future would need to
be made to work with every SSL implementation, and I think that could
indeed be an annoyance, but I don't think we'll really know how much
of an annoyance until we try it.  It's not like we can't rip support
out again if it proves to be a huge problem (in contrast to a feature
like multixacts or partitioning, which you can't remove without
breaking upgrades).

Also, I have to admit that my experiences with OpenSSL have been
mostly negative.  The basic stuff works and is clearly documented, but
more complicated things, at least at the time I looked at it, were not
clearly documented or not documented at all, and I ended up reading
uncommented code to try to figure out how it was supposed to work.  I
don't think it's a bad thing if we do our bit to contribute to a
little competition among implementations.  I'm not sure that gcc was
worrying too much about their error message quality until llvm came
along, but I bet they're thinking about it now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Partition-wise aggregation/grouping

2018-02-02 Thread Jeevan Chalke

On Fri, Feb 2, 2018 at 1:41 AM, Robert Haas  wrote:

> On Thu, Feb 1, 2018 at 8:59 AM, Jeevan Chalke
>  wrote:
> > I wrote a patch for this (on current HEAD) and attached separately here.
> > Please have a look.
>
> Yes, this is approximately what I had in mind, though it needs more
> work (e.g. it doesn't removing the clearing of the grouped_rel's
> partial_pathlist, which should no longer be necessary; also, it needs
> substantial comment updates).
>

That was just a quick patch to make sure is this what you meant.
Yes, it need some more work as you suggested and comment updates.


>
> >> 1. Parallel partition-wise aggregate, grouping key doesn't contain
> >> partition key.  This should just be a matter of adding additional
> >> Append paths to partially_grouped_rel->partial_pathlist.  The existing
> >> code already knows how to stick a Gather and FinalizeAggregate step on
> >> top of that, and I don't see why that logic would need any
> >> modification or addition.  An Append of child partial-grouping paths
> >> should be producing the same output as a partial grouping of an
> >> Append, except that the former case might produce more separate groups
> >> that need to be merged; but that should be OK: we can just throw all
> >> the paths into the same path list and let the cheapest one win.
> >
> > For any partial aggregation we need to add finalization step after we are
> > done with the APPEND i.e. post add_paths_to_append_rel(). Given that we
> need
> > to replicate the logic of sticking Gather and FinalizeAggregate step at
> > later stage. This is what exactly done in create_partition_agg_paths().
> > Am I missing something here?
>
> The problem is that create_partition_agg_paths() is doing *exactly*
> same thing that add_paths_to_grouping_rel() is already doing inside
> the blocks that say if (grouped_rel->partial_pathlist).  We don't need
> two copies of that code.  Both of those places except to take a
> partial path that has been partially aggregated and produce a
> non-partial path that is fully aggregated.  We do not need or want two
> copies of that code.
>

OK. Got it.

Will try to find a common place for them and will also check how it goes
with your suggested design change.


>
> Here's another way to look at it.  We have four kinds of things.
>
> 1. Partially aggregated partial paths
> 2. Partially aggregated non-partial paths
> 3. Fully aggregated partial paths
> 4. Fully aggregated non-partial paths
>
> The current code only ever generates paths of type #1 and #4; this
> patch will add paths of type #2 as well, and maybe also type #3.  But
> the way you've got it, the existing paths of type #1 go into the
> grouping_rel's partial_pathlist, and the new paths of type #1 go into
> the OtherUpperPathExtraData's partial_paths list.  Maybe there's a
> good reason why we should keep them separate, but I'm inclined to
> think they should all be going into the same list.
>

The new paths are specific to partition-wise aggregates and I thought
better to keep them separately without interfering with grouped_rel
pathlist/partial_pathlist. And as you said, I didn't find a better place
that its own structure.


> >> Overall, what I'm trying to argue for here is making this feature look
> >> less like its own separate thing and more like part of the general
> >> mechanism we've already got: partial paths would turn into regular
> >> paths via generate_gather_paths(), and partially aggregated paths
> >> would turn into fully aggregated paths by adding FinalizeAggregate.
> >> The existing special case that allows us to build a non-partial, fully
> >> aggregated path from a partial, partially-aggregated path would be
> >> preserved.
> >>
> >> I think this would probably eliminate some other problems I see in the
> >> existing design as well.  For example, create_partition_agg_paths
> >> doesn't consider using Gather Merge, but that might be a win.
> >
> > Append path is always non-sorted and has no pathkeys. Thus Gather Merge
> over
> > an Append path seems infeasible, isn't it?
>
> We currently never generate an Append path with pathkeys, but we do
> generate MergeAppend paths with pathkeys, as in the following example:
>
> rhaas=# create table foo (a int, b text) partition by range (a);
> CREATE TABLE
> rhaas=# create index on foo (a);
> CREATE INDEX
> rhaas=# create table foo1 partition of foo for values from (0) to
> (100);
> CREATE TABLE
> rhaas=# create table foo2 partition of foo for values from (100)
> to (200);
> CREATE TABLE
> rhaas=# select * from foo foo order by a;
>  a | b
> ---+---
> (0 rows)
> rhaas=# explain select * from foo foo order by a;
>QUERY PLAN
> 
> 
>  Merge Append  (cost=0.32..145.47 rows=2540 width=36)
>Sort Key: foo.a
>->  Index Scan using foo1_a_idx on foo1 foo  (cost=0.15..63.20
> rows=1270 width=36)
>->  Index Scan using fo

Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2018-02-02 Thread Tomas Vondra

On 01/04/2018 11:50 PM, Tom Lane wrote:
> I wrote:
>> Jim Nasby  writes:
>>> I've verified that the patch still applies and make check-world is clean.
> 
>> Not any more :-(.  Here's a v3 rebased over HEAD.  No substantive
>> change from v2.
> 
> Again rebased up to HEAD; still no substantive changes.
> 

ISTM this patch got somewhat stuck as we're not quite sure the
transformation is correct in all cases. Is my impression correct? If
yes, how to we convince ourselves? Would some sort of automated testing
(generating data and queries) help? I'm willing to spend some cycles on
that, if considered helpful.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Can ICU be used for a database's default sort order?

2018-02-02 Thread Andrey Borodin

Hello!

> 1 февр. 2018 г., в 19:09, Peter Eisentraut  
> написал(а):
> 
> On 1/31/18 11:48, Vladimir Borodin wrote:
>> Will it work only for a particular database? Or for a whole cluster
>> during initdb also? Any chance to get this done in 11?
> 
> I'm currently not working on it.
> 
> It's basically just a lot of leg work, and you need to come up with a
> catalog representation.  Possible options have already been addressed in
> earlier threads.

I can try to do this before next CF. But ISTM that EDB and Postgres Pro already 
have various flavours of similar feature. Maybe they are planning to publish 
that?

Best regards, Andrey Borodin.

Re: [HACKERS] Add support for tuple routing to foreign partitions

2018-02-02 Thread Etsuro Fujita


(2018/01/25 23:33), Stephen Frost wrote:

I'm afraid a good bit of this patch is now failing to apply.  I don't
have much else to say except to echo the performance concern that Amit
Langote raised about expanding the inheritence tree twice.


To address that concern, I'm thinking to redesign the patch so that it 
wouldn't expand the tree at planning time anymore.  I don't have any 
clear solution for that yet, but what I have in mind now is to add new 
FDW APIs to the executor, instead, so that the FDW could 1) create stuff 
such as a query for remote INSERT as PlanForeignModify and 2) 
initialize/end the remote INSERT operation as BeginForeignModify and 
EndForeignModify, somewhere in the executor.  (For #1, I'm thinking to 
add an API for that to ExecSetupPartitionTupleRouting or 
ExecInitPartitionResultRelInfo proposed by the patch by Amit Langote 
[1].)  Anyway, I'll work on this after reviewing that patch, so I'll 
mark this as Returned with feedback.


Thanks for the comment!

Best regards,
Etsuro Fujita

[1] https://commitfest.postgresql.org/16/1415/

Re: [HACKERS] Adding column_constraint description in ALTER TABLE synopsis

2018-02-02 Thread Stephen Frost

Greetings,

* Amit Langote (langote_amit...@lab.ntt.co.jp) wrote:
> On 2018/01/23 8:57, Thomas Munro wrote:
> > On Tue, Jan 23, 2018 at 12:41 PM, Thomas Munro
> >  wrote:
> >> On Mon, Jan 15, 2018 at 2:32 PM, Stephen Frost  wrote:
> >>> If someone else would like to review it, that'd be great, otherwise I'll
> >>> probably get it committed soon.
> >>
> >> FYI the v2 doesn't build:
> >>
> >> ref/alter_table.sgml:135: parser error : Opening and ending tag
> >> mismatch: refentry line 6 and synopsis
> >> 
> > 
> > Here's an update to move the new stuff to the correct side of the
> > closing "".  Builds for me, and the typesetting looks OK.
> > I'm not sure why the preexisting bogo-productions have inconsistent
> > indentation levels (looking at table_constraint_using_index) but
> > that's not this patch's fault.
> 
> Thanks for fixing that.
> 
> I noticed that partition_bound_spec in the patch is missing hash partition
> bound syntax that was added after the original patch was written.  Fixed
> in the attached.

I've pushed this with just a bit of re-ordering to match the order in
which things show up in the synopsis.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: [PoC PATCH] Parallel dump to /dev/null

2018-02-02 Thread Vladimir Borodin

Hi.

> 2 февр. 2018 г., в 13:25, Andrey Borodin  написал(а):
> 
>> Do folks here think that speedups would be worth it?
> I've seen backup verification via COPY TO to /dev/null in production. These 
> copies were parallelized on the side of verification script.
> Not sure this counts for feature or against it.

This sounds for feature because usage of COPY is involuntary. Having this in 
pg_dump would make our scripts for checking backups simpler.

--
May the force be with you…
https://simply.name

Re: [HACKERS] Add support for tuple routing to foreign partitions

2018-02-02 Thread Etsuro Fujita


(2018/02/02 19:33), Etsuro Fujita wrote:

(2018/01/25 23:33), Stephen Frost wrote:

I'm afraid a good bit of this patch is now failing to apply. I don't
have much else to say except to echo the performance concern that Amit
Langote raised about expanding the inheritence tree twice.


To address that concern, I'm thinking to redesign the patch so that it
wouldn't expand the tree at planning time anymore. I don't have any
clear solution for that yet, but what I have in mind now is to add new
FDW APIs to the executor, instead, so that the FDW could 1) create stuff
such as a query for remote INSERT as PlanForeignModify and 2)
initialize/end the remote INSERT operation as BeginForeignModify and
EndForeignModify, somewhere in the executor. (For #1, I'm thinking to
add an API for that to ExecSetupPartitionTupleRouting or
ExecInitPartitionResultRelInfo proposed by the patch by Amit Langote
[1].) Anyway, I'll work on this after reviewing that patch, so I'll mark
this as Returned with feedback.


CORRECTION: I'm planning to submit a new version to the March CF, so I 
set the status to "Moved to next CF".


Best regards,
Etsuro Fujita

Re: JIT compiling with LLVM v9.1

2018-02-02 Thread Pierre Ducroquet

On Friday, February 2, 2018 10:48:16 AM CET Pierre Ducroquet wrote:
> On Monday, January 29, 2018 10:53:50 AM CET Andres Freund wrote:
> > Hi,
> > 
> > On 2018-01-23 23:20:38 -0800, Andres Freund wrote:
> > > == Code ==
> > > 
> > > As the patchset is large (500kb) and I'm still quickly evolving it, I do
> > > not yet want to attach it. The git tree is at
> > > 
> > >   https://git.postgresql.org/git/users/andresfreund/postgres.git
> > > 
> > > in the jit branch
> > > 
> > >   https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a
> > >   =s
> > >   hortlog;h=refs/heads/jit
> > 
> > I've just pushed an updated and rebased version of the tree:
> > - Split the large "jit infrastructure" commits into a number of smaller
> > 
> >   commits
> > 
> > - Split the C++ file
> > - Dropped some of the performance stuff done to heaptuple.c - that was
> > 
> >   mostly to make performance comparisons a bit more interesting, but
> >   doesn't seem important enough to deal with.
> > 
> > - Added a commit renaming datetime.h symbols so they don't conflict with
> > 
> >   LLVM variables anymore, removing ugly #undef PM/#define PM dance
> >   around includes. Will post separately.
> > 
> > - Reduced the number of pointer constants in the generated LLVM IR, by
> > 
> >   doing more getelementptr accesses (stem from before the time types
> >   were automatically synced)
> > 
> > - Increased number of comments a bit
> > 
> > There's a jit-before-rebase-2018-01-29 tag, for the state of the tree
> > before the rebase.
> > 
> > Regards,
> > 
> > Andres
> 
> Hi
> 
> I have successfully built the JIT branch against LLVM 4.0.1 on Debian
> testing. This is not enough for Debian stable (LLVM 3.9 is the latest
> available there), but it's a first step.
> I've split the patch in four files. The first three fix the build issues,
> the last one fixes a runtime issue.
> I think they are small enough to not be a burden for you in your
> developments. But if you don't want to carry these ifdefs right now, I
> maintain them in a branch on a personal git and rebase as frequently as I
> can.
> 
> LLVM 3.9 support isn't going to be hard, but I prefer splitting. I also hope
> this will help more people test this wonderful toy… :)
> 
> Regards
> 
>  Pierre

For LLVM 3.9, only small changes were needed.
I've attached the patches to this email.
I only did very basic, primitive testing, but it seems to work.
I'll do more testing in the next days.

 Pierre

>From 5ca9594a7f52b7daab8562293010fe8c807107ee Mon Sep 17 00:00:00 2001
From: Pierre 
Date: Fri, 2 Feb 2018 11:29:45 +0100
Subject: [PATCH 1/2] Fix building with LLVM 3.9

---
 src/backend/lib/llvmjit_inline.cpp | 23 +--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/src/backend/lib/llvmjit_inline.cpp b/src/backend/lib/llvmjit_inline.cpp
index 8a747cbfc0..a785261bea 100644
--- a/src/backend/lib/llvmjit_inline.cpp
+++ b/src/backend/lib/llvmjit_inline.cpp
@@ -37,7 +37,12 @@ extern "C"
 #include 
 #include 
 #include 
+#if LLVM_MAJOR_VERSION > 3
 #include 
+#else
+#include "llvm/Bitcode/ReaderWriter.h"
+#include "llvm/Support/Error.h"
+#endif
 #include 
 #include 
 #include 
@@ -100,7 +105,12 @@ llvm_inline(LLVMModuleRef M)
 	llvm_execute_inline_plan(mod, globalsToInline.get());
 }
 
-#if LLVM_VERSION_MAJOR < 5
+#if LLVM_VERSION_MAJOR < 4
+bool operator!(const llvm::ValueInfo &vi) {
+	return !(  (vi.Kind == llvm::ValueInfo::VI_GUID && vi.TheValue.Id)
+		|| (vi.Kind == llvm::ValueInfo::VI_Value && vi.TheValue.V));
+}
+#elif LLVM_VERSION_MAJOR < 5
 bool operator!(const llvm::ValueInfo &vi) {
 	return !(  (vi.Kind == llvm::ValueInfo::VI_GUID && vi.TheValue.Id)
 		|| (vi.Kind == llvm::ValueInfo::VI_Value && vi.TheValue.GV));
@@ -188,12 +198,15 @@ llvm_build_inline_plan(llvm::Module *mod)
  funcName.data(),
  modPath.data());
 
+// XXX Missing in LLVM < 4.0 ?
+#if LLVM_VERSION_MAJOR > 3
 			if (gvs->notEligibleToImport())
 			{
 elog(DEBUG1, "uneligible to import %s due to summary",
 	 funcName.data());
 continue;
 			}
+#endif
 
 			if ((int) fs->instCount() > threshold)
 			{
@@ -339,8 +352,10 @@ llvm_execute_inline_plan(llvm::Module *mod, ImportMapTy *globalsToInline)
 
 #if LLVM_VERSION_MAJOR > 4
 #define IRMOVE_PARAMS , /*IsPerformingImport=*/false
-#else
+#elif LLVM_VERSION_MAJOR > 3
 #define IRMOVE_PARAMS , /*LinkModuleInlineAsm=*/false, /*IsPerformingImport=*/false
+#else
+#define IRMOVE_PARAMS
 #endif
 		if (Mover.move(std::move(importMod), GlobalsToImport.getArrayRef(),
 	   [](llvm::GlobalValue &, llvm::IRMover::ValueAdder) {}
@@ -648,7 +663,11 @@ llvm_load_index(void)
 			if (e)
 elog(ERROR, "could not load summary at %s", subpath);
 #else
+#if LLVM_VERSION_MAJOR > 3
 			std::unique_ptr subindex = std::move(llvm::getModuleSummaryIndex(ref).get());
+#else
+			std::unique_ptr subindex = std::move(llvm::getModuleSummaryIndex(ref, [](const llvm::DiagnosticInfo &) {}).get());
+#endif
 			if (!subindex)
 elog(ERROR,

Re: [PoC PATCH] Parallel dump to /dev/null

2018-02-02 Thread Andrey Borodin

Hi!

Michael, thanks for the patch!

> 2 февр. 2018 г., в 9:22, Michael Paquier  
> написал(а):
> 
> On Thu, Feb 01, 2018 at 02:24:46PM +0100, Michael Banck wrote:
>> dumping a database to /dev/null via pg_dump is (AFAIK) one recommended
>> way to check for corruption.  However, dumping to /dev/null is currently
>> not supported in directory mode which makes it not possible to dump to
>> /dev/null in parallel.
> 
> Do folks here think that speedups would be worth it?
I've seen backup verification via COPY TO to /dev/null in production. These 
copies were parallelized on the side of verification script.
Not sure this counts for feature or against it.

Best regards, Andrey Borodin.

Re: non-bulk inserts and tuple routing

2018-02-02 Thread Etsuro Fujita


(2018/01/30 18:52), Etsuro Fujita wrote:

(2018/01/30 18:39), Amit Langote wrote:

Will wait for your comments before sending a new version then.


Ok, I'll post my comments as soon as possible.


* ExecInitPartitionResultRelInfo is called from ExecFindPartition, but 
we could call that another way; in ExecInsert/CopyFrom we call that 
after ExecFindPartition if the partition chosen by ExecFindPartition has 
not been initialized yet.  Maybe either would be OK, but I like that 
because I think that would not only better divide that labor but better 
fit into the existing code in ExecInsert/CopyFrom IMO.


* In ExecInitPartitionResultRelInfo:
+   /*
+* Note that the entries in this list appear in no predetermined
+* order as result of initializing partition result rels as and when
+* they're needed.
+*/
+   estate->es_leaf_result_relations =
+ 
lappend(estate->es_leaf_result_relations,

+   leaf_part_rri);

Is it OK to put this within the "if (leaf_part_rri == NULL)" block?

* In the same function:
+   /*
+* Verify result relation is a valid target for INSERT.
+*/
+   CheckValidResultRel(leaf_part_rri, CMD_INSERT);

I think it would be better to leave the previous comments as-is here:

/*
 * Verify result relation is a valid target for an INSERT.  An 
UPDATE
 * of a partition-key becomes a DELETE+INSERT operation, so 
this check

 * is still required when the operation is CMD_UPDATE.
 */

* ExecInitPartitionResultRelInfo does the work other than the 
initialization of ResultRelInfo for the chosen partition (eg, create a 
tuple conversion map to convert a tuple routed to the partition from the 
parent's type to the partition's).  So I'd propose to rename that 
function to eg, ExecInitPartition.


* CopyFrom is modified so that it calls ExecSetupPartitionTupleRouting 
and ExecFindPartition with a mostly-dummy ModifyTableState node.  I'm 
not sure that is a good idea.  My concern about that is that might be 
something like a headache in future development.


* The patch 0001 and 0002 are pretty small but can't be reviewed without 
the patch 0003.  The total size of the three patches aren't that large, 
so I think it would be better to put those patches together into a 
single patch.


That's all for now.  I'll continue to review the patches!

Best regards,
Etsuro Fujita

Re: JIT compiling with LLVM v9.0

2018-02-02 Thread Andres Freund

Hi,

On 2018-02-02 18:22:34 +1300, Thomas Munro wrote:
> The clang that was used for bitcode was the system /usr/bin/clang,
> version 4.0.  Is it a problem that I used that for compiling the
> bitcode, but LLVM5 for JIT?  I actually tried
> CLANG=/usr/local/llvm50/bin/clang but ran into weird failures I
> haven't got to the bottom of at ThinLink time so I couldn't get as far
> as a running system.

You're using thinlto to compile pg? Could you provide what you pass to
configure for that? IIRC I tried that a while ago and ran into some
issues with us creating archives (libpgport, libpgcommon).

Greetings,

Andres Freund

Re: [HACKERS] path toward faster partition pruning

2018-02-02 Thread Amit Langote

Hi David.

On 2018/02/01 8:57, David Rowley wrote:
> On 31 January 2018 at 21:03, Amit Langote  
> wrote:
>> Update patch set attached.  Thanks again.
> 
> (My apologies for being slow to respond here. I've been on leave this
> week and I'm off again next week. I now have a little time to reply)

No worries.

> Thanks for incorporating my changes into the patchset. A while ago I
> was rebasing the run-time pruning patch on top of this but ran into a
> few problems which are all results of my changes.
> 
> 1. remove_redundant_clauses collapses the PartClause list into the
> most restrictive set of clauses. This disallows multiple evaluations
> of the PartitionClauseInfo during runtime pruning. I've written a
> proposed fix for this and attached it.

I've incorporated it in the latest patch I posted today.

> 2. PartitionClauseInfo->keyclauses is a list of PartClause which is
> not a node type. This will cause _copyPartitionClauseInfo() to fail.
> 
> I'm still not quite sure the best way to fix #2 since PartClause
> contains a FmgrInfo. I do have a local fix which moves PartClause to
> primnodes.h and makes it a proper node type. I also added a copy
> function which does not copy any of the cache fields in PartClause. It
> just sets valid_cache to false. I didn't particularly think this was
> the correct fix. I just couldn't think of how exactly this should be
> done at the time.
> 
> The attached patch also adds the missing nodetag from
> PartitionClauseInfo and also fixes up other code so as we don't memset
> the node memory to zero, as that would destroy the tag. I ended up
> just having extract_partition_key_clauses do the makeNode call. This
> also resulted in populate_partition_clauses being renamed to
> generate_partition_clauses

I started wondering if it's not such a good idea to make
PartitionClauseInfo a Node at all?  I went back to your earlier message
[1] where you said that it's put into the Append node for run-time pruning
to use, but it doesn't sound nice that we'd be putting into the plan
something that's looks more like scratchpad for the partition.c code.  I
think we should try to keep PartitionClauseInfo in partition.h and put
only the list of matched bare clauses into Append.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CAKJS1f8T_efuAgPWtyGdfwD1kBLR-giFvoez7raYsQ4P1i2OYw%40mail.gmail.com

Re: JIT compiling with LLVM v9.1

2018-02-02 Thread Pierre Ducroquet

On Monday, January 29, 2018 10:53:50 AM CET Andres Freund wrote:
> Hi,
> 
> On 2018-01-23 23:20:38 -0800, Andres Freund wrote:
> > == Code ==
> > 
> > As the patchset is large (500kb) and I'm still quickly evolving it, I do
> > not yet want to attach it. The git tree is at
> > 
> >   https://git.postgresql.org/git/users/andresfreund/postgres.git
> > 
> > in the jit branch
> > 
> >   https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=s
> >   hortlog;h=refs/heads/jit
> I've just pushed an updated and rebased version of the tree:
> - Split the large "jit infrastructure" commits into a number of smaller
>   commits
> - Split the C++ file
> - Dropped some of the performance stuff done to heaptuple.c - that was
>   mostly to make performance comparisons a bit more interesting, but
>   doesn't seem important enough to deal with.
> - Added a commit renaming datetime.h symbols so they don't conflict with
>   LLVM variables anymore, removing ugly #undef PM/#define PM dance
>   around includes. Will post separately.
> - Reduced the number of pointer constants in the generated LLVM IR, by
>   doing more getelementptr accesses (stem from before the time types
>   were automatically synced)
> - Increased number of comments a bit
> 
> There's a jit-before-rebase-2018-01-29 tag, for the state of the tree
> before the rebase.
> 
> Regards,
> 
> Andres

Hi

I have successfully built the JIT branch against LLVM 4.0.1 on Debian testing. 
This is not enough for Debian stable (LLVM 3.9 is the latest available there), 
but it's a first step.
I've split the patch in four files. The first three fix the build issues, the 
last one fixes a runtime issue.
I think they are small enough to not be a burden for you in your developments. 
But if you don't want to carry these ifdefs right now, I maintain them in a 
branch on a personal git and rebase as frequently as I can.

LLVM 3.9 support isn't going to be hard, but I prefer splitting. I also hope 
this will help more people test this wonderful toy… :)

Regards

 Pierre


>From 770104331a36a8d207053227b850396f1392939a Mon Sep 17 00:00:00 2001
From: Pierre 
Date: Fri, 2 Feb 2018 09:11:55 +0100
Subject: [PATCH 1/4] Add support for LLVM4 in llvmjit.c

---
 src/backend/lib/llvmjit.c | 24 +---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/src/backend/lib/llvmjit.c b/src/backend/lib/llvmjit.c
index 8e5ba94c98..d0c5537610 100644
--- a/src/backend/lib/llvmjit.c
+++ b/src/backend/lib/llvmjit.c
@@ -230,12 +230,19 @@ llvm_get_function(LLVMJitContext *context, const char *funcname)
 
 		addr = 0;
 		if (LLVMOrcGetSymbolAddressIn(handle->stack, &addr, handle->orc_handle, mangled))
-			elog(ERROR, "failed to lookup symbol");
+			elog(ERROR, "failed to lookup symbol %s", mangled);
 		if (addr)
 			return (void *) addr;
 	}
 #endif
 
+#if LLVM_VERSION_MAJOR < 5
+	if ((addr = LLVMOrcGetSymbolAddress(llvm_opt0_orc, mangled)))
+		return (void *) addr;
+	if ((addr = LLVMOrcGetSymbolAddress(llvm_opt3_orc, mangled)))
+		return (void *) addr;
+	elog(ERROR, "failed to lookup symbol %s for %s", mangled, funcname);
+#else
 	if (LLVMOrcGetSymbolAddress(llvm_opt0_orc, &addr, mangled))
 		elog(ERROR, "failed to lookup symbol");
 	if (addr)
@@ -244,7 +251,7 @@ llvm_get_function(LLVMJitContext *context, const char *funcname)
 		elog(ERROR, "failed to lookup symbol");
 	if (addr)
 		return (void *) addr;
-
+#endif
 	elog(ERROR, "failed to JIT: %s", funcname);
 
 	return NULL;
@@ -380,11 +387,21 @@ llvm_compile_module(LLVMJitContext *context)
 	 * faster instruction selection mechanism is used.
 	 */
 	{
-		LLVMSharedModuleRef smod;
 		instr_time tb, ta;
 
 		/* emit the code */
 		INSTR_TIME_SET_CURRENT(ta);
+#if LLVM_VERSION < 5
+		orc_handle = LLVMOrcAddEagerlyCompiledIR(compile_orc, context->module,
+		llvm_resolve_symbol, NULL);
+		if (!orc_handle)
+		{
+			elog(ERROR, "failed to jit module");
+		}
+#else
+		LLVMSharedModuleRef smod;
+
+		LLVMSharedModuleRef smod;
 		smod = LLVMOrcMakeSharedModule(context->module);
 		if (LLVMOrcAddEagerlyCompiledIR(compile_orc, &orc_handle, smod,
 		llvm_resolve_symbol, NULL))
@@ -392,6 +409,7 @@ llvm_compile_module(LLVMJitContext *context)
 			elog(ERROR, "failed to jit module");
 		}
 		LLVMOrcDisposeSharedModuleRef(smod);
+#endif
 		INSTR_TIME_SET_CURRENT(tb);
 		INSTR_TIME_SUBTRACT(tb, ta);
 		ereport(DEBUG1, (errmsg("time to emit: %.3fs",
-- 
2.15.1

>From 079ad7087e2ab106c0f04fa9056c93afa9a43b7c Mon Sep 17 00:00:00 2001
From: Pierre 
Date: Fri, 2 Feb 2018 09:13:40 +0100
Subject: [PATCH 2/4] Add LLVM4 support in llvmjit_error.cpp

---
 src/backend/lib/llvmjit_error.cpp | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/src/backend/lib/llvmjit_error.cpp b/src/backend/lib/llvmjit_error.cpp
index 70cecd114b..04e51b2a31 100644
--- a/src/backend/lib/llvmjit_error.cpp
+++ b/src/backend/lib/llvmjit_error.cpp
@@ -56,7 +56,9 @@ llvm_enter_fatal_on_oom(void)
 	if (fatal_new_handler_depth == 0)
 	{
 		old_

Re: [PoC PATCH] Parallel dump to /dev/null

2018-02-02 Thread Michael Banck

Hi,

Am Freitag, den 02.02.2018, 13:22 +0900 schrieb Michael Paquier:
> On Thu, Feb 01, 2018 at 02:24:46PM +0100, Michael Banck wrote:
> > dumping a database to /dev/null via pg_dump is (AFAIK) one recommended
> > way to check for corruption.  However, dumping to /dev/null is currently
> > not supported in directory mode which makes it not possible to dump to
> > /dev/null in parallel.
> 
> FWIW, I use this trick as well in some test deployments.  Now, those
> deployments happen in places where I want the checks to warn me, so the
> time it takes is not really an issue generally.  Do folks here think
> that speedups would be worth it?  Say to make the operation shorter on
> production systems for example.  A lengthy pg_dump bloats data for a
> longer time, so I can buy that shorter times may help, though I think
> that hearing from other folks is necessary as well.

Another use-case would be automatic restores of basebackups, where you
might want to dump to /dev/null afterwards to check for issues, as just
starting the basebackup wouldn't tell you. If you have lots of instances
and lots of basebackups to check, doing that in parallel might be
helpful.

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Re: Boolean partitions syntax

2018-02-02 Thread Amit Langote

On 2018/01/29 14:57, Tom Lane wrote:
> Amit Langote  writes:
>> Partition bound literals as captured gram.y don't have any type
>> information attached.
> 
> Isn't that design broken by definition?  TRUE is not the same thing
> as 't', nor as 'true'.  Nor are 1 and '1' the same thing; it's true
> that in some contexts we'll let '1' convert to an integer 1, but the
> reverse is not true.

Hmm, I thought the following does convert an integer 1 to text value '1',
but maybe I'm missing your point.

create table foo (a text default 1);

> Moreover, this approach doesn't have any hope
> of ever extending to bound values that aren't bare literals.
>
> I think you are fixing this at the wrong level.  Ideally the bound values
> ought to be expressions that get coerced to the partition column type.
> It's fine to require them to be constants for now, but not to invent
> an off-the-cuff set of syntactic restrictions that substitute for the
> semantic notion of "must be a constant".

I do remember the partitioning patch used to use a_expr for what today is:

partbound_datum:
Sconst  { $$ = makeStringConst($1, @1); }
| NumericOnly   { $$ = makeAConst($1, @1); }
| NULL_P{ $$ = makeNullAConst(@1); }
;

but thought at the time that that allowed way too much stuff into the
partition bound syntax.

> That path will lead to nasty
> backwards compatibility issues whenever somebody tries to extend the
> feature.
> 
> A concrete example of that is that the code currently accepts:
> 
> regression=# create table textpart (a text) partition by list (a);
> CREATE TABLE
> regression=# create table textpart_t partition of textpart for values in (1);
> CREATE TABLE
> 
> Since there's no implicit conversion from int to text, this seems
> pretty broken to me: there's no way for this behavior to be upward
> compatible to an implementation that treats the partition bound
> values as anything but text strings.  We should fix that before the
> behavior gets completely set in concrete.

Most of the code does treat partition bound values as Node values doing
coercing before calling the input value good and failing upon not being
able to convert to the desired type for whatever reason.

create table b (a bool) partition by list (a);
create table bt partition of b for values in (1);
ERROR:  specified value cannot be cast to type boolean for column "a"
LINE 1: create table bt partition of b for values in (1);

Can you say a bit more about the compatibility issues if we extend the syntax?

Thanks,
Amit

Re: [HACKERS] make async slave to wait for lsn to be replayed

2018-02-02 Thread Stephen Frost

Greetings,

* Simon Riggs (si...@2ndquadrant.com) wrote:
> On 22 January 2018 at 23:21, Stephen Frost  wrote:
> 
> >> It sounds reasonable. I can offer the following version.
> >>
> >> WAIT  LSN  lsn_number;
> >> WAIT  LSN  lsn_number  TIMEOUT delay;
> >> WAIT  LSN  lsn_number  INFINITE;
> >> WAIT  LSN  lsn_number  NOWAIT;
> >>
> >>
> >> WAIT  [token]  wait_value [option];
> >>
> >> token - [LSN, TIME | TIMESTAMP | CSN | XID]
> >> option - [TIMEOUT | INFINITE | NOWAIT]
> >>
> >> Ready to listen to your suggestions.
> >
> > There were a few different suggestions made, but somehow this thread
> > ended up in Needs Review again while still having LSNs.  I've changed it
> > back to Waiting for Author since it seems pretty unlikely that using LSN
> > is going to be acceptable based on the feedback.
> 
> I agree with the need for a separate command rather than a function.
> 
> I agree that WAIT LSN is good syntax because this allows us to wait
> for something else in future.
> 
> Without having reviewed the patch, I think we want this feature in PG11.

I've also looked back through this and while I understand the up-thread
discussion about having something better than LSN, I don't see any
particular reason to not allow waiting on LSN, so I agree with Simon
that this makes sense to include.  There are definite cases it helps
with today and it doesn't block off future work.

As we're closing out the January commitfest, I've moved this to the next
one.

Thanks!

Stephen


signature.asc
Description: PGP signature

Re: JIT compiling with LLVM v9.0

2018-02-02 Thread Andres Freund

On 2018-02-01 22:20:01 -0800, Jeff Davis wrote:
> Thanks! That worked, but I had to remove the "-stdlib=libc++" also,
> which was causing me problems.

That'll be gone as soon as I finish the shlib thing. Will hope to have
something over the weekend. Right now I'm at FOSDEM and need to prepare
a talk for tomorrow.

Greetings,

Andres Freund

Re: JIT compiling with LLVM v9.0

2018-02-02 Thread Andres Freund

Hi,

On 2018-02-02 18:22:34 +1300, Thomas Munro wrote:
> Is there something broken about my installation?  I see simple
> arithmetic expressions apparently compiling and working but I can
> easily find stuff that breaks... so far I think it's anything
> involving string literals:

That definitely should all work. Did you compile with lto and forced it
to internalize all symbols or such?


> postgres=# set jit_above_cost = 0;
> SET
> postgres=# select quote_ident('x');
> ERROR:  failed to resolve name MakeExpandedObjectReadOnlyInternal

...

> The clang that was used for bitcode was the system /usr/bin/clang,
> version 4.0.  Is it a problem that I used that for compiling the
> bitcode, but LLVM5 for JIT?

No, I did that locally without problems.


> I actually tried CLANG=/usr/local/llvm50/bin/clang but ran into weird
> failures I haven't got to the bottom of at ThinLink time so I couldn't
> get as far as a running system.

So you'd clang 5 level issues rather than with this patchset, do I
understand correctly?


> I installed llvm50 from a package.  I did need to make a tiny tweak by
> hand: in src/Makefile.global, llvm-config --system-libs had said
> -l/usr/lib/libexecinfo.so which wasn't linking and looks wrong to me
> so I changed it to -lexecinfo, noted that it worked and reported a bug
> upstream:  https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225621

Yea, that seems outside of my / our hands.

- Andres

Re: [HACKERS] Surjective functional indexes

2018-02-02 Thread Simon Riggs

On 1 February 2018 at 09:32, Simon Riggs  wrote:

> OK, thanks. Just checking my understanding and will add to the
> comments in the patch.

I'm feeling good about ths now, but for personal reasons won't be
committing this until late Feb/early March.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] make async slave to wait for lsn to be replayed

2018-02-02 Thread Simon Riggs

On 30 October 2017 at 17:25, Ivan Kartyshov  wrote:

> It sounds reasonable. I can offer the following version.
>
> WAIT  LSN  lsn_number;
> WAIT  LSN  lsn_number  TIMEOUT delay;
> WAIT  LSN  lsn_number  INFINITE;
> WAIT  LSN  lsn_number  NOWAIT;
>
>
> WAIT  [token]  wait_value [option];
>
> token - [LSN, TIME | TIMESTAMP | CSN | XID]
> option - [TIMEOUT | INFINITE | NOWAIT]
>
> Ready to listen to your suggestions.

OK, on review we want this feature in PG11

Many people think we will want to wait on a variety of things in the
future. Support for those things can come in the future when/if they
exist.

In PG11, I propose the following command, sticking mostly to Ants'
syntax, and allowing to wait for multiple events before it returns. It
doesn't hold snapshot and will not get cancelled by Hot Standby.

WAIT FOR event [, event ...] options

event is
LSN value
TIMESTAMP value

options
TIMEOUT delay
UNTIL TIMESTAMP timestamp
(we have both, so people don't need to do math, they can use whichever
they have)

We do not need "INFINITE" or "INFINITELY", obviously the default mode
for WAIT is to continue waiting until the thing you asked for happens.

I couldn't see the point of the NOWAIT option, was that a Zen joke?

WAIT can be issued on masters as well as standbys, no need to block that.

If you want this in PG11, please work on this now, including docs and
tap tests. Please submit before 1 March and I will shepherd this to
commit.

Thomas, I suggest we also do what Robert suggested elsewhere which was
to have an connection option that returns xid or lsn (or both) via the
protocol, so we can use that with the WAIT command and you can have
the overall causal consistency feature into PG11. I'll be reviewer for
that feature also if you submit.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Restrict concurrent update/delete with UPDATE of partition key

2018-02-02 Thread amul sul

Hi Amit,

Sorry for the delayed response.

On Fri, Jan 26, 2018 at 11:58 AM, Amit Kapila  wrote:
> On Wed, Jan 24, 2018 at 12:44 PM, amul sul  wrote:
>> On Tue, Jan 23, 2018 at 7:01 PM, Amit Kapila  wrote:
>>> On Fri, Jan 12, 2018 at 11:43 AM, amul sul  wrote:
[]
> I think you can manually (via debugger) hit this by using
> PUBLICATION/SUBSCRIPTION syntax for logical replication.  I think what
> you need to do is in node-1, create a partitioned table and subscribe
> it on node-2.  Now, perform an Update on node-1, then stop the logical
> replication worker before it calls heap_lock_tuple.  Now, in node-2,
> update the same row such that it moves the row.  Now, continue the
> logical replication worker.  I think it should hit your new code, if
> not then we need to think of some other way.
>

I am able to hit the change log using above steps. Thanks a lot for the
step by step guide, I really needed that.

One strange behavior I found in the logical replication which is reproducible
without attached patch as well -- when I have updated on node2 by keeping
breakpoint before the heap_lock_tuple call in replication worker, I can see
a duplicate row was inserted on the node2, see this:

== NODE 1 ==

postgres=# insert into foo values(1, 'initial insert');
INSERT 0 1

postgres=# select tableoid::regclass, * from foo;
 tableoid | a |   b
--+---+
 foo1 | 1 | initial insert
(1 row)


=== NODE 2 ==

postgres=# select tableoid::regclass, * from foo;
 tableoid | a |   b
--+---+
 foo1 | 1 | initial insert
(1 row)


== NODE 1 ==

postgres=# update foo set a=2, b='node1_update' where a=1;
UPDATE 1

< BREAK POINT BEFORE heap_lock_tuple IN replication worker  --->


== NODE 2 ==

postgres=#  update foo set a=2, b='node2_update' where a=1;

< RELEASE BREAK POINT --->

postgres=# 2018-02-02 12:35:45.050 IST [91786] LOG:  tuple to be
locked was already moved to another partition due to concurrent
update, retrying

postgres=# select tableoid::regclass, * from foo;
 tableoid | a |  b
--+---+--
 foo2 | 2 | node2_update
 foo2 | 2 | node1_update
(2 rows)


== NODE 1 ==

postgres=# select tableoid::regclass, * from foo;
 tableoid | a |  b
--+---+--
 foo2 | 2 | node1_update
(1 row)

I am thinking to report this in a separate thread, but not sure if
this is already known behaviour or not.

== schema to reproduce above case ==
-- node1
create table foo (a int2, b text) partition by list (a);
create table foo1 partition of foo for values IN (1);
create table foo2 partition of foo for values IN (2);
insert into foo values(1, 'initial insert');
CREATE PUBLICATION update_row_mov_pub FOR ALL TABLES;
ALTER TABLE foo REPLICA IDENTITY FULL;
ALTER TABLE foo1 REPLICA IDENTITY FULL;
ALTER TABLE foo2 REPLICA IDENTITY FULL;

-- node2
create table foo (a int2, b text) partition by list (a);
create table foo1 partition of foo for values IN (1);
create table foo2 partition of foo for values IN (2);
CREATE SUBSCRIPTION update_row_mov_sub CONNECTION 'host=localhost
dbname=postgres' PUBLICATION update_row_mov_pub;
== END==

Updated patch attached -- correct changes in execReplication.c.

Regards,
Amul Sul


0001-Invalidate-ip_blkid-v5-wip2.patch
Description: Binary data


0002-isolation-tests-v3.patch
Description: Binary data

72 matches

Mail list logo