date:20161207

Re: [HACKERS] tuplesort_gettuple_common() and *should_free argument

2016-12-07 Thread Heikki Linnakangas


On 10/22/2016 02:45 AM, Peter Geoghegan wrote:

I noticed that the routine tuplesort_gettuple_common() fails to set
*should_free in all paths in master branch (no bug in 9.6). Consider
the TSS_SORTEDONTAPE case, where we can return false without also
setting "*should_free = false" to instruct caller not to pfree() tuple
memory that tuplesort still owns. I suppose that this is okay because
caller should never pfree() a tuple when NULL pointer was returned at
higher level (as "pointer-to-tuple"). Even still, it seems like a bad
idea to rely on caller testing things such that caller doesn't go on
to pfree() a NULL pointer when seemingly instructed to do so by
tuplesort "get tuple" interface routine (that is, some higher level
routine that calls tuplesort_gettuple_common()).

More importantly, there are no remaining cases where
tuplesort_gettuple_common() sets "*should_free = true", because there
is no remaining need for caller to *ever* pfree() tuple. Moreover, I
don't anticipate any future need for this -- the scheme recently
established around per-tape "slab slots" makes it seem unlikely to me
that the need to "*should_free = true" will ever arise again. So, for
Postgres 10, why not rip out the "*should_free" arguments that appear
in a bunch of places? This lets us slightly simplify most tuplesort.c
callers.


Yeah, I agree we should just remove the *should_free arguments.

Should we be worried about breaking the API of tuplesort_get* functions? 
They might be used by extensions. I think that's OK, but wanted to bring 
it up. This would be only for master, of course, and any breakage would 
be straightforward to fix.



Now, it is still true that at least some callers to
tuplesort_gettupleslot() (but not any other "get tuple" interface
routine) are going to *require* ownership of memory for returned
tuples, which means a call to heap_copy_minimal_tuple() remains
necessary there (see recent commit
d8589946ddd5c43e1ebd01c5e92d0e177cbfc198 for details). But, that's
beside the point. In the master branch only, the
tuplesort_gettupleslot() test "if (!should_free)" seems tautological,
just as similar tests are for every other tuplesort_gettuple_common()
caller. So, tuplesort_gettupleslot() can safely assume it should
*always* heap_copy_minimal_tuple() in the master branch. (BTW, I'm
going to teach tuplesort_gettuple_common() to avoid this
heap_copy_minimal_tuple() call for callers that don't actually need
it, but that's a discussion for another thread).


Yep.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-12-07 Thread Michael Paquier

On Thu, Dec 8, 2016 at 9:07 AM, Robert Haas  wrote:
> You could do that, but first I would code up the simplest, cleanest
> algorithm you can think of and see if it even shows up in a 'perf'
> profile.  Microbenchmarking is probably overkill here unless a problem
> is visible on macrobenchmarks.

This is what I would go for! The current code is doing a simple thing:
select the Nth element using qsort() after scanning each WAL sender's
values. And I think that Sawada-san got it right. Even running on my
laptop a pgbench run with 10 sync standbys using a data set that fits
into memory, SyncRepGetOldestSyncRecPtr gets at most 0.04% of overhead
using perf top on a non-assert, non-debug build. Hash tables and
allocations get a far larger share. Using the patch,
SyncRepGetSyncRecPtr is at the same level with a quorum set of 10
nodes. Let's kick the ball for now. An extra patch could make things
better later on if that's worth it.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-07 Thread Craig Ringer

On 8 December 2016 at 12:01, Kohei KaiGai  wrote:

>> At a higher level, I don't understand exactly where such giant
>> ExpandedObjects would come from.  (As you point out, there's certainly
>> no easy way for a client to ship over the data for one.)  So this feels
>> like a very small part of a useful solution, if indeed it's part of a
>> useful solution at all, which is not obvious.
>>
> I expect an aggregate function that consumes millions of rows as source
> of a large matrix larger than 1GB. Once it is formed to a variable, it is
> easy to deliver as an argument of PL functions.

You might be interested in how Java has historically dealt with similar issues.

For a long time the JVM had quite low limits on the maximum amount of
RAM it could manage, in the single gigabytes for a long time. Even for
the 64-bit JVM. Once those limitations were lifted, the garbage
collector algorithm placed a low practical limit on how much RAM it
could cope with effectively.

If you were doing scientific computing with Java, lots of big
image/video work, using GPGPUs, doing large scale caching, etc, this
rapidly became a major pain point. So people introduced external
memory mappings to Java, where objects could reference and manage
memory outside the main JVM heap. The most well known is probably
BigMemory (https://www.terracotta.org/products/bigmemory), but there
are many others. They exposed this via small opaque handle objects
that you used to interact with the external memory store via library
functions.

It might make a lot of sense to apply the same principle to
PostgreSQL, since it's much less intrusive than true 64-bit VARLENA.
Rather than extending all of PostgreSQL to handle special-case
split-up VARLENA extended objects, have your interim representation be
a simple opaque value that points to externally mapped memory. Your
operators for the type, etc, know how to work with it. You probably
don't need a full suite of normal operators, you'll be interacting
with the data in a limited set of ways.

The main issue would presumably be one of resource management, since
we currently assume we can just copy a Datum around without telling
anybody about it or doing any special management. You'd need to know
when to clobber your external segment, when to copy(!) it if
necessary, etc. This probably makes sense for working with GPGPUs
anyway, since they like dealing with big contiguous chunks of memory
(or used to, may have improved?).

It sounds like only code specifically intended to work with the
oversized type should be doing much with it except passing it around
as an opaque handle, right?

Do you need to serialize this type to/from disk at all? Or just
exchange it in chunks with a client? If you do need to, can you
possibly do TOAST-like or pg_largeobject-like storage where you split
it up for on disk storage then reassemble for use?

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-07 Thread Craig Ringer

On 8 December 2016 at 07:36, Tom Lane  wrote:

> Likewise, the need for clients to be able to transfer data in chunks
> gets pressing well before you get to 1GB.  So there's a lot here that
> really should be worked on before we try to surmount that barrier.

Yeah. I tend to agree with Tom here. Allowing >1GB varlena-like
objects, when we can barely cope with our existing ones in
dump/restore, in clients, etc, doesn't strike me as quite the right
direction to go in.

I understand it solves a specific, niche case you're dealing with when
exchanging big blobs of data with a GPGPU. But since the client
doesn't actually see that large blob, it's split up into objects that
will work on the current protocol and interfaces, why is is necessary
to have instances of a single data type with >1GB values, rather than
take a TOAST-like / pg_largeobject-like approach and split it up for
storage?

I'm concerned that this adds a special case format that will create
maintenance burden and pain down the track, and it won't help with
pain points users face like errors dumping/restoring rows with big
varlena objects, problems efficiently exchanging them on the wire
protocol, etc.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Typmod associated with multi-row VALUES constructs

2016-12-07 Thread Kyotaro HORIGUCHI

Mmm. I did the same thing in select_common_type and resulted in a
messier (a bit).

At Wed, 07 Dec 2016 23:44:19 -0500, Tom Lane  wrote in 
<15128.1481172...@sss.pgh.pa.us>
> Attached is a patch that fixes this by storing typmod info in the RTE.
> This turned out to be straightforward, and I think it's definitely
> what we should do in HEAD.  I have mixed emotions about whether it's
> worth doing anything about it in the back branches.

With it, VALUES works as the same as UNION, as documentation
says.

https://www.postgresql.org/docs/8.2/static/queries-values.html

Past versions have the same documentation so back-patching the
*behavior* seems to me worth doing. Instead of modifying RTE,
re-coercing the first row's value would enough (I'm not sure how
to do that now) for back-patching.

> I chose to redefine the existing coltypes/coltypmods/colcollations
> lists for CTE RTEs as also applying to VALUES RTEs.  That saves a
> little space in the RangeTblEntry nodes and allows sharing code
> in a couple of places.  It's tempting to consider making that apply
> to all RTE types, which would permit collapsing expandRTE() and
> get_rte_attribute_type() into a single case.  But AFAICS there would
> be no benefit elsewhere, so I'm not sure the extra code churn is
> justified.

Agreed.

> BTW, I noticed that the CTE case of expandRTE() fails to assign the
> specified location to the generated Vars, which is clearly a bug
> though a very minor one; it would result in failing to display a
> parse error location in some cases where we would do so for Vars from
> other RTE types.  That part might be worth back-patching, not sure.

If we do back-patching VALUES patch, involving it would
worth, I think.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Typmod associated with multi-row VALUES constructs

2016-12-07 Thread Pavel Stehule

2016-12-07 22:17 GMT+01:00 Alvaro Herrera :

> Tom Lane wrote:
>
> > In HEAD, we could change the RTE data structure so that
> > transformValuesClause could save the typmod information in the RTE,
> > keeping the lookups cheap.
>
> Hmm, I think this would be useful for the XMLTABLE patch too.  I talked
> a bit about it at
> https://www.postgresql.org/message-id/20161122204730.
> dgipy6gxi25j4e6a@alvherre.pgsql
>
> The patch has evolved quite a bit since then, but the tupdesc continues
> to be a problem.  Latest patch is at
> https://www.postgresql.org/message-id/CAFj8pRBsrhwR636-_
> 3TPbqu%3DFo3_DDer6_yp_afzR7qzhW1T6Q%40mail.gmail.com


What do you dislike on tupdesc usage there?

regards

Pavel

>
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: [HACKERS] Typmod associated with multi-row VALUES constructs

2016-12-07 Thread Tom Lane

Attached is a patch that fixes this by storing typmod info in the RTE.
This turned out to be straightforward, and I think it's definitely
what we should do in HEAD.  I have mixed emotions about whether it's
worth doing anything about it in the back branches.

I chose to redefine the existing coltypes/coltypmods/colcollations
lists for CTE RTEs as also applying to VALUES RTEs.  That saves a
little space in the RangeTblEntry nodes and allows sharing code
in a couple of places.  It's tempting to consider making that apply
to all RTE types, which would permit collapsing expandRTE() and
get_rte_attribute_type() into a single case.  But AFAICS there would
be no benefit elsewhere, so I'm not sure the extra code churn is
justified.

BTW, I noticed that the CTE case of expandRTE() fails to assign the
specified location to the generated Vars, which is clearly a bug
though a very minor one; it would result in failing to display a
parse error location in some cases where we would do so for Vars from
other RTE types.  That part might be worth back-patching, not sure.

regards, tom lane

diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index e30c57e..d973225 100644
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
*** _copyRangeTblEntry(const RangeTblEntry *
*** 2149,2161 
  	COPY_NODE_FIELD(functions);
  	COPY_SCALAR_FIELD(funcordinality);
  	COPY_NODE_FIELD(values_lists);
- 	COPY_NODE_FIELD(values_collations);
  	COPY_STRING_FIELD(ctename);
  	COPY_SCALAR_FIELD(ctelevelsup);
  	COPY_SCALAR_FIELD(self_reference);
! 	COPY_NODE_FIELD(ctecoltypes);
! 	COPY_NODE_FIELD(ctecoltypmods);
! 	COPY_NODE_FIELD(ctecolcollations);
  	COPY_NODE_FIELD(alias);
  	COPY_NODE_FIELD(eref);
  	COPY_SCALAR_FIELD(lateral);
--- 2149,2160 
  	COPY_NODE_FIELD(functions);
  	COPY_SCALAR_FIELD(funcordinality);
  	COPY_NODE_FIELD(values_lists);
  	COPY_STRING_FIELD(ctename);
  	COPY_SCALAR_FIELD(ctelevelsup);
  	COPY_SCALAR_FIELD(self_reference);
! 	COPY_NODE_FIELD(coltypes);
! 	COPY_NODE_FIELD(coltypmods);
! 	COPY_NODE_FIELD(colcollations);
  	COPY_NODE_FIELD(alias);
  	COPY_NODE_FIELD(eref);
  	COPY_SCALAR_FIELD(lateral);
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index b7a109c..edc1797 100644
*** a/src/backend/nodes/equalfuncs.c
--- b/src/backend/nodes/equalfuncs.c
*** _equalRangeTblEntry(const RangeTblEntry 
*** 2460,2472 
  	COMPARE_NODE_FIELD(functions);
  	COMPARE_SCALAR_FIELD(funcordinality);
  	COMPARE_NODE_FIELD(values_lists);
- 	COMPARE_NODE_FIELD(values_collations);
  	COMPARE_STRING_FIELD(ctename);
  	COMPARE_SCALAR_FIELD(ctelevelsup);
  	COMPARE_SCALAR_FIELD(self_reference);
! 	COMPARE_NODE_FIELD(ctecoltypes);
! 	COMPARE_NODE_FIELD(ctecoltypmods);
! 	COMPARE_NODE_FIELD(ctecolcollations);
  	COMPARE_NODE_FIELD(alias);
  	COMPARE_NODE_FIELD(eref);
  	COMPARE_SCALAR_FIELD(lateral);
--- 2460,2471 
  	COMPARE_NODE_FIELD(functions);
  	COMPARE_SCALAR_FIELD(funcordinality);
  	COMPARE_NODE_FIELD(values_lists);
  	COMPARE_STRING_FIELD(ctename);
  	COMPARE_SCALAR_FIELD(ctelevelsup);
  	COMPARE_SCALAR_FIELD(self_reference);
! 	COMPARE_NODE_FIELD(coltypes);
! 	COMPARE_NODE_FIELD(coltypmods);
! 	COMPARE_NODE_FIELD(colcollations);
  	COMPARE_NODE_FIELD(alias);
  	COMPARE_NODE_FIELD(eref);
  	COMPARE_SCALAR_FIELD(lateral);
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 0d858f5..7258c03 100644
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*** _outRangeTblEntry(StringInfo str, const 
*** 2841,2855 
  			break;
  		case RTE_VALUES:
  			WRITE_NODE_FIELD(values_lists);
! 			WRITE_NODE_FIELD(values_collations);
  			break;
  		case RTE_CTE:
  			WRITE_STRING_FIELD(ctename);
  			WRITE_UINT_FIELD(ctelevelsup);
  			WRITE_BOOL_FIELD(self_reference);
! 			WRITE_NODE_FIELD(ctecoltypes);
! 			WRITE_NODE_FIELD(ctecoltypmods);
! 			WRITE_NODE_FIELD(ctecolcollations);
  			break;
  		default:
  			elog(ERROR, "unrecognized RTE kind: %d", (int) node->rtekind);
--- 2841,2857 
  			break;
  		case RTE_VALUES:
  			WRITE_NODE_FIELD(values_lists);
! 			WRITE_NODE_FIELD(coltypes);
! 			WRITE_NODE_FIELD(coltypmods);
! 			WRITE_NODE_FIELD(colcollations);
  			break;
  		case RTE_CTE:
  			WRITE_STRING_FIELD(ctename);
  			WRITE_UINT_FIELD(ctelevelsup);
  			WRITE_BOOL_FIELD(self_reference);
! 			WRITE_NODE_FIELD(coltypes);
! 			WRITE_NODE_FIELD(coltypmods);
! 			WRITE_NODE_FIELD(colcollations);
  			break;
  		default:
  			elog(ERROR, "unrecognized RTE kind: %d", (int) node->rtekind);
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index c587d4e..d608530 100644
*** a/src/backend/nodes/readfuncs.c
--- b/src/backend/nodes/readfuncs.c
*** _readRangeTblEntry(void)
*** 1314,1328 
  			break;
  		case RTE_VALUES:
  			READ_NODE_FIELD(values_lists);
! 			READ_NODE_FIELD(values_collations);
  			break;
  		case RTE_

Re: [HACKERS] Declarative partitioning - another take

2016-12-07 Thread Michael Paquier

On Thu, Dec 8, 2016 at 1:39 PM, Andres Freund  wrote:
> On 2016-12-07 13:20:04 -0500, Robert Haas wrote:
>> On Wed, Dec 7, 2016 at 11:53 AM, Erik Rijkers  wrote:
>> >> My bad.  The fix I sent last night for one of the cache flush issues
>> >> wasn't quite right.  The attached seems to fix it.
>> > Yes, fixed here too.  Thanks.
>>
>> Thanks for the report - that was a good catch.
>
> Congrats to everyone working on this! This is a large step forward.

Congratulations to all! It was a long way to this result.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

2016-12-07 Thread Andres Freund

On 2016-12-07 13:20:04 -0500, Robert Haas wrote:
> On Wed, Dec 7, 2016 at 11:53 AM, Erik Rijkers  wrote:
> >> My bad.  The fix I sent last night for one of the cache flush issues
> >> wasn't quite right.  The attached seems to fix it.
> > Yes, fixed here too.  Thanks.
> 
> Thanks for the report - that was a good catch.

Congrats to everyone working on this! This is a large step forward.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Unlogged tables cleanup

2016-12-07 Thread Michael Paquier

On Thu, Dec 8, 2016 at 6:03 AM, Robert Haas  wrote:
> Michael, your greetings were passed on to me with a request that I
> look at this thread.

Thanks for showing up!

> On Fri, Nov 18, 2016 at 12:33 PM, Michael Paquier
>  wrote:
 More seriously, if there could be more details regarding that, I would
 think that we could say something like "logging the init fork is
 mandatory in any case to ensure its on-disk presence when at recovery
 replay, even on non-default tablespaces whose base location are
 deleted and re-created from scratch if the WAL record in charge of
 creating this tablespace gets replayed". The problem shows up because
 of tablespaces being deleted at replay at the end... So perhaps this
 makes sense. What do you think?
>>>
>>> Yes, that's about what I think we need to explain.
>>
>> OK, what do you think about the attached then?
>>
>> I came up with something like that for those code paths:
>> -   /* Write the page.  If archiving/streaming, XLOG it. */
>> +   /*
>> +* Write the page and log it unconditionally.  This is important
>> +* particularly for indexes created on tablespaces and databases
>> +* whose creation happened after the last redo pointer as recovery
>> +* removes any of their existing content when the corresponding
>> +* create records are replayed.
>> +*/
>> I have not been able to use the word "create" less than that. The
>> patch is really repetitive, but I think that we had better mention the
>> need of logging the content in each code path and not touch
>> log_newpage() to keep it a maximum flexible.
>
> In blinsert.c, nbtree.c, etc. how about:
>
> Write the page and log it.  It might seem that an immediate sync would
> be sufficient to guarantee that the file exists on disk, but recovery
> itself might remove it while replaying, for example, an
> XLOG_DBASE_CREATE record.  Therefore, we need this even when
> wal_level=minimal.

OK, I rewrote a bit the patch as attached. What do you think?

>>> Actually, I'm
>>> wondering if we ought to just switch this over to using the delayChkpt
>>> mechanism instead of going through this complicated song-and-dance.
>>> Incurring an immediate sync to avoid having to WAL-log this is a bit
>>> tenuous, but having to WAL-log it AND do the immediate sync just seems
>>> silly, and I'm actually a bit worried that even with your fix there
>>> might still be a latent bug somewhere.  We couldn't switch mechanisms
>>> cleanly in the 9.2 branch (cf.
>>> f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should
>>> back-patch it in the form you have it in already, but it's probably
>>> worth switching over at least in master.
>>
>> Thinking through it... Could we not just rip off the sync phase
>> because there is delayChkpt? It seems to me that what counts is that
>> the commit of the transaction that creates the relation does not get
>> past the redo point. It would matter for read uncommitted transactions
>> but that concept does not exist in PG, and never will. On
>> back-branches it is definitely safer to keep the sync, I am just
>> thinking about a HEAD-only optimization here as you do.
>
> Right (I think).  If we set and clear delayChkpt around this work, we
> don't need the immediate sync.

My point is a bit different than what you mean I think: the
transaction creating an unlogged relfilenode would not need to even
set delayChkpt in the empty() routines because other transactions
would not refer to it until this transaction has committed. So I am
arguing about just removing the sync phase.
-- 
Michael
diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index 0946aa29ec..a31149f044 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -164,13 +164,18 @@ blbuildempty(Relation index)
metapage = (Page) palloc(BLCKSZ);
BloomFillMetapage(index, metapage);
 
-   /* Write the page.  If archiving/streaming, XLOG it. */
+   /*
+* Write the page and log it.  It might seem that an immediate sync
+* would be sufficient to guarantee that the file exists on disk, but
+* recovery itself might remove it while replaying, for example, an
+* XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we
+* need this even when wal_level=minimal.
+*/
PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
  (char *) metapage, true);
-   if (XLogIsNeeded())
-   log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-   BLOOM_METAPAGE_BLKNO, metapage, false);
+   log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+   BLOOM_METAPAGE_BLKNO, metapage, false);
 
/*
 * An immediate sync is required even if we xlog'd the page, because the
diff --git a/src/backend/access/nbtree/nbtree.c 
b/src

Re: [HACKERS] Patch to implement pg_current_logfile() function

2016-12-07 Thread Karl O. Pinc

Hello Robert,

On Wed, 7 Dec 2016 18:52:24 -0500
Robert Haas  wrote:

> On Tue, Dec 6, 2016 at 6:11 PM, Gilles Darold
>  wrote:
> > It seems that all fixes have been included in this patch.  

I read this and knew that I hadn't finished review, but didn't
immediately respond because I thought the patch had to be
marked "ready for committer" on commitfest.postgresql.org
before the committers would spend a lot of time on it.

I don't have the time to fully respond, or I'd be able to
attach new code, but want to comment before anybody else
spends a lot of time on this patch.

> Yikes.  This patch has certainly had a lot of review, but it has lots
> of basic inconsistencies with PostgreSQL coding style, which one would
> think somebody would have noticed by now,

Yes and no.  I don't really know what the coding style is and
rather than have to make multiple corrections to code that might
get weeded out and discarded I've been focusing on getting the logic
right.  Really, I've not put a thought to it except for trying to write
what I write like what's there, and sticking to < 80 chars per line.

There has been lots of review.
The only truly effective way I've found to communicate regards
the pg_current_logfiles patch has been to write code and provide
detailed test cases.  I could be wrong, but unless I submit
code I don't feel like I've been understood.
I just haven't been interested in re-formatting
somebody else's code before I think the code really works.

>  I don't have any objection
> to the basic idea of this patch, but it's got to look and feel like
> the surrounding PostgreSQL code.  And not be unnecessarily
> complicated.

I've been introducing some complication, but I hope it's necessary.
To keep the review process simple my plan has been to submit
multiple patches.  One with the basics and more on top of that
that introduce complication that can be considered and accepted or
rejected.  So I send emails with multiple patches, some that I think
need to go into the core patch and others to be kept separate.
But, although I try, this does not seem to have been communicated
because I keep getting emails back that contain only a single patch.

Maybe something's wrong with my review process but I don't know
how to fix it.

If you think a single patch is the way to go I can open up
separate tickets at commitfest.postgresql.org.  But this seems
like overkill because all the patches touch the same code.

The extreme case is the attached "cleanup_rotate" patch.
(Which applies to v14 of this patch.)  It's nothing but
a little bit of tiding up of the master branch, but does
touch code that's already being modified so it seems
like the committers would want to look at it at the same
time as when reviewing the pg_current_logfile patch.  
Once you've looked at the pg_current_logfile patch
you've already looked at and modified code in the function
the "cleanup_rotate" patch touches.

But the "cleanup_rotate" patch got included
in the v15 version of the patch and is also in v16.
I'm expecting to submit it as a separate patch
along with the pg_current_logfile patch and tried
to be very very clear about this.  It
really has nothing to do with the pg_current_logfile
functionality.  (And is the only "separate" patch
which really has nothing to do with the pg_current_logfile
functionality.)

More examples of separate patches are below.  Any guidance
would be appreciated.

> 
> Detailed comments below:

> rm_log_metainfo() could be merged into logfile_writename(), since
> they're always called in conjunction and in the same pattern. 

This would be true, if it weren't for the attached
"retry_current_logfiles" patches that were applied in
v15 of the patch and removed from v16 due to some
mis-communication I don't understand.

(But the attached patches apply on top of the the v14 patch.
I don't have time to refactor them now.)

FYI.  The point of the "retry_current_logfiles"
patch series is to handle the case
where logfile_writename gets a ENFILE or EMFILE.
When this happens the current_logfiles file can
"skip" and not contain the name(s) of the current
log file for an entire log rotation.  To miss, say,
a month of logs because the logs only rotate monthly
and your log processing is based on reading the
current_logfiles file sounds like a problem.

What I was attempting to communicate in my email response
to the v15 patch is that the (attached, but applies to the
v14 patch again) "backoff" patch should be submitted
as a separate patch.  It seems over-complicated, but
exists in case a committer is worried about re-trying
writes on a system that's busy enough to generate ENFILE
or EMFILE errors.

> +if (errno == ENFILE || errno == EMFILE)
> +ereport(LOG,
> +(errmsg("system is too busy to write logfile meta
> info, %s will be updated on next rotation (or use SIGHUP to retry)",
> LOG_METAINFO_DATAFILE)));
> 
> This seems like a totally unprincipled reaction to a purely arbitrary
> subset of

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-07 Thread Kohei KaiGai

2016-12-08 8:36 GMT+09:00 Tom Lane :
> Robert Haas  writes:
>> On Wed, Dec 7, 2016 at 8:50 AM, Kohei KaiGai  wrote:
>>> I like to propose a new optional type handler 'typserialize' to
>>> serialize an in-memory varlena structure (that can have indirect
>>> references) to on-disk format.
>
>> I think it's probably a mistake to conflate objects with substructure
>> with objects > 1GB.  Those are two somewhat orthogonal needs.
>
> Maybe.  I think where KaiGai-san is trying to go with this is being
> able to turn an ExpandedObject (which could contain very large amounts
> of data) directly into a toast pointer or vice versa.  There's nothing
> really preventing a TOAST OID from having more than 1GB of data
> attached, and if you had a side channel like this you could transfer
> the data without ever having to form a larger-than-1GB tuple.
>
> The hole in that approach, to my mind, is that there are too many places
> that assume that they can serialize an ExpandedObject into part of an
> in-memory tuple, which might well never be written to disk, or at least
> not written to disk in a table.  (It might be intended to go into a sort
> or hash join, for instance.)  This design can't really work for that case,
> and unfortunately I think it will be somewhere between hard and impossible
> to remove all the places where that assumption is made.
>
Regardless of the ExpandedObject, does the flatten format need to
contain fully flatten data chunks?
If a data type internally contains multiple toast pointers as like an array,
its flatten image is likely very small we can store using an existing
varlena mechanism.
One problem is VARSIZE() will never tell us exact total length of
the data even if it references multiple GB scale chunks.

> At a higher level, I don't understand exactly where such giant
> ExpandedObjects would come from.  (As you point out, there's certainly
> no easy way for a client to ship over the data for one.)  So this feels
> like a very small part of a useful solution, if indeed it's part of a
> useful solution at all, which is not obvious.
>
I expect an aggregate function that consumes millions of rows as source
of a large matrix larger than 1GB. Once it is formed to a variable, it is
easy to deliver as an argument of PL functions.

> FWIW, ExpandedObjects themselves are far from a fully fleshed out
> concept, one of the main problems being that they don't have very long
> lifespans except in the case that they're the value of a plpgsql
> variable.  I think we would need to move things along quite a bit in
> that area before it would get to be useful to think in terms of
> ExpandedObjects containing multiple GB of data.  Otherwise, the
> inevitable flattenings and re-expansions are just going to kill you.q
>
> Likewise, the need for clients to be able to transfer data in chunks
> gets pressing well before you get to 1GB.  So there's a lot here that
> really should be worked on before we try to surmount that barrier.
>
Do you point out the problem around client<->server protocol, isn't it?
Likely, we eventually need this enhancement. I agree.

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-07 Thread Kohei KaiGai

2016-12-08 8:04 GMT+09:00 Robert Haas :
> On Wed, Dec 7, 2016 at 8:50 AM, Kohei KaiGai  wrote:
>> I like to propose a new optional type handler 'typserialize' to
>> serialize an in-memory varlena structure (that can have indirect
>> references) to on-disk format.
>> If any, it shall be involced on the head of toast_insert_or_update()
>> than indirect references are transformed to something other which
>> is safe to save. (My expectation is, the 'typserialize' handler
>> preliminary saves the indirect chunks to the toast relation, then
>> put toast pointers instead.)
>
> This might not work.  The reason is that we have important bits of
> code that expect that they can figure out how to do some operation on
> a datum (find its length, copy it, serialize it) based only on typlen
> and typbyval.  See src/backend/utils/adt/datum.c for assorted
> examples.  Note also the lengthy comment near the top of the file,
> which explains that typlen > 0 indicates a fixed-length type, typlen
> == -1 indicates a varlena, and typlen == -2 indicates a cstring.  I
> think there's probably room for other typlen values; for example, we
> could have typlen == -3 indicate some new kind of thing -- a
> super-varlena that has a higher length limit, or some other kind of
> thing altogether.
>
> Now, you can imagine trying to treat what you're talking about as a
> new type of TOAST pointer, but I think that's not going to help,
> because at some point the TOAST pointer gets de-toasted into a varlena
> ... which is still limited to 1GB.  So that's not really going to
> work.  And it brings out another point, which is that if you define a
> new typlen code, like -3, for super-big things, they won't be
> varlenas, which means they won't work with the existing TOAST
> interfaces.  Still, you might be able to fix that.  You would probably
> have to do some significant surgery on the wire protocol, per the
> commit message for fa2fa995528023b2e6ba1108f2f47558c6b66dcd.
>
Hmm... The reason why I didn't introduce the idea of 64bit varlena format is
this approach seems too invasive for existing PostgreSQL core and extensions,
because I assumed this "long" variable length datum utilize/enhance existing
varlena infrastructure.
However, once we have completely independent infrastructure from the exiting
varlena, we may not need to have a risk of code mixture.
It seems to me an advantage.

My concern about ExpandedObject is its flat format still needs 32bit varlena
header that restrict the total length up to 1GB, so, it was not possible to
represent a large data chunk.
So, I didn't think the current ExpandedObject is a solution for us.

Regarding to the protocol stuff, I want to consider the support of a large
record next to the internal data format, because my expected primary
usage is an internal use for in-database analytics, then user will get
its results from a complicated logic described in PL function.

> I think it's probably a mistake to conflate objects with substructure
> with objects > 1GB.  Those are two somewhat orthogonal needs.  As Jim
> also points out, expanded objects serve the first need.  Of course,
> those assume that we're dealing with a varlena, so if we made a
> super-varlena, we'd still need to create an equivalent.  But perhaps
> it would be a fairly simple adaptation of what we've already got.
> Handling objects >1GB at all seems like the harder part of the
> problem.
>
I could get your point almost. Does the last line above mention about
amount of the data object >1GB? even if the "super-varlena" format
allows 64bit length?

Thanks,
-- 
KaiGai Kohei 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?

2016-12-07 Thread Alex Hunsaker

On Wed, Dec 7, 2016 at 3:42 PM, Tom Lane  wrote:

>
>
> Hmm ... after further experimentation, I still can't get this version of
> systemd (231) to do anything evil.  It turns out that Fedora ships it with
> KillUserProcesses turned off by default, and maybe having that on is a
> prerequisite for the other behavior?  But that doesn't make a lot of sense
> because we'd never be seeing the reports of databases moaning about lost
> semaphores if the processes got killed first.  Anyway, I see nothing bad
> happening if KillUserProcesses is off, while if it's on then the database
> gets shut down reasonably politely via SIGTERM.
>
> Color me confused ... maybe systemd's behavior has changed?
>

Hrm, the following incantation seems to break for me on a fresh Fedora 25
system:
1) As root su to $USER and start postgres.
2) ssh in as $USER and then logout
3) # psql localhost

FATAL: semctl(4980742, 3, SETVAL, 0) failed: Invalid argument
LOG: server process (PID 14569) exited with exit code 1
...

Re: [HACKERS] [sqlsmith] Short reads in hash indexes

2016-12-07 Thread Amit Kapila

On Thu, Dec 8, 2016 at 2:38 AM, Andreas Seltenreich  wrote:
> Andreas Seltenreich writes:
>
>> Amit Kapila writes:
>>
>>> On Sat, Dec 3, 2016 at 3:44 PM, Andreas Seltenreich  
>>> wrote:
 Amit Kapila writes:

> [2. text/x-diff; fix_hash_bucketsplit_sqlsmith_v1.patch]
 Ok, I'll do testing with the patch applied.
>>
>> Good news: the assertion hasn't fired since the patch is in.
>
> Meh, it fired again today after being silent for 100e6 queries :-/
> I guess I need to add some confidence qualification on such statements.
> Maybe sigmas as they do at CERN…
>
>> smith=# select * from state_report where sqlstate = 'XX001';
>> -[ RECORD 1 
>> ]--
>> count| 10
>> sqlstate | XX001
>> sample   | ERROR:  could not read block 1173 in file "base/16384/17256": 
>> read only 0 of 8192 bytes
>> hosts| {airbisquit,frell,gorgo,marbit,pillcrow,quakken}
>>
>>> Hmm, I am not sure if this is related to previous problem, but it
>>> could be.  Is it possible to get the operation and or callstack for
>>> above failure?
>>
>> Ok, will turn the elog into an assertion to get at the backtraces.
>
> Doing so on top of 4212cb7, I caught the backtrace below.  Query was:
>

Thanks for the report, I will look into it.  I think this one is quite
similar to what Jeff has reported [1].

[1] - 
https://www.postgresql.org/message-id/CAMkU%3D1ydfriLCOriJ%3DAxtF%3DhhBOUUcWtf172vquDrj%3D3T7yXmg%40mail.gmail.com


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

2016-12-07 Thread Amit Langote


Hi Robert,

On 2016/12/08 3:20, Robert Haas wrote:
> On Wed, Dec 7, 2016 at 11:53 AM, Erik Rijkers  wrote:
>>> My bad.  The fix I sent last night for one of the cache flush issues
>>> wasn't quite right.  The attached seems to fix it.
>> Yes, fixed here too.  Thanks.
> 
> Thanks for the report - that was a good catch.
> 
> I've committed 0001 - 0006 with that correction and a few other
> adjustments.  There's plenty of room for improvement here, and almost
> certainly some straight-up bugs too, but I think we're at a point
> where it will be easier and less error-prone to commit follow on
> changes incrementally rather than by continuously re-reviewing a very
> large patch set for increasingly smaller changes.

+1 and thanks a lot for your and everyone else's very patient support in
reviewing the patches.

> Some notes:
> 
> * We should try to teach the executor never to scan the parent.
> That's never necessary with this system, and it might add significant
> overhead.  We should also try to get rid of the idea of the parent
> having storage (i.e. a relfilenode).

Agreed, I will start investigating.

> * The fact that, in some cases, locking requirements for partitioning
> are stronger than those for inheritance is not good.  We made those
> decisions for good reasons -- namely, data integrity and not crashing
> the server -- but it would certainly be good to revisit those things
> and see whether there's any room for improvement.

+1

> * I didn't commit 0007, which updates the documentation for this new
> feature. That patch removes more lines than it adds, and I suspect
> what is needed here
> is an expansion of the documentation rather than a diminishing of it.

Hmm, I had mixed feeling about what to do about that as well.  So now, we
have the description of various new features buried into VI. Reference
section of the documentation, which is simply meant as a command
reference.  I agree that the new partitioning warrants more expansion in
the DDL partitioning chapter.  Will see how that could be done.

> * The fact that there's no implementation of row movement should be
> documented as a limitation.  We should also look at removing that
> limitation.

Yes, something to improve.  By the way, since we currently mention INSERT
tuple-routing directly in the description of the partitioned tables in the
CREATE TABLE command reference, is that also the place to list this
particular limitation?  Or is UPDATE command reference rather the correct
place?

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel execution and prepared statements

2016-12-07 Thread Amit Kapila

On Wed, Dec 7, 2016 at 12:30 AM, Robert Haas  wrote:
> On Tue, Dec 6, 2016 at 1:07 PM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> Done.
>>
>> The comment seems quite confused now:
>>
>> * If a tuple count was supplied or data is being written to relation, we
>> * must force the plan to run without parallelism, because we might exit
>> * early.
>>
>> Exit early is exactly what we *won't* do if writing to an INTO rel, so
>> I think this will confuse future readers.  I think it should be more like
>>
>> * If a tuple count was supplied, we must force the plan to run without
>> * parallelism, because we might exit early.  Also disable parallelism
>> * when writing into a relation, because [ uh, why exactly? ]
>>
>> Considering that the writing would happen at top level of the executor,
>> and hence in the parent process, it's not actually clear to me why the
>> second restriction is there at all: can't we write tuples to a rel even
>> though they came from a parallel worker?  In any case, the current wording
>> of the comment is a complete fail at explaining this.
>
> Oops.  You're right.  [ uh, why exactly? ] -> no database changes
> whatsoever are allowed while in parallel mode.  (This restriction
> might be lifted someday, but right now we're stuck with it.)
>

Attached patch changes the comment based on above suggestions.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


fix_comment_parallel_mode_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Typmod associated with multi-row VALUES constructs

2016-12-07 Thread Tom Lane

Alvaro Herrera  writes:
> Tom Lane wrote:
>> In HEAD, we could change the RTE data structure so that
>> transformValuesClause could save the typmod information in the RTE,
>> keeping the lookups cheap.

> Hmm, I think this would be useful for the XMLTABLE patch too.  I talked
> a bit about it at
> https://www.postgresql.org/message-id/20161122204730.dgipy6gxi25j4e6a@alvherre.pgsql

I dunno.  If your example there is correct that XMLTABLE can be called as
a plain function in a SELECT list, then I doubt that we want to tie
anything about it to the RTE data structure.  If anything, the case where
it appears in FROM seems to need to be treated as a generic RTE_FUNCTION
case.

I've been trying to avoid getting involved in the XMLTABLE patch, mainly
because I know zip about XML, but maybe I need to take a look.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.

2016-12-07 Thread Amit Langote

On 2016/12/08 3:33, Robert Haas wrote:
> On Wed, Dec 7, 2016 at 1:30 PM, Robert Haas  wrote:
>>   -- partitioned table cannot partiticipate in regular inheritance
>>   CREATE TABLE partitioned2 (
>>   a int
>> --- 392,411 
>>   c text,
>>   d text
>>   ) PARTITION BY RANGE (a oid_ops, plusone(b), c collate "default", d
>> collate "en_US");
>> + ERROR:  collation "en_US" for encoding "SQL_ASCII" does not exist
> ...
>> No idea why yet, but I'll try to figure it out.
> 
> And of course that'd be because relying on en_US isn't portable.  Sigh.

Should've thought about the non-portability of locales.  Thanks for
catching and fixing anyway!

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

2016-12-07 Thread Amit Langote

On 2016/12/08 1:53, Erik Rijkers wrote:
> On 2016-12-07 17:38, Robert Haas wrote:
>> On Wed, Dec 7, 2016 at 11:34 AM, Amit Langote 
>> wrote:
 begin;
 create schema if not exists s;
 create table s.t (c text, d text, id serial) partition by list
 ((ascii(substring(coalesce(c, d, ''), 1, 1;
 create table s.t_part_ascii_065 partition of s.t for values in ( 65 );

 it logs as follows:

 2016-12-07 17:03:45.787 CET 6125 LOG:  server process (PID 11503) was
 terminated by signal 11: Segmentation fault
 2016-12-07 17:03:45.787 CET 6125 DETAIL:  Failed process was running:
 create
 table s.t_part_ascii_065 partition of s.t for values in ( 65 );
>>>
>>> Hm, will look into this in a few hours.
>>
>> My bad.  The fix I sent last night for one of the cache flush issues
>> wasn't quite right.  The attached seems to fix it.
> 
> Yes, fixed here too.  Thanks.

Thanks for reporting and the fix, Erik and Robert!

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Separate connection handling from backends

2016-12-07 Thread Craig Ringer

On 7 December 2016 at 22:27, Kevin Grittner  wrote:

> I don't know how that execution model would compare to what we use
> now in terms of performance, but its popularity makes it hard to
> ignore as something to consider.

Those engines also tend to be threaded. They can stash state in memory
and hand it around between executors in ways we cannot really do.

I'd love to see a full separation of executor from session in
postgres, but I can't see how it could be at all practical. The use of
globals for state and the assumption that session == backend is baked
in way too deep.

At least, I think it'd be a slow and difficult thing to change, and
would need many steps. Something like what was proposed upthread would
possibly make sense as a first step.

But again, I don't see anyone who's likely to actually do it.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Test "tablespace" fails during `make installcheck` on master-replica setup

2016-12-07 Thread Michael Paquier

On Thu, Dec 8, 2016 at 12:06 AM, Tom Lane  wrote:
> Stephen Frost  writes:
>> It would be really nice if we would detect that some other postmaster is
>> already using a given tablespace directory and to throw an error and
>> complain rather than starting up thinking everything is fine.
>
> In principle, we could have the postmaster run through $PGDATA/pg_tblspc
> and drop a lockfile into each referenced directory.  But the devil is in
> the details --- in particular, not sure how to get the right thing to
> happen during a CREATE TABLESPACE.  Also, I kinda doubt that this is going
> to fix anything for the replica-on-same-machine problem.

That's where having a node-based ID would become helpful, which is
different from the global system ID. Ages ago when working on
Postgres-XC, we took care of this problem by appending to the
tablespace folder name, the one prefixed with PGXX, a suffix using a
node name. When applying this concept to PG, we could have standbys to
set up this node ID each time recovery is done using a backup_label.
This won't solve the problem of tablespaces already created, that
should be handled by users when taking the base backup by remapping
them. But it would adress the problems for newly-created ones.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?

2016-12-07 Thread Tom Lane

Robert Haas  writes:
> On Wed, Dec 7, 2016 at 6:49 PM, Tom Lane  wrote:
>> This still doesn't address the real question, which is whether RemoveIPC
>> does anything if KillUserProcesses is off, and whether that behavior
>> has changed.  I don't see anything about RemoveIPC in that thread.

> http://www.dsm.fordham.edu/cgi-bin/man-cgi.pl?topic=logind.conf§=5
> suggests that KillUserProcesses and RemoveIPC are separate cleanup
> behaviors triggered by the same underlying cause (termination of last
> session).

Yeah, I read that man page too, but ...

The test case I was using was to ssh into the box, launch a
postmaster using the old-school "nohup postmaster &" technique, and
log out.  What I saw was that the "/usr/lib/systemd/systemd --user"
process Alex referred to would be launched when the ssh connection
started, and would stick around as long as the postmaster was there,
if KillUserProcesses was off.  (If it was on, something SIGTERM'd
the postmaster as soon as I disconnected.)  So if they really are
independent behaviors, I'd have expected the same something to have
killed the semaphores as soon as I disconnected; but that did NOT
happen.

[ Yes, RemoveIPC is definitely on: I turned it on explicitly in
logind.conf, just in case the comment claiming it's on by default
is a lie. ]

BTW, I also tried this from the console, but the results were confused
by the fact that GNOME seems to launch approximately a metric buttload
of "helper" processes, which don't disappear when I log out.  If that's
the behavior Lennart is trying to get rid of, I can see his point; but
I tend to agree with the other comments in that thread that this should
be fixed in GNOME not by breaking longstanding working assumptions.

When I get a chance, I think I'll try F24 and see if it behaved
differently.  F23 might be interesting too if it's still downloadable.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-12-07 Thread Kyotaro HORIGUCHI

Hello, context switch was complete that time, sorry.

There's multiple "target LET"s. So we need kth-largest LTEs.

At Wed, 7 Dec 2016 19:04:23 +0900, Michael Paquier  
wrote in 
> On Wed, Dec 7, 2016 at 5:17 PM, Masahiko Sawada  wrote:
> > Sorry, I could not understand this algorithm. Could you elaborate
> > this? It takes only O(n) times?
> 
> Nah, please forget that, that was a random useless thought. There is
> no way to be able to select the k-th element without knowing the
> hierarchy induced by the others, which is what the partial sort would
> help with here.

So, let's consider for some cases,

- needing 3-quorum among 5 standbys.

 There's no problem whatever make kth-largest we choose.
 Of course qsorts are fine.

- needing 10 quorums among 100 standbys.

 I'm not sure if there's any difference with 3/5.

- needing 10 quorums among 1000 standbys.
 Obviously qsorts are doing too much.  Any kind of kth-largest
 algorithm should be involved.  For instance quickselect with O(n
 long n) - O(n) seems better in comparison to O(n log n) - O(n^2)
 of qsort.

- needing 700 quorums among 1000 standbys.

 I don't think this case is worth consider but kth-largest is
 better even for this case.

If we don't 700/1000 is out of at least the current scope, I also
recommend to use kth-largest selection.

If not, the quorum calculation is triggered by every standby
reply message and the frequency of the calculation seems near to
the frequency of WAL-insertion for the worst case. (Right?) Even
the kth-largest takes too long time to have 1000 standys.

Maintining kth-largest in shared memory needs less CPU but leads
to more bad contention on the shared memory.

Inversely, we already have waiting LSNs of backends in procarray.
If we have another array in the order of waiting LSNs and having
a condition varialble on the number of comforming
walsenders. Every walsender can individually looking up it and
count it up. It might performs better but I'm not sure.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-07 Thread Tom Lane

I wrote:
> Maybe.  I think where KaiGai-san is trying to go with this is being
> able to turn an ExpandedObject (which could contain very large amounts
> of data) directly into a toast pointer or vice versa.  There's nothing
> really preventing a TOAST OID from having more than 1GB of data
> attached, and if you had a side channel like this you could transfer
> the data without ever having to form a larger-than-1GB tuple.

BTW, you could certainly imagine attaching such infrastructure for
direct-to-TOAST-table I/O to ExpandedObjects today, independently
of any ambitions about larger-than-1GB values.  I'm not entirely sure
how often it would get exercised, which is the key subtext of what
I wrote before, but it's clearly a possible optimization of what
we do now.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-12-07 Thread Robert Haas

On Tue, Dec 6, 2016 at 11:26 PM, Michael Paquier
 wrote:
> On Wed, Dec 7, 2016 at 12:32 PM, Fujii Masao  wrote:
>> So, isn't it better to compare the performance of some algorithms and
>> confirm which is the best for quorum commit? Since this code is hot, i.e.,
>> can be very frequently executed, I'd like to avoid waste of cycle as much
>> as possible.
>
> It seems to me that it would be simple enough to write a script to do
> that to avoid any other noise: allocate an array with N random
> elements, and fetch the M-th element from it after applying a sort
> method. I highly doubt that you'd see much difference with a low
> number of elements, now if you scale at a thousand standbys in a
> quorum set you may surely see something :*)
> Anybody willing to try out?

You could do that, but first I would code up the simplest, cleanest
algorithm you can think of and see if it even shows up in a 'perf'
profile.  Microbenchmarking is probably overkill here unless a problem
is visible on macrobenchmarks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?

2016-12-07 Thread Robert Haas

On Wed, Dec 7, 2016 at 6:49 PM, Tom Lane  wrote:
> Alvaro Herrera  writes:
>> Tom Lane wrote:
>>> Color me confused ... maybe systemd's behavior has changed?
>
>> https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/thread/ZNQW72UP36UAFMX53HPFFQTWTQDZVJ3M/
>
> I see Lennart hasn't gotten any less convinced that he's always right
> since I left Red Hat :-(

This thread does seem to give that impression.  It's nice to know
there are engineers in the world even more arrogant than we are.  :-)

> But anyway, it's a demonstrable fact that Fedora 25 has KillUserProcesses
> off by default, even though it contains systemd-231.  I assume FESCO
> brought down the hammer at some point.

https://pagure.io/fesco/issue/1600 seems to suggest that it's merely
in abeyance.  (See the first two updates and the last one for the
executive summary.)

> This still doesn't address the real question, which is whether RemoveIPC
> does anything if KillUserProcesses is off, and whether that behavior
> has changed.  I don't see anything about RemoveIPC in that thread.

http://www.dsm.fordham.edu/cgi-bin/man-cgi.pl?topic=logind.conf§=5
suggests that KillUserProcesses and RemoveIPC are separate cleanup
behaviors triggered by the same underlying cause (termination of last
session).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch to implement pg_current_logfile() function

2016-12-07 Thread Tom Lane

Robert Haas  writes:
> Yikes.  This patch has certainly had a lot of review, but it has lots
> of basic inconsistencies with PostgreSQL coding style, which one would
> think somebody would have noticed by now, and there are multiple
> places where the logic seems to do in a complicated way something that
> could be done significantly more simply.  I don't have any objection
> to the basic idea of this patch, but it's got to look and feel like
> the surrounding PostgreSQL code.  And not be unnecessarily
> complicated.

A lot of the code-formatting issues could probably be fixed by running it
through pgindent.  Also, when you are starting from code that was written
with little regard for Postgres layout conventions, it's a really good
idea to see what pgindent will do to it, because it might not look very
nice afterwards.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch to implement pg_current_logfile() function

2016-12-07 Thread Robert Haas

On Tue, Dec 6, 2016 at 6:11 PM, Gilles Darold  wrote:
> It seems that all fixes have been included in this patch.

Yikes.  This patch has certainly had a lot of review, but it has lots
of basic inconsistencies with PostgreSQL coding style, which one would
think somebody would have noticed by now, and there are multiple
places where the logic seems to do in a complicated way something that
could be done significantly more simply.  I don't have any objection
to the basic idea of this patch, but it's got to look and feel like
the surrounding PostgreSQL code.  And not be unnecessarily
complicated.

Detailed comments below:

The SGML doesn't match the surrounding style - for example, 
typically appears on a line by itself.

+if (!Logging_collector) {

Formatting.

rm_log_metainfo() could be merged into logfile_writename(), since
they're always called in conjunction and in the same pattern.  The
function is poorly named; it should be something like
update_metainfo_datafile().

+if (errno == ENFILE || errno == EMFILE)
+ereport(LOG,
+(errmsg("system is too busy to write logfile meta
info, %s will be updated on next rotation (or use SIGHUP to retry)",
LOG_METAINFO_DATAFILE)));

This seems like a totally unprincipled reaction to a purely arbitrary
subset of errors.  EMFILE or ENFILE doesn't represent a general "too
busy" condition, and logfile_open() has already logged the real error.

+snprintf(tempfn, sizeof(tempfn), "%s.tmp", LOG_METAINFO_DATAFILE);

You don't really need to use snprintf() here.  You could #define
LOG_METAINFO_DATAFILE_TMP LOG_METAINFO_DATAFILE ".tmp" and compute
this at compile time instead of runtime.

+if (PG_NARGS() == 1) {
+fmt = PG_ARGISNULL(0) ? NULL : PG_GETARG_TEXT_PP(0);
+if (fmt != NULL) {
+logfmt = text_to_cstring(fmt);
+if ( (strcmp(logfmt, "stderr") != 0) &&
+(strcmp(logfmt, "csvlog") != 0) ) {
+ereport(ERROR,
+(errmsg("log format %s not supported, possible values are
stderr or csvlog", logfmt)));
+PG_RETURN_NULL();
+}
+}
+} else {
+logfmt = NULL;
+}

Formatting.  This is unlike PostgreSQL style in multiple ways,
including cuddled braces, extra parentheses and spaces, and use of
braces around a single-line block.  Also, this could be written in a
much less contorted way, like:

if (PG_NARGS() == 0 || PG_ARGISNULL(0))
logfmt = NULL;
else
{
logfmt = text_to_cstring(PG_GETARG_TEXT_PP(0));
if (strcmp(logfmt, "stderr") != 0 && strcmp(logfmt, "csvlog") != 0)
 ereport(...);
}

Also, the ereport() needs an errcode(), not just an errmsg().
Otherwise it defaults to ERRCODE_INTERNAL_ERROR, which is not correct.

+/* Check if current log file is present */
+if (stat(LOG_METAINFO_DATAFILE, &stat_buf) != 0)
+PG_RETURN_NULL();

Useless test.  The code that follows catches this case anyway and
handles it the same way.  Which is good, because this has an inherent
race condition. The previous if (!Logging_collector) test sems fairly
useless too; unless there's a bug in your syslogger.c code, the file
won't exist anyway, so we'll return NULL for that reason with no extra
code needed here.

+/*
+* Read the file to gather current log filename(s) registered
+* by the syslogger.
+*/

Whitespace isn't right.

+while (fgets(lbuffer, sizeof(lbuffer), fd) != NULL) {
+charlog_format[10];
+int i = 0, space_pos = 0;
+
+/* Check for a read error. */
+if (ferror(fd)) {

More coding style issues.

+ereport(ERROR,
+(errcode_for_file_access(),
+errmsg("could not read file \"%s\": %m",
LOG_METAINFO_DATAFILE)));
+FreeFile(fd);
+break;

ereport(ERROR) doesn't return, so the code that follows is pointless.

+if (strchr(lbuffer, '\n') != NULL)
+*strchr(lbuffer, '\n') = '\0';

Probably worth adding a local variable instead of doing this twice.
Local variables are cheap, and the code would be more readable.

+if ((space_pos == 0) && (isspace(lbuffer[i]) != 0))

Too many parentheses.  Also, I think the whole loop in which this is
contained could be eliminated entirely.  Just search for the first ' '
character with strchr(); you don't need to are about isspace() because
you know the code that writes this file uses ' ' specifically.
Overwrite it with '\0'.  And then use a pointer to lbuffer for the log
format and a pointer to lbuffer + strchr_result + 1 for the pathname.

+if ((space_pos != (int)strlen("stderr")) &&
+(space_pos != (int)strlen("csvlog")))
+{
+ereport(ERROR,
+(errmsg("unexpected line format in file %s",
LOG_METAINFO_DATAFILE)));
+break;
+}

I think this is pointless.  Validating the length of the log format
but not the content seems kind of silly, and why do either?  T

Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?

2016-12-07 Thread Tom Lane

Alvaro Herrera  writes:
> Tom Lane wrote:
>> Color me confused ... maybe systemd's behavior has changed?

> https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/thread/ZNQW72UP36UAFMX53HPFFQTWTQDZVJ3M/

I see Lennart hasn't gotten any less convinced that he's always right
since I left Red Hat :-(

But anyway, it's a demonstrable fact that Fedora 25 has KillUserProcesses
off by default, even though it contains systemd-231.  I assume FESCO
brought down the hammer at some point.

This still doesn't address the real question, which is whether RemoveIPC
does anything if KillUserProcesses is off, and whether that behavior
has changed.  I don't see anything about RemoveIPC in that thread.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-07 Thread Tom Lane

Robert Haas  writes:
> On Wed, Dec 7, 2016 at 8:50 AM, Kohei KaiGai  wrote:
>> I like to propose a new optional type handler 'typserialize' to
>> serialize an in-memory varlena structure (that can have indirect
>> references) to on-disk format.

> I think it's probably a mistake to conflate objects with substructure
> with objects > 1GB.  Those are two somewhat orthogonal needs.

Maybe.  I think where KaiGai-san is trying to go with this is being
able to turn an ExpandedObject (which could contain very large amounts
of data) directly into a toast pointer or vice versa.  There's nothing
really preventing a TOAST OID from having more than 1GB of data
attached, and if you had a side channel like this you could transfer
the data without ever having to form a larger-than-1GB tuple.

The hole in that approach, to my mind, is that there are too many places
that assume that they can serialize an ExpandedObject into part of an
in-memory tuple, which might well never be written to disk, or at least
not written to disk in a table.  (It might be intended to go into a sort
or hash join, for instance.)  This design can't really work for that case,
and unfortunately I think it will be somewhere between hard and impossible
to remove all the places where that assumption is made.

At a higher level, I don't understand exactly where such giant
ExpandedObjects would come from.  (As you point out, there's certainly
no easy way for a client to ship over the data for one.)  So this feels
like a very small part of a useful solution, if indeed it's part of a
useful solution at all, which is not obvious.

FWIW, ExpandedObjects themselves are far from a fully fleshed out
concept, one of the main problems being that they don't have very long
lifespans except in the case that they're the value of a plpgsql
variable.  I think we would need to move things along quite a bit in
that area before it would get to be useful to think in terms of
ExpandedObjects containing multiple GB of data.  Otherwise, the
inevitable flattenings and re-expansions are just going to kill you.

Likewise, the need for clients to be able to transfer data in chunks
gets pressing well before you get to 1GB.  So there's a lot here that
really should be worked on before we try to surmount that barrier.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?

2016-12-07 Thread Alvaro Herrera

Tom Lane wrote:

> Hmm ... after further experimentation, I still can't get this version of
> systemd (231) to do anything evil.  It turns out that Fedora ships it with
> KillUserProcesses turned off by default, and maybe having that on is a
> prerequisite for the other behavior?  But that doesn't make a lot of sense
> because we'd never be seeing the reports of databases moaning about lost
> semaphores if the processes got killed first.  Anyway, I see nothing bad
> happening if KillUserProcesses is off, while if it's on then the database
> gets shut down reasonably politely via SIGTERM.
> 
> Color me confused ... maybe systemd's behavior has changed?

https://lists.fedoraproject.org/archives/list/de...@lists.fedoraproject.org/thread/ZNQW72UP36UAFMX53HPFFQTWTQDZVJ3M/

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partitionning: support for Truncate Table WHERE

2016-12-07 Thread legrand legrand

here is an exemple :


CREATE OR REPLACE FUNCTION truncate_table_where(v_table VARCHAR, 
v_where_condition VARCHAR)
RETURNS void AS $$
DECLARE
v_stmt varchar;
v_tableoid oid;
v_part varchar;
v_found_other integer;
BEGIN
LOOP
v_stmt := 'SELECT tableoid FROM '|| v_table||' WHERE 
'||v_where_condition||' limit 1 ';
EXECUTE v_stmt INTO v_tableoid;
IF (v_tableoid is null) THEN
EXIT;
END IF;
Select pg_namespace.nspname||'.'||pg_class.relname into v_part from 
pg_catalog.pg_class
INNER JOIN pg_namespace
 ON pg_class.relnamespace = pg_namespace.oid where pg_class.oid = 
v_tableoid;
RAISE NOTICE 'Partition found: %', v_part;
-- check if other data in part
v_stmt := 'SELECT 1 FROM '|| v_part||' WHERE NOT ('||v_where_condition||') 
limit 1 ';
EXECUTE v_stmt INTO v_found_other;
IF (v_found_other =1) THEN
v_stmt := 'DELETE FROM '|| v_part||' WHERE '||v_where_condition;
RAISE NOTICE 'Executing: %', v_stmt;
EXECUTE v_stmt;
ELSE
v_stmt := 'TRUNCATE '|| v_part;
RAISE NOTICE 'Executing: %', v_stmt;
EXECUTE v_stmt;
END IF;
END LOOP;
END
$$ LANGUAGE plpgsql;
;



De : Amit Langote 
Envoyé : mercredi 7 décembre 2016 06:58:03
À : Craig Ringer; legrand legrand
Cc : pgsql-hackers@postgresql.org
Objet : Re: [HACKERS] Partitionning: support for Truncate Table WHERE

On 2016/12/07 15:26, Craig Ringer wrote:
> On 7 December 2016 at 07:29, legrand legrand
>  wrote:
>
>> Working in a DSS environment, we often need to truncate table partitions
>> regarding a WHERE condition and have to
>> [...]
>> Would be pleased to ear your feedback regarding this.
>
> It sounds like something that'd be useful to do on top of declarative
> partitioning, once that is merged. Perhaps you could start by reading
> and testing the declarative partitioning patch. That'll give you a
> better idea of the practicalities of doing what you propose on top of
> it, and give you an opportunity to suggest changes to the declarative
> partitioning scheme that might make conditional truncate easier later.

Agreed.

If I understand the request correctly, TRUNCATE on the parent table (a
partitioned table), which currently recurses to *all* child tables
(partitions), should have a restricting WHERE clause, right?  It would
become possible to implement something like that with the new declarative
partitioned tables.  As Crag mentioned, you can take a look at the
discussion about declarative partitioning in the emails linked to at the
following page: https://commitfest.postgresql.org/12/611/

Thanks,
Amit

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-07 Thread Robert Haas

On Wed, Dec 7, 2016 at 8:50 AM, Kohei KaiGai  wrote:
> I like to propose a new optional type handler 'typserialize' to
> serialize an in-memory varlena structure (that can have indirect
> references) to on-disk format.
> If any, it shall be involced on the head of toast_insert_or_update()
> than indirect references are transformed to something other which
> is safe to save. (My expectation is, the 'typserialize' handler
> preliminary saves the indirect chunks to the toast relation, then
> put toast pointers instead.)

This might not work.  The reason is that we have important bits of
code that expect that they can figure out how to do some operation on
a datum (find its length, copy it, serialize it) based only on typlen
and typbyval.  See src/backend/utils/adt/datum.c for assorted
examples.  Note also the lengthy comment near the top of the file,
which explains that typlen > 0 indicates a fixed-length type, typlen
== -1 indicates a varlena, and typlen == -2 indicates a cstring.  I
think there's probably room for other typlen values; for example, we
could have typlen == -3 indicate some new kind of thing -- a
super-varlena that has a higher length limit, or some other kind of
thing altogether.

Now, you can imagine trying to treat what you're talking about as a
new type of TOAST pointer, but I think that's not going to help,
because at some point the TOAST pointer gets de-toasted into a varlena
... which is still limited to 1GB.  So that's not really going to
work.  And it brings out another point, which is that if you define a
new typlen code, like -3, for super-big things, they won't be
varlenas, which means they won't work with the existing TOAST
interfaces.  Still, you might be able to fix that.  You would probably
have to do some significant surgery on the wire protocol, per the
commit message for fa2fa995528023b2e6ba1108f2f47558c6b66dcd.

I think it's probably a mistake to conflate objects with substructure
with objects > 1GB.  Those are two somewhat orthogonal needs.  As Jim
also points out, expanded objects serve the first need.  Of course,
those assume that we're dealing with a varlena, so if we made a
super-varlena, we'd still need to create an equivalent.  But perhaps
it would be a fairly simple adaptation of what we've already got.
Handling objects >1GB at all seems like the harder part of the
problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_dump vs. TRANSFORMs

2016-12-07 Thread Tom Lane

Stephen Frost  writes:
> As pointed out by Peter E, this also impacts CASTs.  Attached is a patch
> which addresses both by simply also pulling any functions which are
> referenced from pg_cast or pg_transform when they have OIDs at or after
> FirstNormalObjectId.  I also modified dumpCast() and dumpTransform() to
> complain loudly if they're unable to dump out the cast or transform due
> to not finding the function definition(s) necessary.

Please do not hack the already-overcomplicated query in getFuncs without
bothering to adjust the long comment that describes what it's doing.

I have a vague feeling that the code for dumping casts and/or transforms
may have some assumptions that the underlying function is also being
dumped.  Although maybe the assumption was really only what's fixed here,
ie that there be a DumpableObject for the function.  Anyway, take a close
look for that.

Looks reasonable otherwise.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?

2016-12-07 Thread Andres Freund

On 2016-12-06 23:54:43 -0500, Tom Lane wrote:
> You're attacking a straw man.  I didn't propose changing our behavior
> anywhere but Linux.  AFAIK, on that platform unnamed POSIX semaphores
> are futexes, which have been a stable feature since 2003 according to
> https://en.wikipedia.org/wiki/Futex#History.  Anybody who did need
> to compile PG for use with a pre-2.6 kernel could override the default,
> anyway.

Back then futexes weren't "robust" though (crash handling and such was
unusable). They only started to be reliable in the ~2007-2008 frame
IIRC.  That still should be ok though.

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?

2016-12-07 Thread Andres Freund

Hi,

On 2016-12-06 21:53:06 -0500, Tom Lane wrote:
> Just saw another report of what's probably systemd killing off Postgres'
> SysV semaphores, as we've discussed previously at, eg,
> https://www.postgresql.org/message-id/flat/57828C31.5060409%40gmail.com
> Since the systemd people are generally impervious to suggestions that
> they might be mistaken, I do not expect this problem to go away.

Would doing so actually solve the systemd issue? Doesn't systemd also
remove SYSV shared memory, which we still use a tiny bit of?

> I think we should give serious consideration to back-patching commit
> ecb0d20a9, which changed the default semaphore type to unnamed-POSIX
> on Linux.  We've seen no problems in the buildfarm in the two months
> that that's been in HEAD.  If we don't do this, we can expect to
> continue seeing complaints of this sort until pre-v10 PG releases
> fall out of use ... and I don't want to wait that long.

I'm a bit uncomfortable backpatching this change, before it has seen
production usage. Both the posix and sysv semaphore implementation has
evolved over the years, with changing performance characteristics. I
don't think it's fair to users to swap a proven solution out for
something that hasn't seen a lot of load.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?

2016-12-07 Thread Tom Lane

Alex Hunsaker  writes:
> On Wed, Dec 7, 2016 at 1:12 PM, Tom Lane  wrote:
>> But this is all kind of moot if Peter is right that systemd will zap
>> POSIX shmem along with SysV semaphores.  I've been trying to reproduce
>> the issue on a Fedora 25 installation, and so far I can't get it to
>> zap anything, so I'm a bit at a loss how to prove things one way or
>> the other.

> After logon, you should see "/usr/lib/systemd/systemd --user" running for
> that user. After logout out, said proc should exit.

Hmm ... after further experimentation, I still can't get this version of
systemd (231) to do anything evil.  It turns out that Fedora ships it with
KillUserProcesses turned off by default, and maybe having that on is a
prerequisite for the other behavior?  But that doesn't make a lot of sense
because we'd never be seeing the reports of databases moaning about lost
semaphores if the processes got killed first.  Anyway, I see nothing bad
happening if KillUserProcesses is off, while if it's on then the database
gets shut down reasonably politely via SIGTERM.

Color me confused ... maybe systemd's behavior has changed?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Indirect indexes

2016-12-07 Thread Robert Haas

On Thu, Oct 20, 2016 at 11:30 AM, Pavan Deolasee
 wrote:
> We have a design to convert WARM chains back to HOT and that will increase
> the percentage of WARM updates much beyond 50%. I was waiting for feedback
> on the basic patch before putting in more efforts, but it went unnoticed
> last CF.

While you did sign up to review one patch in the last CF, the amount
of review you did for that patch is surely an order of magnitude less
than what WARM will require.  Maybe more than that.  I don't mean to
point the finger at you specifically -- there are lots of people
slinging patches into the CommitFest who aren't doing as much review
as their own patches will require.  I'm putting a lot of time into
reviewing patches this year, and basically none into writing my own,
but I still can't review every major patch that somebody submits.  I
can't even do committer review of all of those patches, let alone
first-round review.

Perhaps I ought to rank the things I review by descending order of
importance, in which case this arguably ought to be pretty high on the
list.  But I'd feel somewhat bad working on this instead of, say,
multivariate statistics or unique joins, which have been pending for a
lot longer.  Anyway, the point, not just to you but to everybody, is
that the review can't always be left to other people.  Some people
will review and not contribute any code, and that's great.  Some
people will contribute code but not review, and to the extent that we
can support that, it's also great.  But the giant backlog of
unreviewed patches which has accumulated shows that we have too many
people needing more review than they produce, and that is not great.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] varlena beyond 1GB and matrix

2016-12-07 Thread Jim Nasby


On 12/7/16 5:50 AM, Kohei KaiGai wrote:

If and when this structure is fetched from the tuple, its @ptr_block
is initialized to NULL. Once it is supplied to a function which
references a part of blocks, type specific code can load sub-matrix
from the toast relation, then update the @ptr_block not to load the
sub-matrix from the toast multiple times.
I'm not certain whether it is acceptable behavior/manner.


I'm glad you're looking into this. The 1G limit is becoming a larger 
problem every day.


Have you considered using ExpandedObjects to accomplish this? I don't 
think the API would work as-is, but I suspect there's other places where 
we'd like to be able to have this capability (arrays and JSONB come to 
mind).

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?

2016-12-07 Thread Alex Hunsaker

On Wed, Dec 7, 2016 at 1:12 PM, Tom Lane  wrote:

> But this is all kind of moot if Peter is right that systemd will zap
> POSIX shmem along with SysV semaphores.  I've been trying to reproduce
> the issue on a Fedora 25 installation, and so far I can't get it to
> zap anything, so I'm a bit at a loss how to prove things one way or
> the other.
>

Don't know precisely about Fedora 25, but I've had success in the past with:
ssh in as the user
start postgres under tmux/screen
logout
do another ssh login/logout cycle

After logon, you should see "/usr/lib/systemd/systemd --user" running for
that
user. After logout out, said proc should exit. If either of those is not
true,
either systemd is not setup to track sessions (probably via pam) or it
thinks
you still have an active logon. Another way to check if systemd thinks the
user
is logged in is if /run/user/$USER exists.

Re: [HACKERS] Typmod associated with multi-row VALUES constructs

2016-12-07 Thread Alvaro Herrera

Tom Lane wrote:

> In HEAD, we could change the RTE data structure so that
> transformValuesClause could save the typmod information in the RTE,
> keeping the lookups cheap.

Hmm, I think this would be useful for the XMLTABLE patch too.  I talked
a bit about it at
https://www.postgresql.org/message-id/20161122204730.dgipy6gxi25j4e6a@alvherre.pgsql

The patch has evolved quite a bit since then, but the tupdesc continues
to be a problem.  Latest patch is at
https://www.postgresql.org/message-id/CAFj8pRBsrhwR636-_3TPbqu%3DFo3_DDer6_yp_afzR7qzhW1T6Q%40mail.gmail.com

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Indirect indexes

2016-12-07 Thread Robert Haas

On Fri, Oct 21, 2016 at 7:04 PM, Alvaro Herrera
 wrote:
> Robert Haas wrote:
>> So, I think that this is a really promising direction, but also that
>> you should try very hard to try to get out from under this 6-byte PK
>> limitation.  That seems really ugly, and in practice it probably means
>> your PK is probably going to be limited to int4, which is kind of sad
>> since it leaves people using int8 or text PKs out in the cold.
>
> I think we could just add a new type, unsigned 6 byte int, specifically
> for this purpose.  Little in the way of operators, as it's pointless to
> try to do arithmetic with object identifiers.  (It'd be similar to UUID
> in spirit, but I wouldn't try to do anything too smart to generate them.)

Sure, we could do that, but that's just band-aiding over the fact that
the index page format isn't really what we want for a feature of this
type.

> Yes, recheck is always needed.
>
> As for vacuum, I was thinking this morning that perhaps the answer to
> that is just to not vacuum the index at all and instead rely on the
> killtuple interface (which removes tuples during scan).  So we don't
> need to spend precious maint_work_mem space on a large list of PK
> values.

I don't think that's going to fly.  Even if it's the case that
indirect indexes typically need less cleanup than regular indexes, the
idea that there's no command to force a full cleanup short of REINDEX
doesn't sit well with me.  It's not difficult to construct realistic
scenarios in which kill_prior_tuple is almost useless (e.g. values are
all marching forward).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Short reads in hash indexes

2016-12-07 Thread Andreas Seltenreich

Andreas Seltenreich writes:

> Amit Kapila writes:
>
>> On Sat, Dec 3, 2016 at 3:44 PM, Andreas Seltenreich  
>> wrote:
>>> Amit Kapila writes:
>>>
 [2. text/x-diff; fix_hash_bucketsplit_sqlsmith_v1.patch]
>>> Ok, I'll do testing with the patch applied.
>
> Good news: the assertion hasn't fired since the patch is in.

Meh, it fired again today after being silent for 100e6 queries :-/
I guess I need to add some confidence qualification on such statements.
Maybe sigmas as they do at CERN…

> smith=# select * from state_report where sqlstate = 'XX001';
> -[ RECORD 1 
> ]--
> count| 10
> sqlstate | XX001
> sample   | ERROR:  could not read block 1173 in file "base/16384/17256": read 
> only 0 of 8192 bytes
> hosts| {airbisquit,frell,gorgo,marbit,pillcrow,quakken}
>
>> Hmm, I am not sure if this is related to previous problem, but it
>> could be.  Is it possible to get the operation and or callstack for
>> above failure?
>
> Ok, will turn the elog into an assertion to get at the backtraces.

Doing so on top of 4212cb7, I caught the backtrace below.  Query was:

--8<---cut here---start->8---
set max_parallel_workers_per_gather = 0;
select  count(1) from
   public.hash_name_heap as ref_2
   join public.rtest_emplog as sample_1
  on (ref_2.random = sample_1.who);
--8<---cut here---end--->8---

I've put the data directory where it can be reproduced here:

http://ansel.ydns.eu/~andreas/hash_index_short_read.tar.xz (12MB)

regards,
Andreas

TRAP: FailedAssertion("!(!"short read of block")", File: "md.c", Line: 782)
#2  0x007f7f11 in ExceptionalCondition 
(conditionName=conditionName@entry=0x9a1ae9 "!(!\"short read of block\")", 
errorType=errorType@entry=0x83db3d "FailedAssertion", 
fileName=fileName@entry=0x946a9a "md.c", lineNumber=lineNumber@entry=782) at 
assert.c:54
#3  0x006fb305 in mdread (reln=, forknum=, blocknum=4702, buffer=0x7fe97e7e1280 "\"") at md.c:782
#4  0x006d0ffa in ReadBuffer_common (smgr=0x2af7408, 
relpersistence=, forkNum=forkNum@entry=MAIN_FORKNUM, 
blockNum=blockNum@entry=4702, mode=RBM_NORMAL, strategy=, 
hit=0x7ffde9df11cf "") at bufmgr.c:890
#5  0x006d1a20 in ReadBufferExtended (reln=0x2fd10d8, 
forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=4702, mode=mode@entry=RBM_NORMAL, 
strategy=strategy@entry=0x0) at bufmgr.c:664
#6  0x006d1b74 in ReadBuffer (blockNum=, reln=) at bufmgr.c:596
#7  ReleaseAndReadBuffer (buffer=buffer@entry=87109984, relation=, blockNum=) at bufmgr.c:1540
#8  0x004c047b in index_fetch_heap (scan=scan@entry=0x5313160) at 
indexam.c:469
#9  0x004c05ee in index_getnext (scan=scan@entry=0x5313160, 
direction=direction@entry=ForwardScanDirection) at indexam.c:565
#10 0x005f9b71 in IndexNext (node=node@entry=0x5311c48) at 
nodeIndexscan.c:105
#11 0x005ec492 in ExecScanFetch (recheckMtd=0x5f9af0 , 
accessMtd=0x5f9b30 , node=0x5311c48) at execScan.c:95
#12 ExecScan (node=0x5311c48, accessMtd=0x5f9b30 , 
recheckMtd=0x5f9af0 ) at execScan.c:145
#13 0x005e4da8 in ExecProcNode (node=node@entry=0x5311c48) at 
execProcnode.c:427
#14 0x006014f9 in ExecNestLoop (node=node@entry=0x53110a8) at 
nodeNestloop.c:174
#15 0x005e4cf8 in ExecProcNode (node=node@entry=0x53110a8) at 
execProcnode.c:476
#16 0x00601436 in ExecNestLoop (node=node@entry=0x5310e00) at 
nodeNestloop.c:123
#17 0x005e4cf8 in ExecProcNode (node=node@entry=0x5310e00) at 
execProcnode.c:476
#18 0x00601436 in ExecNestLoop (node=node@entry=0x530f698) at 
nodeNestloop.c:123
#19 0x005e4cf8 in ExecProcNode (node=node@entry=0x530f698) at 
execProcnode.c:476
#20 0x005e0e9e in ExecutePlan (dest=0x603a4a8, direction=, numberTuples=0, sendTuples=, operation=CMD_SELECT, 
use_parallel_mode=, planstate=0x530f698, estate=0x46bc008) at 
execMain.c:1568
#21 standard_ExecutorRun (queryDesc=0x3475168, direction=, 
count=0) at execMain.c:338
#22 0x007029f8 in PortalRunSelect (portal=portal@entry=0x2561e18, 
forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, 
dest=dest@entry=0x603a4a8) at pquery.c:946
#23 0x00703f3e in PortalRun (portal=portal@entry=0x2561e18, 
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', 
dest=dest@entry=0x603a4a8, altdest=altdest@entry=0x603a4a8, 
completionTag=completionTag@entry=0x7ffde9df18b0 "") at pquery.c:787
#24 0x00700d5b in exec_simple_query (query_string=0x4685258 ) at 
postgres.c:1094
#25 PostgresMain (argc=, argv=argv@entry=0x256f5a8, 
dbname=0x256f580 "regression", username=) at postgres.c:4069
#26 0x0046daf2 in BackendRun (port=0x25645a0) at postmaster.c:4274
#27 BackendStartup (port=0x25645a0) at postmaster.c:3946
#28 ServerLoop () at postmaster.c:1704
#29 0x00699d28 in PostmasterMain (argc=argc@entry=4,

Re: [HACKERS] Unlogged tables cleanup

2016-12-07 Thread Robert Haas

Michael, your greetings were passed on to me with a request that I
look at this thread.

On Fri, Nov 18, 2016 at 12:33 PM, Michael Paquier
 wrote:
>>> More seriously, if there could be more details regarding that, I would
>>> think that we could say something like "logging the init fork is
>>> mandatory in any case to ensure its on-disk presence when at recovery
>>> replay, even on non-default tablespaces whose base location are
>>> deleted and re-created from scratch if the WAL record in charge of
>>> creating this tablespace gets replayed". The problem shows up because
>>> of tablespaces being deleted at replay at the end... So perhaps this
>>> makes sense. What do you think?
>>
>> Yes, that's about what I think we need to explain.
>
> OK, what do you think about the attached then?
>
> I came up with something like that for those code paths:
> -   /* Write the page.  If archiving/streaming, XLOG it. */
> +   /*
> +* Write the page and log it unconditionally.  This is important
> +* particularly for indexes created on tablespaces and databases
> +* whose creation happened after the last redo pointer as recovery
> +* removes any of their existing content when the corresponding
> +* create records are replayed.
> +*/
> I have not been able to use the word "create" less than that. The
> patch is really repetitive, but I think that we had better mention the
> need of logging the content in each code path and not touch
> log_newpage() to keep it a maximum flexible.

In blinsert.c, nbtree.c, etc. how about:

Write the page and log it.  It might seem that an immediate sync would
be sufficient to guarantee that the file exists on disk, but recovery
itself might remove it while replaying, for example, an
XLOG_DBASE_CREATE record.  Therefore, we need this even when
wal_level=minimal.

>> Actually, I'm
>> wondering if we ought to just switch this over to using the delayChkpt
>> mechanism instead of going through this complicated song-and-dance.
>> Incurring an immediate sync to avoid having to WAL-log this is a bit
>> tenuous, but having to WAL-log it AND do the immediate sync just seems
>> silly, and I'm actually a bit worried that even with your fix there
>> might still be a latent bug somewhere.  We couldn't switch mechanisms
>> cleanly in the 9.2 branch (cf.
>> f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should
>> back-patch it in the form you have it in already, but it's probably
>> worth switching over at least in master.
>
> Thinking through it... Could we not just rip off the sync phase
> because there is delayChkpt? It seems to me that what counts is that
> the commit of the transaction that creates the relation does not get
> past the redo point. It would matter for read uncommitted transactions
> but that concept does not exist in PG, and never will. On
> back-branches it is definitely safer to keep the sync, I am just
> thinking about a HEAD-only optimization here as you do.

Right (I think).  If we set and clear delayChkpt around this work, we
don't need the immediate sync.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol

2016-12-07 Thread Heikki Linnakangas


On 12/07/2016 08:39 AM, Michael Paquier wrote:

On Tue, Nov 29, 2016 at 1:36 PM, Michael Paquier
 wrote:

Nothing more will likely happen in this CF, so I have moved it to
2017-01 with the same status of "Needs Review".


Attached is a new set of patches using the new routines
pg_backend_random() and pg_strong_random() to handle the randomness in
SCRAM:
- 0001 refactors the SHA2 routines. pgcrypto uses raw files from
src/common when compiling with this patch. That works on any platform,
and this is the simplified version of upthread.
- 0002 adds base64 routines to src/common.
- 0003 does some refactoring regarding the password encryption in
ALTER/CREATE USER queries.
- 0004 adds the clause PASSWORD (val USING method) in CREATE/ALTER USER.
- 0005 is the code patch for SCRAM. Note that this switches pgcrypto
to link to libpgcommon as SHA2 routines are used by the backend.
- 0006 adds some regression tests for passwords.
- 0007 adds some TAP tests for authentication.
This is added to the upcoming CF.


I spent a little time reading through this once again. Steady progress, 
did some small fixes:


* Rewrote the nonce generation. In the server-side, it first generated a 
string of ascii-printable characters, then base64-encoded them, which is 
superfluous. Also, avoid calling pg_strong_random() one byte at a time, 
for performance reasons.


* Added a more sophisticated fallback implementation in libpq, for the 
--disable-strong-random cases, similar to pg_backend_random().


* No need to disallow SCRAM with db_user_namespace. It doesn't include 
the username in the salt like MD5 does.


Attached those here, as add-on patches to your latest patch set. I'll 
continue reviewing, but a couple of things caught my eye that you may 
want to jump on, in the meanwhile:


On error messages, the spec says:


o  e: This attribute specifies an error that occurred during
  authentication exchange.  It is sent by the server in its final
  message and can help diagnose the reason for the authentication
  exchange failure.  On failed authentication, the entire server-
  final-message is OPTIONAL; specifically, a server implementation
  MAY conclude the SASL exchange with a failure without sending the
  server-final-message.  This results in an application-level error
  response without an extra round-trip.  If the server-final-message
  is sent on authentication failure, then the "e" attribute MUST be
  included.


Note that it says that the server can send the error message with the e= 
attribute, in the *final message*. It's not a valid response in the 
earlier state, before sending server-first-message. I think we need to 
change the INIT state handling in pg_be_scram_exchange() to not send e= 
messages to the client. On an error at that state, it needs to just bail 
out without a message. The spec allows that. We can always log the 
detailed reason in the server log, anyway.


As Peter E pointed out earlier, the documentation is lacking, on how to 
configure MD5 and/or SCRAM. If you put "scram" as the authentication 
method in pg_hba.conf, what does it mean? If you have a line for both 
"scram" and "md5" in pg_hba.conf, with the same database/user/hostname 
combo, what does that mean? Answer: The first one takes effect, the 
second one has no effect. Yet the example in the docs now has that, 
which is nonsense :-). Hopefully we'll have some kind of a "both" 
option, before the release, but in the meanwhile, we need describe how 
this works now in the docs.


- Heikki

>From 4d3a59ae1cb5742499c71b0c1e048d30dcef6836 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 7 Dec 2016 15:24:55 +0200
Subject: [PATCH 08/11] Rewrite nonce generation.

In the server, the nonce was generated using only ASCII-printable
characters, and the result was base64-encoded. The base64 encoding is
pointless, if we use only ASCII-printable chars to begin with.

Calling pg_strong_random() can be somewhat expensive, as with the
/dev/urandom implementation, it has to open the device, read the bytes,
and close, on every call. So avoid calling it in a loop, generating only
one byte in each call.

I went back to using base64-encoding method of turning the raw bytes into
the final nonce. That was more convenient than writing something that
encodes to the whole ASCII-printable range. That means that we're not using
the whole range of chars allowed in the nonce, but I believe that doesn't
make any difference. (Both the frontend and backend will still accept the
full range from the other side of the connection).
---
 src/backend/libpq/auth-scram.c   | 52 ---
 src/include/common/scram-common.h|  6 +++-
 src/include/libpq/libpq-be.h |  2 --
 src/interfaces/libpq/fe-auth-scram.c | 60 ++--
 4 files changed, 34 insertions(+), 86 deletions(-)

diff --git a/src/backend/libpq/auth-scram.c b/src/backend/libpq/auth-scram.c
index 55c5efa..cda663b

Re: [HACKERS] patch: function xmltable

2016-12-07 Thread Pavel Stehule

2016-12-07 20:50 GMT+01:00 Alvaro Herrera :

> Pavel Stehule wrote:
> > 2016-12-07 18:34 GMT+01:00 Alvaro Herrera :
>
> > > Hmm.  Now that I see how this works, by having the GetValue "guess"
> what
> > > is going on and have a special case for it, I actually don't like it
> > > very much.  It seems way too magical.  I think we should do away with
> > > the "if column is NULL" case in GetValue, and instead inject a column
> > > during transformTableExpr if columns is NIL.  This has implications on
> > > ExecInitExpr too, which currently checks for an empty column list -- it
> > > would no longer have to do so.
> >
> > I prefer this way against second described. The implementation should be
> in
> > table builder routines, not in executor.
>
> Well, given the way you have implemented it, I would prefer the original
> too.  But your v23 is not what I meant.  Essentially what you do in v23
> is to communicate the lack of COLUMNS clause in a different way --
> previously it was "ncolumns = 0", now it's "is_auto_col=true".  It's
> still "magic".  It's not an improvement.
>

is_auto_col is used primary for asserting. The table builder has
information for decision in parameter path, when path is NULL.

Hard to say, if this info should be assigned to column or to table. In both
locations has sense. But somewhere should be some flag.


>
> What I want to happen is that there is no magic at all; it's up to
> transformExpr to make sure that when COLUMNS is empty, one column
> appears and it must not be a magic column that makes the xml.c code act
> differently, but rather to xml.c it should appear that this is just a
> normal column that happens to return the entire row.  If I say "COLUMNS
> foo PATH '/'" I should be able to obtain a similar behavior (except that
> in the current code, if I ask for "COLUMNS foo XML PATH '/'" I don't get
> XML at all but rather weird text where all tags have been stripped out,
> which is very strange.  I would expect the tags to be preserved if the
> output type is XML.  Maybe the tag-stripping behavior should occur if
> the output type is some type of text.)
>

I am doing this. Just I using NULL for PATH.


>
>
> I still have to figure out how to fix the tupledesc thing.  What we have
> now is not good.
>

cannot be moved to nodefunc?

Pavel

>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?

2016-12-07 Thread Robert Haas

On Wed, Dec 7, 2016 at 3:12 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Dec 6, 2016 at 11:54 PM, Tom Lane  wrote:
>>> Robert Haas  writes:
 On Tue, Dec 6, 2016 at 9:53 PM, Tom Lane  wrote:
> I think we should give serious consideration to back-patching commit
> ecb0d20a9, which changed the default semaphore type to unnamed-POSIX
> on Linux.
>
 Urk.  That sounds like a scary thing to back-patch.
>
>>> I don't deny that it's scary, but the alternative seems to be to be
>>> rather badly broken on systemd-using distros for years to come.
>>> That's pretty scary too.
>
>> Why can't this be configurable?
>
> It already is.  Note that I said "default".
>
> As things stand, it's only a configure-time choice, but I've been
> thinking that we might be well advised to make it run-time configurable.

Sure.  A configure-time choice only benefits people who are compiling
from source, which as far as production is concerned is almost nobody.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.

2016-12-07 Thread Robert Haas

On Wed, Dec 7, 2016 at 3:13 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> And of course that'd be because relying on en_US isn't portable.  Sigh.
>
> You can't rely on *any* collations other than C and POSIX.

I get it; I just missed that during review, and then sent that message
before I even looked at it carefully, so that you would know I was
working on it.  I think that it's fixed now; at any rate, the
buildfarm seems happy enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Typmod associated with multi-row VALUES constructs

2016-12-07 Thread David G. Johnston

On Wed, Dec 7, 2016 at 1:03 PM, Tom Lane  wrote:

> Still, things have been like this since 8.2 when we implemented multi-row
> VALUES, and nobody's noticed up to now.  Maybe the right answer is to
> change the data structure in HEAD and decree that we won't fix it in back
> branches.  I don't really like that answer though ...
>

The merit of "won't back-patch" is that at least you don't punish those
who are being lazy (in a good sense) but generating values in subsequent
lines that conform to the type specification of the first record.  We
already implicitly accept such behavior elsewhere - though probably with
better validation - so keeping it here is defense-able.  We dislike
changing query plans in back branches and this really isn't that different.

The concern, especially since this can propagate to a CREATE TABLE AS, is
whether there is some kind of fundamental storage risk being introduced
that we do not want to have happen no matter how rare.  /me feels deja-vu...

David J.

Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.

2016-12-07 Thread Tom Lane

Robert Haas  writes:
> And of course that'd be because relying on en_US isn't portable.  Sigh.

You can't rely on *any* collations other than C and POSIX.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?

2016-12-07 Thread Tom Lane

Robert Haas  writes:
> On Tue, Dec 6, 2016 at 11:54 PM, Tom Lane  wrote:
>> Robert Haas  writes:
>>> On Tue, Dec 6, 2016 at 9:53 PM, Tom Lane  wrote:
 I think we should give serious consideration to back-patching commit
 ecb0d20a9, which changed the default semaphore type to unnamed-POSIX
 on Linux.

>>> Urk.  That sounds like a scary thing to back-patch.

>> I don't deny that it's scary, but the alternative seems to be to be
>> rather badly broken on systemd-using distros for years to come.
>> That's pretty scary too.

> Why can't this be configurable?

It already is.  Note that I said "default".

As things stand, it's only a configure-time choice, but I've been
thinking that we might be well advised to make it run-time configurable.
I do not believe that anyone's still using a Linux version wherein
POSIX semas wouldn't work, but I am not convinced that the same is true
for FreeBSD.  And a configure-run-time test is not a pleasant option
because it doesn't work for cross-compiles.  So really, on platforms
where we think POSIX semas might work, it'd be best to try a sem_init()
during postmaster start and then fall back to SysV if it doesn't work.

But this is all kind of moot if Peter is right that systemd will zap
POSIX shmem along with SysV semaphores.  I've been trying to reproduce
the issue on a Fedora 25 installation, and so far I can't get it to
zap anything, so I'm a bit at a loss how to prove things one way or
the other.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_dump vs. TRANSFORMs

2016-12-07 Thread Stephen Frost

All,

* Stephen Frost (sfr...@snowman.net) wrote:
> While testing pg_dump, I discovered that there seems to be an issue when
> it comes to TRANSFORMs.

[...]

As pointed out by Peter E, this also impacts CASTs.  Attached is a patch
which addresses both by simply also pulling any functions which are
referenced from pg_cast or pg_transform when they have OIDs at or after
FirstNormalObjectId.  I also modified dumpCast() and dumpTransform() to
complain loudly if they're unable to dump out the cast or transform due
to not finding the function definition(s) necessary.

This'll need to be back-patched to 9.5 for the pg_transform look up and
all the way for pg_cast, though I don't expect that to be too difficult.

We don't do anything else with FirstNormalObjectId in SQL code in
pg_dump, though we obviously use it all over the place in the actual
code based on the OIDs returned from the database.  Still, does anyone
see an issue with using it in a query?  Without it, we end grabbing the
info for 100+ or so functions in a default install that we don't need,
which isn't horrible, but there were concerns raised before about
pg_dump performance for very small databases.

This also adds in regression tests to pg_dump for casts and transforms
and the pg_upgrade testing with the regression database will now
actually test the dump/restore of transforms (which it didn't previously
because the transforms weren't dumped).

Thoughts?

Thanks!

Stephen
From 3eaa8d32170b69e58b5780c0aa92b05d10869389 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Wed, 7 Dec 2016 14:59:44 -0500
Subject: [PATCH] Fix dumping of casts and transforms using built-in functions

In pg_dump.c dumpCast() and dumpTransform(), we would happily ignore the
cast or transform if it happened to use a built-in function because we
weren't including the information about built-in functions when querying
pg_proc from getFuncs().

Modify the query in getFuncs() to also gather information about
functions which are used by user-defined casts and transforms (where
"user-defined" means "has an OID >= FirstNormalObjectId").  This also
adds to the TAP regression tests for 9.6 and master to cover these
types of objects.

Back-patch all the way for casts, back to 9.5 for transforms.

Discussion: https://www.postgresql.org/message-id/flat/20160504183952.GE10850%40tamriel.snowman.net
---
 src/bin/pg_dump/pg_dump.c| 33 +---
 src/bin/pg_dump/t/002_pg_dump.pl | 85 +---
 2 files changed, 80 insertions(+), 38 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 42873bb..530500c 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -4721,12 +4721,21 @@ getFuncs(Archive *fout, int *numFuncs)
 		  "\n  AND ("
 		  "\n  pronamespace != "
 		  "(SELECT oid FROM pg_namespace "
-		  "WHERE nspname = 'pg_catalog')",
+		  "WHERE nspname = 'pg_catalog')"
+		  "\n  OR EXISTS (SELECT 1 FROM pg_cast"
+		  "\n  WHERE pg_cast.oid >= %u "
+		  "\n  AND p.oid = pg_cast.castfunc)"
+		  "\n  OR EXISTS (SELECT 1 FROM pg_transform"
+		  "\n  WHERE pg_transform.oid >= %u AND "
+		  "\n  (p.oid = pg_transform.trffromsql"
+		  "\n  OR p.oid = pg_transform.trftosql))",
 		  acl_subquery->data,
 		  racl_subquery->data,
 		  initacl_subquery->data,
 		  initracl_subquery->data,
-		  username_subquery);
+		  username_subquery,
+		  FirstNormalObjectId,
+		  FirstNormalObjectId);
 		if (dopt->binary_upgrade)
 			appendPQExpBufferStr(query,
 			   "\n  OR EXISTS(SELECT 1 FROM pg_depend WHERE "
@@ -4764,7 +4773,16 @@ getFuncs(Archive *fout, int *numFuncs)
 			 "\n  AND ("
 			 "\n  pronamespace != "
 			 "(SELECT oid FROM pg_namespace "
-			 "WHERE nspname = 'pg_catalog')");
+			 "WHERE nspname = 'pg_catalog')"
+			 "\n  OR EXISTS (SELECT 1 FROM pg_cast"
+			 "\n  WHERE p.oid = pg_cast.castfunc)");
+
+		if (fout->remoteVersion >= 90500)
+			appendPQExpBufferStr(query,
+ "\n  OR EXISTS (SELECT 1 FROM pg_transform"
+ "\n  WHERE p.oid = pg_transform.trffromsql"
+ "\n  OR p.oid = pg_transform.trftosql)");
+
 		if (dopt->binary_upgrade && fout->remoteVersion >= 90100)
 			appendPQExpBufferStr(query,
 			   "\n  OR EXISTS(SELECT 1 FROM pg_depend WHERE "
@@ -10828,7 +10846,8 @@ dumpCast(Archive *fout, CastInfo *cast)
 	{
 		funcInfo = findFuncByOid(cast->castfunc);
 		if (funcInfo == NULL)
-			return;
+			exit_horribly(NULL, "unable to find function definition for OID %u",
+		  cast->castfunc);
 	}

 	/*
@@ -10937,13 +10956,15 @@ dumpTransform(Archive *fout, TransformInfo *transform)
 	{
 		fromsqlFuncInfo = findFuncByOid(transform->trffromsql);
 		if (fromsqlFuncInfo == NULL)
-			return;
+			exit_horribly(NULL, "unable to find function definition for OID %u",
+		  transform->trffromsql);
 	}
 	if (OidIsValid(transform->trftosql))
 	{
 		tosqlFuncInfo = findFu

Re: [HACKERS] [GENERAL] Select works only when connected from login postgres

2016-12-07 Thread Tom Lane

Joseph Brenner  writes:
> I thought I'd reproduced the behavior in an xterm, but I was just
> trying again and I don't see it.  It does seem that the dumbness of my
> dumb terminal is a factor.

Evidently.

> If I understand the way this works, it could be an even more baffling
> behavior if I were using an xterm: with a blank PAGER your output
> would disappear only if the select exceeded a certain number of
> lines...

Yeah, that was exactly the behavior I was seeing before fixing it
(the fix is pushed btw).

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Typmod associated with multi-row VALUES constructs

2016-12-07 Thread Tom Lane

Kyotaro HORIGUCHI  writes:
> At Mon, 5 Dec 2016 21:54:08 -0700, "David G. Johnston" 
>  wrote in 
> 
>> If every row passes you can retain the typemod.  That is arguably the
>> perfect solution.  The concern is that "scan every row" could be very
>> expensive - though in writing this I'm thinking that you'd quickly find a

> I found that it already scans the whole list. select_common_type
> does that. And it returns the type that is found to be applicable
> on all values. Handling typmod there has a small
> impact. (Though, I don't assert that we should do this.)

The problem is that there is not where the impact is.  It would be really
easy for transformValuesClause to check for a common typmod while it's
transforming the construct, but it has noplace to put the information.
The place where the info is needed is in expandRTE and
get_rte_attribute_type (which is called by make_var).  So basically,
parsing of each Var reference to a VALUES subquery would potentially have
to scan all rows of the VALUES list to identify the right typmod to give
to the Var.  The repetitiveness of that is what's bothering me.

In HEAD, we could change the RTE data structure so that
transformValuesClause could save the typmod information in the RTE,
keeping the lookups cheap.  But that option is not available to us in
released branches.

On the other hand, we might be worrying over nothing.  I think that
in typical use, expandRTE() will be done just once per query, and it
could amortize the cost by considering all the rows in one scan.
get_rte_attribute_type would be a problem except that I believe it's
rarely used for VALUES RTEs --- our code coverage report shows that that
switch branch isn't reached at all in the standard regression tests.
(The reason for the above statements is that we convert a bare VALUES to
SELECT * FROM VALUES, and the * is expanded by expandRTE, not by retail
applications of make_var.)

> Ok, I think all of the #1 to #5 are the change of behavior in
> this criteria. What we shold resolve here is the inconsistent
> table generated by create table as select from (values...

Well, that's one symptom, but not the only one; consider

# select x::varchar(4) from (values ('z'::varchar(4)), ('0123456789')) v(x);
 x  

 z
 0123456789
(2 rows)

The Var representing v.x is (mis)labeled as already having the right typmod
for varchar(4), so transformTypeCast concludes that no run-time work is
needed, and you get clearly-wrong answers out.

Still, things have been like this since 8.2 when we implemented multi-row
VALUES, and nobody's noticed up to now.  Maybe the right answer is to
change the data structure in HEAD and decree that we won't fix it in back
branches.  I don't really like that answer though ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch: function xmltable

2016-12-07 Thread Alvaro Herrera

Pavel Stehule wrote:
> 2016-12-07 18:34 GMT+01:00 Alvaro Herrera :

> > Hmm.  Now that I see how this works, by having the GetValue "guess" what
> > is going on and have a special case for it, I actually don't like it
> > very much.  It seems way too magical.  I think we should do away with
> > the "if column is NULL" case in GetValue, and instead inject a column
> > during transformTableExpr if columns is NIL.  This has implications on
> > ExecInitExpr too, which currently checks for an empty column list -- it
> > would no longer have to do so.
> 
> I prefer this way against second described. The implementation should be in
> table builder routines, not in executor.

Well, given the way you have implemented it, I would prefer the original
too.  But your v23 is not what I meant.  Essentially what you do in v23
is to communicate the lack of COLUMNS clause in a different way --
previously it was "ncolumns = 0", now it's "is_auto_col=true".  It's
still "magic".  It's not an improvement.

What I want to happen is that there is no magic at all; it's up to
transformExpr to make sure that when COLUMNS is empty, one column
appears and it must not be a magic column that makes the xml.c code act
differently, but rather to xml.c it should appear that this is just a
normal column that happens to return the entire row.  If I say "COLUMNS
foo PATH '/'" I should be able to obtain a similar behavior (except that
in the current code, if I ask for "COLUMNS foo XML PATH '/'" I don't get
XML at all but rather weird text where all tags have been stripped out,
which is very strange.  I would expect the tags to be preserved if the
output type is XML.  Maybe the tag-stripping behavior should occur if
the output type is some type of text.)

I still have to figure out how to fix the tupledesc thing.  What we have
now is not good.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [GENERAL] Select works only when connected from login postgres

2016-12-07 Thread Joseph Brenner

Yes, I have a tendency to use emacs sub-shells (and occasionally M-x
sql-postgres)--

I thought I'd reproduced the behavior in an xterm, but I was just
trying again and I don't see it.  It does seem that the dumbness of my
dumb terminal is a factor.

If I understand the way this works, it could be an even more baffling
behavior if I were using an xterm: with a blank PAGER your output
would disappear only if the select exceeded a certain number of
lines...


On Wed, Dec 7, 2016 at 2:31 AM, Daniel Verite  wrote:
> Tom Lane wrote:
>
>> BTW, I realized while testing this that there's still one gap in our
>> understanding of what went wrong for you: cases like "SELECT 'hello'"
>> should not have tried to use the pager, because that would've produced
>> less than a screenful of data
>
> At some point emacs was mentioned as the terminal:
>
>>> And I guess I did that intentionally, my .bashrc has
>>>
>>>   # I use emacs shells, I got a "pager" already:
>>>   export PAGER=''
>
> The M-x shell mode of emacs has a so-called "dumb" terminal
> emulation (that's the value of $TERM) where the notion of a "page"
> doesn't quite apply.
>
> For instance, when using emacs 24.3 with my default pager on an
> Ubuntu desktop, this is what I get:
>
> test=> select 1;
> WARNING: terminal is not fully functional
> -  (press RETURN)
>  ?column?
> --
> 1
> (1 row)
>
> I suspect that psql is unable to determine the screen size
> of the "dumb" terminal, and that it's the fault of the terminal
> rather than psql.
> The warning is displayed by "less" AFAICS.
>
> There are other psql features like tab-completion that don't work
> in this mode because emacs interpret keystrokes first for
> itself, in effect mixing emacs functionalities with these of the
> application run in the terminal. It's awesome sometimes
> and irritating at other times depending on what you expect :)
>
> OTOH it has also a M-x term command/mode that provides a
> more sophisticated screen emulation into which paging seems
> to work exactly like in a normal terminal and the emacs key bindings
> are turned off.
>
>
> Best regards,
> --
> Daniel Vérité
> PostgreSQL-powered mailer: http://www.manitou-mail.org
> Twitter: @DanielVerite


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?

2016-12-07 Thread Robert Haas

On Tue, Dec 6, 2016 at 11:54 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> On Tue, Dec 6, 2016 at 9:53 PM, Tom Lane  wrote:
>>> I think we should give serious consideration to back-patching commit
>>> ecb0d20a9, which changed the default semaphore type to unnamed-POSIX
>>> on Linux.
>
>> Urk.  That sounds like a scary thing to back-patch.
>
> I don't deny that it's scary, but the alternative seems to be to be
> rather badly broken on systemd-using distros for years to come.
> That's pretty scary too.

Why can't this be configurable?

>> ... Granted, that might not
>> happen, because maybe unnamed POSIX semas are one of those really
>> awesome operating system primitives that never has problems on any
>> system anywhere ever.  But I think it's pretty hard to be certain of
>> that.
>
> You're attacking a straw man.  I didn't propose changing our behavior
> anywhere but Linux.  AFAIK, on that platform unnamed POSIX semaphores
> are futexes, which have been a stable feature since 2003 according to
> https://en.wikipedia.org/wiki/Futex#History.  Anybody who did need
> to compile PG for use with a pre-2.6 kernel could override the default,
> anyway.

Changing the behavior even just on Linux leaves plenty of room for
failure, even if the feature itself has been stable.  For example,
there are Linux machines where POSIX shared memory doesn't work, even
though POSIX shared memory is in general a supported feature on Linux
and has been for a long time.   So, if we were to change from System V
shared memory to POSIX shared memory in a minor release, anyone in
that situation would break.  It's hard to be sure the same thing
wouldn't happen in this case.  The fact that the feature's stable
doesn't prove that it works on every system in every configuration.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.

2016-12-07 Thread Robert Haas

On Wed, Dec 7, 2016 at 1:30 PM, Robert Haas  wrote:
>   -- partitioned table cannot partiticipate in regular inheritance
>   CREATE TABLE partitioned2 (
>   a int
> --- 392,411 
>   c text,
>   d text
>   ) PARTITION BY RANGE (a oid_ops, plusone(b), c collate "default", d
> collate "en_US");
> + ERROR:  collation "en_US" for encoding "SQL_ASCII" does not exist
...
> No idea why yet, but I'll try to figure it out.

And of course that'd be because relying on en_US isn't portable.  Sigh.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Implement table partitioning.

2016-12-07 Thread Robert Haas

On Wed, Dec 7, 2016 at 1:20 PM, Robert Haas  wrote:
> Implement table partitioning.

Well, that didn't take long to cause problems.  The very first
buildfarm machine to report after this commit is longfin, which is
unhappy:

***
*** 392,419 
  c text,
  d text
  ) PARTITION BY RANGE (a oid_ops, plusone(b), c collate "default", d
collate "en_US");
  -- check relkind
  SELECT relkind FROM pg_class WHERE relname = 'partitioned';
   relkind
  -
!  P
! (1 row)

  -- check that range partition key columns are marked NOT NULL
  SELECT attname, attnotnull FROM pg_attribute WHERE attrelid =
'partitioned'::regclass AND attnum > 0;
!  attname | attnotnull
! -+
!  a   | t
!  b   | f
!  c   | t
!  d   | t
! (4 rows)
!
  -- prevent a function referenced in partition key from being dropped
  DROP FUNCTION plusone(int);
- ERROR:  cannot drop function plusone(integer) because other objects
depend on it
- DETAIL:  table partitioned depends on function plusone(integer)
- HINT:  Use DROP ... CASCADE to drop the dependent objects too.
  -- partitioned table cannot partiticipate in regular inheritance
  CREATE TABLE partitioned2 (
  a int
--- 392,411 
  c text,
  d text
  ) PARTITION BY RANGE (a oid_ops, plusone(b), c collate "default", d
collate "en_US");
+ ERROR:  collation "en_US" for encoding "SQL_ASCII" does not exist
  -- check relkind
  SELECT relkind FROM pg_class WHERE relname = 'partitioned';
   relkind
  -
! (0 rows)

  -- check that range partition key columns are marked NOT NULL
  SELECT attname, attnotnull FROM pg_attribute WHERE attrelid =
'partitioned'::regclass AND attnum > 0;
! ERROR:  relation "partitioned" does not exist
! LINE 1: ...me, attnotnull FROM pg_attribute WHERE attrelid = 'partition...
!  ^
  -- prevent a function referenced in partition key from being dropped
  DROP FUNCTION plusone(int);
  -- partitioned table cannot partiticipate in regular inheritance
  CREATE TABLE partitioned2 (
  a int

No idea why yet, but I'll try to figure it out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

2016-12-07 Thread Robert Haas

On Wed, Dec 7, 2016 at 11:53 AM, Erik Rijkers  wrote:
>> My bad.  The fix I sent last night for one of the cache flush issues
>> wasn't quite right.  The attached seems to fix it.
> Yes, fixed here too.  Thanks.

Thanks for the report - that was a good catch.

I've committed 0001 - 0006 with that correction and a few other
adjustments.  There's plenty of room for improvement here, and almost
certainly some straight-up bugs too, but I think we're at a point
where it will be easier and less error-prone to commit follow on
changes incrementally rather than by continuously re-reviewing a very
large patch set for increasingly smaller changes.

Some notes:

* We should try to teach the executor never to scan the parent.
That's never necessary with this system, and it might add significant
overhead.  We should also try to get rid of the idea of the parent
having storage (i.e. a relfilenode).

* The fact that, in some cases, locking requirements for partitioning
are stronger than those for inheritance is not good.  We made those
decisions for good reasons -- namely, data integrity and not crashing
the server -- but it would certainly be good to revisit those things
and see whether there's any room for improvement.

* I didn't commit 0007, which updates the documentation for this new
feature. That patch removes more lines than it adds, and I suspect
what is needed here
is an expansion of the documentation rather than a diminishing of it.

* The fact that there's no implementation of row movement should be
documented as a limitation.  We should also look at removing that
limitation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] patch: function xmltable

2016-12-07 Thread Alvaro Herrera

Pavel Stehule wrote:

> I fixed two issues.
> 
> 2. there was reverse setting in NOT NULL flag

Ah-hah, that was silly, thanks.

> 1. there are not columns data when there are not any explicit column - fixed

Hmm.  Now that I see how this works, by having the GetValue "guess" what
is going on and have a special case for it, I actually don't like it
very much.  It seems way too magical.  I think we should do away with
the "if column is NULL" case in GetValue, and instead inject a column
during transformTableExpr if columns is NIL.  This has implications on
ExecInitExpr too, which currently checks for an empty column list -- it
would no longer have to do so.

Maybe this means we need an additional method, which would request "the
expr that returns the whole row", so that transformExpr can work for
XmlTable (which I think would be something like "./") and the future
JsonTable stuff (I don't know how that one would work, but I assume it's
not necessarily the same thing).

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

2016-12-07 Thread Erik Rijkers


On 2016-12-07 17:38, Robert Haas wrote:
On Wed, Dec 7, 2016 at 11:34 AM, Amit Langote  
wrote:

begin;
create schema if not exists s;
create table s.t (c text, d text, id serial) partition by list
((ascii(substring(coalesce(c, d, ''), 1, 1;
create table s.t_part_ascii_065 partition of s.t for values in ( 65 
);


it logs as follows:

2016-12-07 17:03:45.787 CET 6125 LOG:  server process (PID 11503) was
terminated by signal 11: Segmentation fault
2016-12-07 17:03:45.787 CET 6125 DETAIL:  Failed process was running: 
create

table s.t_part_ascii_065 partition of s.t for values in ( 65 );


Hm, will look into this in a few hours.


My bad.  The fix I sent last night for one of the cache flush issues
wasn't quite right.  The attached seems to fix it.


Yes, fixed here too.  Thanks.



--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

2016-12-07 Thread Robert Haas

On Wed, Dec 7, 2016 at 11:34 AM, Amit Langote  wrote:
>> begin;
>> create schema if not exists s;
>> create table s.t (c text, d text, id serial) partition by list
>> ((ascii(substring(coalesce(c, d, ''), 1, 1;
>> create table s.t_part_ascii_065 partition of s.t for values in ( 65 );
>>
>> it logs as follows:
>>
>> 2016-12-07 17:03:45.787 CET 6125 LOG:  server process (PID 11503) was
>> terminated by signal 11: Segmentation fault
>> 2016-12-07 17:03:45.787 CET 6125 DETAIL:  Failed process was running: create
>> table s.t_part_ascii_065 partition of s.t for values in ( 65 );
>
> Hm, will look into this in a few hours.

My bad.  The fix I sent last night for one of the cache flush issues
wasn't quite right.  The attached seems to fix it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index ae77d15..a230b20 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -2516,6 +2516,7 @@ RelationClearRelation(Relation relation, bool rebuild)
 		bool		keep_tupdesc;
 		bool		keep_rules;
 		bool		keep_policies;
+		bool		keep_partkey;
 		bool		keep_partdesc;
 
 		/* Build temporary entry, but don't link it into hashtable */
@@ -2547,6 +2548,7 @@ RelationClearRelation(Relation relation, bool rebuild)
 		keep_tupdesc = equalTupleDescs(relation->rd_att, newrel->rd_att);
 		keep_rules = equalRuleLocks(relation->rd_rules, newrel->rd_rules);
 		keep_policies = equalRSDesc(relation->rd_rsdesc, newrel->rd_rsdesc);
+		keep_partkey = (relation->rd_partkey != NULL);
 		keep_partdesc = equalPartitionDescs(relation->rd_partkey,
 			relation->rd_partdesc,
 			newrel->rd_partdesc);
@@ -2604,9 +2606,12 @@ RelationClearRelation(Relation relation, bool rebuild)
 		SWAPFIELD(Oid, rd_toastoid);
 		/* pgstat_info must be preserved */
 		SWAPFIELD(struct PgStat_TableStatus *, pgstat_info);
-		/* partition key must be preserved */
-		SWAPFIELD(PartitionKey, rd_partkey);
-		SWAPFIELD(MemoryContext, rd_partkeycxt);
+		/* partition key must be preserved, if we have one */
+		if (keep_partkey)
+		{
+			SWAPFIELD(PartitionKey, rd_partkey);
+			SWAPFIELD(MemoryContext, rd_partkeycxt);
+		}
 		/* preserve old partdesc if no logical change */
 		if (keep_partdesc)
 		{

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

2016-12-07 Thread Amit Langote

Hi Erik,

On Thu, Dec 8, 2016 at 1:19 AM, Erik Rijkers  wrote:
> On 2016-12-07 12:42, Amit Langote wrote:
>
>> 0001-Catalog-and-DDL-for-partitioned-tables-20.patch
>> 0002-psql-and-pg_dump-support-for-partitioned-tables-20.patch
>> 0003-Catalog-and-DDL-for-partitions-20.patch
>> 0004-psql-and-pg_dump-support-for-partitions-20.patch
>> 0005-Teach-a-few-places-to-use-partition-check-quals-20.patch
>> 0006-Tuple-routing-for-partitioned-tables-20.patch
>> 0007-Update-DDL-Partitioning-chapter-to-reflect-new-devel-20.patch
>
>
> Patches apply, compile, check OK.

Thanks!

> But this yields a segfault:
>
> begin;
> create schema if not exists s;
> create table s.t (c text, d text, id serial) partition by list
> ((ascii(substring(coalesce(c, d, ''), 1, 1;
> create table s.t_part_ascii_065 partition of s.t for values in ( 65 );
>
> it logs as follows:
>
> 2016-12-07 17:03:45.787 CET 6125 LOG:  server process (PID 11503) was
> terminated by signal 11: Segmentation fault
> 2016-12-07 17:03:45.787 CET 6125 DETAIL:  Failed process was running: create
> table s.t_part_ascii_065 partition of s.t for values in ( 65 );

Hm, will look into this in a few hours.

Thanks,
Amit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_background contrib module proposal

2016-12-07 Thread Andrew Borodin

This is simply wonderful!
Finaly, I can implement my favorite sleepsort in postgres:

create table input as select round(random()*20) x from generate_series(1,5,1);
create table output(place int,value int);
create sequence s start 1;
create table handles as select pg_background_launch('select
pg_sleep('||x||'); insert into output values
(nextval(''s''),'||x||');') h from input;
select (select * from pg_background_result(h) as (x text) limit 1) from handles;
select * from output;

Works like a charm. It has perfrect running time O(1) where n is
number of items to sort, but it cannot sort more than
max_worker_processes-1.
Next step of user cuncurrency mechanics is future indexes (indexes
under cunstruction are taken into plans, query is executed when index
is actualy constructed) and promised tables (query waits untial there
is actually data in the table).


I like the idea and implementation and I'm planning to post review by
the end of december.
Right now I have just one useful idea: I do not see comfortable way to
check up state of task (finished\failed) without possibility of haning
for long or catching fire.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

2016-12-07 Thread Erik Rijkers


On 2016-12-07 12:42, Amit Langote wrote:


0001-Catalog-and-DDL-for-partitioned-tables-20.patch
0002-psql-and-pg_dump-support-for-partitioned-tables-20.patch
0003-Catalog-and-DDL-for-partitions-20.patch
0004-psql-and-pg_dump-support-for-partitions-20.patch
0005-Teach-a-few-places-to-use-partition-check-quals-20.patch
0006-Tuple-routing-for-partitioned-tables-20.patch
0007-Update-DDL-Partitioning-chapter-to-reflect-new-devel-20.patch


Patches apply, compile, check OK.


But this yields a segfault:

begin;
create schema if not exists s;
create table s.t (c text, d text, id serial) partition by list 
((ascii(substring(coalesce(c, d, ''), 1, 1;

create table s.t_part_ascii_065 partition of s.t for values in ( 65 );




it logs as follows:

2016-12-07 17:03:45.787 CET 6125 LOG:  server process (PID 11503) was 
terminated by signal 11: Segmentation fault
2016-12-07 17:03:45.787 CET 6125 DETAIL:  Failed process was running: 
create table s.t_part_ascii_065 partition of s.t for values in ( 65 );
2016-12-07 17:03:45.787 CET 6125 LOG:  terminating any other active 
server processes
2016-12-07 17:03:45.791 CET 6125 LOG:  all server processes terminated; 
reinitializing
2016-12-07 17:03:45.999 CET 11655 LOG:  database system was interrupted; 
last known up at 2016-12-07 17:00:38 CET
2016-12-07 17:03:48.040 CET 11655 LOG:  database system was not properly 
shut down; automatic recovery in progress

2016-12-07 17:03:48.156 CET 11655 LOG:  redo starts at 0/2897988
2016-12-07 17:03:48.172 CET 11655 LOG:  invalid magic number  in log 
segment 00010002, offset 9207808

2016-12-07 17:03:48.172 CET 11655 LOG:  redo done at 0/28C72C0
2016-12-07 17:03:48.172 CET 11655 LOG:  last completed transaction was 
at log time 2016-12-07 17:01:29.580562+01
2016-12-07 17:03:49.534 CET 11655 LOG:  MultiXact member wraparound 
protections are now enabled
2016-12-07 17:03:49.622 CET 6125 LOG:  database system is ready to 
accept connections




Thanks,

Erik Rijkers


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Test "tablespace" fails during `make installcheck` on master-replica setup

2016-12-07 Thread Tom Lane

Stephen Frost  writes:
> It would be really nice if we would detect that some other postmaster is
> already using a given tablespace directory and to throw an error and
> complain rather than starting up thinking everything is fine.

In principle, we could have the postmaster run through $PGDATA/pg_tblspc
and drop a lockfile into each referenced directory.  But the devil is in
the details --- in particular, not sure how to get the right thing to
happen during a CREATE TABLESPACE.  Also, I kinda doubt that this is going
to fix anything for the replica-on-same-machine problem.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PROPOSAL] Temporal query processing with range types

2016-12-07 Thread Peter Moser


Am 05.12.2016 um 06:11 schrieb Haribabu Kommi:



On Tue, Oct 25, 2016 at 8:44 PM, Peter Moser mailto:pitiz...@gmail.com>> wrote:


We decided to follow your recommendation and add the patch to the
commitfest.


Path is not applying properly to HEAD.
Moved to next CF with "waiting on author" status.



We updated our patch. We tested it with the latest
commit dfe530a09226a9de80f2b4c3d5f667bf51481c49.



Regards,
Hari Babu
Fujitsu Australia


Best regards,
Anton, Johann, Michael, Peter
diff --git src/backend/commands/explain.c src/backend/commands/explain.c
index 0a669d9..09406d4 100644
--- src/backend/commands/explain.c
+++ src/backend/commands/explain.c
@@ -875,6 +875,12 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_SeqScan:
 			pname = sname = "Seq Scan";
 			break;
+		case T_TemporalAdjustment:
+			if(((TemporalAdjustment *) plan)->temporalCl->temporalType == TEMPORAL_TYPE_ALIGNER)
+pname = sname = "Adjustment(for ALIGN)";
+			else
+pname = sname = "Adjustment(for NORMALIZE)";
+			break;
 		case T_SampleScan:
 			pname = sname = "Sample Scan";
 			break;
diff --git src/backend/executor/Makefile src/backend/executor/Makefile
index 51edd4c..42801d3 100644
--- src/backend/executor/Makefile
+++ src/backend/executor/Makefile
@@ -25,6 +25,8 @@ OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
nodeSamplescan.o nodeSeqscan.o nodeSetOp.o nodeSort.o nodeUnique.o \
nodeValuesscan.o nodeCtescan.o nodeWorktablescan.o \
nodeGroup.o nodeSubplan.o nodeSubqueryscan.o nodeTidscan.o \
-   nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o
+   nodeForeignscan.o nodeWindowAgg.o tstoreReceiver.o tqueue.o spi.o \
+   nodeTemporalAdjustment.o
+
 
 include $(top_srcdir)/src/backend/common.mk
diff --git src/backend/executor/execProcnode.c src/backend/executor/execProcnode.c
index 554244f..610d753 100644
--- src/backend/executor/execProcnode.c
+++ src/backend/executor/execProcnode.c
@@ -114,6 +114,7 @@
 #include "executor/nodeValuesscan.h"
 #include "executor/nodeWindowAgg.h"
 #include "executor/nodeWorktablescan.h"
+#include "executor/nodeTemporalAdjustment.h"
 #include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 
@@ -334,6 +335,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
  estate, eflags);
 			break;
 
+		case T_TemporalAdjustment:
+			result = (PlanState *) ExecInitTemporalAdjustment((TemporalAdjustment *) node,
+ estate, eflags);
+			break;
+
 		default:
 			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
 			result = NULL;		/* keep compiler quiet */
@@ -531,6 +537,10 @@ ExecProcNode(PlanState *node)
 			result = ExecLimit((LimitState *) node);
 			break;
 
+		case T_TemporalAdjustmentState:
+			result = ExecTemporalAdjustment((TemporalAdjustmentState *) node);
+			break;
+
 		default:
 			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
 			result = NULL;
@@ -779,6 +789,10 @@ ExecEndNode(PlanState *node)
 			ExecEndLimit((LimitState *) node);
 			break;
 
+		case T_TemporalAdjustmentState:
+			ExecEndTemporalAdjustment((TemporalAdjustmentState *) node);
+			break;
+
 		default:
 			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
 			break;
@@ -812,3 +826,4 @@ ExecShutdownNode(PlanState *node)
 
 	return planstate_tree_walker(node, ExecShutdownNode, NULL);
 }
+
diff --git src/backend/executor/nodeTemporalAdjustment.c src/backend/executor/nodeTemporalAdjustment.c
new file mode 100644
index 000..e45ec03
--- /dev/null
+++ src/backend/executor/nodeTemporalAdjustment.c
@@ -0,0 +1,528 @@
+#include "postgres.h"
+#include "executor/executor.h"
+#include "executor/nodeTemporalAdjustment.h"
+#include "utils/memutils.h"
+#include "access/htup_details.h"/* for heap_getattr */
+#include "utils/lsyscache.h"
+#include "nodes/print.h"		/* for print_slot */
+#include "utils/datum.h"		/* for datumCopy */
+
+/*
+ * #define TEMPORAL_DEBUG
+ * XXX PEMOSER Maybe we should use execdebug.h stuff here?
+ */
+#ifdef TEMPORAL_DEBUG
+static char*
+datumToString(Oid typeinfo, Datum attr)
+{
+	Oid			typoutput;
+	bool		typisvarlena;
+	getTypeOutputInfo(typeinfo, &typoutput, &typisvarlena);
+	return OidOutputFunctionCall(typoutput, attr);
+}
+
+#define TPGdebug(...) 	{ printf(__VA_ARGS__); printf("\n"); fflush(stdout); }
+#define TPGdebugDatum(attr, typeinfo) 	TPGdebug("%s = %s %ld\n", #attr, datumToString(typeinfo, attr), attr)
+#define TPGdebugSlot(slot) { printf("Printing Slot '%s'\n", #slot); print_slot(slot); fflush(stdout); }
+
+#else
+#define datumToString(typeinfo, attr)
+#define TPGdebug(...)
+#define TPGdebugDatum(attr, typeinfo)
+#define TPGdebugSlot(slot)
+#endif
+
+/*
+ * isLessThan
+ *		We must check if the sweepline is before a timepoint, or if a timepoint
+ *		is smaller than another. We initialize the function call info during
+ *		ExecInit phase.
+ */
+static bool
+isLessThan(Datum a, Datum b, TemporalAdjustmentState* node

Re: [HACKERS] Declarative partitioning - another take

2016-12-07 Thread Robert Haas

On Wed, Dec 7, 2016 at 6:42 AM, Amit Langote
 wrote:
> On 2016/12/07 13:38, Robert Haas wrote:
>> On Wed, Nov 30, 2016 at 10:56 AM, Amit Langote  
>> wrote:
>>> The latest patch I posted earlier today has this implementation.
>>
>> I decided to try out these patches today with #define
>> CLOBBER_CACHE_ALWAYS 1 in pg_config_manual.h, which found a couple of
>> problems:
>>
>> 1. RelationClearRelation() wasn't preserving the rd_partkey, even
>> though there's plenty of code that relies on it not changing while we
>> hold a lock on the relation - in particular, transformPartitionBound.
>
> Oh, I thought an AccessExclusiveLock on the relation would prevent having
> to worry about that, but guess I'm wrong.  Perhaps, having a lock on a
> table does not preclude RelationClearRelation() resetting the table's
> relcache.

No, it sure doesn't.  The lock prevents the table from actually being
changed, so a reload will find data equivalent to what it had before,
but it doesn't prevent the backend's cache from being flushed.

>> 2. partition_bounds_equal() was using the comparator and collation for
>> partitioning column 0 to compare the datums for all partitioning
>> columns.  It's amazing this passed the regression tests.
>
> Oops, it seems that the regression tests where the above code might be
> exercised consisted only of range partition key with columns all of the
> same type: create table test(a int, b int) partition by range (a, (a+b));

It doesn't seem like it; you had this: create table part1 partition of
range_parted for values from ('a', 1) to ('a', 10);

>> I recommend that once you fix this, you run 'make check' with #define
>> CLOBBER_CACHE_ALWAYS 1 and look for other hazards.  Such mistakes are
>> easy to make with this kind of patch.
>
> With the attached latest version of the patches, I couldn't see any
> failures with a CLOBBER_CACHE_ALWAYS build.

Cool.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Test "tablespace" fails during `make installcheck` on master-replica setup

2016-12-07 Thread Stephen Frost

Michael, all,

* Michael Paquier (michael.paqu...@gmail.com) wrote:
> On Wed, Dec 07, 2016 at 03:42:53PM +0300, Aleksander Alekseev wrote:
> > > In the same host, primary and standby will try to use the tablespace
> > > in the same path. That's the origin of this breakage.
> > 
> > Sorry, I don't follow. Don't master and replica use different
> > directories to store _all_ data? Particularly in my case:
> > 
> > ```
> > $ find path/to/postgresql-install/ -type d -name pg_tblspc
> > /home/eax/work/postgrespro/postgresql-install/data-slave/pg_tblspc
> > /home/eax/work/postgrespro/postgresql-install/data-master/pg_tblspc
> > ```
> > 
> > Where exactly a collision happens?
> 
> At the location of the tablespaces, pg_tblspc just stores symlinks to
> the place data is stored, and both point to the same path, the same path
> being stream to the standby when replaying the create tablespace record.

It would be really nice if we would detect that some other postmaster is
already using a given tablespace directory and to throw an error and
complain rather than starting up thinking everything is fine.

We do that already for $PGDATA, of course, but not tablespaces.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?

2016-12-07 Thread Tom Lane

Peter Eisentraut  writes:
> On 12/6/16 9:53 PM, Tom Lane wrote:
>> I think we should give serious consideration to back-patching commit
>> ecb0d20a9, which changed the default semaphore type to unnamed-POSIX
>> on Linux.

> Even with that change, dynamic shared memory is still vulnerable to be
> removed.

Really?  I thought we concluded that it is safe because it is detectably
attached to running processes.  The trouble with SysV semaphores is that
they lack any such attachment, so systemd is left to guess whether they
are still in use.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?

2016-12-07 Thread Stephen Frost

All,

* Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote:
> On 12/6/16 9:53 PM, Tom Lane wrote:
> > I think we should give serious consideration to back-patching commit
> > ecb0d20a9, which changed the default semaphore type to unnamed-POSIX
> > on Linux.
> 
> Even with that change, dynamic shared memory is still vulnerable to be
> removed.  So backpatching the semaphore change wouldn't achieve any new
> level of safety for users so that we could tell them, "you're good now".

I tend to agree with Peter, Magnus, and Craig on this.  If you aren't
running PG as a system user on a system where systemd feels happy to
kill processes and remove shared memory segments and semaphores when you
log out, no amount of back-patching of anything is going to make you
'safe'.  As noted in the thread referenced, users who are trying to
(mistakenly) do this are already having to modify their logind.conf file
to not have PG outright killed when they log out, it's on them to make
sure systemd doesn't do other stupid things when they log out too if
they really want PG to be run as their user account.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Separate connection handling from backends

2016-12-07 Thread Kevin Grittner

On Wed, Dec 7, 2016 at 12:36 AM, Jim Nasby  wrote:

> The way I'm picturing it backends would no longer be directly
> tied to connections. The code that directly handles connections
> would grab an available backend when a statement actually came in
> (and certainly it'd need to worry about transactions and session
> GUCs).

If we're going to consider that, I think we should consider going
all the way to the technique used by many (most?) database
products, which is to have a configurable number of "engines" that
pull work requests from queues.  We might have one queue for disk
writes, one for disk reads, one for network writes, etc.
Traditionally, each engine spins over attempts to read from the
queues until it finds a request to process; blocking only if
several passes over all queues come up empty.  It is often possible
to bind each engine to a particular core.  Current process-local
state would be passed around, attached to queued requests, in a
structure associated with the connection.

I don't know how that execution model would compare to what we use
now in terms of performance, but its popularity makes it hard to
ignore as something to consider.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?

2016-12-07 Thread Peter Eisentraut

On 12/6/16 9:53 PM, Tom Lane wrote:
> I think we should give serious consideration to back-patching commit
> ecb0d20a9, which changed the default semaphore type to unnamed-POSIX
> on Linux.

Even with that change, dynamic shared memory is still vulnerable to be
removed.  So backpatching the semaphore change wouldn't achieve any new
level of safety for users so that we could tell them, "you're good now".

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] varlena beyond 1GB and matrix

2016-12-07 Thread Kohei KaiGai

This is a design proposal for matrix data type which can be larger
than 1GB. Not only a new data type support, it also needs a platform
enhancement because existing varlena has a hard limit (1GB).
We had a discussion about this topic on the developer unconference
at Tokyo/Akihabara, the day before PGconf.ASIA. The design proposal
below stands on the overall consensus on the discussion.


 BACKGROUND

The varlena format has either short (1-byte) or long (4-bytes) header.
We use the long header for in-memory structure which is referenced
by VARSIZE()/VARDATA(), or on-disk strcuture which is larger than
126b but less than TOAST threshold. Elsewhere, the short format is
used if varlena is less than 126b or externally stored in the toast
relation. Any kind of varlena representation does not support data-
size larger than 1GB.
On the other hands, some use cases which handle (relative) big-data
in database are interested in variable length datum larger than 1GB.

One example is what I've presented at PGconf.SV.
A PL/CUDA function takes two arguments (2D-array of int4 instead of
matrix), then returns top-N combination of the chemical compounds
according to the similarity, like as:

SELECT knn_gpu_similarity(5, Q.matrix,
 D.matrix)
  FROM (SELECT array_matrix(bitmap) matrix
  FROM finger_print_query
 WHERE tag LIKE '%abc%') Q,
   (SELECT array_matrix(bitmap) matrix
  FROM finger_print_10m) D;

array_matrix() is an aggregate function to generate 2D-array which
contains all the input relation stream.
It works fine as long as data size is less than 1GB. Once it exceeds
the boundary, user has to split the 2D-array by manual, although
it is not uncommon the recent GPU model has more than 10GB RAM.

It is really problematic if we cannot split the mathematical problem
into several portions appropriately. And, it is not comfortable for
users who cannot use full capability of the GPU device.

=
 PROBLEM
=
Our problem is that varlena format does not permit to have a variable
length datum larger than 1GB, even if our "matrix" type wants to move
a bunch of data larger than 1GB.
So, we need to solve the problem of the varlena format restriction
prior to the matrix type implementation.

In the developer unconference, people had discussed three ideas towards
the problem. Then, overall consensus was the idea of special data-type
which can contain multiple indirect references to other data chunks.
Both of the main part and referenced data chunks are less than 1GB,
but total amount of data size we can represent is more than 1GB.

For example, even if we have a large matrix around 8GB, its sub-matrix
separated into 9 portions (3x3) are individually less than 1GB.
It is problematic when we try to save the matrix which contains indirect
reference to the sub-matrixes, because toast_save_datum() writes out
the flat portion just after the varlena head onto a tuple or a toast
relation as is.
If main portion of the matrix contains pointers (indirect reference),
it is obviously problematic. We need to have an infrastructure to
serialize the indirect reference prior to saving.

BTW, other ideas but less acknowledgement were 64bit varlena header
and utilization of large object. The earlier idea breaks effects to
the current data format, thus, will make unexpected side-effect on
the existing code of PostgreSQL core and extensions. The later one
requires users to construct a large object preliminary. It makes
impossible to use interim result of sub-query, and leads unnecessary
i/o for preparation.

==
 SOLUTION
==
I like to propose a new optional type handler 'typserialize' to
serialize an in-memory varlena structure (that can have indirect
references) to on-disk format.
If any, it shall be involced on the head of toast_insert_or_update()
than indirect references are transformed to something other which
is safe to save. (My expectation is, the 'typserialize' handler
preliminary saves the indirect chunks to the toast relation, then
put toast pointers instead.)

On the other hands, it is uncertain whether we need 'typdeserialize'
handler symmetrically. Because all the functions/operators which
support the special data types should know its internal format,
it is possible to load the data chunks indirectly referenced on
demand.
It will be beneficial from the performance perspective if functions
/operators touches only a part of the large structure, because the
rest of portions are not necessary to load into the memory.

One thing I'm not certain is, whether we can update the datum
supplied as an argument in functions/operators.
Let me assume the data structure below:

  struct {
  int32   varlena_head;
  Oid element_id;
  int matrix_type;
  int blocksz_x; /* horizontal size of each block */
  int blocksz_y; /* vertical size of each block */
  int gridsz_x;  /* horizon

[HACKERS] Re: [PATCH] PostgresNode.pm enhancements, pg_lsn helper, and some more recovery tests

2016-12-07 Thread Aleksander Alekseev

Hello, Craig.

I noticed that these patches have a "Needs review" status and decided to
take a look. Surprisingly I didn't manage to find many defects.

* I like this idea in general;
* Code is test covered and documented well;
* All tests pass;
* No warnings during compilation and/or tests run;
* I didn't find any typos;
* Code style seems to be OK;

Though are you sure you want to name a class like_this instead of LikeThis?
It's quite uncommon in Perl code and usually used only for packages like
"strict" and "warnings". However I prefer no to insist on changes like
this since it's a matter of taste.

The bottom line --- this code looks good to me. If there is nothing you
would like to change in the last moment I suggest to change the status
to "Ready for committer".

-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature

Re: [HACKERS] Test "tablespace" fails during `make installcheck` on master-replica setup

2016-12-07 Thread Michael Paquier

On Wed, Dec 07, 2016 at 03:42:53PM +0300, Aleksander Alekseev wrote:
> > In the same host, primary and standby will try to use the tablespace
> > in the same path. That's the origin of this breakage.
> 
> Sorry, I don't follow. Don't master and replica use different
> directories to store _all_ data? Particularly in my case:
> 
> ```
> $ find path/to/postgresql-install/ -type d -name pg_tblspc
> /home/eax/work/postgrespro/postgresql-install/data-slave/pg_tblspc
> /home/eax/work/postgrespro/postgresql-install/data-master/pg_tblspc
> ```
> 
> Where exactly a collision happens?

At the location of the tablespaces, pg_tblspc just stores symlinks to
the place data is stored, and both point to the same path, the same path
being stream to the standby when replaying the create tablespace record.
-- 
Michael


signature.asc
Description: PGP signature

Re: [HACKERS] Test "tablespace" fails during `make installcheck` on master-replica setup

2016-12-07 Thread Aleksander Alekseev

> In the same host, primary and standby will try to use the tablespace
> in the same path. That's the origin of this breakage.

Sorry, I don't follow. Don't master and replica use different
directories to store _all_ data? Particularly in my case:

```
$ find path/to/postgresql-install/ -type d -name pg_tblspc
/home/eax/work/postgrespro/postgresql-install/data-slave/pg_tblspc
/home/eax/work/postgrespro/postgresql-install/data-master/pg_tblspc
```

Where exactly a collision happens?

On Wed, Dec 07, 2016 at 09:39:20PM +0900, Michael Paquier wrote:
> On Wed, Dec 07, 2016 at 03:18:59PM +0300, Aleksander Alekseev wrote:
> > I noticed, that `make installcheck` fails on my laptop with following
> > errors:
> > 
> > http://afiskon.ru/s/98/6f94ce2cfa_regression.out.txt
> > http://afiskon.ru/s/b3/d0da05597e_regression.diffs.txt
> 
> The interesting bit for the archives:
> 
> *** 
> /home/eax/work/postgrespro/postgresql-src/src/test/regress/expected/tablespace.out
> 2016-12-07 13:53:44.000728436 +0300
> --- 
> /home/eax/work/postgrespro/postgresql-src/src/test/regress/results/tablespace.out
>  2016-12-07 13:53:46.150728558 +0300
> ***
> *** 66,71 
> --- 66,72 
> INSERT INTO testschema.test_default_tab VALUES (1);
> CREATE INDEX test_index1 on testschema.test_default_tab (id);
> CREATE INDEX test_index2 on testschema.test_default_tab (id) TABLESPACE 
> regress_tblspace;
> + ERROR:  could not open file "pg_tblspc/16395/PG_10_201612061/16393/16407": 
> No such file or directory
> \d testschema.test_index1
> 
> > Any ideas what can cause this issue?
> 
> In the same host, primary and standby will try to use the tablespace
> in the same path. That's the origin of this breakage.
> -- 
> Michael



-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature

Re: [HACKERS] Test "tablespace" fails during `make installcheck` on master-replica setup

2016-12-07 Thread Michael Paquier

On Wed, Dec 07, 2016 at 03:18:59PM +0300, Aleksander Alekseev wrote:
> I noticed, that `make installcheck` fails on my laptop with following
> errors:
> 
> http://afiskon.ru/s/98/6f94ce2cfa_regression.out.txt
> http://afiskon.ru/s/b3/d0da05597e_regression.diffs.txt

The interesting bit for the archives:

*** 
/home/eax/work/postgrespro/postgresql-src/src/test/regress/expected/tablespace.out
  2016-12-07 13:53:44.000728436 +0300
--- 
/home/eax/work/postgrespro/postgresql-src/src/test/regress/results/tablespace.out
   2016-12-07 13:53:46.150728558 +0300
***
*** 66,71 
--- 66,72 
INSERT INTO testschema.test_default_tab VALUES (1);
CREATE INDEX test_index1 on testschema.test_default_tab (id);
CREATE INDEX test_index2 on testschema.test_default_tab (id) TABLESPACE 
regress_tblspace;
+ ERROR:  could not open file "pg_tblspc/16395/PG_10_201612061/16393/16407": No 
such file or directory
\d testschema.test_index1

> Any ideas what can cause this issue?

In the same host, primary and standby will try to use the tablespace
in the same path. That's the origin of this breakage.
-- 
Michael


signature.asc
Description: PGP signature

[HACKERS] Test "tablespace" fails during `make installcheck` on master-replica setup

2016-12-07 Thread Aleksander Alekseev

Hello.

I noticed, that `make installcheck` fails on my laptop with following
errors:

http://afiskon.ru/s/98/6f94ce2cfa_regression.out.txt
http://afiskon.ru/s/b3/d0da05597e_regression.diffs.txt

My first idea was to use `git bisect`. It turned out that this issue
reproduces on commits back from 2015 as well (older versions don't compile
on my laptop). However it reproduces rarely and with different errors:

http://afiskon.ru/s/8e/1ad2c8ed2b_regression.diffs-8c48375.txt

Here are scripts I use to compile and test PostgreSQL:

https://github.com/afiskon/pgscripts

Exact steps to reproduce are:

```
./quick-build.sh && ./install.sh && make installcheck
```

Completely removing all `configure` flags doesn't make any difference.
Issue reproduces only on master-replica setup i.e. if instead of
install.sh you run ./single-install.sh all tests pass.

I'm using Arch Linux and GCC 6.2.1.

Any ideas what can cause this issue?

-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature

Re: [HACKERS] Push down more full joins in postgres_fdw

2016-12-07 Thread Ashutosh Bapat

On Wed, Dec 7, 2016 at 4:51 PM, Etsuro Fujita
 wrote:
> On 2016/12/05 20:20, Ashutosh Bapat wrote:
>>
>> On Mon, Nov 28, 2016 at 3:52 PM, Etsuro Fujita
>>  wrote:
>>>
>>> On 2016/11/24 21:45, Etsuro Fujita wrote:

 * I removed make_outerrel_subquery/make_innerrel_subquery from fpinfo
 and added a new member "is_subquery_rel" to fpinfo, as a flag on whether
 to deparse the relation as a subquery.
>
>
>> Replacing make_outerrel_subquery/make_innerrel_subquery with
>> is_subquery_rel
>> seems to be a bad idea. Whether a relation needs to be deparsed as a
>> subquery
>> or not depends upon the relation which joins it with another relation.
>> Take for
>> example a case when a join ABC can pull up the conditions on AB, but ABD
>> can
>> not. In this case we will mark that AB in ABD needs to be deparsed as a
>> subquery but will not mark so in ABC. So, if we choose to join ABCD as
>> (ABC)D
>> we don't need AB to be deparsed as a subquery. If we choose to join ABCD
>> as
>> (ABD)C, we will need AB to deparsed as a subquery.
>
>
> This is not true; since (1) currently, a relation needs to be deparsed as a
> subquery only when the relation is full-joined with any other relation, and
> (2) the join ordering for full joins is preserved, we wouldn't have such an
> example.  (You might think we would have that when extending the patch to
> handle PHVs, but the answer would be no, because the patch for PHVs would
> determine whether a relation emitting PHVs needs to be deparsed as a
> subquery, depending on the relation itself. See the patch for PHVs.)  I like
> is_subquery_rel more than make_outerrel_subquery/make_innerrel_subquery,
> because it reduces the number of members in fpinfo.  But the choice would be
> arbitrary, so I wouldn't object to going back to
> make_outerrel_subquery/make_innerrel_subquery.

I think you are right. And even if we were to deparse a relation as
subquery, when it shouldn't be, we will only loose a syntactic
optimization. So, it shouldn't result in a bug.


>> 3. Why is foreignrel variable changed to rel?
>> -extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root,
>> -RelOptInfo *foreignrel, List *tlist,
>> -List *remote_conds, List *pathkeys,
>> -List **retrieved_attrs, List **params_list);
>> +extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo
>> *root, RelOptInfo *rel,
>> +List *tlist, List *remote_conds, List *pathkeys,
>> +bool is_subquery, List **retrieved_attrs,
>> +List **params_list);
>
>
> I changed the name that way to match with the function definition in
> deparse.c.
>

That's something good to do but it's unrelated to this patch. I have
repeated this line several times in this thread :)

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Push down more full joins in postgres_fdw

2016-12-07 Thread Etsuro Fujita


On 2016/12/07 15:39, Ashutosh Bapat wrote:


On 2016/11/22 18:28, Ashutosh Bapat wrote:


I guess, the reason why you are doing it this way, is SELECT clause for
the
outermost query gets deparsed before FROM clause. For later we call
deparseRangeTblRef(), which builds the tlist. So, while deparsing SELECT
clause, we do not have tlist to build from. In that case, I guess, we have
to
build the tlist in get_subselect_alias_id() if it's not available and
stick it
in fpinfo. Subsequent calls to get_subselect_alias_id() should find tlist
there. Then in deparseRangeTblRef() assert that there's a tlist in fpinfo
and use it to build the SELECT clause of subquery. That way, we don't
build
tlist unless it's needed and also use the same tlist for all searches.
Please
use tlist_member() to search into the tlist.


I wrote:

This would probably work, but seems to me a bit complicated.  Instead, I'd
like to propose that we build the tlist for each relation being deparsed as
a subquery in a given join tree, right before deparsing the SELECT clause in
deparseSelectStmtForRel, if is_subquery is false and lower_subquery_rels
isn't NULL, and store the tlist into the relation's fpinfo.  That would
allow us to build the tlist only when we need it, and to use tlist_member
for the exact comparison.  I think it would be much easier to implement
that.



IIRC, this is inline with my original proposal of creating tlists
before deparsing anything. Along-with that I also proposed to bubble
this array of tlist up the join hierarchy to avoid recursion [1] point
#15, further elaborated in [2]

[1] 
https://www.postgresql.org/message-id/ad449b25-66ee-4c06-568c-0eb2e1bef9f9%40lab.ntt.co.jp
[2] 
https://www.postgresql.org/message-id/CAFjFpRcn7%3DDNOXm-PJ_jVDqAmghKVf6tApQC%2B_RgMZ8tOPExcA%40mail.gmail.com

We seem to be moving towards that solution, but as discussed we have
left it to committer's judgement.


My proposal here would be a bit different from what you proposed; right 
before deparseSelectSql in deparseSelectStmtForRel, build the tlists for 
relations present in the given jointree that will be deparsed as 
subqueries, by (1) traversing the given jointree and (2) applying 
build_tlist_to_deparse to each relation to be deparsed as a subquery and 
saving the result in that relation's fpinfo.  I think that would be 
implemented easily, and the overhead would be small.


Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Push down more full joins in postgres_fdw

2016-12-07 Thread Etsuro Fujita


On 2016/12/05 20:20, Ashutosh Bapat wrote:

On Mon, Nov 28, 2016 at 3:52 PM, Etsuro Fujita
 wrote:

On 2016/11/24 21:45, Etsuro Fujita wrote:

* I removed make_outerrel_subquery/make_innerrel_subquery from fpinfo
and added a new member "is_subquery_rel" to fpinfo, as a flag on whether
to deparse the relation as a subquery.



Replacing make_outerrel_subquery/make_innerrel_subquery with is_subquery_rel
seems to be a bad idea. Whether a relation needs to be deparsed as a subquery
or not depends upon the relation which joins it with another relation. Take for
example a case when a join ABC can pull up the conditions on AB, but ABD can
not. In this case we will mark that AB in ABD needs to be deparsed as a
subquery but will not mark so in ABC. So, if we choose to join ABCD as (ABC)D
we don't need AB to be deparsed as a subquery. If we choose to join ABCD as
(ABD)C, we will need AB to deparsed as a subquery.


This is not true; since (1) currently, a relation needs to be deparsed 
as a subquery only when the relation is full-joined with any other 
relation, and (2) the join ordering for full joins is preserved, we 
wouldn't have such an example.  (You might think we would have that when 
extending the patch to handle PHVs, but the answer would be no, because 
the patch for PHVs would determine whether a relation emitting PHVs 
needs to be deparsed as a subquery, depending on the relation itself. 
See the patch for PHVs.)  I like is_subquery_rel more than 
make_outerrel_subquery/make_innerrel_subquery, because it reduces the 
number of members in fpinfo.  But the choice would be arbitrary, so I 
wouldn't object to going back to 
make_outerrel_subquery/make_innerrel_subquery.



Some more comments
1. The output of the following query
+SELECT 1 FROM (SELECT c1 FROM ft4 WHERE c1 between 50 and 60) t1 FULL
JOIN (SELECT c1 FROM ft5 WHERE c1 between 50 and 60) t2 ON (TRUE);
produces 21 rows all containing a "1". That's not quite good use of expected
output space, given that all that the test tests is the empty
targetlists are deparsed
correctly. You might want to either just test EXPLAIN output or make "between 50
and 60" condition more stringent to reduce the number of rows output.


Will do.


2. Got lost here. I guess, all you need to say is "test deparsing FOR UPDATE
clause with subqueries.". Anyway, the sentence is very hard to read and needs
simplification.
+-- d. test deparsing the remote queries for simple foreign table scans in
+-- an EPQ subplan of a foreign join in which the foreign tables are full
+-- joined and in the remote join query the foreign tables are deparsed as
+-- subqueries


Will do.


3. Why is foreignrel variable changed to rel?
-extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root,
-RelOptInfo *foreignrel, List *tlist,
-List *remote_conds, List *pathkeys,
-List **retrieved_attrs, List **params_list);
+extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo
*root, RelOptInfo *rel,
+List *tlist, List *remote_conds, List *pathkeys,
+bool is_subquery, List **retrieved_attrs,
+List **params_list);


I changed the name that way to match with the function definition in 
deparse.c.


Thanks for the comments!

Best regards,
Etsuro Fujita




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [GENERAL] Select works only when connected from login postgres

2016-12-07 Thread Daniel Verite

Tom Lane wrote:

> BTW, I realized while testing this that there's still one gap in our
> understanding of what went wrong for you: cases like "SELECT 'hello'"
> should not have tried to use the pager, because that would've produced
> less than a screenful of data

At some point emacs was mentioned as the terminal:

>> And I guess I did that intentionally, my .bashrc has
>>
>>   # I use emacs shells, I got a "pager" already:
>>   export PAGER=''

The M-x shell mode of emacs has a so-called "dumb" terminal
emulation (that's the value of $TERM) where the notion of a "page"
doesn't quite apply.

For instance, when using emacs 24.3 with my default pager on an
Ubuntu desktop, this is what I get:

test=> select 1;
WARNING: terminal is not fully functional
-  (press RETURN)
 ?column? 
--
1
(1 row)

I suspect that psql is unable to determine the screen size
of the "dumb" terminal, and that it's the fault of the terminal
rather than psql.
The warning is displayed by "less" AFAICS.

There are other psql features like tab-completion that don't work
in this mode because emacs interpret keystrokes first for
itself, in effect mixing emacs functionalities with these of the
application run in the terminal. It's awesome sometimes
and irritating at other times depending on what you expect :)

OTOH it has also a M-x term command/mode that provides a
more sophisticated screen emulation into which paging seems
to work exactly like in a normal terminal and the emacs key bindings
are turned off. 

Best regards,
-- 
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-12-07 Thread Michael Paquier

On Wed, Dec 7, 2016 at 5:17 PM, Masahiko Sawada  wrote:
> On Wed, Dec 7, 2016 at 4:05 PM, Michael Paquier
>  wrote:
>> Indeed, I haven't thought about that, and that's a no-brainer. That
>> would remove the need to allocate and sort each array, what is simply
>> needed is to track the number of times a newest value has been found.
>> So what this processing would do is updating the write/flush/apply
>> values for the first k loops if the new value is *older* than the
>> current one, where k is the quorum number, and between k+1 and N the
>> value gets updated only if the value compared is newer. No need to
>> take the mutex lock for a long time as well.
>
> Sorry, I could not understand this algorithm. Could you elaborate
> this? It takes only O(n) times?

Nah, please forget that, that was a random useless thought. There is
no way to be able to select the k-th element without knowing the
hierarchy induced by the others, which is what the partial sort would
help with here.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Hang in pldebugger after git commit : 98a64d0

2016-12-07 Thread Ashutosh Sharma

Hi All,

I have noticed that any SQL query executed immediately after attaching
to a particular debugging target (pldebugger) either takes longer time
for execution or it hangs on Windows. I have checked it on
PostgreSQL-9.5 and PostgreSQL-9.6 versions and have found that the
issue only exists on 9.6 version. After doing 'git bisect' i found
that the following git commit in PostgreSQL-9.6 branch is the culprit.

commit *98a64d0bd713cb89e61bef6432befc*
4b7b5da59e
Author: Andres Freund 
Date:   Mon Mar 21 09:56:39 2016 +0100

Introduce WaitEventSet API.

Commit ac1d794 ("Make idle backends exit if the postmaster dies.")
introduced a regression on, at least, large linux systems. Constantly
adding the same postmaster_alive_fds to the OSs internal datastructures
for implementing poll/select can cause significant contention; leading
to a performance regression of nearly 3x in one example.

This can be avoided by using e.g. linux' epoll, which avoids having to
add/remove file descriptors to the wait datastructures at a high rate.
Unfortunately the current latch interface makes it hard to allocate any
persistent per-backend resources.

.

Following are the steps to reproduce the issue:

1) Download pldebugger from below url and copy it into contrib directory.

git clone git://git.postgresql.org/git/pldebugger.git

2) Start a new backend session (psql -d postgres)
3) Create a plpgsql function say func1();
4) Get the oid of the func1 and enable debugging of this using pldbgapi
function as shown below

select plpgsql_oid_debug(16487);

5) execute function func1 : select func1();

After executing above query we will get the message as below and
terminal will not respond as it will go in listen mode.
   NOTICE:  PLDBGBREAK:2

6) Start another backend session.
7) Execute below query.
   SELECT * FROM pldbg_attach_to_port(2)
   NOTE: We need to extract the port number from step 5 NOTICE message
after 'PLDBGBREAK:' string and use as input here.

8) Execute any SQL query now and the problem starts. I have tried with
below queries.

SELECT 1;
 OR
SELECT pg_backend_pid();
 OR
SELECT   FROM pldbg_wait_for_breakpoint(1::INTEGER);




Problem Analysis:
-
Allthough i am very new to Windows, i tried debugging the issue and
could find that Backend is not receiving the query executed after
"SELECT pldbg_attach_to_port(2)" and is infinitely waiting on
"WaitEventSetWaitBlock()" at WaitForMultipleObjects() to read the
input command. Below is the backtrace for the same.

postgres.exe!WaitEventSetWaitBlock(WaitEventSet * set, int
cur_timeout, WaitEvent * occurred_events, int nevents)  Line 1384 +
0x2b bytes C
  postgres.exe!WaitEventSetWait(WaitEventSet * set, long timeout,
WaitEvent * occurred_events, int nevents)  Line 936 + 0x18 bytes C
  postgres.exe!secure_read(Port * port, void * ptr, unsigned __int64
len)  Line 168 C
  postgres.exe!pq_recvbuf()  Line 921 + 0x33 bytes C
  postgres.exe!pq_getbyte()  Line 963 + 0x5 bytes C
  postgres.exe!SocketBackend(StringInfoData * inBuf)  Line 334 + 0x5 bytes C
  postgres.exe!ReadCommand(StringInfoData * inBuf)  Line 507 + 0xa bytes C
  postgres.exe!PostgresMain(int argc, char * * argv, const char *
dbname, const char * username)  Line 4004 + 0xd bytes C
  postgres.exe!BackendRun(Port * port)  Line 4259 C
  postgres.exe!SubPostmasterMain(int argc, char * * argv)  Line 4750 C
  postgres.exe!main(int argc, char * * argv)  Line 216 C
  postgres.exe!__tmainCRTStartup()  Line 555 + 0x19 bytes C
  postgres.exe!mainCRTStartup()  Line 371 C

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Back-patch use of unnamed POSIX semaphores for Linux?

2016-12-07 Thread Magnus Hagander

On Wed, Dec 7, 2016 at 7:18 AM, Craig Ringer  wrote:

> On 7 December 2016 at 10:53, Tom Lane  wrote:
> > Just saw another report of what's probably systemd killing off Postgres'
> > SysV semaphores, as we've discussed previously at, eg,
> > https://www.postgresql.org/message-id/flat/57828C31.5060409%40gmail.com
> > Since the systemd people are generally impervious to suggestions that
> > they might be mistaken, I do not expect this problem to go away.
> >
> > I think we should give serious consideration to back-patching commit
> > ecb0d20a9, which changed the default semaphore type to unnamed-POSIX
> > on Linux.  We've seen no problems in the buildfarm in the two months
> > that that's been in HEAD.  If we don't do this, we can expect to
> > continue seeing complaints of this sort until pre-v10 PG releases
> > fall out of use ... and I don't want to wait that long.
> >
> > Commit ecb0d20a9 also changed the default for FreeBSD.  I'm not convinced
> > we should back-patch that part, because (a) unnamed-POSIX semas have
> > only been there since FreeBSD 9.0, which isn't that long ago, and (b)
> > the argument for switching is "it'll perform better" not "your server
> > will fail randomly without this change".
>
> That's a huge change to make for something that doesn't risk data
> corruption, and that won't happen when using postgres with distro
> packages.
>
> As I understand it, it's only a problem if you're running postgres as
> a normal user, not a "system user" which systemd defines at
> compile-time as a user < 500 or < 1000 depending on the distro's
> default login.conf . So it'll only affect people who're not using
> their distro's packages and service mechanism, and are instead running
> Pg under some other user, likely started manually with pg_ctl.
>
> I see quite a few people who compile their own Pg rather than using
> packages, and some who even fail to use the init system and instead
> use custom scripts to launch Pg. But pretty much everything I've seen
> uses a 'postgres' system-user. Clearly there are exceptions out there
> in the wild, but I don't think it makes sense to backpatch this to
> satisfy people who are, IMO, doing it wrong in the first place.
>
> Especially since those people can reconfigure systemd not to do this
> with the RemoveIPC and KillUserProcesses directives if they insist on
> using a non-system user.
>
> If they defined a systemd service to start postgres they'd be fine...
> and isn't it pretty much basic sysadmin 101 to use your init system to
> start services?
>
> Don't get me wrong, I think systemd's behaviour is pretty stupid.
> Mostly in terms of its magic definition of a "system user", which
> shouldn't be something determined by a uid threshold at compile time.
> But I don't think we should double down on it by backpatching a big
> change that hasn't even seen in-the-wild loads from real world use
> yet, just to make it easier on people who're doing things backwards in
> the first place.
>

+1 (or several).

I don't think we should backpatch something that carries risk for people
who do things "the right way" to help those that don't. Even if the
behavior is stupid.



>
> If it were possible to detect that systemd was about to clobber us and
> log something informative, _that_ would be very nice to backpatch. I
> don't see how that's possible though.


Surely there must be some way to ask systemd about it's configuration? And
if we have that, then we could possibly teach pg_ctl about that and have it
throw a big warning?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: [HACKERS] Typmod associated with multi-row VALUES constructs

2016-12-07 Thread Kyotaro HORIGUCHI

Hello,

At Mon, 5 Dec 2016 21:54:08 -0700, "David G. Johnston" 
 wrote in 

> feel free to s/typemod/typmod/ ... my fingers don't want to drop the "e"
> 
> On Mon, Dec 5, 2016 at 9:17 PM, Kyotaro HORIGUCHI <
> horiguchi.kyot...@lab.ntt.co.jp> wrote:
> 
> > Hello,
> >
> > At Mon, 5 Dec 2016 18:59:42 -0700, "David G. Johnston" <
> > david.g.johns...@gmail.com> wrote in  > zrvf9qia0wwt_qzmauy...@mail.gmail.com>
> > > On Mon, Dec 5, 2016 at 6:36 PM, Kyotaro HORIGUCHI <
> > > horiguchi.kyot...@lab.ntt.co.jp> wrote:
> > >
> >
> > (It's not typo but my poor wording... Sorry.)
> > Mmm. I think the typemod of a row should not be applied onto
> > other rows, and the same can be said for types. But type of the
> > first row is applied to all of the rest rows, though... Does it
> > make sense?
> >
> 
> Yes.  All rows in a given relation must end up with the exact same type
> and it isn't friendly to fail when a implicit conversion from unknown is
> possible.

I'm not sure how clearly users are conscious of the type unkown,
my feeling is opposed to implicity applying of the type of the
first value in VALUES to all other values of the type "unknown".

On the other hand, such behavior doesn't harm anything when we
don't explicitly type the values in VALUES. (But it would be a
new feature as you wrote below)

Anyway the real behavior of the current parser for VALUES is
scanning whole the list and extract common-coerceable type, then
applying the type on the whole list. (transformValuesClause)

> > But what I wanted to say was not that but the something like the
> > following.
> >
> > select pg_typeof('12345'::varchar(1));
> >  pg_typeof
> > ---
> >  character varying
> >
> > A value seemingly doesn't have typmod. So it seems reasonable
> > that VALUES cannot handle typmod. It seems reasonable regardless
> > of how it is acutually implemented.
> >
> 
> This is an artifact of functions - the typemod associated with the value
> '12345' is lost when that value is passed into the function pg_typeof.

It seems to me what is occuring in VALUES.

> Thus it is impossible to write a SQL query the reports the typemod of a
> literal or column reference.  Nonetheless it is there in reality.  Just see
> the original CREATE TABLE AS example for proof.  The created table's column
> is shown (using direct catalog queries) to contain typemod value of 20 -
> which it could only have gotten from the first values rows which contained
> a casted literal.

Ok, I found myself have read the original problem wrongly.

=# create table t2 (a) as select * from (values ('abcdee'::varchar(3)), 
('defghij'::varchar(5))) x;
postgres=# \d t2;
 Table "public.t2"
 Column | Type | Modifiers 
+--+---
 a  | character varying(3) | 
postgres=# select * from t2;
   a   
---
 abc
 defgh
(2 rows)

The problem to be resolved here seems to be that CREATE TABLE AS
creates a broken in-a-sense table. Not a coercion of VALUES.

#6 - raise an error if a subquery's result doesn't fit the newly
 created table. Or, create a new table so that all the value
 given from subqery fit to it. (I havent' understand how the
 source typmod affects the new table and I dont'see how to do
 that, though)


> > But I undetstand what we want to solve here is how to change
> > VALUES's behavior to perfectly coerce the row values to the types
> > explicity applied in the first row. Right?
> >
> 
> Actually, no, since it is not possible to coerce "perfectly".  Since any
> attempt at doing so could fail it is only possible to scan every row and
> compare its typemod to the first row's typemod.  Ideally (but maybe not in
> reality) as soon as a discrepancy is found stop and discard the typemod.


> If every row passes you can retain the typemod.  That is arguably the
> perfect solution.  The concern is that "scan every row" could be very
> expensive - though in writing this I'm thinking that you'd quickly find a

I found that it already scans the whole list. select_common_type
does that. And it returns the type that is found to be applicable
on all values. Handling typmod there has a small
impact. (Though, I don't assert that we should do this.)

Even the value itself doesn't convey typmod, VALUES can take
expressions. (It is obvious because the first value was already a
coercing expression). They are scanned already.

> non-match even in a large dataset - and so a less perfect but still valid
> solution is to simply discard the typemod if there is more than one row.
> My thought was that if you are going to discard typemod in the n > 1 case
> for consistency you should discard the typemod in the n = 1 case as well.
> 
> That is, in a nutshell, options 1, 2, and 3 in order.
> 
> The "fault" in #1 that #4 attempted to fix was that VALUES are often hand
> entered and so the inexperienced would like to only type in a cast one the
> first row and have it will apply to all subsequent rows

Re: [HACKERS] Quorum commit for multiple synchronous replication.

2016-12-07 Thread Masahiko Sawada

On Wed, Dec 7, 2016 at 4:05 PM, Michael Paquier
 wrote:
> On Wed, Dec 7, 2016 at 2:49 PM, Kyotaro HORIGUCHI
>  wrote:
>> Aside from measurement of the two sorting methods, I'd like to
>> point out that quorum commit basically doesn't need
>> sorting. Counting conforming santdbys while scanning the
>> walsender(receiver) LSN list comparing with the target LSN is
>> O(n). Small refactoring of SyncRerpGetOldestSyncRecPtr would
>> enough to do that.

What does the target LSN mean here?

> Indeed, I haven't thought about that, and that's a no-brainer. That
> would remove the need to allocate and sort each array, what is simply
> needed is to track the number of times a newest value has been found.
> So what this processing would do is updating the write/flush/apply
> values for the first k loops if the new value is *older* than the
> current one, where k is the quorum number, and between k+1 and N the
> value gets updated only if the value compared is newer. No need to
> take the mutex lock for a long time as well.

Sorry, I could not understand this algorithm. Could you elaborate
this? It takes only O(n) times?

> By the way, the patch now
> conflicts on HEAD, it needs a refresh.

Thanks, I'll post the latest patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

90 matches

Mail list logo