Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-08-04 Thread Amit Kapila
On Tue, Aug 4, 2015 at 5:45 PM, Robert Haas  wrote:
>
> On Tue, Jul 28, 2015 at 3:28 PM, Heikki Linnakangas 
wrote:
> > * The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in
sync
> > with the list of individual locks in lwlock.h. Sooner or later someone
will
> > add an LWLock and forget to update the names-array. That needs to be
made
> > less error-prone, so that the names are maintained in the same place as
the
> > #defines. Perhaps something like rmgrlist.h.
>
> This is a good idea, but it's not easy to do in the style of
> rmgrlist.h, because I don't believe there's any way to define a macro
> that expands to a preprocessor directive.  Attached is a patch that
> instead generates the list of macros from a text file, and also
> generates an array inside lwlock.c with the lock names that gets used
> by the Trace_lwlocks stuff where applicable.
>
> Any objections to this solution to the problem?  If not, I'd like to
> go ahead and push this much.  I can't test the Windows changes
> locally, though, so it would be helpful if someone could check that
> out.
>

Okay, I have check it on Windows and found below issues which I
have fixed in patch attached with this mail.

1. Got below error while running mkvcbuild.pl

Use of uninitialized value in numeric lt (<) at Solution.pm line 103.
Could not open src/backend/storage/lmgr/lwlocknames.h at Mkvcbuild.pm line
674.

+ if (IsNewer(
+ 'src/backend/storage/lmgr/lwlocknames.txt',
'src/include/storage/lwlocknames.h'))

I think the usage of old and new filenames in IsNewer is reversed here.

2.
+ print "Generating lwlocknames.c and fmgroids.h...\n";

Here it should be lwlocknames.h instead of fmgroids.h

3.
+ if (IsNewer(
+ 'src/include/storage/lwlocknames.h',
+
'src/backend/storage/lmgr/fmgroids.h'))

Here, it seems lwlocknames.h needs to compared instead of fmgroids.h

4. Got below error while running mkvcbuild.pl
Generating lwlocknames.c and fmgroids.h...
/* autogenerated from src/backend/storage/lmgr/lwlocknames.txt, do not edit
*/
/* there is deliberately not an #ifndef LWLOCKNAMES_H here */
rename: lwlocknames.h.tmp3180: Permission denied at generate-lwlocknames.pl
line
 59, <$lwlocknames> line 47.

In generate-lwlocknames.pl, below line is causing problem.

rename($htmp, 'lwlocknames.h') || die "rename: $htmp: $!";

The reason is that closing of tmp files is missing.


Some other general observations

5.
On running perl mkvcbuild.pl in Windows, below message is generated.

Generating lwlocknames.c and lwlocknames.h...
/* autogenerated from src/backend/storage/lmgr/lwlocknames.txt, do not edit
*/
/* there is deliberately not an #ifndef LWLOCKNAMES_H here */

Following message gets displayed, which looks slightly odd, displaying
it in C comments makes it look out of place and other similar .pl files
don't generate such message.  Having said that, I think it displays useful
information, so it might be okay to retain it.

6.
+
+maintainer-clean: clean
+ rm -f lwlocknames.h

Here, don't we need to clean lwlocknames.c?



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


lwlocknames-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablecmds.c and lock hierarchy

2015-08-04 Thread Michael Paquier
On Wed, Aug 5, 2015 at 11:47 AM, Noah Misch  wrote:
> On Tue, Aug 04, 2015 at 07:35:43AM +0100, Simon Riggs wrote:
>> On 4 August 2015 at 05:56, Michael Paquier  wrote:
>> > The thing is that, as mentioned by Alvaro and Andres on this thread,
>> > we have no guarantee that the different relation locks compared have a
>> > monotone hierarchy and we may finish by taking a lock that does not
>> > behave as you would like to. We are now lucky enough that ALTER TABLE
>> > only uses ShareUpdateExclusiveLock, ShareRowExclusiveLock and
>> > AccessExclusiveLock that actually have a hierarchy so this is not a
>> > problem yet.
>> > However it may become a problem if we add in the future more lock
>> > modes and that are used by ALTER TABLE.
>> >
>>
>> Please provide the link to the discussion of this. I don't see a problem
>> here right now that can't be solved by saying
>>
>> Assert(locklevel==ShareUpdateExclusiveLock ||
>> locklevel>ShareRowExclusiveLock);
>
> Agreed; that addresses the foreseeable future of this threat.

Some sub-commands are using ShareRowExclusiveLock... So this one is better :)
Assert(locklevel==ShareUpdateExclusiveLock || locklevel >=
ShareRowExclusiveLock);
Or we simply list all the locks allowed individually... But that's a
minor point.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablecmds.c and lock hierarchy

2015-08-04 Thread Noah Misch
On Tue, Aug 04, 2015 at 07:35:43AM +0100, Simon Riggs wrote:
> On 4 August 2015 at 05:56, Michael Paquier  wrote:
> > The thing is that, as mentioned by Alvaro and Andres on this thread,
> > we have no guarantee that the different relation locks compared have a
> > monotone hierarchy and we may finish by taking a lock that does not
> > behave as you would like to. We are now lucky enough that ALTER TABLE
> > only uses ShareUpdateExclusiveLock, ShareRowExclusiveLock and
> > AccessExclusiveLock that actually have a hierarchy so this is not a
> > problem yet.
> > However it may become a problem if we add in the future more lock
> > modes and that are used by ALTER TABLE.
> >
> 
> Please provide the link to the discussion of this. I don't see a problem
> here right now that can't be solved by saying
> 
> Assert(locklevel==ShareUpdateExclusiveLock ||
> locklevel>ShareRowExclusiveLock);

Agreed; that addresses the foreseeable future of this threat.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-08-04 Thread Michael Paquier
On Tue, Aug 4, 2015 at 8:37 PM, Beena Emerson  wrote:
> Robert Haas wrote:
>>Maybe shoehorning this into the GUC mechanism is the wrong thing, and
>>what we really need is a new config file for this.  The information
>>we're proposing to store seems complex enough to justify that.
>>
>
> I think the consensus is that JSON is better.

I guess so as well. Thanks for brainstorming the whole thread in a single post.

> And using a new file with multi line support would be good.

This file just contains a JSON blob, hence we just need to fetch its
content entirely and then let the server parse it using the existing
facilities.

> Name of the file: how about pg_syncinfo.conf?
> Backward compatibility: synchronous_standby_names will be supported.
> synchronous_standby_names='pg_syncinfo' indicates use of new file.

This strengthens the fact that parsing is done at SIGHUP, so that
sounds fine to me. We may still find out an application_name that uses
pg_syncinfo but well, that's unlikely to happen...

> JSON format:
> It would contain 2 main keys: "sync_info" and  "groups"
> The "sync_info" would consist of "quorum"/"priority" with the count and
> "nodes"/"group" with the group name or node list.
> The optional "groups" key would list out all the "group" mentioned within
> "sync_info" along with the node list.
>
> [...]
>
> Thoughts?

Yes, I think that's the idea. I would let a couple of days to let
people time to give their opinion and objections regarding this
approach though.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );

2015-08-04 Thread Michael Paquier
On Wed, Aug 5, 2015 at 2:15 AM, Alvaro Herrera  wrote:
> Andres Freund wrote:
>> On 2015-08-03 14:15:27 +0900, Michael Paquier wrote:
>
>> > As long as this only applies on master, this may be fine... We could
>> > basically pass a LOCKMASK to the multiple layers of tablecmds.c
>> > instead of LOCKMODE to track all the locks that need to be taken, and
>> > all the relations open during operations.
>>
>> This sounds far too complicated to me. Just LockRelationOid() the
>> relation with the appropriate level everytime you pass through the
>> function?
>
> That opens up for lock escalation and deadlocks, doesn't it?  You are
> probably thinking that it's okay to ignore those but I don't necessarily
> agree with that.

Yes, I think so as long as we would try to take multiple locks... And
we go back to the lock hierarchy then, because there is no way to
define in some cases which lock is strictly stronger than the other.
Honestly I think that we should go with the version Fabrizio doing the
lock comparison, with a assert safeguard in AlterTableGetLockLevel.
The logic would get broken if ALTER TABLE begins to use a lock level
that cannot be ordered according to the others, but that's not the
case now.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2015-08-04 Thread Amit Langote
On 2015-08-05 AM 06:11, Robert Haas wrote:
> On Mon, Aug 3, 2015 at 8:19 PM, Amit Langote
>  wrote:
>> On 2015-08-03 PM 09:24, Ashutosh Bapat wrote:
>>> For postgres_fdw it's a boolean server-level option 'twophase_compliant'
>>> (suggestions for name welcome).
>>>
>>
>> How about just 'twophase'?
> 
> How about two_phase_commit?
> 

Much cleaner, +1

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Make timestamptz_out less slow.

2015-08-04 Thread David Rowley
On 29 July 2015 at 03:25, Andres Freund  wrote:

> On 2015-07-29 03:10:41 +1200, David Rowley wrote:
> > timestamp_out() = 2015-07-29 02:24:33.34 in 3.506000
> > timestamp_out_old() = 2015-07-29 02:24:33.034 in 64.518000
> > timestamp_out_af() = 2015-07-29 02:24:33.034 in 2.981000
> >
> > timestamp_out_old is master's version, the timestamp_out_af() is yours,
> and
> > timestamp_out() is my one. times are in seconds to perform 100 million
> > calls.
>
> That looks good.
>
> > So it appears your version is a bit faster than mine, but we're both
> about
> > 20 times faster than the current one.
> > Also mine needs fixed up as the fractional part is not padded the same as
> > yours, but I doubt that'll affect the performance by much.
>
> Worthwhile to finish that bit and try ;)
>
> > My view: It's probably not worth going quite as far as you've gone for a
> > handful of nanoseconds per call, but perhaps something along the lines of
> > mine can be fixed up.
>
> Yes, I agreee that your's is probably going to be fast enough.
>
> > Have you thought about what to do when HAVE_INT64_TIMESTAMP is not
> defined?
>
> I don't think it's actually important. The only difference vs float
> timestamps is that in the latter case we set fsecs to zero BC.
>
> Unless we want to slow down the common case it seems not unlikely that
> we're going to end up with a separate slow path anyway. E.g. neither
> your version nor mine handles 5 digit years (which is why I fell back to
> the slow path in that case in my patch).
>

It occurred to me that handling the 5 digit year is quite a simple change
to my code:

static char *
pg_uint2str_padding(char *str, unsigned int value, unsigned int padding)
{
char   *start = str;
char   *end = &str[padding];
unsigned int num = value;
//Assert(padding > 0);
*end = '\0';
while (padding--)
{
str[padding] = num % 10 + '0';
num /= 10;
}
/*
* if value was too big for the specified padding then rebuild the whole
* number again without padding. Conveniently pg_uint2str() does exactly
* this.
*/
if (num > 0)
return pg_uint2str(str, value);

return end;
}

We simply just need to check if there was any 'num' left after consuming
the given space. If there's any left then just use pg_uint2str().
This keeps things very fast for the likely most common case where the year
is 4 digits long.

I've not thought about negative years. The whole function should perhaps
take signed ints instead of unsigned.

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/

 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] tablecmds.c and lock hierarchy

2015-08-04 Thread Michael Paquier
On Wed, Aug 5, 2015 at 3:05 AM, Robert Haas  wrote:
> On Tue, Aug 4, 2015 at 2:41 AM, Michael Paquier
>  wrote:
>> Yep, true as things stand now. But this would get broken if we add a
>> new lock level between ShareRowExclusiveLock and AccessExclusiveLock
>> that does not respect the current monotone hierarchy between those.
>
> But we're probably not going to do that, so it doesn't matter; and if
> we do do it, we can worry about it then.  I don't think this is worth
> getting concerned about now.

OK. Then let me suggest the attached Assert safeguard then. It ensures
that all the locks used follow a monotony hierarchy per definition of
what is on the conflict table. That looks like a cheap insurance...

In any case, this means as well that we should move on with the
current logic presented by Fabrizio on the other thread.
-- 
Michael
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 970abd4..3aff387 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -3057,8 +3057,13 @@ AlterTableGetLockLevel(List *cmds)
 		}
 
 		/*
-		 * Take the greatest lockmode from any subcommand
+		 * Take the greatest lockmode from any subcommand following a monotone
+		 * hierarchy.
 		 */
+		Assert(cmd_lockmode == ShareUpdateExclusiveLock ||
+			   cmd_lockmode == ShareRowExclusiveLock ||
+			   cmd_lockmode == ExclusiveLock ||
+			   cmd_lockmode == AccessExclusiveLock);
 		if (cmd_lockmode > lockmode)
 			lockmode = cmd_lockmode;
 	}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-08-04 Thread Josh Berkus
On 06/25/2015 07:50 AM, Tom Lane wrote:
> To do that, we'd have to change the semantics of the 'waiting' column so
> that it becomes true for non-heavyweight-lock waits.  I'm not sure whether
> that's a good idea or not; I'm afraid there may be client-side code that
> expects 'waiting' to indicate that there's a corresponding row in
> pg_locks.  If we're willing to do that, then I'd be okay with
> allowing wait_status to be defined as "last thing waited for"; but the
> two points aren't separable.

Speaking as someone who writes a lot of monitoring and alerting code,
changing the meaning of the waiting column is OK as long as there's
still a boolean column named "waiting" and it means "query blocked" in
some way.

Users are used to pg_stat_activity.waiting failing to join against
pg_locks ... for one thing, there's timing issues there.  So pretty much
everything I've seen uses outer joins anyway.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: prevent user from setting wal_buffers over 2GB bytes

2015-08-04 Thread Takashi Horikawa
> ... Josh's approach of restricting the buffer size seems
> a lot more robust.
I understand that the capping of approach of restricting the buffer size
is much more robust and is suitable in this case.

I, howerver, think that the chane from
'page = &XLogCtl->pages[firstIdx * XLOG_BLCKSZ];'
to
'page = &XLogCtl->pages[firstIdx * (Size) XLOG_BLCKSZ];'
is no harm even when restricting the wal buffer size.

It is in harmony with the usage of 'XLogCtl->pages' found in, for example, 
'cachedPos = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;'
in GetXLogBuffer(XLogRecPtr ptr)
and 
'NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
'
in AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
, etc.

Only exception is
'page = &XLogCtl->pages[firstIdx * XLOG_BLCKSZ];'
in
StartupXLOG(void)
--
Takashi Horikawa
NEC Corporation
Knowledge Discovery Research Laboratories


> -Original Message-
> From: pgsql-hackers-ow...@postgresql.org
> [mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Tom Lane
> Sent: Tuesday, August 04, 2015 10:50 PM
> To: Horikawa Takashi(堀川 隆)
> Cc: PostgreSQL-development
> Subject: Re: [HACKERS] patch: prevent user from setting wal_buffers over
> 2GB bytes
> 
> Takashi Horikawa  writes:
>  Why does this cause a core dump?  We could consider fixing whatever
>  the problem is rather than capping the value.
> 
> > As far as I experiment with my own evaluation environment using
> > PostgreSQL-9.4.4 on a x86_64 Linux, this problem can be fixed with the
> > patch attached.
> 
> I'm unsure whether this represents a complete fix ... but even if it does,
> it would be awfully easy to re-introduce similar bugs in future code
changes,
> and who would notice?  Josh's approach of restricting the buffer size
seems
> a lot more robust.
> 
> If there were any practical use-case for such large WAL buffers then it
> might be worth spending some effort/risk here.  But AFAICS, there is not.
> Indeed, capping wal_buffers might be argued to be a good thing in itself
> because it would prevent users from wasting shared memory foolishly.
> 
> So my vote is for the original approach.  (I've not read Josh's patch, so
> there might be something wrong with it in detail, but I like the basic
> approach.)
> 
>   regards, tom lane
> 
> 
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make
> changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


smime.p7s
Description: S/MIME cryptographic signature


Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2015-08-04 Thread Tom Lane
Piotr Stefaniak  writes:
> On 08/03/2015 09:18 PM, Tom Lane wrote:
>> ... but I can't reproduce it on HEAD with either of these queries.
>> Not clear why you're getting different results.

> I'm terribly sorry, but I didn't notice that postgresql.conf was modified...
> Set join_collapse_limit = 32 and you should see the error.

Ah ... now I get that error on the smaller query, but the larger one
(that you put in an attachment) still doesn't show any problem.
Do you have any other nondefault settings?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablecmds.c and lock hierarchy

2015-08-04 Thread Michael Paquier
On Wed, Aug 5, 2015 at 2:23 AM, Alvaro Herrera wrote:
> Michael Paquier wrote:
>> On Tue, Aug 4, 2015 at 2:43 PM, Alvaro Herrera wrote:
>> Now, let's take for example this case with locks A, B, C, D:
>> - Lock A conflicts with ACD
>> - B with BCD
>> - C with itself
>> - D with itself
>> What would you choose as a result of add(C,D)? A or B? Or the super
>> lock conflicting with all of them?

Actually the answer is the sum of all the potential candidates. This
converges to AccessExclusiveLock.

> This appears to me an hypothetical case that I don't think occurs in our
> conflicts table, so I wouldn't worry about it.

No it doesn't. I am using a huge hypothetical "if" here :)
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cost_agg() with AGG_HASHED does not account for startup costs

2015-08-04 Thread David Rowley
On 5 August 2015 at 01:54, Tom Lane  wrote:

> David Rowley  writes:
> > During working on allowing the planner to perform GROUP BY before joining
> > I've noticed that cost_agg() completely ignores input_startup_cost
> > when aggstrategy == AGG_HASHED.
>
> Isn't your proposed patch double-counting the input startup cost?
> input_total_cost already includes that charge.  The calculation
> reflects the fact that we have to read all of the input before we
> can deliver any aggregated results, so the time to get the first
> input row isn't really interesting.
>
> If this were wrong, the PLAIN costing path would also be wrong, but
> I don't think that either one is.
>
>
Sorry, false alarm, you're right.

This was a bug in my code where I was adding disable_cost to the
startup_cost, but not to the total_cost.

--
 David Rowley   http://www.2ndQuadrant.com/

 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] [DOCS] max_worker_processes on the standby

2015-08-04 Thread Alvaro Herrera
Adding CC to hackers, since this is clearly not just a docs issue.  Also
CCing Petr and Craig since they are the ones that know how this is used
in BDR.

Robert Haas wrote:
> On Tue, Aug 4, 2015 at 12:41 AM, Alvaro Herrera
>  wrote:

> > The alternative is to turn the feature on automatically if it sees that
> > the master also has it on, i.e. the value would not be what the config
> > file says it is.  Doing this would be a bit surprising IMO, but given
> > the behavior above maybe it's better than the current behavior.
> 
> I think it's totally reasonable for the standby to follow the master's
> behavior rather than the config file.  That should be documented, but
> otherwise, no problem.  If it were technologically possible for the
> standby to follow the config file rather than the master in all cases,
> that would be fine, too.  But the current behavior is somewhere in the
> middle, and that doesn't seem like a good plan.

So I discussed this with Petr.  He points out that if we make the
standby follows the master, then the problem would be the misbehavior
that results once the standby is promoted: at that point the standby
would no longer "follow the master" and would start with the feature
turned off, which could be disastrous (depending on what are you using
the commit timestamps for).

To solve that problem, you could suggest that if we see the setting
turned on in pg_control then we should follow that instead of the config
file; but then the problem is that there's no way to turn the feature
off.  And things are real crazy by then.

Given this, we're leaning towards the idea that the standby should not
try to follow the master at all.  Instead, an extension that wants to
use this stuff can check the value for itself, and raise a fatal error
if it's not already turned on the config file.  That way, a lot of the
strange corner cases disappear.


-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] brin index vacuum versus transaction snapshots

2015-08-04 Thread Alvaro Herrera
Alvaro Herrera wrote:

> Thankfully I found
> another way to solve it, which is to forgo usage of MVCC here and
> instead use SnapshotAny.  There's already a mode in
> IndexBuildHeapRangeScan that uses SnapshotAny, but it needs some tweaks
> to do what we need.  I'm going to propose a patch along those lines
> shortly.

Here's the promised patch.  Includes a new isolation test (which fails
with the old code in two ways: first when using VACUUM it causes the
error message Kevin didn't like to be raised; second when using
brin_summarize_new_values it causes the inserted tuple to not be part of
the summary, which is wrong).  This test leaves the pageinspect
extension installed in the isolationtest database, but that seems fine
to me.  (No other isolation test installs an extension.)

The error message Kevin complained about is gone, as is some related
commentary.  This is replaced by tweaks in IndexBuildHeapRangeScan that
know to do a SnapshotAny scan without being noisy about in progress
insertions/deletions.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 360b26e..de32cca 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -951,9 +951,13 @@ summarize_range(IndexInfo *indexInfo, BrinBuildState *state, Relation heapRel,
 	 * Execute the partial heap scan covering the heap blocks in the specified
 	 * page range, summarizing the heap tuples in it.  This scan stops just
 	 * short of brinbuildCallback creating the new index entry.
+	 *
+	 * Note that it is critical we use the "any visible" mode of
+	 * IndexBuildHeapRangeScan here: otherwise, we would miss tuples inserted
+	 * by transactions that are still in progress, among other corner cases.
 	 */
 	state->bs_currRangeStart = heapBlk;
-	IndexBuildHeapRangeScan(heapRel, state->bs_irel, indexInfo, false,
+	IndexBuildHeapRangeScan(heapRel, state->bs_irel, indexInfo, false, true,
 			heapBlk, state->bs_pagesPerRange,
 			brinbuildCallback, (void *) state);
 
@@ -1058,36 +1062,6 @@ brinsummarize(Relation index, Relation heapRel, double *numSummarized,
 state = initialize_brin_buildstate(index, revmap,
    pagesPerRange);
 indexInfo = BuildIndexInfo(index);
-
-/*
- * We only have ShareUpdateExclusiveLock on the table, and
- * therefore other sessions may insert tuples into the range
- * we're going to scan.  This is okay, because we take
- * additional precautions to avoid losing the additional
- * tuples; see comments in summarize_range.  Set the
- * concurrent flag, which causes IndexBuildHeapRangeScan to
- * use a snapshot other than SnapshotAny, and silences
- * warnings emitted there.
- */
-indexInfo->ii_Concurrent = true;
-
-/*
- * If using transaction-snapshot mode, it would be possible
- * for another transaction to insert a tuple that's not
- * visible to our snapshot if we have already acquired one,
- * when in snapshot-isolation mode; therefore, disallow this
- * from running in such a transaction unless a snapshot hasn't
- * been acquired yet.
- *
- * This code is called by VACUUM and
- * brin_summarize_new_values. Have the error message mention
- * the latter because VACUUM cannot run in a transaction and
- * thus cannot cause this issue.
- */
-if (IsolationUsesXactSnapshot() && FirstSnapshotSet)
-	ereport(ERROR,
-			(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
-			 errmsg("brin_summarize_new_values() cannot run in a transaction that has already obtained a snapshot")));
 			}
 			summarize_range(indexInfo, state, heapRel, heapBlk);
 
@@ -,7 +1085,10 @@ brinsummarize(Relation index, Relation heapRel, double *numSummarized,
 	/* free resources */
 	brinRevmapTerminate(revmap);
 	if (state)
+	{
 		terminate_brin_buildstate(state);
+		pfree(indexInfo);
+	}
 }
 
 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 69f35c9..e59b163 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2161,6 +2161,7 @@ IndexBuildHeapScan(Relation heapRelation,
 {
 	return IndexBuildHeapRangeScan(heapRelation, indexRelation,
    indexInfo, allow_sync,
+   false,
    0, InvalidBlockNumber,
    callback, callback_state);
 }
@@ -2170,12 +2171,17 @@ IndexBuildHeapScan(Relation heapRelation,
  * number of blocks are scanned.  Scan to end-of-rel can be signalled by
  * passing InvalidBlockNumber as numblocks.  Note that restricting the range
  * to scan cannot be done when requesting syncscan.
+ *
+ * When "anyvisible" mode is requested, all tuples visible to any transaction
+ * are considered, including those inserted or deleted by transactions that are
+ * still in progress.
  */
 double
 IndexBuildHeapRangeScan(Relation heapRelation,
 		Relation indexR

Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c

2015-08-04 Thread Ewan Higgs
Hi there,I've been following the sqlsmith work and wanted to jump in and try it 
out. I took Peter's idea and tried building postgres with the flags suggested 
but it was hard to get anything working. 

I'm on commit 85e5e222b1dd02f135a8c3bf387d0d6d88e669bd (Tue Aug 4 14:55:32 2015 
-0400)
Configure arguments:./configure --prefix=$HOME/pkg CC=clang CFLAGS='-O1 -g 
-fsanitize=address -fno-omit-frame-pointer -fno-optimize-sibling-calls' 
--enable-cassert

I had to make a simple leak suppression file:
$ cat leak.supp
leak:save_ps_display_args
leak:__GI___strdup
$ export LSAN_OPTIONS=suppressions=leak.supp
And then I could run postgres. After 50,000 queries, I'm left with the 
following report:

queries: 50514
AST stats (avg): height = 7.29877 nodes = 37.8156
296    ERROR:  canceling statement due to statement timeout
166    ERROR:  invalid regular expression: quantifier operand invalid
26     ERROR:  could not determine which collation to use for string comparison
23     ERROR:  cannot compare arrays of different element types
12     ERROR:  invalid regular expression: brackets [] not balanced
5      ERROR:  cache lookup failed for index 2619
2      ERROR:  invalid regular expression: parentheses () not balanced
error rate: 0.0104921

AddressSanitizer didn't fire except for the suppressed leaks. The suppressed 
leaks were only hit at the beginning:
-
Suppressions used:
  count  bytes template
  1    520 save_ps_display_args
  1 10 __GI___strdup
-


sqlsmith is a cool little piece of kit and I see a lot of room for on going 
work (performance bumps for more queries per second; more db back ends; 
different fuzzers). 

Yours,Ewan Higgs

 

 From: Peter Geoghegan 
 To: Andreas Seltenreich  
Cc: Pg Hackers  
 Sent: Sunday, 2 August 2015, 10:39
 Subject: Re: [HACKERS] [sqlsmith] Failed assertion in joinrels.c
   
On Fri, Jul 31, 2015 at 5:56 PM, Andreas Seltenreich  wrote:
> sqlsmith triggered the following assertion in master (c188204).

Thanks for writing sqlsmith. It seems like a great tool.

I wonder, are you just running the tool with assertions enabled when
PostgreSQL is built? If so, it might make sense to make various
problems more readily detected. As you may know, Clang has a pretty
decent option called AddressSanitizer that can detect memory errors as
they occur with an overhead that is not excessive. One might use the
following configure arguments when building PostgreSQL to use
AddressSanitizer:

./configure CC=clang CFLAGS='-O1 -g -fsanitize=address
-fno-omit-frame-pointer -fno-optimize-sibling-calls' --enable-cassert

Of course, it remains to be seen if this pays for itself. Apparently
the tool has about a 2x overhead [1]. I'm really not sure that you'll
find any more bugs this way, but it's certainly possible that you'll
find a lot more. Given your success in finding bugs without using
AddressSanitizer, introducing it may be premature.

[1] http://clang.llvm.org/docs/AddressSanitizer.html#introduction
-- 
Peter Geoghegan




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  

Re: [HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types

2015-08-04 Thread Peter Geoghegan
On Tue, Aug 4, 2015 at 2:29 AM, Amit Langote
 wrote:
> Perhaps, it may have to do with how EXCLUDED pseudo-rel's targetlist is
> manipulated through parse-plan stage?

I think so, yes.

I'll look into writing a fix for this later in the week.

Thanks for the report, Geoff, and thanks for the analysis Amit.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sharing aggregate states between different aggregate functions

2015-08-04 Thread David Rowley
On 5 August 2015 at 03:03, Heikki Linnakangas  wrote:

> On 08/03/2015 08:53 AM, David Rowley wrote:
>
>> Attached is a delta patched which is based
>> on sharing_aggstate-heikki-2.patch to fix up the out-dated comments and
>> also a few more test scenarios which test the sharing works with matching
>> INITCOND and that it does not when they don't match.
>>
>> What do you think?
>>
>
> I committed this, after some more cleanup of the comments. Thanks!
>
>
Great! Thanks for doing the cleanups and committing it.

--
 David Rowley   http://www.2ndQuadrant.com/

 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Transactions involving multiple postgres foreign servers

2015-08-04 Thread Robert Haas
On Mon, Aug 3, 2015 at 8:19 PM, Amit Langote
 wrote:
> On 2015-08-03 PM 09:24, Ashutosh Bapat wrote:
>> On Sat, Aug 1, 2015 at 12:18 AM, Robert Haas  wrote:
>>>
>>> OK, sure.  But let's make sure postgres_fdw gets a server-level option
>>> to control this.
>>>
>>>
>> For postgres_fdw it's a boolean server-level option 'twophase_compliant'
>> (suggestions for name welcome).
>>
>
> How about just 'twophase'?

How about two_phase_commit?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Raising our compiler requirements for 9.6

2015-08-04 Thread Robert Haas
On Tue, Aug 4, 2015 at 3:55 PM, Andres Freund  wrote:
> On 2015-08-04 15:45:44 -0400, Tom Lane wrote:
>> I'm not sure that there's any great urgency about changing the instances
>> that exist now; the real point of this discussion is that we will allow
>> new code to use static inlines in headers.
>
> I agree that we don't have to (and probably shouldn't) make the required
> configure changes and change definitions.  But I do think some of the
> current macro usage deserves to be cleaned up at some point. I,
> somewhere during 9.4 or 9.5, redefined some of the larger macros into
> static inlines and it both reduced the binary size and gave minor
> performance benefits.

We typically recommend that people write their new code like the
existing code.  If we say that the standards for new code are now
different from old code in this one way, I don't think that's going to
be very helpful to anyone.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-08-04 Thread Alvaro Herrera
Ildus Kurbangaliev wrote:

> A new version of the patch. I used your idea with macros, and with tranches 
> that
> allowed us to remove array with names (they can be written directly to the 
> corresponding
> tranche).

Just a bystander here, I haven't reviewed this patch at all, but I have
two questions,

1. have you tested this under -DEXEC_BACKEND ?  I wonder if those
initializations are going to work on Windows.

2. why keep the SLRU control locks as individual locks?  Surely we could
put them in the SlruCtl struct and get rid of a few individual lwlocks?

Also, I wonder if we shouldn't be doing this in two parts, one that
changes the underlying lwlock structure and another one to change
pg_stat_activity.


We have CppAsString() in c.h IIRC, which we use instead of # (we do use
## in a few places though).  I wonder if that stuff has any value
anymore.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-08-04 Thread Robert Haas
On Tue, Aug 4, 2015 at 4:47 PM, Robert Haas  wrote:
> On Tue, Aug 4, 2015 at 4:37 PM, Ildus Kurbangaliev
>  wrote:
>> A new version of the patch. I used your idea with macros, and with tranches 
>> that
>> allowed us to remove array with names (they can be written directly to the 
>> corresponding
>> tranche).
>
> You seem not to have addressed a few of the points I brought up here:
>
> http://www.postgresql.org/message-id/CA+TgmoaGqhah0VTamsfaOMaE9uOrCPYSXN8hCS9=wirupjs...@mail.gmail.com

More generally, I'd like to stop smashing all the things that need to
be done here into one patch.  We need to make some changes, such as
the one I proposed earlier today, to make it easier to properly
identify locks.  Let's talk about how to do that and agree on the
details.  Then, once that's done, let's do the main part of the work
afterwards, in a separate commit.  We're running through patch
versions at light speed here, but I'm not sure we're really building
consensus around how to do things.  The actual technical work here
isn't really the problem; that part is easy.  The hard part is
agreeing on the details of how it should work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Simon Riggs
On 4 August 2015 at 21:04, Jeff Janes  wrote:


> Couple of questions here...
>>
>> * the docs say "it's desirable to have pending-list cleanup occur in the
>> background", but there is no way to invoke that, except via VACUUM. I
>> think we need a separate function to be able to call this as a background
>> action. If we had that, we wouldn't need much else, would we?
>>
>
> I thought maybe the new bgworker framework would be a way to have a
> backend signal a bgworker to do the cleanup when it notices the pending
> list is getting large.  But that wouldn't directly fix this issue, because
> the bgworker still wouldn't recycle that space (without further changes),
> only vacuum workers do that currently.
>
> But I don't think this could be implemented as an extension, because the
> signalling code has to be in core, so (not having studied the matter at
> all) I don't know if it is good fit for bgworker.
>

We need to expose 2 functions:

1. a function to perform the recycling directly (BRIN has an equivalent
function)

2. a function to see how big the pending list is for a particular index,
i.e. do we need to run function 1?

We can then build a bgworker that polls the pending list and issues a
recycle if and when needed - which is how autovac started.


> * why do we have two parameters: gin_pending_list_limit and fastupdate?
>> What happens if we set gin_pending_list_limit but don't set fastupdate?
>>
>
> Fastupdate is on by default.  If it were turned off, then
> gin_pending_list_limit would be mostly irrelevant for those tables.
> Fastupdate could have been implemented as a magic value (0 or -1) for
> gin_pending_list_limit but that would break backwards compatibility (and
> arguably would not be a better way of doing things, anyway).
>
>
>> * how do we know how to set that parameter? Is there a way of knowing
>> gin_pending_list_limit has been reached?
>>
>
> I don't think there is an easier answer to that.  The trade offs are
> complex and depend on things like how well cached the parts of the index
> needing insertions are, how many lexemes/array elements are in an average
> document, and how many documents inserted near the same time as each other
> share lexemes in common.  And of course what you need to optimize for,
> latency or throughput, and if latency search latency or insert latency.
>

So we also need a way to count the number of times the pending list is
flushed. Perhaps record that on the metapage, so we can see how often it
has happened - and another function to view the stats on that

This and the OP seem like 9.5 open items to me.
>>
>
> I don't think so.  Freeing gin_pending_list_limit from being forcibly tied
> to work_mem is a good thing.  Even if I don't know exactly how to set
> gin_pending_list_limit, I know I don't want to be 4GB just because work_mem
> was set there for some temporary reason.  I'm happy to leave it at its
> default and let its fine tuning be a topic for people who really care about
> every microsecond of performance.
>

OK, I accept this.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-08-04 Thread Robert Haas
On Tue, Aug 4, 2015 at 4:37 PM, Ildus Kurbangaliev
 wrote:
> A new version of the patch. I used your idea with macros, and with tranches 
> that
> allowed us to remove array with names (they can be written directly to the 
> corresponding
> tranche).

You seem not to have addressed a few of the points I brought up here:

http://www.postgresql.org/message-id/CA+TgmoaGqhah0VTamsfaOMaE9uOrCPYSXN8hCS9=wirupjs...@mail.gmail.com

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] More work on SortSupport for text - strcoll() and strxfrm() caching

2015-08-04 Thread Peter Geoghegan
On Tue, Aug 4, 2015 at 1:30 PM, Peter Geoghegan  wrote:
>> 2. I believe the change to bttextcmp_abbrev() should be pulled out
>> into a separate patch and committed separately.  That part  seems like
>> a slam dunk.
>
> Makes sense.

BTW, I want to put the string_uint() macro in a common header now. It
can be used for other types. I've written a SortSupport + abbreviated
keys patch for the UUID type which will use it, too, so that it too
can use simple unsigned integer comparisons within its abbreviated
comparator. I haven't posted the UUID patch yet only because everyone
is up to their ears in my sorting patches.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] more-helpful-izing a debug message

2015-08-04 Thread Robert Haas
On Wed, Jul 8, 2015 at 5:38 AM, Marko Tiikkaja  wrote:
> One of the debug messages related to logical replication could be more
> helpful than it currently is.  The attached patch reorders the two
> operations to make it so.
>
> Please consider patching and back-patching.

Andres, this looks like a bug fix to me.  What do you think?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-08-04 Thread Ildus Kurbangaliev

> On Aug 4, 2015, at 4:54 PM, Heikki Linnakangas  wrote:
> 
> On 08/04/2015 03:15 PM, Robert Haas wrote:
>> On Tue, Jul 28, 2015 at 3:28 PM, Heikki Linnakangas  wrote:
>>> * The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in sync
>>> with the list of individual locks in lwlock.h. Sooner or later someone will
>>> add an LWLock and forget to update the names-array. That needs to be made
>>> less error-prone, so that the names are maintained in the same place as the
>>> #defines. Perhaps something like rmgrlist.h.
>> 
>> This is a good idea, but it's not easy to do in the style of
>> rmgrlist.h, because I don't believe there's any way to define a macro
>> that expands to a preprocessor directive.  Attached is a patch that
>> instead generates the list of macros from a text file, and also
>> generates an array inside lwlock.c with the lock names that gets used
>> by the Trace_lwlocks stuff where applicable.
>> 
>> Any objections to this solution to the problem?  If not, I'd like to
>> go ahead and push this much.  I can't test the Windows changes
>> locally, though, so it would be helpful if someone could check that
>> out.
> 
> A more low-tech solution would be to something like this in lwlocknames.c:
> 
> static char *MainLWLockNames[NUM_INDIVIDUAL_LWLOCKS];
> 
> /* Turn pointer into one of the LWLocks in main array into an index number */
> #define NAME_LWLOCK(l, name) MainLWLockNames[l - MainLWLockArray)] = name
> 
> InitLWLockNames()
> {
>  NAME_LWLOCK(ShmemIndexLock, "ShmemIndexLock");
>  NAME_LWLOCK(OidGenLock, "OidGenLock");
>  ...
> }
> 
> That would not be auto-generated, so you'd need to keep that list in sync 
> with lwlock.h, but it would be much better than the original patch because if 
> you forgot to add an entry in the names-array, the numbering of all the other 
> locks would not go wrong. And you could have a runtime check that complains 
> if there's an entry missing, like Ildus did in his latest patch.
> 
> I have no particular objection to your perl script either, though. I'll leave 
> it up to you.
> 
> - Heikki
> 

A new version of the patch. I used your idea with macros, and with tranches that
allowed us to remove array with names (they can be written directly to the 
corresponding
tranche).


Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



extend_pg_stat_activity_v7.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dependency between bgw_notify_pid and bgw_flags

2015-08-04 Thread Robert Haas
On Tue, Jul 7, 2015 at 4:34 AM, Ashutosh Bapat
 wrote:
> With that notion of backend, to fix the original problem I reported,
> PostmasterMarkPIDForWorkerNotify() should also look at the
> BackgroundWorkerList. As per the comments in the prologue of this function,
> it assumes that only backends need to be notified about worker state change,
> which looks too restrictive. Any backend or background worker, which wants
> to be notified about a background worker it requested to be spawned, should
> be allowed to do so.

Yeah.  I'm wondering if we should fix this problem by just insisting
that all workers have an entry in BackendList.  At first look, this
seems like it would make the code a lot simpler and have virtually no
downside.  As it is, we're repeatedly reinventing new and different
ways for unconnected background workers to do things that work fine in
all other cases, and I don't see the point of that.

I haven't really tested the attached patch - sadly, we have no testing
of background workers without shared memory access in the tree - but
it shows what I have in mind.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 000524d..1818f7c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -155,8 +155,7 @@
  * they will never become live backends.  dead_end children are not assigned a
  * PMChildSlot.
  *
- * Background workers that request shared memory access during registration are
- * in this list, too.
+ * Background workers are in this list, too.
  */
 typedef struct bkend
 {
@@ -404,13 +403,11 @@ static long PostmasterRandom(void);
 static void RandomSalt(char *md5Salt);
 static void signal_child(pid_t pid, int signal);
 static bool SignalSomeChildren(int signal, int targets);
-static bool SignalUnconnectedWorkers(int signal);
 static void TerminateChildren(int signal);
 
 #define SignalChildren(sig)			   SignalSomeChildren(sig, BACKEND_TYPE_ALL)
 
 static int	CountChildren(int target);
-static int	CountUnconnectedWorkers(void);
 static void maybe_start_bgworker(void);
 static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
 static pid_t StartChildProcess(AuxProcType type);
@@ -2414,7 +2411,6 @@ SIGHUP_handler(SIGNAL_ARGS)
 (errmsg("received SIGHUP, reloading configuration files")));
 		ProcessConfigFile(PGC_SIGHUP);
 		SignalChildren(SIGHUP);
-		SignalUnconnectedWorkers(SIGHUP);
 		if (StartupPID != 0)
 			signal_child(StartupPID, SIGHUP);
 		if (BgWriterPID != 0)
@@ -2491,7 +2487,6 @@ pmdie(SIGNAL_ARGS)
 /* and bgworkers too; does this need tweaking? */
 SignalSomeChildren(SIGTERM,
 			   BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER);
-SignalUnconnectedWorkers(SIGTERM);
 /* and the autovac launcher too */
 if (AutoVacPID != 0)
 	signal_child(AutoVacPID, SIGTERM);
@@ -2543,11 +2538,11 @@ pmdie(SIGNAL_ARGS)
 signal_child(BgWriterPID, SIGTERM);
 			if (WalReceiverPID != 0)
 signal_child(WalReceiverPID, SIGTERM);
-			SignalUnconnectedWorkers(SIGTERM);
 			if (pmState == PM_RECOVERY)
 			{
+SignalSomeChildren(SIGTERM, BACKEND_TYPE_BGWORKER);
 /*
- * Only startup, bgwriter, walreceiver, unconnected bgworkers,
+ * Only startup, bgwriter, walreceiver, possibly bgworkers,
  * and/or checkpointer should be active in this state; we just
  * signaled the first four, and we don't want to kill
  * checkpointer yet.
@@ -2999,25 +2994,21 @@ CleanupBackgroundWorker(int pid,
 		}
 
 		/* Get it out of the BackendList and clear out remaining data */
-		if (rw->rw_backend)
-		{
-			Assert(rw->rw_worker.bgw_flags & BGWORKER_BACKEND_DATABASE_CONNECTION);
-			dlist_delete(&rw->rw_backend->elem);
+		dlist_delete(&rw->rw_backend->elem);
 #ifdef EXEC_BACKEND
-			ShmemBackendArrayRemove(rw->rw_backend);
+		ShmemBackendArrayRemove(rw->rw_backend);
 #endif
 
-			/*
-			 * It's possible that this background worker started some OTHER
-			 * background worker and asked to be notified when that worker
-			 * started or stopped.  If so, cancel any notifications destined
-			 * for the now-dead backend.
-			 */
-			if (rw->rw_backend->bgworker_notify)
-BackgroundWorkerStopNotifications(rw->rw_pid);
-			free(rw->rw_backend);
-			rw->rw_backend = NULL;
-		}
+		/*
+		 * It's possible that this background worker started some OTHER
+		 * background worker and asked to be notified when that worker
+		 * started or stopped.  If so, cancel any notifications destined
+		 * for the now-dead backend.
+		 */
+		if (rw->rw_backend->bgworker_notify)
+			BackgroundWorkerStopNotifications(rw->rw_pid);
+		free(rw->rw_backend);
+		rw->rw_backend = NULL;
 		rw->rw_pid = 0;
 		rw->rw_child_slot = 0;
 		ReportBackgroundWorkerPID(rw);	/* report child death */
@@ -3160,15 +3151,12 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 			 * Found entry for freshly-de

Re: [HACKERS] More work on SortSupport for text - strcoll() and strxfrm() caching

2015-08-04 Thread Peter Geoghegan
On Tue, Aug 4, 2015 at 12:41 PM, Robert Haas  wrote:
> Interesting work.

Thanks.

> 1. My biggest gripe with this patch is that the comments are not easy
> to understand.

> Of course everybody may prefer something different here; I'm just
> telling you what I think.

I have struggled with trying to put just the right amount of
exposition on the theory behind a particular optimization in source
code comments, and things like that. Since no one is telling me that I
need to write more, clearly I don't have the balance right yet. To a
certain extent, it is a matter of personal style, but I'll try and be
more terse.

> 2. I believe the change to bttextcmp_abbrev() should be pulled out
> into a separate patch and committed separately.  That part  seems like
> a slam dunk.

Makes sense.

> 3. What is the worst case for the strxfrm()-reuse stuff?  I suppose
> it's the case where we have many strings that are long, all
> equal-length, and all different, but only in the last few characters.
> Then the memcmp() is as expensive as possible but never works out.
> How does the patch hold up in that case?

I haven't tested it. I'll get around to it at some point in the next
couple of weeks. I imagine that it's exactly the same as the memcmp()
equality thing because of factors like speculative execution, and the
fact that we need both strings in cache anyway. It's almost exactly
the same story, although unlike the memcmp() opportunistic equality
pre-check thing, this check happens only n times, not n log n times.

I'm quite sure that the cost needs to be virtually zero to go ahead
with the idea. I think it probably is. Note that like the memcmp()
thing, we check string length first, before a memcmp().

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Jeff Janes
On Tue, Aug 4, 2015 at 6:35 AM, Simon Riggs  wrote:

> On 4 August 2015 at 09:39, Simon Riggs  wrote:
>
>> On 4 August 2015 at 06:03, Jeff Janes  wrote:
>>
>>
>>> The attached proof of concept patch greatly improves the bloat for both
>>> the insert and the update cases.  You need to turn on both features: adding
>>> the pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3).
>>> The first of those two things could probably be adopted for real, but the
>>> second probably is not acceptable.  What is the right way to do this?
>>> Could a variant of RecordFreeIndexPage bubble the free space up the map
>>> immediately rather than waiting for a vacuum?  It would only have to move
>>> up until it found a page with freespace already recorded in it, which the
>>> vast majority of the time would mean observing up one level and then not
>>> writing to it, assuming the pending list pages remain well clustered.
>>>
>>
>> You make a good case for action here since insert only tables with GIN
>> indexes on text are a common use case for GIN.
>>
>> Why would vacuuming the FSM be unacceptable? With a
>> large gin_pending_list_limit it makes sense.
>>
>> If it is unacceptable, perhaps we can avoid calling it every time, or
>> simply have FreeSpaceMapVacuum() terminate more quickly on some kind of
>> 80/20 heuristic for this case.
>>
>
> Couple of questions here...
>
> * the docs say "it's desirable to have pending-list cleanup occur in the
> background", but there is no way to invoke that, except via VACUUM. I
> think we need a separate function to be able to call this as a background
> action. If we had that, we wouldn't need much else, would we?
>

I thought maybe the new bgworker framework would be a way to have a backend
signal a bgworker to do the cleanup when it notices the pending list is
getting large.  But that wouldn't directly fix this issue, because the
bgworker still wouldn't recycle that space (without further changes), only
vacuum workers do that currently.

But I don't think this could be implemented as an extension, because the
signalling code has to be in core, so (not having studied the matter at
all) I don't know if it is good fit for bgworker.


> * why do we have two parameters: gin_pending_list_limit and fastupdate?
> What happens if we set gin_pending_list_limit but don't set fastupdate?
>

Fastupdate is on by default.  If it were turned off, then
gin_pending_list_limit would be mostly irrelevant for those tables.
Fastupdate could have been implemented as a magic value (0 or -1) for
gin_pending_list_limit but that would break backwards compatibility (and
arguably would not be a better way of doing things, anyway).


> * how do we know how to set that parameter? Is there a way of knowing
> gin_pending_list_limit has been reached?
>

I don't think there is an easier answer to that.  The trade offs are
complex and depend on things like how well cached the parts of the index
needing insertions are, how many lexemes/array elements are in an average
document, and how many documents inserted near the same time as each other
share lexemes in common.  And of course what you need to optimize for,
latency or throughput, and if latency search latency or insert latency.

This and the OP seem like 9.5 open items to me.
>

I don't think so.  Freeing gin_pending_list_limit from being forcibly tied
to work_mem is a good thing.  Even if I don't know exactly how to set
gin_pending_list_limit, I know I don't want to be 4GB just because work_mem
was set there for some temporary reason.  I'm happy to leave it at its
default and let its fine tuning be a topic for people who really care about
every microsecond of performance.

Cheers,

Jeff


Re: [HACKERS] Raising our compiler requirements for 9.6

2015-08-04 Thread Andres Freund
On 2015-08-04 15:45:44 -0400, Tom Lane wrote:
> I'm not sure that there's any great urgency about changing the instances
> that exist now; the real point of this discussion is that we will allow
> new code to use static inlines in headers.

I agree that we don't have to (and probably shouldn't) make the required
configure changes and change definitions.  But I do think some of the
current macro usage deserves to be cleaned up at some point. I,
somewhere during 9.4 or 9.5, redefined some of the larger macros into
static inlines and it both reduced the binary size and gave minor
performance benefits.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Raising our compiler requirements for 9.6

2015-08-04 Thread Tom Lane
Andres Freund  writes:
> On 2015-08-04 15:20:14 -0400, Robert Haas wrote:
>> OK, so do we want to rip out all instances of the static inline dance
>> in favor of more straightforward coding?  Do we then shut pandemelon
>> and any other affected buildfarm members down as unsupported, or what?

> I think all that happens is that they'll log a couple more warnings
> about defined but unused static functions. configure already defines
> inline away if not supported.

Right.  We had already concluded that this would be safe to do, it's
just a matter of somebody being motivated to do it.

I'm not sure that there's any great urgency about changing the instances
that exist now; the real point of this discussion is that we will allow
new code to use static inlines in headers.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Jeff Janes
On Tue, Aug 4, 2015 at 1:39 AM, Simon Riggs  wrote:

> On 4 August 2015 at 06:03, Jeff Janes  wrote:
>
>
>> The attached proof of concept patch greatly improves the bloat for both
>> the insert and the update cases.  You need to turn on both features: adding
>> the pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3).
>> The first of those two things could probably be adopted for real, but the
>> second probably is not acceptable.  What is the right way to do this?
>> Could a variant of RecordFreeIndexPage bubble the free space up the map
>> immediately rather than waiting for a vacuum?  It would only have to move
>> up until it found a page with freespace already recorded in it, which the
>> vast majority of the time would mean observing up one level and then not
>> writing to it, assuming the pending list pages remain well clustered.
>>
>
> You make a good case for action here since insert only tables with GIN
> indexes on text are a common use case for GIN.
>
> Why would vacuuming the FSM be unacceptable? With a
> large gin_pending_list_limit it makes sense.
>

But with a smallish gin_pending_list_limit (like the default 4MB) this
could be called a lot (multiple times a second during some spurts), and
would read the entire fsm each time.


>
> If it is unacceptable, perhaps we can avoid calling it every time, or
> simply have FreeSpaceMapVacuum() terminate more quickly on some kind of
> 80/20 heuristic for this case.
>

Or maybe it could be passed a range of blocks which need vacuuming, so it
concentrated on that range.

But from the README file, it sounds like it is already supposed to be
bubbling up.  I'll have to see just whats going on there when I get a
chance.

Cheers,

Jeff


Re: [HACKERS] More work on SortSupport for text - strcoll() and strxfrm() caching

2015-08-04 Thread Robert Haas
On Fri, Jul 3, 2015 at 8:33 PM, Peter Geoghegan  wrote:
> Since apparently we're back to development work, I thought it was time
> to share a patch implementing a few additional simple tricks to make
> sorting text under a non-C locale even faster than in 9.5. These
> techniques are mostly effective when values are physically clustered
> together. This might be because there is a physical/logical
> correlation, but cases involving any kind of clustering of values are
> helped significantly.

Interesting work.

Some comments:

1. My biggest gripe with this patch is that the comments are not easy
to understand.  For example:

+ * Attempt to re-use buffers across calls.  Also, avoid work in the event
+ * of clustered together identical items by exploiting temporal locality.
+ * This works well with divide-and-conquer, comparison-based sorts like
+ * quicksort and mergesort.
+ *
+ * With quicksort, there is, in general, a pretty strong chance that the
+ * same buffer contents can be used repeatedly for pivot items -- early
+ * pivot items will account for large number of total comparisons, since
+ * they must be compared against many (possibly all other) items.

Well, what I would have written is something like: "We're likely to be
asked to compare the same strings repeatedly, and memcmp() is so much
cheaper than memcpy() that it pays to attempt a memcmp() in the hopes
of avoiding a memcpy().  This doesn't seem to slow things down
measurably even if it doesn't work out very often."

+ * While it is worth going to the trouble of trying to re-use buffer
+ * contents across calls, ideally that will lead to entirely avoiding a
+ * strcoll() call by using a cached return value.
+ *
+ * This optimization can work well again and again for the same set of
+ * clustered together identical attributes;  as they're relocated to new
+ * subpartitions, only one strcoll() is required for each pivot (in respect
+ * of that clump of identical values, at least).  Similarly, the final
+ * N-way merge of a mergesort can be effectively accelerated if each run
+ * has its own locally clustered values.

And here I would have written something like: "If we're comparing the
same two strings that we compared last time, we can return the same
answer without calling strcoll() again.  This is more likely than it
seems, because quicksort compares the same pivot against many values,
and some of those values might be duplicates."

Of course everybody may prefer something different here; I'm just
telling you what I think.

2. I believe the change to bttextcmp_abbrev() should be pulled out
into a separate patch and committed separately.  That part  seems like
a slam dunk.

3. What is the worst case for the strxfrm()-reuse stuff?  I suppose
it's the case where we have many strings that are long, all
equal-length, and all different, but only in the last few characters.
Then the memcmp() is as expensive as possible but never works out.
How does the patch hold up in that case?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Raising our compiler requirements for 9.6

2015-08-04 Thread Andres Freund
On 2015-08-04 15:20:14 -0400, Robert Haas wrote:
> OK, so do we want to rip out all instances of the static inline dance
> in favor of more straightforward coding?  Do we then shut pandemelon
> and any other affected buildfarm members down as unsupported, or what?

I think all that happens is that they'll log a couple more warnings
about defined but unused static functions. configure already defines
inline away if not supported.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Raising our compiler requirements for 9.6

2015-08-04 Thread Robert Haas
On Thu, Jul 2, 2015 at 12:10 PM, Tom Lane  wrote:
> Robert Haas  writes:
>> So far this thread is all about the costs of desupporting compilers
>> that don't have these features, and you're making a good argument
>> (that I think we all agree with) that the cost is small.  But you
>> haven't really said what the benefits are.
>
> I made the same remark with respect to varargs macros, and I continue
> to think that the cost-benefit ratio there is pretty marginal.
>
> However, I do think that there's a case to be made for adopting static
> inline.  The INCLUDE_DEFINITIONS dance is very inconvenient, so it
> discourages people from using static inlines over macros, even in cases
> where the macro approach is pretty evil (usually, because of multiple-
> evaluation hazards).  If we allowed static inlines then we could work
> towards having a safer coding standard along the lines of "thou shalt
> not write macros that evaluate their arguments more than once".
> So the benefits here are easier development and less risk of bugs.
> On the other side of the ledger, the costs are pretty minimal; we will
> not actually break any platforms, at worst we'll make them unpleasant
> to develop on because of lots of warning messages.  We have some platforms
> like that already, and it's not been a huge problem.

OK, so do we want to rip out all instances of the static inline dance
in favor of more straightforward coding?  Do we then shut pandemelon
and any other affected buildfarm members down as unsupported, or what?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] upgrade failure from 9.5 to head

2015-08-04 Thread Andrew Dunstan


On 08/04/2015 02:09 PM, Tom Lane wrote:

Robert Haas  writes:

On Tue, Aug 4, 2015 at 1:54 PM, Andres Freund  wrote:

On 2015-08-04 13:52:54 -0400, Tom Lane wrote:

Not sure whether we should consider it a back-patchable bug fix or
something to do only in HEAD, though --- comments?

Tentatively I'd say it's a bug and should be back-patched.

Agreed.  If investigation turns up reasons to worry about
back-patching it, I'd possibly back-track on that position, but I
think we should start with the notion that it is back-patchable and
retreat from that position only at need.

OK.  Certainly we can fix 9.5 the same way as HEAD; the pg_dump code
hasn't diverged much yet.  I'll plan to push it as far back as convenient,
but I won't expend any great effort on making the older branches do it if
they turn out to be too different.





From my POV 9.5 is the one that's most critical, because it's the one 
that introduced a regression test that leaves a shell type lying around. 
But "as far back as convenient" works for me.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [sqlsmith] subplan variable reference / unassigned NestLoopParams (was: [sqlsmith] Failed assertion in joinrels.c)

2015-08-04 Thread Tom Lane
Andreas Seltenreich  writes:
> Tom Lane writes:
>> Well, I certainly think all of these represent bugs:
>>> 3 | ERROR:  plan should not reference subplan's variable
>>> 2 | ERROR:  failed to assign all NestLoopParams to plan nodes

> These appear to be related.  The following query produces the former,
> but if you replace the very last reference of provider with the literal
> 'bar', it raises the latter error.

Fixed that, thanks for the test case!

> ,[ FWIW: git bisect run ]
> | first bad commit: [e83bb10d6dcf05a666d4ada00d9788c7974ad378]
> | Adjust definition of cheapest_total_path to work better with LATERAL.
> `

There's still something fishy about your git bisect results; they don't
have much to do with what seems to me to be the triggering condition.
I suspect the problem is that git bisect doesn't allow for the possibility
that the symptom might appear and disappear over time, ie it might have
been visible at some early stage of the LATERAL work but been fixed later,
and then reintroduced by still-later optimization efforts.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] brin index vacuum versus transaction snapshots

2015-08-04 Thread Alvaro Herrera
Robert Haas wrote:
> On Fri, Jul 31, 2015 at 3:45 PM, Alvaro Herrera
>  wrote:

> > I think the only way to close this hole is to have summarize_range()
> > sleep until all open snapshots are gone after inserting the placeholder
> > tuple and before acquiring the snapshot, similarly to how CREATE INDEX
> > CONCURRENTLY does it.
> 
> That's gonna be really slow, though, right?  Even if you rework things
> so that vacuum inserts all the placeholder tuples first, then waits,
> then does all the summarization, that could easily turn a vacuum that
> would have finished in a second into one that instead takes multiple
> hours.  During that time an AV worker is pinned down, and all sorts of
> badness can ensue.

Yeah, it is bad and I was concerned about that too.  Thankfully I found
another way to solve it, which is to forgo usage of MVCC here and
instead use SnapshotAny.  There's already a mode in
IndexBuildHeapRangeScan that uses SnapshotAny, but it needs some tweaks
to do what we need.  I'm going to propose a patch along those lines
shortly.

> Maybe I'm misunderstanding, but this whole thing seems like a pretty
> serious problem for BRIN.  :-(

With this new approach it shouldn't be.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] upgrade failure from 9.5 to head

2015-08-04 Thread Tom Lane
Robert Haas  writes:
> On Tue, Aug 4, 2015 at 1:54 PM, Andres Freund  wrote:
>> On 2015-08-04 13:52:54 -0400, Tom Lane wrote:
>>> Not sure whether we should consider it a back-patchable bug fix or
>>> something to do only in HEAD, though --- comments?

>> Tentatively I'd say it's a bug and should be back-patched.

> Agreed.  If investigation turns up reasons to worry about
> back-patching it, I'd possibly back-track on that position, but I
> think we should start with the notion that it is back-patchable and
> retreat from that position only at need.

OK.  Certainly we can fix 9.5 the same way as HEAD; the pg_dump code
hasn't diverged much yet.  I'll plan to push it as far back as convenient,
but I won't expend any great effort on making the older branches do it if
they turn out to be too different.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablecmds.c and lock hierarchy

2015-08-04 Thread Robert Haas
On Tue, Aug 4, 2015 at 2:41 AM, Michael Paquier
 wrote:
> Yep, true as things stand now. But this would get broken if we add a
> new lock level between ShareRowExclusiveLock and AccessExclusiveLock
> that does not respect the current monotone hierarchy between those.

But we're probably not going to do that, so it doesn't matter; and if
we do do it, we can worry about it then.  I don't think this is worth
getting concerned about now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] upgrade failure from 9.5 to head

2015-08-04 Thread Robert Haas
On Tue, Aug 4, 2015 at 1:54 PM, Andres Freund  wrote:
> On 2015-08-04 13:52:54 -0400, Tom Lane wrote:
>> Not sure whether we should consider it a back-patchable bug fix or
>> something to do only in HEAD, though --- comments?
>
> Tentatively I'd say it's a bug and should be back-patched.

Agreed.  If investigation turns up reasons to worry about
back-patching it, I'd possibly back-track on that position, but I
think we should start with the notion that it is back-patchable and
retreat from that position only at need.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] upgrade failure from 9.5 to head

2015-08-04 Thread Andres Freund
On 2015-08-04 13:52:54 -0400, Tom Lane wrote:
> Not sure whether we should consider it a back-patchable bug fix or
> something to do only in HEAD, though --- comments?

Tentatively I'd say it's a bug and should be back-patched.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] upgrade failure from 9.5 to head

2015-08-04 Thread Tom Lane
Robert Haas  writes:
> On Sun, Aug 2, 2015 at 8:20 PM, Stephen Frost  wrote:
>> +1.
>> 
>> I was doing testing the other day and ran into the "pg_dump doesn't
>> support shell types" issue and it was annoyingly confusing.

> Is anyone working on this?  Should it be added to the open items list?

I plan to work on it as soon as I'm done fixing the planner issue Andreas
found (which I'm almost done with).  Not sure whether we should consider
it a back-patchable bug fix or something to do only in HEAD, though ---
comments?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reduce ProcArrayLock contention

2015-08-04 Thread Robert Haas
On Tue, Aug 4, 2015 at 11:29 AM, Robert Haas  wrote:
> 4. I factored out the actual XID-clearing logic into a new function
> ProcArrayEndTransactionInternal instead of repeating it twice. On the
> flip side, I merged PushProcAndWaitForXidClear with
> PopProcsAndClearXids and renamed the result to ProcArrayGroupClearXid,
> since there seemed to be no need to separate them.

Thinking about this a bit more, it's probably worth sticking an
"inline" designation on ProcArrayEndTransactionInternal.  Keeping the
time for which we hold ProcArrayLock in exclusive mode down to the
absolute minimum possible number of instructions seems like a good
plan.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tablecmds.c and lock hierarchy

2015-08-04 Thread Alvaro Herrera
Michael Paquier wrote:
> On Tue, Aug 4, 2015 at 2:43 PM, Alvaro Herrera  
> wrote:

> > Maybe the solution to this is to add the concept of "addition" of two
> > lock modes, where the result is another lock mode that conflicts with
> > any lock that would conflict with either of the two operand lock modes.

> That's commutative, as this is basically looking at the conflict table
> to get the union of the bits to indicate what are all the locks
> conflicting with lock A and lock B, and then we select the lock on the
> table that includes the whole union, with a minimum number of them.

Yes.

> Now, let's take for example this case with locks A, B, C, D:
> - Lock A conflicts with ACD
> - B with BCD
> - C with itself
> - D with itself
> What would you choose as a result of add(C,D)? A or B? Or the super
> lock conflicting with all of them?

This appears to me an hypothetical case that I don't think occurs in our
conflicts table, so I wouldn't worry about it.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] track_commit_timestamp and COMMIT PREPARED

2015-08-04 Thread Robert Haas
On Tue, Aug 4, 2015 at 1:18 PM, Andres Freund  wrote:
> On 2015-08-04 13:16:52 -0400, Robert Haas wrote:
>> On Mon, Aug 3, 2015 at 10:31 AM, Fujii Masao  wrote:
>> > track_commit_timestamp tracks COMMIT PREPARED as expected in standby 
>> > server,
>> > but not in master server. Is this intentional? It should track COMMIT 
>> > PREPARED
>> > even in master? Otherwise, we cannot use commit_timestamp feature to check
>> > the replication lag properly while we use 2PC.
>>
>> That sounds like it must be a bug.  I think you should add it to the
>> open items list.
>
> Yea, completely agreed. It's probably because twophase.c duplicates a
> good bit of xact.c. How about we also add a warning to the respective
> places that one should always check the others?

Sounds like a good idea.  Or we could try to, heh heh, refactor away
the duplication.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Alvaro Herrera
Jeff Janes wrote:

> The attached proof of concept patch greatly improves the bloat for both the
> insert and the update cases.  You need to turn on both features: adding the
> pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3).  The
> first of those two things could probably be adopted for real, but the
> second probably is not acceptable.  What is the right way to do this?

I think that's the right way, only that I wouldn't call FSMVacuum()
inside the loop but only once you're out of it.

FWIW I had to add a FSMVacuum call somewhere in the BRIN code also,
which I didn't like at all.  If there's a way to optimize this "bubbling
up" of some individual page all the way to the root level, that would
probably be a lot better than what it's currently doing.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] track_commit_timestamp and COMMIT PREPARED

2015-08-04 Thread Andres Freund
On 2015-08-04 13:16:52 -0400, Robert Haas wrote:
> On Mon, Aug 3, 2015 at 10:31 AM, Fujii Masao  wrote:
> > track_commit_timestamp tracks COMMIT PREPARED as expected in standby server,
> > but not in master server. Is this intentional? It should track COMMIT 
> > PREPARED
> > even in master? Otherwise, we cannot use commit_timestamp feature to check
> > the replication lag properly while we use 2PC.
> 
> That sounds like it must be a bug.  I think you should add it to the
> open items list.

Yea, completely agreed. It's probably because twophase.c duplicates a
good bit of xact.c. How about we also add a warning to the respective
places that one should always check the others?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] track_commit_timestamp and COMMIT PREPARED

2015-08-04 Thread Robert Haas
On Mon, Aug 3, 2015 at 10:31 AM, Fujii Masao  wrote:
> track_commit_timestamp tracks COMMIT PREPARED as expected in standby server,
> but not in master server. Is this intentional? It should track COMMIT PREPARED
> even in master? Otherwise, we cannot use commit_timestamp feature to check
> the replication lag properly while we use 2PC.

That sounds like it must be a bug.  I think you should add it to the
open items list.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );

2015-08-04 Thread Alvaro Herrera
Andres Freund wrote:
> On 2015-08-03 14:15:27 +0900, Michael Paquier wrote:

> > As long as this only applies on master, this may be fine... We could
> > basically pass a LOCKMASK to the multiple layers of tablecmds.c
> > instead of LOCKMODE to track all the locks that need to be taken, and
> > all the relations open during operations.
> 
> This sounds far too complicated to me. Just LockRelationOid() the
> relation with the appropriate level everytime you pass through the
> function?

That opens up for lock escalation and deadlocks, doesn't it?  You are
probably thinking that it's okay to ignore those but I don't necessarily
agree with that.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: prevent user from setting wal_buffers over 2GB bytes

2015-08-04 Thread Robert Haas
On Tue, Aug 4, 2015 at 9:52 AM, Andres Freund  wrote:
>> So my vote is for the original approach.  (I've not read Josh's patch,
>> so there might be something wrong with it in detail, but I like the
>> basic approach.)
>
> +1

OK, committed and back-patched that all the way back to 9.0.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Tab completion for CREATE SEQUENCE

2015-08-04 Thread Robert Haas
On Mon, Aug 3, 2015 at 2:17 AM, Michael Paquier
 wrote:
> On Tue, Jul 7, 2015 at 9:14 PM, Andres Freund  wrote:
>> On 2015-06-19 06:41:19 +, Brendan Jurd wrote:
>>> I'm marking this "Waiting on Author".  Once the problems have been
>>> corrected, it should be ready for a committer.
>>
>> Vik, are you going to update the patch?
>
> Seeing no activity on this thread and as it would be a waste not to do
> it, I completed the patch as attached. The following things are done:
> - Fixed code indentation
> - Removal of "RESTART", "SET SCHEMA", "OWNER TO", and "RENAME TO" in
> CREATE SEQUENCE
> - Added "START WITH".
> - Added TEMP/TEMPORARY in the set of keywords tracked.
> I am switching at the same time this patch as "Ready for committer".

You didn't remove RESTART, so I did that.  And this needed some more
parentheses to make my compiler happy.

Committed with those changes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reduce ProcArrayLock contention

2015-08-04 Thread Robert Haas
On Tue, Aug 4, 2015 at 11:50 AM, Amit Kapila  wrote:
> I have kept barriers based on comments on top of atomic read, refer
> below code:
>
>  * No barrier semantics.
>  */
> STATIC_IF_INLINE uint32
> pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr)
>
> Note - The function header comments on pg_atomic_read_u32 and
> corresponding write call seems to be reversed, but that is something
> separate.

That doesn't matter, because the compare-and-exchange *is* a barrier.
Putting a barrier between a store and an operation that is already a
barrier doesn't do anything useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reduce ProcArrayLock contention

2015-08-04 Thread Andres Freund
On 2015-08-04 21:20:20 +0530, Amit Kapila wrote:
> I have kept barriers based on comments on top of atomic read, refer
> below code:


>  * No barrier semantics.
>  */
> STATIC_IF_INLINE uint32
> pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr)
> 
> Note - The function header comments on pg_atomic_read_u32 and
> corresponding write call seems to be reversed, but that is something
> separate.

Well, the question is whether you *need* barrier semantics in that
place. If you just have a retry loop around a compare/exchange there's
no point in having one, it'll just cause needless slowdown due to
another bus-lock.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Incorrect comment about abbreviated keys

2015-08-04 Thread Robert Haas
On Sun, Aug 2, 2015 at 5:11 AM, Peter Geoghegan  wrote:
> Attached patch fixes this issue. This was missed by
> 78efd5c1edb59017f06ef96773e64e6539bfbc86

Committed and back-patched to 9.5.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reduce ProcArrayLock contention

2015-08-04 Thread Andres Freund
On 2015-08-04 11:43:45 -0400, Robert Haas wrote:
> On Tue, Aug 4, 2015 at 11:33 AM, Andres Freund  wrote:
> > Actually by far not all system calls are full barriers?
> 
> How do we know which ones are and which ones are not?

Good question. Reading the source code of all implementations I suppose
:(

E.g. gettimeofday()/clock_gettime(), getpid() on linux aren't
barriers.

> I can't believe PGSemaphoreUnlock isn't a barrier.  That would be cruel.

Yea, I think that's a pretty safe bet. I mean even if you'd implement it
locklessly in the kernel, that'd still employ significant enough
barriers/atomic ops itself.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] upgrade failure from 9.5 to head

2015-08-04 Thread Robert Haas
On Sun, Aug 2, 2015 at 8:20 PM, Stephen Frost  wrote:
> +1.
>
> I was doing testing the other day and ran into the "pg_dump doesn't
> support shell types" issue and it was annoyingly confusing.

Is anyone working on this?  Should it be added to the open items list?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reduce ProcArrayLock contention

2015-08-04 Thread Amit Kapila
On Tue, Aug 4, 2015 at 8:59 PM, Robert Haas  wrote:
>
> On Mon, Aug 3, 2015 at 8:39 AM, Amit Kapila 
wrote:
> >> I agree and modified the patch to use 32-bit atomics based on idea
> >> suggested by Robert and didn't modify lwlock.c.
> >
> > While looking at patch, I found that the way it was initialising the
list
> > to be empty was wrong, it was using pgprocno as 0 to initialise the
> > list, as 0 is a valid pgprocno.  I think we should use a number greater
> > that PROCARRAY_MAXPROC (maximum number of procs in proc
> > array).
> >
> > Apart from above fix, I have modified src/backend/access/transam/README
> > to include the information about the improvement this patch brings to
> > reduce ProcArrayLock contention.
>
> I spent some time looking at this patch today and found that a good
> deal of cleanup work seemed to be needed.  Attached is a cleaned-up
> version which makes a number of changes:
>
> 1. I got rid of all of the typecasts.  You're supposed to treat
> pg_atomic_u32 as a magic data type that is only manipulated via the
> primitives provided, not just cast back and forth between that and
> u32.
>
> 2. I got rid of the memory barriers.  System calls are full barriers,
> and so are compare-and-exchange operations.  Between those two facts,
> we should be fine without these.
>

I have kept barriers based on comments on top of atomic read, refer
below code:

 * No barrier semantics.
 */
STATIC_IF_INLINE uint32
pg_atomic_read_u32(volatile pg_atomic_uint32 *ptr)

Note - The function header comments on pg_atomic_read_u32 and
corresponding write call seems to be reversed, but that is something
separate.

>
> I'm not entirely happy with the name "nextClearXidElem" but apart from
> that I'm fairly happy with this version.  We should probably test it
> to make sure I haven't broken anything;

Okay, will look into it tomorrow.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Reduce ProcArrayLock contention

2015-08-04 Thread Robert Haas
On Tue, Aug 4, 2015 at 11:33 AM, Andres Freund  wrote:
> On 2015-08-04 11:29:39 -0400, Robert Haas wrote:
>> On Mon, Aug 3, 2015 at 8:39 AM, Amit Kapila  wrote:
>> 1. I got rid of all of the typecasts.  You're supposed to treat
>> pg_atomic_u32 as a magic data type that is only manipulated via the
>> primitives provided, not just cast back and forth between that and
>> u32.
>
> Absolutely. Otherwise no fallbacks can work.
>
>> 2. I got rid of the memory barriers.  System calls are full barriers,
>> and so are compare-and-exchange operations.  Between those two facts,
>> we should be fine without these.
>
> Actually by far not all system calls are full barriers?

How do we know which ones are and which ones are not?

I can't believe PGSemaphoreUnlock isn't a barrier.  That would be cruel.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] brin index vacuum versus transaction snapshots

2015-08-04 Thread Robert Haas
On Fri, Jul 31, 2015 at 3:45 PM, Alvaro Herrera
 wrote:
>> I think the real solution to this problem is to avoid use of
>> GetTransactionSnapshot(), and instead use GetLatestSnapshot().  As far
>> as I can see, that should completely close the hole.  This requires
>> patching IndexBuildHeapRangeScan() to allow for that.
>
> Actually I think there's another problem: if a transaction starts and
> inserts a tuple into the page range, then goes to sleep, and then
> another session does the summarization of the page range, session 1 is
> seen as "in progress" by session 2 (so the scan does not see the new
> tuple), but the placeholder tuple was not modified either because it was
> inserted later than the snapshot.  So the update is lost.
>
> I think the only way to close this hole is to have summarize_range()
> sleep until all open snapshots are gone after inserting the placeholder
> tuple and before acquiring the snapshot, similarly to how CREATE INDEX
> CONCURRENTLY does it.

That's gonna be really slow, though, right?  Even if you rework things
so that vacuum inserts all the placeholder tuples first, then waits,
then does all the summarization, that could easily turn a vacuum that
would have finished in a second into one that instead takes multiple
hours.  During that time an AV worker is pinned down, and all sorts of
badness can ensue.

Maybe I'm misunderstanding, but this whole thing seems like a pretty
serious problem for BRIN.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reduce ProcArrayLock contention

2015-08-04 Thread Andres Freund
On 2015-08-04 11:29:39 -0400, Robert Haas wrote:
> On Mon, Aug 3, 2015 at 8:39 AM, Amit Kapila  wrote:
> 1. I got rid of all of the typecasts.  You're supposed to treat
> pg_atomic_u32 as a magic data type that is only manipulated via the
> primitives provided, not just cast back and forth between that and
> u32.

Absolutely. Otherwise no fallbacks can work.

> 2. I got rid of the memory barriers.  System calls are full barriers,
> and so are compare-and-exchange operations.  Between those two facts,
> we should be fine without these.

Actually by far not all system calls are full barriers?

Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Tom Lane
Simon Riggs  writes:
> On 4 August 2015 at 15:18, Andres Freund  wrote:
>> The only thing that variable does is change what the pending size limit
>> is determined by. Previously it was work_mem, now it's
>> gin_pending_list_limit. Imo that has pretty much nothing to do with not
>> registering pages as free.

> We've made a change to the way GIN fast update works and that needs to be
> an effective change, not one that immediately triggers a secondary
> annoyance for the user.

Said annoyance has been there since day one, no?

> I've asked some questions; they may turn into
> actions, or not, but they are open items.

If it's not a new problem introduced by 9.5, I don't think it's
appropriate to insist that 9.5 must solve it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reduce ProcArrayLock contention

2015-08-04 Thread Robert Haas
On Mon, Aug 3, 2015 at 8:39 AM, Amit Kapila  wrote:
>> I agree and modified the patch to use 32-bit atomics based on idea
>> suggested by Robert and didn't modify lwlock.c.
>
> While looking at patch, I found that the way it was initialising the list
> to be empty was wrong, it was using pgprocno as 0 to initialise the
> list, as 0 is a valid pgprocno.  I think we should use a number greater
> that PROCARRAY_MAXPROC (maximum number of procs in proc
> array).
>
> Apart from above fix, I have modified src/backend/access/transam/README
> to include the information about the improvement this patch brings to
> reduce ProcArrayLock contention.

I spent some time looking at this patch today and found that a good
deal of cleanup work seemed to be needed.  Attached is a cleaned-up
version which makes a number of changes:

1. I got rid of all of the typecasts.  You're supposed to treat
pg_atomic_u32 as a magic data type that is only manipulated via the
primitives provided, not just cast back and forth between that and
u32.

2. I got rid of the memory barriers.  System calls are full barriers,
and so are compare-and-exchange operations.  Between those two facts,
we should be fine without these.

3. I removed the clearXid field you added to PGPROC.  Since that's
just a sentinel, I used nextClearXidElem for that purpose. There
doesn't seem to be a need for two fields.

4. I factored out the actual XID-clearing logic into a new function
ProcArrayEndTransactionInternal instead of repeating it twice. On the
flip side, I merged PushProcAndWaitForXidClear with
PopProcsAndClearXids and renamed the result to ProcArrayGroupClearXid,
since there seemed to be no need to separate them.

5. I renamed PROC_NOT_IN_PGPROCARRAY to INVALID_PGPROCNO, which I
think is more consistent with what we do elsewhere, and made it a
compile-time constant instead of a value that must be computed every
time it's used.

6. I overhauled the comments and README updates.

I'm not entirely happy with the name "nextClearXidElem" but apart from
that I'm fairly happy with this version.  We should probably test it
to make sure I haven't broken anything; on a quick 3-minute pgbench
test on cthulhu (128-way EDB server) with 128 clients I got 2778 tps
with master and 4330 tps with this version of the patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index bc68b47..f6db580 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -252,6 +252,9 @@ implementation of this is that GetSnapshotData takes the ProcArrayLock in
 shared mode (so that multiple backends can take snapshots in parallel),
 but ProcArrayEndTransaction must take the ProcArrayLock in exclusive mode
 while clearing MyPgXact->xid at transaction end (either commit or abort).
+(To reduce context switching, when multiple transactions commit nearly
+simultaneously, we have one backend take ProcArrayLock and clear the XIDs
+of multiple processes at once.)
 
 ProcArrayEndTransaction also holds the lock while advancing the shared
 latestCompletedXid variable.  This allows GetSnapshotData to use
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index 4f3c5c9..14d1219 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -167,6 +167,9 @@ static int KnownAssignedXidsGetAndSetXmin(TransactionId *xarray,
 static TransactionId KnownAssignedXidsGetOldestXmin(void);
 static void KnownAssignedXidsDisplay(int trace_level);
 static void KnownAssignedXidsReset(void);
+static void ProcArrayEndTransactionInternal(PGPROC *proc, PGXACT *pgxact,
+TransactionId latestXid);
+static void ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid);
 
 /*
  * Report shared-memory space needed by CreateSharedProcArray.
@@ -370,7 +373,6 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
 	elog(LOG, "failed to find proc %p in ProcArray", proc);
 }
 
-
 /*
  * ProcArrayEndTransaction -- mark a transaction as no longer running
  *
@@ -399,26 +401,18 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 		 */
 		Assert(TransactionIdIsValid(allPgXact[proc->pgprocno].xid));
 
-		LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-
-		pgxact->xid = InvalidTransactionId;
-		proc->lxid = InvalidLocalTransactionId;
-		pgxact->xmin = InvalidTransactionId;
-		/* must be cleared with xid/xmin: */
-		pgxact->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
-		pgxact->delayChkpt = false;		/* be sure this is cleared in abort */
-		proc->recoveryConflictPending = false;
-
-		/* Clear the subtransaction-XID cache too while holding the lock */
-		pgxact->nxids = 0;
-		pgxact->overflowed = false;
-
-		/* Also advance global latestCompletedXid while holding the lock */
-		if (TransactionIdPrecedes(ShmemVariableCache->latestCompletedXid,
-  latestXid))
-			ShmemVariableCache->latestComplete

Re: [HACKERS] Autonomous Transaction is back

2015-08-04 Thread Merlin Moncure
On Tue, Aug 4, 2015 at 4:12 AM, Rajeev rastogi
 wrote:
> On 03 August 2015 18:40, Merlin Moncure [mailto:mmonc...@gmail.com] Wrote:
>>On Sun, Aug 2, 2015 at 11:37 PM, Rajeev rastogi
>> wrote:
>>> On 31 July 2015 23:10, Robert Haas Wrote:
I think we're going entirely down the wrong path here.  Why is it ever
>>useful for a backend's lock requests to conflict with themselves, even
>>with autonomous transactions?  That seems like an artifact of somebody
>>else's implementation that we should be happy we don't need to copy.
>>>
>>> IMHO, since most of the locking are managed at transaction level not
>>backend level and we consider main & autonomous transaction to be
>>independent transaction, then practically they may conflict right.
>>> It is also right as you said that there is no as such useful use-cases
>>where autonomous transaction conflicts with main (parent) transaction.
>>But we cannot take it for granted as user might make a mistake. So at-
>>least we should have some mechanism to handle this rare case, for which
>>as of now I think throwing error from autonomous transaction as one of
>>the solution. Once error thrown from autonomous transaction, main
>>transaction may continue as it is (or abort main transaction also??).
>>
>>hm.  OK, what's the behavior of:
>>
>>BEGIN
>>  UPDATE foo SET x = x + 1 WHERE foo_id = 1;
>>
>>  BEGIN WITH AUTONOMOUS TRANSACTION
>>UPDATE foo SET x = x + 1 WHERE foo_id = 1;
>>  END;
>>
>>  RAISE EXCEPTION ...;
>>EXCEPTION ...
>>
>>END;
>
> It should throw an error (or something equivalent) as the second update will 
> wait for record lock to get released, which in this case will not happen till 
> second update finishes. So catch 22.

Yeah. Point being, from my point of view autonomous transactions have
to conflict with the master transaction (or any transaction really).
I agree the right course of action is to error out immediately...what
else could you do?  There isn't even a deadlock in the classic sense
and allowing control to continue would result in indeterminate
behavior FWICT.

>>Also,
>>*) What do the other candidate implementations do?  IMO, 
>>compatibility>>should be the underlying design principle.
>
> Oracle throws error in such case. But we can decide on what behavior we want 
> to keep.

gotcha.  makes sense.

>>*) What will the "SQL only" feature look like?
>
> Similar to PL as mentioned in your example, we can provide the "SQL only" 
> feature also.
>
>>*) Is the SPI interface going to be extended to expose AT?
>
> I don’t think at this point that there is any need of exposing SPI interface 
> for this.

Ok, how do AT work in a non-plpgsql ("SQL only") scenario?  Are you
going to similarly extend BEGIN TRANSACTION?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Simon Riggs
On 4 August 2015 at 15:18, Andres Freund  wrote:

> On 2015-08-04 14:59:11 +0100, Simon Riggs wrote:
> > On 4 August 2015 at 14:55, Heikki Linnakangas  wrote:
> >
> > > On 08/04/2015 04:35 PM, Simon Riggs wrote:
> > >
> > >> This and the OP seem like 9.5 open items to me.
> > >>
> > >
> > > Why? This is nothing new in 9.5.
> >
> >
> > gin_pending_list_limit is new in 9.5
> >
> > We're in Alpha, so if something has been added and isn't very usable, we
> > should change it while we can.
>
> The only thing that variable does is change what the pending size limit
> is determined by. Previously it was work_mem, now it's
> gin_pending_list_limit. Imo that has pretty much nothing to do with not
> registering pages as free.
>

We've made a change to the way GIN fast update works and that needs to be
an effective change, not one that immediately triggers a secondary
annoyance for the user. I've asked some questions; they may turn into
actions, or not, but they are open items.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Sharing aggregate states between different aggregate functions

2015-08-04 Thread Heikki Linnakangas

On 08/03/2015 08:53 AM, David Rowley wrote:

Attached is a delta patched which is based
on sharing_aggstate-heikki-2.patch to fix up the out-dated comments and
also a few more test scenarios which test the sharing works with matching
INITCOND and that it does not when they don't match.

What do you think?


I committed this, after some more cleanup of the comments. Thanks!

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );

2015-08-04 Thread Fabrízio de Royes Mello
On Tue, Aug 4, 2015 at 5:55 AM, Andres Freund  wrote:
>
> On 2015-08-03 14:15:27 +0900, Michael Paquier wrote:
> > On Sat, Aug 1, 2015 at 9:20 PM, Andres Freund wrote:
> > > On August 1, 2015 2:17:24 PM GMT+02:00, Michael Paquier wrote:
> > >>> For instance, if you told me to choose between ShareLock and
> > >>> ShareUpdateExclusiveLock I wouldn't know which one is strongest.  I
> > >>> don't it's sensible to have the "lock mode compare" primitive
> > >>honestly.
> > >>> I don't have any great ideas to offer ATM sadly.
> > >>
> > >>Yes, the thing is that lowering the lock levels is good for
> > >>concurrency, but the non-monotony of the lock levels makes it
> > >>impossible to choose an intermediate state correctly.
> > >
> > > How about simply acquiring all the locks individually of they're
different types? These few acquisitions won't matter.
> >
> > As long as this only applies on master, this may be fine... We could
> > basically pass a LOCKMASK to the multiple layers of tablecmds.c
> > instead of LOCKMODE to track all the locks that need to be taken, and
> > all the relations open during operations.
>
> This sounds far too complicated to me. Just LockRelationOid() the
> relation with the appropriate level everytime you pass through the
> function?

Hi all,

IMHO is more simply we just fallback to AccessExclusiveLock if there are
different lockmodes in reloptions as Michael suggested before.

Look at the new version attached.

Regards,


*** This work is funded by Zenvia Mobile Results (http://www.zenvia.com.br)

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 1c1c181..ad985cd 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -543,6 +543,10 @@ ALTER TABLE ALL IN TABLESPACE name
   of ALTER TABLE that forces a table rewrite.
  
 
+ 
+  Changing autovacuum storage parameters acquires a SHARE UPDATE EXCLUSIVE lock.
+ 
+
  
   
While CREATE TABLE allows OIDS to be specified
diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 180f529..e0f9f09 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -57,7 +57,8 @@ static relopt_bool boolRelOpts[] =
 		{
 			"autovacuum_enabled",
 			"Enables autovacuum in this relation",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST
+			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST,
+			ShareUpdateExclusiveLock
 		},
 		true
 	},
@@ -65,7 +66,8 @@ static relopt_bool boolRelOpts[] =
 		{
 			"user_catalog_table",
 			"Declare a table as an additional catalog table, e.g. for the purpose of logical replication",
-			RELOPT_KIND_HEAP
+			RELOPT_KIND_HEAP,
+			AccessExclusiveLock
 		},
 		false
 	},
@@ -73,7 +75,8 @@ static relopt_bool boolRelOpts[] =
 		{
 			"fastupdate",
 			"Enables \"fast update\" feature for this GIN index",
-			RELOPT_KIND_GIN
+			RELOPT_KIND_GIN,
+			AccessExclusiveLock
 		},
 		true
 	},
@@ -81,7 +84,8 @@ static relopt_bool boolRelOpts[] =
 		{
 			"security_barrier",
 			"View acts as a row security barrier",
-			RELOPT_KIND_VIEW
+			RELOPT_KIND_VIEW,
+			AccessExclusiveLock
 		},
 		false
 	},
@@ -95,7 +99,8 @@ static relopt_int intRelOpts[] =
 		{
 			"fillfactor",
 			"Packs table pages only to this percentage",
-			RELOPT_KIND_HEAP
+			RELOPT_KIND_HEAP,
+			AccessExclusiveLock
 		},
 		HEAP_DEFAULT_FILLFACTOR, HEAP_MIN_FILLFACTOR, 100
 	},
@@ -103,7 +108,8 @@ static relopt_int intRelOpts[] =
 		{
 			"fillfactor",
 			"Packs btree index pages only to this percentage",
-			RELOPT_KIND_BTREE
+			RELOPT_KIND_BTREE,
+			AccessExclusiveLock
 		},
 		BTREE_DEFAULT_FILLFACTOR, BTREE_MIN_FILLFACTOR, 100
 	},
@@ -111,7 +117,8 @@ static relopt_int intRelOpts[] =
 		{
 			"fillfactor",
 			"Packs hash index pages only to this percentage",
-			RELOPT_KIND_HASH
+			RELOPT_KIND_HASH,
+			AccessExclusiveLock
 		},
 		HASH_DEFAULT_FILLFACTOR, HASH_MIN_FILLFACTOR, 100
 	},
@@ -119,7 +126,8 @@ static relopt_int intRelOpts[] =
 		{
 			"fillfactor",
 			"Packs gist index pages only to this percentage",
-			RELOPT_KIND_GIST
+			RELOPT_KIND_GIST,
+			AccessExclusiveLock
 		},
 		GIST_DEFAULT_FILLFACTOR, GIST_MIN_FILLFACTOR, 100
 	},
@@ -127,7 +135,8 @@ static relopt_int intRelOpts[] =
 		{
 			"fillfactor",
 			"Packs spgist index pages only to this percentage",
-			RELOPT_KIND_SPGIST
+			RELOPT_KIND_SPGIST,
+			AccessExclusiveLock
 		},
 		SPGIST_DEFAULT_FILLFACTOR, SPGIST_MIN_FILLFACTOR, 100
 	},
@@ -135,7 +144,8 @@ static relopt_int intRelOpts[] =
 		{
 			"autovacuum_vacuum_threshold",
 			"Minimum number of tuple updates or deletes prior to vacuum",
-			RELOPT_KIND_HEAP | RELOPT_KIND_TOAST
+			RELOPT_KI

Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Andres Freund
On 2015-08-04 14:59:11 +0100, Simon Riggs wrote:
> On 4 August 2015 at 14:55, Heikki Linnakangas  wrote:
> 
> > On 08/04/2015 04:35 PM, Simon Riggs wrote:
> >
> >> This and the OP seem like 9.5 open items to me.
> >>
> >
> > Why? This is nothing new in 9.5.
> 
> 
> gin_pending_list_limit is new in 9.5
> 
> We're in Alpha, so if something has been added and isn't very usable, we
> should change it while we can.

The only thing that variable does is change what the pending size limit
is determined by. Previously it was work_mem, now it's
gin_pending_list_limit. Imo that has pretty much nothing to do with not
registering pages as free.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Simon Riggs
On 4 August 2015 at 14:55, Heikki Linnakangas  wrote:

> On 08/04/2015 04:35 PM, Simon Riggs wrote:
>
>> This and the OP seem like 9.5 open items to me.
>>
>
> Why? This is nothing new in 9.5.


gin_pending_list_limit is new in 9.5

We're in Alpha, so if something has been added and isn't very usable, we
should change it while we can.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Heikki Linnakangas

On 08/04/2015 04:35 PM, Simon Riggs wrote:

This and the OP seem like 9.5 open items to me.


Why? This is nothing new in 9.5.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cost_agg() with AGG_HASHED does not account for startup costs

2015-08-04 Thread Tom Lane
David Rowley  writes:
> During working on allowing the planner to perform GROUP BY before joining
> I've noticed that cost_agg() completely ignores input_startup_cost
> when aggstrategy == AGG_HASHED.

Isn't your proposed patch double-counting the input startup cost?
input_total_cost already includes that charge.  The calculation
reflects the fact that we have to read all of the input before we
can deliver any aggregated results, so the time to get the first
input row isn't really interesting.

If this were wrong, the PLAIN costing path would also be wrong, but
I don't think that either one is.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-08-04 Thread Heikki Linnakangas

On 08/04/2015 03:15 PM, Robert Haas wrote:

On Tue, Jul 28, 2015 at 3:28 PM, Heikki Linnakangas  wrote:

* The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in sync
with the list of individual locks in lwlock.h. Sooner or later someone will
add an LWLock and forget to update the names-array. That needs to be made
less error-prone, so that the names are maintained in the same place as the
#defines. Perhaps something like rmgrlist.h.


This is a good idea, but it's not easy to do in the style of
rmgrlist.h, because I don't believe there's any way to define a macro
that expands to a preprocessor directive.  Attached is a patch that
instead generates the list of macros from a text file, and also
generates an array inside lwlock.c with the lock names that gets used
by the Trace_lwlocks stuff where applicable.

Any objections to this solution to the problem?  If not, I'd like to
go ahead and push this much.  I can't test the Windows changes
locally, though, so it would be helpful if someone could check that
out.


A more low-tech solution would be to something like this in lwlocknames.c:

static char *MainLWLockNames[NUM_INDIVIDUAL_LWLOCKS];

/* Turn pointer into one of the LWLocks in main array into an index 
number */

#define NAME_LWLOCK(l, name) MainLWLockNames[l - MainLWLockArray)] = name

InitLWLockNames()
{
  NAME_LWLOCK(ShmemIndexLock, "ShmemIndexLock");
  NAME_LWLOCK(OidGenLock, "OidGenLock");
  ...
}

That would not be auto-generated, so you'd need to keep that list in 
sync with lwlock.h, but it would be much better than the original patch 
because if you forgot to add an entry in the names-array, the numbering 
of all the other locks would not go wrong. And you could have a runtime 
check that complains if there's an entry missing, like Ildus did in his 
latest patch.


I have no particular objection to your perl script either, though. I'll 
leave it up to you.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: prevent user from setting wal_buffers over 2GB bytes

2015-08-04 Thread Andres Freund
On 2015-08-04 09:49:58 -0400, Tom Lane wrote:
> Takashi Horikawa  writes:
>  Why does this cause a core dump?  We could consider fixing whatever
>  the problem is rather than capping the value.
> 
> > As far as I experiment with my own evaluation environment using 
> > PostgreSQL-9.4.4 on a x86_64 Linux, this problem can be fixed with the 
> > patch 
> > attached.
> 
> I'm unsure whether this represents a complete fix ... but even if it does,
> it would be awfully easy to re-introduce similar bugs in future code
> changes, and who would notice?  Josh's approach of restricting the buffer
> size seems a lot more robust.
> 
> If there were any practical use-case for such large WAL buffers then it
> might be worth spending some effort/risk here.  But AFAICS, there is not.
> Indeed, capping wal_buffers might be argued to be a good thing in itself
> because it would prevent users from wasting shared memory foolishly.
> 
> So my vote is for the original approach.  (I've not read Josh's patch,
> so there might be something wrong with it in detail, but I like the
> basic approach.)

+1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch: prevent user from setting wal_buffers over 2GB bytes

2015-08-04 Thread Tom Lane
Takashi Horikawa  writes:
 Why does this cause a core dump?  We could consider fixing whatever
 the problem is rather than capping the value.

> As far as I experiment with my own evaluation environment using 
> PostgreSQL-9.4.4 on a x86_64 Linux, this problem can be fixed with the patch 
> attached.

I'm unsure whether this represents a complete fix ... but even if it does,
it would be awfully easy to re-introduce similar bugs in future code
changes, and who would notice?  Josh's approach of restricting the buffer
size seems a lot more robust.

If there were any practical use-case for such large WAL buffers then it
might be worth spending some effort/risk here.  But AFAICS, there is not.
Indeed, capping wal_buffers might be argued to be a good thing in itself
because it would prevent users from wasting shared memory foolishly.

So my vote is for the original approach.  (I've not read Josh's patch,
so there might be something wrong with it in detail, but I like the
basic approach.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Simon Riggs
On 4 August 2015 at 09:39, Simon Riggs  wrote:

> On 4 August 2015 at 06:03, Jeff Janes  wrote:
>
>
>> The attached proof of concept patch greatly improves the bloat for both
>> the insert and the update cases.  You need to turn on both features: adding
>> the pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3).
>> The first of those two things could probably be adopted for real, but the
>> second probably is not acceptable.  What is the right way to do this?
>> Could a variant of RecordFreeIndexPage bubble the free space up the map
>> immediately rather than waiting for a vacuum?  It would only have to move
>> up until it found a page with freespace already recorded in it, which the
>> vast majority of the time would mean observing up one level and then not
>> writing to it, assuming the pending list pages remain well clustered.
>>
>
> You make a good case for action here since insert only tables with GIN
> indexes on text are a common use case for GIN.
>
> Why would vacuuming the FSM be unacceptable? With a
> large gin_pending_list_limit it makes sense.
>
> If it is unacceptable, perhaps we can avoid calling it every time, or
> simply have FreeSpaceMapVacuum() terminate more quickly on some kind of
> 80/20 heuristic for this case.
>

Couple of questions here...

* the docs say "it's desirable to have pending-list cleanup occur in the
background", but there is no way to invoke that, except via VACUUM. I think
we need a separate function to be able to call this as a background action.
If we had that, we wouldn't need much else, would we?

* why do we have two parameters: gin_pending_list_limit and fastupdate?
What happens if we set gin_pending_list_limit but don't set fastupdate?

* how do we know how to set that parameter? Is there a way of knowing
gin_pending_list_limit has been reached?

This and the OP seem like 9.5 open items to me.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-08-04 Thread Ildus Kurbangaliev

On 08/04/2015 03:15 PM, Robert Haas wrote:

On Tue, Jul 28, 2015 at 3:28 PM, Heikki Linnakangas  wrote:

>* The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in sync
>with the list of individual locks in lwlock.h. Sooner or later someone will
>add an LWLock and forget to update the names-array. That needs to be made
>less error-prone, so that the names are maintained in the same place as the
>#defines. Perhaps something like rmgrlist.h.

This is a good idea, but it's not easy to do in the style of
rmgrlist.h, because I don't believe there's any way to define a macro
that expands to a preprocessor directive.  Attached is a patch that
instead generates the list of macros from a text file, and also
generates an array inside lwlock.c with the lock names that gets used
by the Trace_lwlocks stuff where applicable.

Any objections to this solution to the problem?  If not, I'd like to
go ahead and push this much.  I can't test the Windows changes
locally, though, so it would be helpful if someone could check that
out.

In my latest patch I still have an array with names, but postgres will 
show a message
if somebody adds an individual LWLock and forgets to add its name. Code 
generation
is also a solution, and if commiters will support it I'll merge it to 
main patch.


--
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] postgres_fdw extension support

2015-08-04 Thread Paul Ramsey
Thanks so much Michael! Let me know when you have further feedback I should 
incorporate.

ATB,
P. 

-- 
http://postgis.net
http://cleverelephant.ca


On July 25, 2015 at 4:52:11 AM, Michael Paquier (michael.paqu...@gmail.com) 
wrote:

On Sat, Jul 25, 2015 at 2:19 AM, Paul Ramsey  wrote: 
 
> On Thu, Jul 23, 2015 at 7:48 AM, Paul Ramsey  
> wrote:  
>> Here's an updated patch that clears the cache on changes to foreign  
>> wrappers and servers.  
>  
> Any chance one of you folks could by my official commitfest reviewer?  
> Appreciate all the feedback so far!  
>  
> https://commitfest.postgresql.org/6/304/  

That's an interesting topic, I will register as such.  
--  
Michael  


Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-08-04 Thread Robert Haas
On Tue, Jul 28, 2015 at 3:28 PM, Heikki Linnakangas  wrote:
> * The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept in sync
> with the list of individual locks in lwlock.h. Sooner or later someone will
> add an LWLock and forget to update the names-array. That needs to be made
> less error-prone, so that the names are maintained in the same place as the
> #defines. Perhaps something like rmgrlist.h.

This is a good idea, but it's not easy to do in the style of
rmgrlist.h, because I don't believe there's any way to define a macro
that expands to a preprocessor directive.  Attached is a patch that
instead generates the list of macros from a text file, and also
generates an array inside lwlock.c with the lock names that gets used
by the Trace_lwlocks stuff where applicable.

Any objections to this solution to the problem?  If not, I'd like to
go ahead and push this much.  I can't test the Windows changes
locally, though, so it would be helpful if someone could check that
out.

> * Instead of having LWLOCK_INDIVIDUAL_NAMES to name "individual" locks, how
> about just giving each one of them a separate tranche?

I don't think it's good to split things up to that degree;
standardizing on one name per fixed lwlock and one per tranche
otherwise seems like a good compromise to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/Makefile b/src/backend/Makefile
index 98b978f..d16ab10 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -106,7 +106,7 @@ endif
 endif # aix
 
 # Update the commonly used headers before building the subdirectories
-$(SUBDIRS:%=%-recursive): $(top_builddir)/src/include/parser/gram.h $(top_builddir)/src/include/catalog/schemapg.h $(top_builddir)/src/include/utils/fmgroids.h $(top_builddir)/src/include/utils/errcodes.h $(top_builddir)/src/include/utils/probes.h
+$(SUBDIRS:%=%-recursive): $(top_builddir)/src/include/parser/gram.h $(top_builddir)/src/include/catalog/schemapg.h $(top_builddir)/src/include/storage/lwlocknames.h $(top_builddir)/src/include/utils/fmgroids.h $(top_builddir)/src/include/utils/errcodes.h $(top_builddir)/src/include/utils/probes.h
 
 # run this unconditionally to avoid needing to know its dependencies here:
 submake-schemapg:
@@ -135,6 +135,9 @@ postgres.o: $(OBJS)
 parser/gram.h: parser/gram.y
 	$(MAKE) -C parser gram.h
 
+storage/lmgr/lwlocknames.h: storage/lmgr/generate-lwlocknames.pl storage/lmgr/lwlocknames.txt
+	$(MAKE) -C storage/lmgr lwlocknames.h
+
 utils/fmgroids.h: utils/Gen_fmgrtab.pl catalog/Catalog.pm $(top_srcdir)/src/include/catalog/pg_proc.h
 	$(MAKE) -C utils fmgroids.h
 
@@ -165,6 +168,11 @@ $(top_builddir)/src/include/catalog/schemapg.h: catalog/schemapg.h
 	  cd '$(dir $@)' && rm -f $(notdir $@) && \
 	  $(LN_S) "$$prereqdir/$(notdir $<)" .
 
+$(top_builddir)/src/include/storage/lwlocknames.h: storage/lmgr/lwlocknames.h
+	prereqdir=`cd '$(dir $<)' >/dev/null && pwd` && \
+	  cd '$(dir $@)' && rm -f $(notdir $@) && \
+	  $(LN_S) "$$prereqdir/$(notdir $<)" .
+
 $(top_builddir)/src/include/utils/errcodes.h: utils/errcodes.h
 	prereqdir=`cd '$(dir $<)' >/dev/null && pwd` && \
 	  cd '$(dir $@)' && rm -f $(notdir $@) && \
@@ -192,6 +200,7 @@ distprep:
 	$(MAKE) -C bootstrap	bootparse.c bootscanner.c
 	$(MAKE) -C catalog	schemapg.h postgres.bki postgres.description postgres.shdescription
 	$(MAKE) -C replication	repl_gram.c repl_scanner.c
+	$(MAKE) -C storage	lwlocknames.h
 	$(MAKE) -C utils	fmgrtab.c fmgroids.h errcodes.h
 	$(MAKE) -C utils/misc	guc-file.c
 	$(MAKE) -C utils/sort	qsort_tuple.c
@@ -307,6 +316,7 @@ maintainer-clean: distclean
 	  catalog/postgres.shdescription \
 	  replication/repl_gram.c \
 	  replication/repl_scanner.c \
+	  storage/lmgr/lwlocknames.h \
 	  utils/fmgroids.h \
 	  utils/fmgrtab.c \
 	  utils/errcodes.h \
diff --git a/src/backend/storage/lmgr/.gitignore b/src/backend/storage/lmgr/.gitignore
new file mode 100644
index 000..9355cae
--- /dev/null
+++ b/src/backend/storage/lmgr/.gitignore
@@ -0,0 +1,2 @@
+/lwlocknames.c
+/lwlocknames.h
diff --git a/src/backend/storage/lmgr/Makefile b/src/backend/storage/lmgr/Makefile
index e12a854..e941f2d 100644
--- a/src/backend/storage/lmgr/Makefile
+++ b/src/backend/storage/lmgr/Makefile
@@ -24,8 +24,17 @@ s_lock_test: s_lock.c $(top_builddir)/src/port/libpgport.a
 	$(CC) $(CPPFLAGS) $(CFLAGS) -DS_LOCK_TEST=1 $(srcdir)/s_lock.c \
 		$(TASPATH) -L $(top_builddir)/src/port -lpgport -o s_lock_test
 
+# see explanation in ../../parser/Makefile
+lwlocknames.c: lwlocknames.h ;
+
+lwlocknames.h: $(top_srcdir)/src/backend/storage/lmgr/lwlocknames.txt generate-lwlocknames.pl
+	$(PERL) $(srcdir)/generate-lwlocknames.pl $<
+
 check: s_lock_test
 	./s_lock_test
 
-clean distclean maintainer-clean:
+clean distclean:
 	rm -f s_lock_test
+
+maintainer-clean: clean
+	rm -f lwlocknames.h
diff --git a/src/backend/storage/lmgr/generate-lwlocknames.pl b/src/backend/storage/lmgr/gen

Re: [HACKERS] Support for N synchronous standby servers - take 2

2015-08-04 Thread Beena Emerson
Robert Haas wrote:
>Maybe shoehorning this into the GUC mechanism is the wrong thing, and
>what we really need is a new config file for this.  The information
>we're proposing to store seems complex enough to justify that.
>

I think the consensus is that JSON is better.
And using a new file with multi line support would be good.

Name of the file: how about pg_syncinfo.conf? 


Backward compatibility: synchronous_standby_names will be supported.
synchronous_standby_names='pg_syncinfo' indicates use of new file.


JSON format:
It would contain 2 main keys: "sync_info" and  "groups"
The "sync_info" would consist of "quorum"/"priority" with the count and
"nodes"/"group" with the group name or node list.
The optional "groups" key would list out all the "group" mentioned within
"sync_info" along with the node list.


Ex:
1.
{
"sync_info":
{
"quorum":2,
"nodes":
[
"node1", "node2", "node3"
]
}
}

2.
{
"sync_info":
{
"quorum":2,
"nodes":
[
{"priority":1,"group":"cluster1"},
{"quorum":2,"group": "cluster2"},
"node99"
]
},
"groups":
{
"cluster1":["node11","node12"],
"cluster2":["node21","node22","node23"]
}
}

Thoughts?



-
Beena Emerson

--
View this message in context: 
http://postgresql.nabble.com/Support-for-N-synchronous-standby-servers-take-2-tp5849384p5860791.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GROUP BY before JOIN

2015-08-04 Thread Ashutosh Bapat
On Tue, Aug 4, 2015 at 4:00 PM, David Rowley 
wrote:

> On 4 August 2015 at 21:56, Ashutosh Bapat  > wrote:
>
>> This looks like one example of general problem of finding optimal order
>> for SQL operations. Consider a query defined as sql_op1(sql_op2(sql_op3(A,
>> B), sql_op4(C, D), sql_op5(E, F where sql_op can be SQL operation like
>> join, grouping, aggregation, projection, union, intersection, limit,
>> ordering etc. and A, B, C, D are tables. Given such a representation, how
>> do we find an optimal order of operation?
>>
>> In you example the query which is defined like GROUPBY(JOIN(A, B)) might
>> have an optimal order JOIN(GROUPBY(A), GROUPBY(B)) if we can prove GROUPBY
>> is distributive over JOIN. Similarly GROUPBY(UNION(A, B)) might have an
>> optimal order UNION(GROUPBY(A), GROUPBY(B)) if A and B produce disjoint
>> groups i.e. UNION is distributive over GROUPBY.
>>
>>
> I'm not really sure that I understand why it's GROUPBY(A) and GROUPBY(B).
> If the group by contained columns from relations A and B, then it wouldn't
> be possible to group each one individually. We would have to wait until the
> join was performed in that case.
>

Sorry for not being clear enough. GROUPBY(A) implies GROUPBY on relation A,
using some columns (unspecified but common to both A and B here). GROUPBY
is used as an operator over relations, in classic sense of operator. This
representation does not have any columns, join conditions specified for
simplicity. It specifies SQL operators on relations as operands.


>
> Right now, we use dynamic programming to find the optimal order for join
>> tree (make_one_rel). In the language above, we find the optimal order for
>> evaluating an expression consisting of JOIN operators by exploiting their
>> commutative and associative properties. If we could extend that to SQL
>> expression tree representing the query, we will be able to solve many of
>> such re-ordering problems using current path infrastructure. The dynamic
>> program for this would look something like below
>>
>> root = top operator of the query
>> child = the operands of the root
>>
>> cost of query = minimum of
>> 1. cost of query without any mutation
>> 2. cost of root and its child operator commuted, if there is only one
>> child for root operator and root operator and child operator are commutable
>> 3. cost of root distributed over children operands if there are multiple
>> operands and root operator is distributive over child operators
>> 4. ...
>> 5. ...
>>
>> Each operator root will have a RelOptInfo to which we add many paths
>> possible for performing that operation using different orders of
>> evaluation. Current add_path should eliminate the expensive ones.
>>
>
> So in a query such as:
>
>  select s.product_id from sale s inner join product p on s.product_id =
> p.product_id group by s.product_id;
>
> Do you mean that there would be 2 RelOptInfos for the sale relation, one
> without the GROUP BY, and one with? That would likely solve the issue I
> have with join row estimates, but which of the 2 RelOptInfo would all of
> the Vars in the parse tree be pointing to?
>

This query would be represented as GROUPBY(JOIN(sales, product)), grouping
happens on s.product_id and JOIN on s.product_id = p.product_id. The
dynamic programming would consider GROUPBY(JOIN(sales, product)) and
JOIN(GROUPBY(sales), GROUPBY(product)) as possible evaluation orders, since
GROUPBY is distributive over JOIN in this case and choose the order with
the least cost.

>
>
>> This is going to explode the number of path (plan) trees considered by
>> planner, so we need to be judicious when to apply this technique or have a
>> rigorous pruning technique to discard costly paths early enough. This
>> approach might be better than trying to solve each re-ordering problem
>> separately and might go well with upper planner pathification being
>> discussed at
>> http://www.postgresql.org/message-id/ca+tgmoaqgzhux7frzddpghkqt3rcnx-f0df+zqzsefw-arx...@mail.gmail.com
>> .
>>
>>
> It's certainly going to increase that, but I'm not sure if 'explode' is
> the best word as we at least require all of the rels seen in the GROUP BY
> expressions and aggregate function parameters to be joined before we can
> consider a GroupingPath. The patch
> uses bms_is_subset(root->minimumGroupingRels, rel->relids) to determine
> that.
>
>
>> On Tue, Aug 4, 2015 at 1:57 PM, David Rowley <
>> david.row...@2ndquadrant.com> wrote:
>>
>>> == Overview ==
>>>
>>> As of today we always perform GROUP BY at the final stage, after each
>>> relation in the query has been joined. This of course works, but it's not
>>> always the most optimal way of executing the query.
>>>
>>> Consider the following two relations:
>>>
>>> create table product (product_id int primary key, code varchar(32) not
>>> null);
>>> create table sale (sale_id int primary key, product_id int not null, qty
>>> int not null);
>>> create index on sale (product_id);
>>>
>>> Populated with:
>>>
>>> ins

Re: [HACKERS] RFC: replace pg_stat_activity.waiting with something more descriptive

2015-08-04 Thread Ildus Kurbangaliev

On 08/03/2015 04:25 PM, Ildus Kurbangaliev wrote:

On 07/28/2015 10:28 PM, Heikki Linnakangas wrote:

On 07/27/2015 01:20 PM, Ildus Kurbangaliev wrote:

Hello.
In the attached patch I've made a refactoring for tranches.
The prefix for them was extended,  and I've did a split of LWLockAssign
to two
functions (one with tranche and second for user defined LWLocks).


This needs some work in order to be maintainable:

* The patch requires that the LWLOCK_INDIVIDUAL_NAMES array is kept 
in sync with the list of individual locks in lwlock.h. Sooner or 
later someone will add an LWLock and forget to update the 
names-array. That needs to be made less error-prone, so that the 
names are maintained in the same place as the #defines. Perhaps 
something like rmgrlist.h.


* The "base" tranches are a bit funny. They all have the same 
array_base, pointing to MainLWLockArray. If there are e.g. 5 clog 
buffer locks, I would expect the T_NAME() to return "ClogBufferLocks" 
for all of them, and T_ID() to return numbers between 0-4. But in 
reality, T_ID() will return something like 55-59.


Instead of passing a tranche-id to LWLockAssign(), I think it would 
be more clear to have a new function to allocate a contiguous block 
of lwlocks as a new tranche. It could then set the base correctly.


* Instead of having LWLOCK_INDIVIDUAL_NAMES to name "individual" 
locks, how about just giving each one of them a separate tranche?


* User manual needs to be updated to explain the new column in 
pg_stat_activity.


- Heikki



Hello. Thanks for review. I attached new version of patch.

It adds new field in pg_stat_activity that shows current wait in backend.

I've did a some refactoring LWLocks tranche mechanism. In lwlock.c only
invididual and user defined LWLocks is creating, other LWLocks created by
modules who need them. I think that is more logical (user know about 
module,

not module about of all users). It also simplifies LWLocks acquirement.

Now each individual LWLock and other groups of LWLocks have their 
tranche, and can
be easily identified. If somebody will add new individual LWLock and 
forget to add
its name, postgres will show a message. Individual LWLocks still 
allocated in

one array but tranches for them point to particular index in main array.

Sample:

b1=# select pid, wait_event from pg_stat_activity; \watch 1

 pid  |   wait_event
--+-
 7722 | Storage: READ
 7653 |
 7723 | Network: WRITE
 7725 | Network: READ
 7727 | Locks: Transaction
 7731 | Storage: READ
 7734 | Network: READ
 7735 | Storage: READ
 7739 | LWLocks: WALInsertLocks
 7738 | Locks: Transaction
 7740 | LWLocks: BufferMgrLocks
 7741 | Network: READ
 7742 | Network: READ
 7743 | Locks: Transaction


Attatched new version of patch with some small fixes in code

--
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e64b7ef..2e4e67e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -630,6 +630,11 @@ postgres   27093  0.0  0.0  30096  2752 ?Ss   11:34   0:00 postgres: ser
  True if this backend is currently waiting on a lock
 
 
+ wait_event
+ text
+ Name of a current wait event in backend
+
+
  state
  text
  Current overall state of this backend.
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 3a58f1e..10c25cf 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -457,7 +457,8 @@ CLOGShmemInit(void)
 {
 	ClogCtl->PagePrecedes = CLOGPagePrecedes;
 	SimpleLruInit(ClogCtl, "CLOG Ctl", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
-  CLogControlLock, "pg_clog");
+  CLogControlLock, "pg_clog",
+  "CLogBufferLocks");
 }
 
 /*
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 5ad35c0..dd7578f 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -466,7 +466,8 @@ CommitTsShmemInit(void)
 
 	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
 	SimpleLruInit(CommitTsCtl, "CommitTs Ctl", CommitTsShmemBuffers(), 0,
-  CommitTsControlLock, "pg_commit_ts");
+  CommitTsControlLock, "pg_commit_ts",
+  "CommitTSBufferLocks");
 
 	commitTsShared = ShmemInitStruct("CommitTs shared",
 	 sizeof(CommitTimestampShared),
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 1933a87..b905c59 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1842,10 +1842,12 @@ MultiXactShmemInit(void)
 
 	SimpleLruInit(MultiXactOffsetCtl,
   "MultiXactOffset Ctl", NUM_MXACTOFFSET_BUFFERS, 0,
-  MultiXactOffsetControlLock, "pg_multixact/offsets");
+  MultiXactOffsetControlLock, "pg_multixact/offsets",
+  "MultiXactOffsetBufferLocks");
 	SimpleLru

Re: [HACKERS] GROUP BY before JOIN

2015-08-04 Thread David Rowley
On 4 August 2015 at 21:56, Ashutosh Bapat 
wrote:

> This looks like one example of general problem of finding optimal order
> for SQL operations. Consider a query defined as sql_op1(sql_op2(sql_op3(A,
> B), sql_op4(C, D), sql_op5(E, F where sql_op can be SQL operation like
> join, grouping, aggregation, projection, union, intersection, limit,
> ordering etc. and A, B, C, D are tables. Given such a representation, how
> do we find an optimal order of operation?
>
> In you example the query which is defined like GROUPBY(JOIN(A, B)) might
> have an optimal order JOIN(GROUPBY(A), GROUPBY(B)) if we can prove GROUPBY
> is distributive over JOIN. Similarly GROUPBY(UNION(A, B)) might have an
> optimal order UNION(GROUPBY(A), GROUPBY(B)) if A and B produce disjoint
> groups i.e. UNION is distributive over GROUPBY.
>
>
I'm not really sure that I understand why it's GROUPBY(A) and GROUPBY(B).
If the group by contained columns from relations A and B, then it wouldn't
be possible to group each one individually. We would have to wait until the
join was performed in that case.

Right now, we use dynamic programming to find the optimal order for join
> tree (make_one_rel). In the language above, we find the optimal order for
> evaluating an expression consisting of JOIN operators by exploiting their
> commutative and associative properties. If we could extend that to SQL
> expression tree representing the query, we will be able to solve many of
> such re-ordering problems using current path infrastructure. The dynamic
> program for this would look something like below
>
> root = top operator of the query
> child = the operands of the root
>
> cost of query = minimum of
> 1. cost of query without any mutation
> 2. cost of root and its child operator commuted, if there is only one
> child for root operator and root operator and child operator are commutable
> 3. cost of root distributed over children operands if there are multiple
> operands and root operator is distributive over child operators
> 4. ...
> 5. ...
>
> Each operator root will have a RelOptInfo to which we add many paths
> possible for performing that operation using different orders of
> evaluation. Current add_path should eliminate the expensive ones.
>

So in a query such as:

 select s.product_id from sale s inner join product p on s.product_id =
p.product_id group by s.product_id;

Do you mean that there would be 2 RelOptInfos for the sale relation, one
without the GROUP BY, and one with? That would likely solve the issue I
have with join row estimates, but which of the 2 RelOptInfo would all of
the Vars in the parse tree be pointing to?


> This is going to explode the number of path (plan) trees considered by
> planner, so we need to be judicious when to apply this technique or have a
> rigorous pruning technique to discard costly paths early enough. This
> approach might be better than trying to solve each re-ordering problem
> separately and might go well with upper planner pathification being
> discussed at
> http://www.postgresql.org/message-id/ca+tgmoaqgzhux7frzddpghkqt3rcnx-f0df+zqzsefw-arx...@mail.gmail.com
> .
>
>
It's certainly going to increase that, but I'm not sure if 'explode' is the
best word as we at least require all of the rels seen in the GROUP BY
expressions and aggregate function parameters to be joined before we can
consider a GroupingPath. The patch
uses bms_is_subset(root->minimumGroupingRels, rel->relids) to determine
that.


> On Tue, Aug 4, 2015 at 1:57 PM, David Rowley  > wrote:
>
>> == Overview ==
>>
>> As of today we always perform GROUP BY at the final stage, after each
>> relation in the query has been joined. This of course works, but it's not
>> always the most optimal way of executing the query.
>>
>> Consider the following two relations:
>>
>> create table product (product_id int primary key, code varchar(32) not
>> null);
>> create table sale (sale_id int primary key, product_id int not null, qty
>> int not null);
>> create index on sale (product_id);
>>
>> Populated with:
>>
>> insert into product select x.x,'ABC' || x.x from generate_series(1,100)
>> x(x);
>> insert into sale select x.x,x.x%100+1, 10 from generate_series(1,100)
>> x(x);
>>
>> Here we have a table with 100 products and another with 1 million sales
>> records.
>>
>> If we wanted to see the total sales for each product we may write a query
>> such as:
>>
>> select s.product_id,sum(qty) as qty from sale s inner join product p on
>> s.product_id = p.product_id group by s.product_id;
>>
>> If we look at the explain for this query we can see that the grouping
>> happened after the join took place:
>>
>> QUERY PLAN
>> --
>>  HashAggregate
>>Group Key: s.product_id
>>->  Hash Join
>>  Hash Cond: (s.product_id = p.product_id)
>>  ->  Seq Scan on sale s
>>  ->  Hash
>>->  Seq Scan on product p
>>
>> The join here would product exactly 1 

Re: [HACKERS] GROUP BY before JOIN

2015-08-04 Thread Ashutosh Bapat
This looks like one example of general problem of finding optimal order for
SQL operations. Consider a query defined as sql_op1(sql_op2(sql_op3(A, B),
sql_op4(C, D), sql_op5(E, F where sql_op can be SQL operation like
join, grouping, aggregation, projection, union, intersection, limit,
ordering etc. and A, B, C, D are tables. Given such a representation, how
do we find an optimal order of operation?

In you example the query which is defined like GROUPBY(JOIN(A, B)) might
have an optimal order JOIN(GROUPBY(A), GROUPBY(B)) if we can prove GROUPBY
is distributive over JOIN. Similarly GROUPBY(UNION(A, B)) might have an
optimal order UNION(GROUPBY(A), GROUPBY(B)) if A and B produce disjoint
groups i.e. UNION is distributive over GROUPBY.

Right now, we use dynamic programming to find the optimal order for join
tree (make_one_rel). In the language above, we find the optimal order for
evaluating an expression consisting of JOIN operators by exploiting their
commutative and associative properties. If we could extend that to SQL
expression tree representing the query, we will be able to solve many of
such re-ordering problems using current path infrastructure. The dynamic
program for this would look something like below

root = top operator of the query
child = the operands of the root

cost of query = minimum of
1. cost of query without any mutation
2. cost of root and its child operator commuted, if there is only one child
for root operator and root operator and child operator are commutable
3. cost of root distributed over children operands if there are multiple
operands and root operator is distributive over child operators
4. ...
5. ...

Each operator root will have a RelOptInfo to which we add many paths
possible for performing that operation using different orders of
evaluation. Current add_path should eliminate the expensive ones.

This is going to explode the number of path (plan) trees considered by
planner, so we need to be judicious when to apply this technique or have a
rigorous pruning technique to discard costly paths early enough. This
approach might be better than trying to solve each re-ordering problem
separately and might go well with upper planner pathification being
discussed at
http://www.postgresql.org/message-id/ca+tgmoaqgzhux7frzddpghkqt3rcnx-f0df+zqzsefw-arx...@mail.gmail.com
.

On Tue, Aug 4, 2015 at 1:57 PM, David Rowley 
wrote:

> == Overview ==
>
> As of today we always perform GROUP BY at the final stage, after each
> relation in the query has been joined. This of course works, but it's not
> always the most optimal way of executing the query.
>
> Consider the following two relations:
>
> create table product (product_id int primary key, code varchar(32) not
> null);
> create table sale (sale_id int primary key, product_id int not null, qty
> int not null);
> create index on sale (product_id);
>
> Populated with:
>
> insert into product select x.x,'ABC' || x.x from generate_series(1,100)
> x(x);
> insert into sale select x.x,x.x%100+1, 10 from generate_series(1,100)
> x(x);
>
> Here we have a table with 100 products and another with 1 million sales
> records.
>
> If we wanted to see the total sales for each product we may write a query
> such as:
>
> select s.product_id,sum(qty) as qty from sale s inner join product p on
> s.product_id = p.product_id group by s.product_id;
>
> If we look at the explain for this query we can see that the grouping
> happened after the join took place:
>
> QUERY PLAN
> --
>  HashAggregate
>Group Key: s.product_id
>->  Hash Join
>  Hash Cond: (s.product_id = p.product_id)
>  ->  Seq Scan on sale s
>  ->  Hash
>->  Seq Scan on product p
>
> The join here would product exactly 1 million rows, then hashaggregate
> will group 1 million rows.
>
> If it were possible for PostgreSQL to perform the GROUP BY before the join
> we could change that to Grouping 1 million rows, then joining 100 rows.
>
> Of course we could have written the query as:
>
> select s.product_id,qty from (select product_id,sum(qty) as qty from sale
> group by product_id) s inner join product p on s.product_id = p.product_id;
>
> Which would do exactly that, and in this particular case it is much
> faster, but it's not all rainbows and unicorns, as if the join happened to
> filter out many of the groups, then it might not be a win at all.
>
> Consider:
>
> select s.product_id from sale s inner join product p on s.product_id =
> p.product_id where p.code = 'ABC1' group by s.product_id;
>
> Since the product.code has no equivalence member in the sale table the
> predicate cannot be applied to the sale table, so in this case if we'd
> written the query as:
>
> select s.product_id,qty from (select product_id,sum(qty) as qty from sale
> group by product_id) s inner join product p on s.product_id = p.product_id
> where p.code = 'ABC1';
>
> We'd have thrown away 99% of the g

Re: [HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types

2015-08-04 Thread Amit Langote
On 2015-08-04 PM 05:58, Geoff Winkless wrote:
> 
> ​Although it seems Amit has defined the problem better than I could, so
> this is a bit late to the party (!), yes, the table had been ALTERed after
> it was created (looking back through the history, that modification
> included at least one DROP COLUMN).
> 

It seems using any columns that used to be after a dropped columns cause
EXCLUDE pseudo-relation to misbehave. For example, I observed another symptom:

test=# CREATE TABLE upsert_fail_test(a int, b int, c int, d smallint);
CREATE TABLE

test=# ALTER TABLE upsert_fail_test DROP b;
ALTER TABLE

test=# ALTER TABLE upsert_fail_test ADD PRIMARY KEY (a, c, d);
ALTER TABLE

test=# INSERT INTO  upsert_fail_test(a, c, d) VALUES (1, 2, 3) ON CONFLICT
(a, c, d) DO UPDATE SET c = EXCLUDED.c;
INSERT 0 1

test=# INSERT INTO  upsert_fail_test(a, c, d) VALUES (1, 2, 3) ON CONFLICT
(a, c, d) DO UPDATE SET c = EXCLUDED.c;
ERROR:  null value in column "c" violates not-null constraint
DETAIL:  Failing row contains (1, null, 3).

Or, the EXCLUDED pseudo-rel failed to deliver '2' produced by the subplan
and instead produced a 'null' which I guess was caused by the dropped
column 'b'.

Perhaps, it may have to do with how EXCLUDED pseudo-rel's targetlist is
manipulated through parse-plan stage?

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Autonomous Transaction is back

2015-08-04 Thread Rajeev rastogi
On 03 August 2015 18:40, Merlin Moncure [mailto:mmonc...@gmail.com] Wrote:
>On Sun, Aug 2, 2015 at 11:37 PM, Rajeev rastogi
> wrote:
>> On 31 July 2015 23:10, Robert Haas Wrote:
>>>I think we're going entirely down the wrong path here.  Why is it ever
>useful for a backend's lock requests to conflict with themselves, even
>with autonomous transactions?  That seems like an artifact of somebody
>else's implementation that we should be happy we don't need to copy.
>>
>> IMHO, since most of the locking are managed at transaction level not
>backend level and we consider main & autonomous transaction to be
>independent transaction, then practically they may conflict right.
>> It is also right as you said that there is no as such useful use-cases
>where autonomous transaction conflicts with main (parent) transaction.
>But we cannot take it for granted as user might make a mistake. So at-
>least we should have some mechanism to handle this rare case, for which
>as of now I think throwing error from autonomous transaction as one of
>the solution. Once error thrown from autonomous transaction, main
>transaction may continue as it is (or abort main transaction also??).
>
>hm.  OK, what's the behavior of:
>
>BEGIN
>  UPDATE foo SET x = x + 1 WHERE foo_id = 1;
>
>  BEGIN WITH AUTONOMOUS TRANSACTION
>UPDATE foo SET x = x + 1 WHERE foo_id = 1;
>  END;
>
>  RAISE EXCEPTION ...;
>EXCEPTION ...
>
>END;

It should throw an error (or something equivalent) as the second update will 
wait for record lock to get released, which in this case will not happen till 
second update finishes. So catch 22.

>Also,
>*) What do the other candidate implementations do?  IMO, compatibility
>should be the underlying design principle.


Oracle throws error in such case. But we can decide on what behavior we want to 
keep.

>*) What will the "SQL only" feature look like?

Similar to PL as mentioned in your example, we can provide the "SQL only" 
feature also.

>*) Is the SPI interface going to be extended to expose AT?

I don’t think at this point that there is any need of exposing SPI interface 
for this.

Thanks and Regards,
Kumar Rajeev Rastogi

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types

2015-08-04 Thread Geoff Winkless
On 4 August 2015 at 09:30, Amit Langote 
wrote:

> On 2015-08-04 AM 02:57, Peter Geoghegan wrote:
> > On Mon, Aug 3, 2015 at 8:53 AM, Geoff Winkless 
> wrote:
> >> If I create a copy of the table using
> >>
> >> CREATE mytab (LIKE brokentab INCLUDING ALL);
> >> INSERT INTO mytab SELECT * FROM brokentab;
> >
> > Also, did you drop any columns from the original "brokentab" table
> > where the bug can be reproduced?
> >
>
> This seem to be the case. I could reproduce the reported problem:
>

​Although it seems Amit has defined the problem better than I could, so
this is a bit late to the party (!), yes, the table had been ALTERed after
it was created (looking back through the history, that modification
included at least one DROP COLUMN).

Thanks

Geoff​


Re: [HACKERS] Doubt about AccessExclusiveLock in ALTER TABLE .. SET ( .. );

2015-08-04 Thread Andres Freund
On 2015-08-03 14:15:27 +0900, Michael Paquier wrote:
> On Sat, Aug 1, 2015 at 9:20 PM, Andres Freund wrote:
> > On August 1, 2015 2:17:24 PM GMT+02:00, Michael Paquier wrote:
> >>> For instance, if you told me to choose between ShareLock and
> >>> ShareUpdateExclusiveLock I wouldn't know which one is strongest.  I
> >>> don't it's sensible to have the "lock mode compare" primitive
> >>honestly.
> >>> I don't have any great ideas to offer ATM sadly.
> >>
> >>Yes, the thing is that lowering the lock levels is good for
> >>concurrency, but the non-monotony of the lock levels makes it
> >>impossible to choose an intermediate state correctly.
> >
> > How about simply acquiring all the locks individually of they're different 
> > types? These few acquisitions won't matter.
> 
> As long as this only applies on master, this may be fine... We could
> basically pass a LOCKMASK to the multiple layers of tablecmds.c
> instead of LOCKMODE to track all the locks that need to be taken, and
> all the relations open during operations.

This sounds far too complicated to me. Just LockRelationOid() the
relation with the appropriate level everytime you pass through the
function?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] postgres_fdw join pushdown (was Re: Custom/Foreign-Join-APIs)

2015-08-04 Thread Shigeru Hanada
Hi Ashutosh,

Sorry for leaving the thread.

2015-07-20 16:09 GMT+09:00 Ashutosh Bapat :
> In find_user_mapping(), if the first cache search returns a valid tuple, it
> is checked twice for validity, un-necessarily. Instead if the first search
> returns a valid tuple, it should be returned immediately. I see that this
> code was copied from GetUserMapping(), where as well it had the same
> problem.

Oops.  I changed find_user_mapping to exit immediately when any valid
cache was found.

> In build_join_rel(), we check whether user mappings of the two joining
> relations are same. If the user mappings are same, doesn't that imply that
> the foreign server and local user are same too?

Yes, validity of umid is identical to serverid.  We can remove the
check for serverid for some cycles.  One idea is to put Assert for
serverid inside the if-statement block.

> Rest of the patch looks fine.

Thanks

-- 
Shigeru HANADA


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

2015-08-04 Thread Peter Geoghegan
On Tue, Aug 4, 2015 at 1:24 AM, Heikki Linnakangas  wrote:
> Yeah, something like that. To paraphrase, if I'm now understanding it
> correctly, Peter's idea is:
>
> When all the tuples have been fed to tuplesort, and it's time to perform the
> sort, quicksort all the tuples currently in the heap, ignoring the run
> numbers, and turn the resulting array into another tape. That tape is
> special: it's actually stored completely in memory. It is merged with the
> "real" tapes when tuples are returned from the tuplesort, just like regular
> tapes in TSS_FINALMERGE.

Yeah. I imagine that we'll want to put memory prefetch hints for the
new case, since I've independently shown that that works well for the
in-memory case, which this can be very close to.

My next patch will also include quicksorting of runs after we give up
on heapification (after there is more than one run and it is
established that we cannot use my "quicksort with spillover"
optimization, so there are two or more "real" runs on tape). Once
there is clearly not going to be one huge run (which can happen due to
everything largely being in order, even when work_mem is small), and
once incrementally spilling does not end in time to do a "quicksort
with spillover", then the replacement selection thing isn't too
valuable. Especially with large memory sizes but memory bandwidth +
latency as a bottleneck, which is the norm these days.

This seems simpler than my earlier idea of reusing half the memtuples
array only, and resorting the entire array each time, to have
something that consistently approximates replacement selection but
with quicksorting + batching, which I discussed before.

I have this working, and it takes about a good chunk of the runtime
off a sort that merges 3 runs on one reasonable case tested where
work_mem was 300MB. It went from about 56.6 seconds with master to
35.8 seconds with this new approach when tested just now (this
approach saves no writing of tuples, so it's not as effective as the
original "quicksort with spillover" patch can be, but covers a
fundamentally different case). I just need to clean up the patch, and
see if I missed any further optimizations, but this feels like the way
forward multi-run wise. I think it's worth giving up on replacement
selection style runs after the first run is produced, because that's
where the benefits are, if anywhere.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Simon Riggs
On 4 August 2015 at 06:03, Jeff Janes  wrote:


> The attached proof of concept patch greatly improves the bloat for both
> the insert and the update cases.  You need to turn on both features: adding
> the pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3).
> The first of those two things could probably be adopted for real, but the
> second probably is not acceptable.  What is the right way to do this?
> Could a variant of RecordFreeIndexPage bubble the free space up the map
> immediately rather than waiting for a vacuum?  It would only have to move
> up until it found a page with freespace already recorded in it, which the
> vast majority of the time would mean observing up one level and then not
> writing to it, assuming the pending list pages remain well clustered.
>

You make a good case for action here since insert only tables with GIN
indexes on text are a common use case for GIN.

Why would vacuuming the FSM be unacceptable? With a
large gin_pending_list_limit it makes sense.

If it is unacceptable, perhaps we can avoid calling it every time, or
simply have FreeSpaceMapVacuum() terminate more quickly on some kind of
80/20 heuristic for this case.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] ON CONFLICT DO UPDATE using EXCLUDED.column gives an error about mismatched types

2015-08-04 Thread Amit Langote
On 2015-08-04 AM 02:57, Peter Geoghegan wrote:
> On Mon, Aug 3, 2015 at 8:53 AM, Geoff Winkless  wrote:
>> If I create a copy of the table using
>>
>> CREATE mytab (LIKE brokentab INCLUDING ALL);
>> INSERT INTO mytab SELECT * FROM brokentab;
> 
> Also, did you drop any columns from the original "brokentab" table
> where the bug can be reproduced?
> 

This seem to be the case. I could reproduce the reported problem:

test=# CREATE TABLE upsert_fail_test(a int, b int, c smallint);
CREATE TABLE

test=# ALTER TABLE upsert_fail_test DROP c;
ALTER TABLE

test=# ALTER TABLE upsert_fail_test ADD c smallint;
ALTER TABLE

test=# ALTER TABLE upsert_fail_test ADD PRIMARY KEY (a, b, c);
ALTER TABLE

test=# INSERT INTO  upsert_fail_test(a, b, c) VALUES (1, 2, 0) ON CONFLICT
(a, b, c) DO UPDATE SET c = EXCLUDED.c;
INSERT 0 1

test=# INSERT INTO  upsert_fail_test(a, b, c) VALUES (1, 2, 0) ON CONFLICT
(a, b, c) DO UPDATE SET b = EXCLUDED.b;
INSERT 0 1

test=# INSERT INTO  upsert_fail_test(a, b, c) VALUES (1, 2, 0) ON CONFLICT
(a, b, c) DO UPDATE SET c = EXCLUDED.c;
ERROR:  attribute 3 has wrong type
DETAIL:  Table has type integer, but query expects smallint.

FWIW, I tried to look why that happens. It seems during set_plan_refs(),
fix_join_expr on the splan->onConflictSet targetlist using EXCLUDE
pseudo-rel's targetlist as inner targetlist causes columns c's varattno to
be changed to 3, whereas in the actual tuple descriptor it's 4 (dropped
and added). Eventually, ExecEvalScalarVar() complains when it finds attno
3 is a dropped attribute.

Thanks,
Amit



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] GROUP BY before JOIN

2015-08-04 Thread David Rowley
== Overview ==

As of today we always perform GROUP BY at the final stage, after each
relation in the query has been joined. This of course works, but it's not
always the most optimal way of executing the query.

Consider the following two relations:

create table product (product_id int primary key, code varchar(32) not
null);
create table sale (sale_id int primary key, product_id int not null, qty
int not null);
create index on sale (product_id);

Populated with:

insert into product select x.x,'ABC' || x.x from generate_series(1,100)
x(x);
insert into sale select x.x,x.x%100+1, 10 from generate_series(1,100)
x(x);

Here we have a table with 100 products and another with 1 million sales
records.

If we wanted to see the total sales for each product we may write a query
such as:

select s.product_id,sum(qty) as qty from sale s inner join product p on
s.product_id = p.product_id group by s.product_id;

If we look at the explain for this query we can see that the grouping
happened after the join took place:

QUERY PLAN
--
 HashAggregate
   Group Key: s.product_id
   ->  Hash Join
 Hash Cond: (s.product_id = p.product_id)
 ->  Seq Scan on sale s
 ->  Hash
   ->  Seq Scan on product p

The join here would product exactly 1 million rows, then hashaggregate will
group 1 million rows.

If it were possible for PostgreSQL to perform the GROUP BY before the join
we could change that to Grouping 1 million rows, then joining 100 rows.

Of course we could have written the query as:

select s.product_id,qty from (select product_id,sum(qty) as qty from sale
group by product_id) s inner join product p on s.product_id = p.product_id;

Which would do exactly that, and in this particular case it is much faster,
but it's not all rainbows and unicorns, as if the join happened to filter
out many of the groups, then it might not be a win at all.

Consider:

select s.product_id from sale s inner join product p on s.product_id =
p.product_id where p.code = 'ABC1' group by s.product_id;

Since the product.code has no equivalence member in the sale table the
predicate cannot be applied to the sale table, so in this case if we'd
written the query as:

select s.product_id,qty from (select product_id,sum(qty) as qty from sale
group by product_id) s inner join product p on s.product_id = p.product_id
where p.code = 'ABC1';

We'd have thrown away 99% of the groups created in the subquery.

== Proposal ==

I've been working on allowing the planner to properly cost which option is
better and choose to perform the GROUP BY at estimated most efficient join
level.

So far I've not gotten as far as supporting queries which contain aggregate
functions, as I need the ability to combine aggregate states (Simon and I
proposed a patch for that here https://commitfest.postgresql.org/5/131/).

Performance of the above test case is looking quite good so far:

select s.product_id from sale s inner join product p on s.product_id =
p.product_id group by s.product_id;

-- Master = 312.294 ms
-- Patched = 95.328 ms

So a 327% performance increase in this particular case.

== Implementation ==

I've had to make some choices about how exactly to implement this feature.
As I explain above, we can't just always do the grouping at the first
possible opportunity due to the possibility of joins eliminating groups and
needless groups being created.

To support this in the planner I've invented a GroupingPath type and I've
also added 'cheapest_sorted_group_path' and 'cheapest_group_path' to
RelOptInfo. This part I wasn't too sure about and I wondered if these
GroupingPaths should just be added to the normal list with add_path(), but
since these paths perform more work than normal paths that would give them
an unfair disadvantage and they could be thrown out. These paths are only
possibly cheaper after subsequent JOIN has taken place.

There's also an issue with row estimates being wrong on joinrels, and the
row estimate is using the base rel's row count rather than the number of
groups. As yet I'm unsure the best way to fix this.

I wanted to post this early so as to register the fact that I'm working on
this, but I'm also posting in hope to get some early feedback on what I
have so far.

Of course there's lots more work to do here, aggregates need to be
supported, functional dependencies and transitive closures will also need
to be detected in a more complete implementation.

Here's what I have so far (attached)

Comments are most welcome. Thanks.

Regards

David Rowley

--
 David Rowley   http://www.2ndQuadrant.com/

 PostgreSQL Development, 24x7 Support, Training & Services


groupbeforejoin_v0.1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

2015-08-04 Thread Heikki Linnakangas

On 08/03/2015 11:36 PM, Robert Haas wrote:

On Mon, Aug 3, 2015 at 3:33 PM, Peter Geoghegan  wrote:

When it's time to drain the heap, in performsort, divide the array into two
arrays, based on the run number of each tuple, and then quicksort the arrays
separately. The first array becomes the in-memory tail of the current tape,
and the second array becomes the in-memory tail of the next tape.

You wouldn't want to actually allocate two arrays and copy SortTuples
around, but keep using the single large array, just logically divided into
two. So the bookkeeping isn't trivial, but seems doable.


You're talking about a new thing here, that happens when it is
necessary to dump everything and do a conventional merging of on-tape
runs. IOW, we cannot fit a significant fraction of overall tuples in
memory, and we need much of the memtuples array for the next run
(usually this ends as a TSS_FINALMERGE). That being the case
(...right?),


I don't think that's what Heikki is talking about.  He can correct me
if I'm wrong, but what I think he's saying is that we should try to
exploit the fact that we've already determined which in-memory tuples
can be part of the current run and which in-memory tuples must become
part of the next run.  Suppose half the tuples in memory can become
part of the current run and the other half must become part of the
next run.  If we throw all of those tuples into a single bucket and
quicksort it, we're losing the benefit of the comparisons we did to
figure out which tuples go in which runs.  Instead, we could quicksort
the current-run tuples and, separately, quick-sort the next-run
tuples.  Ignore the current-run tuples completely until the tape is
empty, and then merge them with any next-run tuples that remain.


Yeah, something like that. To paraphrase, if I'm now understanding it 
correctly, Peter's idea is:


When all the tuples have been fed to tuplesort, and it's time to perform 
the sort, quicksort all the tuples currently in the heap, ignoring the 
run numbers, and turn the resulting array into another tape. That tape 
is special: it's actually stored completely in memory. It is merged with 
the "real" tapes when tuples are returned from the tuplesort, just like 
regular tapes in TSS_FINALMERGE.


And my idea is:

When all the tuples have been fed to tuplesort, and it's time to perform 
the sort, take all the tuples in the heap belonging to currentRun, 
quicksort them, and make them part of the current tape. They're not just 
pushed to the tape as usual, however, but attached as in-memory tail of 
the current tape. The logical tape abstraction will return them after 
all the tuples already in the tape, as if they were pushed to the tape 
as usual. Then take all the remaining tuples in the heap (if any), 
belonging to next tape, and do the same for them. They become an 
in-memory tail of the next tape.



I'm not sure if there's any reason to believe that would be faster
than your approach.  In general, sorting is O(n lg n) so sorting two
arrays that are each half as large figures to be slightly faster than
sorting one big array.  But the difference may not amount to much.


Yeah, I don't think there's a big performance difference between the two 
approaches. I'm not wedded to either approach. Whichever approach we 
use, my main point was that it would be better to handle this in the 
logical tape abstraction. In my approach, you would have those 
"in-memory tails" attached to the last two tapes. In Peter's approach, 
you would have one tape that's completely in memory, backed by the 
array. In either case, the tapes would look like normal tapes to most of 
tuplesort.c. There would be no extra TSS state, it would be 
TSS_SORTEDONTAPE or TSS_FINALMERGE as usual.


The logical tape abstraction is currently too low-level for that. It's 
just a tape of bytes, and tuplesort.c determines where a tuple begins 
and ends. That needs to be changed so that the logical tape abstraction 
works tuple-at-a-time instead. For example, instead of 
LogicalTapeRead(N) which reads N bytes, you would have 
LogicalTapeReadTuple(), which reads next tuple, and returns its length 
and the tuple itself. But that would be quite sensible anyway.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: SCRAM authentication

2015-08-04 Thread Michael Paquier
On Mon, Mar 30, 2015 at 7:52 PM, Heikki Linnakangas  wrote:

> There have been numerous threads on replacing our MD5 authentication
> method, so I started hacking on that to see what it might look like. Just
> to be clear, this is 9.6 material. Attached is a WIP patch series that adds
> support for SCRAM. There's no need to look at the details yet, but it
> demonstrates what the protocol changes and the code structure would be like.
>
> I'm not wedded to SCRAM - SRP or JPAKE or something else might be better.
> But replacing the algorithm, or adding more of them, should be
> straightforward with this.
>

Agreed. We need such a facility.


> There is no negotiation of the authentication mechanism. SCRAM is just
> added as a new one, alongside all the existing ones. If the server requests
> SCRAM authentication, but the client doesn't support it, the attempt will
> fail. We might want to do something about that, to make the transition
> easier, but it's an orthogonal feature and not absolutely required.
>
> There are four patches in the series. The first two are just refactoring:
> moving the SHA-1 implementation from pgcrypto to src/common, and some
> refactoring in src/backend/auth.c that IMHO would make sense anyway.
>

The two first patches of the series look good to me.


> Patches three and four are the interesting ones:
>

I have not looked in details yet at number implementing SCRAM.


> 3. Allow storing multiple verifiers in pg_authid
> 
>
> Replace the pg_authid.rolpassword text field with an array, and rename it
> to 'rolverifiers'. This allows storing multiple password hashes: an MD5
> hash for MD5 authentication, and a SCRAM salt and stored key for SCRAM
> authentication, etc. Each element in the array is a string that begins with
> the method's name. For example "md5:", or "password:".
>
> For dump/reload, and for clients that wish to create the hashes in the
> client-side, there is a new option to CREATE/ALTER USER commands: PASSWORD
> VERIFIERS '{ ... }', that allows replacing the array.
>
> The old "ENCRYPTED/UNENCRYPTED PASSWORD 'foo'" options are still supported
> for backwards-compatibility, but it's not clear what it should mean now.
>
> TODO:
>
> * Password-checking hook needs to be redesigned, to allow for more kinds
> of hashes.
>
> * With "CREATE USER PASSWORD 'foo'", which hashes/verifiers should be
> generated by default? We currently have a boolean password_encryption
> setting for that. Needs to be a list.
>

I have been looking more in depths at this one, which adds essential
infrastructure to support multiple authentication hashes for more
protocols. Here are some comments:
- Docs are missing (not a big issue for a WIP)
- Instead of an array that has an identified embedded, let's add a new
catalog pg_authid_hashes that stores all the hashes for a user (idea by
Heikki):
-- hashrol, role Oid associated with the hash
-- hashmet, hash method
-- hashval, value of the hash
- New password-checking hook (contrib/passwordcheck will need a refresh).
As of now, we have that:
void (*check_password_hook_type)
(const char *username,
 const char *password,
 int password_type,
 Datum validuntil_time,
 bool validuntil_null);
We need to switch to something that checks a list of hashes:
void (*check_password_hook_type)
(const char *username,
 list *passwd,
 Datum validuntil_time,
 bool validuntil_null);
passwd is a structure containing the password type and the hash value.
Password type can then be "plain" (or password to match pg_hba.conf) or
"md5" for now.
- When password_encryption is switched to a list, true means md5, and false
means plain. At the addition of SCRAM, we could think harder the default
value, "true" may be worth meaning "md5,scram".
- For CREATE ROLE/ALTER ROLE, it is necessary to be able to define the list
of hashes that need to be generated, with something like that for example:
[ ENCRYPTED [(md5[, scram])] | UNENCRYPTED ] PASSWORD 'password'
When UNENCRYPTED is used, we could simply store the password as plain. When
only ENCRYPTED is used, we store it for all the methods available, except
"plain". ENCRYPTED and plain are not allowed combinations.
- Also, do we really want an option at SQL level to allow storing custom
hashes generated on client side as a first step? We could have something
like WITH (md5 = 'blah', scram = 'blah2') appended after PASSWORD for
example.
- rolpassword is removed from pg_authid.

I am willing to write a patch for the next CF following more or less those
lines, depending of course on the outcome of the discussion we can have
here, so feel free to comment.

I'll have a look more in-depth at the scram patch as well.
Regards,
-- 
Michael


Re: [HACKERS] FSM versus GIN pending list bloat

2015-08-04 Thread Heikki Linnakangas

On 08/04/2015 08:03 AM, Jeff Janes wrote:

For a GIN index with fastupdate turned on, both the user backends and
autoanalyze routine will clear out the pending list, pushing the entries
into the normal index structure and deleting the pages used by the pending
list.  But those deleted pages will not get added to the freespace map
until a vacuum is done.  This leads to horrible bloat on insert only
tables, as it is never vacuumed and so the pending list space is never
reused.  And the pending list is very inefficient in space usage to start
with, even compared to the old style posting lists and especially compared
to the new compressed ones.  (If they were aggressively recycled, this
inefficient use wouldn't be much of a problem.)


Good point.


The attached proof of concept patch greatly improves the bloat for both the
insert and the update cases.  You need to turn on both features: adding the
pages to fsm, and vacuuming the fsm, to get the benefit (so JJ_GIN=3).  The
first of those two things could probably be adopted for real, but the
second probably is not acceptable.  What is the right way to do this?
Could a variant of RecordFreeIndexPage bubble the free space up the map
immediately rather than waiting for a vacuum?  It would only have to move
up until it found a page with freespace already recorded in it, which the
vast majority of the time would mean observing up one level and then not
writing to it, assuming the pending list pages remain well clustered.


Yep, that sounds good.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers