Re: [HACKERS] Better support of exported snapshots with pg_dump

2014-10-14 Thread Michael Paquier
On Wed, Oct 15, 2014 at 2:46 PM, Andres Freund 
wrote:

> > This seems more user-friendly. But well I agree that we could do
> > a larger set of things that could be used for even other purposes:
> > - Ability to define snapshot name with pg_dump
> > - Take system or database-wide lock
> > - Extra client application running the whole
> > Now is this set of features worth doing knowing that export snapshot has
> > been designed for multi-threaded closed applications? Not much sure.
> > Regards,
>
> What do you mean with "designed for multi-threaded closed applications"?
>
External snapshots creation and control should be localized within
dedicated client applications only. At least that's what I understand from
it as that's how it is used now.
-- 
Michael


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Andres Freund
On 2014-10-14 17:53:10 -0400, Robert Haas wrote:
> On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund  wrote:
> >> The code in CHashSearch shows the problem there: you need to STORE the
> >> hazard pointer before you begin to do the LOAD operations to scan the
> >> bucket, and you must finish all of those LOADs before you STORE the
> >> NULL hazard pointer.  A read or write barrier won't provide store/load
> >> or load/store ordering.
> >
> > I'm not sure I understand the problem here - but a read + write barrier
> > should suffice? To avoid falling back to two full barriers, we'd need a
> > separate define pg_read_write_barrier(), but that's not a problem. IIUC
> > that should allow us to avoid emitting any actual code on x86.
> 
> Well, thinking about x86 specifically, it definitely needs at least
> one mfence, after setting the hazard pointer and before beginning to
> read the bucket chain.

Yes, I can see that now. I do wonder if that's not going to prove quite
expensive... I think I can see some ways around needing that. But I
think that needs some benchmarking first - no need to build a even more
complex solution if not needed.

The solution I'm thinking of is to essentially move away from hazard
pointers and store something like a generation counter per
backend. Which is updated less often, and in combination with other
operations. When a backend need to clean up and sees that there's a
straggler with a old generation it sends that backend a signal to ensure
it sets the latest generation.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] narwhal and PGDLLIMPORT

2014-10-14 Thread Craig Ringer
On 10/15/2014 12:53 PM, Noah Misch wrote:
> Windows Server 2003 isn't even EOL yet.  I'd welcome a buildfarm member with
> that OS and a modern toolchain.

It's possible to run multiple buildfarm animals on a single Windows
instance, each with a different toolchain.

There's the chance of interactions that can't occur if each SDK is used
in isolation on its own machine, but it should be pretty minimal.

I have a machine that currently does this with Jenkins. I'll look at
bringing up buildfarm members on it. It's win2k8 though.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better support of exported snapshots with pg_dump

2014-10-14 Thread Andres Freund
On 2014-10-15 14:28:16 +0900, Michael Paquier wrote:
> On Wed, Oct 15, 2014 at 2:06 PM, Andres Freund 
> wrote:
> 
> > I think that's completely the wrong way to go at this. The time it takes
> > to create a replication slot under write load is far larger than the
> > time it takes to start pg_dump and load. This really doesn't add any
> > actual safety. Also, the inability to use the snapshot outside of
> > pg_dump restricts the feature far too much imo.
> >
> > I personally think we should just disregard the race here and add a
> > snapshot parameter. The race is already there and not exactly
> > small. Let's not kid ourselves that hiding it solves anything.
> >
> > But if that's not the way to go, we need to think about a way of how to
> > prevent "problematic" DDL that's not racy.
> >
> 
> Well, I would be perfectly happy to be able to specify a snapshot for
> pg_dump, now the reason why this approach is used is to be able to isolate
> the DDL conflicts into pg_dump itself without relying on any external
> mechanism, be it an extra client controlling lock on the objects being
> dumped, or a system-wide lock preventing any DDL command (LOCK SYSTEM kind
> of thing).

There's no 'isolation' here imo. pg_dump *does not* detect these
cases. I've posted a couple of examples of that in some earlier thread
about this.

> This seems more user-friendly. But well I agree that we could do
> a larger set of things that could be used for even other purposes:
> - Ability to define snapshot name with pg_dump
> - Take system or database-wide lock
> - Extra client application running the whole
> Now is this set of features worth doing knowing that export snapshot has
> been designed for multi-threaded closed applications? Not much sure.
> Regards,

What do you mean with "designed for multi-threaded closed applications"?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better support of exported snapshots with pg_dump

2014-10-14 Thread Michael Paquier
On Wed, Oct 15, 2014 at 2:06 PM, Andres Freund 
wrote:

> I think that's completely the wrong way to go at this. The time it takes
> to create a replication slot under write load is far larger than the
> time it takes to start pg_dump and load. This really doesn't add any
> actual safety. Also, the inability to use the snapshot outside of
> pg_dump restricts the feature far too much imo.
>
> I personally think we should just disregard the race here and add a
> snapshot parameter. The race is already there and not exactly
> small. Let's not kid ourselves that hiding it solves anything.
>
> But if that's not the way to go, we need to think about a way of how to
> prevent "problematic" DDL that's not racy.
>

Well, I would be perfectly happy to be able to specify a snapshot for
pg_dump, now the reason why this approach is used is to be able to isolate
the DDL conflicts into pg_dump itself without relying on any external
mechanism, be it an extra client controlling lock on the objects being
dumped, or a system-wide lock preventing any DDL command (LOCK SYSTEM kind
of thing). This seems more user-friendly. But well I agree that we could do
a larger set of things that could be used for even other purposes:
- Ability to define snapshot name with pg_dump
- Take system or database-wide lock
- Extra client application running the whole
Now is this set of features worth doing knowing that export snapshot has
been designed for multi-threaded closed applications? Not much sure.
Regards,
-- 
Michael


[HACKERS] Additional role attributes && superuser review

2014-10-14 Thread Stephen Frost
Greetings,

  The attached patch for review implements a few additional role
  attributes (all requested by users or clients in various forums) for
  common operations which currently require superuser privileges.  This
  is not a complete solution for all of the superuser-only privileges we
  have but it's certainly good progress and along the correct path, as
  shown below in a review of the existing superuser checks in the
  backend.

  First though, the new privileges, about which the bikeshedding can
  begin, short-and-sweet format:

  BACKUP:
pg_start_backup()
pg_stop_backup()
pg_switch_xlog()
pg_create_restore_point()

  LOGROTATE:
pg_rotate_logfile()

  MONITOR:
View detailed information regarding other processes.
pg_stat_get_wal_senders()
pg_stat_get_activity()

  PROCSIGNAL:
pg_signal_backend()
(not allowed to signal superuser-owned backends)

  Before you ask, yes, this patch also does a bit of rework and cleanup.

  Yes, that could be done as an independent patch- just ask, but the
  changes are relatively minor.  One curious item to note is that the
  current if(!superuser()) {} block approach has masked an inconsistency
  in at least the FDW code which required a change to the regression
  tests- namely, we normally force FDW owners to have USAGE rights on
  the FDW, but that was being bypassed when a superuser makes the call.
  I seriously doubt that was intentional.  I won't push back if someone
  really wants it to stay that way though.

  This also includes the previously discussed changes for pgstat and
  friends to use role membership instead of GetUserId() == backend_role.

  Full documentation is not included but will be forthcoming, of course.

  As part of working through what the best solution is regarding the
  existing superuser-only checks, I did a review of more-or-less all of
  the ones which exist in the backend.  From that review, I came to the
  conclusion (certainly one which can be debated, but still) that most
  cases of superuser() checks that should be possible for non-superusers
  to do are yes/no privileges, except for when it comes to server-side
  COPY and CREATE TABLESPACE operations, which need a form of
  directory-level access control also (patch for that will be showing up
  shortly..).

  For posterity's sake, here's my review and comments on the various
  existing superuser checks in the backend (those not addressed above):

  CREATE EXTENSION
This could be a role attribute as the others above, but I didn't
want to try and include it in this patch as it has a lot of hairy
parts, I expect.

  Connect using 'reserved' slots
This is another one which we might add as another role attribute.

  Only needed during recovery, where it's fine to require superuser:
pg_xlog_replay_pause()
pg_xlog_replay_resume()

  Directory / File access (addressed independently):
libpq/be/fsstubs.c
  lo_import()
  lo_export()

commands/tablespace.c - create tablespace
commands/copy.c - server-side COPY TO/FROM

utils/adt/genfile.c
  pg_read_file() / pg_read_text_file() / pg_read_binary_file()
  pg_stat_file()
  pg_ls_dir()

  Lots of things depend on being able to create C functions.  These, in
  my view, should either be done through extensions or should remain the
  purview of the superuser.  Below are the ones which I identified as
  falling into this category:

  commands/tsearchcmd.c
Create text search parser
Create text search template

  tcop/utility.c
LOAD (load shared library)

  commands/foreigncmds.c
Create FDW, Alter FDW
FDW ownership

  commands/functioncmds.c
create binary-compatible casts
create untrusted-language functions

  command/proclang.c
Create procedural languages (unless PL exists in pg_pltemplate)
Create custom procedural language

  commands/opclasscmds.c
Create operator class
Create operator family
Alter operator family

  commands/aggregatecmds.c
Create aggregates which use 'internal' types

  commands/functioncmds.c
create leakproof functions
alter function to be leakproof

  command/event_trigger.c
create event trigger

  Other cases which I don't think would make sense to change (and some I
  wonder if we should prevent the superuser from doing, unless they have
  rolcatupdate or similar...):

  commands/alter.c
rename objects (for objects which don't have an 'owner')
change object schema (ditto)

  commands/trigger.c
enable or disable internal triggers

  commands/functioncmds.c
execute DO blocks with untrusted languages

  commands/dbcommands.c
allow encoding/locale to be different

  commands/user.c
set 'superuser' on a role
alter superuser roles (other options)
drop superuser roles
alter role membership for superusers
force 'granted by' on a role grant
alter global settings
rename superuser roles

  utils/misc/guc.c
set superuser-only G

Re: [HACKERS] Better support of exported snapshots with pg_dump

2014-10-14 Thread Andres Freund
On 2014-10-15 07:09:10 +0900, Michael Paquier wrote:
> > whatever replication solution you use and just have pg_dump accept the
> > snapshot as input parameter? I am not sure how much I like pg_dump creating
> > the slot. I am aware that you need to have the replication connection open
> > but that's IMHO just matter of scripting it together.
> The whole point of the operation is to reduce the amount of time the
> external snapshot is taken to reduce the risk of race conditions that
> may be induced by schema changes. See for example discussions related
> to why we do not authorize specifying a snapshot name as an option of
> pg_dump.

I think that's completely the wrong way to go at this. The time it takes
to create a replication slot under write load is far larger than the
time it takes to start pg_dump and load. This really doesn't add any
actual safety. Also, the inability to use the snapshot outside of
pg_dump restricts the feature far too much imo.

I personally think we should just disregard the race here and add a
snapshot parameter. The race is already there and not exactly
small. Let's not kid ourselves that hiding it solves anything.

But if that's not the way to go, we need to think about a way of how to
prevent "problematic" DDL that's not racy.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] narwhal and PGDLLIMPORT

2014-10-14 Thread Noah Misch
On Tue, Oct 14, 2014 at 07:07:17PM -0400, Tom Lane wrote:
> Dave Page  writes:
> > On Tue, Oct 14, 2014 at 11:38 PM, Tom Lane  wrote:
> >> I think we're hoping that somebody will step up and investigate how
> >> narwhal's problem might be fixed.

I have planned to look at reproducing narwhal's problem once the dust settles
on orangutan, but I wouldn't mind if narwhal went away instead.

> > However, if "fixing" it comes down to upgrading the seriously old
> > compiler and toolchain on that box (which frankly is so obsolete, I
> > can't see why anyone would want to use anything like it these days),
> > then I think the best option is to retire it, and replace it with
> > Windows 2012R2 and a modern release of MinGW/Msys which is far more
> > likely to be similar to what someone would want to use at present.

If you upgrade the toolchain, you really have a new animal.

> No argument here.  I would kind of like to have more than zero
> understanding of *why* it's failing, just in case there's more to it
> than "oh, probably a bug in this old toolchain".  But finding that out
> might well take significant time, and in the end not tell us anything
> very useful.

Agreed on all those points.

> Is it likely that anyone is still using Windows 2003 in the field?
> A possible compromise is to update the toolchain but stay on the
> same OS release, so that we still have testing that's relevant to
> people using older OS releases.

Windows Server 2003 isn't even EOL yet.  I'd welcome a buildfarm member with
that OS and a modern toolchain.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] group locking: incomplete patch, just for discussion

2014-10-14 Thread Robert Haas
On Wed, Oct 15, 2014 at 12:13 AM, Tom Lane  wrote:
> What is "timely service lock requests"?  You got the lock before firing
> off the background workers, you hold it till they're done.

If you want to run any non-trivial (read: useful) code in the
background workers, it rapidly gets very hard to predict which
relation locks they might need, and infeasible to hold them for the
lifetime of the computation.  For example, suppose we restrict
background workers to running only operators found in btree opclasses.
That's a far more draconian restriction than we'd like to have, but
perhaps liveable in a pinch.  But bttextcmp() uses
PG_GETARG_TEXT_PP(), which means it may (or may not) need to lock the
TOAST table; and it can indirectly call lookup_collation_cache() which
does SearchSysCache1(COLLOID, ...) which may result in scanning
pg_collation.  And enum_cmp_internal() will do
SearchSysCache1(ENUMOID, ...) which may result in scanning pg_enum.

There's obviously a balance to be struck, here.  It will never be safe
to run just anything in a background worker, and we'll go crazy if we
try to make that work.  On the other hand, if there's essentially no
code that can run in a background worker without extensive
modification, we might as well not bother implementing parallelism in
the first place.  I admit that getting group locking is far from
simple, but I also don't believe it's as complicated as previous
projects like SSI, logical decoding, or LATERAL.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] group locking: incomplete patch, just for discussion

2014-10-14 Thread Tom Lane
Robert Haas  writes:
> For parallelism, I think we need a concept of group locking.  That is,
> suppose we have a user backend and N worker backends collaborating to
> execute some query.  For the sake of argument, let's say it's a
> parallel CLUSTER, requiring a full table lock.  We need all of the
> backends to be able to lock the table even though at least one of them
> holds AccessExclusiveLock.  This suggests that the backends should all
> be members of a locking group, and that locks within the same locking
> group should be regarded as mutually non-conflicting.

In the background worker case, I imagined that the foreground process
would hold a lock and the background processes would just assume they
could access the table without holding locks of their own.  Aren't
you building a mountain where a molehill would do?

> [1] You could argue that one process instead should take locks on
> behalf of the group.  But I think this is a terrible approach.  First,
> it means that whichever process is holding the locks, presumably the
> user backend, had better not try to do any real computation of its
> own, so that it can timely service lock requests, which seems like a
> huge loss of efficiency;

What is "timely service lock requests"?  You got the lock before firing
off the background workers, you hold it till they're done.

This is a truly horrible excuse for the amount of complexity (and,
without doubt, bugs and performance penalties) you propose to load onto
the lock mechanism.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] group locking: incomplete patch, just for discussion

2014-10-14 Thread Robert Haas
For parallelism, I think we need a concept of group locking.  That is,
suppose we have a user backend and N worker backends collaborating to
execute some query.  For the sake of argument, let's say it's a
parallel CLUSTER, requiring a full table lock.  We need all of the
backends to be able to lock the table even though at least one of them
holds AccessExclusiveLock.  This suggests that the backends should all
be members of a locking group, and that locks within the same locking
group should be regarded as mutually non-conflicting.

At least to me, that simple scenario is clear-cut[1], but what do we
do in more complicated situations?  For example, suppose backends A
and B are members of the same locking group.  A locks a relation with
AccessShareLock, an unrelated process X queues up waiting for an
AccessExclusiveLock, and then B also requests AccessShareLock.  The
normal behavior here is that B should wait for X to go first, but here
that's a problem.  If A is the user backend and B is a worker backend,
A will eventually wait for B, which is waiting for X, which is waiting
for A: deadlock.  If it's the other way around, we might survive, but
at the very best it's undesirable to have "parallelism" where A and B
are actually completely serialized, with X taking its turn in the
middle.  Now, in practical cases, we probably expect that if A and B
lock the same relation, they're going to do so in very quick
succession, so the actual behavior of the lock manager shouldn't
matter much in terms of fairness.  But since we can't predict what
other processes may do in the window between those lock acquisitions,
a sound theoretical approach is needed to avoid deadlocks.

After some thought I've come to the conclusion that we should assume
that every process in a locking group will eventually wait for every
other process in that locking group.  It follows that once any single
process in a locking group obtains any lock on an object, any future
lock request by another lock group member on that object should, if
non-conflicting, be granted immediately.  It also follows that, when
processes in a locking group are waiting for a lock on an object, none
should be awoken until all lock requests have been satisfied.  For
example, suppose process X holds AccessExclusiveLock on a relation.
First, process Y waits for AccessShareLock.  Then, processes A and B,
members of the same locking group, queue up respectively for
AccessShareLock and AccessExclusiveLock, in that order.  Next, process
X releases its lock.  At this point, we should wake Y *but not A*; A
should continue to wait until we can grant locks to both A and B.  The
reason is that a new process Q might come along and, for purposes of
deadlock avoidance, we might need to reorder the lock queue.  Putting
Q ahead of both A and B is fine; but putting it between A and B is bad
for the reasons discussed in the previous paragraph.

Attached is an incomplete draft patch.  It modifies
LockAcquireExtended() and ProcSleep() but not ProcWakeup() or the
deadlock detector.  Aside from being incomplete, which is a problem in
and of itself, I don't have a very good idea how to comprehensively
test something like this.  I can of course cobble together some test
code for particular scenarios, but it's not at all clear to me what a
comprehensive test plan would look like.  Any ideas on that point
would be particularly appreciated, as would feedback on the overall
design.   A lot more work is clearly needed here, but I've been known
to advocate soliciting feedback on proposed patches early, so here I
am taking my own advice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] You could argue that one process instead should take locks on
behalf of the group.  But I think this is a terrible approach.  First,
it means that whichever process is holding the locks, presumably the
user backend, had better not try to do any real computation of its
own, so that it can timely service lock requests, which seems like a
huge loss of efficiency; if the optimum degree of parallelism is 2,
we'll need 50% more processes and the whole result set will have to be
copied an extra time.  Yuck.  Second, it means that the lock-holding
process must refuse to die until all of the processes on behalf of
which it is holding locks don't need them any more.  And I hate
backend processes that refuse to die with a fiery passion that cannot
be quenched.
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 723051e..85997f6 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -210,6 +210,10 @@ static bool FastPathUnGrantRelationLock(Oid relid, LOCKMODE lockmode);
 static bool FastPathTransferRelationLocks(LockMethod lockMethodTable,
 			  const LOCKTAG *locktag, uint32 hashcode);
 static PROCLOCK *FastPathGetRelationLockEntry(LOCALLOCK *locallock);
+static bool GroupLockShouldJumpQueue(LockMethod lockMethodTa

Re: [HACKERS] Wait free LW_SHARED acquisition - v0.9

2014-10-14 Thread Amit Kapila
On Wed, Oct 15, 2014 at 12:06 AM, Merlin Moncure  wrote:
>
> A while back, I submitted a minor tweak to the clock sweep so that,
> instead of spinlocking every single buffer header as it swept it just
> did a single TAS as a kind of a trylock and punted to the next buffer
> if the test failed on the principle there's not good reason to hang
> around.  You only spin if you passed the first test; that should
> reduce the likelihood of actual spinning to approximately zero.  I
> still maintain there's no reason not to do that (I couldn't show a
> benefit but that was because mapping list locking was masking any
> clock sweep contention at that time).

If you feel that can now show the benefit, then I think you can rebase
it for the coming commit fest (which is going to start today).


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [PATCH] Cleanup: unify checks for catalog modification

2014-10-14 Thread Alvaro Herrera
Tom Lane wrote:
> Marti Raudsepp  writes:
> > I happened to notice that there are no less than 14 places in the code
> > that check whether a relation is a system catalog and throwing the
> > error "permission denied: "foo" is a system catalog"
> 
> > The attached patch factors all of those into a single
> > ForbidSystemTableMods() function. Is this considered an improvement?
> 
> I'd argue not.  The code bulk savings is minimal, and this change
> would degrade the usefulness of the file/line number reporting that's
> built into ereport().

Along these lines, I've sometimes thought that it could be useful to
pass down file/line info from certain callers down to certain generic
check subroutines such as the one being proposed here.  (I can't recall
any specific examples offhand.)  Of course, doing it manually would be
very tedious and error prone, but perhaps we could have something like a
macro system that both sets up arguments in the called function, and
sets up the values in the callee.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Cleanup: unify checks for catalog modification

2014-10-14 Thread Tom Lane
Marti Raudsepp  writes:
> I happened to notice that there are no less than 14 places in the code
> that check whether a relation is a system catalog and throwing the
> error "permission denied: "foo" is a system catalog"

> The attached patch factors all of those into a single
> ForbidSystemTableMods() function. Is this considered an improvement?

I'd argue not.  The code bulk savings is minimal, and this change
would degrade the usefulness of the file/line number reporting that's
built into ereport().  Admittedly it's a judgment call --- we've certainly
built error-reporting subroutines in cases where a significant amount of
complexity could be folded into the subroutine.  But I don't think this
case meets the threshold.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] narwhal and PGDLLIMPORT

2014-10-14 Thread Tom Lane
Dave Page  writes:
> On Tue, Oct 14, 2014 at 11:38 PM, Tom Lane  wrote:
>> I think we're hoping that somebody will step up and investigate how
>> narwhal's problem might be fixed.  However, the machine's owner (Dave)
>> doesn't appear to have the time/interest to do that.

> It's a time issue right now I'm afraid (always interested in fixing bugs).

> However, if "fixing" it comes down to upgrading the seriously old
> compiler and toolchain on that box (which frankly is so obsolete, I
> can't see why anyone would want to use anything like it these days),
> then I think the best option is to retire it, and replace it with
> Windows 2012R2 and a modern release of MinGW/Msys which is far more
> likely to be similar to what someone would want to use at present.

No argument here.  I would kind of like to have more than zero
understanding of *why* it's failing, just in case there's more to it
than "oh, probably a bug in this old toolchain".  But finding that out
might well take significant time, and in the end not tell us anything
very useful.

> Does anyone really think there's a good reason to keep maintaining
> such an obsolete animal?

Is it likely that anyone is still using Windows 2003 in the field?
A possible compromise is to update the toolchain but stay on the
same OS release, so that we still have testing that's relevant to
people using older OS releases.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] narwhal and PGDLLIMPORT

2014-10-14 Thread Andrew Dunstan


On 10/14/2014 06:44 PM, Dave Page wrote:

On Tue, Oct 14, 2014 at 11:38 PM, Tom Lane  wrote:

Alvaro Herrera  writes:

It seems we left this in broken state.  Do we need to do more here to
fix narwhal, or do we want to retire narwhal now?  Something else?  Are
we waiting on someone in particular to do something specific?

I think we're hoping that somebody will step up and investigate how
narwhal's problem might be fixed.  However, the machine's owner (Dave)
doesn't appear to have the time/interest to do that.  That means that
our realistic choices are to retire narwhal or revert the linker changes
that broke it.  Since those linker changes were intended to help expose
missing-PGDLLIMPORT bugs, I don't much care for the second alternative.

It's a time issue right now I'm afraid (always interested in fixing bugs).

However, if "fixing" it comes down to upgrading the seriously old
compiler and toolchain on that box (which frankly is so obsolete, I
can't see why anyone would want to use anything like it these days),
then I think the best option is to retire it, and replace it with
Windows 2012R2 and a modern release of MinGW/Msys which is far more
likely to be similar to what someone would want to use at present.

Does anyone really think there's a good reason to keep maintaining
such an obsolete animal?




I do not. I upgraded from this ancient toolset quite a few years ago, 
and I'm actually thinking of retiring what I replaced it with.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] Cleanup: unify checks for catalog modification

2014-10-14 Thread Marti Raudsepp
Hi list,

I happened to notice that there are no less than 14 places in the code
that check whether a relation is a system catalog and throwing the
error "permission denied: "foo" is a system catalog"

The attached patch factors all of those into a single
ForbidSystemTableMods() function. Is this considered an improvement?
Should I add it to CommitFest?

Regards,
Marti


0001-Cleanup-unify-checks-for-catalog-modification.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] narwhal and PGDLLIMPORT

2014-10-14 Thread Dave Page
On Tue, Oct 14, 2014 at 11:38 PM, Tom Lane  wrote:
> Alvaro Herrera  writes:
>> It seems we left this in broken state.  Do we need to do more here to
>> fix narwhal, or do we want to retire narwhal now?  Something else?  Are
>> we waiting on someone in particular to do something specific?
>
> I think we're hoping that somebody will step up and investigate how
> narwhal's problem might be fixed.  However, the machine's owner (Dave)
> doesn't appear to have the time/interest to do that.  That means that
> our realistic choices are to retire narwhal or revert the linker changes
> that broke it.  Since those linker changes were intended to help expose
> missing-PGDLLIMPORT bugs, I don't much care for the second alternative.

It's a time issue right now I'm afraid (always interested in fixing bugs).

However, if "fixing" it comes down to upgrading the seriously old
compiler and toolchain on that box (which frankly is so obsolete, I
can't see why anyone would want to use anything like it these days),
then I think the best option is to retire it, and replace it with
Windows 2012R2 and a modern release of MinGW/Msys which is far more
likely to be similar to what someone would want to use at present.

Does anyone really think there's a good reason to keep maintaining
such an obsolete animal?

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] narwhal and PGDLLIMPORT

2014-10-14 Thread Tom Lane
Alvaro Herrera  writes:
> It seems we left this in broken state.  Do we need to do more here to
> fix narwhal, or do we want to retire narwhal now?  Something else?  Are
> we waiting on someone in particular to do something specific?

I think we're hoping that somebody will step up and investigate how
narwhal's problem might be fixed.  However, the machine's owner (Dave)
doesn't appear to have the time/interest to do that.  That means that
our realistic choices are to retire narwhal or revert the linker changes
that broke it.  Since those linker changes were intended to help expose
missing-PGDLLIMPORT bugs, I don't much care for the second alternative.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for better support of time-varying timezone abbreviations

2014-10-14 Thread Tom Lane
I wrote:
> I got interested in the problem discussed in
> http://www.postgresql.org/message-id/20714.1412456...@sss.pgh.pa.us
> to wit:
>> It's becoming clear to me that our existing design whereby zone
>> abbreviations represent fixed GMT offsets isn't really good enough.
>> I've been wondering whether we could change things so that, for instance,
>> "EDT" means "daylight time according to America/New_York" and the system
>> would consult the zic database to find out what the prevailing GMT offset
>> was in that zone on that date.  This would be a lot more robust in the
>> face of the kind of foolishness we now see actually goes on.

Attached is an updated patch for this, incorporating my previous work on
changing the representation of datetkn.  The code changes are complete
I believe, but I've not touched the user documentation yet, nor updated
the tznames data files except for changing MSK for testing purposes.

Some notes:

* There wasn't any reasonable way to return the required information about
a dynamic zone abbreviation from DecodeSpecial().  However, on looking
closely I found that of the existing callers of DecodeSpecial(), only two
actually relied on it to find both timezone abbreviations and regular
keywords.  The other callers only wanted one case or the other.  So it
seemed to me that it'd be best to split the abbreviation-lookup behavior
apart from keyword-lookup.  DecodeSpecial() now does only the latter, and
there's a new function DecodeTimezoneAbbrev() to do the former.  This
avoids touching any code that doesn't care about timezone abbreviations,
and should be a bit faster for those cases too (but about the same speed
otherwise).

* I found that there were pre-existing bugs in DetermineTimeZoneOffset()
for cases in which a zone changed offset but neither the preceding nor
following time segment was marked as DST time.  This caused strange
behaviors for cases like Europe/Moscow's 2011 and 2014 time changes.
This is not terribly surprising because we never thought about zone
changes other than simple DST spring-forward/fall-back changes when
that code was written.

* I've not touched ecpg except for cosmetic changes to keep the struct
definitions in sync, and to fix the previously-mentioned bogus free()
attempt.  I doubt that it would be worth teaching ecpg how to access the
zic timezone database --- the problem of configuring where to find those
files seems like more trouble than it's worth given the lack of
complaints.  I'm not sure what we should do about the obsolete timezone
abbreviations in its table.

* timetz_zone() and DecodeTimeOnly() are not terribly consistent about
how they resolve timezones when not given a date.  The former resolves
based on the value of time(NULL), which is a moving target even within
a transaction (which is why the function is marked volatile I guess).
The latter gets today's date as of start of transaction (so at least
it's stable) and then merges that m/d/y with the h/m/s from the time
value.  That behavior does not seem terribly well thought out either:
on the day of, or day before or after, a DST change it's likely to
produce strange results, especially if the session timezone is different
from the zone specified in the input.  I think we ought to consider
changing one or both of these, though it's not material for back-patching.

* In HEAD, we might want to extend the pg_timezone_abbrevs view to have
a column that shows the underlying zone for a dynamic abbreviation.
The attached patch just shows the abbreviation's behavior as of current
time (defined as now()), which is sort of good enough but certainly not
the whole truth.

Comments?

regards, tom lane

diff --git a/contrib/btree_gist/btree_ts.c b/contrib/btree_gist/btree_ts.c
index a13dcc8..b9c2b49 100644
*** a/contrib/btree_gist/btree_ts.c
--- b/contrib/btree_gist/btree_ts.c
*** tstz_dist(PG_FUNCTION_ARGS)
*** 200,226 
   **/
  
  
! static Timestamp
  tstz_to_ts_gmt(TimestampTz ts)
  {
! 	Timestamp	gmt;
! 	int			val,
! tz;
! 
! 	gmt = ts;
! 	DecodeSpecial(0, "gmt", &val);
! 
! 	if (ts < DT_NOEND && ts > DT_NOBEGIN)
! 	{
! 		tz = val * 60;
! 
! #ifdef HAVE_INT64_TIMESTAMP
! 		gmt -= (tz * INT64CONST(100));
! #else
! 		gmt -= tz;
! #endif
! 	}
! 	return gmt;
  }
  
  
--- 200,210 
   **/
  
  
! static inline Timestamp
  tstz_to_ts_gmt(TimestampTz ts)
  {
! 	/* No timezone correction is needed, since GMT is offset 0 by definition */
! 	return (Timestamp) ts;
  }
  
  
diff --git a/src/backend/utils/adt/date.c b/src/backend/utils/adt/date.c
index 073104d..bb23b12 100644
*** a/src/backend/utils/adt/date.c
--- b/src/backend/utils/adt/date.c
*** timetz_zone(PG_FUNCTION_ARGS)
*** 2695,2718 
  	pg_tz	   *tzp;
  
  	/*
! 	 * Look up the requested timezone.  First we look in the date token table
! 	 * (to handle cases like "EST"), and if that f

Re: [HACKERS] replicating DROP commands across servers

2014-10-14 Thread Alvaro Herrera
Robert Haas wrote:
> On Fri, Oct 3, 2014 at 4:58 PM, Alvaro Herrera  
> wrote:
> > Robert Haas wrote:
> >> I'm not really very convinced that it's a good idea to expose this
> >> instead of just figuring out a way to parse the object identity.
> >
> > That's the first thing I tried.  But it's not pretty: you have to
> > extract schema names by splitting at a period (and what if a schema name
> > contains a period?),
> 
> Please tell me that the literals are escaped if necessary.  If so,
> this is pretty easy.  quote_literal() is not a hard transformation to
> reverse, and splitting on a unquoted period is not hard...

I don't think it is necessary to force parsing strings for something
that can be more conveniently obtained from the get go, just because
we're too lazy to change the existing definition of the function.  I'm
not saying it is impossible or extremely hard to parse the strings, but
since we can emit the right format with no extra effort, there doesn't
seem to be any point on doing it that way.  It's not like this patch
adds excessive extra complexity to this code, either.

I'd say that the most complex part of this patch is the addition of the
two flag ("normal" and "original") columns, which we would need
regardless of the rest of the patch; these are used to tell whether
there are routes to the given object that have the eponymous flags set
in the dependency graph.

> > It's just not sane to try to parse such text strings.
> 
> But this is a pretty ridiculous argument.  We have an existing parser
> that does it just fine, and a special-purpose parser that does just
> that (and not all of the other stuff that the main parser does) would
> be a great deal simpler.

We have a main parser because we have no other option --- it's the way
stuff gets into the system in the first place.  But in this case, it's
not about accepting communication from the outside world, but emitting
state that is already in the database, in a different format.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Better support of exported snapshots with pg_dump

2014-10-14 Thread Michael Paquier
On Tue, Oct 14, 2014 at 11:55 PM, Petr Jelinek  wrote:
> On 22/09/14 02:24, Michael Paquier wrote:
>>
>> On Thu, Sep 4, 2014 at 11:33 PM, Michael Paquier
>>
>> Taking a dump consistent with a replication slot is useful for online
>> upgrade cases first, because you can simply run pg_dump, have a slot
>> created, and get as well a state of the database consistent with the
>> slot creation before replaying changes in a way or another. Using
>> that, a decoder that generates raw queries, and a receiver able to
>> apply changes on a remote Postgres server, it is possible to get a
>> kind of live migration solution from a Postgres instance to another
>> for a single database, as long as the origin server uses 9.4. Making
>> the receiver report write and flush positions makes also possible the
>> origin server to use synchronous replication protocol to be sure that
>> changes got applied on remote before performing a switch from the
>> origin to the remote (that may actually explain why multi-syncrep
>> would be useful here for multiple databases). Also, I imagine that
>> users could even use this tool in pg_dump for example to do some post
>> processing on the data dumped in accordance to the decoder plugin
>> before applying changes to a remote source.
>>
>> Now, this is done with the addition of two options in pg_dump to
>> control the logical slot creation:
>> - --slot to define the name of the slot being created
>> - --plugin-name, to define the name of the decoder plugin
>> And then you can of course do things like that:
>> # Raw data dump on a slot
>> $ pg_dump --slot bar --plugin-name test_decoding
>> # Existing parallel dump not changed:
>> $ pg_dump -j 4 -f data -F d
>> # Parallel dump on a slot
>> $ pg_dump -j 4 --slot bar --plugin-name test_decoding -f data -F d
>>
>
> Wouldn't it be better to have the slot handling done outside of pg_dump by
> whatever replication solution you use and just have pg_dump accept the
> snapshot as input parameter? I am not sure how much I like pg_dump creating
> the slot. I am aware that you need to have the replication connection open
> but that's IMHO just matter of scripting it together.
The whole point of the operation is to reduce the amount of time the
external snapshot is taken to reduce the risk of race conditions that
may be induced by schema changes. See for example discussions related
to why we do not authorize specifying a snapshot name as an option of
pg_dump.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Robert Haas
On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund  wrote:
>> The code in CHashSearch shows the problem there: you need to STORE the
>> hazard pointer before you begin to do the LOAD operations to scan the
>> bucket, and you must finish all of those LOADs before you STORE the
>> NULL hazard pointer.  A read or write barrier won't provide store/load
>> or load/store ordering.
>
> I'm not sure I understand the problem here - but a read + write barrier
> should suffice? To avoid falling back to two full barriers, we'd need a
> separate define pg_read_write_barrier(), but that's not a problem. IIUC
> that should allow us to avoid emitting any actual code on x86.

Well, thinking about x86 specifically, it definitely needs at least
one mfence, after setting the hazard pointer and before beginning to
read the bucket chain.  It probably doesn't need the second mfence,
before clearing the the hazard pointer, because x86 allows loads to be
reordered before stores but not stores before loads.  We could
certainly try removing the second fence on x86 for benchmarking
purposes, and we could also check whether mfence outperforms lock;
addl in that scenario.

I tested PPC, because that's what I have.  I think we're emitting two
sync instructions there, but maybe lwsync or isync would be enough in
one or both cases.  The first of these links indicates that lwsync
ought to be enough for both cases; the second says that we need a sync
after setting the hazard pointer but that an lwsync is enough before
clearing it.   Or that's my reading anyway.

http://www.ibm.com/developerworks/systems/articles/powerpc.html
http://peeterjoot.wordpress.com/2010/07/23/use-of-lwsync-instead-of-isync-as-a-mutex-acquire-memory-barrier/

>> > My guess is that the additional indirection via the arena explains the
>> > difference in cache misses? But I might be completely off here.
>>
>> The arena makes the calculation of the pointer address less
>> predictable, which might make things more difficult for the CPU
>> pipeline.  But aside from that, I think it's the same basic idea: you
>> bounce from some sort of bucket header to the first element of the
>> chain and then, hopefully, you're done.  Am I missing something?
>
> I haven't really read much of the code yet (wrote most of this email on
> my workstation before embarking on a plane), but afaics there's one
> indirection to the bucket, and then one to the first node in the linked
> list. Right?
> In dynahash it's first to the segment, but there's only few, so it's
> likely to be cached. Then to the bucket - which, afaics, already can
> contain the key.
>
> So it's not really the arena, but that you chose not to store the first
> element in a chain inline...

Hmm, you have a point.  I think the reason I did it that way is
because I don't know how to make the memory management work out
otherwise.  If you store the first element inline, then even an empty
bucket uses up as much memory space as one with a single element.
More seriously, it breaks the deletion algorithm.  There's no good way
to atomically swap out the bucket header if it is located in a fixed
position and bigger than the size of object the machine can manipulate
atomically.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] narwhal and PGDLLIMPORT

2014-10-14 Thread Alvaro Herrera

It seems we left this in broken state.  Do we need to do more here to
fix narwhal, or do we want to retire narwhal now?  Something else?  Are
we waiting on someone in particular to do something specific?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: plpgsql - Assert statement

2014-10-14 Thread Petr Jelinek

On 09/09/14 17:37, Pavel Stehule wrote:

Ada is language with strong character, and PLpgSQL is little bit strange
fork - so it isn't easy to find some form, how to solve all requirements.

My requests:

* be consistent with current PLpgSQL syntax and logic
* allow some future extensibility
* allow a static analyses without hard expression processing

But I am thinking so there are some points where can be some agreement -
although it is not ASSERT implementation.

enhancing RAISE WHEN - please, see attached patch -

I prefer RAISE WHEN again RAISE WHERE due consistency with EXIT and
CONTINUE [ WHEN ];



Short review of the patch. Note that this has nothing to do with actual 
assertions, at the moment it's just enhancement of RAISE statement to 
make it optionally conditional. As I was one of the people who voted for 
it I do think we want this and I like the syntax.


Code applies cleanly, seems formatted according to project standards - 
there is unnecessary whitespace added in variable declaration part of 
exec_stmt_raise which should be removed.


Passes make check, I would prefer to have little more complex expression 
than just "true" in the new regression test added for this feature.


Did some manual testing, seems to work as advertised.

There are no docs at the moment which is the only show-stopper that I 
can see so far.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-10-14 Thread Petr Jelinek

On 09/10/14 00:32, Andres Freund wrote:

 From c835a06f20792556d35a0eee4c2fa21f5f23e8a3 Mon Sep 17 00:00:00 2001
From: Robert Haas 
Date: Fri, 11 Jul 2014 09:53:40 -0400
Subject: [PATCH 6/6] pg_background: Run commands in a background worker, and
  get the results.

The currently-active GUC values from the user session will be copied
to the background worker.  If the command returns a result set, you
can retrieve the result set; if not, you can retrieve the command
tags.  If the command fails with an error, the same error will be
thrown in the launching process when the results are retrieved.
Warnings and other messages generated by the background worker, and
notifications received by it, are also propagated to the foreground
process.


I got to ask: Why is it helpful that we have this in contrib? I have a
good share of blame to bear for that, but I think we need to stop
dilluting contrib evermore with test programs. These have a place, but I
don't think it should be contrib.



I don't see this as just mere test module, it seems to provide actual 
useful functionality (I still dream of it being extended with scheduler 
eventually and even if it's not, you still get "asynchronous transactions").


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Andres Freund
On 2014-10-14 11:19:16 -0400, Robert Haas wrote:
> On Tue, Oct 14, 2014 at 10:25 AM, Andres Freund  
> wrote:
> >> The key idea here is that lookups are done without any locks, only
> >> memory barriers; and inserts and deletes are done using atomic ops.
> >
> > Hm. I quickly looked and I see that you use two full barriers for every
> > lookup. That's pretty expensive. I think this should be doable using
> > only read/write barriers on the lookup side? Even on architectures where
> > read/write memory barriers have to be something but a compiler barrier,
> > they're still usually a magnitude or so cheaper than full barriers.
> 
> The code in CHashSearch shows the problem there: you need to STORE the
> hazard pointer before you begin to do the LOAD operations to scan the
> bucket, and you must finish all of those LOADs before you STORE the
> NULL hazard pointer.  A read or write barrier won't provide store/load
> or load/store ordering.

I'm not sure I understand the problem here - but a read + write barrier
should suffice? To avoid falling back to two full barriers, we'd need a
separate define pg_read_write_barrier(), but that's not a problem. IIUC
that should allow us to avoid emitting any actual code on x86.

> > With regard for using a hash table for the buffer mapping lock I'm
> > doubtful that any form of separate chaining is the right one. We
> > currently have a quite noticeable problem with the number of cache
> > misses in the buffer mapping hash (and also the heavyweight lock mgr) -
> > if we stay with hashes that's only going to be fundamentally lower than
> > dynahash if we change the type of hashing. I've had good, *preliminary*,
> > results using an open addressing + linear probing approach.
> 
> I'm very skeptical of open addressing + linear probing.  Such hash
> tables tend to degrade over time, because you have to leave tombstones
> behind to ensure that future scans advance far enough.

You can try to be a bit smarter than a plain open addressing
approach. But yes, it has disadvantages.

> There's really
> no way to recover without rebuilding the whole hash table, and
> eventually it degrades to linear search.  If we're spending too much
> time walking hash chains, I think the solution is to increase the
> number of buckets so that the chains get shorter.

Might be worthwile to try. I know that both my handrolled open
addressing and my hand rolled chaining approach have significantly fewer
cache misses than dynahash for the same amount of work.

Hm. Possibly that's at least partially because of the segment stuff in
dynahash?

Anyway, you can get the size of the entire hashtable down quite a
bit. Primarily because there's no need to store a next pointer. There's
also not really any need for the 'hashvalue' in the bufmgr case
imo.

> > My guess is that the additional indirection via the arena explains the
> > difference in cache misses? But I might be completely off here.
> 
> The arena makes the calculation of the pointer address less
> predictable, which might make things more difficult for the CPU
> pipeline.  But aside from that, I think it's the same basic idea: you
> bounce from some sort of bucket header to the first element of the
> chain and then, hopefully, you're done.  Am I missing something?

I haven't really read much of the code yet (wrote most of this email on
my workstation before embarking on a plane), but afaics there's one
indirection to the bucket, and then one to the first node in the linked
list. Right?
In dynahash it's first to the segment, but there's only few, so it's
likely to be cached. Then to the bucket - which, afaics, already can
contain the key.

So it's not really the arena, but that you chose not to store the first
element in a chain inline...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] New Event Trigger: table_rewrite

2014-10-14 Thread Dimitri Fontaine
Hi fellow hackers,

Please find attached to this email a patch to implement a new Event
Trigger, fired on the the "table_rewrite" event. As attached, it's meant
as a discussion enabler and only supports ALTER TABLE (and maybe not in
all forms of it). It will need to grow support for VACUUM FULL and
CLUSTER and more before getting commited.

Also, I'd like to work on the AccessExclusiveLock Event Trigger next,
but wanted this one, more simple, to get acceptance as the way to
approach adding events that are not DDL centric.

This time it's not about which command is running, it's about what the
command is doing.

 src/backend/commands/event_trigger.c| 92 -
 src/backend/commands/tablecmds.c| 35 +++-
 src/backend/utils/cache/evtcache.c  |  2 +
 src/include/commands/event_trigger.h|  1 +
 src/include/utils/evtcache.h|  3 +-
 src/test/regress/expected/event_trigger.out | 18 
 src/test/regress/sql/event_trigger.sql  | 21 +
 7 files changed, 166 insertions(+), 6 deletions(-)

Regards,
-- 
Dimitri Fontaine06 63 07 10 78
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

diff --git a/src/backend/commands/event_trigger.c b/src/backend/commands/event_trigger.c
index 1b8c94b..9314da9 100644
--- a/src/backend/commands/event_trigger.c
+++ b/src/backend/commands/event_trigger.c
@@ -119,11 +119,14 @@ static void AlterEventTriggerOwner_internal(Relation rel,
 HeapTuple tup,
 Oid newOwnerId);
 static event_trigger_command_tag_check_result check_ddl_tag(const char *tag);
+static event_trigger_command_tag_check_result check_table_rewrite_ddl_tag(
+	const char *tag);
 static void error_duplicate_filter_variable(const char *defname);
 static Datum filter_list_to_array(List *filterlist);
 static Oid insert_event_trigger_tuple(char *trigname, char *eventname,
 		   Oid evtOwner, Oid funcoid, List *tags);
 static void validate_ddl_tags(const char *filtervar, List *taglist);
+static void validate_table_rewrite_tags(const char *filtervar, List *taglist);
 static void EventTriggerInvoke(List *fn_oid_list, EventTriggerData *trigdata);
 
 /*
@@ -154,7 +157,8 @@ CreateEventTrigger(CreateEventTrigStmt *stmt)
 	/* Validate event name. */
 	if (strcmp(stmt->eventname, "ddl_command_start") != 0 &&
 		strcmp(stmt->eventname, "ddl_command_end") != 0 &&
-		strcmp(stmt->eventname, "sql_drop") != 0)
+		strcmp(stmt->eventname, "sql_drop") != 0 &&
+		strcmp(stmt->eventname, "table_rewrite") != 0)
 		ereport(ERROR,
 (errcode(ERRCODE_SYNTAX_ERROR),
  errmsg("unrecognized event name \"%s\"",
@@ -183,6 +187,9 @@ CreateEventTrigger(CreateEventTrigStmt *stmt)
 		 strcmp(stmt->eventname, "sql_drop") == 0)
 		&& tags != NULL)
 		validate_ddl_tags("tag", tags);
+	else if (strcmp(stmt->eventname, "table_rewrite") == 0
+			 && tags != NULL)
+		validate_table_rewrite_tags("tag", tags);
 
 	/*
 	 * Give user a nice error message if an event trigger of the same name
@@ -281,6 +288,40 @@ check_ddl_tag(const char *tag)
 }
 
 /*
+ * Validate DDL command tags.
+ */
+static void
+validate_table_rewrite_tags(const char *filtervar, List *taglist)
+{
+	ListCell   *lc;
+
+	foreach(lc, taglist)
+	{
+		const char *tag = strVal(lfirst(lc));
+		event_trigger_command_tag_check_result result;
+
+		result = check_table_rewrite_ddl_tag(tag);
+		if (result == EVENT_TRIGGER_COMMAND_TAG_NOT_SUPPORTED)
+			ereport(ERROR,
+	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			/* translator: %s represents an SQL statement name */
+	 errmsg("event triggers are not supported for %s",
+			tag)));
+	}
+}
+
+static event_trigger_command_tag_check_result
+check_table_rewrite_ddl_tag(const char *tag)
+{
+	if (pg_strcasecmp(tag, "ALTER TABLE") == 0 ||
+		pg_strcasecmp(tag, "CLUSTER") == 0 ||
+		pg_strcasecmp(tag, "VACUUM") == 0)
+		return EVENT_TRIGGER_COMMAND_TAG_OK;
+
+	return EVENT_TRIGGER_COMMAND_TAG_NOT_SUPPORTED;
+}
+
+/*
  * Complain about a duplicate filter variable.
  */
 static void
@@ -838,6 +879,55 @@ EventTriggerSQLDrop(Node *parsetree)
 	list_free(runlist);
 }
 
+
+/*
+ * Fire table_rewrite triggers.
+ */
+void
+EventTriggerTableRewrite(Node *parsetree)
+{
+	List	   *runlist;
+	EventTriggerData trigdata;
+
+	/*
+	 * Event Triggers are completely disabled in standalone mode.  There are
+	 * (at least) two reasons for this:
+	 *
+	 * 1. A sufficiently broken event trigger might not only render the
+	 * database unusable, but prevent disabling itself to fix the situation.
+	 * In this scenario, restarting in standalone mode provides an escape
+	 * hatch.
+	 *
+	 * 2. BuildEventTriggerCache relies on systable_beginscan_ordered, and
+	 * therefore will malfunction if pg_event_trigger's indexes are damaged.
+	 * To allow recovery from a damaged index, we need some operating mode
+	 * wherein event triggers are disabled.  (Or we could implement
+	 * heapscan-and-sort logic for that case, but having disaster 

Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2014-10-14 Thread Abhijit Menon-Sen
At 2014-10-14 20:09:50 +0100, si...@2ndquadrant.com wrote:
>
> I think that's a good idea.
> 
> We could have pg_audit.roles = 'audit1, audit2'

Yes, it's a neat idea, and we could certainly do that. But why is it any
better than "ALTER ROLE audit_rw SET pgaudit.log = …" and granting that
role to the users whose actions you want to audit?

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2014-10-14 Thread Stephen Frost
* Simon Riggs (si...@2ndquadrant.com) wrote:
> On 14 October 2014 13:57, Stephen Frost  wrote:
> 
> > Create an 'audit' role.
> >
> > Every command run by roles which are granted to the 'audit' role are
> > audited.
> >
> > Every 'select' against tables which the 'audit' role has 'select' rights
> > on are audited.  Similairly for every insert, update, delete.
> 
> I think that's a good idea.
> 
> We could have pg_audit.roles = 'audit1, audit2'
> so users can specify any audit roles they wish, which might even be
> existing user names.

Agreed.

> That is nice because it allows multiple completely independent
> auditors to investigate whatever they choose without discussing with
> other auditors.

Yes, also a good thought.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2014-10-14 Thread Simon Riggs
On 14 October 2014 13:57, Stephen Frost  wrote:

> Create an 'audit' role.
>
> Every command run by roles which are granted to the 'audit' role are
> audited.
>
> Every 'select' against tables which the 'audit' role has 'select' rights
> on are audited.  Similairly for every insert, update, delete.

I think that's a good idea.

We could have pg_audit.roles = 'audit1, audit2'
so users can specify any audit roles they wish, which might even be
existing user names.

That is nice because it allows multiple completely independent
auditors to investigate whatever they choose without discussing with
other auditors.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait free LW_SHARED acquisition - v0.9

2014-10-14 Thread Merlin Moncure
On Tue, Oct 14, 2014 at 8:58 AM, Andres Freund  wrote:
> On 2014-10-14 08:40:49 -0500, Merlin Moncure wrote:
>> On Fri, Oct 10, 2014 at 11:00 AM, Andres Freund  
>> wrote:
>> > Which is nearly trivial now that atomics are in. Check out the attached
>> > WIP patch which eliminates the spinlock from StrategyGetBuffer() unless
>> > there's buffers on the freelist.
>>
>> Is this safe?
>>
>> + victim = pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
>>
>> - if (++StrategyControl->nextVictimBuffer >= NBuffers)
>> + buf = &BufferDescriptors[victim % NBuffers];
>> +
>> + if (victim % NBuffers == 0)
>> 
>>
>> I don't think there's any guarantee you could keep nextVictimBuffer
>> from wandering off the end.
>
> Not sure what you mean? It'll cycle around at 2^32. The code doesn't try
> to avoid nextVictimBuffer from going above NBuffers. To avoid overrunning
> &BufferDescriptors I'm doing % NBuffers.
>
> Yes, that'll have the disadvantage of the first buffers being slightly
> more likely to be hit, but for that to be relevant you'd need
> unrealistically large amounts of shared_buffers.

Right -- my mistake. That's clever.  I think that should work well.

> I think that's pretty much orthogonal. I *do* think it makes sense to
> increment nextVictimBuffer in bigger steps. But the above doesn't
> prohibit doing so - and it'll still be be much better without the
> spinlock. Same number of atomics, but no potential of spinning and no
> potential of being put to sleep while holding the spinlock.
>
> This callsite has a comparatively large likelihood of being put to sleep
> while holding a spinlock because it can run for a very long time doing
> nothing but spinlocking.

A while back, I submitted a minor tweak to the clock sweep so that,
instead of spinlocking every single buffer header as it swept it just
did a single TAS as a kind of a trylock and punted to the next buffer
if the test failed on the principle there's not good reason to hang
around.  You only spin if you passed the first test; that should
reduce the likelihood of actual spinning to approximately zero.  I
still maintain there's no reason not to do that (I couldn't show a
benefit but that was because mapping list locking was masking any
clock sweep contention at that time).

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump refactor patch to remove global variables

2014-10-14 Thread Alvaro Herrera
I have pushed this, thanks.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Buffer Requests Trace

2014-10-14 Thread Stephen Frost
* Lucas Lersch (lucasler...@gmail.com) wrote:
> I see this... but ReleaseBuffer() simply decrements the reference count of
> page the buffer currently holds. Assuming that a ReadBuffer() -
> ReleaseBuffer() pattern is used for interacting with the shared_buffers,
> there will be a ReleaseBuffer() call for any page (heap or index) "loaded"
> into the shared_buffers.

Not sure what you're getting at here.  This was the original comment
that I was addressing:

---
Unfortunately, in the generated trace with over 2 million buffer requests,
only ~14k different pages are being accessed, out of the 800k of the whole
database. Am I missing something here?
---

So, there's 2MM buffer requests with only ~14k different pages even
though the database consists of ~800k different pages.

Either your short benchmark is only hitting ~14k different pages out of
the ~800k, or what you're actually looking at are the ~14k pages (eh,
more like 16k, but whatever) of the shared_buffer cache.  Somewhere in
your analysis of the 2MM buffer requests you reduced the set of buffer
requests down to the set of "~14k different pages" that you're asking
about here.

What would be helpful here would be actual code changes you made (eg: a
patch), the resulting buffer request data (or at least a snippet of it),
and exactly how you did your analysis to come up with the ~14k number.

Thanks

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Buffer Requests Trace

2014-10-14 Thread Lucas Lersch
I see this... but ReleaseBuffer() simply decrements the reference count of
page the buffer currently holds. Assuming that a ReadBuffer() -
ReleaseBuffer() pattern is used for interacting with the shared_buffers,
there will be a ReleaseBuffer() call for any page (heap or index) "loaded"
into the shared_buffers.

On Tue, Oct 14, 2014 at 7:21 PM, Stephen Frost  wrote:

> * Lucas Lersch (lucasler...@gmail.com) wrote:
> > Aren't heap and index requests supposed to go through the shared buffers
> > anyway?
>
> Sure they do, but a given page in shared_buffers can be used over and
> over again for different heap and index pages..
>
> Thanks,
>
> Stephen
>



-- 
Lucas Lersch


Re: [HACKERS] Expose options to explain? (track_io_timing)

2014-10-14 Thread Robert Haas
On Tue, Oct 14, 2014 at 1:04 PM, Joshua D. Drake  wrote:
> On 10/14/2014 10:01 AM, Robert Haas wrote:
>> Hmm.  IIRC, there are only two use cases for I/O timing at present:
>> pg_stat_statements (which really only makes sense if it's turned on or
>> off system-wide) and EXPLAIN.  Rather than inventing more GUC
>> machinery, I think we could just add an explain flag called "IO".  So
>> you could do:
>>
>> EXPLAIN (ANALYZE, IO) SELECT 
>>
>> And that would gather I/O stats even if it's turned off system-wide.
>> Or you could do:
>>
>> EXPLAIN (ANALYZE, IO false) SELECT 
>>
>> That can't really be allowed to suppress gathering the I/O stats for
>> this query if the sysadmin wants those stats for all queries.  But it
>> could suppress the print-out.
>
> I think the first one makes the most sense.

It would be both or neither, not one or the other.   All EXPLAIN
options take true/false arguments; but "true" can be omitted for
brevity.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_background (and more parallelism infrastructure patches)

2014-10-14 Thread Robert Haas
On Wed, Oct 8, 2014 at 6:32 PM, Andres Freund  wrote:
>>  /*
>> + * Arrange to remove a dynamic shared memory mapping at cleanup time.
>> + *
>> + * dsm_keep_mapping() can be used to preserve a mapping for the entire
>> + * lifetime of a process; this function reverses that decision, making
>> + * the segment owned by the current resource owner.  This may be useful
>> + * just before performing some operation that will invalidate the segment
>> + * for future use by this backend.
>> + */
>> +void
>> +dsm_unkeep_mapping(dsm_segment *seg)
>> +{
>> + Assert(seg->resowner == NULL);
>> + ResourceOwnerEnlargeDSMs(CurrentResourceOwner);
>> + seg->resowner = CurrentResourceOwner;
>> + ResourceOwnerRememberDSM(seg->resowner, seg);
>> +}
>
> Hm, I dislike the name unkeep. I guess you want to be symmetric to
> dsm_keep_mapping? dsm_manage_mapping(), dsm_ensure_mapping_cleanup()
> dm_remember_mapping()?

Yes, I want to be symmetrical with dsm_keep_mapping, which is itself
symmetrical with dsm_keep_segment().  Your proposed names don't sound
good to me because it's not obvious that "manage" or "remember" is the
opposite of "keep".  It's pretty easy to remember that "unkeep" is the
opposite of "keep" though.

If I had to pick one of those, I'd probably pick
dsm_ensure_mapping_cleanup(), but I don't like that much.  It
describes what the function does fairly well, but it's totally unlike
the function it pairs with.

Another idea is to add another argument to the existing function
(possibly renaming it along the way).  For example, we could have
dsm_keep_mapping(seg, true) mean keep it, and dsm_keep_mapping(seg,
false) meaning don't keep it.  That would break existing callers, but
there probably aren't many of those.  We could even - and much more
generally - replace it with dsm_set_resowner(seg, res), but that would
require #including some some stuff into dsm.h that we might rather not
have there.

I'll respond to the rest of this later.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory

2014-10-14 Thread Bruce Momjian
On Tue, Oct 14, 2014 at 09:20:22AM -0700, Jeff Janes wrote:
> On Mon, Oct 13, 2014 at 12:11 PM, Bruce Momjian  wrote:
> 
> 
> I looked into this, and came up with more questions.  Why is
> checkpoint_completion_target involved in the total number of WAL
> segments?  If checkpoint_completion_target is 0.5 (the default), the
> calculation is:
> 
>         (2 + 0.5) * checkpoint_segments + 1
> 
> while if it is 0.9, it is:
> 
>         (2 + 0.9) * checkpoint_segments + 1
> 
> Is this trying to estimate how many WAL files are going to be created
> during the checkpoint?  If so, wouldn't it be (1 +
> checkpoint_completion_target), not "2 +".  My logic is you have the old
> WAL files being checkpointed (that's the "1"), plus you have new WAL
> files being created during the checkpoint, which would be
> checkpoint_completion_target * checkpoint_segments, plus one for the
> current WAL file.
> 
> 
> WAL is not eligible to be recycled until there have been 2 successful
> checkpoints.
> 
> So at the end of a checkpoint, you have 1 cycle of WAL which has just become
> eligible for recycling,
> 1 cycle of WAL which is now expendable but which is kept anyway, and
> checkpoint_completion_target worth of WAL which has occurred while the
> checkpoint was occurring and is still needed for crash recovery.

OK, so based on this analysis, what is the right calculation?  This?

(1 + checkpoint_completion_target) * checkpoint_segments + 1 +
max(wal_keep_segments, checkpoint_segments)

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Buffer Requests Trace

2014-10-14 Thread Stephen Frost
* Lucas Lersch (lucasler...@gmail.com) wrote:
> Aren't heap and index requests supposed to go through the shared buffers
> anyway?

Sure they do, but a given page in shared_buffers can be used over and
over again for different heap and index pages..

Thanks,

Stephen


signature.asc
Description: Digital signature


[HACKERS] [RFC] Incremental backup v3: incremental PoC

2014-10-14 Thread Marco Nenciarini
Hi Hackers,

following the advices gathered on the list I've prepared a third partial
patch on the way of implementing incremental pg_basebackup as described
here https://wiki.postgresql.org/wiki/Incremental_backup


== Changes

Compared to the previous version I've made the following changes:

* The backup_profile is not optional anymore. Generating it is cheap
enough not to bother the user with such a choice.

* I've isolated the code which detects the maxLSN of a segment in a
separate getMaxLSN function. At the moment it works scanning the whole
file, but I'm looking to replace it in the next versions.

* I've made possible to request an incremental backup passing a "-I
" option to pg_basebackup. It is probably too "raw" to remain as
is, but it's is useful at this stage to test the code.

* I've modified the backup label to report the fact that the backup was
taken with the incremental option. The result will be something like:

START WAL LOCATION: 0/5228 (file 00010052)
CHECKPOINT LOCATION: 0/5260
INCREMENTAL FROM LOCATION: 0/5128
BACKUP METHOD: streamed
BACKUP FROM: master
START TIME: 2014-10-14 16:05:04 CEST
LABEL: pg_basebackup base backup


== Testing it

At this stage you can make an incremental file-level backup using this
procedure:

pg_basebackup -v -F p -D /tmp/x -x
LSN=$(awk '/^START WAL/{print $4}' /tmp/x/backup_profile)
pg_basebackup -v -F p -D /tmp/y -I $LSN -x

the result will be an incremental backup in /tmp/y based on the full
backup on /tmp/x.

You can "reintegrate" the incremental backup in the /tmp/z directory
with the following little python script, calling it as

./recover.py /tmp/x /tmp/y /tmp/z


#!/usr/bin/env python
# recover.py

import os
import shutil
import sys

if len(sys.argv) != 4:
print >> sys.stderr, "usage: %s base incremental destination"
sys.exit(1)

base=sys.argv[1]
incr=sys.argv[2]
dest=sys.argv[3]

if os.path.exists(dest):
print >> sys.stderr, "error: destination must not exist (%s)" % dest
sys.exit(1)

profile=open(os.path.join(incr, 'backup_profile'), 'r')

for line in profile:
if line.strip() == 'FILE LIST':
break

shutil.copytree(incr, dest)
for line in profile:
tblspc, lsn, sent, date, size, path = line.strip().split('\t')
if sent == 't' or lsn=='\\N':
continue
base_file = os.path.join(base, path)
dest_file = os.path.join(dest, path)
shutil.copy2(base_file, dest_file)


It has obviously to be replaced by a full-fledged user tool, but it is
enough to test the concept.

== What next

I would to replace the getMaxLSN function with a more-or-less persistent
structure which contains the maxLSN for each data segment.

To make it work I would hook into the ForwardFsyncRequest() function in
src/backend/postmaster/checkpointer.c and update an in memory hash every
time a block is going to be fsynced. The structure could be persisted on
disk at some time (probably on checkpoint).

I think a good key for the hash would be a BufferTag with blocknum
"rounded" to the start of the segment.

I'm here asking for comments and advices on how to implement it in an
acceptable way.

== Disclaimer

The code here is an intermediate step, it does not contain any
documentation beside the code comments and will be subject to deep and
radical changes. However I believe it can be a base to allow PostgreSQL
to have its file-based incremental backup, and a block-based incremental
backup after it.

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it
From 5a7365fc3115c831627c087311c702a79cb355bc Mon Sep 17 00:00:00 2001
From: Marco Nenciarini 
Date: Tue, 14 Oct 2014 14:31:28 +0100
Subject: [PATCH] File-based incremental backup

Add backup profile to pg_basebackup
INCREMENTAL option naive implementaion
---
 src/backend/access/transam/xlog.c  |   7 +-
 src/backend/access/transam/xlogfuncs.c |   2 +-
 src/backend/replication/basebackup.c   | 316 +++--
 src/backend/replication/repl_gram.y|   6 +
 src/backend/replication/repl_scanner.l |   1 +
 src/bin/pg_basebackup/pg_basebackup.c  |  83 +++--
 src/include/access/xlog.h  |   3 +-
 src/include/replication/basebackup.h   |   4 +
 8 files changed, 391 insertions(+), 31 deletions(-)

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index 235b442..4dc79f0 100644
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*** XLogFileNameP(TimeLineID tli, XLogSegNo 
*** 9718,9724 
   * permissions of the calling user!
   */
  XLogRecPtr
! do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
   char **labelfile)
  {
boolexclusive = (labelfile == NULL);
--- 9718,9725 
   * permissions of the calling user!
   */
  XLogRecPtr
! do_pg_start_backup(const char *backupid

Re: [HACKERS] Buffer Requests Trace

2014-10-14 Thread Lucas Lersch
Aren't heap and index requests supposed to go through the shared buffers
anyway?

On Tue, Oct 14, 2014 at 7:02 PM, Stephen Frost  wrote:

> * Lucas Lersch (lucasler...@gmail.com) wrote:
> > shared_buffers is 128MB and the version of pgsql is 9.3.5
>
> I suspect you're not tracking what you think you're tracking, which is
> why I brought up shared_buffers.
>
> ~14k * 8192 (page size) = ~110MB
>
> What it sounds like you're actually tracking are shared buffer requests
> and not heap or index requests.
>
> Now, perhaps the test you're running only touched 110MB of the 6G
> database, but that seems pretty unlikely.
>
> Thanks,
>
> Stephen
>



-- 
Lucas Lersch


Re: [HACKERS] Expose options to explain? (track_io_timing)

2014-10-14 Thread Joshua D. Drake


On 10/14/2014 10:01 AM, Robert Haas wrote:


Hmm.  IIRC, there are only two use cases for I/O timing at present:
pg_stat_statements (which really only makes sense if it's turned on or
off system-wide) and EXPLAIN.  Rather than inventing more GUC
machinery, I think we could just add an explain flag called "IO".  So
you could do:

EXPLAIN (ANALYZE, IO) SELECT 

And that would gather I/O stats even if it's turned off system-wide.
Or you could do:

EXPLAIN (ANALYZE, IO false) SELECT 

That can't really be allowed to suppress gathering the I/O stats for
this query if the sysadmin wants those stats for all queries.  But it
could suppress the print-out.


I think the first one makes the most sense.

JD



--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc
"If we send our children to Caesar for their education, we should
 not be surprised when they come back as Romans."


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Buffer Requests Trace

2014-10-14 Thread Stephen Frost
* Lucas Lersch (lucasler...@gmail.com) wrote:
> shared_buffers is 128MB and the version of pgsql is 9.3.5

I suspect you're not tracking what you think you're tracking, which is
why I brought up shared_buffers.

~14k * 8192 (page size) = ~110MB

What it sounds like you're actually tracking are shared buffer requests
and not heap or index requests.

Now, perhaps the test you're running only touched 110MB of the 6G
database, but that seems pretty unlikely.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Expose options to explain? (track_io_timing)

2014-10-14 Thread Robert Haas
On Thu, Oct 9, 2014 at 2:41 PM, Jeff Janes  wrote:
> I think the theory for track_io_timing being PGC_SUSET is that if the
> superuser turned it on, no one should be able to turn it off.
>
> But I don't see an argument for the other way around, that no one should be
> able to turn it on locally of the superuser left it at the default of off.
>
> So I think the real behavior we would want is that anyone can turn it on in
> their session, and can also turn it off provided it was turned on by them in
> the first place.  But there is no machinery in the GUC code to do that,
> which is probably why it wasn't done.  I meant to work on that for this dev
> cycle, but I never dug into how to implement the "provided it was turned on
> by them in the first place" part of the requirement.  And how would this be
> expressed generically? Some notion that the default value can be a floor or
> ceiling which the user can alter in one direction, and reverse that
> alteration. PGC_SUSET_FLOOR and PGC_SUSET_CEILING?

Hmm.  IIRC, there are only two use cases for I/O timing at present:
pg_stat_statements (which really only makes sense if it's turned on or
off system-wide) and EXPLAIN.  Rather than inventing more GUC
machinery, I think we could just add an explain flag called "IO".  So
you could do:

EXPLAIN (ANALYZE, IO) SELECT 

And that would gather I/O stats even if it's turned off system-wide.
Or you could do:

EXPLAIN (ANALYZE, IO false) SELECT 

That can't really be allowed to suppress gathering the I/O stats for
this query if the sysadmin wants those stats for all queries.  But it
could suppress the print-out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Buffer Requests Trace

2014-10-14 Thread Lucas Lersch
shared_buffers is 128MB and the version of pgsql is 9.3.5

On Tue, Oct 14, 2014 at 6:31 PM, Lucas Lersch  wrote:

> Sorry, I do not understand the question.
>
> But I forgot to give an additional information: I am printing the page id
> for the trace file in ReleaseBuffer() only if it is a shared buffer, I am
> not considering local buffers. I assumed that local buffers were used only
> for temporary tables.
>
> On Tue, Oct 14, 2014 at 6:25 PM, Stephen Frost  wrote:
>
>> * Lucas Lersch (lucasler...@gmail.com) wrote:
>> > Unfortunately, in the generated trace with over 2 million buffer
>> requests,
>> > only ~14k different pages are being accessed, out of the 800k of the
>> whole
>> > database. Am I missing something here?
>>
>> What do you have shared_buffers set to..?
>>
>> Thanks,
>>
>> Stephen
>>
>
>
>
> --
> Lucas Lersch
>



-- 
Lucas Lersch


Re: [HACKERS] CINE in CREATE TABLE AS ... and CREATE MATERIALIZED VIEW ...

2014-10-14 Thread Fabrízio de Royes Mello
On Wed, Oct 1, 2014 at 9:17 AM, Fabrízio de Royes Mello <
fabriziome...@gmail.com> wrote:
>
> Hi all,
>
> We already have IF NOT EXISTS for CREATE TABLE. There are some reason to
don't have to CREATE TABLE AS and CREATE MATERIALIZED VIEW??
>

Patch attached to add CINE support to:

- CREATE TABLE AS
- CREATE MATERIALIZED VIEW

Regards,

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog: http://fabriziomello.github.io
>> Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello
>> Github: http://github.com/fabriziomello
diff --git a/doc/src/sgml/bgworker.sgml b/doc/src/sgml/bgworker.sgml
index 8e218ac..d53570d 100644
--- a/doc/src/sgml/bgworker.sgml
+++ b/doc/src/sgml/bgworker.sgml
@@ -192,7 +192,7 @@ typedef struct BackgroundWorker
opaque handle that can subsequently be passed to
GetBackgroundWorkerPid(BackgroundWorkerHandle *, pid_t *) or
TerminateBackgroundWorker(BackgroundWorkerHandle *).
-   GetBackgroundWorkerPid can be used to poll the status of the
+   GetBackgroundWorker can be used to poll the status of the
worker: a return value of BGWH_NOT_YET_STARTED indicates that
the worker has not yet been started by the postmaster;
BGWH_STOPPED indicates that it has been started but is
diff --git a/doc/src/sgml/ref/alter_view.sgml b/doc/src/sgml/ref/alter_view.sgml
index 3aef61b..cdcc4f1 100644
--- a/doc/src/sgml/ref/alter_view.sgml
+++ b/doc/src/sgml/ref/alter_view.sgml
@@ -28,6 +28,11 @@ ALTER VIEW [ IF EXISTS ] name RENAM
 ALTER VIEW [ IF EXISTS ] name SET SCHEMA new_schema
 ALTER VIEW [ IF EXISTS ] name SET ( view_option_name [= view_option_value] [, ... ] )
 ALTER VIEW [ IF EXISTS ] name RESET ( view_option_name [, ... ] )
+
+where view_option_name can be one of:
+
+security_barrier [ boolean ]
+check_option [ text (local or cascaded) ]
 
  
 
@@ -117,32 +122,19 @@ ALTER VIEW [ IF EXISTS ] name RESET

 

-SET ( view_option_name [= view_option_value] [, ... ] )
-RESET ( view_option_name [, ... ] )
+view_option_name
+
+ 
+  The name of a view option to be set or reset.
+ 
+
+   
+
+   
+view_option_value
 
  
-  Sets or resets a view option.  Currently supported options are:
-  
-   
-check_option (string)
-
- 
-  Changes the check option of the view.  The value must
-  be local or cascaded.
- 
-
-   
-   
-security_barrier (boolean)
-
- 
-  Changes the security-barrier property of the view.  The value must
-  be Boolean value, such as true
-  or false.
- 
-
-   
-  
+  The new value for a view option.
  
 

diff --git a/doc/src/sgml/ref/create_aggregate.sgml b/doc/src/sgml/ref/create_aggregate.sgml
index eaa410b..d61a22e 100644
--- a/doc/src/sgml/ref/create_aggregate.sgml
+++ b/doc/src/sgml/ref/create_aggregate.sgml
@@ -59,13 +59,13 @@ CREATE AGGREGATE name (
 [ , FINALFUNC = ffunc ]
 [ , FINALFUNC_EXTRA ]
 [ , INITCOND = initial_condition ]
-[ , MSFUNC = msfunc ]
-[ , MINVFUNC = minvfunc ]
-[ , MSTYPE = mstate_data_type ]
-[ , MSSPACE = mstate_data_size ]
-[ , MFINALFUNC = mffunc ]
+[ , MSFUNC = sfunc ]
+[ , MINVFUNC = invfunc ]
+[ , MSTYPE = state_data_type ]
+[ , MSSPACE = state_data_size ]
+[ , MFINALFUNC = ffunc ]
 [ , MFINALFUNC_EXTRA ]
-[ , MINITCOND = minitial_condition ]
+[ , MINITCOND = initial_condition ]
 [ , SORTOP = sort_operator ]
 )
 
diff --git a/doc/src/sgml/ref/create_materialized_view.sgml b/doc/src/sgml/ref/create_materialized_view.sgml
index 2c73852..8a0fb4d 100644
--- a/doc/src/sgml/ref/create_materialized_view.sgml
+++ b/doc/src/sgml/ref/create_materialized_view.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-CREATE MATERIALIZED VIEW table_name
+CREATE MATERIALIZED VIEW [ IF NOT EXISTS ] table_name
 [ (column_name [, ...] ) ]
 [ WITH ( storage_parameter [= value] [, ... ] ) ]
 [ TABLESPACE tablespace_name ]
@@ -53,6 +53,18 @@ CREATE MATERIALIZED VIEW table_name
  
   Parameters
 
+   
+IF NOT EXISTS
+
+ 
+  Do not throw an error if a materialized view with the same name already
+  exists. A notice is issued in this case.  Note that there is no guarantee
+  that the existing materialized view is anything like the one that would
+  have been created.
+ 
+
+   
+
   

 table_name
diff --git a/doc/src/sgml/ref/create_table_as.sgml b/doc/src/sgml/ref/create_table_as.sgml
index 60300ff..8e4ada7 100644
--- a/doc/src/sgml/ref/create_table_as.sgml
+++ b/doc/src/sgml/ref/create_table_as.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE table_name
+CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXISTS ] table_name
 [ (col

Re: [HACKERS] Column Redaction

2014-10-14 Thread Robert Haas
On Sat, Oct 11, 2014 at 3:40 AM, Simon Riggs  wrote:
> As soon as you issue the above query, you have clearly indicated your
> intention to steal. Receiving information is no longer accidental, it
> is an explicit act that is logged in the auditing system against your
> name. This is sufficient to bury you in court and it is now a real
> deterrent. Redaction has worked.

To me, this feels thin.  It's true that this might be good enough for
some users, but I wouldn't bet on it being good enough for very many
users, and I really hope there's a better option.  We have an existing
method of doing data redaction via security barrier views.  There are
information leaks, to be sure, but they are far more subtle and
low-bandwidth than Rod's query.  The reason for that is that only
trusted code (leakproof functions) are allowed to run against the
trusted data; the redaction is applied before any
potentially-untrustworthy stuff happens.  Here, you're applying the
redaction as the very last step before sending the data to the user,
and that allows too much freedom to do bad stuff between the time the
database first lays hands on the data and the time it gets redacted.

But maybe that can be fixed.  I don't know exactly how.  I think you
need a design that allows you to restrict very tightly the operations
that an untrusted user can perform on trusted data.  Maybe you only
want to allow "=" and nothing else, for example.  Perhaps the set of
allowable predicates could be defined via DDL.  Then when the query is
run, the system imposes a security fence.  Only approved predicates
can be pushed through the fence.  And when the data crosses the fence
from the trusted side to the untrusted side, redaction happens at that
point, rather than just before sending the data to the user.

This is, of course, more complicated.  But I think it's likely to be
worth it.  The problem with relying on auditing is that you need a
human to look at the audit logs and judge intent.  With a query as
overt as Rod's, that's maybe not too hard.  But with a lot of analysts
running a lot of queries, it might not be that hard to bury an
information-stealing query inside an innocent-looking query in such a
way that the administrator doesn't notice.  Granted, that's playing
with fire, but I've encountered many security vulnerabilities in my
career that can be exploited without doing anything obviously evil.
If you retroactively put a packet-sniffer on every network I've ever
been connected to, and carefully examined all my network traffic,
you'd find me finding holes in all kinds of things, but in fact,
nobody's ever noticed a problem in advance of me reporting it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Buffer Requests Trace

2014-10-14 Thread Lucas Lersch
Sorry, I do not understand the question.

But I forgot to give an additional information: I am printing the page id
for the trace file in ReleaseBuffer() only if it is a shared buffer, I am
not considering local buffers. I assumed that local buffers were used only
for temporary tables.

On Tue, Oct 14, 2014 at 6:25 PM, Stephen Frost  wrote:

> * Lucas Lersch (lucasler...@gmail.com) wrote:
> > Unfortunately, in the generated trace with over 2 million buffer
> requests,
> > only ~14k different pages are being accessed, out of the 800k of the
> whole
> > database. Am I missing something here?
>
> What do you have shared_buffers set to..?
>
> Thanks,
>
> Stephen
>



-- 
Lucas Lersch


Re: [HACKERS] Buffer Requests Trace

2014-10-14 Thread Stephen Frost
* Lucas Lersch (lucasler...@gmail.com) wrote:
> Unfortunately, in the generated trace with over 2 million buffer requests,
> only ~14k different pages are being accessed, out of the 800k of the whole
> database. Am I missing something here?

What do you have shared_buffers set to..?

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Maximum number of WAL files in the pg_xlog directory

2014-10-14 Thread Jeff Janes
On Mon, Oct 13, 2014 at 12:11 PM, Bruce Momjian  wrote:

>
> I looked into this, and came up with more questions.  Why is
> checkpoint_completion_target involved in the total number of WAL
> segments?  If checkpoint_completion_target is 0.5 (the default), the
> calculation is:
>
> (2 + 0.5) * checkpoint_segments + 1
>
> while if it is 0.9, it is:
>
> (2 + 0.9) * checkpoint_segments + 1
>
> Is this trying to estimate how many WAL files are going to be created
> during the checkpoint?  If so, wouldn't it be (1 +
> checkpoint_completion_target), not "2 +".  My logic is you have the old
> WAL files being checkpointed (that's the "1"), plus you have new WAL
> files being created during the checkpoint, which would be
> checkpoint_completion_target * checkpoint_segments, plus one for the
> current WAL file.
>

WAL is not eligible to be recycled until there have been 2 successful
checkpoints.

So at the end of a checkpoint, you have 1 cycle of WAL which has just
become eligible for recycling,
1 cycle of WAL which is now expendable but which is kept anyway, and
checkpoint_completion_target worth of WAL which has occurred while the
checkpoint was occurring and is still needed for crash recovery.

I don't really understand the point of this way of doing things.  I guess
it is because the control file contains two redo pointers, one for the last
checkpoint, and one for the previous to that checkpoint, and if recovery
finds that it can't use the most recent one it tries the ones before that.
Why?  Beats me.  If we are worried about the control file getting a corrupt
redo pointer, it seems like we would record the last one twice, rather than
recording two different ones once each.  And if the in-memory version got
corrupted before being written to the file, I really doubt anything is
going to save your bacon at that point.

I've never seen a case where recovery couldn't use the last recorded good
checkpoint, so instead used the previous one, and was successful at it.
But then again I haven't seen all possible crashes.

This is based on memory from the last time I looked into this, I haven't
re-verified it so could be wrong or obsolete.

Cheers,

Jeff


[HACKERS] Buffer Requests Trace

2014-10-14 Thread Lucas Lersch
Hello,

I changed the buffer manager code in order to generate a trace of page
requests from the buffer manager perspective. In summary, whenever
ReleaseBuffer() or ReleaseAndReadBuffer() are called, I print the page
currently being released which is identified by the tuple (tableSpace,
dbNode, relationNode, blockNumber).

I am now running a tpcc benchmark from http://oltpbenchmark.com/

Initially I create and load the database with a scale factor of 64. This
sums up to a database of around 6.7GB (~ 800k pages). Then I execute the
tpcc benchmark for 1 minute with only 1 terminal. Finally I analyse the
trace of the buffer requests made by the execution of the benchmark only
(creation and loading not considered).

Unfortunately, in the generated trace with over 2 million buffer requests,
only ~14k different pages are being accessed, out of the 800k of the whole
database. Am I missing something here?

Best regards.
-- 
Lucas Lersch


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Robert Haas
On Tue, Oct 14, 2014 at 11:31 AM, Andres Freund  wrote:
>> It doesn't look particularly dangerous to me.  Famous last words.
>
>> Basically, I think what we're doing right now is holding the buffer
>> mapping lock so that the buffer can't be renamed under us while we're
>> pinning it.
>
> What I'm afraid of is that there's hidden assumptions about the
> consistency provided by the mapping locks.

That's certainly worth checking for, but I think the only code that
needs to be checked is the code that would formerly have run while
holding said lock.  And there isn't that much of that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Andres Freund
On 2014-10-14 11:08:08 -0400, Robert Haas wrote:
> On Tue, Oct 14, 2014 at 10:47 AM, Andres Freund  
> wrote:
> > On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
> >> I took the basic infrastructure from before and used it to replace the
> >> buffer table.  Patch is attached.
> >
> > Hm. Unless I missed something you pretty much completely eradicated
> > locks from the buffer mapping code? That'd be pretty cool, but it's also
> > scary.
> >
> > How confident are you that your conversion is correct? Not in the sense
> > that there aren't any buglets, but fundamentally.
> 
> It doesn't look particularly dangerous to me.  Famous last words.

> Basically, I think what we're doing right now is holding the buffer
> mapping lock so that the buffer can't be renamed under us while we're
> pinning it.

What I'm afraid of is that there's hidden assumptions about the
consistency provided by the mapping locks.

> If we don't do that, I think the only consequence is
> that, by the time we get the buffer pin, the buffer might no longer be
> the one we looked up.  So we need to recheck.  But assuming we do
> that, I don't see an issue.  In fact, it occurred to me while I was
> cobbling this together that we might want to experiment with that
> change independently of chash.  We already know (from the
> StrategyGetBuffer locking changes) that holding system-wide locks to
> prevent people from twaddling the state of individual buffers can be a
> loser.  This case isn't nearly as bad, because we're only pinning one
> buffer, rather than potentially all of them multiple times, and the
> lock we're holding only affects 1/128th of the buffers, not all of
> them.

Also IIRC at least linux has targeted wakeup/time transfer logic when
using semaphore - and doesn't for spinlocks. So if you're not happening
to sleep while the lwlock's spinlock is held, the result will be
better. Only that we'll frequently sleep within that for very frequently
taken locks...

> The other application I had in mind for chash was SLRU stuff.  I
> haven't tried to write the code yet, but fundamentally, the same
> considerations apply there.  You do the lookup locklessly to find out
> which buffer is thought to contain the SLRU page you want, but by the
> time you lock the page the answer might be out of date.  Assuming that
> this doesn't happen many times in a row and that lookups are
> relatively cheap, you could get rid of having any centralized lock for
> SLRU, and just have per-page locks.

Hm. I have to admit I haven't really looked enough into the slru code to
judge this. My impression so far wasn't that the locking itself was the
problem in most scenarios - that's not what you've seen?

> I'm not quite sure what fundamental dangers you're thinking about
> here

Oh, only in the context of the bufmgr.c conversion. Not more
generally. I agree that a lockfree hashtable is something quite
worthwile to have.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Robert Haas
On Tue, Oct 14, 2014 at 10:25 AM, Andres Freund  wrote:
>> The key idea here is that lookups are done without any locks, only
>> memory barriers; and inserts and deletes are done using atomic ops.
>
> Hm. I quickly looked and I see that you use two full barriers for every
> lookup. That's pretty expensive. I think this should be doable using
> only read/write barriers on the lookup side? Even on architectures where
> read/write memory barriers have to be something but a compiler barrier,
> they're still usually a magnitude or so cheaper than full barriers.

The code in CHashSearch shows the problem there: you need to STORE the
hazard pointer before you begin to do the LOAD operations to scan the
bucket, and you must finish all of those LOADs before you STORE the
NULL hazard pointer.  A read or write barrier won't provide store/load
or load/store ordering.

> I don't see much reason to strive for full lock-free ness. If the locks
> aren't noticeable in real world scenarios - who cares?

That's my feeling too.  ISTR that when I stress-tested the hash table
back in 2012 when I started this code, the concurrency was far better
than dynahash with 16 lwlock partitions.  I don't remember off hand
exactly how I tested that, but it was with the code in hashtest.c.  I
also seem to recall that it was possible to make the freelists get
VERY hot, but of course that was a workload where hash table updates
were the only thing happening, so I assume that on a workload where
we're actually doing work (like, copying the data in and out of kernel
buffers) that's not going to be a real problem unless you have
thousands of cores.  But we'll see.

> With regard for using a hash table for the buffer mapping lock I'm
> doubtful that any form of separate chaining is the right one. We
> currently have a quite noticeable problem with the number of cache
> misses in the buffer mapping hash (and also the heavyweight lock mgr) -
> if we stay with hashes that's only going to be fundamentally lower than
> dynahash if we change the type of hashing. I've had good, *preliminary*,
> results using an open addressing + linear probing approach.

I'm very skeptical of open addressing + linear probing.  Such hash
tables tend to degrade over time, because you have to leave tombstones
behind to ensure that future scans advance far enough.  There's really
no way to recover without rebuilding the whole hash table, and
eventually it degrades to linear search.  If we're spending too much
time walking hash chains, I think the solution is to increase the
number of buckets so that the chains get shorter.

> My guess is that the additional indirection via the arena explains the
> difference in cache misses? But I might be completely off here.

The arena makes the calculation of the pointer address less
predictable, which might make things more difficult for the CPU
pipeline.  But aside from that, I think it's the same basic idea: you
bounce from some sort of bucket header to the first element of the
chain and then, hopefully, you're done.  Am I missing something?

> It'd be quite interesting to see a perf stat -ddd of dynahash.c/chash
> employing builds for some comparable load.

I'm hoping you can test this out on x86.  All I have to work with are
the POWER machines, and the characteristics of those are somewhat
different.  I can test it there, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Drop any statistics of table after it's truncated

2014-10-14 Thread Sawada Masahiko
On Tue, Oct 14, 2014 at 11:20 PM, Tom Lane  wrote:
> Sawada Masahiko  writes:
>> I found that the statistics are still remained after it's truncated.
>> In addition, the analyzing ignores table does not have any tuple.
>> After table truncated, the entry of statistics continues to remain
>> unless insertion and analyzing are executed.
>> There is reason why statistics are remained?
>
> Yes, actually, that's intentional.  The idea is that once you start
> loading data into the table, it's most likely going to look something
> like the old data; therefore, the stale statistics are better than
> none at all.  Eventually auto-ANALYZE will replace the stats,
> but until that happens, it seems best to keep using the old stats.
> (Of course there are counterexamples to this, but removing the stats
> is bad in more cases than not.)
>
>> Attached patch is one line patch adds RemoveStatistics() into
>> ExecuteTruncate(), to remove statistics entry of table.
>
> -1, for the reasons explained above.
>

I understood that reason.
Thank you for explaining!

-- 
Regards,

---
Sawada Masahiko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Amit Kapila
On Tue, Oct 14, 2014 at 8:38 PM, Andres Freund 
wrote:
>
> On 2014-10-14 20:30:45 +0530, Amit Kapila wrote:
> > On Tue, Oct 14, 2014 at 7:55 PM, Andres Freund 
> > wrote:
> > >
> > > On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
> > > > A few years ago I started working on a concurrent hash table for
> > > > PostgreSQL.  The hash table part of it worked, but I never did
> > > > anything with it, really.  Amit mentioned to me earlier this week
that
> > > > he was seeing contention inside the dynahash machinery, which
inspired
> > > > me to go back and update the patch.
> > >
> > > Interestingly I've benchmarked similar loads, even on the same machine
> > > as Amit,
> >
> > There is one catch here, for these profiles I am using Power-8 m/c
> > and the load is slightly higher (5000 scale factor).
>
> Ah, right. I don't think the scale factor changes much, but the
> different architecture certainly does. As I said elsewhere, I would not
> believe these profiles much until they're actually done with optimized
> code...

Today, that m/c is not accessible, so will take a day or so to get the
optimized profiles and will post it once I am able to take the same.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Robert Haas
On Tue, Oct 14, 2014 at 10:47 AM, Andres Freund  wrote:
> On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
>> I took the basic infrastructure from before and used it to replace the
>> buffer table.  Patch is attached.
>
> Hm. Unless I missed something you pretty much completely eradicated
> locks from the buffer mapping code? That'd be pretty cool, but it's also
> scary.
>
> How confident are you that your conversion is correct? Not in the sense
> that there aren't any buglets, but fundamentally.

It doesn't look particularly dangerous to me.  Famous last words.

Basically, I think what we're doing right now is holding the buffer
mapping lock so that the buffer can't be renamed under us while we're
pinning it.  If we don't do that, I think the only consequence is
that, by the time we get the buffer pin, the buffer might no longer be
the one we looked up.  So we need to recheck.  But assuming we do
that, I don't see an issue.  In fact, it occurred to me while I was
cobbling this together that we might want to experiment with that
change independently of chash.  We already know (from the
StrategyGetBuffer locking changes) that holding system-wide locks to
prevent people from twaddling the state of individual buffers can be a
loser.  This case isn't nearly as bad, because we're only pinning one
buffer, rather than potentially all of them multiple times, and the
lock we're holding only affects 1/128th of the buffers, not all of
them.

The other application I had in mind for chash was SLRU stuff.  I
haven't tried to write the code yet, but fundamentally, the same
considerations apply there.  You do the lookup locklessly to find out
which buffer is thought to contain the SLRU page you want, but by the
time you lock the page the answer might be out of date.  Assuming that
this doesn't happen many times in a row and that lookups are
relatively cheap, you could get rid of having any centralized lock for
SLRU, and just have per-page locks.

I'm not quite sure what fundamental dangers you're thinking about
here, but from what I understand from reading the literature, doing
lookups in a lock-free hash table needs to be thought about in a sort
of relativistic way.  The different CPUs have slightly different views
of the world, so any answer you get may be obsolete by the time you
get it, and may according to some other CPU's view of the world have
been obsolete even at the time that it was copied.  That's OK, because
it's just equivalent to some other CPU doing its hash table update
slightly sooner or later than it actually did it, within the bounds
imposed by memory barriers, which it could have done anyway.  There is
no one source of truth.  The result of all this is that some of the
synchronization responsibility is transferred to the caller.  That's
obviously a bit more complex to reason about, but it hasn't stopped
lock-free hash tables from being wildly popular data structures.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Andres Freund
On 2014-10-14 20:30:45 +0530, Amit Kapila wrote:
> On Tue, Oct 14, 2014 at 7:55 PM, Andres Freund 
> wrote:
> >
> > On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
> > > A few years ago I started working on a concurrent hash table for
> > > PostgreSQL.  The hash table part of it worked, but I never did
> > > anything with it, really.  Amit mentioned to me earlier this week that
> > > he was seeing contention inside the dynahash machinery, which inspired
> > > me to go back and update the patch.
> >
> > Interestingly I've benchmarked similar loads, even on the same machine
> > as Amit,
> 
> There is one catch here, for these profiles I am using Power-8 m/c
> and the load is slightly higher (5000 scale factor).

Ah, right. I don't think the scale factor changes much, but the
different architecture certainly does. As I said elsewhere, I would not
believe these profiles much until they're actually done with optimized
code...

I also think we need to be a bit careful about optimizing too much for
stock pgbench with working set >> s_b. The uniform readonly access
pattern you're profiling here isn't something super realistic. Still
valuable, but we should take it with a grain of salt.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_get_indexdef() doesn't quote string reloptions

2014-10-14 Thread Eric B. Ridge

> If this communication is in fact intended to be protected by some
> legal privilege, or to remain company confidential, you have
> definitely sent it to the wrong place.  

Sadly I don't control my company's email server.  They however don't control my 
gmail account.  I'll switch to that. 

eric 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Shapes on the regression test for polygon

2014-10-14 Thread Bruce Momjian
On Tue, Oct 14, 2014 at 05:40:06PM +0300, Emre Hasegeli wrote:
> > I extracted Emre's diagram adjustments from the patch and applied it,
> > and no tabs now.  Emre, I assume your regression changes did not affect
> > the diagram contents.
> 
> Thank you for looking at it.  I wanted to make the tests consistent
> with the diagrams.  Now they look better but they still don't make
> sense with the tests.  I looked at it some more, and come to the
> conclusion that removing them is better than changing the tests.

OK, unless I hear more feedback I will remove the diagrams.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Amit Kapila
On Tue, Oct 14, 2014 at 7:55 PM, Andres Freund 
wrote:
>
> On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
> > A few years ago I started working on a concurrent hash table for
> > PostgreSQL.  The hash table part of it worked, but I never did
> > anything with it, really.  Amit mentioned to me earlier this week that
> > he was seeing contention inside the dynahash machinery, which inspired
> > me to go back and update the patch.
>
> Interestingly I've benchmarked similar loads, even on the same machine
> as Amit,

There is one catch here, for these profiles I am using Power-8 m/c
and the load is slightly higher (5000 scale factor).



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Better support of exported snapshots with pg_dump

2014-10-14 Thread Petr Jelinek

On 22/09/14 02:24, Michael Paquier wrote:

On Thu, Sep 4, 2014 at 11:33 PM, Michael Paquier

Taking a dump consistent with a replication slot is useful for online
upgrade cases first, because you can simply run pg_dump, have a slot
created, and get as well a state of the database consistent with the
slot creation before replaying changes in a way or another. Using
that, a decoder that generates raw queries, and a receiver able to
apply changes on a remote Postgres server, it is possible to get a
kind of live migration solution from a Postgres instance to another
for a single database, as long as the origin server uses 9.4. Making
the receiver report write and flush positions makes also possible the
origin server to use synchronous replication protocol to be sure that
changes got applied on remote before performing a switch from the
origin to the remote (that may actually explain why multi-syncrep
would be useful here for multiple databases). Also, I imagine that
users could even use this tool in pg_dump for example to do some post
processing on the data dumped in accordance to the decoder plugin
before applying changes to a remote source.

Now, this is done with the addition of two options in pg_dump to
control the logical slot creation:
- --slot to define the name of the slot being created
- --plugin-name, to define the name of the decoder plugin
And then you can of course do things like that:
# Raw data dump on a slot
$ pg_dump --slot bar --plugin-name test_decoding
# Existing parallel dump not changed:
$ pg_dump -j 4 -f data -F d
# Parallel dump on a slot
$ pg_dump -j 4 --slot bar --plugin-name test_decoding -f data -F d



Wouldn't it be better to have the slot handling done outside of pg_dump 
by whatever replication solution you use and just have pg_dump accept 
the snapshot as input parameter? I am not sure how much I like pg_dump 
creating the slot. I am aware that you need to have the replication 
connection open but that's IMHO just matter of scripting it together.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Andres Freund
On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
> I took the basic infrastructure from before and used it to replace the
> buffer table.  Patch is attached.

Hm. Unless I missed something you pretty much completely eradicated
locks from the buffer mapping code? That'd be pretty cool, but it's also
scary.

How confident are you that your conversion is correct? Not in the sense
that there aren't any buglets, but fundamentally.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Shapes on the regression test for polygon

2014-10-14 Thread Emre Hasegeli
> I extracted Emre's diagram adjustments from the patch and applied it,
> and no tabs now.  Emre, I assume your regression changes did not affect
> the diagram contents.

Thank you for looking at it.  I wanted to make the tests consistent
with the diagrams.  Now they look better but they still don't make
sense with the tests.  I looked at it some more, and come to the
conclusion that removing them is better than changing the tests.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Andres Freund
On 2014-10-14 09:30:58 -0400, Robert Haas wrote:
> A few years ago I started working on a concurrent hash table for
> PostgreSQL.  The hash table part of it worked, but I never did
> anything with it, really.  Amit mentioned to me earlier this week that
> he was seeing contention inside the dynahash machinery, which inspired
> me to go back and update the patch.

Interestingly I've benchmarked similar loads, even on the same machine
as Amit, and I do seem trememdous time spent in dynahash.c. It's nearly
all cache misses in my tests though.

> I took the basic infrastructure
> from before and used it to replace the buffer table.  Patch is
> attached.

That's pretty cool. The algorithm is complex enough that I haven't fully
understood it yet, but it sounds sane on a first glance.

> The key idea here is that lookups are done without any locks, only
> memory barriers; and inserts and deletes are done using atomic ops.

Hm. I quickly looked and I see that you use two full barriers for every
lookup. That's pretty expensive. I think this should be doable using
only read/write barriers on the lookup side? Even on architectures where
read/write memory barriers have to be something but a compiler barrier,
they're still usually a magnitude or so cheaper than full barriers.

> The algorithm is not strictly lock-free for reasons explained in the
> comments in chash.c, but it's a lot less locky than what we have now,
> so in theory you might think that would be a good thing.

I don't see much reason to strive for full lock-free ness. If the locks
aren't noticeable in real world scenarios - who cares?

> I haven't had time to do much performance testing yet, but it looks
> like this may be slower at low client counts and faster at high client
> counts.  However, my results weren't real reproducible, and I haven't
> done comprehensive testing yet.  What was really bizarre is that I
> couldn't really pin down the cause of the slowness at low client
> counts; a quick perf profile showed overhead concentrated in
> CHashBucketScan, basically memory access latency for walking the
> bucket chain.  But the table is built to have a load factor of 1, so I
> can't see why that should amount to much, or why it should be
> significantly worse than for dynahash.

With regard for using a hash table for the buffer mapping lock I'm
doubtful that any form of separate chaining is the right one. We
currently have a quite noticeable problem with the number of cache
misses in the buffer mapping hash (and also the heavyweight lock mgr) -
if we stay with hashes that's only going to be fundamentally lower than
dynahash if we change the type of hashing. I've had good, *preliminary*,
results using an open addressing + linear probing approach.

My guess is that the additional indirection via the arena explains the
difference in cache misses? But I might be completely off here.

It'd be quite interesting to see a perf stat -ddd of dynahash.c/chash
employing builds for some comparable load.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Drop any statistics of table after it's truncated

2014-10-14 Thread Tom Lane
Sawada Masahiko  writes:
> I found that the statistics are still remained after it's truncated.
> In addition, the analyzing ignores table does not have any tuple.
> After table truncated, the entry of statistics continues to remain
> unless insertion and analyzing are executed.
> There is reason why statistics are remained?

Yes, actually, that's intentional.  The idea is that once you start
loading data into the table, it's most likely going to look something
like the old data; therefore, the stale statistics are better than
none at all.  Eventually auto-ANALYZE will replace the stats,
but until that happens, it seems best to keep using the old stats.
(Of course there are counterexamples to this, but removing the stats
is bad in more cases than not.)

> Attached patch is one line patch adds RemoveStatistics() into
> ExecuteTruncate(), to remove statistics entry of table.

-1, for the reasons explained above.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait free LW_SHARED acquisition - v0.9

2014-10-14 Thread Andres Freund
On 2014-10-14 08:40:49 -0500, Merlin Moncure wrote:
> On Fri, Oct 10, 2014 at 11:00 AM, Andres Freund  
> wrote:
> > On 2014-10-10 16:41:39 +0200, Andres Freund wrote:
> >> FWIW, the profile always looks like
> >> -  48.61%  postgres  postgres  [.] s_lock
> >>- s_lock
> >>   + 96.67% StrategyGetBuffer
> >>   + 1.19% UnpinBuffer
> >>   + 0.90% PinBuffer
> >>   + 0.70% hash_search_with_hash_value
> >> +   3.11%  postgres  postgres  [.] GetSnapshotData
> >> +   2.47%  postgres  postgres  [.] StrategyGetBuffer
> >> +   1.93%  postgres  [kernel.kallsyms] [k] copy_user_generic_string
> >> +   1.28%  postgres  postgres  [.] 
> >> hash_search_with_hash_value
> >> -   1.27%  postgres  postgres  [.] LWLockAttemptLock
> >>- LWLockAttemptLock
> >>   - 97.78% LWLockAcquire
> >>  + 38.76% ReadBuffer_common
> >>  + 28.62% _bt_getbuf
> >>  + 8.59% _bt_relandgetbuf
> >>  + 6.25% GetSnapshotData
> >>  + 5.93% VirtualXactLockTableInsert
> >>  + 3.95% VirtualXactLockTableCleanup
> >>  + 2.35% index_fetch_heap
> >>  + 1.66% StartBufferIO
> >>  + 1.56% LockReleaseAll
> >>  + 1.55% _bt_next
> >>  + 0.78% LockAcquireExtended
> >>   + 1.47% _bt_next
> >>   + 0.75% _bt_relandgetbuf
> >>
> >> to me. Now that's with the client count 496, but it's similar with lower
> >> counts.
> >>
> >> BTW, that profile *clearly* indicates we should make StrategyGetBuffer()
> >> smarter.
> >
> > Which is nearly trivial now that atomics are in. Check out the attached
> > WIP patch which eliminates the spinlock from StrategyGetBuffer() unless
> > there's buffers on the freelist.
> 
> Is this safe?
> 
> + victim = pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
> 
> - if (++StrategyControl->nextVictimBuffer >= NBuffers)
> + buf = &BufferDescriptors[victim % NBuffers];
> +
> + if (victim % NBuffers == 0)
> 
> 
> I don't think there's any guarantee you could keep nextVictimBuffer
> from wandering off the end.

Not sure what you mean? It'll cycle around at 2^32. The code doesn't try
to avoid nextVictimBuffer from going above NBuffers. To avoid overrunning
&BufferDescriptors I'm doing % NBuffers.

Yes, that'll have the disadvantage of the first buffers being slightly
more likely to be hit, but for that to be relevant you'd need
unrealistically large amounts of shared_buffers.

> You could buy a little safety by CAS'ing
> zero in instead, followed by atomic increment.  However that's still
> pretty dodgy IMO and requires two atomic ops which will underperform
> the spin in some situations.
> 
> I like Robert's idea to keep the spinlock but preallocate a small
> amount of buffers, say 8 or 16.

I think that's pretty much orthogonal. I *do* think it makes sense to
increment nextVictimBuffer in bigger steps. But the above doesn't
prohibit doing so - and it'll still be be much better without the
spinlock. Same number of atomics, but no potential of spinning and no
potential of being put to sleep while holding the spinlock.

This callsite has a comparatively large likelihood of being put to sleep
while holding a spinlock because it can run for a very long time doing
nothing but spinlocking.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Wait free LW_SHARED acquisition - v0.9

2014-10-14 Thread Merlin Moncure
On Fri, Oct 10, 2014 at 11:00 AM, Andres Freund  wrote:
> On 2014-10-10 16:41:39 +0200, Andres Freund wrote:
>> FWIW, the profile always looks like
>> -  48.61%  postgres  postgres  [.] s_lock
>>- s_lock
>>   + 96.67% StrategyGetBuffer
>>   + 1.19% UnpinBuffer
>>   + 0.90% PinBuffer
>>   + 0.70% hash_search_with_hash_value
>> +   3.11%  postgres  postgres  [.] GetSnapshotData
>> +   2.47%  postgres  postgres  [.] StrategyGetBuffer
>> +   1.93%  postgres  [kernel.kallsyms] [k] copy_user_generic_string
>> +   1.28%  postgres  postgres  [.] 
>> hash_search_with_hash_value
>> -   1.27%  postgres  postgres  [.] LWLockAttemptLock
>>- LWLockAttemptLock
>>   - 97.78% LWLockAcquire
>>  + 38.76% ReadBuffer_common
>>  + 28.62% _bt_getbuf
>>  + 8.59% _bt_relandgetbuf
>>  + 6.25% GetSnapshotData
>>  + 5.93% VirtualXactLockTableInsert
>>  + 3.95% VirtualXactLockTableCleanup
>>  + 2.35% index_fetch_heap
>>  + 1.66% StartBufferIO
>>  + 1.56% LockReleaseAll
>>  + 1.55% _bt_next
>>  + 0.78% LockAcquireExtended
>>   + 1.47% _bt_next
>>   + 0.75% _bt_relandgetbuf
>>
>> to me. Now that's with the client count 496, but it's similar with lower
>> counts.
>>
>> BTW, that profile *clearly* indicates we should make StrategyGetBuffer()
>> smarter.
>
> Which is nearly trivial now that atomics are in. Check out the attached
> WIP patch which eliminates the spinlock from StrategyGetBuffer() unless
> there's buffers on the freelist.

Is this safe?

+ victim = pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);

- if (++StrategyControl->nextVictimBuffer >= NBuffers)
+ buf = &BufferDescriptors[victim % NBuffers];
+
+ if (victim % NBuffers == 0)


I don't think there's any guarantee you could keep nextVictimBuffer
from wandering off the end.  You could buy a little safety by CAS'ing
zero in instead, followed by atomic increment.  However that's still
pretty dodgy IMO and requires two atomic ops which will underperform
the spin in some situations.

I like Robert's idea to keep the spinlock but preallocate a small
amount of buffers, say 8 or 16.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Drop any statistics of table after it's truncated

2014-10-14 Thread Sawada Masahiko
Hi all,

I found that the statistics are still remained after it's truncated.
In addition, the analyzing ignores table does not have any tuple.
After table truncated, the entry of statistics continues to remain
unless insertion and analyzing are executed.
There is reason why statistics are remained?

Attached patch is one line patch adds RemoveStatistics() into
ExecuteTruncate(), to remove statistics entry of table.

-- 
Regards,

---
Sawada Masahiko


Drop_statistics_after_truncated_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] WIP: dynahash replacement for buffer table

2014-10-14 Thread Robert Haas
A few years ago I started working on a concurrent hash table for
PostgreSQL.  The hash table part of it worked, but I never did
anything with it, really.  Amit mentioned to me earlier this week that
he was seeing contention inside the dynahash machinery, which inspired
me to go back and update the patch.  I took the basic infrastructure
from before and used it to replace the buffer table.  Patch is
attached.

The key idea here is that lookups are done without any locks, only
memory barriers; and inserts and deletes are done using atomic ops.
The algorithm is not strictly lock-free for reasons explained in the
comments in chash.c, but it's a lot less locky than what we have now,
so in theory you might think that would be a good thing.

I haven't had time to do much performance testing yet, but it looks
like this may be slower at low client counts and faster at high client
counts.  However, my results weren't real reproducible, and I haven't
done comprehensive testing yet.  What was really bizarre is that I
couldn't really pin down the cause of the slowness at low client
counts; a quick perf profile showed overhead concentrated in
CHashBucketScan, basically memory access latency for walking the
bucket chain.  But the table is built to have a load factor of 1, so I
can't see why that should amount to much, or why it should be
significantly worse than for dynahash.

This patch contains assorted leftovers and is grotty in various ways,
but I'm sharing it anyway just to get it out there.

git branch also available at:
http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash2014

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/contrib/Makefile b/contrib/Makefile
index b37d0dd..0b91ac1 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -20,6 +20,7 @@ SUBDIRS = \
 		earthdistance	\
 		file_fdw	\
 		fuzzystrmatch	\
+		hashtest	\
 		hstore		\
 		intagg		\
 		intarray	\
diff --git a/contrib/hashtest/Makefile b/contrib/hashtest/Makefile
new file mode 100644
index 000..3ee42f8
--- /dev/null
+++ b/contrib/hashtest/Makefile
@@ -0,0 +1,18 @@
+# contrib/hashtest/Makefile
+
+MODULE_big = hashtest
+OBJS = hashtest.o
+
+EXTENSION = hashtest
+DATA = hashtest--1.0.sql
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/hashtest
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/hashtest/hashtest--1.0.sql b/contrib/hashtest/hashtest--1.0.sql
new file mode 100644
index 000..e271baf
--- /dev/null
+++ b/contrib/hashtest/hashtest--1.0.sql
@@ -0,0 +1,52 @@
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION hashtest" to load this file. \quit
+
+CREATE FUNCTION chash_insert_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_insert_test'
+LANGUAGE C;
+
+CREATE FUNCTION chash_search_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_search_test'
+LANGUAGE C;
+
+CREATE FUNCTION chash_delete_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_delete_test'
+LANGUAGE C;
+
+CREATE FUNCTION chash_concurrent_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_concurrent_test'
+LANGUAGE C;
+
+CREATE FUNCTION chash_collision_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_collision_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_insert_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_insert_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_search_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_search_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_delete_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_delete_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_concurrent_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_concurrent_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_collision_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_collision_test'
+LANGUAGE C;
diff --git a/contrib/hashtest/hashtest.c b/contrib/hashtest/hashtest.c
new file mode 100644
index 000..172a5bb
--- /dev/null
+++ b/contrib/hashtest/hashtest.c
@@ -0,0 +1,527 @@
+/*-
+ * hashtest.c
+ *-
+ */
+
+#include "postgres.h"
+
+#include "funcapi.h"
+#include "libpq/auth.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "portability/instr_time.h"
+#include "storage/ipc.h"
+#include "utils/chash.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+Datum		chash_insert_test(PG_FUNCTION_ARGS);
+Datum		chash_search_test(PG_FUNCTION_ARGS);
+Datum		chash_delete_test(PG_FUNCTION_ARGS);
+Datum		chash_concurrent_test(PG_FUNCTION_ARGS);
+Datum		chash_collision_test(PG_FUNCTION_ARGS);
+Datum		dynahash_insert_test(PG_FUNCTION_ARGS);
+Datum		dynahash_search_test(PG_FUNCTION_ARGS);
+Datum		dynahash_delete_test(P

Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2014-10-14 Thread Stephen Frost
Abhijit,

* Abhijit Menon-Sen (a...@2ndquadrant.com) wrote:
> As before, the pgaudit code is at https://github.com/2ndQuadrant/pgaudit
> I did a quick round of testing to make sure things still work.

Thanks!

I had a very interesting discussion about auditing rules and the need to
limit what gets audited by user, table, column, policy, etc recently and
an idea came up (not my own) about how to support that granularity
without having to modify the existing PG catalogs or use GUCs or
reloptions, etc.  It goes something like this-

Create an 'audit' role.

Every command run by roles which are granted to the 'audit' role are
audited.

Every 'select' against tables which the 'audit' role has 'select' rights
on are audited.  Similairly for every insert, update, delete.

Every 'select' against columns of tables which the 'audit' role has
'select' rights on are audited- and only those columns are logged.
Similairly for every insert, update, delete.

etc, etc, throughout the various objects which can have permissions.

We don't currently have more granular permissions for roles (it's
"all-or-nothing" currently regarding role membership) and so if we want
to support things like "audit all DML for this role" we need multiple
roles, eg:

Create an 'audit_rw' role.  DML commands run by roles granted to the
'audit_rw' role are audited.  Similairly for 'audit_ro' or other
permutations.

The 'audit*' roles would need to be configured for pgAudit in some
fashion, but that's acceptable and is much more reasonable than having
an independent config file which has to keep track of the specific roles
or tables being audited.

The 'audit_ro' and 'audit_rw' roles lead to an interesting thought about
supporting something like which I want to mention here before I forget
about it, but probably should be a new thread if folks think it's
interesting:

GRANT role1 (SELECT) to role2;

Such that 'role2' would have role1's SELECT rights, but not role1's
INSERT, or UPDATE, or DELETE rights.  There would be more complexity if
we want to support this for more than just normal relations, of course.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows

2014-10-14 Thread MauMau

From: "Heikki Linnakangas" 

Committed this.


Thank you very much.  I didn't anticipate such a difficult complicated 
cause.  The user agreed to try the patch tonight.  I'll report back the 
result as soon as I got it from him.


BTW, in LWLockWaitForVar(), the first line of the following code fragment is 
not necessary, because lwWaitLink is set to head immediately.  I think it 
would be good to eliminate as much unnecessary code as possible from the 
spinlock section.


 proc->lwWaitLink = NULL;

 /* waiters are added to the front of the queue */
 proc->lwWaitLink = lock->head;

Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2014-10-14 Thread Abhijit Menon-Sen
At 2014-10-14 18:05:00 +0530, a...@2ndquadrant.com wrote:
>
> As before, the pgaudit code is at
> https://github.com/2ndQuadrant/pgaudit

Sorry, I forgot to attach the tarball.

-- Abhijit


pgaudit.tar.gz
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2014-10-14 Thread Abhijit Menon-Sen
Hi.

As before, the pgaudit code is at https://github.com/2ndQuadrant/pgaudit
I did a quick round of testing to make sure things still work.

I'll re-add it to the next CF now.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] inherit support for foreign tables

2014-10-14 Thread Etsuro Fujita

(2014/09/12 16:30), Etsuro Fujita wrote:

(2014/09/11 20:51), Heikki Linnakangas wrote:

On 09/11/2014 02:30 PM, Etsuro Fujita wrote:

So, should I split the patch into the two?


Yeah, please do.


OK, Will do.


Here are separated patches.

fdw-chk.patch  - CHECK constraints on foreign tables
fdw-inh.patch  - table inheritance with foreign tables

The latter has been created on top of [1].  I've addressed the comment 
from Horiguchi-san [2] in it also, though I've slightly modified his 
proposal.


There is the functionality for path reparameterization for ForeignScan 
in the previous patch, but since the functionality would be useful not 
only for such table inheritance cases but for UNION ALL cases, I'd add 
the functionality as an independent feature maybe to CF 2014-12.


Thanks,

[1] http://www.postgresql.org/message-id/540da168.3040...@lab.ntt.co.jp
[2] 
http://www.postgresql.org/message-id/20140902.142218.253402812.horiguchi.kyot...@lab.ntt.co.jp


Best regards,
Etsuro Fujita
*** a/contrib/file_fdw/input/file_fdw.source
--- b/contrib/file_fdw/input/file_fdw.source
***
*** 134,139  DELETE FROM agg_csv WHERE a = 100;
--- 134,149 
  -- but this should be ignored
  SELECT * FROM agg_csv FOR UPDATE;
  
+ -- constraint exclusion tests
+ ALTER FOREIGN TABLE agg_csv ADD CONSTRAINT agg_csv_a_check CHECK (a >= 0);
+ \t on
+ EXPLAIN (VERBOSE, COSTS FALSE) SELECT * FROM agg_csv WHERE a < 0;
+ SET constraint_exclusion = 'on';
+ EXPLAIN (VERBOSE, COSTS FALSE) SELECT * FROM agg_csv WHERE a < 0;
+ SET constraint_exclusion = 'partition';
+ \t off
+ ALTER FOREIGN TABLE agg_csv DROP CONSTRAINT agg_csv_a_check;
+ 
  -- privilege tests
  SET ROLE file_fdw_superuser;
  SELECT * FROM agg_text ORDER BY a;
*** a/contrib/file_fdw/output/file_fdw.source
--- b/contrib/file_fdw/output/file_fdw.source
***
*** 219,224  SELECT * FROM agg_csv FOR UPDATE;
--- 219,242 
42 |  324.78
  (3 rows)
  
+ -- constraint exclusion tests
+ ALTER FOREIGN TABLE agg_csv ADD CONSTRAINT agg_csv_a_check CHECK (a >= 0);
+ \t on
+ EXPLAIN (VERBOSE, COSTS FALSE) SELECT * FROM agg_csv WHERE a < 0;
+  Foreign Scan on public.agg_csv
+Output: a, b
+Filter: (agg_csv.a < 0)
+Foreign File: @abs_srcdir@/data/agg.csv
+ 
+ SET constraint_exclusion = 'on';
+ EXPLAIN (VERBOSE, COSTS FALSE) SELECT * FROM agg_csv WHERE a < 0;
+  Result
+Output: a, b
+One-Time Filter: false
+ 
+ SET constraint_exclusion = 'partition';
+ \t off
+ ALTER FOREIGN TABLE agg_csv DROP CONSTRAINT agg_csv_a_check;
  -- privilege tests
  SET ROLE file_fdw_superuser;
  SELECT * FROM agg_text ORDER BY a;
*** a/contrib/postgres_fdw/expected/postgres_fdw.out
--- b/contrib/postgres_fdw/expected/postgres_fdw.out
***
*** 2488,2493  select c2, count(*) from "S 1"."T 1" where c2 < 500 group by 1 order by 1;
--- 2488,2512 
  (13 rows)
  
  -- ===
+ -- test constraint exclusion stuff
+ -- ===
+ ALTER FOREIGN TABLE ft1 ADD CONSTRAINT ft1_c1_check CHECK (c1 > 0);
+ \t on
+ EXPLAIN (VERBOSE, COSTS FALSE) SELECT * FROM ft1 WHERE c1 <= 0;
+  Foreign Scan on public.ft1
+Output: c1, c2, c3, c4, c5, c6, c7, c8
+Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE (("C 1" <= 0))
+ 
+ SET constraint_exclusion = 'on';
+ EXPLAIN (VERBOSE, COSTS FALSE) SELECT * FROM ft1 WHERE c1 <= 0;
+  Result
+Output: c1, c2, c3, c4, c5, c6, c7, c8
+One-Time Filter: false
+ 
+ SET constraint_exclusion = 'partition';
+ \t off
+ ALTER FOREIGN TABLE ft1 DROP CONSTRAINT ft1_c1_check;
+ -- ===
  -- test serial columns (ie, sequence-based defaults)
  -- ===
  create table loc1 (f1 serial, f2 text);
*** a/contrib/postgres_fdw/sql/postgres_fdw.sql
--- b/contrib/postgres_fdw/sql/postgres_fdw.sql
***
*** 387,392  select c2, count(*) from ft2 where c2 < 500 group by 1 order by 1;
--- 387,404 
  select c2, count(*) from "S 1"."T 1" where c2 < 500 group by 1 order by 1;
  
  -- ===
+ -- test constraint exclusion stuff
+ -- ===
+ ALTER FOREIGN TABLE ft1 ADD CONSTRAINT ft1_c1_check CHECK (c1 > 0);
+ \t on
+ EXPLAIN (VERBOSE, COSTS FALSE) SELECT * FROM ft1 WHERE c1 <= 0;
+ SET constraint_exclusion = 'on';
+ EXPLAIN (VERBOSE, COSTS FALSE) SELECT * FROM ft1 WHERE c1 <= 0;
+ SET constraint_exclusion = 'partition';
+ \t off
+ ALTER FOREIGN TABLE ft1 DROP CONSTRAINT ft1_c1_check;
+ 
+ -- ===
  -- test serial columns (ie, sequence-based defaults)
  -- ===
  create table loc1 (f1 serial, f2 text);
*** a/doc/sr

Re: [HACKERS] Scaling shared buffer eviction

2014-10-14 Thread Andres Freund
On 2014-10-14 15:24:57 +0530, Amit Kapila wrote:
> After that I observed that contention for LW_SHARED has reduced
> for this load, but it didn't help much in terms of performance, so I again
> rechecked the profile and this time most of the contention is moved
> to spinlock used in dynahash for buf mapping tables, please refer
> the profile (for 128 client count; Read only load) below:
> 
> bgreclaimer patch + wait free lw_shared acquisition patches -
> --
> 
> 9.24%  swapper  [unknown]   [H] 0x011d9c10
> +   7.19% postgres  postgres[.] s_lock
> +   3.52% postgres  postgres[.] GetSnapshotData
> +   3.02% postgres  postgres[.] calc_bucket
> +   2.71% postgres  postgres[.]
> hash_search_with_hash_value
> 2.32% postgres  [unknown]   [H] 0x011e0d7c
> +   2.17% postgres  postgres[.]
> pg_atomic_fetch_add_u32_impl
> +   1.84% postgres  postgres[.] AllocSetAlloc
> +   1.57% postgres  postgres[.] _bt_compare
> +   1.05% postgres  postgres[.] AllocSetFreeIndex
> +   1.02% postgres  [kernel.kallsyms]   [k]
> .__copy_tofrom_user_power7
> +   0.94% postgres  postgres[.] tas
> +   0.85%  swapper  [kernel.kallsyms]   [k] .int_sqrt
> +   0.80% postgres  postgres[.] pg_encoding_mbcliplen
> +   0.78%  pgbench  [kernel.kallsyms]   [k] .find_busiest_group
> 0.65%  pgbench  [unknown]   [H] 0x011d96e0
> +   0.59% postgres  postgres[.] hash_any
> +   0.54% postgres  postgres[.] LWLockRelease

This profile is without -O2 again. I really don't think it makes much
sense to draw much inference from an unoptimized build. I realize that
you said that the builds you use for benchmarking don't have that
problem, but that doesn't make this profile meaningful...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Scaling shared buffer eviction

2014-10-14 Thread Amit Kapila
On Thu, Oct 9, 2014 at 6:17 PM, Amit Kapila  wrote:
>
> On Fri, Sep 26, 2014 at 7:04 PM, Robert Haas 
wrote:
> >
> > On another point, I think it would be a good idea to rebase the
> > bgreclaimer patch over what I committed, so that we have a
> > clean patch against master to test with.
>
> Please find the rebased patch attached with this mail.  I have taken
> some performance data as well and done some analysis based on
> the same.
>
>
>
> Below data is median of 3 runs.
>
> patch_ver/client_count 1 8 32 64 128 256
> HEAD 18884 118628 251093 216294 186625 177505
> PATCH 18743 122578 247243 205521 179712 175031
>
>
> Here we can see that the performance dips at higher client
> count(>=32) which was quite surprising for me, as I was expecting
> it to improve, because bgreclaimer reduces the contention by making
> buffers available on free list.  So I tried to analyze the situation by
> using perf and found that in above configuration, there is a contention
> around freelist spinlock with HEAD and the same is removed by Patch,
> but still the performance goes down with Patch.  On further analysis, I
> observed that actually after Patch there is an increase in contention
> around ProcArrayLock (shared LWlock) via GetSnapshotData which
> sounds bit odd, but that's what I can see in profiles.  Based on analysis,
> few ideas which I would like to further investigate are:
> a.  As there is an increase in spinlock contention, I would like to check
> with Andres's latest patch which reduces contention around shared
> lwlocks.

I have tried by applying Andres's Wait free LW_SHARED acquisition
patch posted by him at below link along with bgreclaimer patch:
http://www.postgresql.org/message-id/20141008133533.ga5...@alap3.anarazel.de

After that I observed that contention for LW_SHARED has reduced
for this load, but it didn't help much in terms of performance, so I again
rechecked the profile and this time most of the contention is moved
to spinlock used in dynahash for buf mapping tables, please refer
the profile (for 128 client count; Read only load) below:

bgreclaimer patch + wait free lw_shared acquisition patches -
--

9.24%  swapper  [unknown]   [H] 0x011d9c10
+   7.19% postgres  postgres[.] s_lock
+   3.52% postgres  postgres[.] GetSnapshotData
+   3.02% postgres  postgres[.] calc_bucket
+   2.71% postgres  postgres[.]
hash_search_with_hash_value
2.32% postgres  [unknown]   [H] 0x011e0d7c
+   2.17% postgres  postgres[.]
pg_atomic_fetch_add_u32_impl
+   1.84% postgres  postgres[.] AllocSetAlloc
+   1.57% postgres  postgres[.] _bt_compare
+   1.05% postgres  postgres[.] AllocSetFreeIndex
+   1.02% postgres  [kernel.kallsyms]   [k]
.__copy_tofrom_user_power7
+   0.94% postgres  postgres[.] tas
+   0.85%  swapper  [kernel.kallsyms]   [k] .int_sqrt
+   0.80% postgres  postgres[.] pg_encoding_mbcliplen
+   0.78%  pgbench  [kernel.kallsyms]   [k] .find_busiest_group
0.65%  pgbench  [unknown]   [H] 0x011d96e0
+   0.59% postgres  postgres[.] hash_any
+   0.54% postgres  postgres[.] LWLockRelease


Detailed Profile Data
-
-   7.19% postgres  postgres[.] s_lock


   - s_lock


  - 3.79% s_lock


 - 1.69% get_hash_entry


  hash_search_with_hash_value


  BufTableInsert


  BufferAlloc


   0


 - 1.63% hash_search_with_hash_value


- 1.63% BufTableDelete


 BufferAlloc


 ReadBuffer_common


 ReadBufferExtended
  - 1.45% hash_search_with_hash_value


 - 1.45% hash_search_with_hash_value


  BufTableDelete


  BufferAlloc


  ReadBuffer_common
  - 1.43% get_hash_entry


 - 1.43% get_hash_entry


  hash_search_with_hash_value


  BufTableInsert


  BufferAlloc


  ReadBuffer_common
-   3.52% postgres  postgres[.] GetSnapshotData


   - GetSnapshotData


  - 3.50% GetSnapshotData


 - 3.49% GetTransactionSnapshot


- 1.75% exec_bind_message


 PostgresMain


 0


- 1.74% PortalStart


 exec_bind_message


 PostgresMain


 0




To reduce above contention, I tried to write a patch to replace spin lock
used in dynahash to manage free list by atomic operations.  Still there
is work pending for this patch with respect to ensuring whether the
approach used in patch is c

Re: [HACKERS] pg_receivexlog --status-interval add fsync feedback

2014-10-14 Thread furuyao
> >> >>> If we remove --fsync-interval, resoponse time to user will not
> be
> >> delay.
> >> >>> Although, fsync will be executed multiple times in a short period.
> >> >>> And there is no way to solve the problem without
> >> >>> --fsync-interval, what
> >> >> should we do about it?
> >> >>
> >> >> I'm sorry, I didn't understand that.
> >> >
> >> > Here is an example.
> >> > When WAL is sent at 100ms intervals, fsync() is executed 10 times
> >> > per
> >> second.
> >> > If --fsync-interval is set by 1 second, we have to wait SQL
> >> responce(kind of making WAL record) for 1 second, though, fsync()
> >> won't be executed several times in 1 second.
> >> > I think --fsync-interval meets the demands of people who wants to
> >> restrain fsync() happens for several time in short period, but what
> >> do you think?
> >> > Is it ok to delete --fsync-interval ?
> >>
> >> I still don't see the problem.
> >>
> >> In synchronous mode, pg_receivexlog should have similar logic as
> >> walreceiver does. It should read as much WAL from the socket as it
> >> can without blocking, and fsync() and send reply after that. And also
> >> fsync whenever switching to new segment. If the master sends WAL
> >> every 100ms, then pg_recevexlog will indeed fsync() 10 times per
> >> second. There's nothing wrong with that.
> >>
> >> In asynchronous mode, only fsync whenever switching to new segment.
> >>
> >> >> Yeah. Or rather, add a new message type, to indicate the
> >> >> synchronous/asynchronous status.
> >> >
> >> > What kind of 'message type' we have to add ?
> >> >
> >> > Do we need to separate 'w' into two types ? synchronous and
> >> asynchronous ?
> >> >
> >> > OR
> >> >
> >> > Add a new message type, kind of 'notify synchronous', and inform
> >> > pg_receivexlog of synchronous status when it connect to the server.
> >>
> >> Better to add a new "notify" message type. And pg_recevexlog should
> >> be prepared to receive it at any time. The status might change on the
> >> fly, if the server's configuration is reloaded.
> >
> > Thanks for the reply.
> >
> >> In synchronous mode, pg_receivexlog should have similar logic as
> walreceiver does.
> >
> > OK. I understand that removing --fsync-interval has no problem.
> 
> +1 for adding something like --synchronous option. To me,
> it sounds walreceiver-compatible mode rather than synchronous.
> 
> >> Better to add a new "notify" message type. And pg_recevexlog should
> be prepared to receive it at any time. The status might change on the
> fly, if the server's configuration is reloaded.
> >
> > OK. I'll consider it.
> 
> I don't think that's good idea because it prevents us from using
> pg_receivexlog as async walreceiver (i.e., received WAL data is fsynced
> and feedback is sent back to the server soon, but transaction commit in
> the server doesn't wait for the feedback).

Thanks for the comment.

> it prevents us from using pg_receivexlog as async walreceiver.

Yes. If sync or async is switched by message,
in case pg_receivexlog is async, it won't work like a walreceiver.

User don't have to set options if synchronous status can be distinguished by 
message, though they cannot specify the action.

> adding something like --synchronous option
Integrate required options for sync like above, is more appropriate ?

Regards,

--
Furuya Osamu


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Commitfest progress, or lack thereof

2014-10-14 Thread Andres Freund
On 2014-10-13 21:01:57 +0300, Heikki Linnakangas wrote:
> The August commitfest is still Open, with a few more patches left. The
> patches that remain have stayed in limbo for a long time. It's not realistic
> to expect anything to happen to them.
> 
> I'm going to move the remaining patches to the next commitfest, and close
> the August one. I hate to do that, because the whole point of a commitfest
> is to get patches either rejected or committed, and not leave them hanging.
> But if nothing's happening, there's no point waiting.

FWIW, I think all of the remaining patches did get a fair amount of
feedback. It's not like they were completely ignored.

Thanks for managing!

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Commitfest progress, or lack thereof

2014-10-14 Thread Amit Kapila
On Mon, Oct 13, 2014 at 11:31 PM, Heikki Linnakangas <
hlinnakan...@vmware.com> wrote:
>
> I'm going to move the remaining patches to the next commitfest, and close
the August one.

Many thanks for managing commit fest in a best possible
way.  I think it is big bonanza for all the authors who have
patches in the CF as most of the patches got fair enough
review in just one CF.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] split builtins.h to quote.h

2014-10-14 Thread Michael Paquier
On Mon, Oct 13, 2014 at 11:14 PM, Robert Haas  wrote:
>
> IMHO, putting some prototypes for a .c file in one header and others
> in another header is going to make it significantly harder to figure
> out which files you need to #include when. Keeping a simple rule there
> seems essential to me.

OK. Let's make the separation clear then. The attached does so, and
moves the remaining quote_* functions previously in builtins.h to
quote.h. I am adding it to the upcoming CF for tracking the effort on
it.
-- 
Michael
From d36cb786e15a296ed9b1a2b3d84741821aaa01df Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Sat, 11 Oct 2014 22:28:05 +0900
Subject: [PATCH] Move all quote-related functions into a single header quote.h

Note that this impacts as well contrib modules, and mainly external
modules particularly for quote_identifier.
---
 contrib/dblink/dblink.c   |   1 +
 contrib/postgres_fdw/deparse.c|   1 +
 contrib/postgres_fdw/postgres_fdw.c   |   1 +
 contrib/tablefunc/tablefunc.c |   1 +
 contrib/test_decoding/test_decoding.c |   1 +
 contrib/worker_spi/worker_spi.c   |   2 +-
 src/backend/catalog/namespace.c   |   1 +
 src/backend/catalog/objectaddress.c   |   1 +
 src/backend/commands/explain.c|   1 +
 src/backend/commands/extension.c  |   1 +
 src/backend/commands/matview.c|   2 +-
 src/backend/commands/trigger.c|   1 +
 src/backend/commands/tsearchcmds.c|   1 +
 src/backend/parser/parse_utilcmd.c|   1 +
 src/backend/utils/adt/format_type.c   |   1 +
 src/backend/utils/adt/quote.c | 110 ++
 src/backend/utils/adt/regproc.c   |   1 +
 src/backend/utils/adt/ri_triggers.c   |   1 +
 src/backend/utils/adt/ruleutils.c | 109 +
 src/backend/utils/adt/varlena.c   |   1 +
 src/backend/utils/cache/ts_cache.c|   1 +
 src/backend/utils/misc/guc.c  |   1 +
 src/include/utils/builtins.h  |  11 
 src/include/utils/quote.h |  31 ++
 src/pl/plpgsql/src/pl_gram.y  |   1 +
 src/pl/plpython/plpy_plpymodule.c |   1 +
 26 files changed, 164 insertions(+), 121 deletions(-)
 create mode 100644 src/include/utils/quote.h

diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index d77b3ee..cbbc513 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -56,6 +56,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
+#include "utils/quote.h"
 #include "utils/rel.h"
 #include "utils/tqual.h"
 
diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index 2045774..0009cd3 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -50,6 +50,7 @@
 #include "parser/parsetree.h"
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
+#include "utils/quote.h"
 #include "utils/syscache.h"
 
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 4c49776..feb41f5 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -36,6 +36,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
+#include "utils/quote.h"
 
 PG_MODULE_MAGIC;
 
diff --git a/contrib/tablefunc/tablefunc.c b/contrib/tablefunc/tablefunc.c
index 10ee8c7..66e5ccf 100644
--- a/contrib/tablefunc/tablefunc.c
+++ b/contrib/tablefunc/tablefunc.c
@@ -41,6 +41,7 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "utils/builtins.h"
+#include "utils/quote.h"
 
 #include "tablefunc.h"
 
diff --git a/contrib/test_decoding/test_decoding.c b/contrib/test_decoding/test_decoding.c
index fdbd313..b168fc3 100644
--- a/contrib/test_decoding/test_decoding.c
+++ b/contrib/test_decoding/test_decoding.c
@@ -25,6 +25,7 @@
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
+#include "utils/quote.h"
 #include "utils/rel.h"
 #include "utils/relcache.h"
 #include "utils/syscache.h"
diff --git a/contrib/worker_spi/worker_spi.c b/contrib/worker_spi/worker_spi.c
index 328c722..ec6c9fa 100644
--- a/contrib/worker_spi/worker_spi.c
+++ b/contrib/worker_spi/worker_spi.c
@@ -37,7 +37,7 @@
 #include "fmgr.h"
 #include "lib/stringinfo.h"
 #include "pgstat.h"
-#include "utils/builtins.h"
+#include "utils/quote.h"
 #include "utils/snapmgr.h"
 #include "tcop/utility.h"
 
diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index 5eb8fd4..7f8f54b 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -53,6 +53,7 @@
 #include "utils/inval.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
+#include "utils/quote.h"
 #include "utils/syscache.h"
 
 
diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c
index b69b75b..d7a1ec7 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -74,6 +74,7 @@
 #include "utils/b

Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows

2014-10-14 Thread Heikki Linnakangas

On 10/13/2014 06:57 PM, Heikki Linnakangas wrote:

Hmm, we could set releaseOK in LWLockWaitForVar(), though, when it
(re-)queues the backend. That would be less invasive, for sure
(attached). I like this better.


Committed this.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hide 'Execution time' in EXPLAIN (COSTS OFF)

2014-10-14 Thread David Rowley
On Tue, Oct 14, 2014 at 6:04 AM, Tom Lane  wrote:

> Andres Freund  writes:
> > Well. Unless I miss something it doesn't resolve the problem that
> > started this thread. Namely that it's currently impossible to write
> > regression using EXPLAIN (ANALYZE, TIMING OFF. COSTS OFF). Which is
> > worthwhile because it allows to tests some behaviour that's only visible
> > in actually executed plans (like seing that a subtree wasn't executed).
>
> TBH, I don't particularly care about that.  A new flag for turning
> "summary timing" off would answer the complaint with not too much
> non-orthogonality ... but I really don't find this use case compelling
> enough to justify adding a feature to EXPLAIN.
>
>
Hmm, was my case above not compelling enough?
This leaves me out in the cold a bit for when it comes to testing inner
joins are properly skipped at execution time. I can see no other way to
properly verify when the joins are and are not being skipped other than
outputting the explain analyze in the test and I can't really imagine it
ever getting committed without proper regression tests.

Can you think of some other way that I could test this? Keep in mind
there's no trace of the removal in the EXPLAIN without ANALYZE.

Regards

David Rowley