Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-01-23 Thread Robert Haas
On Wed, Jan 21, 2015 at 1:42 AM, Abhijit Menon-Sen a...@2ndquadrant.com wrote:
 At 2015-01-20 20:36:39 -0500, robertmh...@gmail.com wrote:
 I think this is throwing the baby out with the bathwater.  Neither C
 functions nor all-or-nothing are going to be of any practical use.

 Do you see some approach that has a realistic chance of making 9.5 and
 would also actually be worth putting into 9.5? Suggestions very welcome.

Well, I'm still of the view that there's little to lose by having this
remain out-of-core for a release or three.  We don't really all agree
on what we want, and non-core code can evolve a lot faster than core
code, so what's the rush?

I'm coming around to the view that we're going to need fairly deep
integration to make this work nicely.  It seems to me that the natural
way of controlling auditing of a table is with table or column options
on that table, but that requires either an extensible reloptions
framework that we don't have today, or that it be in the core server.
Similarly, the nice way of controlling options on a user is some
property attached to the user: audit operations X, Y, Z when performed
by this user.

If you held a gun to my head and said, we must have something this
release, I'd probably go for a GUC with a value that is a
comma-separated list of role:operation:object triplets, like this:

audit_stuff = 'alice:*:*, *:delete:*, bob:*:table secret'

...read as:

- Audit everything alice does.
- Audit all deletes by anyone.
- Audit all access by bob to table secret.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2015-01-23 Thread Alvaro Herrera
Alvaro Herrera wrote:

 I'm tweaking your v24 a bit more now, thanks -- main change is to make
 vacuum_one_database be called only to run one analyze stage, so it never
 iterates for each stage; callers must iterate calling it multiple times
 in those cases.  (There's only one callsite that needs changing anyway.)

I made some more changes, particularly so that the TAP test pass (we
were missing the semicolon when a table name was not specified to
prepare_vacuum_command).  I reordered the code in a more sensible
manner, remove the vacuum_database_stage layer (which was pretty
useless), and changed the analyze-in-stages mode: if we pass a valid
stage number, run that stage, if not, then we're not in analyze-in-stage
mode.  So I got rid of the boolean flag altogether.  I also moved the
per-stage commands and messages back into a struct inside a function,
since there's no need to have them be file-level variables anymore.

-j1 is now the same as not specifying anything, and vacuum_one_database
uses more common code in the parallel and not-parallel cases: the
not-parallel case is just the parallel case with a single connection, so
the setup and shutdown is mostly the same in both cases.

I pushed the result.  Please test, particularly on Windows.  If you can
use configure --enable-tap-tests and run them (make check in the
src/bin/scripts subdir) that would be good too .. not sure whether
that's expected to work on Windows.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-23 Thread Robert Haas
On Wed, Jan 21, 2015 at 2:22 AM, Peter Geoghegan p...@heroku.com wrote:
 You'll probably prefer the attached. This patch works by disabling
 abbreviation, but only after writing out runs, with the final merge
 left to go. That way, it doesn't matter when abbreviated keys are not
 read back from disk (or regenerated).

Yes, this seems like the way to go for now.  Thanks, committed.  And
thanks to Andrew for spotting this so quickly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Windows buildfarm animals are still not happy with abbreviated keys patch

2015-01-23 Thread Robert Haas
On Thu, Jan 22, 2015 at 5:51 PM, Peter Geoghegan p...@heroku.com wrote:
 That having been said, it's clearer to continue to handle each case (C
 locale vs other locales) separately within the new
 bttext_abbrev_convert() function, just to be consistent, but also to
 avoid NUL-terminating the text strings to pass to strxfrm(), which as
 you point out is an avoidable cost. So, I'm not asking you to restore
 the terser uniform use of strxfrm() we had before, but, for what it's
 worth, that was not the real issue. The real issue was that strxfrm()
 spuriously used the wrong locale, as you said. Provided strxfrm() had
 the correct locale (the C locale), then AFAICT it ought to have been
 fine, regardless of whether or not it then behave exactly equivalently
 to memcpy().

I don't agree.  On a system where HAVE_LOCALE_T is not defined, there
is *no way* to get strxfrm() to behave like memcpy(), because we're
not passing a locale to it.  Clearly, strxfrm() is going to do a
locale-aware transformation in that case whether the user wrote
collate C or not.  The comments in regc_pg_locale.c explain:

 * 1. In C/POSIX collations, we use hard-wired code.  We can't depend on
 * the ctype.h functions since those will obey LC_CTYPE.  Note that these
 * collations don't give a fig about multibyte characters.
 *
 * 2. In the default collation (which is supposed to obey LC_CTYPE):
 *
 * 2a. When working in UTF8 encoding, we use the wctype.h functions if
 * available.  This assumes that every platform uses Unicode codepoints
 * directly as the wchar_t representation of Unicode.  On some platforms
 * wchar_t is only 16 bits wide, so we have to punt for codepoints  0x.
 *
 * 2b. In all other encodings, or on machines that lack wctype.h, we use
 * the ctype.h functions for pg_wchar values up to 255, and punt for values
 * above that.  This is only 100% correct in single-byte encodings such as
 * LATINn.  However, non-Unicode multibyte encodings are mostly Far Eastern
 * character sets for which the properties being tested here aren't very
 * relevant for higher code values anyway.  The difficulty with using the
 * wctype.h functions with non-Unicode multibyte encodings is that we can
 * have no certainty that the platform's wchar_t representation matches
 * what we do in pg_wchar conversions.
 *
 * 3. Other collations are only supported on platforms that HAVE_LOCALE_T.
 * Here, we use the locale_t-extended forms of the wctype.h and ctype.h
 * functions, under exactly the same cases as #2.

In other words, even on systems that don't HAVE_LOCALE_T, we still
have to support the default collation and the C collation, and they
have to behave differently.  There's no way to make that work using
only strxfrm(), because nothing gets passed to that function to tell
it which of those two things it is supposed to do.

Now even if the above were not an issue, for example because we
dropped support for systems where !HAVE_LOCALE_T, I think it would be
poor form to depend on strxfrm_l() to behave like memcpy() where we
could just as easily call memcpy() and be *sure* that it was going to
do what we wanted.  Much of writing good code is figuring out what
could go wrong and then figuring out how to prevent it, and relying on
strxfrm_l() would be an unnecessary risk regardless.  Given the
existence of !HAVE_LOCALE_T systems, it's just plain broken.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2015-01-23 Thread Alvaro Herrera
Andres Freund wrote:
 On 2014-12-31 18:35:38 +0530, Amit Kapila wrote:

  +   PQsetnonblocking(connSlot[0].connection, 1);
  +
  +   for (i = 1; i  concurrentCons; i++)
  +   {
  +   connSlot[i].connection = connectDatabase(dbname, host, port, 
  username,
  + 
  prompt_password, progname, false);
  +
  +   PQsetnonblocking(connSlot[i].connection, 1);
  +   connSlot[i].isFree = true;
  +   connSlot[i].sock = PQsocket(connSlot[i].connection);
  +   }
 
 Are you sure about this global PQsetnonblocking()? This means that you
 might not be able to send queries... And you don't seem to be waiting
 for sockets waiting for writes in the select loop - which means you
 might end up being stuck waiting for reads when you haven't submitted
 the query.
 
 I think you might need a more complex select() loop. On nonfree
 connections also wait for writes if PQflush() returns != 0.

I removed the PQsetnonblocking() calls.  They were a misunderstanding, I
think.

  +/*
  + * GetIdleSlot
  + * Process the slot list, if any free slot is available then return
  + * the slotid else perform the select on all the socket's and wait
  + * until atleast one slot becomes available.
  + */
  +static int
  +GetIdleSlot(ParallelSlot *pSlot, int max_slot, const char *dbname,
  +   const char *progname, bool completedb)
  +{
  +   int i;
  +   fd_set  slotset;
 
 
 Hm, you probably need to limit -j to FD_SETSIZE - 1 or so.

I tried without the check to use 1500 connections, and select() didn't
even blink -- everything worked fine vacuuming 1500 tables in parallel
on a set of 2000 tables.  Not sure what's the actual limit but my
FD_SETSIZE says 1024, so I'm clearly over the limit.  (I tried to run it
with -j2000 but the server didn't start with that many connections.  I
didn't try any intermediate numbers.)  Anyway I added the check.

I fixed some more minor issues and pushed.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-23 Thread Jim Nasby

On 1/22/15 7:54 PM, Stephen Frost wrote:

* Bruce Momjian (br...@momjian.us) wrote:

On Fri, Jan 23, 2015 at 01:19:33AM +0100, Andres Freund wrote:

 Or do you - as the text edited in your patch, but not the quote above -
 mean to run pg_upgrade just on the primary and then rsync?


No, I was going to run it on both, then rsync.

I'm pretty sure this is all a lot easier than you believe it to be.  If
you want to recreate what pg_upgrade does to a cluster then the simplest
thing to do is rsync before removing any of the hard links.  rsync will
simply recreate the same hard link tree that pg_upgrade created when it
ran, and update files which were actually changed (the catalog tables).

The problem, as mentioned elsewhere, is that you have to checksum all
the files because the timestamps will differ.  You can actually get
around that with rsync if you really want though- tell it to only look
at file sizes instead of size+time by passing in --size-only.


What if instead of trying to handle that on the rsync side, we changed 
pg_upgrade so that it created hardlinks that had the same timestamp as the 
original file?

That said, the whole timestamp race condition in rsync gives me the 
heebie-jeebies. For normal workloads maybe it's not that big a deal, but when 
dealing with fixed-size data (ie: Postgres blocks)? Eww.

How horribly difficult would it be to allow pg_upgrade to operate on multiple servers? 
Could we have it create a shell script instead of directly modifying things itself? Or 
perhaps some custom command file that could then be replayed by pg_upgrade on 
another server? Of course, that's assuming that replicas are compatible enough with 
masters for that to work...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-01-23 Thread Robert Haas
On Wed, Jan 21, 2015 at 7:05 PM, Jim Nasby jim.na...@bluetreble.com wrote:
 I'm still nervous about overloading this onto the roles system; I think it
 will end up being very easy to accidentally break. But if others think it'll
 work then I guess I'm just being paranoid.

I agree with you.  I don't hear anyone who actually likes that
proposal except for Stephen.  The list of people who don't like it
seems to include the patch author, which strikes me as a rather
significant point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-01-23 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
 On Wed, Jan 21, 2015 at 7:05 PM, Jim Nasby jim.na...@bluetreble.com wrote:
  I'm still nervous about overloading this onto the roles system; I think it
  will end up being very easy to accidentally break. But if others think it'll
  work then I guess I'm just being paranoid.

 I agree with you.  I don't hear anyone who actually likes that
 proposal except for Stephen.  The list of people who don't like it
 seems to include the patch author, which strikes me as a rather
 significant point.

Just to clarify- this concept isn't actually mine but was suggested by a
pretty sizable PG user who has a great deal of familiarity with other
databases.  I don't mean to try and invoke the 'silent majority' but
rather to make sure folks don't think this is my idea alone or that it's
only me who thinks it makes sense. :)  Simon had weighed in earlier
with, iirc, a comment that he thought it was a good approach also,
though that was a while ago and things have changed.

I happen to like the idea specifically because it would allow regular
roles to change the auditing settings (no need to be a superuser or to
be able to modify postgresql.conf/postgresql.auto.conf), it would
provide the level of granularity desired, would follow table, column,
role changes, renames, drops, recreations, the dependency system would
function as expected, etc, while keeping it as an extension, which I
understood to be pretty desirable, especially initially.

* Robert Haas (robertmh...@gmail.com) wrote:
 On Wed, Jan 21, 2015 at 1:42 AM, Abhijit Menon-Sen a...@2ndquadrant.com 
 wrote:
  At 2015-01-20 20:36:39 -0500, robertmh...@gmail.com wrote:
  I think this is throwing the baby out with the bathwater.  Neither C
  functions nor all-or-nothing are going to be of any practical use.
 
  Do you see some approach that has a realistic chance of making 9.5 and
  would also actually be worth putting into 9.5? Suggestions very welcome.

 Well, I'm still of the view that there's little to lose by having this
 remain out-of-core for a release or three.  We don't really all agree
 on what we want, and non-core code can evolve a lot faster than core
 code, so what's the rush?

Being out of core makes it unusable in production environments for a
large number of organizations. :/

 I'm coming around to the view that we're going to need fairly deep
 integration to make this work nicely.  It seems to me that the natural
 way of controlling auditing of a table is with table or column options
 on that table, but that requires either an extensible reloptions
 framework that we don't have today, or that it be in the core server.
 Similarly, the nice way of controlling options on a user is some
 property attached to the user: audit operations X, Y, Z when performed
 by this user.

This is basically the same position which I held about a year ago when
we were discussing this, but there was quite a bit of push-back on
having an in-core solution, from the additional catalog tables to making
the grammar larger and slowing things down.  For my 2c, auditing is more
than valuable enough to warrant those compromises, but I'm not anxious
to spend time developing an in-core solution only to get it shot down
after all the work has been put into it.

Still, if folks have come around to feeling like an in-core solution
makes sense, I'm certainly willing to work towards making that happen;
it's, after all, what I've been interested in for years. :)

 If you held a gun to my head and said, we must have something this
 release, I'd probably go for a GUC with a value that is a
 comma-separated list of role:operation:object triplets, like this:

 audit_stuff = 'alice:*:*, *:delete:*, bob:*:table secret'

 ...read as:

 - Audit everything alice does.
 - Audit all deletes by anyone.
 - Audit all access by bob to table secret.

And this approach doesn't address any of the issues mentioned above,
unfortunately, which would make it pretty difficult to really use.  I
think I'd rather deal with pgaudit being outside of the tree than pursue
an approach which has so many issues.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] B-Tree support function number 3 (strxfrm() optimization)

2015-01-23 Thread Robert Haas
On Fri, Jan 23, 2015 at 2:18 AM, David Rowley dgrowle...@gmail.com wrote:
 On 20 January 2015 at 17:10, Peter Geoghegan p...@heroku.com wrote:

 On Mon, Jan 19, 2015 at 7:47 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:

  With your patch applied, the failure with MSVC disappeared, but there
  is still a warning showing up:
  (ClCompile target) -
src\backend\lib\hyperloglog.c(73): warning C4334: '' : result of
  32-bit shift implicitly converted to 64 bits (was 64-bit shift
  intended?

 That seems harmless. I suggest an explicit cast to Size here.

 This caught my eye too.

 I agree about casting to Size.

 Patch attached.

Thanks, committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-01-23 Thread Jim Nasby

On 1/21/15 6:50 PM, Stephen Frost wrote:

I'm still nervous about overloading this onto the roles system; I think it will 
end up being very easy to accidentally break. But if others think it'll work 
then I guess I'm just being paranoid.

Break in which way..?  If you're saying it'll be easy for a user to
misconfigure then I might agree with you- but documentation and
examples can help to address that.


I'm worried about user misconfiguration. Setting up a good system of roles (as 
in, distinguishing between application accounts, users, account(s) used to 
deploy code changes, DBAs, etc) is already tricky because of all the different 
use cases to consider. I fear that adding auditing to that matrix is just going 
to make it worse.

I do like Robert's idea of role:action:object triplets more, though I'm not 
sure it's enough. For example, what happens if you

CREATE ROLE su SUPERUSER NOINHERIT NOLOGIN;
CREATE ROLE su_role IN ROLE su NOLOGIN;
GRANT su_role TO bob;

and have

su_role:*:*

Does bob get audited all the time then? Only when he does SET ROLE su? For that 
matter, how does SET ROLE affect auditing?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-01-23 Thread Stephen Frost
* Jim Nasby (jim.na...@bluetreble.com) wrote:
 On 1/21/15 6:50 PM, Stephen Frost wrote:
 I'm still nervous about overloading this onto the roles system; I think it 
 will end up being very easy to accidentally break. But if others think 
 it'll work then I guess I'm just being paranoid.
 Break in which way..?  If you're saying it'll be easy for a user to
 misconfigure then I might agree with you- but documentation and
 examples can help to address that.
 
 I'm worried about user misconfiguration. Setting up a good system of roles 
 (as in, distinguishing between application accounts, users, account(s) used 
 to deploy code changes, DBAs, etc) is already tricky because of all the 
 different use cases to consider. I fear that adding auditing to that matrix 
 is just going to make it worse.

Even with an in-core solution, users would need to work out who should
be able to configure auditing..  I agree that seeing the permission
grants to the auditing roles might be confusing for folks who have not
seen it before, but I think that'll quickly resolve itself since the
only people who would see that are those who want to use pgaudit...

 I do like Robert's idea of role:action:object triplets more, though I'm not 
 sure it's enough. For example, what happens if you

I'd suggest considering what happens if you:

ALTER ROLE su_role RENAME TO new_su_role;

Or if you want to grant a non-superuser the ability to modify the
auditing rules..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Parallel Seq Scan

2015-01-23 Thread Joshua D. Drake


On 01/23/2015 10:44 AM, Jim Nasby wrote:

number of workers especially at slightly higher worker count.

Those fixed chunk numbers look pretty screwy. 2, 4 and 8 workers make no
difference, then suddenly 16 cuts times by 1/2 to 1/3? Then 32 cuts time
by another 1/2 to 1/3?


cached? First couple of runs gets the relations into memory?

JD



--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc
If we send our children to Caesar for their education, we should
 not be surprised when they come back as Romans.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-23 Thread Stephen Frost
* Andres Freund (and...@2ndquadrant.com) wrote:
 On 2015-01-23 13:52:54 -0500, Stephen Frost wrote:
  That wouldn't actually help with what Bruce is trying to do, which
  is to duplicate the results of the pg_upgrade from the master over to
  the standby.
 
 Well, it'd pretty much obliviate the need to run pg_upgrade on the
 standby. As there's no renamed files you don't need to muck around with
 leaving hardlinks in place and such just so that rsync recognizes
 unchanged files.

Uh, pg_upgrade always either creates a hard link tree or copies
everything over.  If I follow what you're suggesting, pg_upgrade would
need a new 'in-place' mode that removes all of the catalog tables from
the old cluster and puts the new catalog tables into place and leaves
everything else alone.

I don't really think I'd want to go there either..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] pg_upgrade and rsync

2015-01-23 Thread Stephen Frost
* Andres Freund (and...@2ndquadrant.com) wrote:
 On 2015-01-23 14:27:51 -0500, Stephen Frost wrote:
  * Andres Freund (and...@2ndquadrant.com) wrote:
   On 2015-01-23 14:05:10 -0500, Stephen Frost wrote:
If I follow what you're suggesting, pg_upgrade would
need a new 'in-place' mode that removes all of the catalog tables from
the old cluster and puts the new catalog tables into place and leaves
everything else alone.
   
   No. Except that it'd preserve the relfilenodes (i.e. the filenames of
   relations) it'd work exactly the same as today. The standby is simply
   updated by rsyncing the new data directory of the primary to the
   standby.
  
  You'd have to replace the existing data directory on the master to do
  that, which pg_upgrade was designed specifically to not do, in case
  things went poorly.
 
 Why? Just rsync the new data directory onto the old directory on the
 standbys. That's fine and simple.

That still doesn't address the need to use --size-only, it would just
mean that you don't need to use -H.  If anything the -H part is the
aspect which worries me the least about this approach.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] pg_upgrade and rsync

2015-01-23 Thread Stephen Frost
* Jim Nasby (jim.na...@bluetreble.com) wrote:
 On 1/22/15 7:54 PM, Stephen Frost wrote:
 * Bruce Momjian (br...@momjian.us) wrote:
 On Fri, Jan 23, 2015 at 01:19:33AM +0100, Andres Freund wrote:
  Or do you - as the text edited in your patch, but not the quote above -
  mean to run pg_upgrade just on the primary and then rsync?
 
 No, I was going to run it on both, then rsync.
 I'm pretty sure this is all a lot easier than you believe it to be.  If
 you want to recreate what pg_upgrade does to a cluster then the simplest
 thing to do is rsync before removing any of the hard links.  rsync will
 simply recreate the same hard link tree that pg_upgrade created when it
 ran, and update files which were actually changed (the catalog tables).
 
 The problem, as mentioned elsewhere, is that you have to checksum all
 the files because the timestamps will differ.  You can actually get
 around that with rsync if you really want though- tell it to only look
 at file sizes instead of size+time by passing in --size-only.
 
 What if instead of trying to handle that on the rsync side, we changed 
 pg_upgrade so that it created hardlinks that had the same timestamp as the 
 original file?

So, two things, I chatted w/ Bruce and he was less concerned about the
lack of being able to match up the timestamps than I was.  He has a
point though- the catalog tables are going to get copied anyway since
they won't be hard links and checking that all the other files match in
size and that both the master and the standby are at the same xlog
position should give you a pretty good feeling that everything matches
up sufficiently.

Second, I don't follow what you mean by having pg_upgrade change the
hardlinks to have the same timestamp- for starters, the timestamp is in
the inode and not the actual hard link (two files hard linked together
won't have different timestamps..) and second, the problem isn't on the
master side- it's on the standby side.  The standby's files will have
timestamps different from the master and there really isn't much to be
done about that.

 That said, the whole timestamp race condition in rsync gives me the 
 heebie-jeebies. For normal workloads maybe it's not that big a deal, but when 
 dealing with fixed-size data (ie: Postgres blocks)? Eww.

The race condition is a problem for pg_start/stop_backup and friends.
In this instance, everything will be shut down when the rsync is
running, so there isn't a timestamp race condition to worry about.

 How horribly difficult would it be to allow pg_upgrade to operate on multiple 
 servers? Could we have it create a shell script instead of directly modifying 
 things itself? Or perhaps some custom command file that could then be 
 replayed by pg_upgrade on another server? Of course, that's assuming that 
 replicas are compatible enough with masters for that to work...

Yeah, I had suggested that to Bruce also, but it's not clear why that
would be any different from an rsync --size-only in the end, presuming
everything went according to plan.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] pg_upgrade and rsync

2015-01-23 Thread Andres Freund
On 2015-01-22 20:54:47 -0500, Stephen Frost wrote:
 * Bruce Momjian (br...@momjian.us) wrote:
  On Fri, Jan 23, 2015 at 01:19:33AM +0100, Andres Freund wrote:
   Or do you - as the text edited in your patch, but not the quote above -
   mean to run pg_upgrade just on the primary and then rsync?
  
  No, I was going to run it on both, then rsync.
 
 I'm pretty sure this is all a lot easier than you believe it to be.  If
 you want to recreate what pg_upgrade does to a cluster then the simplest
 thing to do is rsync before removing any of the hard links.  rsync will
 simply recreate the same hard link tree that pg_upgrade created when it
 ran, and update files which were actually changed (the catalog tables).

I don't understand why that'd be better than simply fixing (yes, that's
imo the correct term) pg_upgrade to retain relfilenodes across the
upgrade. Afaics there's no conflict risk and it'd make the clusters much
more similar, which would be good; independent of rsyncing standbys.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-23 Thread Andres Freund
On 2015-01-23 13:52:54 -0500, Stephen Frost wrote:
 * Andres Freund (and...@2ndquadrant.com) wrote:
  On 2015-01-22 20:54:47 -0500, Stephen Frost wrote:
   * Bruce Momjian (br...@momjian.us) wrote:
On Fri, Jan 23, 2015 at 01:19:33AM +0100, Andres Freund wrote:
 Or do you - as the text edited in your patch, but not the quote above 
 -
 mean to run pg_upgrade just on the primary and then rsync?

No, I was going to run it on both, then rsync.
   
   I'm pretty sure this is all a lot easier than you believe it to be.  If
   you want to recreate what pg_upgrade does to a cluster then the simplest
   thing to do is rsync before removing any of the hard links.  rsync will
   simply recreate the same hard link tree that pg_upgrade created when it
   ran, and update files which were actually changed (the catalog tables).
  
  I don't understand why that'd be better than simply fixing (yes, that's
  imo the correct term) pg_upgrade to retain relfilenodes across the
  upgrade. Afaics there's no conflict risk and it'd make the clusters much
  more similar, which would be good; independent of rsyncing standbys.
 
 That's an entirely orthogonal discussion from the original one though,
 no?

Don't think so.

 That wouldn't actually help with what Bruce is trying to do, which
 is to duplicate the results of the pg_upgrade from the master over to
 the standby.

Well, it'd pretty much obliviate the need to run pg_upgrade on the
standby. As there's no renamed files you don't need to muck around with
leaving hardlinks in place and such just so that rsync recognizes
unchanged files.

 Trying to pg_upgrade both the master and the standby, to me at least,
 seems like an even *worse* approach than trusting rsync with -H and
 --size-only..

I think running pg_upgrade on the standby is a dangerous folly.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-23 Thread Andres Freund
On 2015-01-23 14:05:10 -0500, Stephen Frost wrote:
 * Andres Freund (and...@2ndquadrant.com) wrote:
  On 2015-01-23 13:52:54 -0500, Stephen Frost wrote:
   That wouldn't actually help with what Bruce is trying to do, which
   is to duplicate the results of the pg_upgrade from the master over to
   the standby.
  
  Well, it'd pretty much obliviate the need to run pg_upgrade on the
  standby. As there's no renamed files you don't need to muck around with
  leaving hardlinks in place and such just so that rsync recognizes
  unchanged files.
 
 Uh, pg_upgrade always either creates a hard link tree or copies
 everything over.

Yes. The problem is that the filenames after pg_upgrade aren't the same
as before. Which means that a simple rsync call won't be able to save
anything because the standby's filenames differ.  What you can do is
rsync both cluster directories (i.e. the old and the post pg_upgrade
ones) and use rsync -H, right? Without transferring both -H won't detect
the hardlinks as they need to be in the synced set. That's pretty
cumbersome/complicated, and far from cheap.

 If I follow what you're suggesting, pg_upgrade would
 need a new 'in-place' mode that removes all of the catalog tables from
 the old cluster and puts the new catalog tables into place and leaves
 everything else alone.

No. Except that it'd preserve the relfilenodes (i.e. the filenames of
relations) it'd work exactly the same as today. The standby is simply
updated by rsyncing the new data directory of the primary to the
standby.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] src/port/gethostname.c sure looks like dead code

2015-01-23 Thread Tom Lane
AFAICS there are no provisions to ever build this file, on any platform,
and have not been since 8.0.

If we did build it, it would likely fail on a large subset of platforms,
because it refers to a constant SYS_NMLN that isn't too portable.

I propose we just remove it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-23 Thread Stephen Frost
* Andres Freund (and...@2ndquadrant.com) wrote:
 On 2015-01-22 20:54:47 -0500, Stephen Frost wrote:
  * Bruce Momjian (br...@momjian.us) wrote:
   On Fri, Jan 23, 2015 at 01:19:33AM +0100, Andres Freund wrote:
Or do you - as the text edited in your patch, but not the quote above -
mean to run pg_upgrade just on the primary and then rsync?
   
   No, I was going to run it on both, then rsync.
  
  I'm pretty sure this is all a lot easier than you believe it to be.  If
  you want to recreate what pg_upgrade does to a cluster then the simplest
  thing to do is rsync before removing any of the hard links.  rsync will
  simply recreate the same hard link tree that pg_upgrade created when it
  ran, and update files which were actually changed (the catalog tables).
 
 I don't understand why that'd be better than simply fixing (yes, that's
 imo the correct term) pg_upgrade to retain relfilenodes across the
 upgrade. Afaics there's no conflict risk and it'd make the clusters much
 more similar, which would be good; independent of rsyncing standbys.

That's an entirely orthogonal discussion from the original one though,
no?  That wouldn't actually help with what Bruce is trying to do, which
is to duplicate the results of the pg_upgrade from the master over to
the standby.  Even if the relfilenodes were the same across the upgrade,
I don't think it'd be a good idea to run pg_upgrade on the standby and
hope the results match close enough to the master that you can trust
updates to the catalog tables on the standby from the master going
forward to work..

Trying to pg_upgrade both the master and the standby, to me at least,
seems like an even *worse* approach than trusting rsync with -H and
--size-only..

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

2015-01-23 Thread Andres Freund
On 2015-01-23 12:52:03 -0600, Jim Nasby wrote:
 On 1/22/15 3:18 PM, Andres Freund wrote:
 , but to then put it's entire contents into WAL? Blech.
 Besides actually having a chance of being correct, doing so will save
 having to do two checkpoints inside movedb(). I think it's pretty likely
 that that actually saves overall IO, even including the WAL
 writes. Especially if there's other databases in the cluster at the same
 time.
 
 If you're moving a small amount of data, maybe. If you're moving
 several hundred GB or more? You're going to flood WAL and probably
 cause all replication/archiving to lag.

I have hard time believing many people do AD ST on a database a couple
hundred GB large. That'd mean the database is out of business for a
significant amount of time (remember, there can't be any connected users
during that time). I think realistically it's only used on smaller
databases.

The reason createdb()/movedb() work the way they do is imo simply
because they're old.

So far I can't see how the current solution can actually be made safe to
be executed on a not yet consistent database. And we obviously can't
wait for it to be consistent (as we're replaying a linear stream of WAL).

 Is there some way we can add an option to control this? I'm thinking
 that by default ADAT will error if a backup is running, but allow the
 user to over-ride that.

Why would we want to allow that? There's simply no way it's safe, so
...?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Perl coding error in msvc build system?

2015-01-23 Thread Alvaro Herrera
Why is the last; still there?  AFAICS it matters in the old coding
because of the while, but I don't think there is an equivalent looping
construct in the new code.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Perl coding error in msvc build system?

2015-01-23 Thread Andrew Dunstan


On 01/23/2015 03:00 PM, Alvaro Herrera wrote:

Why is the last; still there?  AFAICS it matters in the old coding
because of the while, but I don't think there is an equivalent looping
construct in the new code.



You're right, I missed that. It shouldn't be there, and will produce an 
error like


   Can't last outside a loop block

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

2015-01-23 Thread Jim Nasby

On 1/23/15 9:10 AM, Andres Freund wrote:

On 2015-01-22 22:58:17 +0100, Andres Freund wrote:

On 2015-01-22 16:38:49 -0500, Stephen Frost wrote:

I'm trying to figure out why you'd do '2' in master when in discussion
of '1' you say I also don't think ALTER DATABASE is even intentionally
run at the time of a base backup.  I agree with that sentiment and am
inclined to say that '1' is good enough throughout.


Because the way it currently works is a major crock. It's more luck than
anything that it actually somewhat works. We normally rely on WAL to
bring us into a consistent state. But around CREATE/MOVE/DROP DATABASE
we've ignored that.



And. Hm. The difficulty of the current method is evidenced by the fact
that so far nodoby recognized that 1) as described above isn't actually
safe.  It fails to protect against basebackups on a standby as its
XLogCtl state will obviously not be visible on the master.


Further evidenced by the fact that the current method isn't
crash/shutdown safe at all. If a standby was shut down/crashed/was
started on a base backup when a CREATE DATABASE from the primary is
replayed the template database used can be in an nearly arbitrarily bad
state. It'll later get fixed up by recovery - but those changes won't
make it to the copied database.


I think we all agree that ADAT can't run while a base backup is happening, 
which I believe is what you're describing above. We'd have to somehow cover 
that same scenario on replicas too.

Perhaps there isn't really an issue here; I suspect ADAT is very rarely used. 
What I'm worried about though is that someone is using it regularly for some 
reason, with non-trivial databases, and this is going to completely hose them.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-23 Thread Stephen Frost
* Andres Freund (and...@2ndquadrant.com) wrote:
 On 2015-01-23 14:05:10 -0500, Stephen Frost wrote:
  * Andres Freund (and...@2ndquadrant.com) wrote:
   On 2015-01-23 13:52:54 -0500, Stephen Frost wrote:
That wouldn't actually help with what Bruce is trying to do, which
is to duplicate the results of the pg_upgrade from the master over to
the standby.
   
   Well, it'd pretty much obliviate the need to run pg_upgrade on the
   standby. As there's no renamed files you don't need to muck around with
   leaving hardlinks in place and such just so that rsync recognizes
   unchanged files.
  
  Uh, pg_upgrade always either creates a hard link tree or copies
  everything over.
 
 Yes. The problem is that the filenames after pg_upgrade aren't the same
 as before. Which means that a simple rsync call won't be able to save
 anything because the standby's filenames differ.  What you can do is
 rsync both cluster directories (i.e. the old and the post pg_upgrade
 ones) and use rsync -H, right? Without transferring both -H won't detect
 the hardlinks as they need to be in the synced set. That's pretty
 cumbersome/complicated, and far from cheap.

The filenames don't need to be the same for rsync -H to work.  You
specifically do *not* want to independently rsync the old and new
clusters- you need to run a single rsync (and one for each tablespace)
with -H and then it'll realize that the old cluster on both systems is
identical and will just recreate the hard links, and copy the completely
new files (the catalog tables).

  If I follow what you're suggesting, pg_upgrade would
  need a new 'in-place' mode that removes all of the catalog tables from
  the old cluster and puts the new catalog tables into place and leaves
  everything else alone.
 
 No. Except that it'd preserve the relfilenodes (i.e. the filenames of
 relations) it'd work exactly the same as today. The standby is simply
 updated by rsyncing the new data directory of the primary to the
 standby.

You'd have to replace the existing data directory on the master to do
that, which pg_upgrade was designed specifically to not do, in case
things went poorly.  You'd still have to deal with the tablespace
directories being renamed also, since we include the major version and
catalog build in the directory name..

This whole process really isn't all that complicated in the end..

my_data_dir/old_cluster
my_data_dir/new_cluster

pg_upgrade
rsync -H --size-only my_data_dir/ standby:/path/to/my_data_dir
start the clusters
remove the old cluster on the master and standby.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-01-23 Thread Jim Nasby

On 1/23/15 12:16 PM, Stephen Frost wrote:

Just to clarify- this concept isn't actually mine but was suggested by a
pretty sizable PG user who has a great deal of familiarity with other
databases.  I don't mean to try and invoke the 'silent majority' but
rather to make sure folks don't think this is my idea alone or that it's
only me who thinks it makes sense.:)   Simon had weighed in earlier
with, iirc, a comment that he thought it was a good approach also,
though that was a while ago and things have changed.


I know there's definitely demand for auditing. I'd love to see us support it.


I happen to like the idea specifically because it would allow regular
roles to change the auditing settings (no need to be a superuser or to
be able to modify postgresql.conf/postgresql.auto.conf)


Is there really a use case for non-superusers to be able to change auditing 
config? That seems like a bad idea.

Also, was there a solution to how to configure auditing on specific objects 
with a role-based mechanism? I think we really do need something akin to 
role:action:object tuples, and I don't see how to do that with roles alone.

BTW, I'm starting to feel like this needs a wiki page to get the design pulled 
together.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-01-23 Thread Jim Nasby

On 1/23/15 5:42 AM, Amit Kapila wrote:

*Fixed-Chunks*  
*No. of workers/Time (ms)*

*0* *2* *4* *8* *16**24**32*

Run-1   250536  266279  251263  234347  87930   50474   35474
Run-2   249587  230628  225648  193340  83036   35140   9100
Run-3   234963  220671  230002  256183  105382  62493   27903
Run-4   239111  245448  224057  189196  123780  63794   24746
Run-5   239937  222820  219025  220478  114007  77965   39766



The trend remains same although there is some variation.
In block-by-block approach, it performance dips (execution takes
more time) with more number of workers, though it stabilizes at
some higher value, still I feel it is random as it leads to random
scan.
In Fixed-chunk approach, the performance improves with more
number of workers especially at slightly higher worker count.


Those fixed chunk numbers look pretty screwy. 2, 4 and 8 workers make no 
difference, then suddenly 16 cuts times by 1/2 to 1/3? Then 32 cuts time by 
another 1/2 to 1/3?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

2015-01-23 Thread Jim Nasby

On 1/22/15 3:18 PM, Andres Freund wrote:

, but to then put it's entire contents into WAL? Blech.

Besides actually having a chance of being correct, doing so will save
having to do two checkpoints inside movedb(). I think it's pretty likely
that that actually saves overall IO, even including the WAL
writes. Especially if there's other databases in the cluster at the same
time.


If you're moving a small amount of data, maybe. If you're moving several 
hundred GB or more? You're going to flood WAL and probably cause all 
replication/archiving to lag.

Obviously, we need to do be reliable, but I think a lot of users would much 
rather that they can't muck with tablespaces while a backup is running than an 
ADAT suddenly consumes way more resources than before.

Is there some way we can add an option to control this? I'm thinking that by 
default ADAT will error if a backup is running, but allow the user to over-ride 
that.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

2015-01-23 Thread Andres Freund
On 2015-01-23 13:01:34 -0600, Jim Nasby wrote:
 On 1/23/15 9:10 AM, Andres Freund wrote:
 On 2015-01-22 22:58:17 +0100, Andres Freund wrote:
 On 2015-01-22 16:38:49 -0500, Stephen Frost wrote:
 I'm trying to figure out why you'd do '2' in master when in discussion
 of '1' you say I also don't think ALTER DATABASE is even intentionally
 run at the time of a base backup.  I agree with that sentiment and am
 inclined to say that '1' is good enough throughout.
 
 Because the way it currently works is a major crock. It's more luck than
 anything that it actually somewhat works. We normally rely on WAL to
 bring us into a consistent state. But around CREATE/MOVE/DROP DATABASE
 we've ignored that.
 
 And. Hm. The difficulty of the current method is evidenced by the fact
 that so far nodoby recognized that 1) as described above isn't actually
 safe.  It fails to protect against basebackups on a standby as its
 XLogCtl state will obviously not be visible on the master.
 
 Further evidenced by the fact that the current method isn't
 crash/shutdown safe at all. If a standby was shut down/crashed/was
 started on a base backup when a CREATE DATABASE from the primary is
 replayed the template database used can be in an nearly arbitrarily bad
 state. It'll later get fixed up by recovery - but those changes won't
 make it to the copied database.
 
 I think we all agree that ADAT can't run while a base backup is
 happening, which I believe is what you're describing above.

No, that's not what I'm descirbing in the last paragraph. The problem is
present without any AD ST. When a cluster crashes it's likely not in a
consistent state - we need to replay from the last checkpoint's REDO
pointer to the minimum recovery location to make it so. The problem
though is that if the copied database (both for CREATE DB/SET
TABLESPACE) is copied that record can be replayed well before the
database is in a consistent state. In that case the new copy will be in
a corrupted state as it'll not be fixed up by recovery, in contrast to
the original, which will.

 Perhaps there isn't really an issue here; I suspect ADAT is very
 rarely used. What I'm worried about though is that someone is using it
 regularly for some reason, with non-trivial databases, and this is
 going to completely hose them.

I think if somebody does this regularly on nontrivialy sized databases,
over replication, they'd have hit this bug a long time ago. It's really
not that hard to reproduce.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-23 Thread Jeff Janes
On Thu, Jan 22, 2015 at 1:50 PM, Merlin Moncure mmonc...@gmail.com wrote:


 So far, the 'nasty' damage seems to generally if not always follow a
 checksum failure and the checksum failures are always numerically
 adjacent.  For example:

 [cds2 12707 2015-01-22 12:51:11.032 CST 2754]WARNING:  page
 verification failed, calculated checksum 9465 but expected 9477 at
 character 20
 [cds2 21202 2015-01-22 13:10:18.172 CST 3196]WARNING:  page
 verification failed, calculated checksum 61889 but expected 61903 at
 character 20
 [cds2 29153 2015-01-22 14:49:04.831 CST 4803]WARNING:  page
 verification failed, calculated checksum 27311 but expected 27316

 I'm not up on the intricacies of our checksum algorithm but this is
 making me suspicious that we are looking at a improperly flipped
 visibility bit via some obscure problem -- almost certainly with
 vacuum playing a role.


That very much sounds like the block is getting duplicated from one place
to another.

Even flipping one hint bit (aren't these index pages?  Do they have hint
bits) should thoroughly scramble the checksum.

Because the checksum adds in the block number after the scrambling has been
done, copying a page to another nearby location will just move the
(expected) checksum a little bit.

Cheers,

Jeff


Re: [HACKERS] pg_upgrade and rsync

2015-01-23 Thread Andres Freund
On 2015-01-23 14:27:51 -0500, Stephen Frost wrote:
 * Andres Freund (and...@2ndquadrant.com) wrote:
  On 2015-01-23 14:05:10 -0500, Stephen Frost wrote:
   If I follow what you're suggesting, pg_upgrade would
   need a new 'in-place' mode that removes all of the catalog tables from
   the old cluster and puts the new catalog tables into place and leaves
   everything else alone.
  
  No. Except that it'd preserve the relfilenodes (i.e. the filenames of
  relations) it'd work exactly the same as today. The standby is simply
  updated by rsyncing the new data directory of the primary to the
  standby.
 
 You'd have to replace the existing data directory on the master to do
 that, which pg_upgrade was designed specifically to not do, in case
 things went poorly.

Why? Just rsync the new data directory onto the old directory on the
standbys. That's fine and simple.

 You'd still have to deal with the tablespace directories being renamed
 also, since we include the major version and catalog build in the
 directory name..

True.


Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor issues with code comments related to abbreviated keys

2015-01-23 Thread Peter Geoghegan
On Thu, Jan 22, 2015 at 5:17 PM, Peter Geoghegan p...@heroku.com wrote:
 Attached patch fixes minor issues in code comments that relate to
 abbreviated keys.

There should also be a description of hyperLogLog in the new README
file within ./src/backend/lib/

I suggest:

hyperloglog.c - a streaming cardinality estimator
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Minor issues with code comments related to abbreviated keys

2015-01-23 Thread Robert Haas
On Fri, Jan 23, 2015 at 2:59 PM, Peter Geoghegan p...@heroku.com wrote:
 On Thu, Jan 22, 2015 at 5:17 PM, Peter Geoghegan p...@heroku.com wrote:
 Attached patch fixes minor issues in code comments that relate to
 abbreviated keys.

 There should also be a description of hyperLogLog in the new README
 file within ./src/backend/lib/

 I suggest:

 hyperloglog.c - a streaming cardinality estimator

Committed with that addition.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgaudit - an auditing extension for PostgreSQL

2015-01-23 Thread Stephen Frost
* Jim Nasby (jim.na...@bluetreble.com) wrote:
 On 1/23/15 12:16 PM, Stephen Frost wrote:
 Just to clarify- this concept isn't actually mine but was suggested by a
 pretty sizable PG user who has a great deal of familiarity with other
 databases.  I don't mean to try and invoke the 'silent majority' but
 rather to make sure folks don't think this is my idea alone or that it's
 only me who thinks it makes sense.:)   Simon had weighed in earlier
 with, iirc, a comment that he thought it was a good approach also,
 though that was a while ago and things have changed.
 
 I know there's definitely demand for auditing. I'd love to see us support it.
 
 I happen to like the idea specifically because it would allow regular
 roles to change the auditing settings (no need to be a superuser or to
 be able to modify postgresql.conf/postgresql.auto.conf)
 
 Is there really a use case for non-superusers to be able to change auditing 
 config? That seems like a bad idea.

What's a bad idea is having every auditor on the system running around
as superuser..

 Also, was there a solution to how to configure auditing on specific objects 
 with a role-based mechanism? I think we really do need something akin to 
 role:action:object tuples, and I don't see how to do that with roles alone.

That is supported with the grant-based proposal.

 BTW, I'm starting to feel like this needs a wiki page to get the design 
 pulled together.

I agree with that and was planning to offer help with the documentation
and building of such a wiki with examples, etc, once the implementation
was far enough along to demonstrate that the design will actually work..

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Perl coding error in msvc build system?

2015-01-23 Thread Brar Piening

Am 23.01.2015 um 09:17 schrieb Abhijit Menon-Sen:

At 2014-06-03 22:30:50 -0400, pete...@gmx.net wrote:

I'm not sure whether the following coding actually detects any errors:

Solution.pm:

 open(P, cl /? 21 |) || die cl command not found;

Since nobody with a Windows system has commented, I'm just writing to
say that from a Perl perspective, I agree with your analysis and the
patch looks perfectly sensible.



I can confirm it on my Windows system.

Calling build from a console without nmake in the path I always get:
Unable to determine Visual Studio version: The nmake version could not 
be determined. at src/tools/msvc/Mkvcbuild.pm line 63.


This means that the following construct in VSObjectFactory.pm doesn't 
have the desired effect.

open(P, nmake /? 21 |)
  || croak
Unable to determine Visual Studio version: The nmake command wasn't 
found.;


On the other hand complicacy  is in the eye of the beholder.
Perl constructs like the following get quite a few wtf's 
(http://www.osnews.com/story/19266/WTFs_m) from a simple-minded person 
like me.

$?  8 == 0 or die cl command not found;

However as it fixes a confirmed problem and as maintainance of perl code 
is an issue of its own, please go ahead.


Regards,
Brar




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] REVIEW: Track TRUNCATE via pgstat

2015-01-23 Thread Alvaro Herrera
Here's v0.5. (Why did you use that weird decimal versioning scheme?  You
could just say v4 and save a couple of keystrokes).  This patch makes
perfect sense to me now.  I was ready to commit, but I checked the
regression test you added and noticed that you're only reading results
for the last set of operations because they all use the same table and
so each new set clobbers the values for the previous one.  So I modified
them to use one table for each set, and report the counters for all
tables.  In doing this I noticed that the one for trunc_stats_test3 is
at odds with what your comment in the .sql file says; would you review
it please?  Thanks.

(I didn't update the expected file.)

BTW you forgot to update expected/prepared_xact_1.out, for the case when
prep xacts are disabled.

If some other committer decides to give this a go, please remember to
bump catversion before pushing.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 66d5083..b2993b8 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -71,6 +71,7 @@
 #include parser/parse_type.h
 #include parser/parse_utilcmd.h
 #include parser/parser.h
+#include pgstat.h
 #include rewrite/rewriteDefine.h
 #include rewrite/rewriteHandler.h
 #include rewrite/rewriteManip.h
@@ -1220,6 +1221,8 @@ ExecuteTruncate(TruncateStmt *stmt)
 			 */
 			reindex_relation(heap_relid, REINDEX_REL_PROCESS_TOAST);
 		}
+
+		pgstat_count_truncate(rel);
 	}
 
 	/*
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 827ad71..afe1da2 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -197,8 +197,12 @@ typedef struct TwoPhasePgStatRecord
 	PgStat_Counter tuples_inserted;		/* tuples inserted in xact */
 	PgStat_Counter tuples_updated;		/* tuples updated in xact */
 	PgStat_Counter tuples_deleted;		/* tuples deleted in xact */
+	PgStat_Counter inserted_pre_trunc;	/* tuples inserted prior to truncate */
+	PgStat_Counter updated_pre_trunc;	/* tuples updated prior to truncate */
+	PgStat_Counter deleted_pre_trunc;	/* tuples deleted prior to truncate */
 	Oid			t_id;			/* table's OID */
 	bool		t_shared;		/* is it a shared catalog? */
+	bool		t_truncated;	/* was the relation truncated? */
 } TwoPhasePgStatRecord;
 
 /*
@@ -1859,6 +1863,59 @@ pgstat_count_heap_delete(Relation rel)
 }
 
 /*
+ * pgstat_table_record_truncate - remember i/u/d counts accumulated so far
+ */
+static void
+pgstat_table_record_truncate(PgStat_TableXactStatus *trans)
+{
+	if (!trans-truncated)
+	{
+		trans-inserted_pre_trunc = trans-tuples_inserted;
+		trans-updated_pre_trunc = trans-tuples_updated;
+		trans-deleted_pre_trunc = trans-tuples_deleted;
+	}
+}
+
+/*
+ * pgstat_table_restore_truncate - restore counters when a truncate aborts
+ */
+static void
+pgstat_table_restore_truncate(PgStat_TableXactStatus *trans)
+{
+	if (trans-truncated)
+	{
+		trans-tuples_inserted = trans-inserted_pre_trunc;
+		trans-tuples_updated = trans-updated_pre_trunc;
+		trans-tuples_deleted = trans-deleted_pre_trunc;
+	}
+}
+
+/*
+ * pgstat_count_truncate - update tuple counters due to truncate
+ */
+void
+pgstat_count_truncate(Relation rel)
+{
+	PgStat_TableStatus *pgstat_info = rel-pgstat_info;
+
+	if (pgstat_info != NULL)
+	{
+		/* We have to log the effect at the proper transactional level */
+		int			nest_level = GetCurrentTransactionNestLevel();
+
+		if (pgstat_info-trans == NULL ||
+			pgstat_info-trans-nest_level != nest_level)
+			add_tabstat_xact_level(pgstat_info, nest_level);
+
+		pgstat_table_record_truncate(pgstat_info-trans);
+		pgstat_info-trans-truncated = true;
+		pgstat_info-trans-tuples_inserted = 0;
+		pgstat_info-trans-tuples_updated = 0;
+		pgstat_info-trans-tuples_deleted = 0;
+	}
+}
+
+/*
  * pgstat_update_heap_dead_tuples - update dead-tuples count
  *
  * The semantics of this are that we are reporting the nontransactional
@@ -1916,12 +1973,22 @@ AtEOXact_PgStat(bool isCommit)
 			Assert(trans-upper == NULL);
 			tabstat = trans-parent;
 			Assert(tabstat-trans == trans);
+			/* restore pre-truncate stats (if any) in case of aborted xact */
+			if (!isCommit)
+pgstat_table_restore_truncate(trans);
 			/* count attempted actions regardless of commit/abort */
 			tabstat-t_counts.t_tuples_inserted += trans-tuples_inserted;
 			tabstat-t_counts.t_tuples_updated += trans-tuples_updated;
 			tabstat-t_counts.t_tuples_deleted += trans-tuples_deleted;
 			if (isCommit)
 			{
+tabstat-t_counts.t_truncated = trans-truncated;
+if (trans-truncated)
+{
+	/* forget live/dead stats seen by backend thus far */
+	tabstat-t_counts.t_delta_live_tuples = 0;
+	tabstat-t_counts.t_delta_dead_tuples = 0;
+}
 /* insert adds a live tuple, delete removes one */
 tabstat-t_counts.t_delta_live_tuples +=
 	

[HACKERS] Abbreviated keys for Datum tuplesort (was: Re: B-Tree support function number 3 (strxfrm() optimization))

2015-01-23 Thread Andrew Gierth
[pruning the Cc: list and starting a new thread]

Here's the cleaned-up version of the patch to allow abbreviated keys
when sorting a single Datum. This also removes comments that suggest
that the caller of tuplesort_begin_datum should ever have to care
about the abbreviated key optimization.

I'll add this to the CF.

-- 
Andrew (irc:RhodiumToad)


diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 8079d97..08088ea 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -363,10 +363,6 @@ initialize_aggregates(AggState *aggstate,
 			 * We use a plain Datum sorter when there's a single input column;
 			 * otherwise sort the full tuple.  (See comments for
 			 * process_ordered_aggregate_single.)
-			 *
-			 * In the future, we should consider forcing the
-			 * tuplesort_begin_heap() case when the abbreviated key
-			 * optimization can thereby be used, even when numInputs is 1.
 			 */
 			peraggstate-sortstate =
 (peraggstate-numInputs == 1) ?
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index f9a5f7f..39ed85b 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -268,10 +268,6 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 
 	/*
 	 * Initialize tuplesort object.
-	 *
-	 * In the future, we should consider forcing the tuplesort_begin_heap()
-	 * case when the abbreviated key optimization can thereby be used, even
-	 * when !use_tuples.
 	 */
 	if (use_tuples)
 		osastate-sortstate = tuplesort_begin_heap(qstate-tupdesc,
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index b6e302f..1f506bf 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -147,13 +147,18 @@ bool		optimize_bounded_sort = true;
  * case where the first key determines the comparison result.  Note that
  * for a pass-by-reference datatype, datum1 points into the tuple storage.
  *
+ * There is one special case: when the sort support infrastructure provides an
+ * abbreviated key representation, where the key is (typically) a pass by
+ * value proxy for a pass by reference type.  In this case, the abbreviated key
+ * is stored in datum1 in place of the actual first key column.
+ *
  * When sorting single Datums, the data value is represented directly by
- * datum1/isnull1.  If the datatype is pass-by-reference and isnull1 is false,
- * then datum1 points to a separately palloc'd data value that is also pointed
- * to by the tuple pointer; otherwise tuple is NULL.  There is one special
- * case:  when the sort support infrastructure provides an abbreviated key
- * representation, where the key is (typically) a pass by value proxy for a
- * pass by reference type.
+ * datum1/isnull1 for pass by value types (or null values).  If the datatype is
+ * pass-by-reference and isnull1 is false, then tuple points to a separately
+ * palloc'd data value, otherwise tuple is NULL.  The value of datum1 is then
+ * either the same pointer as tuple, or is an abbreviated key value as
+ * described above.  Accordingly, tuple is always used in preference to
+ * datum1 as the authoritative value for pass-by-reference cases.
  *
  * While building initial runs, tupindex holds the tuple's run number.  During
  * merge passes, we re-use it to hold the input tape number that each tuple in
@@ -901,30 +906,36 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	state-copytup = copytup_datum;
 	state-writetup = writetup_datum;
 	state-readtup = readtup_datum;
+	state-abbrevNext = 10;
 
 	state-datumType = datumType;
 
-	/* Prepare SortSupport data */
-	state-onlyKey = (SortSupport) palloc0(sizeof(SortSupportData));
-
-	state-onlyKey-ssup_cxt = CurrentMemoryContext;
-	state-onlyKey-ssup_collation = sortCollation;
-	state-onlyKey-ssup_nulls_first = nullsFirstFlag;
-	/*
-	 * Conversion to abbreviated representation infeasible in the Datum case.
-	 * It must be possible to subsequently fetch original datum values within
-	 * tuplesort_getdatum(), which would require special-case preservation of
-	 * original values.
-	 */
-	state-onlyKey-abbreviate = false;
-
-	PrepareSortSupportFromOrderingOp(sortOperator, state-onlyKey);
-
 	/* lookup necessary attributes of the datum type */
 	get_typlenbyval(datumType, typlen, typbyval);
 	state-datumTypeLen = typlen;
 	state-datumTypeByVal = typbyval;
 
+	/* Prepare SortSupport data */
+	state-sortKeys = (SortSupport) palloc0(sizeof(SortSupportData));
+
+	state-sortKeys-ssup_cxt = CurrentMemoryContext;
+	state-sortKeys-ssup_collation = sortCollation;
+	state-sortKeys-ssup_nulls_first = nullsFirstFlag;
+
+	/* abbreviation is possible here only for by-reference types */
+	state-sortKeys-abbreviate = !typbyval;
+
+	PrepareSortSupportFromOrderingOp(sortOperator, state-sortKeys);
+
+	/*
+	 * The onlyKey optimization cannot be used with abbreviated keys, since
+	 

Re: [HACKERS] hung backends stuck in spinlock heavy endless loop

2015-01-23 Thread Peter Geoghegan
On Thu, Jan 22, 2015 at 1:50 PM, Merlin Moncure mmonc...@gmail.com wrote:
 Quick update:  not done yet, but I'm making consistent progress, with
 several false starts.  (for example, I had a .conf problem with the
 new dynamic shared memory setting and git merrily bisected down to the
 introduction of the feature.).


Thanks. Looking forward to your findings. Ideally, we'd be able to get
a fix in for 9.4.1.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_upgrade and rsync

2015-01-23 Thread Bruce Momjian
On Fri, Jan 23, 2015 at 02:34:36PM -0500, Stephen Frost wrote:
  Why? Just rsync the new data directory onto the old directory on the
  standbys. That's fine and simple.
 
 That still doesn't address the need to use --size-only, it would just
 mean that you don't need to use -H.  If anything the -H part is the
 aspect which worries me the least about this approach.

It took me a while to understand what Stephen was saying, so let me
explain the details so everyone can get on the same page.

First, let's look at the downsides of using non-hardlink rsync against a
slave cluster, whether we run pg_upgrade on the slave or not:

o  must preserve db directory and relfilenodes (4 new things for
   pg_upgrade to preserve)
o  must checksum files because there is no way to distinguish
   user tables/indexes (which don't need to be copied) from system
   tables/indexes (which must be copied so it is in sync with
   the master)
o  must use log_wal_hints when the slave is installed so the
   checksums match

So, even if if all the checksums work, it will be slow/expensive.

Stephen's idea is quite interesting.  You run pg_upgrade on the master,
then, before you start the new server, you run rsync with special flags
and sync the old _and_ new clusters on the master with just the old
cluster on the standby (yeah, odd).  Yes, this is very odd, and where I
got lost too.

First, this only works when pg_upgrade is run in --link mode.  What
rsync --hard-links --size-only is going to do is record which files have
hard links, remember their inode numbers, and cross-reference the
hard-linked files.  When doing the rsync remote comparisons, the
master's old relfilenodes will match the standby's old relfilenodes, and
because we are using --size-only, they will be considered identical and
not copied, or even checksumed.  When it goes to do the standby's new
cluster, none of the directories will exist, so they will all be copied
along with the system objects (they are small), but the user
tables/indexes will be identified as already existing in the slave's old
cluster so it will hard-link to those standby's old cluster files. Once
rsync is complete, you can delete the old cluster on master and standby.
This is effectively duplicating the way pg_upgrade works.

What is interesting is that this will work on any version of pg_upgrade,
with no modifications, as long as link mode is used.  You _cannot_ run
initdb on the standby, as this will create system files that would
prevent the master's system files from being copied.  This is also going
to remove your recovery.conf on the standby, and replace your
postgresql.conf with the master's, so any modifications you made to the
standby will have to be saved and restored in to the new cluster before
starting.

I plan to run some tests soon to verify this method works, and if so, I
can document it in the pg_upgrade manual page.

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Unsafe coding in ReorderBufferCommit()

2015-01-23 Thread Tom Lane
There are at least two bugs in reorderbuffer.c's ReorderBufferCommit():

1. Although iterstate is modified within the PG_TRY segment and
referenced within the PG_CATCH segment, it is not marked volatile.
This means that its value upon reaching the PG_CATCH segment is
indeterminate.  In practice, what can happen is that it gets set back
to its value at the time of reaching PG_TRY, which will always be NULL,
so that the effect would be to miss out calling
ReorderBufferIterTXNFinish in the CATCH code.

2. On the other hand, if we get past the ReorderBufferIterTXNFinish
call within the PG_TRY block and then suffer a failure,
ReorderBufferIterTXNFinish will be called again in the PG_CATCH block.
This is due to failure to reset iterstate to NULL after the finish call.
(So this error could be masked if the compiler did cause iterstate
to revert to NULL during longjmp.)

I'm not sure whether #1 is harmless, but #2 most certainly isn't, as
it would result in access to already-freed memory.

The first of these was pointed out to me by Mark Wilding of Salesforce.
It's really pretty distressing that modern versions of gcc don't warn
about this (not even with -Wclobbered).  The very ancient gcc on gaur
does warn, but my experience is that it emits a lot of false positives
too, so I'm not that eager anymore to plaster volatile on any variable
it whinges about.  Still, it sure looks like we need a volatile here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Perl coding error in msvc build system?

2015-01-23 Thread Abhijit Menon-Sen
At 2014-06-03 22:30:50 -0400, pete...@gmx.net wrote:

 I'm not sure whether the following coding actually detects any errors:
 
 Solution.pm:
 
 open(P, cl /? 21 |) || die cl command not found;

Since nobody with a Windows system has commented, I'm just writing to
say that from a Perl perspective, I agree with your analysis and the
patch looks perfectly sensible.

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup fails with long tablespace paths

2015-01-23 Thread Abhijit Menon-Sen
At 2014-12-24 08:10:46 -0500, pete...@gmx.net wrote:

 As a demo for how this might look, attached is a wildly incomplete
 patch to produce GNU long-link headers.

Hi Peter.

In what way exactly is this patch wildly incomplete? (I ask because it's
been added to the current CF).

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Parallel Seq Scan

2015-01-23 Thread Amit Kapila
On Thu, Jan 22, 2015 at 7:23 PM, Robert Haas robertmh...@gmail.com wrote:

 On Thu, Jan 22, 2015 at 5:57 AM, Amit Kapila amit.kapil...@gmail.com
wrote:
  1. Scanning block-by-block has negative impact on performance and
  I thin it will degrade more if we increase parallel count as that can
lead
  to more randomness.
 
  2. Scanning in fixed chunks improves the performance. Increasing
  parallel count to a very large number might impact the performance,
  but I think we can have a lower bound below which we will not allow
  multiple processes to scan the relation.

 I'm confused.  Your actual test numbers seem to show that the
 performance with the block-by-block approach was slightly higher with
 parallelism than without, where as the performance with the
 chunk-by-chunk approach was lower with parallelism than without, but
 the text quoted above, summarizing those numbers, says the opposite.

 Also, I think testing with 2 workers is probably not enough.  I think
 we should test with 8 or even 16.


Below is the data with more number of workers, the amount of data and
other configurations remains as previous, I have only increased parallel
worker count:

  *Block-By-Block*






 *No. of workers/Time (ms)* *0* *2* *4* *8* *16* *24* *32*  Run-1 257851
287353 350091 330193 284913 338001 295057  Run-2 263241 314083 342166 347337
378057 351916 348292  Run-3 315374 334208 389907 340327 328695 330048 330102
Run-4 301054 312790 314682 352835 323926 324042 302147  Run-5 304547 314171
349158 350191 350468 341219 281315

  *Fixed-Chunks*






 *No. of workers/Time (ms)* *0* *2* *4* *8* *16* *24* *32*  Run-1 250536
266279 251263 234347 87930 50474 35474  Run-2 249587 230628 225648 193340
83036 35140 9100  Run-3 234963 220671 230002 256183 105382 62493 27903
Run-4 239111 245448 224057 189196 123780 63794 24746  Run-5 239937 222820
219025 220478 114007 77965 39766


The trend remains same although there is some variation.
In block-by-block approach, it performance dips (execution takes
more time) with more number of workers, though it stabilizes at
some higher value, still I feel it is random as it leads to random
scan.
In Fixed-chunk approach, the performance improves with more
number of workers especially at slightly higher worker count.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Parallel Seq Scan

2015-01-23 Thread Amit Kapila
On Sat, Jan 24, 2015 at 12:24 AM, Joshua D. Drake j...@commandprompt.com
wrote:


 On 01/23/2015 10:44 AM, Jim Nasby wrote:

 number of workers especially at slightly higher worker count.

 Those fixed chunk numbers look pretty screwy. 2, 4 and 8 workers make no
 difference, then suddenly 16 cuts times by 1/2 to 1/3? Then 32 cuts time
 by another 1/2 to 1/3?


There is variation in tests at different worker count but there is
definitely improvement from 0 to 2 worker count (if you refer my
initial mail on this data, with 2 workers there is a benefit of ~20%)
and I think we run the tests in a similar way (like compare 0 and 2
or 0 or 4 or 0 and 8), then the other effects could be minimised and
we might see better consistency, however the general trend with
fixed-chunk seems to be that scanning that way is better.

I think the real benefit with the current approach/patch can be seen
with qualifications (especially costly expression evaluation).

Further, if we want to just get the benefit of parallel I/O, then
I think we can get that by parallelising partition scan where different
table partitions reside on different disk partitions, however that is
a matter of separate patch.


 cached? First couple of runs gets the relations into memory?


Not entirely, as the table size is double than RAM, so each run
has to perform I/O.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2015-01-23 Thread Dilip kumar
On 22 January 2015 23:16, Alvaro Herrera Wrote,

 Here's v23.

 There are two things that continue to bother me and I would like you,
 dear patch author, to change them before committing this patch:

 1. I don't like having vacuum_one_database() and a separate
 vacuum_one_database_parallel().  I think we should merge them into one
 function, which does either thing according to parameters.  There's
 plenty in there that's duplicated.

 2. in particular, the above means that run_parallel_vacuum can no
 longer exist as it is.  Right now vacuum_one_database_parallel relies
 on run_parallel_vacuum to do the actual job parallellization.  I would
 like to have that looping in the improved vacuum_one_database()
 function instead.

 Looking forward to v24,

Thanks you for your effort, I have tried to change the patch as per your 
instructions and come up with v24,

Changes:
1. In current patch vacuum_one_database (for table list), have the table loop 
outside and analyze_stage loop inside, so it will finish
All three stage for one table first and then pick the next table. But 
vacuum_one_database_parallel will do the stage loop outside and will call 
run_parallel_vacuum,
Which will have table loop, so for one stage all the tables will be vacuumed 
first, then go to next stage.
So for merging two function both functions behaviors should be identical, I 
think if user have given a list of tables in analyze-in-stages, than doing all 
the table
Atleast for one stage and then picking next stage will be better solution so I 
have done it that way.

2. in select_loop
For WIN32 TranslateSocketError function I replaced with
if (WSAGetLastError() == WSAEINTR)
errno == EINTR;

otherwise I have to expose TranslateSocketError function from socket and 
include extra header.

I have tested in windows also its working fine.

Regards,
Dilip





vacuumdb_parallel_v24.patch
Description: vacuumdb_parallel_v24.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WITH CHECK and Column-Level Privileges

2015-01-23 Thread Stephen Frost
Alvaro,

Thanks for the review.

* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
 Note the first comment in this hunk was not update to talk about NULL
 instead of :

Ah, good catch, will fix.

 Hm, this is a bit odd.  I thought you were going to return the subset of
 columns that the user had permission to read, not empty if any of them
 was inaccesible.  Did I misunderstand?

The subset is for regular relations.  For indexes and keys, we only
return either the whole key or nothing.  Returning a partial key didn't
make much sense to me and we also don't know which columns were actually
provided by the user since it's going through the index AM, so we can't
return just what the user provided.

  +#define GetModifiedColumns(relinfo, estate) \
  +   (rt_fetch((relinfo)-ri_RangeTableIndex, 
  (estate)-es_range_table)-modifiedCols)
 
 I assume you are aware that this GetModifiedColumns() macro is a
 duplicate of another one found elsewhere.  However I think this is not
 such a hot idea; the UPSERT patch series has a preparatory patch that
 changes that other macro definition, as far as I recall; probably it'd
 be a good idea to move it elsewhere, to avoid a future divergence.

Yeah, I had meant to do something about that and had looked around but
didn't find any particularly good place to put it.  Any suggestions on
that?

  @@ -1689,25 +1722,54 @@ ExecWithCheckOptions(ResultRelInfo *resultRelInfo,
* dropped columns.  We used to use the slot's tuple descriptor to decode 
  the
* data, but the slot's descriptor doesn't identify dropped columns, so we
* now need to be passed the relation's descriptor.
  + *
  + * Note that, like BuildIndexValueDescription, if the user does not have
  + * permission to view any of the columns involved, a NULL is returned.  
  Unlike
  + * BuildIndexValueDescription, if the user has access to view a subset of 
  the
  + * column involved, that subset will be returned with a key identifying 
  which
  + * columns they are.
*/
 
 Ah, I now see that you are aware of the NULL-or-nothing nature of
 BuildIndexValueDescription ... but the comments there don't explain the
 reason.  Perhaps a comment in BuildIndexValueDescription is in order
 rather than a code change?

Yeah, I'll add comments to BuildIndexValueDescription to explain why
it's all-or-nothing.

I've also been working on back-patching the fixes; the next update will
hopefully include patches for all branches.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Perl coding error in msvc build system?

2015-01-23 Thread Andrew Dunstan


On 01/23/2015 03:17 AM, Abhijit Menon-Sen wrote:

At 2014-06-03 22:30:50 -0400, pete...@gmx.net wrote:

I'm not sure whether the following coding actually detects any errors:

Solution.pm:

 open(P, cl /? 21 |) || die cl command not found;

Since nobody with a Windows system has commented, I'm just writing to
say that from a Perl perspective, I agree with your analysis and the
patch looks perfectly sensible.


Not quite. This line:

   if ($output =~ /^\/favor:.+AMD64/)

needs an m modifier on the regex, I think, so that the ^ matches any 
beginning of line, not just the first.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WITH CHECK and Column-Level Privileges

2015-01-23 Thread Alvaro Herrera
Stephen Frost wrote:

   +#define GetModifiedColumns(relinfo, estate) \
   + (rt_fetch((relinfo)-ri_RangeTableIndex, 
   (estate)-es_range_table)-modifiedCols)
  
  I assume you are aware that this GetModifiedColumns() macro is a
  duplicate of another one found elsewhere.  However I think this is not
  such a hot idea; the UPSERT patch series has a preparatory patch that
  changes that other macro definition, as far as I recall; probably it'd
  be a good idea to move it elsewhere, to avoid a future divergence.
 
 Yeah, I had meant to do something about that and had looked around but
 didn't find any particularly good place to put it.  Any suggestions on
 that?

Hmm, tough call now that I look it up.  This macro depends on
ResultRelInfo and EState, both of which are in execnodes.h, and also on
rt_fetch which is in parsetree.h.  There is no existing header that
includes parsetree.h (only .c files), so we would have to add one
inclusion on some other header file, or create a new header with just
this definition.  None of these sounds real satisfactory (including
parsetree.h in execnodes.h sounds very bad).  Maybe just add a comment
on both definitions to note that they are dupes of each other?  That
would be more backpatchable that anything else that occurs to me right
away.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: knowing detail of config files via SQL

2015-01-23 Thread Sawada Masahiko
On Fri, Jan 23, 2015 at 4:36 AM, Jim Nasby jim.na...@bluetreble.com wrote:


 Would this view have a row for every option in a config file? IE: if you set
 something in both postgresql.conf and postgresql.auto.conf, would it show up
 twice? I think it should, and that there should be a way to see which
 setting is actually in effect.

yes.

 It looks like you're attempting to handle #include, yes?

Of course yes.

 Also other idea is to add additional columns existing view
 (pg_settings), according to prev discussion.


 I think it would be useful to have a separate view that shows all
 occurrences of a setting. I recall some comment about source_file and
 source_line not always being correct in pg_settings; if that's true we
 should fix that.

You mean that pg_settings view shows the variable in current session?
pg_settings view shows can be changed if user set the GUC parameter in
session or GUC parameter is set to role individually.
But I don't think pg_file_settings view has such a behavior.

Regards,

---
Sawada Masahiko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

2015-01-23 Thread Andres Freund
On 2015-01-22 22:58:17 +0100, Andres Freund wrote:
 On 2015-01-22 16:38:49 -0500, Stephen Frost wrote:
  I'm trying to figure out why you'd do '2' in master when in discussion
  of '1' you say I also don't think ALTER DATABASE is even intentionally
  run at the time of a base backup.  I agree with that sentiment and am
  inclined to say that '1' is good enough throughout.
 
 Because the way it currently works is a major crock. It's more luck than
 anything that it actually somewhat works. We normally rely on WAL to
 bring us into a consistent state. But around CREATE/MOVE/DROP DATABASE
 we've ignored that.

 And. Hm. The difficulty of the current method is evidenced by the fact
 that so far nodoby recognized that 1) as described above isn't actually
 safe.  It fails to protect against basebackups on a standby as its
 XLogCtl state will obviously not be visible on the master.

Further evidenced by the fact that the current method isn't
crash/shutdown safe at all. If a standby was shut down/crashed/was
started on a base backup when a CREATE DATABASE from the primary is
replayed the template database used can be in an nearly arbitrarily bad
state. It'll later get fixed up by recovery - but those changes won't
make it to the copied database.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WITH CHECK and Column-Level Privileges

2015-01-23 Thread Stephen Frost
* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
 Stephen Frost wrote:
+#define GetModifiedColumns(relinfo, estate) \
+   (rt_fetch((relinfo)-ri_RangeTableIndex, 
(estate)-es_range_table)-modifiedCols)
   
   I assume you are aware that this GetModifiedColumns() macro is a
   duplicate of another one found elsewhere.  However I think this is not
   such a hot idea; the UPSERT patch series has a preparatory patch that
   changes that other macro definition, as far as I recall; probably it'd
   be a good idea to move it elsewhere, to avoid a future divergence.
  
  Yeah, I had meant to do something about that and had looked around but
  didn't find any particularly good place to put it.  Any suggestions on
  that?
 
 Hmm, tough call now that I look it up.  This macro depends on
 ResultRelInfo and EState, both of which are in execnodes.h, and also on
 rt_fetch which is in parsetree.h.  There is no existing header that
 includes parsetree.h (only .c files), so we would have to add one
 inclusion on some other header file, or create a new header with just
 this definition.  None of these sounds real satisfactory (including
 parsetree.h in execnodes.h sounds very bad).  Maybe just add a comment
 on both definitions to note that they are dupes of each other?  That
 would be more backpatchable that anything else that occurs to me right
 away.

Good thought, I'll do that.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] WITH CHECK and Column-Level Privileges

2015-01-23 Thread Alvaro Herrera
Note the first comment in this hunk was not update to talk about NULL
instead of :

 @@ -163,13 +168,63 @@ BuildIndexValueDescription(Relation indexRelation,
  Datum *values, bool *isnull)
  {
   StringInfoData buf;
 + Form_pg_index idxrec;
 + HeapTuple   ht_idx;
   int natts = indexRelation-rd_rel-relnatts;
   int i;
 + int keyno;
 + Oid indexrelid = RelationGetRelid(indexRelation);
 + Oid indrelid;
 + AclResult   aclresult;
 +
 + /*
 +  * Check permissions- if the users does not have access to view the
 +  * data in the key columns, we return  instead, which callers can
 +  * test for and use or not accordingly.
 +  *
 +  * First we need to check table-level SELECT access and then, if
 +  * there is no access there, check column-level permissions.
 +  */

[hunk continues]

 + /* Table-level SELECT is enough, if the user has it */
 + aclresult = pg_class_aclcheck(indrelid, GetUserId(), ACL_SELECT);
 + if (aclresult != ACLCHECK_OK)
 + {
 + /*
 +  * No table-level access, so step through the columns in the
 +  * index and make sure the user has SELECT rights on all of 
 them.
 +  */
 + for (keyno = 0; keyno  idxrec-indnatts; keyno++)
 + {
 + AttrNumber  attnum = idxrec-indkey.values[keyno];
 +
 + aclresult = pg_attribute_aclcheck(indrelid, attnum, 
 GetUserId(),
 + 
   ACL_SELECT);
 +
 + if (aclresult != ACLCHECK_OK)
 + {
 + /* No access, so clean up and return */
 + ReleaseSysCache(ht_idx);
 + return NULL;
 + }
 + }
 + }

Hm, this is a bit odd.  I thought you were going to return the subset of
columns that the user had permission to read, not empty if any of them
was inaccesible.  Did I misunderstand?


 diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
 index 4c1..20acf04 100644
 --- a/src/backend/executor/execMain.c
 +++ b/src/backend/executor/execMain.c
 @@ -82,12 +82,17 @@ static void ExecutePlan(EState *estate, PlanState 
 *planstate,
   DestReceiver *dest);
  static bool ExecCheckRTEPerms(RangeTblEntry *rte);
  static void ExecCheckXactReadOnly(PlannedStmt *plannedstmt);
 -static char *ExecBuildSlotValueDescription(TupleTableSlot *slot,
 +static char *ExecBuildSlotValueDescription(Oid reloid,
 +   TupleTableSlot *slot,
 TupleDesc tupdesc,
 +   Bitmapset 
 *modifiedCols,
 int maxfieldlen);
  static void EvalPlanQualStart(EPQState *epqstate, EState *parentestate,
 Plan *planTree);
  
 +#define GetModifiedColumns(relinfo, estate) \
 + (rt_fetch((relinfo)-ri_RangeTableIndex, 
 (estate)-es_range_table)-modifiedCols)

I assume you are aware that this GetModifiedColumns() macro is a
duplicate of another one found elsewhere.  However I think this is not
such a hot idea; the UPSERT patch series has a preparatory patch that
changes that other macro definition, as far as I recall; probably it'd
be a good idea to move it elsewhere, to avoid a future divergence.

 @@ -1689,25 +1722,54 @@ ExecWithCheckOptions(ResultRelInfo *resultRelInfo,
   * dropped columns.  We used to use the slot's tuple descriptor to decode the
   * data, but the slot's descriptor doesn't identify dropped columns, so we
   * now need to be passed the relation's descriptor.
 + *
 + * Note that, like BuildIndexValueDescription, if the user does not have
 + * permission to view any of the columns involved, a NULL is returned.  
 Unlike
 + * BuildIndexValueDescription, if the user has access to view a subset of the
 + * column involved, that subset will be returned with a key identifying which
 + * columns they are.
   */

Ah, I now see that you are aware of the NULL-or-nothing nature of
BuildIndexValueDescription ... but the comments there don't explain the
reason.  Perhaps a comment in BuildIndexValueDescription is in order
rather than a code change?


-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2015-01-23 Thread Alvaro Herrera
Dilip kumar wrote:

 Changes:
 1. In current patch vacuum_one_database (for table list), have the table loop 
 outside and analyze_stage loop inside, so it will finish
 All three stage for one table first and then pick the next table. But 
 vacuum_one_database_parallel will do the stage loop outside and will call 
 run_parallel_vacuum,
 Which will have table loop, so for one stage all the tables will be vacuumed 
 first, then go to next stage.
 So for merging two function both functions behaviors should be identical, I 
 think if user have given a list of tables in analyze-in-stages, than doing 
 all the table
 Atleast for one stage and then picking next stage will be better solution so 
 I have done it that way.

Yeah, I think the stages loop should be outermost, as discussed upthread
somewhere -- it's better to have initial stats for all tables as soon as
possible, and improve them later, than have some tables/dbs with no
stats for a longer period while full stats are computed for some
specific tables/database.

I'm tweaking your v24 a bit more now, thanks -- main change is to make
vacuum_one_database be called only to run one analyze stage, so it never
iterates for each stage; callers must iterate calling it multiple times
in those cases.  (There's only one callsite that needs changing anyway.)

 2. in select_loop
 For WIN32 TranslateSocketError function I replaced with
 if (WSAGetLastError() == WSAEINTR)
 errno == EINTR;
 
 otherwise I have to expose TranslateSocketError function from socket and 
 include extra header.

Grumble.  Don't like this bit, but moving TranslateSocketError to
src/common is outside the scope of this patch, so okay.  (pg_dump's
parallel stuff has the same issue anyway.)

In case you're up for doing some more work later on, there are two ideas
here: move the backend's TranslateSocketError to src/common, and try to
merge pg_dump's select_loop function with the one in this new code.  But
that's for another patch anyway (and this new function needs a little
battle-testing, of course.)

 I have tested in windows also its working fine.

Great, thanks.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers