Re: [HACKERS] Buildfarm not happy with test module move

2014-12-01 Thread Magnus Hagander
On Mon, Dec 1, 2014 at 2:22 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Alvaro Herrera alvhe...@2ndquadrant.com writes:
 That was my fault -- the alvh.no-ip.org domain was deleted, and the
 email system in postgresql.org rejected the commit message because the
 sender was not in a deliverable domain.  I noticed within a few hours,
 but the damage was already done.

 I guess I'm confused about which is your preferred email address?

Yeah. I was assuming the alvh.no-ip.org was going to come back, as
that's what you're still subcribed to critical mailinglists with :) Do
you need that one changed?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Buildfarm not happy with test module move

2014-12-01 Thread Alvaro Herrera
Magnus Hagander wrote:
 On Mon, Dec 1, 2014 at 2:22 AM, Tom Lane t...@sss.pgh.pa.us wrote:
  Alvaro Herrera alvhe...@2ndquadrant.com writes:
  That was my fault -- the alvh.no-ip.org domain was deleted, and the
  email system in postgresql.org rejected the commit message because the
  sender was not in a deliverable domain.  I noticed within a few hours,
  but the damage was already done.
 
  I guess I'm confused about which is your preferred email address?
 
 Yeah. I was assuming the alvh.no-ip.org was going to come back, as
 that's what you're still subcribed to critical mailinglists with :) Do
 you need that one changed?

My preferred email address continues to be alvhe...@alvh.no-ip.org.  It
was deleted by mistake, it wasn't intended.  (If you must know, it's a
new idiotic feature by no-ip.com, coupled by my server machine at home
dying because of a stuck SSD, which I thought was non-critical so
didn't replace in over half a year, coupled with a change in laptop;
goes to show that the non-critical machine wasn't really non-critical
after all, and that I'm juggling with way too many things.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] compress method for spgist - 2

2014-12-01 Thread Teodor Sigaev


Initial message:
http://www.postgresql.org/message-id/5447b3ff.2080...@sigaev.ru

Second version fixes a forgotten changes in pg_am.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


spgist_compress_method-2.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Windows,PATCH] Use faster, higher precision timer API

2014-12-01 Thread Craig Ringer
Hi all

I've attached a revised patchset for GetSystemTimeAsFileTime and
GetSystemTimePreciseAsFileTime to address David Rowley's review notes.

Thanks for the review, and for poking me to follow up.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From 63edf0571961f2f2c7fd6a08b99e747ee39a3377 Mon Sep 17 00:00:00 2001
From: Craig Ringer cr...@2ndquadrant.com
Date: Fri, 12 Sep 2014 12:41:35 +0800
Subject: [PATCH 1/2] Use GetSystemTimeAsFileTime directly in windows
 gettimeofday
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

PostgreSQL was calling GetSystemTime followed by SystemTimeToFileTime in the
win32 port gettimeofday function. This is not necessary and limits the reported
precision to the 1ms granularity that the SYSTEMTIME struct can represent. By
using GetSystemTimeAsFileTime we avoid unnecessary conversions and capture
timestamps at 100ns granularity, which is then rounded to 1µs granularity for
storage in a PostgreSQL timestamp.

On most Windows systems this change will actually have no significant effect as
the system timer tick is typically between 1ms and 15ms depending on what timer
resolution currently running applications have requested. You can check this
with clockres.exe from sysinternals. Despite the platform limiation this change
still permits capture of finer timestamps where the system is capable of
producing them and it gets rid of an unnecessary syscall.

Future work may permit use of GetSystemTimePreciseAsFileTime on Windows 8 and
Windows Server 2012 for higher resolution time capture. This call has the same
interface as GetSystemTimeAsFileTime.
---
 src/port/gettimeofday.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/port/gettimeofday.c b/src/port/gettimeofday.c
index 75a9199..73ec406 100644
--- a/src/port/gettimeofday.c
+++ b/src/port/gettimeofday.c
@@ -44,16 +44,14 @@ int
 gettimeofday(struct timeval * tp, struct timezone * tzp)
 {
 	FILETIME	file_time;
-	SYSTEMTIME	system_time;
 	ULARGE_INTEGER ularge;
 
-	GetSystemTime(system_time);
-	SystemTimeToFileTime(system_time, file_time);
+	GetSystemTimeAsFileTime(file_time);
 	ularge.LowPart = file_time.dwLowDateTime;
 	ularge.HighPart = file_time.dwHighDateTime;
 
 	tp-tv_sec = (long) ((ularge.QuadPart - epoch) / 1000L);
-	tp-tv_usec = (long) (system_time.wMilliseconds * 1000);
+	tp-tv_usec = (long) (((ularge.QuadPart - epoch) % 1000L) / 10);
 
 	return 0;
 }
-- 
1.9.3

From c3c9f38379dca3f6f59520c5ceafce34ee8c8d90 Mon Sep 17 00:00:00 2001
From: Craig Ringer cr...@2ndquadrant.com
Date: Thu, 18 Sep 2014 23:02:14 +0800
Subject: [PATCH 2/2] Use GetSystemTimePreciseAsFileTime when available

This will cause PostgreSQL on Windows 8 or Windows Server 2012 to
obtain high-resolution timestamps while allowing the same binaries
to run without problems on older releases.
---
 src/backend/main/main.c |  6 ++
 src/include/port.h  |  2 ++
 src/port/gettimeofday.c | 56 +++--
 3 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index c51b391..73c30c5 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -260,6 +260,12 @@ startup_hacks(const char *progname)
 
 		/* In case of general protection fault, don't show GUI popup box */
 		SetErrorMode(SEM_FAILCRITICALERRORS | SEM_NOGPFAULTERRORBOX);
+
+#ifndef HAVE_GETTIMEOFDAY
+		/* Figure out which syscall to use to capture timestamp information */
+		init_win32_gettimeofday();
+#endif
+
 	}
 #endif   /* WIN32 */
 
diff --git a/src/include/port.h b/src/include/port.h
index 94a0e2f..58677ec 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -328,6 +328,8 @@ extern FILE *pgwin32_popen(const char *command, const char *type);
 #ifndef HAVE_GETTIMEOFDAY
 /* Last parameter not used */
 extern int	gettimeofday(struct timeval * tp, struct timezone * tzp);
+/* On windows we need to call some backend start setup for accurate timing */
+extern void init_win32_gettimeofday(void);
 #endif
 #else			/* !WIN32 */
 
diff --git a/src/port/gettimeofday.c b/src/port/gettimeofday.c
index 73ec406..a82a1a4 100644
--- a/src/port/gettimeofday.c
+++ b/src/port/gettimeofday.c
@@ -30,14 +30,66 @@
 
 #include sys/time.h
 
+#ifndef FRONTEND
+#include utils/elog.h
+#endif
+
 
 /* FILETIME of Jan 1 1970 00:00:00. */
 static const unsigned __int64 epoch = UINT64CONST(1164447360);
 
 /*
+ * Both GetSystemTimeAsFileTime and GetSystemTimePreciseAsFileTime share a
+ * signature, so we can just store a pointer to whichever we find. This
+ * is the pointer's type.
+ */
+typedef VOID (WINAPI *PgGetSystemTimeFn)(LPFILETIME);
+/* Storage for the function we pick at runtime */
+static PgGetSystemTimeFn pg_get_system_time = NULL;
+
+/*
+ * During backend startup, determine if GetSystemTimePreciseAsFileTime is
+ * available and use it; if not, 

Re: [HACKERS] [Windows,PATCH] Use faster, higher precision timer API

2014-12-01 Thread Marco Nenciarini
Il 01/12/14 14:16, Craig Ringer ha scritto:
  
 +#ifndef FRONTEND
 +#include utils/elog.h
 +#endif
 +

I think this is a leftover, as you don't use elog afterwards.

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] Proposal : REINDEX SCHEMA

2014-12-01 Thread Sawada Masahiko
On Thu, Nov 27, 2014 at 11:43 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Thu, Nov 27, 2014 at 8:18 PM, Sawada Masahiko sawada.m...@gmail.com 
 wrote:
 On Thu, Nov 27, 2014 at 1:07 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 On Thu, Nov 27, 2014 at 12:55 AM, Sawada Masahiko sawada.m...@gmail.com 
 wrote:
 +1 to define new something object type and remove do_user and do_system.
 But if we add OBJECT_SYSTEM to ObjectType data type,
 system catalogs are OBJECT_SYSTEM as well as OBJECT_TABLE.
 It's a bit redundant?
 Yes, kind of. That's a superset of a type of relations, aka a set of
 catalog tables. If you find something cleaner to propose, feel free.

 I thought we can add new struct like ReindexObjectType which has
 REINDEX_OBJECT_TABLE,
 REINDEX_OBJECT_SYSTEM and so on. It's similar to GRANT syntax.
 Check.

 Another thing, ReindexDatabaseOrSchema should be renamed to ReindexObject.
 So, I think that we need to think a bit more here. We are not far from
 smth that could be committed, so marking as Waiting on Author for
 now. Thoughts?

 Is the table also kind of object?
 Sorry, I am not sure I follow you here. Indexes and tables have
 already their relkind set in ReindexStmt, and I think that we're fine
 to continue letting them go in their own reindex code path for now.

 It was not enough, sorry.
 I mean that there is already ReindexTable() function.
 if we renamed ReindexObject, I would feel uncomfortable feeling.
 Because table is also kind of object.
 Check.

 If you get that done, I'll have an extra look at it and then let's
 have a committer look at it.

Attached patch is latest version.
I added new enum values like REINDEX_OBJECT_XXX,
and changed ReindexObject() function.
Also ACL problem is fixed for this version.

Regards,

---
Sawada Masahiko


000_reindex_schema_v6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROTOCOL TODO] Permit streaming of unknown-length lob/clob (bytea,text,etc)

2014-12-01 Thread David Fetter
On Mon, Dec 01, 2014 at 02:55:22PM +0800, Craig Ringer wrote:
 Hi all
 
 Currently the client must know the size of a large lob/clob field, like
 a 'bytea' or 'text' field, in order to send it to the server. This can
 force the client to buffer all the data before sending it to the server.

Yes, this is not good.

 It would be helpful if the v4 protocol permitted the client to specify
 the field length as unknown / TBD, then stream data until an end marker
 is read.

What's wrong with specifying its length in advance instead?  Are you
thinking of a one or more use cases where it's both large and unknown?

Cheers,
David.
-- 
David Fetter da...@fetter.org http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROTOCOL TODO] Permit streaming of unknown-length lob/clob (bytea,text,etc)

2014-12-01 Thread Craig Ringer
On 12/01/2014 10:38 PM, David Fetter wrote:
 On Mon, Dec 01, 2014 at 02:55:22PM +0800, Craig Ringer wrote:
 Hi all

 Currently the client must know the size of a large lob/clob field, like
 a 'bytea' or 'text' field, in order to send it to the server. This can
 force the client to buffer all the data before sending it to the server.
 
 Yes, this is not good.
 
 It would be helpful if the v4 protocol permitted the client to specify
 the field length as unknown / TBD, then stream data until an end marker
 is read.
 
 What's wrong with specifying its length in advance instead?  Are you
 thinking of a one or more use cases where it's both large and unknown?

I am - specifically, the JDBC setBlob(...) and setClob(...) APIs that
accept streams without a specified length:

https://docs.oracle.com/javase/7/docs/api/java/sql/PreparedStatement.html#setBlob(int,%20java.io.InputStream)

https://docs.oracle.com/javase/7/docs/api/java/sql/PreparedStatement.html#setClob(int,%20java.io.Reader)

There are variants that do take a length, so PgJDBC can (and now does)
implement the no-length variants by internally buffering the stream
until EOF. It'd be nice to get rid of that though.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROTOCOL TODO] Permit streaming of unknown-length lob/clob (bytea,text,etc)

2014-12-01 Thread Tom Lane
Craig Ringer cr...@2ndquadrant.com writes:
 Currently the client must know the size of a large lob/clob field, like
 a 'bytea' or 'text' field, in order to send it to the server. This can
 force the client to buffer all the data before sending it to the server.

 It would be helpful if the v4 protocol permitted the client to specify
 the field length as unknown / TBD, then stream data until an end marker
 is read. Some encoding would be required for binary data to ensure that
 occurrences of the end marker in the streamed data were properly
 handled, but there are many well established schemes for doing this.

I think this is pretty much a non-starter as stated, because the v3
protocol requires all messages to have a preceding length word.  That's
not very negotiable.

What's already on the TODO list is to allow large field values to be sent
or received in segments, perhaps with a cursor-like arrangement.  You can
do that today for blobs, but not for oversize regular table fields.

Of course, considering that the maximum practical size of a regular field
is probably in the dozens of megabytes, and that RAM is getting cheaper
all the time, it's not clear that it's all that much of a hardship for
clients to buffer the whole thing.  If we've not gotten around to this
in the last dozen years, it's unlikely we'll get to it in the future
either ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Selectivity estimation for inet operators

2014-12-01 Thread Tom Lane
I wrote:
 I spent a fair chunk of the weekend hacking on this patch to make
 it more understandable and fix up a lot of what seemed to me pretty
 clear arithmetic errors in the upper layers of the patch.  However,
 I couldn't quite convince myself to commit it, because the business
 around estimation for partial histogram-bucket matches still doesn't
 make any sense to me.

Actually, there's a second large problem with this patch: blindly
iterating through all combinations of MCV and histogram entries makes the
runtime O(N^2) in the statistics target.  I made up some test data (by
scanning my mail logs) and observed the following planning times, as
reported by EXPLAIN ANALYZE:

explain analyze select * from relays r1, relays r2 where r1.ip = r2.ip;
explain analyze select * from relays r1, relays r2 where r1.ip  r2.ip;

stats targeteqjoinsel   networkjoinsel

100 0.27 ms 1.85 ms
10002.56 ms 167.2 ms
1   56.6 ms 13987.1 ms

I don't think it's necessary for network selectivity to be quite as
fast as eqjoinsel, but I doubt we can tolerate 14 seconds planning
time for a query that might need just milliseconds to execute :-(

It seemed to me that it might be possible to reduce the runtime by
exploiting knowledge about the ordering of the histograms, but
I don't have time to pursue that right now.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Buildfarm not happy with test module move

2014-12-01 Thread Alvaro Herrera
Tom Lane wrote:
 Alvaro Herrera alvhe...@2ndquadrant.com writes:
  Tom Lane wrote:
  All of the MSVC critters are failing at make check.
 
  Yeah, I noticed that, thanks.  As far as I can see the only way to fix
  it is to install dummy_seclabel to run the core seclabel test.  That
  doesn't seem smart; I think it'd be better to remove that part of the
  core seclabel test, and move the rest of the test to a new test in the
  dummy_seclabel module.
 
 Sounds good to me.  The other parts of the core tests that depend on
 contrib modules aren't exactly good models to follow.

Pushed; tests pass for me, let's see what buildfarm says.

I think the input/ and output/ stuff is rather annoying.  I tried to
make dummy_seclabel an extension instead of a bare library, so that
CREATE EXTENSION inside the .sql loads it easily.  The problem there is
that dummy_seclabel doesn't have any sql-callable function, so the
module doesn't ever load.  I guess we could create a dummy function
which is there only to cause the module to get loaded ...

I am in touch with Andrew about adding a new stage to buildfarm client
so that animals will build src/test/modules by default.  It should work
fine everywhere except MSVC, I think.  The issue with MSVC continues to
be that Install.pm and Mkvcbuild.pm need to be in sync regarding what to
build and what to install.  I could use help from some MSVC-enabled
developer there ...

In the meantime, I'm going to rebase commit_ts to src/test/modules and
get one bug I found there fixed, and will push that shortly too.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Buildfarm not happy with test module move

2014-12-01 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Tom Lane wrote:
 Sounds good to me.  The other parts of the core tests that depend on
 contrib modules aren't exactly good models to follow.

 Pushed; tests pass for me, let's see what buildfarm says.

Pushed?  Don't see it ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using Core Foundation locale functions

2014-12-01 Thread David E. Wheeler
On Nov 28, 2014, at 8:43 AM, Peter Eisentraut pete...@gmx.net wrote:
 
 At the moment, this is probably just an experiment that shows where
 refactoring and better abstractions might be suitable if we want to
 support multiple locale libraries.  If we want to pursue ICU, I think
 this could be a useful third option.

Gotta say, I’m thrilled to see movement on this front, and especially pleased 
to see how consensus seems to be building around an abstracted interface to 
keep options open. This platform-specific example really highlights the need 
for it (I had no idea that there was separate and more up-to-date collation 
support in Core Foundation than in the UNIX layer of OS X).

Really looking forward to seeing where we end up.

Best,

David
 



smime.p7s
Description: S/MIME cryptographic signature


Re: [HACKERS] Turning recovery.conf into GUCs

2014-12-01 Thread Alex Shulgin
Alex Shulgin a...@commandprompt.com writes:

 Here's an attempt to revive this patch.

Here's the patch rebased against current HEAD, that is including the
recently committed action_at_recovery_target option.

The default for the new GUC is 'pause', as in HEAD, and
pause_at_recovery_target is removed completely in favor of it.

I've also taken the liberty to remove that part that errors out when
finding $PGDATA/recovery.conf.  Now get your rotten tomatoes ready. ;-)

--
Alex



recovery_guc_v5.4.patch.gz
Description: application/gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

2014-12-01 Thread Jim Nasby

On 12/1/14, 11:57 AM, Andres Freund wrote:

On 2014-11-30 20:46:51 -0600, Jim Nasby wrote:

On 11/10/14, 7:52 PM, Tom Lane wrote:

On the whole, I'm +1 for just logging the events and seeing what we learn
that way.  That seems like an appropriate amount of effort for finding out
whether there is really an issue.


Attached is a patch that does this.


Unless somebody protests I plan to push this soon. I'll change two
things:

* Always use the same message, independent of scan_all. For one Jim's
   version would be untranslatable, for another it's not actually
   correct. In most cases we'll *not* wait, even if scan_all is
   true as we'll often just balk after !lazy_check_needs_freeze().


Good point.


* Change the new bit in the errdetail. could not acquire cleanup lock
   sounds too much like an error to me. skipped %u pinned pages maybe?


Seems reasonable.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] No documentation on record *= record?

2014-12-01 Thread Jim Nasby

There doesn't seem to be documentation on *= (or search isn't finding it). Is 
this intentional?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2 recovery/startup problems with unlogged tables

2014-12-01 Thread Jeff Janes
On Wed, Nov 26, 2014 at 4:13 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 On Tue, Nov 25, 2014 at 9:30 PM, Jeff Janes jeff.ja...@gmail.com wrote:

 Using both 9.2.9 and 9_2_STABLE 9b468bcec15f1ea7433d4, I get a fairly
 reproducible startup failure.

 What I was doing is restore a database from a base backup and roll it
 forward with a recovery.conf until it completes and starts up.  Then I
 truncate an unlogged table and start repopulating it with a large slow
 'insert into ...select from' query.

 While that is running, if I reboot the server, when it restarts it does
 not come back up initially.

 This is what I get in the log from the attempted restart:

 PST LOG:  database system was interrupted; last known up at 2014-11-25
 15:40:33 PST
 PST LOG:  database system was not properly shut down; automatic recovery
 in progress
 PST LOG:  redo starts at 84/EF80
 PST LOG:  record with zero length at 84/EF09AE18
 PST LOG:  redo done at 84/EF09AD28
 PST LOG:  last completed transaction was at log time 2014-11-25
 15:42:09.173599-08
 PST LOG:  checkpoint starting: end-of-recovery immediate
 PST LOG:  checkpoint complete: wrote 103 buffers (0.2%); 0 transaction
 log file(s) added, 246 removed, 7 recycled; write=0.002 s, sync=0.020 s,
 total=0.526 s; sync files=51, longest=0.003 s, average=0.000 s
 PST FATAL:  could not create file base/16416/59288: File exists
 PST LOG:  startup process (PID 2472) exited with exit code 1
 PST LOG:  aborting startup due to startup process failure

 oid2name doesn't show me any 59288, so I think it is the new copy of the
 unlogged table which is being created at the moment of the reboot.


 59288 is not the new (uncommitted) unlogged table itself, but is the new
 (uncommitted) toast table associated with that unlogged table.

 immediately before the 'sudo /sbin/reboot', while the 'truncate
 unlogged_table; insert into unlogged_table from select...' is running, the
 two files below exist and have zero size.

 base/16416/59288_init
 base/16416/59288

 Immediately after the system has come back up, the file
 base/16416/59288_init no longer exists.

 I can't reproduce this behavior without the reboot.  It seems to depend on
 the sequencing and timing of the signals which the kernel delivers to the
 running processes during the reboot.

 If I do a pg_ctl stop -mf, then both files go away.  If I do a pg_ctl stop
 -mi, then neither goes away.  It is only with the /sbin/reboot that I get
 the fatal combination of _init being gone but the other still present.


 Then once the system is back and I restart postmaster, the first pass
 through ResetUnloggedRelationsInDbspaceDir doesn't remove base/16416/59288,
 because base/16416/59288_init doesn't exist to trigger the deletion.  On
 the 2nd pass through, base/16416/59288_init exists because it was created
 by WAL replay, and the copy fails because it is expecting base/16416/59288
 to have already been cleaned up by the first pass.

 Should ResetUnloggedRelationsInDbspaceDir on the second pass attempt to
 unlink the target immediately before the copy (line 338, in 9.2) to
 accommodate cases like this?


I still haven't figured out what sequence of signals causes the init fork
to be removed but the main fork to remain, but I think the startup process
should be robust to this situation.  The attached patch adds an
unconditional and unchecked unlink before the copy.  It should probably
have a comment, but I don't really know what to say.

This resolves the situation for me.  It is expected that the unlink will
usually fail with non-existent files in the vast majority of cases.  Should
it check that the unlink either succeeds, or fails for non-existence, and
throw an error in cases other than those?  Would it be a performance
concern to call unlink potentially thousands of times?  I don't think so,
because it only calls the unlink when it is about to copy a file anyway,
which is a much heavier operation.

My understanding of the reason that it was done in two passes in the first
place is that during hot standby, it is OK for the main fork to be missing,
but not OK for it to be gibberish.  But then upon promotion to writable, it
is no longer OK for it to be missing, so they get copied over at that time
(and not earlier as then init forks created during replay wouldn't get
copied).  Given that reasoning, I think that unconditionally unlinking the
destination file before the copy would not be a problem. The gibberish
master fork would not be a problem during hot standby queries as the tables
creation has not been committed and so can't be seen.

(I've added Robert to CC because he wrote the original code.)

Cheers,

Jeff


reinit_unlink.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Buildfarm not happy with test module move

2014-12-01 Thread Alvaro Herrera
Tom Lane wrote:
 Alvaro Herrera alvhe...@2ndquadrant.com writes:
  Tom Lane wrote:
  Sounds good to me.  The other parts of the core tests that depend on
  contrib modules aren't exactly good models to follow.
 
  Pushed; tests pass for me, let's see what buildfarm says.
 
 Pushed?  Don't see it ...

Meh.  Done for real this time.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Role Attribute Bitmask Catalog Representation

2014-12-01 Thread Stephen Frost
Adam,

* Adam Brightwell (adam.brightw...@crunchydatasolutions.com) wrote:
 Please let me know what you think, all feedback is greatly appreciated.

Thanks for this.  Found time to do more of a review and have a few
comments:

 diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
 new file mode 100644
 index d30612c..93eb2e6
 *** a/src/backend/catalog/aclchk.c
 --- b/src/backend/catalog/aclchk.c
 *** pg_extension_ownercheck(Oid ext_oid, Oid
 *** 5051,5056 
 --- 5031,5058 
   }
   
   /*
 +  * Check whether the specified role has a specific role attribute.
 +  */
 + bool
 + role_has_attribute(Oid roleid, RoleAttr attribute)
 + {
 + RoleAttrattributes;
 + HeapTuple   tuple;
 + 
 + tuple = SearchSysCache1(AUTHOID, ObjectIdGetDatum(roleid));
 + 
 + if (!HeapTupleIsValid(tuple))
 + ereport(ERROR,
 + (errcode(ERRCODE_UNDEFINED_OBJECT),
 +  errmsg(role with OID %u does not exist, 
 roleid)));
 + 
 + attributes = ((Form_pg_authid) GETSTRUCT(tuple))-rolattr;
 + ReleaseSysCache(tuple);
 + 
 + return ((attributes  attribute)  0);
 + }

I'd put the superuser_arg() check in role_has_attribute.

 --- 5066,5089 
   bool
   has_createrole_privilege(Oid roleid)
   {
   /* Superusers bypass all permission checking. */
   if (superuser_arg(roleid))
   return true;
   
 ! return role_has_attribute(roleid, ROLE_ATTR_CREATEROLE);
   }

And then ditch the individual has_X_privilege() calls (I note that you
didn't appear to add back the has_rolcatupdate() function..).

 diff --git a/src/backend/commands/user.c b/src/backend/commands/user.c
 new file mode 100644
 index 1a73fd8..72c5dcc
 *** a/src/backend/commands/user.c
 --- b/src/backend/commands/user.c
 *** have_createrole_privilege(void)
 *** 63,68 
 --- 63,73 
   return has_createrole_privilege(GetUserId());
   }
   
 + static RoleAttr
 + set_role_attribute(RoleAttr attributes, RoleAttr attribute)
 + {
 + return ((attributes  ~(0x)) | attribute);
 + }

I don't think this is doing what you think it is..?  It certainly isn't
working as expected by AlterRole().  Rather than using a function for
these relatively simple operations, why not just have AlterRole() do:

if (isX = 0)
attributes = (isX  0) ? attributes | ROLE_ATTR_X
   : attributes  
~(ROLE_ATTR_X);

Otherwise, you'd probably need to pass a flag into set_role_attribute()
to indicate if you're setting or unsetting the bit, or have an
'unset_role_attribute()' function, but all of that seems like overkill..

 *** CreateRole(CreateRoleStmt *stmt)
 --- 86,99 
   char   *password = NULL;/* user password */
   boolencrypt_password = Password_encryption; /* encrypt 
 password? */
   charencrypted_password[MD5_PASSWD_LEN + 1];
 ! boolissuper = false;/* Make the user a 
 superuser? */
 ! boolinherit = true; /* Auto inherit 
 privileges? */

I'd probably leave the whitespace-only changes out.

 *** AlterRoleSet(AlterRoleSetStmt *stmt)
 *** 889,895 
* To mess with a superuser you gotta be superuser; else you 
 need
* createrole, or just want to change your own settings
*/
 ! if (((Form_pg_authid) GETSTRUCT(roletuple))-rolsuper)
   {
   if (!superuser())
   ereport(ERROR,
 --- 921,928 
* To mess with a superuser you gotta be superuser; else you 
 need
* createrole, or just want to change your own settings
*/
 ! attributes = ((Form_pg_authid) GETSTRUCT(roletuple))-rolattr;
 ! if ((attributes  ROLE_ATTR_SUPERUSER)  0)
   {
   if (!superuser())
   ereport(ERROR,

We don't bother with the ' 0' in any of the existing bit testing that
goes on in the backend, so I don't think it makes sense to start now.
Just do

if (attributes  ROLE_ATTR_SUPERUSER) ... etc

and be done with it.

 *** aclitemin(PG_FUNCTION_ARGS)
 *** 577,582 
 --- 578,584 
   PG_RETURN_ACLITEM_P(aip);
   }
   
 + 
   /*
* aclitemout
*  Allocates storage for, and fills in, a new null-delimited string

More whitespace changes... :/

 *** pg_role_aclcheck(Oid role_oid, Oid rolei
 *** 4602,4607 
 --- 4604,4712 
   return ACLCHECK_NO_PRIV;
   }
   
 + /*
 +  * has_role_attribute_id
 +  *  Check the named role attribute on a role by given role oid.
 +  */
 + Datum
 + has_role_attribute_id(PG_FUNCTION_ARGS)
 + {
 + Oid roleoid = PG_GETARG_OID(0);
 + text   *attr_type_text = PG_GETARG_TEXT_P(1);
 + RoleAttr

Re: [HACKERS] bug in json_to_record with arrays

2014-12-01 Thread Josh Berkus
On 11/28/2014 12:55 PM, Tom Lane wrote:
 * This would only really address Josh's complaint if we were to back-patch
 it into 9.4, but I'm hesitant to change error texts post-RC1.  Thoughts?

If we don't fix an error text in an RC, what would we fix in an RC?
That's as minor as things get.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug in json_to_record with arrays

2014-12-01 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 On 11/28/2014 12:55 PM, Tom Lane wrote:
 * This would only really address Josh's complaint if we were to back-patch
 it into 9.4, but I'm hesitant to change error texts post-RC1.  Thoughts?

 If we don't fix an error text in an RC, what would we fix in an RC?
 That's as minor as things get.

Code-wise, yeah, but it would put some pressure on the translators.

What did you think of the new error texts in themselves?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug in json_to_record with arrays

2014-12-01 Thread Joshua D. Drake


On 12/01/2014 12:28 PM, Josh Berkus wrote:


On 11/28/2014 12:55 PM, Tom Lane wrote:

* This would only really address Josh's complaint if we were to back-patch
it into 9.4, but I'm hesitant to change error texts post-RC1.  Thoughts?


If we don't fix an error text in an RC, what would we fix in an RC?
That's as minor as things get.


I think the idea is that you only fix *major* things in an RC? That 
said, I agree that we should fix something so minor.


JD






--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc
If we send our children to Caesar for their education, we should
 not be surprised when they come back as Romans.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug in json_to_record with arrays

2014-12-01 Thread Andres Freund
On 2014-12-01 15:30:06 -0500, Tom Lane wrote:
 Josh Berkus j...@agliodbs.com writes:
  On 11/28/2014 12:55 PM, Tom Lane wrote:
  * This would only really address Josh's complaint if we were to back-patch
  it into 9.4, but I'm hesitant to change error texts post-RC1.  Thoughts?
 
  If we don't fix an error text in an RC, what would we fix in an RC?
  That's as minor as things get.
 
 Code-wise, yeah, but it would put some pressure on the translators.

I think a untranslated, but on the point, error message is better than a
translated confusing one.
 
Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug in json_to_record with arrays

2014-12-01 Thread Josh Berkus
On 12/01/2014 12:30 PM, Tom Lane wrote:
 Josh Berkus j...@agliodbs.com writes:
 On 11/28/2014 12:55 PM, Tom Lane wrote:
 * This would only really address Josh's complaint if we were to back-patch
 it into 9.4, but I'm hesitant to change error texts post-RC1.  Thoughts?
 
 If we don't fix an error text in an RC, what would we fix in an RC?
 That's as minor as things get.
 
 Code-wise, yeah, but it would put some pressure on the translators.
 
 What did you think of the new error texts in themselves?

I would prefer invalid input syntax to malformed array literal, but
I'll take anything we can backpatch.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug in json_to_record with arrays

2014-12-01 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 On 12/01/2014 12:30 PM, Tom Lane wrote:
 Code-wise, yeah, but it would put some pressure on the translators.
 
 What did you think of the new error texts in themselves?

 I would prefer invalid input syntax to malformed array literal, but
 I'll take anything we can backpatch.

I think if we're changing them at all, it's about the same cost
no matter what we change them to exactly.

I'd be good with going to invalid input syntax if we also change
record_in to match.  Anyone have a feeling pro or con about that?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug in json_to_record with arrays

2014-12-01 Thread Merlin Moncure
On Mon, Dec 1, 2014 at 3:01 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Josh Berkus j...@agliodbs.com writes:
 On 12/01/2014 12:30 PM, Tom Lane wrote:
 Code-wise, yeah, but it would put some pressure on the translators.

 What did you think of the new error texts in themselves?

 I would prefer invalid input syntax to malformed array literal, but
 I'll take anything we can backpatch.

 I think if we're changing them at all, it's about the same cost
 no matter what we change them to exactly.

 I'd be good with going to invalid input syntax if we also change
 record_in to match.  Anyone have a feeling pro or con about that?

I don't know.  malformed array literal is a mighty big clue that you
have a bad postgres format array literal and will be well supported
by googling -- this problem isn't new.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-01 Thread Tomas Vondra
Hi all,

while working on the patch decreasing amount of memory consumed by
array_agg [1], I've ran into some strange OOM issues. Reproducing them
using the attached SQL script is rather simple.

  [1] https://commitfest.postgresql.org/action/patch_view?id=1652

At first I thought there's some rare hidden memory leak, but I'm pretty
sure that's not the case. I've even put some explicit counters into
aset.c counting allocated/freed blocks, and it seems to be working just
fine (and matching the context tree perfectly). So no memory leak.

The only thing I can think of is some interaction (fragmentation?) with
the system allocator (I'm using glibc-2.19 and kernel 3.16.1, btw),
making the RSS values even less useful than I thought. Sadly, it seems
to trigger the OOM killer :-(

It's entirely possible that this is a known behavior of the allocator,
and I've been unaware of it. It's also true that the work_mem values are
really excessive, and the actual values would be much lower, which would
make the issues much less serious.

To demonstrate the problem I'll use the attached SQL script - it's a
rather simple script generating sample tables and then executing trivial
array_agg() queries.

The script sets work_mem to two different values:

  work_mem = '1GB'==  this limits the INSERT (generating data)

  work_mem = '1024GB' ==  bogus value, forcing a hash aggegate
   (assuming  the hash table fits into memory)

The size of the sample tables (and the amount of memory needed for the
hash aggregate) is determined by the fist parameter set in the script.
With this value

  \set size 1

I get a OOM crash on the first execution of the SQL script (on a machine
with 8GB of RAM and 512MB shared buffers), but YMMV.

The problem is, that even with a much smaller dataset (say, using size
7500) you'll get an OOM error after several executions of the script.
How many executions are needed seems to be inversely proportional to the
size of the data set.

The problem is that the RSS amount is increasing over time for some
reason. For example with the size = 5000, the memory stats for the
process look like this over the first few minutes:

 VIRTRESSHR  %CPU %MEM TIME+ COMMAND
  5045508 2,818g 187220  51,9 36,6   0:15.39 postgres: INSERT
  5045508 3,600g 214868  62,8 46,8   3:11.58 postgres: INSERT
  5045508 3,771g 214868  50,9 49,0   3:40.03 postgres: INSERT
  5045508 3,840g 214868  48,5 49,9   4:00.71 postgres: INSERT
  5045508 3,978g 214880  51,5 51,7   4:40.73 postgres: INSERT
  5045508 4,107g 214892  53,2 53,4   5:22.04 postgres: INSERT
  5045508 4,297g 215276  53,9 55,9   6:22.63 postgres: INSERT
  ...

Those are rows for the backend process, captured from top over time.
How long the backend was running is in the TIME column. Each iteration
takes ~30 seconds, so those lines represent approximately iterations 1,
6, 7, 8, 11, etc.

Notice how the RSS value grows over time, and also notice that this is
the INSERT, restricted by work_mem=1GB. So the memory consumption should
be ~1.5GB, and MemoryContextStats(TopMemoryContext) collected at this
point is consistent with that (see the mem-ctx.log).

And then the value stabilizes at ~4,430g and stops growing. With size
7500 it however takes only ~20 iterations to reach the OOM issue, with a
crash log like this:

[10560.843547] Killed process 15862 (postgres) total-vm:7198260kB,
anon-rss:6494136kB, file-rss:300436kB

So, any ideas what might be the culprit here?

As I said, this is clearly made worse by inappropriately high work_mem
values, but I'm not sure it's completely harmless. Imagine for example
long-running backends, executing complex queries with inaccurate
estimates. That may easily result in using much more memory than the
work_mem limit, and increasing the RSS value over time.

regards
Tomas


array-agg.sql
Description: application/sql
TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free (500017 chunks); 608256 used
  TopTransactionContext: 8192 total in 1 blocks; 7904 free (0 chunks); 288 used
  TableSpace cache: 8192 total in 1 blocks; 3216 free (0 chunks); 4976 used
  Attopt cache: 8192 total in 1 blocks; 2704 free (0 chunks); 5488 used
  Type information cache: 24240 total in 2 blocks; 3744 free (0 chunks); 20496 used
  Operator lookup cache: 24576 total in 2 blocks; 11888 free (5 chunks); 12688 used
  MessageContext: 32768 total in 3 blocks; 9640 free (0 chunks); 23128 used
  Operator class cache: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
  smgr relation table: 24576 total in 2 blocks; 9808 free (4 chunks); 14768 used
  TransactionAbortContext: 32768 total in 1 blocks; 32736 free (0 chunks); 32 used
  Portal hash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
  PortalMemory: 8192 total in 1 blocks; 7888 free (0 chunks); 304 used
PortalHeapMemory: 1024 total in 1 blocks; 848 free (0 chunks); 176 used
  ExecutorState: 318758960 total in 40 blocks; 5879664 free (8 chunks); 

Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-01 Thread Andrew Dunstan


On 12/01/2014 05:39 PM, Tomas Vondra wrote:

Hi all,

while working on the patch decreasing amount of memory consumed by
array_agg [1], I've ran into some strange OOM issues. Reproducing them
using the attached SQL script is rather simple.

   [1] https://commitfest.postgresql.org/action/patch_view?id=1652

At first I thought there's some rare hidden memory leak, but I'm pretty
sure that's not the case. I've even put some explicit counters into
aset.c counting allocated/freed blocks, and it seems to be working just
fine (and matching the context tree perfectly). So no memory leak.



Doesn't this line:

   TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free (500017 
chunks); 608256 used


look pretty suspicious?


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-01 Thread Tomas Vondra
On 2.12.2014 00:31, Andrew Dunstan wrote:
 
 On 12/01/2014 05:39 PM, Tomas Vondra wrote:
 Hi all,

 while working on the patch decreasing amount of memory consumed by
 array_agg [1], I've ran into some strange OOM issues. Reproducing them
 using the attached SQL script is rather simple.

[1] https://commitfest.postgresql.org/action/patch_view?id=1652

 At first I thought there's some rare hidden memory leak, but I'm pretty
 sure that's not the case. I've even put some explicit counters into
 aset.c counting allocated/freed blocks, and it seems to be working just
 fine (and matching the context tree perfectly). So no memory leak.
 
 
 Doesn't this line:
 
TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free
 (500017 chunks); 608256 used
 
 
 look pretty suspicious?

It certainly looks a bit large, especially considering this is a fresh
cluster with a single DB, containing a single table (created by user),
no doubt about that.

For comparison, this is a new backend:

TopMemoryContext: 78128 total in 11 blocks; 8528 free (16 chunks); 69600
used

OTOH, it's just 130MB, and the RSS values are ~6GB when the OOM hits.


For the record, I only tweaked these settings in the config file:

test=# select name, setting from pg_settings where source like
'%configuration file%';
name|  setting
+---
 checkpoint_segments| 32
 DateStyle  | ISO, DMY
 default_text_search_config | pg_catalog.simple
 dynamic_shared_memory_type | posix
 lc_messages| cs_CZ.UTF-8
 lc_monetary| cs_CZ.UTF-8
 lc_numeric | cs_CZ.UTF-8
 lc_time| cs_CZ.UTF-8
 log_timezone   | Europe/Prague
 maintenance_work_mem   | 524288
 max_connections| 100
 shared_buffers | 65536
 TimeZone   | Europe/Prague
 work_mem   | 1048576
(14 rows)

Some of those are set by the initdb, of course.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug in json_to_record with arrays

2014-12-01 Thread Tom Lane
Merlin Moncure mmonc...@gmail.com writes:
 On Mon, Dec 1, 2014 at 3:01 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I'd be good with going to invalid input syntax if we also change
 record_in to match.  Anyone have a feeling pro or con about that?

 I don't know.  malformed array literal is a mighty big clue that you
 have a bad postgres format array literal and will be well supported
 by googling -- this problem isn't new.

Good point about googling, and after all we are already using malformed
array literal for a subset of these cases.  Let's stick with that
wording.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-01 Thread Tom Lane
Tomas Vondra t...@fuzzy.cz writes:
 On 2.12.2014 00:31, Andrew Dunstan wrote:
 Doesn't this line:
 TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free
 (500017 chunks); 608256 used
 look pretty suspicious?

 It certainly looks a bit large, especially considering this is a fresh
 cluster with a single DB, containing a single table (created by user),
 no doubt about that.
 OTOH, it's just 130MB, and the RSS values are ~6GB when the OOM hits.

Yeah, but what was that 130MB being used for?  It's not generally
considered good practice to put large or random stuff in TopMemoryContext
--- arguably, any code doing so is broken already, because almost by
definition such allocations won't be reclaimed on error.

What I suspect you're looking at here is the detritus of creation of a
huge number of memory contexts.  mcxt.c keeps its own state about existing
contents in TopMemoryContext.  So, if we posit that those contexts weren't
real small, there's certainly room to believe that there was some major
memory bloat going on recently.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning recovery.conf into GUCs

2014-12-01 Thread Michael Paquier
On Tue, Dec 2, 2014 at 1:58 AM, Alex Shulgin a...@commandprompt.com wrote:
 Here's the patch rebased against current HEAD, that is including the
 recently committed action_at_recovery_target option.
If this patch gets in, it gives a good argument to jump to 10.0 IMO.
That's not a bad thing, only the cost of making recovery params as
GUCs which is still a feature wanted.

 The default for the new GUC is 'pause', as in HEAD, and
 pause_at_recovery_target is removed completely in favor of it.
Makes sense. Another idea that popped out was to rename this parameter
as recovery_target_action as well, but that's not really something
this patch should care about.

 I've also taken the liberty to remove that part that errors out when
 finding $PGDATA/recovery.conf.
I am not in favor of this part. It may be better to let the users know
that their old configuration is not valid anymore with an error. This
patch cuts in the flesh with a huge axe, let's be sure that users do
not ignore the side pain effects, or recovery.conf would be simply
ignored and users would not be aware of that.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-01 Thread Tomas Vondra
On 2.12.2014 01:33, Tom Lane wrote:
 Tomas Vondra t...@fuzzy.cz writes:
 On 2.12.2014 00:31, Andrew Dunstan wrote:
 Doesn't this line:
 TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free
 (500017 chunks); 608256 used
 look pretty suspicious?
 
 It certainly looks a bit large, especially considering this is a fresh
 cluster with a single DB, containing a single table (created by user),
 no doubt about that.
 OTOH, it's just 130MB, and the RSS values are ~6GB when the OOM hits.
 
 Yeah, but what was that 130MB being used for?  It's not generally
 considered good practice to put large or random stuff in TopMemoryContext
 --- arguably, any code doing so is broken already, because almost by
 definition such allocations won't be reclaimed on error.
 
 What I suspect you're looking at here is the detritus of creation of
 a huge number of memory contexts. mcxt.c keeps its own state about
 existing contents in TopMemoryContext. So, if we posit that those
 contexts weren't real small, there's certainly room to believe that
 there was some major memory bloat going on recently.

Aha! MemoryContextCreate allocates the memory for the new context from
TopMemoryContext explicitly, so that it survives resets of the parent
context. Is that what you had in mind by keeping state about existing
contexts?

That'd probably explain the TopMemoryContext size, because array_agg()
creates separate context for each group. So if you have 1M groups, you
have 1M contexts. Although I don's see how the size of those contexts
would matter?

Maybe we could move this info (list of child contexts) to the parent
context somehow, so that it's freed when the context is destroyed? Aside
from the regular blocks, of course. But maybe it's not worth the effort,
because you can only have so many memory contexts created concurrently.
For AllocSet it's at most RAM/1kB (because that's the minimum block
size) before the OOM intervenes. And in this case the contexts have 8kB
initial size, so the number of contexts is even lower. So maybe it's not
worth the effort.

Also, this explains the TopMemoryContext size, but not the RSS size (or
am I missing something)?

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-01 Thread Tom Lane
Tomas Vondra t...@fuzzy.cz writes:
 On 2.12.2014 01:33, Tom Lane wrote:
 What I suspect you're looking at here is the detritus of creation of
 a huge number of memory contexts. mcxt.c keeps its own state about
 existing contents in TopMemoryContext. So, if we posit that those
 contexts weren't real small, there's certainly room to believe that
 there was some major memory bloat going on recently.

 Aha! MemoryContextCreate allocates the memory for the new context from
 TopMemoryContext explicitly, so that it survives resets of the parent
 context. Is that what you had in mind by keeping state about existing
 contexts?

Right.

 That'd probably explain the TopMemoryContext size, because array_agg()
 creates separate context for each group. So if you have 1M groups, you
 have 1M contexts. Although I don's see how the size of those contexts
 would matter?

Well, if they're each 6K, that's your 6GB right there.

 Maybe we could move this info (list of child contexts) to the parent
 context somehow, so that it's freed when the context is destroyed?

We intentionally didn't do that, because in many cases it'd result in
parent contexts becoming nonempty when otherwise they'd never have
anything actually in them.  The idea was that such shell parent contexts
should be cheap, requiring only a control block in TopMemoryContext and
not an actual allocation arena.  This idea of a million separate child
contexts was never part of the design of course; we might need to rethink
whether that's a good idea.  Or maybe there need to be two different
policies about where to put child control blocks.

 Also, this explains the TopMemoryContext size, but not the RSS size (or
 am I missing something)?

Very possibly you're left with islands that prevent reclaiming very
much of the peak RAM usage.  It'd be hard to be sure without some sort
of memory map, of course.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Role Attribute Bitmask Catalog Representation

2014-12-01 Thread Adam Brightwell
Stephen,

Thanks for this.  Found time to do more of a review and have a few
 comments:


Thanks for taking a look at this and for the feedback.

I'd put the superuser_arg() check in role_has_attribute.


Ok.  Though, this would affect how CATUPDATE is handled.  Peter Eisentraut
previously raised a question about whether superuser checks should be
included with catupdate which led me to create the following post.

http://www.postgresql.org/message-id/cakrt6cqovt2kiykg2gff7h9k8+jvu1149zlb0extkkk7taq...@mail.gmail.com

Certainly, we could keep has_rolcatupdate for this case and put the
superuser check in role_has_attribute, but it seems like it might be worth
taking a look at whether a superuser can bypass catupdate or not.  Just a
thought.

And then ditch the individual has_X_privilege() calls (I note that you
 didn't appear to add back the has_rolcatupdate() function..).


Ok.  I had originally thought for this patch that I would try to minimize
these types of changes, though perhaps this is that opportunity previously
mentioned in refactoring those.  However, the catupdate question still
remains.

 + static RoleAttr
  + set_role_attribute(RoleAttr attributes, RoleAttr attribute)
  + {
  + return ((attributes  ~(0x)) | attribute);
  + }

 I don't think this is doing what you think it is..?  It certainly isn't
 working as expected by AlterRole().  Rather than using a function for
 these relatively simple operations, why not just have AlterRole() do:

 if (isX = 0)
 attributes = (isX  0) ? attributes | ROLE_ATTR_X
: attributes 
 ~(ROLE_ATTR_X);


Yes, this was originally my first approach.  I'm not recollecting why I
decided on the other route, but agree that is preferable and simpler.


 Otherwise, you'd probably need to pass a flag into set_role_attribute()
 to indicate if you're setting or unsetting the bit, or have an
 'unset_role_attribute()' function, but all of that seems like overkill..


Agreed.

We don't bother with the ' 0' in any of the existing bit testing that
 goes on in the backend, so I don't think it makes sense to start now.
 Just do

 if (attributes  ROLE_ATTR_SUPERUSER) ... etc

 and be done with it.


Ok, easy enough.

Why not have this as 'pg_has_role_id_attribute' and then have a
 'pg_has_role_name_attribute' also?  Seems like going with the pg_
 namespace is the direction we want to go in, though I'm not inclined to
 argue about it if folks prefer has_X.


I have no reason for one over the other, though I did ask myself that
question.  I did find it curious that in some cases there is has_X and
then in others pg_has_X.  Perhaps I'm not looking in the right places,
but I haven't found anything that helps to distinguish when one vs the
other is appropriate (even if it is a general rule of thumb).

Seems like you could just make temp_array be N_ROLE_ATTRIBUTES in
 size, instead of building a list, counting it, and then building the
 array from that..?


Yes, this is true.

Do we really need to keep has_rolinherit for any reason..?  Seems
 unnecessary given it's only got one call site and it's nearly the same
 as a call to role_has_attribute() anyway.  Ditto with
 has_rolreplication().


I really don't see any reason and as above, I can certainly do those
refactors now if that is what is desired.

Thought we were getting rid of this..?

  ! #define N_ROLE_ATTRIBUTES   8   /* 1 plus the last
 1x */
  ! #define ROLE_ATTR_NONE  0
  ! #define ROLE_ATTR_ALL   255 /* All
 currently available attributes. */

 Or:

 #define ROLE_ATTR_ALL  (1N_ROLE_ATTRIBUTES)-1


Yes, we were, however the latter causes a syntax error with initdb. :-/

 ! DATA(insert OID = 10 ( POSTGRES PGROLATTRALL -1 _null_ _null_));
 
#define BOOTSTRAP_SUPERUSERID 10

 Is it actually necessary to do this substitution when the value is
 #define'd in the same .h file...?  Might be something to check, if you
 haven't already.


Yes, I had previously checked this, I get the following error from initdb.

FATAL:  invalid input syntax for integer: ROLE_ATTR_ALL

 + #define ROLE_ATTR_SUPERUSER (10)
  + #define ROLE_ATTR_INHERIT   (11)
  + #define ROLE_ATTR_CREATEROLE(12)
  + #define ROLE_ATTR_CREATEDB  (13)
  + #define ROLE_ATTR_CATUPDATE (14)
  + #define ROLE_ATTR_CANLOGIN  (15)
  + #define ROLE_ATTR_REPLICATION   (16)
  + #define ROLE_ATTR_BYPASSRLS (17)
  + #define N_ROLE_ATTRIBUTES   8
  + #define ROLE_ATTR_NONE  0

 These shouldn't need to be here any more..?


No they shouldn't, not sure how I missed that.

Thanks,
Adam

-- 
Adam Brightwell - adam.brightw...@crunchydatasolutions.com
Database Engineer - www.crunchydatasolutions.com


Re: [HACKERS] tracking commit timestamps

2014-12-01 Thread Alvaro Herrera
Petr Jelinek wrote:
 On 25/11/14 16:30, Alvaro Herrera wrote:
 Petr Jelinek wrote:
 On 25/11/14 16:23, Alvaro Herrera wrote:
 Robert Haas wrote:
 
 Maybe 0 should get translated to a NULL return, instead of a bogus 
 timestamp.
 
 That's one idea --- surely no transaction is going to commit at 00:00:00
 on 2000-01-01 anymore.  Yet this is somewhat discomforting.
 
 I solved it for xids that are out of range by returning -infinity and then
 changing that to NULL in sql interface, but no idea how to do that for
 aborted transactions.
 
 I guess the idea is that we just read the value from the slru and if it
 exactly matches allballs we do the same -infinity return and translation
 to NULL.  (Do we really love this -infinity idea?  If it's just an
 internal API we can use a boolean instead.)
 
 As in returning boolean instead of void as found? That works for me
 (for the C interface).

Petr sent me privately some changes to implement this idea.  I also
reworked the tests so that they only happen on src/test/modules (getting
rid of the one in core regress) and made them work with both enabled and
disabled track_commit_timestamps; in make installcheck, they pass
regardless of the setting of the installed server, and in make check,
they run a server with the setting enabled.

I made two more changes:
1. introduce newestCommitTs.  Original code was using lastCommitXact to
check that no future transaction is asked for, but this doesn't really
work if a long-running transaction is committed, because asking for
transactions with a higher Xid but which were committed earlier would
raise an error.

2. change CommitTsControlLock to CommitTsLock as the lock that protects
the stuff we keep in ShmemVariableCache.  The Control one is for SLRU
access, and so it might be slow at times.  This is important because we
fill the checkpoint struct from values protected by that lock, so a
checkpoint might be delayed if it happens to land in the middle of a
slru IO operation.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/contrib/pg_upgrade/pg_upgrade.c b/contrib/pg_upgrade/pg_upgrade.c
index 3b8241b..f0a023f 100644
--- a/contrib/pg_upgrade/pg_upgrade.c
+++ b/contrib/pg_upgrade/pg_upgrade.c
@@ -423,8 +423,10 @@ copy_clog_xlog_xid(void)
 	/* set the next transaction id and epoch of the new cluster */
 	prep_status(Setting next transaction ID and epoch for new cluster);
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
-			  \%s/pg_resetxlog\ -f -x %u \%s\,
-			  new_cluster.bindir, old_cluster.controldata.chkpnt_nxtxid,
+			  \%s/pg_resetxlog\ -f -x %u -c %u \%s\,
+			  new_cluster.bindir,
+			  old_cluster.controldata.chkpnt_nxtxid,
+			  old_cluster.controldata.chkpnt_nxtxid,
 			  new_cluster.pgdata);
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
 			  \%s/pg_resetxlog\ -f -e %u \%s\,
diff --git a/contrib/pg_xlogdump/rmgrdesc.c b/contrib/pg_xlogdump/rmgrdesc.c
index 9397198..180818d 100644
--- a/contrib/pg_xlogdump/rmgrdesc.c
+++ b/contrib/pg_xlogdump/rmgrdesc.c
@@ -10,6 +10,7 @@
 
 #include access/brin_xlog.h
 #include access/clog.h
+#include access/commit_ts.h
 #include access/gin.h
 #include access/gist_private.h
 #include access/hash.h
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ab8c263..e3713d3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2673,6 +2673,20 @@ include_dir 'conf.d'
   /listitem
  /varlistentry
 
+ varlistentry id=guc-track-commit-timestamp xreflabel=track_commit_timestamp
+  termvarnametrack_commit_timestamp/varname (typebool/type)/term
+  indexterm
+   primaryvarnametrack_commit_timestamp/ configuration parameter/primary
+  /indexterm
+  listitem
+   para
+Record commit time of transactions. This parameter
+can only be set in filenamepostgresql.conf/ file or on the server
+command line. The default value is literaloff/literal.
+   /para
+  /listitem
+ /varlistentry
+
  /variablelist
 /sect2
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index baf81ee..62ec275 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15938,6 +15938,45 @@ SELECT collation for ('foo' COLLATE de_DE);
 For example literal10:20:10,14,15/literal means
 literalxmin=10, xmax=20, xip_list=10, 14, 15/literal.
/para
+
+   para
+The functions shown in xref linkend=functions-commit-timestamp
+provide information about transactions that have been already committed.
+These functions mainly provide information about when the transactions
+were committed. They only provide useful data when
+xref linkend=guc-track-commit-timestamp configuration option is enabled
+and only for transactions that were committed after it was enabled.
+   /para
+
+   table id=functions-commit-timestamp
+titleCommitted transaction information/title
+tgroup cols=3
+ thead
+  

Re: [HACKERS] [Windows,PATCH] Use faster, higher precision timer API

2014-12-01 Thread Craig Ringer
On 12/01/2014 09:51 PM, Marco Nenciarini wrote:
 I think this is a leftover, as you don't use elog afterwards.

Good catch, fixed.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From 63edf0571961f2f2c7fd6a08b99e747ee39a3377 Mon Sep 17 00:00:00 2001
From: Craig Ringer cr...@2ndquadrant.com
Date: Fri, 12 Sep 2014 12:41:35 +0800
Subject: [PATCH 1/2] Use GetSystemTimeAsFileTime directly in windows
 gettimeofday
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

PostgreSQL was calling GetSystemTime followed by SystemTimeToFileTime in the
win32 port gettimeofday function. This is not necessary and limits the reported
precision to the 1ms granularity that the SYSTEMTIME struct can represent. By
using GetSystemTimeAsFileTime we avoid unnecessary conversions and capture
timestamps at 100ns granularity, which is then rounded to 1µs granularity for
storage in a PostgreSQL timestamp.

On most Windows systems this change will actually have no significant effect as
the system timer tick is typically between 1ms and 15ms depending on what timer
resolution currently running applications have requested. You can check this
with clockres.exe from sysinternals. Despite the platform limiation this change
still permits capture of finer timestamps where the system is capable of
producing them and it gets rid of an unnecessary syscall.

Future work may permit use of GetSystemTimePreciseAsFileTime on Windows 8 and
Windows Server 2012 for higher resolution time capture. This call has the same
interface as GetSystemTimeAsFileTime.
---
 src/port/gettimeofday.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/port/gettimeofday.c b/src/port/gettimeofday.c
index 75a9199..73ec406 100644
--- a/src/port/gettimeofday.c
+++ b/src/port/gettimeofday.c
@@ -44,16 +44,14 @@ int
 gettimeofday(struct timeval * tp, struct timezone * tzp)
 {
 	FILETIME	file_time;
-	SYSTEMTIME	system_time;
 	ULARGE_INTEGER ularge;
 
-	GetSystemTime(system_time);
-	SystemTimeToFileTime(system_time, file_time);
+	GetSystemTimeAsFileTime(file_time);
 	ularge.LowPart = file_time.dwLowDateTime;
 	ularge.HighPart = file_time.dwHighDateTime;
 
 	tp-tv_sec = (long) ((ularge.QuadPart - epoch) / 1000L);
-	tp-tv_usec = (long) (system_time.wMilliseconds * 1000);
+	tp-tv_usec = (long) (((ularge.QuadPart - epoch) % 1000L) / 10);
 
 	return 0;
 }
-- 
1.9.3

From 035bb3b19f93560d18834d2daf397c70e1555015 Mon Sep 17 00:00:00 2001
From: Craig Ringer cr...@2ndquadrant.com
Date: Thu, 18 Sep 2014 23:02:14 +0800
Subject: [PATCH 2/2] Use GetSystemTimePreciseAsFileTime when available

This will cause PostgreSQL on Windows 8 or Windows Server 2012 to
obtain high-resolution timestamps while allowing the same binaries
to run without problems on older releases.
---
 src/backend/main/main.c |  6 ++
 src/include/port.h  |  2 ++
 src/port/gettimeofday.c | 52 +++--
 3 files changed, 58 insertions(+), 2 deletions(-)

diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index c51b391..73c30c5 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -260,6 +260,12 @@ startup_hacks(const char *progname)
 
 		/* In case of general protection fault, don't show GUI popup box */
 		SetErrorMode(SEM_FAILCRITICALERRORS | SEM_NOGPFAULTERRORBOX);
+
+#ifndef HAVE_GETTIMEOFDAY
+		/* Figure out which syscall to use to capture timestamp information */
+		init_win32_gettimeofday();
+#endif
+
 	}
 #endif   /* WIN32 */
 
diff --git a/src/include/port.h b/src/include/port.h
index 94a0e2f..58677ec 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -328,6 +328,8 @@ extern FILE *pgwin32_popen(const char *command, const char *type);
 #ifndef HAVE_GETTIMEOFDAY
 /* Last parameter not used */
 extern int	gettimeofday(struct timeval * tp, struct timezone * tzp);
+/* On windows we need to call some backend start setup for accurate timing */
+extern void init_win32_gettimeofday(void);
 #endif
 #else			/* !WIN32 */
 
diff --git a/src/port/gettimeofday.c b/src/port/gettimeofday.c
index 73ec406..ab4f491 100644
--- a/src/port/gettimeofday.c
+++ b/src/port/gettimeofday.c
@@ -35,9 +35,57 @@
 static const unsigned __int64 epoch = UINT64CONST(1164447360);
 
 /*
+ * Both GetSystemTimeAsFileTime and GetSystemTimePreciseAsFileTime share a
+ * signature, so we can just store a pointer to whichever we find. This
+ * is the pointer's type.
+ */
+typedef VOID (WINAPI *PgGetSystemTimeFn)(LPFILETIME);
+/* Storage for the function we pick at runtime */
+static PgGetSystemTimeFn pg_get_system_time = NULL;
+
+/*
+ * During backend startup, determine if GetSystemTimePreciseAsFileTime is
+ * available and use it; if not, fall back to GetSystemTimeAsFileTime.
+ */
+void
+init_win32_gettimeofday(void)
+{
+	/*
+	 * Because it's guaranteed that kernel32.dll will be linked into our
+	 * address space already, 

Re: [HACKERS] inherit support for foreign tables

2014-12-01 Thread Etsuro Fujita

(2014/11/28 18:14), Ashutosh Bapat wrote:

On Thu, Nov 27, 2014 at 3:52 PM, Etsuro Fujita
fujita.ets...@lab.ntt.co.jp mailto:fujita.ets...@lab.ntt.co.jp wrote:
Apart from the above, I noticed that the patch doesn't consider to
call ExplainForeignModify during EXPLAIN for an inherited
UPDATE/DELETE, as shown below (note that there are no UPDATE remote
queries displayed):



So, I'd like to modify explain.c to show those queries like this:



postgres=# explain verbose update parent set a = a * 2 where a = 5;
  QUERY PLAN

--__--__-
  Update on public.parent  (cost=0.00..280.77 rows=25 width=10)
-  Seq Scan on public.parent  (cost=0.00..0.00 rows=1 width=10)
  Output: (parent.a * 2), parent.ctid
  Filter: (parent.a = 5)
Remote SQL: UPDATE public.mytable_1 SET a = $2 WHERE ctid = $1
-  Foreign Scan on public.ft1  (cost=100.00..140.38 rows=12
width=10)
  Output: (ft1.a * 2), ft1.ctid
  Remote SQL: SELECT a, ctid FROM public.mytable_1 WHERE ((a
= 5)) FOR UPDATE
Remote SQL: UPDATE public.mytable_2 SET a = $2 WHERE ctid = $1
-  Foreign Scan on public.ft2  (cost=100.00..140.38 rows=12
width=10)
  Output: (ft2.a * 2), ft2.ctid
  Remote SQL: SELECT a, ctid FROM public.mytable_2 WHERE ((a
= 5)) FOR UPDATE
(12 rows)



Two remote SQL under a single node would be confusing. Also the node
is labelled as Foreign Scan. It would be confusing to show an UPDATE
command under this scan node.


I thought this as an extention of the existing (ie, non-inherited) case 
(see the below example) to the inherited case.


postgres=# explain verbose update ft1 set a = a * 2 where a = 5;
 QUERY PLAN
-
 Update on public.ft1  (cost=100.00..140.38 rows=12 width=10)
   Remote SQL: UPDATE public.mytable_1 SET a = $2 WHERE ctid = $1
   -  Foreign Scan on public.ft1  (cost=100.00..140.38 rows=12 width=10)
 Output: (a * 2), ctid
 Remote SQL: SELECT a, ctid FROM public.mytable_1 WHERE ((a = 
5)) FOR UPDATE

(5 rows)

I think we should show update commands somewhere for the inherited case 
as for the non-inherited case.  Alternatives to this are welcome.



BTW, I was experimenting with DMLs being executed on multiple FDW server
under same transaction and found that the transactions may not be atomic
(and may be inconsistent), if one or more of the server fails to commit
while rest of them commit the transaction. The reason for this is, we do
not rollback the already committed changes to the foreign server, if
one or more of them fail to commit a transaction. With foreign tables
under inheritance hierarchy a single DML can span across multiple
servers and the result may not be atomic (and may be inconsistent). So,


IIUC, even the transactions over the local and the *single* remote 
server are not guaranteed to be executed atomically in the current form. 
 It is possible that the remote transaction succeeds and the local one 
fails, for example, resulting in data inconsistency between the local 
and the remote.



either we have to disable DMLs on an inheritance hierarchy which spans
multiple servers. OR make sure that such transactions follow 2PC norms.


-1 for disabling update queries on such an inheritance hierarchy because 
I think we should leave that to the user's judgment.  And I think 2PC is 
definitely a separate patch.


Thanks,

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-12-01 Thread Jeff Davis
On Sun, 2014-11-30 at 17:49 -0800, Peter Geoghegan wrote:
 On Mon, Nov 17, 2014 at 11:39 PM, Jeff Davis pg...@j-davis.com wrote:
  I can also just move isReset there, and keep mem_allocated as a uint64.
  That way, if I find later that I want to track the aggregated value for
  the child contexts as well, I can split it into two uint32s. I'll hold
  off any any such optimizations until I see some numbers from HashAgg
  though.
 
 I took a quick look at memory-accounting-v8.patch.
 
 Is there some reason why mem_allocated is a uint64? All other things
 being equal, I'd follow the example of tuplesort.c's
 MemoryContextAllocHuge() API, which (following bugfix commit
 79e0f87a1) uses int64 variables to track available memory and so on.

No reason. New version attached; that's the only change.

Regards,
Jeff Davis


*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***
*** 438,451  AllocSetContextCreate(MemoryContext parent,
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
! 	AllocSet	context;
  
  	/* Do the type-independent part of context creation */
! 	context = (AllocSet) MemoryContextCreate(T_AllocSetContext,
! 			 sizeof(AllocSetContext),
! 			 AllocSetMethods,
! 			 parent,
! 			 name);
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
--- 438,454 
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
! 	AllocSet			set;
! 	MemoryContext		context;
  
  	/* Do the type-independent part of context creation */
! 	context = MemoryContextCreate(T_AllocSetContext,
!   sizeof(AllocSetContext),
!   AllocSetMethods,
!   parent,
!   name);
! 
! 	set = (AllocSet) context;
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
***
*** 459,467  AllocSetContextCreate(MemoryContext parent,
  	if (maxBlockSize  initBlockSize)
  		maxBlockSize = initBlockSize;
  	Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
! 	context-initBlockSize = initBlockSize;
! 	context-maxBlockSize = maxBlockSize;
! 	context-nextBlockSize = initBlockSize;
  
  	/*
  	 * Compute the allocation chunk size limit for this context.  It can't be
--- 462,470 
  	if (maxBlockSize  initBlockSize)
  		maxBlockSize = initBlockSize;
  	Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
! 	set-initBlockSize = initBlockSize;
! 	set-maxBlockSize = maxBlockSize;
! 	set-nextBlockSize = initBlockSize;
  
  	/*
  	 * Compute the allocation chunk size limit for this context.  It can't be
***
*** 477,486  AllocSetContextCreate(MemoryContext parent,
  	 * and actually-allocated sizes of any chunk must be on the same side of
  	 * the limit, else we get confused about whether the chunk is big.
  	 */
! 	context-allocChunkLimit = ALLOC_CHUNK_LIMIT;
! 	while ((Size) (context-allocChunkLimit + ALLOC_CHUNKHDRSZ) 
  		   (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION))
! 		context-allocChunkLimit = 1;
  
  	/*
  	 * Grab always-allocated space, if requested
--- 480,489 
  	 * and actually-allocated sizes of any chunk must be on the same side of
  	 * the limit, else we get confused about whether the chunk is big.
  	 */
! 	set-allocChunkLimit = ALLOC_CHUNK_LIMIT;
! 	while ((Size) (set-allocChunkLimit + ALLOC_CHUNKHDRSZ) 
  		   (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION))
! 		set-allocChunkLimit = 1;
  
  	/*
  	 * Grab always-allocated space, if requested
***
*** 500,519  AllocSetContextCreate(MemoryContext parent,
  	 errdetail(Failed while creating memory context \%s\.,
  			   name)));
  		}
! 		block-aset = context;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
! 		block-next = context-blocks;
! 		context-blocks = block;
  		/* Mark block as not to be released at reset time */
! 		context-keeper = block;
  
  		/* Mark unallocated space NOACCESS; leave the block header alone. */
  		VALGRIND_MAKE_MEM_NOACCESS(block-freeptr,
     blksize - ALLOC_BLOCKHDRSZ);
  	}
  
! 	return (MemoryContext) context;
  }
  
  /*
--- 503,525 
  	 errdetail(Failed while creating memory context \%s\.,
  			   name)));
  		}
! 
! 		context-mem_allocated += blksize;
! 
! 		block-aset = set;
  		block-freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block-endptr = ((char *) block) + blksize;
! 		block-next = set-blocks;
! 		set-blocks = block;
  		/* Mark block as not to be released at reset time */
! 		set-keeper = block;
  
  		/* Mark unallocated space NOACCESS; leave the block header alone. */
  		VALGRIND_MAKE_MEM_NOACCESS(block-freeptr,
     blksize - ALLOC_BLOCKHDRSZ);
  	}
  
! 	return context;
  }
  
  /*
***
*** 590,595  AllocSetReset(MemoryContext context)
--- 596,603 
  		else
  		{
  			/* Normal case, release the block */
+ 			context-mem_allocated 

Re: [HACKERS] Proposal : REINDEX SCHEMA

2014-12-01 Thread Michael Paquier
On Mon, Dec 1, 2014 at 11:29 PM, Sawada Masahiko sawada.m...@gmail.com wrote:
 On Thu, Nov 27, 2014 at 11:43 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 If you get that done, I'll have an extra look at it and then let's
 have a committer look at it.

 Attached patch is latest version.
 I added new enum values like REINDEX_OBJECT_XXX,
 and changed ReindexObject() function.
 Also ACL problem is fixed for this version.

Thanks for the updated version.
I have been looking at this patch more deeply, and I noticed a couple
of things that were missing or mishandled:
- The patch completely ignored that a catalog schema could be reindexed
- When reindexing pg_catalog as a schema, it is critical to reindex
pg_class first as reindex_relation updates pg_class.
- gram.y generated some warnings because ReindexObjectType stuff was
casted as ObjectType.
- There was a memory leak, the scan keys palloc'd  in ReindexObject
were not pfree'd
- Using do_user, do_system and now do_database was really confusing
and reduced code lisibility. I reworked it using the object kind
- The patch to support SCHEMA in reindexdb has been forgotten.
Reattaching it here.
Adding on top of that a couple of things cleaned up, like docs and
typos, and I got the patch attached. Let's have a committer have a
look a it now, I am marking that as Ready for Committer.
Regards,
-- 
Michael
From 1e384f579ffa865c364616573bd8767a24306c44 Mon Sep 17 00:00:00 2001
From: Michael Paquier mich...@otacoo.com
Date: Tue, 2 Dec 2014 15:29:36 +0900
Subject: [PATCH 1/2] Support for REINDEX SCHEMA

When pg_catalog is reindexed, note that pg_class is reindexed first
for a behavior consistent with the cases of DATABASE and SYSTEM.
---
 doc/src/sgml/ref/reindex.sgml  |  15 +++-
 src/backend/commands/indexcmds.c   | 110 +
 src/backend/parser/gram.y  |  35 +
 src/backend/tcop/utility.c |  15 ++--
 src/bin/psql/tab-complete.c|   4 +-
 src/include/commands/defrem.h  |   3 +-
 src/include/nodes/parsenodes.h |  11 ++-
 src/test/regress/expected/create_index.out |  31 
 src/test/regress/sql/create_index.sql  |  23 ++
 9 files changed, 191 insertions(+), 56 deletions(-)

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index cabae19..0a4c7d4 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  refsynopsisdiv
 synopsis
-REINDEX { INDEX | TABLE | DATABASE | SYSTEM } replaceable class=PARAMETERname/replaceable [ FORCE ]
+REINDEX { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } replaceable class=PARAMETERname/replaceable [ FORCE ]
 /synopsis
  /refsynopsisdiv
 
@@ -101,6 +101,19 @@ REINDEX { INDEX | TABLE | DATABASE | SYSTEM } replaceable class=PARAMETERnam
/varlistentry
 
varlistentry
+termliteralSCHEMA/literal/term
+listitem
+ para
+  Recreate all indexes of the specified schema.  If a table of this
+  schema has a secondary quoteTOAST/ table, that is reindexed as
+  well. Indexes on shared system catalogs are also processed.
+  This form of commandREINDEX/command cannot be executed inside a
+  transaction block.
+ /para
+/listitem
+   /varlistentry
+
+   varlistentry
 termliteralDATABASE/literal/term
 listitem
  para
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 12b4ac7..a3e8a15 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1777,34 +1777,58 @@ ReindexTable(RangeVar *relation)
 }
 
 /*
- * ReindexDatabase
- *		Recreate indexes of a database.
+ * ReindexObject
+ *		Recreate indexes of object whose type is defined by objectKind.
  *
  * To reduce the probability of deadlocks, each table is reindexed in a
  * separate transaction, so we can release the lock on it right away.
  * That means this must not be called within a user transaction block!
  */
 Oid
-ReindexDatabase(const char *databaseName, bool do_system, bool do_user)
+ReindexObject(const char *objectName, ReindexObjectType objectKind)
 {
+	Oid			objectOid;
 	Relation	relationRelation;
 	HeapScanDesc scan;
+	ScanKeyData	*scan_keys = NULL;
 	HeapTuple	tuple;
 	MemoryContext private_context;
 	MemoryContext old;
 	List	   *relids = NIL;
 	ListCell   *l;
+	int	num_keys;
 
-	AssertArg(databaseName);
+	AssertArg(objectName);
+	Assert(objectKind == REINDEX_OBJECT_SCHEMA ||
+		   objectKind == REINDEX_OBJECT_SYSTEM ||
+		   objectKind == REINDEX_OBJECT_DATABASE);
 
-	if (strcmp(databaseName, get_database_name(MyDatabaseId)) != 0)
-		ereport(ERROR,
-(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg(can only reindex the currently open database)));
+	/*
+	 * Get OID of object to reindex, being the database currently being
+	 * used by session for a database or for system catalogs, or the schema
+	 * defined by caller. At the same time do 

Re: [HACKERS] New Event Trigger: table_rewrite

2014-12-01 Thread Michael Paquier
On Thu, Nov 20, 2014 at 10:37 PM, Dimitri Fontaine
dimi...@2ndquadrant.fr wrote:
 Alvaro Herrera alvhe...@2ndquadrant.com writes:
 CLUSTER and VACUUM are not part of the supported commands anymore, so
 I think that we could replace that by the addition of a reference
 number in the cell of ALTER TABLE for the event table_rewrite and
 write at the bottom of the table a description of how this event
 behaves with ALTER TABLE. Note as well that might or might not is
 not really helpful for the user.

 That's precisely why we have an event trigger here, I think --- for some
 subcommands, it's not easy to determine whether a rewrite happens or
 not.  (I think SET TYPE is the one).  I don't think we want to document
 precisely under what condition a rewrite takes place.

 Yeah, the current documentation expands to the following sentence, as
 browsed in

   http://www.postgresql.org/docs/9.3/interactive/sql-altertable.html

   As an exception, if the USING clause does not change the column
   contents and the old type is either binary coercible to the new type
   or an unconstrained domain over the new type, a table rewrite is not
   needed, but any indexes on the affected columns must still be rebuilt.

 I don't think that might or might not is less helpful in the context
 of the Event Trigger, because the whole point is that the event is only
 triggered in case of a rewrite. Of course we could cross link the two
 paragraphs or something.

 2) The examples of SQL queries provided are still in lower case in the
 docs, that's contrary to the rest of the docs where upper case is used
 for reserved keywords.

 Right, being consistent trumps personal preferences, changed in the
 attached.

 Yes please.  nitpick Another thing in that sample code is not current_hour
 between 1 and 6.  That reads strange to me.  It should be equally
 correct to spell it as current_hour not between 1 and 6, which seems
 more natural. /

 True, fixed in the attached.
The status of this patch was not updated on the commit fest app, so I
lost track of it. Sorry for not answering earlier btw.

The following things to note about v5:
1) There are still mentions of VACUUM, ANALYZE and CLUSTER:
@@ -264,6 +275,10 @@ check_ddl_tag(const char *tag)
obtypename = tag + 6;
else if (pg_strncasecmp(tag, DROP , 5) == 0)
obtypename = tag + 5;
+   else if (pg_strncasecmp(tag, ANALYZE, 7) == 0 ||
+pg_strncasecmp(tag, CLUSTER, 7) == 0 ||
+pg_strncasecmp(tag, VACUUM, 6) == 0)
+   return EVENT_TRIGGER_COMMAND_TAG_OK;
2) There are a couple of typos and incorrect styling, like if(. Nothing huge..
Cleanup is done in the attached.

In any case, all the issues mentioned seem to have been addressed, so
switching this patch to ready for committer.
Regards,
-- 
Michael
diff --git a/src/backend/commands/event_trigger.c b/src/backend/commands/event_trigger.c
index 6a3002f..17b0818 100644
--- a/src/backend/commands/event_trigger.c
+++ b/src/backend/commands/event_trigger.c
@@ -42,11 +42,14 @@
 #include utils/syscache.h
 #include tcop/utility.h
 
-
+/*
+ * Data Structure for sql_drop and table_rewrite Event Trigger support.
+ */
 typedef struct EventTriggerQueryState
 {
 	slist_head	SQLDropList;
 	bool		in_sql_drop;
+	Oid			table_rewrite_oid;
 	MemoryContext cxt;
 	struct EventTriggerQueryState *previous;
 } EventTriggerQueryState;
@@ -119,11 +122,14 @@ static void AlterEventTriggerOwner_internal(Relation rel,
 HeapTuple tup,
 Oid newOwnerId);
 static event_trigger_command_tag_check_result check_ddl_tag(const char *tag);
+static event_trigger_command_tag_check_result check_table_rewrite_ddl_tag(
+	const char *tag);
 static void error_duplicate_filter_variable(const char *defname);
 static Datum filter_list_to_array(List *filterlist);
 static Oid insert_event_trigger_tuple(char *trigname, char *eventname,
 		   Oid evtOwner, Oid funcoid, List *tags);
 static void validate_ddl_tags(const char *filtervar, List *taglist);
+static void validate_table_rewrite_tags(const char *filtervar, List *taglist);
 static void EventTriggerInvoke(List *fn_oid_list, EventTriggerData *trigdata);
 
 /*
@@ -154,7 +160,8 @@ CreateEventTrigger(CreateEventTrigStmt *stmt)
 	/* Validate event name. */
 	if (strcmp(stmt-eventname, ddl_command_start) != 0 
 		strcmp(stmt-eventname, ddl_command_end) != 0 
-		strcmp(stmt-eventname, sql_drop) != 0)
+		strcmp(stmt-eventname, sql_drop) != 0 
+		strcmp(stmt-eventname, table_rewrite) != 0)
 		ereport(ERROR,
 (errcode(ERRCODE_SYNTAX_ERROR),
  errmsg(unrecognized event name \%s\,
@@ -183,6 +190,9 @@ CreateEventTrigger(CreateEventTrigStmt *stmt)
 		 strcmp(stmt-eventname, sql_drop) == 0)
 		 tags != NULL)
 		validate_ddl_tags(tag, tags);
+	else if (strcmp(stmt-eventname, table_rewrite) == 0
+			  tags != NULL)
+		validate_table_rewrite_tags(tag, tags);
 
 	/*
 	 * Give user a nice error message if an event trigger of the same name