Re: [HACKERS] Buildfarm not happy with test module move

2014-12-01 Thread Magnus Hagander
On Mon, Dec 1, 2014 at 2:22 AM, Tom Lane  wrote:
> Alvaro Herrera  writes:
>> That was my fault -- the alvh.no-ip.org domain was deleted, and the
>> email system in postgresql.org rejected the commit message because the
>> sender was not in a deliverable domain.  I noticed within a few hours,
>> but the damage was already done.
>
> I guess I'm confused about which is your preferred email address?

Yeah. I was assuming the alvh.no-ip.org was going to come back, as
that's what you're still subcribed to critical mailinglists with :) Do
you need that one changed?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Typo/spacing fix for "29.1. Reliability"

2014-12-01 Thread Magnus Hagander
On Mon, Dec 1, 2014 at 1:31 AM, Ian Barwick  wrote:
> This fixes a missing space here:
>
>   http://www.postgresql.org/docs/devel/static/wal-reliability.html
>
> (present in 9.3 onwards).

Applied, thanks!


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Buildfarm not happy with test module move

2014-12-01 Thread Alvaro Herrera
Magnus Hagander wrote:
> On Mon, Dec 1, 2014 at 2:22 AM, Tom Lane  wrote:
> > Alvaro Herrera  writes:
> >> That was my fault -- the alvh.no-ip.org domain was deleted, and the
> >> email system in postgresql.org rejected the commit message because the
> >> sender was not in a deliverable domain.  I noticed within a few hours,
> >> but the damage was already done.
> >
> > I guess I'm confused about which is your preferred email address?
> 
> Yeah. I was assuming the alvh.no-ip.org was going to come back, as
> that's what you're still subcribed to critical mailinglists with :) Do
> you need that one changed?

My preferred email address continues to be alvhe...@alvh.no-ip.org.  It
was deleted by mistake, it wasn't intended.  (If you must know, it's a
new idiotic "feature" by no-ip.com, coupled by my server machine at home
dying because of a "stuck" SSD, which I thought was non-critical so
didn't replace in over half a year, coupled with a change in laptop;
goes to show that the non-critical machine wasn't really non-critical
after all, and that I'm juggling with way too many things.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Buildfarm not happy with test module move

2014-12-01 Thread Magnus Hagander
On Mon, Dec 1, 2014 at 1:41 PM, Alvaro Herrera  wrote:
> Magnus Hagander wrote:
>> On Mon, Dec 1, 2014 at 2:22 AM, Tom Lane  wrote:
>> > Alvaro Herrera  writes:
>> >> That was my fault -- the alvh.no-ip.org domain was deleted, and the
>> >> email system in postgresql.org rejected the commit message because the
>> >> sender was not in a deliverable domain.  I noticed within a few hours,
>> >> but the damage was already done.
>> >
>> > I guess I'm confused about which is your preferred email address?
>>
>> Yeah. I was assuming the alvh.no-ip.org was going to come back, as
>> that's what you're still subcribed to critical mailinglists with :) Do
>> you need that one changed?
>
> My preferred email address continues to be alvhe...@alvh.no-ip.org.  It
> was deleted by mistake, it wasn't intended.  (If you must know, it's a
> new idiotic "feature" by no-ip.com, coupled by my server machine at home
> dying because of a "stuck" SSD, which I thought was non-critical so
> didn't replace in over half a year, coupled with a change in laptop;
> goes to show that the non-critical machine wasn't really non-critical
> after all, and that I'm juggling with way too many things.)

Ok, good! Will add you back to the list you were dropped from - but
you'll want to doublecheck all other lists as well :)

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] compress method for spgist - 2

2014-12-01 Thread Teodor Sigaev


Initial message:
http://www.postgresql.org/message-id/5447b3ff.2080...@sigaev.ru

Second version fixes a forgotten changes in pg_am.


--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


spgist_compress_method-2.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Windows,PATCH] Use faster, higher precision timer API

2014-12-01 Thread Craig Ringer
Hi all

I've attached a revised patchset for GetSystemTimeAsFileTime and
GetSystemTimePreciseAsFileTime to address David Rowley's review notes.

Thanks for the review, and for poking me to follow up.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
>From 63edf0571961f2f2c7fd6a08b99e747ee39a3377 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Fri, 12 Sep 2014 12:41:35 +0800
Subject: [PATCH 1/2] Use GetSystemTimeAsFileTime directly in windows
 gettimeofday
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

PostgreSQL was calling GetSystemTime followed by SystemTimeToFileTime in the
win32 port gettimeofday function. This is not necessary and limits the reported
precision to the 1ms granularity that the SYSTEMTIME struct can represent. By
using GetSystemTimeAsFileTime we avoid unnecessary conversions and capture
timestamps at 100ns granularity, which is then rounded to 1µs granularity for
storage in a PostgreSQL timestamp.

On most Windows systems this change will actually have no significant effect as
the system timer tick is typically between 1ms and 15ms depending on what timer
resolution currently running applications have requested. You can check this
with clockres.exe from sysinternals. Despite the platform limiation this change
still permits capture of finer timestamps where the system is capable of
producing them and it gets rid of an unnecessary syscall.

Future work may permit use of GetSystemTimePreciseAsFileTime on Windows 8 and
Windows Server 2012 for higher resolution time capture. This call has the same
interface as GetSystemTimeAsFileTime.
---
 src/port/gettimeofday.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/port/gettimeofday.c b/src/port/gettimeofday.c
index 75a9199..73ec406 100644
--- a/src/port/gettimeofday.c
+++ b/src/port/gettimeofday.c
@@ -44,16 +44,14 @@ int
 gettimeofday(struct timeval * tp, struct timezone * tzp)
 {
 	FILETIME	file_time;
-	SYSTEMTIME	system_time;
 	ULARGE_INTEGER ularge;
 
-	GetSystemTime(&system_time);
-	SystemTimeToFileTime(&system_time, &file_time);
+	GetSystemTimeAsFileTime(&file_time);
 	ularge.LowPart = file_time.dwLowDateTime;
 	ularge.HighPart = file_time.dwHighDateTime;
 
 	tp->tv_sec = (long) ((ularge.QuadPart - epoch) / 1000L);
-	tp->tv_usec = (long) (system_time.wMilliseconds * 1000);
+	tp->tv_usec = (long) (((ularge.QuadPart - epoch) % 1000L) / 10);
 
 	return 0;
 }
-- 
1.9.3

>From c3c9f38379dca3f6f59520c5ceafce34ee8c8d90 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Thu, 18 Sep 2014 23:02:14 +0800
Subject: [PATCH 2/2] Use GetSystemTimePreciseAsFileTime when available

This will cause PostgreSQL on Windows 8 or Windows Server 2012 to
obtain high-resolution timestamps while allowing the same binaries
to run without problems on older releases.
---
 src/backend/main/main.c |  6 ++
 src/include/port.h  |  2 ++
 src/port/gettimeofday.c | 56 +++--
 3 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index c51b391..73c30c5 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -260,6 +260,12 @@ startup_hacks(const char *progname)
 
 		/* In case of general protection fault, don't show GUI popup box */
 		SetErrorMode(SEM_FAILCRITICALERRORS | SEM_NOGPFAULTERRORBOX);
+
+#ifndef HAVE_GETTIMEOFDAY
+		/* Figure out which syscall to use to capture timestamp information */
+		init_win32_gettimeofday();
+#endif
+
 	}
 #endif   /* WIN32 */
 
diff --git a/src/include/port.h b/src/include/port.h
index 94a0e2f..58677ec 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -328,6 +328,8 @@ extern FILE *pgwin32_popen(const char *command, const char *type);
 #ifndef HAVE_GETTIMEOFDAY
 /* Last parameter not used */
 extern int	gettimeofday(struct timeval * tp, struct timezone * tzp);
+/* On windows we need to call some backend start setup for accurate timing */
+extern void init_win32_gettimeofday(void);
 #endif
 #else			/* !WIN32 */
 
diff --git a/src/port/gettimeofday.c b/src/port/gettimeofday.c
index 73ec406..a82a1a4 100644
--- a/src/port/gettimeofday.c
+++ b/src/port/gettimeofday.c
@@ -30,14 +30,66 @@
 
 #include 
 
+#ifndef FRONTEND
+#include 
+#endif
+
 
 /* FILETIME of Jan 1 1970 00:00:00. */
 static const unsigned __int64 epoch = UINT64CONST(1164447360);
 
 /*
+ * Both GetSystemTimeAsFileTime and GetSystemTimePreciseAsFileTime share a
+ * signature, so we can just store a pointer to whichever we find. This
+ * is the pointer's type.
+ */
+typedef VOID (WINAPI *PgGetSystemTimeFn)(LPFILETIME);
+/* Storage for the function we pick at runtime */
+static PgGetSystemTimeFn pg_get_system_time = NULL;
+
+/*
+ * During backend startup, determine if GetSystemTimePreciseAsFileTime is
+ * available and use it; if not, fall back to GetSystemTimeAsFileTime.
+ */
+void
+init

Re: [HACKERS] [Windows,PATCH] Use faster, higher precision timer API

2014-12-01 Thread Marco Nenciarini
Il 01/12/14 14:16, Craig Ringer ha scritto:
>  
> +#ifndef FRONTEND
> +#include 
> +#endif
> +

I think this is a leftover, as you don't use elog afterwards.

Regards,
Marco

-- 
Marco Nenciarini - 2ndQuadrant Italy
PostgreSQL Training, Services and Support
marco.nenciar...@2ndquadrant.it | www.2ndQuadrant.it



signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] Proposal : REINDEX SCHEMA

2014-12-01 Thread Sawada Masahiko
On Thu, Nov 27, 2014 at 11:43 PM, Michael Paquier
 wrote:
> On Thu, Nov 27, 2014 at 8:18 PM, Sawada Masahiko  
> wrote:
>> On Thu, Nov 27, 2014 at 1:07 PM, Michael Paquier
>>  wrote:
>>> On Thu, Nov 27, 2014 at 12:55 AM, Sawada Masahiko  
>>> wrote:
 +1 to define new something object type and remove do_user and do_system.
 But if we add OBJECT_SYSTEM to ObjectType data type,
 system catalogs are OBJECT_SYSTEM as well as OBJECT_TABLE.
 It's a bit redundant?
>>> Yes, kind of. That's a superset of a type of relations, aka a set of
>>> catalog tables. If you find something cleaner to propose, feel free.
>>
>> I thought we can add new struct like ReindexObjectType which has
>> REINDEX_OBJECT_TABLE,
>> REINDEX_OBJECT_SYSTEM and so on. It's similar to GRANT syntax.
> Check.
>
> Another thing, ReindexDatabaseOrSchema should be renamed to ReindexObject.
> So, I think that we need to think a bit more here. We are not far from
> smth that could be committed, so marking as "Waiting on Author" for
> now. Thoughts?

 Is the table also kind of "object"?
>>> Sorry, I am not sure I follow you here. Indexes and tables have
>>> already their relkind set in ReindexStmt, and I think that we're fine
>>> to continue letting them go in their own reindex code path for now.
>>
>> It was not enough, sorry.
>> I mean that there is already ReindexTable() function.
>> if we renamed ReindexObject, I would feel uncomfortable feeling.
>> Because table is also kind of "object".
> Check.
>
> If you get that done, I'll have an extra look at it and then let's
> have a committer look at it.

Attached patch is latest version.
I added new enum values like REINDEX_OBJECT_XXX,
and changed ReindexObject() function.
Also ACL problem is fixed for this version.

Regards,

---
Sawada Masahiko


000_reindex_schema_v6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROTOCOL TODO] Permit streaming of unknown-length lob/clob (bytea,text,etc)

2014-12-01 Thread David Fetter
On Mon, Dec 01, 2014 at 02:55:22PM +0800, Craig Ringer wrote:
> Hi all
> 
> Currently the client must know the size of a large lob/clob field, like
> a 'bytea' or 'text' field, in order to send it to the server. This can
> force the client to buffer all the data before sending it to the server.

Yes, this is not good.

> It would be helpful if the v4 protocol permitted the client to specify
> the field length as unknown / TBD, then stream data until an end marker
> is read.

What's wrong with specifying its length in advance instead?  Are you
thinking of a one or more use cases where it's both large and unknown?

Cheers,
David.
-- 
David Fetter  http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter  XMPP: david.fet...@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROTOCOL TODO] Permit streaming of unknown-length lob/clob (bytea,text,etc)

2014-12-01 Thread Craig Ringer
On 12/01/2014 10:38 PM, David Fetter wrote:
> On Mon, Dec 01, 2014 at 02:55:22PM +0800, Craig Ringer wrote:
>> Hi all
>>
>> Currently the client must know the size of a large lob/clob field, like
>> a 'bytea' or 'text' field, in order to send it to the server. This can
>> force the client to buffer all the data before sending it to the server.
> 
> Yes, this is not good.
> 
>> It would be helpful if the v4 protocol permitted the client to specify
>> the field length as unknown / TBD, then stream data until an end marker
>> is read.
> 
> What's wrong with specifying its length in advance instead?  Are you
> thinking of a one or more use cases where it's both large and unknown?

I am - specifically, the JDBC setBlob(...) and setClob(...) APIs that
accept streams without a specified length:

https://docs.oracle.com/javase/7/docs/api/java/sql/PreparedStatement.html#setBlob(int,%20java.io.InputStream)

https://docs.oracle.com/javase/7/docs/api/java/sql/PreparedStatement.html#setClob(int,%20java.io.Reader)

There are variants that do take a length, so PgJDBC can (and now does)
implement the no-length variants by internally buffering the stream
until EOF. It'd be nice to get rid of that though.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PROTOCOL TODO] Permit streaming of unknown-length lob/clob (bytea,text,etc)

2014-12-01 Thread Tom Lane
Craig Ringer  writes:
> Currently the client must know the size of a large lob/clob field, like
> a 'bytea' or 'text' field, in order to send it to the server. This can
> force the client to buffer all the data before sending it to the server.

> It would be helpful if the v4 protocol permitted the client to specify
> the field length as unknown / TBD, then stream data until an end marker
> is read. Some encoding would be required for binary data to ensure that
> occurrences of the end marker in the streamed data were properly
> handled, but there are many well established schemes for doing this.

I think this is pretty much a non-starter as stated, because the v3
protocol requires all messages to have a preceding length word.  That's
not very negotiable.

What's already on the TODO list is to allow large field values to be sent
or received in segments, perhaps with a cursor-like arrangement.  You can
do that today for blobs, but not for oversize regular table fields.

Of course, considering that the maximum practical size of a regular field
is probably in the dozens of megabytes, and that RAM is getting cheaper
all the time, it's not clear that it's all that much of a hardship for
clients to buffer the whole thing.  If we've not gotten around to this
in the last dozen years, it's unlikely we'll get to it in the future
either ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Selectivity estimation for inet operators

2014-12-01 Thread Tom Lane
I wrote:
> I spent a fair chunk of the weekend hacking on this patch to make
> it more understandable and fix up a lot of what seemed to me pretty
> clear arithmetic errors in the "upper layers" of the patch.  However,
> I couldn't quite convince myself to commit it, because the business
> around estimation for partial histogram-bucket matches still doesn't
> make any sense to me.

Actually, there's a second large problem with this patch: blindly
iterating through all combinations of MCV and histogram entries makes the
runtime O(N^2) in the statistics target.  I made up some test data (by
scanning my mail logs) and observed the following planning times, as
reported by EXPLAIN ANALYZE:

explain analyze select * from relays r1, relays r2 where r1.ip = r2.ip;
explain analyze select * from relays r1, relays r2 where r1.ip && r2.ip;

stats targeteqjoinsel   networkjoinsel

100 0.27 ms 1.85 ms
10002.56 ms 167.2 ms
1   56.6 ms 13987.1 ms

I don't think it's necessary for network selectivity to be quite as
fast as eqjoinsel, but I doubt we can tolerate 14 seconds planning
time for a query that might need just milliseconds to execute :-(

It seemed to me that it might be possible to reduce the runtime by
exploiting knowledge about the ordering of the histograms, but
I don't have time to pursue that right now.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Buildfarm not happy with test module move

2014-12-01 Thread Alvaro Herrera
Tom Lane wrote:
> Alvaro Herrera  writes:
> > Tom Lane wrote:
> >> All of the MSVC critters are failing at "make check".
> 
> > Yeah, I noticed that, thanks.  As far as I can see the only way to fix
> > it is to install dummy_seclabel to run the core seclabel test.  That
> > doesn't seem smart; I think it'd be better to remove that part of the
> > core seclabel test, and move the rest of the test to a new test in the
> > dummy_seclabel module.
> 
> Sounds good to me.  The other parts of the core tests that depend on
> contrib modules aren't exactly good models to follow.

Pushed; tests pass for me, let's see what buildfarm says.

I think the input/ and output/ stuff is rather annoying.  I tried to
make dummy_seclabel an extension instead of a bare library, so that
CREATE EXTENSION inside the .sql loads it easily.  The problem there is
that dummy_seclabel doesn't have any sql-callable function, so the
module doesn't ever load.  I guess we could create a dummy function
which is there only to cause the module to get loaded ...

I am in touch with Andrew about adding a new stage to buildfarm client
so that animals will build src/test/modules by default.  It should work
fine everywhere except MSVC, I think.  The issue with MSVC continues to
be that Install.pm and Mkvcbuild.pm need to be in sync regarding what to
build and what to install.  I could use help from some MSVC-enabled
developer there ...

In the meantime, I'm going to rebase commit_ts to src/test/modules and
get one bug I found there fixed, and will push that shortly too.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Buildfarm not happy with test module move

2014-12-01 Thread Tom Lane
Alvaro Herrera  writes:
> Tom Lane wrote:
>> Sounds good to me.  The other parts of the core tests that depend on
>> contrib modules aren't exactly good models to follow.

> Pushed; tests pass for me, let's see what buildfarm says.

Pushed?  Don't see it ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] using Core Foundation locale functions

2014-12-01 Thread David E. Wheeler
On Nov 28, 2014, at 8:43 AM, Peter Eisentraut  wrote:
> 
> At the moment, this is probably just an experiment that shows where
> refactoring and better abstractions might be suitable if we want to
> support multiple locale libraries.  If we want to pursue ICU, I think
> this could be a useful third option.

Gotta say, I’m thrilled to see movement on this front, and especially pleased 
to see how consensus seems to be building around an abstracted interface to 
keep options open. This platform-specific example really highlights the need 
for it (I had no idea that there was separate and more up-to-date collation 
support in Core Foundation than in the UNIX layer of OS X).

Really looking forward to seeing where we end up.

Best,

David
 



smime.p7s
Description: S/MIME cryptographic signature


Re: [HACKERS] Turning recovery.conf into GUCs

2014-12-01 Thread Alex Shulgin
Alex Shulgin  writes:
>
> Here's an attempt to revive this patch.

Here's the patch rebased against current HEAD, that is including the
recently committed action_at_recovery_target option.

The default for the new GUC is 'pause', as in HEAD, and
pause_at_recovery_target is removed completely in favor of it.

I've also taken the liberty to remove that part that errors out when
finding $PGDATA/recovery.conf.  Now get your rotten tomatoes ready. ;-)

--
Alex



recovery_guc_v5.4.patch.gz
Description: application/gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Log inability to lock pages during vacuum

2014-12-01 Thread Jim Nasby

On 12/1/14, 11:57 AM, Andres Freund wrote:

On 2014-11-30 20:46:51 -0600, Jim Nasby wrote:

On 11/10/14, 7:52 PM, Tom Lane wrote:

On the whole, I'm +1 for just logging the events and seeing what we learn
that way.  That seems like an appropriate amount of effort for finding out
whether there is really an issue.


Attached is a patch that does this.


Unless somebody protests I plan to push this soon. I'll change two
things:

* Always use the same message, independent of scan_all. For one Jim's
   version would be untranslatable, for another it's not actually
   correct. In most cases we'll *not* wait, even if scan_all is
   true as we'll often just balk after !lazy_check_needs_freeze().


Good point.


* Change the new bit in the errdetail. "could not acquire cleanup lock"
   sounds too much like an error to me. "skipped %u pinned pages" maybe?


Seems reasonable.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] No documentation on record *= record?

2014-12-01 Thread Jim Nasby

There doesn't seem to be documentation on *= (or search isn't finding it). Is 
this intentional?
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2 recovery/startup problems with unlogged tables

2014-12-01 Thread Jeff Janes
On Wed, Nov 26, 2014 at 4:13 PM, Jeff Janes  wrote:

> On Tue, Nov 25, 2014 at 9:30 PM, Jeff Janes  wrote:
>
>> Using both 9.2.9 and 9_2_STABLE 9b468bcec15f1ea7433d4, I get a fairly
>> reproducible startup failure.
>>
>> What I was doing is restore a database from a base backup and roll it
>> forward with a recovery.conf until it completes and starts up.  Then I
>> truncate an unlogged table and start repopulating it with a large slow
>> 'insert into ...select from' query.
>>
>> While that is running, if I reboot the server, when it restarts it does
>> not come back up initially.
>>
>> This is what I get in the log from the attempted restart:
>>
>> PST LOG:  database system was interrupted; last known up at 2014-11-25
>> 15:40:33 PST
>> PST LOG:  database system was not properly shut down; automatic recovery
>> in progress
>> PST LOG:  redo starts at 84/EF80
>> PST LOG:  record with zero length at 84/EF09AE18
>> PST LOG:  redo done at 84/EF09AD28
>> PST LOG:  last completed transaction was at log time 2014-11-25
>> 15:42:09.173599-08
>> PST LOG:  checkpoint starting: end-of-recovery immediate
>> PST LOG:  checkpoint complete: wrote 103 buffers (0.2%); 0 transaction
>> log file(s) added, 246 removed, 7 recycled; write=0.002 s, sync=0.020 s,
>> total=0.526 s; sync files=51, longest=0.003 s, average=0.000 s
>> PST FATAL:  could not create file "base/16416/59288": File exists
>> PST LOG:  startup process (PID 2472) exited with exit code 1
>> PST LOG:  aborting startup due to startup process failure
>>
>> oid2name doesn't show me any 59288, so I think it is the new copy of the
>> unlogged table which is being created at the moment of the reboot.
>>
>
> 59288 is not the new (uncommitted) unlogged table itself, but is the new
> (uncommitted) toast table associated with that unlogged table.
>
> immediately before the 'sudo /sbin/reboot', while the 'truncate
> unlogged_table; insert into unlogged_table from select...' is running, the
> two files below exist and have zero size.
>
> base/16416/59288_init
> base/16416/59288
>
> Immediately after the system has come back up, the file
> base/16416/59288_init no longer exists.
>
> I can't reproduce this behavior without the reboot.  It seems to depend on
> the sequencing and timing of the signals which the kernel delivers to the
> running processes during the reboot.
>
> If I do a pg_ctl stop -mf, then both files go away.  If I do a pg_ctl stop
> -mi, then neither goes away.  It is only with the /sbin/reboot that I get
> the fatal combination of _init being gone but the other still present.
>
>
> Then once the system is back and I restart postmaster, the first pass
> through ResetUnloggedRelationsInDbspaceDir doesn't remove base/16416/59288,
> because base/16416/59288_init doesn't exist to trigger the deletion.  On
> the 2nd pass through, base/16416/59288_init exists because it was created
> by WAL replay, and the copy fails because it is expecting base/16416/59288
> to have already been cleaned up by the first pass.
>
> Should ResetUnloggedRelationsInDbspaceDir on the second pass attempt to
> unlink the target immediately before the copy (line 338, in 9.2) to
> accommodate cases like this?
>

I still haven't figured out what sequence of signals causes the init fork
to be removed but the main fork to remain, but I think the startup process
should be robust to this situation.  The attached patch adds an
unconditional and unchecked unlink before the copy.  It should probably
have a comment, but I don't really know what to say.

This resolves the situation for me.  It is expected that the unlink will
usually fail with non-existent files in the vast majority of cases.  Should
it check that the unlink either succeeds, or fails for non-existence, and
throw an error in cases other than those?  Would it be a performance
concern to call unlink potentially thousands of times?  I don't think so,
because it only calls the unlink when it is about to copy a file anyway,
which is a much heavier operation.

My understanding of the reason that it was done in two passes in the first
place is that during hot standby, it is OK for the main fork to be missing,
but not OK for it to be gibberish.  But then upon promotion to writable, it
is no longer OK for it to be missing, so they get copied over at that time
(and not earlier as then init forks created during replay wouldn't get
copied).  Given that reasoning, I think that unconditionally unlinking the
destination file before the copy would not be a problem. The gibberish
master fork would not be a problem during hot standby queries as the tables
creation has not been committed and so can't be seen.

(I've added Robert to CC because he wrote the original code.)

Cheers,

Jeff


reinit_unlink.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Buildfarm not happy with test module move

2014-12-01 Thread Alvaro Herrera
Tom Lane wrote:
> Alvaro Herrera  writes:
> > Tom Lane wrote:
> >> Sounds good to me.  The other parts of the core tests that depend on
> >> contrib modules aren't exactly good models to follow.
> 
> > Pushed; tests pass for me, let's see what buildfarm says.
> 
> Pushed?  Don't see it ...

Meh.  Done for real this time.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Role Attribute Bitmask Catalog Representation

2014-12-01 Thread Stephen Frost
Adam,

* Adam Brightwell (adam.brightw...@crunchydatasolutions.com) wrote:
> Please let me know what you think, all feedback is greatly appreciated.

Thanks for this.  Found time to do more of a review and have a few
comments:

> diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
> new file mode 100644
> index d30612c..93eb2e6
> *** a/src/backend/catalog/aclchk.c
> --- b/src/backend/catalog/aclchk.c
> *** pg_extension_ownercheck(Oid ext_oid, Oid
> *** 5051,5056 
> --- 5031,5058 
>   }
>   
>   /*
> +  * Check whether the specified role has a specific role attribute.
> +  */
> + bool
> + role_has_attribute(Oid roleid, RoleAttr attribute)
> + {
> + RoleAttrattributes;
> + HeapTuple   tuple;
> + 
> + tuple = SearchSysCache1(AUTHOID, ObjectIdGetDatum(roleid));
> + 
> + if (!HeapTupleIsValid(tuple))
> + ereport(ERROR,
> + (errcode(ERRCODE_UNDEFINED_OBJECT),
> +  errmsg("role with OID %u does not exist", 
> roleid)));
> + 
> + attributes = ((Form_pg_authid) GETSTRUCT(tuple))->rolattr;
> + ReleaseSysCache(tuple);
> + 
> + return ((attributes & attribute) > 0);
> + }

I'd put the superuser_arg() check in role_has_attribute.

> --- 5066,5089 
>   bool
>   has_createrole_privilege(Oid roleid)
>   {
>   /* Superusers bypass all permission checking. */
>   if (superuser_arg(roleid))
>   return true;
>   
> ! return role_has_attribute(roleid, ROLE_ATTR_CREATEROLE);
>   }

And then ditch the individual has_X_privilege() calls (I note that you
didn't appear to add back the has_rolcatupdate() function..).

> diff --git a/src/backend/commands/user.c b/src/backend/commands/user.c
> new file mode 100644
> index 1a73fd8..72c5dcc
> *** a/src/backend/commands/user.c
> --- b/src/backend/commands/user.c
> *** have_createrole_privilege(void)
> *** 63,68 
> --- 63,73 
>   return has_createrole_privilege(GetUserId());
>   }
>   
> + static RoleAttr
> + set_role_attribute(RoleAttr attributes, RoleAttr attribute)
> + {
> + return ((attributes & ~(0x)) | attribute);
> + }

I don't think this is doing what you think it is..?  It certainly isn't
working as expected by AlterRole().  Rather than using a function for
these relatively simple operations, why not just have AlterRole() do:

if (isX >= 0)
attributes = (isX > 0) ? attributes | ROLE_ATTR_X
   : attributes & 
~(ROLE_ATTR_X);

Otherwise, you'd probably need to pass a flag into set_role_attribute()
to indicate if you're setting or unsetting the bit, or have an
'unset_role_attribute()' function, but all of that seems like overkill..

> *** CreateRole(CreateRoleStmt *stmt)
> --- 86,99 
>   char   *password = NULL;/* user password */
>   boolencrypt_password = Password_encryption; /* encrypt 
> password? */
>   charencrypted_password[MD5_PASSWD_LEN + 1];
> ! boolissuper = false;/* Make the user a 
> superuser? */
> ! boolinherit = true; /* Auto inherit 
> privileges? */

I'd probably leave the whitespace-only changes out.

> *** AlterRoleSet(AlterRoleSetStmt *stmt)
> *** 889,895 
>* To mess with a superuser you gotta be superuser; else you 
> need
>* createrole, or just want to change your own settings
>*/
> ! if (((Form_pg_authid) GETSTRUCT(roletuple))->rolsuper)
>   {
>   if (!superuser())
>   ereport(ERROR,
> --- 921,928 
>* To mess with a superuser you gotta be superuser; else you 
> need
>* createrole, or just want to change your own settings
>*/
> ! attributes = ((Form_pg_authid) GETSTRUCT(roletuple))->rolattr;
> ! if ((attributes & ROLE_ATTR_SUPERUSER) > 0)
>   {
>   if (!superuser())
>   ereport(ERROR,

We don't bother with the '> 0' in any of the existing bit testing that
goes on in the backend, so I don't think it makes sense to start now.
Just do

if (attributes & ROLE_ATTR_SUPERUSER) ... etc

and be done with it.

> *** aclitemin(PG_FUNCTION_ARGS)
> *** 577,582 
> --- 578,584 
>   PG_RETURN_ACLITEM_P(aip);
>   }
>   
> + 
>   /*
>* aclitemout
>*  Allocates storage for, and fills in, a new null-delimited string

More whitespace changes... :/

> *** pg_role_aclcheck(Oid role_oid, Oid rolei
> *** 4602,4607 
> --- 4604,4712 
>   return ACLCHECK_NO_PRIV;
>   }
>   
> + /*
> +  * has_role_attribute_id
> +  *  Check the named role attribute on a role by given role oid.
> +  */
> + Datum
> + has_role_attribute_id(PG_FUNCTION_ARGS)
> + {
> + Oid   

Re: [HACKERS] bug in json_to_record with arrays

2014-12-01 Thread Josh Berkus
On 11/28/2014 12:55 PM, Tom Lane wrote:
> * This would only really address Josh's complaint if we were to back-patch
> it into 9.4, but I'm hesitant to change error texts post-RC1.  Thoughts?

If we don't fix an error text in an RC, what would we fix in an RC?
That's as minor as things get.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug in json_to_record with arrays

2014-12-01 Thread Tom Lane
Josh Berkus  writes:
> On 11/28/2014 12:55 PM, Tom Lane wrote:
>> * This would only really address Josh's complaint if we were to back-patch
>> it into 9.4, but I'm hesitant to change error texts post-RC1.  Thoughts?

> If we don't fix an error text in an RC, what would we fix in an RC?
> That's as minor as things get.

Code-wise, yeah, but it would put some pressure on the translators.

What did you think of the new error texts in themselves?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug in json_to_record with arrays

2014-12-01 Thread Joshua D. Drake


On 12/01/2014 12:28 PM, Josh Berkus wrote:


On 11/28/2014 12:55 PM, Tom Lane wrote:

* This would only really address Josh's complaint if we were to back-patch
it into 9.4, but I'm hesitant to change error texts post-RC1.  Thoughts?


If we don't fix an error text in an RC, what would we fix in an RC?
That's as minor as things get.


I think the idea is that you only fix *major* things in an RC? That 
said, I agree that we should fix something so minor.


JD






--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, @cmdpromptinc
"If we send our children to Caesar for their education, we should
 not be surprised when they come back as Romans."


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug in json_to_record with arrays

2014-12-01 Thread Andres Freund
On 2014-12-01 15:30:06 -0500, Tom Lane wrote:
> Josh Berkus  writes:
> > On 11/28/2014 12:55 PM, Tom Lane wrote:
> >> * This would only really address Josh's complaint if we were to back-patch
> >> it into 9.4, but I'm hesitant to change error texts post-RC1.  Thoughts?
> 
> > If we don't fix an error text in an RC, what would we fix in an RC?
> > That's as minor as things get.
> 
> Code-wise, yeah, but it would put some pressure on the translators.

I think a untranslated, but on the point, error message is better than a
translated confusing one.
 
Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug in json_to_record with arrays

2014-12-01 Thread Josh Berkus
On 12/01/2014 12:30 PM, Tom Lane wrote:
> Josh Berkus  writes:
>> On 11/28/2014 12:55 PM, Tom Lane wrote:
>>> * This would only really address Josh's complaint if we were to back-patch
>>> it into 9.4, but I'm hesitant to change error texts post-RC1.  Thoughts?
> 
>> If we don't fix an error text in an RC, what would we fix in an RC?
>> That's as minor as things get.
> 
> Code-wise, yeah, but it would put some pressure on the translators.
> 
> What did you think of the new error texts in themselves?

I would prefer "invalid input syntax" to "malformed array literal", but
I'll take anything we can backpatch.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug in json_to_record with arrays

2014-12-01 Thread Tom Lane
Josh Berkus  writes:
> On 12/01/2014 12:30 PM, Tom Lane wrote:
>> Code-wise, yeah, but it would put some pressure on the translators.
>> 
>> What did you think of the new error texts in themselves?

> I would prefer "invalid input syntax" to "malformed array literal", but
> I'll take anything we can backpatch.

I think if we're changing them at all, it's about the same cost
no matter what we change them to exactly.

I'd be good with going to "invalid input syntax" if we also change
record_in to match.  Anyone have a feeling pro or con about that?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug in json_to_record with arrays

2014-12-01 Thread Merlin Moncure
On Mon, Dec 1, 2014 at 3:01 PM, Tom Lane  wrote:
> Josh Berkus  writes:
>> On 12/01/2014 12:30 PM, Tom Lane wrote:
>>> Code-wise, yeah, but it would put some pressure on the translators.
>>>
>>> What did you think of the new error texts in themselves?
>
>> I would prefer "invalid input syntax" to "malformed array literal", but
>> I'll take anything we can backpatch.
>
> I think if we're changing them at all, it's about the same cost
> no matter what we change them to exactly.
>
> I'd be good with going to "invalid input syntax" if we also change
> record_in to match.  Anyone have a feeling pro or con about that?

I don't know.  "malformed array literal" is a mighty big clue that you
have a bad "postgres format" array literal and will be well supported
by googling -- this problem isn't new.

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-01 Thread Tomas Vondra
Hi all,

while working on the patch decreasing amount of memory consumed by
array_agg [1], I've ran into some strange OOM issues. Reproducing them
using the attached SQL script is rather simple.

  [1] https://commitfest.postgresql.org/action/patch_view?id=1652

At first I thought there's some rare hidden memory leak, but I'm pretty
sure that's not the case. I've even put some explicit counters into
aset.c counting allocated/freed blocks, and it seems to be working just
fine (and matching the context tree perfectly). So no memory leak.

The only thing I can think of is some interaction (fragmentation?) with
the system allocator (I'm using glibc-2.19 and kernel 3.16.1, btw),
making the RSS values even less useful than I thought. Sadly, it seems
to trigger the OOM killer :-(

It's entirely possible that this is a known behavior of the allocator,
and I've been unaware of it. It's also true that the work_mem values are
really excessive, and the actual values would be much lower, which would
make the issues much less serious.

To demonstrate the problem I'll use the attached SQL script - it's a
rather simple script generating sample tables and then executing trivial
array_agg() queries.

The script sets work_mem to two different values:

  work_mem = '1GB'==>  this limits the INSERT (generating data)

  work_mem = '1024GB' ==>  bogus value, forcing a hash aggegate
   (assuming  the hash table fits into memory)

The size of the sample tables (and the amount of memory needed for the
hash aggregate) is determined by the fist parameter set in the script.
With this value

  \set size 1

I get a OOM crash on the first execution of the SQL script (on a machine
with 8GB of RAM and 512MB shared buffers), but YMMV.

The problem is, that even with a much smaller dataset (say, using size
7500) you'll get an OOM error after several executions of the script.
How many executions are needed seems to be inversely proportional to the
size of the data set.

The "problem" is that the RSS amount is increasing over time for some
reason. For example with the "size = 5000", the memory stats for the
process look like this over the first few minutes:

 VIRTRESSHR  %CPU %MEM TIME+ COMMAND
  5045508 2,818g 187220  51,9 36,6   0:15.39 postgres: INSERT
  5045508 3,600g 214868  62,8 46,8   3:11.58 postgres: INSERT
  5045508 3,771g 214868  50,9 49,0   3:40.03 postgres: INSERT
  5045508 3,840g 214868  48,5 49,9   4:00.71 postgres: INSERT
  5045508 3,978g 214880  51,5 51,7   4:40.73 postgres: INSERT
  5045508 4,107g 214892  53,2 53,4   5:22.04 postgres: INSERT
  5045508 4,297g 215276  53,9 55,9   6:22.63 postgres: INSERT
  ...

Those are rows for the backend process, captured from "top" over time.
How long the backend was running is in the TIME column. Each iteration
takes ~30 seconds, so those lines represent approximately iterations 1,
6, 7, 8, 11, etc.

Notice how the RSS value grows over time, and also notice that this is
the INSERT, restricted by work_mem=1GB. So the memory consumption should
be ~1.5GB, and MemoryContextStats(TopMemoryContext) collected at this
point is consistent with that (see the mem-ctx.log).

And then the value stabilizes at ~4,430g and stops growing. With size
7500 it however takes only ~20 iterations to reach the OOM issue, with a
crash log like this:

[10560.843547] Killed process 15862 (postgres) total-vm:7198260kB,
anon-rss:6494136kB, file-rss:300436kB

So, any ideas what might be the culprit here?

As I said, this is clearly made worse by inappropriately high work_mem
values, but I'm not sure it's completely harmless. Imagine for example
long-running backends, executing complex queries with inaccurate
estimates. That may easily result in using much more memory than the
work_mem limit, and increasing the RSS value over time.

regards
Tomas


array-agg.sql
Description: application/sql
TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free (500017 chunks); 608256 used
  TopTransactionContext: 8192 total in 1 blocks; 7904 free (0 chunks); 288 used
  TableSpace cache: 8192 total in 1 blocks; 3216 free (0 chunks); 4976 used
  Attopt cache: 8192 total in 1 blocks; 2704 free (0 chunks); 5488 used
  Type information cache: 24240 total in 2 blocks; 3744 free (0 chunks); 20496 used
  Operator lookup cache: 24576 total in 2 blocks; 11888 free (5 chunks); 12688 used
  MessageContext: 32768 total in 3 blocks; 9640 free (0 chunks); 23128 used
  Operator class cache: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
  smgr relation table: 24576 total in 2 blocks; 9808 free (4 chunks); 14768 used
  TransactionAbortContext: 32768 total in 1 blocks; 32736 free (0 chunks); 32 used
  Portal hash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
  PortalMemory: 8192 total in 1 blocks; 7888 free (0 chunks); 304 used
PortalHeapMemory: 1024 total in 1 blocks; 848 free (0 chunks); 176 used
  ExecutorState: 318758960 total in 40 blocks; 5879664 free (8 chunks)

Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-01 Thread Andrew Dunstan


On 12/01/2014 05:39 PM, Tomas Vondra wrote:

Hi all,

while working on the patch decreasing amount of memory consumed by
array_agg [1], I've ran into some strange OOM issues. Reproducing them
using the attached SQL script is rather simple.

   [1] https://commitfest.postgresql.org/action/patch_view?id=1652

At first I thought there's some rare hidden memory leak, but I'm pretty
sure that's not the case. I've even put some explicit counters into
aset.c counting allocated/freed blocks, and it seems to be working just
fine (and matching the context tree perfectly). So no memory leak.



Doesn't this line:

   TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free (500017 
chunks); 608256 used


look pretty suspicious?


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-01 Thread Tomas Vondra
On 2.12.2014 00:31, Andrew Dunstan wrote:
> 
> On 12/01/2014 05:39 PM, Tomas Vondra wrote:
>> Hi all,
>>
>> while working on the patch decreasing amount of memory consumed by
>> array_agg [1], I've ran into some strange OOM issues. Reproducing them
>> using the attached SQL script is rather simple.
>>
>>[1] https://commitfest.postgresql.org/action/patch_view?id=1652
>>
>> At first I thought there's some rare hidden memory leak, but I'm pretty
>> sure that's not the case. I've even put some explicit counters into
>> aset.c counting allocated/freed blocks, and it seems to be working just
>> fine (and matching the context tree perfectly). So no memory leak.
> 
> 
> Doesn't this line:
> 
>TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free
> (500017 chunks); 608256 used
> 
> 
> look pretty suspicious?

It certainly looks a bit large, especially considering this is a fresh
cluster with a single DB, containing a single table (created by user),
no doubt about that.

For comparison, this is a new backend:

TopMemoryContext: 78128 total in 11 blocks; 8528 free (16 chunks); 69600
used

OTOH, it's "just" 130MB, and the RSS values are ~6GB when the OOM hits.


For the record, I only tweaked these settings in the config file:

test=# select name, setting from pg_settings where source like
'%configuration file%';
name|  setting
+---
 checkpoint_segments| 32
 DateStyle  | ISO, DMY
 default_text_search_config | pg_catalog.simple
 dynamic_shared_memory_type | posix
 lc_messages| cs_CZ.UTF-8
 lc_monetary| cs_CZ.UTF-8
 lc_numeric | cs_CZ.UTF-8
 lc_time| cs_CZ.UTF-8
 log_timezone   | Europe/Prague
 maintenance_work_mem   | 524288
 max_connections| 100
 shared_buffers | 65536
 TimeZone   | Europe/Prague
 work_mem   | 1048576
(14 rows)

Some of those are set by the initdb, of course.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] bug in json_to_record with arrays

2014-12-01 Thread Tom Lane
Merlin Moncure  writes:
> On Mon, Dec 1, 2014 at 3:01 PM, Tom Lane  wrote:
>> I'd be good with going to "invalid input syntax" if we also change
>> record_in to match.  Anyone have a feeling pro or con about that?

> I don't know.  "malformed array literal" is a mighty big clue that you
> have a bad "postgres format" array literal and will be well supported
> by googling -- this problem isn't new.

Good point about googling, and after all we are already using "malformed
array literal" for a subset of these cases.  Let's stick with that
wording.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-01 Thread Tom Lane
Tomas Vondra  writes:
> On 2.12.2014 00:31, Andrew Dunstan wrote:
>> Doesn't this line:
>> TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free
>> (500017 chunks); 608256 used
>> look pretty suspicious?

> It certainly looks a bit large, especially considering this is a fresh
> cluster with a single DB, containing a single table (created by user),
> no doubt about that.
> OTOH, it's "just" 130MB, and the RSS values are ~6GB when the OOM hits.

Yeah, but what was that 130MB being used for?  It's not generally
considered good practice to put large or random stuff in TopMemoryContext
--- arguably, any code doing so is broken already, because almost by
definition such allocations won't be reclaimed on error.

What I suspect you're looking at here is the detritus of creation of a
huge number of memory contexts.  mcxt.c keeps its own state about existing
contents in TopMemoryContext.  So, if we posit that those contexts weren't
real small, there's certainly room to believe that there was some major
memory bloat going on recently.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Turning recovery.conf into GUCs

2014-12-01 Thread Michael Paquier
On Tue, Dec 2, 2014 at 1:58 AM, Alex Shulgin  wrote:
> Here's the patch rebased against current HEAD, that is including the
> recently committed action_at_recovery_target option.
If this patch gets in, it gives a good argument to jump to 10.0 IMO.
That's not a bad thing, only the cost of making recovery params as
GUCs which is still a feature wanted.

> The default for the new GUC is 'pause', as in HEAD, and
> pause_at_recovery_target is removed completely in favor of it.
Makes sense. Another idea that popped out was to rename this parameter
as recovery_target_action as well, but that's not really something
this patch should care about.

> I've also taken the liberty to remove that part that errors out when
> finding $PGDATA/recovery.conf.
I am not in favor of this part. It may be better to let the users know
that their old configuration is not valid anymore with an error. This
patch cuts in the flesh with a huge axe, let's be sure that users do
not ignore the side pain effects, or recovery.conf would be simply
ignored and users would not be aware of that.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-01 Thread Tomas Vondra
On 2.12.2014 01:33, Tom Lane wrote:
> Tomas Vondra  writes:
>> On 2.12.2014 00:31, Andrew Dunstan wrote:
>>> Doesn't this line:
>>> TopMemoryContext: 136614192 total in 16678 blocks; 136005936 free
>>> (500017 chunks); 608256 used
>>> look pretty suspicious?
> 
>> It certainly looks a bit large, especially considering this is a fresh
>> cluster with a single DB, containing a single table (created by user),
>> no doubt about that.
>> OTOH, it's "just" 130MB, and the RSS values are ~6GB when the OOM hits.
> 
> Yeah, but what was that 130MB being used for?  It's not generally
> considered good practice to put large or random stuff in TopMemoryContext
> --- arguably, any code doing so is broken already, because almost by
> definition such allocations won't be reclaimed on error.
> 
> What I suspect you're looking at here is the detritus of creation of
> a huge number of memory contexts. mcxt.c keeps its own state about
> existing contents in TopMemoryContext. So, if we posit that those
> contexts weren't real small, there's certainly room to believe that
> there was some major memory bloat going on recently.

Aha! MemoryContextCreate allocates the memory for the new context from
TopMemoryContext explicitly, so that it survives resets of the parent
context. Is that what you had in mind by keeping state about existing
contexts?

That'd probably explain the TopMemoryContext size, because array_agg()
creates separate context for each group. So if you have 1M groups, you
have 1M contexts. Although I don's see how the size of those contexts
would matter?

Maybe we could move this info (list of child contexts) to the parent
context somehow, so that it's freed when the context is destroyed? Aside
from the regular blocks, of course. But maybe it's not worth the effort,
because you can only have so many memory contexts created concurrently.
For AllocSet it's at most RAM/1kB (because that's the minimum block
size) before the OOM intervenes. And in this case the contexts have 8kB
initial size, so the number of contexts is even lower. So maybe it's not
worth the effort.

Also, this explains the TopMemoryContext size, but not the RSS size (or
am I missing something)?

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] excessive amounts of consumed memory (RSS), triggering OOM killer

2014-12-01 Thread Tom Lane
Tomas Vondra  writes:
> On 2.12.2014 01:33, Tom Lane wrote:
>> What I suspect you're looking at here is the detritus of creation of
>> a huge number of memory contexts. mcxt.c keeps its own state about
>> existing contents in TopMemoryContext. So, if we posit that those
>> contexts weren't real small, there's certainly room to believe that
>> there was some major memory bloat going on recently.

> Aha! MemoryContextCreate allocates the memory for the new context from
> TopMemoryContext explicitly, so that it survives resets of the parent
> context. Is that what you had in mind by keeping state about existing
> contexts?

Right.

> That'd probably explain the TopMemoryContext size, because array_agg()
> creates separate context for each group. So if you have 1M groups, you
> have 1M contexts. Although I don's see how the size of those contexts
> would matter?

Well, if they're each 6K, that's your 6GB right there.

> Maybe we could move this info (list of child contexts) to the parent
> context somehow, so that it's freed when the context is destroyed?

We intentionally didn't do that, because in many cases it'd result in
parent contexts becoming nonempty when otherwise they'd never have
anything actually in them.  The idea was that such "shell" parent contexts
should be cheap, requiring only a control block in TopMemoryContext and
not an actual allocation arena.  This idea of a million separate child
contexts was never part of the design of course; we might need to rethink
whether that's a good idea.  Or maybe there need to be two different
policies about where to put child control blocks.

> Also, this explains the TopMemoryContext size, but not the RSS size (or
> am I missing something)?

Very possibly you're left with "islands" that prevent reclaiming very
much of the peak RAM usage.  It'd be hard to be sure without some sort
of memory map, of course.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Role Attribute Bitmask Catalog Representation

2014-12-01 Thread Adam Brightwell
Stephen,

Thanks for this.  Found time to do more of a review and have a few
> comments:
>

Thanks for taking a look at this and for the feedback.

I'd put the superuser_arg() check in role_has_attribute.
>

Ok.  Though, this would affect how CATUPDATE is handled.  Peter Eisentraut
previously raised a question about whether superuser checks should be
included with catupdate which led me to create the following post.

http://www.postgresql.org/message-id/cakrt6cqovt2kiykg2gff7h9k8+jvu1149zlb0extkkk7taq...@mail.gmail.com

Certainly, we could keep has_rolcatupdate for this case and put the
superuser check in role_has_attribute, but it seems like it might be worth
taking a look at whether a superuser can bypass catupdate or not.  Just a
thought.

And then ditch the individual has_X_privilege() calls (I note that you
> didn't appear to add back the has_rolcatupdate() function..).
>

Ok.  I had originally thought for this patch that I would try to minimize
these types of changes, though perhaps this is that opportunity previously
mentioned in refactoring those.  However, the catupdate question still
remains.

> + static RoleAttr
> > + set_role_attribute(RoleAttr attributes, RoleAttr attribute)
> > + {
> > + return ((attributes & ~(0x)) | attribute);
> > + }
>
> I don't think this is doing what you think it is..?  It certainly isn't
> working as expected by AlterRole().  Rather than using a function for
> these relatively simple operations, why not just have AlterRole() do:
>
> if (isX >= 0)
> attributes = (isX > 0) ? attributes | ROLE_ATTR_X
>: attributes &
> ~(ROLE_ATTR_X);
>

Yes, this was originally my first approach.  I'm not recollecting why I
decided on the other route, but agree that is preferable and simpler.


> Otherwise, you'd probably need to pass a flag into set_role_attribute()
> to indicate if you're setting or unsetting the bit, or have an
> 'unset_role_attribute()' function, but all of that seems like overkill..
>

Agreed.

We don't bother with the '> 0' in any of the existing bit testing that
> goes on in the backend, so I don't think it makes sense to start now.
> Just do
>
> if (attributes & ROLE_ATTR_SUPERUSER) ... etc
>
> and be done with it.
>

Ok, easy enough.

Why not have this as 'pg_has_role_id_attribute' and then have a
> 'pg_has_role_name_attribute' also?  Seems like going with the pg_
> namespace is the direction we want to go in, though I'm not inclined to
> argue about it if folks prefer has_X.
>

I have no reason for one over the other, though I did ask myself that
question.  I did find it curious that in some cases there is "has_X" and
then in others "pg_has_X".  Perhaps I'm not looking in the right places,
but I haven't found anything that helps to distinguish when one vs the
other is appropriate (even if it is a general rule of thumb).

Seems like you could just make temp_array be N_ROLE_ATTRIBUTES in
> size, instead of building a list, counting it, and then building the
> array from that..?
>

Yes, this is true.

Do we really need to keep has_rolinherit for any reason..?  Seems
> unnecessary given it's only got one call site and it's nearly the same
> as a call to role_has_attribute() anyway.  Ditto with
> has_rolreplication().
>

I really don't see any reason and as above, I can certainly do those
refactors now if that is what is desired.

Thought we were getting rid of this..?
>
> > ! #define N_ROLE_ATTRIBUTES   8   /* 1 plus the last
> 1< > ! #define ROLE_ATTR_NONE  0
> > ! #define ROLE_ATTR_ALL   255 /* All
> currently available attributes. */
>
> Or:
>
> #define ROLE_ATTR_ALL  (1<

Yes, we were, however the latter causes a syntax error with initdb. :-/

> ! DATA(insert OID = 10 ( "POSTGRES" PGROLATTRALL -1 _null_ _null_));
> >
> >   #define BOOTSTRAP_SUPERUSERID 10
>
> Is it actually necessary to do this substitution when the value is
> #define'd in the same .h file...?  Might be something to check, if you
> haven't already.
>

Yes, I had previously checked this, I get the following error from initdb.

FATAL:  invalid input syntax for integer: "ROLE_ATTR_ALL"

> + #define ROLE_ATTR_SUPERUSER (1<<0)
> > + #define ROLE_ATTR_INHERIT   (1<<1)
> > + #define ROLE_ATTR_CREATEROLE(1<<2)
> > + #define ROLE_ATTR_CREATEDB  (1<<3)
> > + #define ROLE_ATTR_CATUPDATE (1<<4)
> > + #define ROLE_ATTR_CANLOGIN  (1<<5)
> > + #define ROLE_ATTR_REPLICATION   (1<<6)
> > + #define ROLE_ATTR_BYPASSRLS (1<<7)
> > + #define N_ROLE_ATTRIBUTES   8
> > + #define ROLE_ATTR_NONE  0
>
> These shouldn't need to be here any more..?
>

No they shouldn't, not sure how I missed that.

Thanks,
Adam

-- 
Adam Brightwell - adam.brightw...@crunchydatasolutions.com
Database Engineer - www.crunchydatasolutions.com


Re: [HACKERS] tracking commit timestamps

2014-12-01 Thread Alvaro Herrera
Petr Jelinek wrote:
> On 25/11/14 16:30, Alvaro Herrera wrote:
> >Petr Jelinek wrote:
> >>On 25/11/14 16:23, Alvaro Herrera wrote:
> >>>Robert Haas wrote:
> >
> Maybe 0 should get translated to a NULL return, instead of a bogus 
> timestamp.
> >>>
> >>>That's one idea --- surely no transaction is going to commit at 00:00:00
> >>>on 2000-01-01 anymore.  Yet this is somewhat discomforting.
> >>
> >>I solved it for xids that are out of range by returning -infinity and then
> >>changing that to NULL in sql interface, but no idea how to do that for
> >>aborted transactions.
> >
> >I guess the idea is that we just read the value from the slru and if it
> >exactly matches allballs we do the same -infinity return and translation
> >to NULL.  (Do we really love this -infinity idea?  If it's just an
> >internal API we can use a boolean instead.)
> 
> As in returning boolean instead of void as "found"? That works for me
> (for the C interface).

Petr sent me privately some changes to implement this idea.  I also
reworked the tests so that they only happen on src/test/modules (getting
rid of the one in core regress) and made them work with both enabled and
disabled track_commit_timestamps; in make installcheck, they pass
regardless of the setting of the installed server, and in make check,
they run a server with the setting enabled.

I made two more changes:
1. introduce newestCommitTs.  Original code was using lastCommitXact to
check that no "future" transaction is asked for, but this doesn't really
work if a long-running transaction is committed, because asking for
transactions with a higher Xid but which were committed earlier would
raise an error.

2. change CommitTsControlLock to CommitTsLock as the lock that protects
the stuff we keep in ShmemVariableCache.  The Control one is for SLRU
access, and so it might be slow at times.  This is important because we
fill the checkpoint struct from values protected by that lock, so a
checkpoint might be delayed if it happens to land in the middle of a
slru IO operation.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
diff --git a/contrib/pg_upgrade/pg_upgrade.c b/contrib/pg_upgrade/pg_upgrade.c
index 3b8241b..f0a023f 100644
--- a/contrib/pg_upgrade/pg_upgrade.c
+++ b/contrib/pg_upgrade/pg_upgrade.c
@@ -423,8 +423,10 @@ copy_clog_xlog_xid(void)
 	/* set the next transaction id and epoch of the new cluster */
 	prep_status("Setting next transaction ID and epoch for new cluster");
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
-			  "\"%s/pg_resetxlog\" -f -x %u \"%s\"",
-			  new_cluster.bindir, old_cluster.controldata.chkpnt_nxtxid,
+			  "\"%s/pg_resetxlog\" -f -x %u -c %u \"%s\"",
+			  new_cluster.bindir,
+			  old_cluster.controldata.chkpnt_nxtxid,
+			  old_cluster.controldata.chkpnt_nxtxid,
 			  new_cluster.pgdata);
 	exec_prog(UTILITY_LOG_FILE, NULL, true,
 			  "\"%s/pg_resetxlog\" -f -e %u \"%s\"",
diff --git a/contrib/pg_xlogdump/rmgrdesc.c b/contrib/pg_xlogdump/rmgrdesc.c
index 9397198..180818d 100644
--- a/contrib/pg_xlogdump/rmgrdesc.c
+++ b/contrib/pg_xlogdump/rmgrdesc.c
@@ -10,6 +10,7 @@
 
 #include "access/brin_xlog.h"
 #include "access/clog.h"
+#include "access/commit_ts.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ab8c263..e3713d3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2673,6 +2673,20 @@ include_dir 'conf.d'
   
  
 
+ 
+  track_commit_timestamp (bool)
+  
+   track_commit_timestamp configuration parameter
+  
+  
+   
+Record commit time of transactions. This parameter
+can only be set in postgresql.conf file or on the server
+command line. The default value is off.
+   
+  
+ 
+
  
 
 
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index baf81ee..62ec275 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15938,6 +15938,45 @@ SELECT collation for ('foo' COLLATE "de_DE");
 For example 10:20:10,14,15 means
 xmin=10, xmax=20, xip_list=10, 14, 15.

+
+   
+The functions shown in 
+provide information about transactions that have been already committed.
+These functions mainly provide information about when the transactions
+were committed. They only provide useful data when
+ configuration option is enabled
+and only for transactions that were committed after it was enabled.
+   
+
+   
+Committed transaction information
+
+ 
+  Name Return Type Description
+ 
+
+ 
+  
+   
+pg_xact_commit_timestamp
+pg_xact_commit_timestamp(xid)
+   
+   timestamp with time zone
+   get commit timestamp of a transaction
+  
+
+  
+   
+pg_last_committed_xact
+pg_last_committed_xact()
+   
+   xid xid, timestamp tim

Re: [HACKERS] [Windows,PATCH] Use faster, higher precision timer API

2014-12-01 Thread Craig Ringer
On 12/01/2014 09:51 PM, Marco Nenciarini wrote:
> I think this is a leftover, as you don't use elog afterwards.

Good catch, fixed.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
>From 63edf0571961f2f2c7fd6a08b99e747ee39a3377 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Fri, 12 Sep 2014 12:41:35 +0800
Subject: [PATCH 1/2] Use GetSystemTimeAsFileTime directly in windows
 gettimeofday
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

PostgreSQL was calling GetSystemTime followed by SystemTimeToFileTime in the
win32 port gettimeofday function. This is not necessary and limits the reported
precision to the 1ms granularity that the SYSTEMTIME struct can represent. By
using GetSystemTimeAsFileTime we avoid unnecessary conversions and capture
timestamps at 100ns granularity, which is then rounded to 1µs granularity for
storage in a PostgreSQL timestamp.

On most Windows systems this change will actually have no significant effect as
the system timer tick is typically between 1ms and 15ms depending on what timer
resolution currently running applications have requested. You can check this
with clockres.exe from sysinternals. Despite the platform limiation this change
still permits capture of finer timestamps where the system is capable of
producing them and it gets rid of an unnecessary syscall.

Future work may permit use of GetSystemTimePreciseAsFileTime on Windows 8 and
Windows Server 2012 for higher resolution time capture. This call has the same
interface as GetSystemTimeAsFileTime.
---
 src/port/gettimeofday.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/port/gettimeofday.c b/src/port/gettimeofday.c
index 75a9199..73ec406 100644
--- a/src/port/gettimeofday.c
+++ b/src/port/gettimeofday.c
@@ -44,16 +44,14 @@ int
 gettimeofday(struct timeval * tp, struct timezone * tzp)
 {
 	FILETIME	file_time;
-	SYSTEMTIME	system_time;
 	ULARGE_INTEGER ularge;
 
-	GetSystemTime(&system_time);
-	SystemTimeToFileTime(&system_time, &file_time);
+	GetSystemTimeAsFileTime(&file_time);
 	ularge.LowPart = file_time.dwLowDateTime;
 	ularge.HighPart = file_time.dwHighDateTime;
 
 	tp->tv_sec = (long) ((ularge.QuadPart - epoch) / 1000L);
-	tp->tv_usec = (long) (system_time.wMilliseconds * 1000);
+	tp->tv_usec = (long) (((ularge.QuadPart - epoch) % 1000L) / 10);
 
 	return 0;
 }
-- 
1.9.3

>From 035bb3b19f93560d18834d2daf397c70e1555015 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Thu, 18 Sep 2014 23:02:14 +0800
Subject: [PATCH 2/2] Use GetSystemTimePreciseAsFileTime when available

This will cause PostgreSQL on Windows 8 or Windows Server 2012 to
obtain high-resolution timestamps while allowing the same binaries
to run without problems on older releases.
---
 src/backend/main/main.c |  6 ++
 src/include/port.h  |  2 ++
 src/port/gettimeofday.c | 52 +++--
 3 files changed, 58 insertions(+), 2 deletions(-)

diff --git a/src/backend/main/main.c b/src/backend/main/main.c
index c51b391..73c30c5 100644
--- a/src/backend/main/main.c
+++ b/src/backend/main/main.c
@@ -260,6 +260,12 @@ startup_hacks(const char *progname)
 
 		/* In case of general protection fault, don't show GUI popup box */
 		SetErrorMode(SEM_FAILCRITICALERRORS | SEM_NOGPFAULTERRORBOX);
+
+#ifndef HAVE_GETTIMEOFDAY
+		/* Figure out which syscall to use to capture timestamp information */
+		init_win32_gettimeofday();
+#endif
+
 	}
 #endif   /* WIN32 */
 
diff --git a/src/include/port.h b/src/include/port.h
index 94a0e2f..58677ec 100644
--- a/src/include/port.h
+++ b/src/include/port.h
@@ -328,6 +328,8 @@ extern FILE *pgwin32_popen(const char *command, const char *type);
 #ifndef HAVE_GETTIMEOFDAY
 /* Last parameter not used */
 extern int	gettimeofday(struct timeval * tp, struct timezone * tzp);
+/* On windows we need to call some backend start setup for accurate timing */
+extern void init_win32_gettimeofday(void);
 #endif
 #else			/* !WIN32 */
 
diff --git a/src/port/gettimeofday.c b/src/port/gettimeofday.c
index 73ec406..ab4f491 100644
--- a/src/port/gettimeofday.c
+++ b/src/port/gettimeofday.c
@@ -35,9 +35,57 @@
 static const unsigned __int64 epoch = UINT64CONST(1164447360);
 
 /*
+ * Both GetSystemTimeAsFileTime and GetSystemTimePreciseAsFileTime share a
+ * signature, so we can just store a pointer to whichever we find. This
+ * is the pointer's type.
+ */
+typedef VOID (WINAPI *PgGetSystemTimeFn)(LPFILETIME);
+/* Storage for the function we pick at runtime */
+static PgGetSystemTimeFn pg_get_system_time = NULL;
+
+/*
+ * During backend startup, determine if GetSystemTimePreciseAsFileTime is
+ * available and use it; if not, fall back to GetSystemTimeAsFileTime.
+ */
+void
+init_win32_gettimeofday(void)
+{
+	/*
+	 * Because it's guaranteed that kernel32.dll will be linked into our
+	 * address space already, we don't need to LoadLibrary it 

Re: [HACKERS] inherit support for foreign tables

2014-12-01 Thread Etsuro Fujita

(2014/11/28 18:14), Ashutosh Bapat wrote:

On Thu, Nov 27, 2014 at 3:52 PM, Etsuro Fujita
mailto:fujita.ets...@lab.ntt.co.jp>> wrote:
Apart from the above, I noticed that the patch doesn't consider to
call ExplainForeignModify during EXPLAIN for an inherited
UPDATE/DELETE, as shown below (note that there are no UPDATE remote
queries displayed):



So, I'd like to modify explain.c to show those queries like this:



postgres=# explain verbose update parent set a = a * 2 where a = 5;
  QUERY PLAN

--__--__-
  Update on public.parent  (cost=0.00..280.77 rows=25 width=10)
->  Seq Scan on public.parent  (cost=0.00..0.00 rows=1 width=10)
  Output: (parent.a * 2), parent.ctid
  Filter: (parent.a = 5)
Remote SQL: UPDATE public.mytable_1 SET a = $2 WHERE ctid = $1
->  Foreign Scan on public.ft1  (cost=100.00..140.38 rows=12
width=10)
  Output: (ft1.a * 2), ft1.ctid
  Remote SQL: SELECT a, ctid FROM public.mytable_1 WHERE ((a
= 5)) FOR UPDATE
Remote SQL: UPDATE public.mytable_2 SET a = $2 WHERE ctid = $1
->  Foreign Scan on public.ft2  (cost=100.00..140.38 rows=12
width=10)
  Output: (ft2.a * 2), ft2.ctid
  Remote SQL: SELECT a, ctid FROM public.mytable_2 WHERE ((a
= 5)) FOR UPDATE
(12 rows)



Two "remote SQL" under a single node would be confusing. Also the node
is labelled as "Foreign Scan". It would be confusing to show an "UPDATE"
command under this "scan" node.


I thought this as an extention of the existing (ie, non-inherited) case 
(see the below example) to the inherited case.


postgres=# explain verbose update ft1 set a = a * 2 where a = 5;
 QUERY PLAN
-
 Update on public.ft1  (cost=100.00..140.38 rows=12 width=10)
   Remote SQL: UPDATE public.mytable_1 SET a = $2 WHERE ctid = $1
   ->  Foreign Scan on public.ft1  (cost=100.00..140.38 rows=12 width=10)
 Output: (a * 2), ctid
 Remote SQL: SELECT a, ctid FROM public.mytable_1 WHERE ((a = 
5)) FOR UPDATE

(5 rows)

I think we should show update commands somewhere for the inherited case 
as for the non-inherited case.  Alternatives to this are welcome.



BTW, I was experimenting with DMLs being executed on multiple FDW server
under same transaction and found that the transactions may not be atomic
(and may be inconsistent), if one or more of the server fails to commit
while rest of them commit the transaction. The reason for this is, we do
not "rollback" the already "committed" changes to the foreign server, if
one or more of them fail to "commit" a transaction. With foreign tables
under inheritance hierarchy a single DML can span across multiple
servers and the result may not be atomic (and may be inconsistent). So,


IIUC, even the transactions over the local and the *single* remote 
server are not guaranteed to be executed atomically in the current form. 
 It is possible that the remote transaction succeeds and the local one 
fails, for example, resulting in data inconsistency between the local 
and the remote.



either we have to disable DMLs on an inheritance hierarchy which spans
multiple servers. OR make sure that such transactions follow 2PC norms.


-1 for disabling update queries on such an inheritance hierarchy because 
I think we should leave that to the user's judgment.  And I think 2PC is 
definitely a separate patch.


Thanks,

Best regards,
Etsuro Fujita


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: Better memory accounting, towards memory-bounded HashAgg

2014-12-01 Thread Jeff Davis
On Sun, 2014-11-30 at 17:49 -0800, Peter Geoghegan wrote:
> On Mon, Nov 17, 2014 at 11:39 PM, Jeff Davis  wrote:
> > I can also just move isReset there, and keep mem_allocated as a uint64.
> > That way, if I find later that I want to track the aggregated value for
> > the child contexts as well, I can split it into two uint32s. I'll hold
> > off any any such optimizations until I see some numbers from HashAgg
> > though.
> 
> I took a quick look at memory-accounting-v8.patch.
> 
> Is there some reason why mem_allocated is a uint64? All other things
> being equal, I'd follow the example of tuplesort.c's
> MemoryContextAllocHuge() API, which (following bugfix commit
> 79e0f87a1) uses int64 variables to track available memory and so on.

No reason. New version attached; that's the only change.

Regards,
Jeff Davis


*** a/src/backend/utils/mmgr/aset.c
--- b/src/backend/utils/mmgr/aset.c
***
*** 438,451  AllocSetContextCreate(MemoryContext parent,
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
! 	AllocSet	context;
  
  	/* Do the type-independent part of context creation */
! 	context = (AllocSet) MemoryContextCreate(T_AllocSetContext,
! 			 sizeof(AllocSetContext),
! 			 &AllocSetMethods,
! 			 parent,
! 			 name);
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
--- 438,454 
  	  Size initBlockSize,
  	  Size maxBlockSize)
  {
! 	AllocSet			set;
! 	MemoryContext		context;
  
  	/* Do the type-independent part of context creation */
! 	context = MemoryContextCreate(T_AllocSetContext,
!   sizeof(AllocSetContext),
!   &AllocSetMethods,
!   parent,
!   name);
! 
! 	set = (AllocSet) context;
  
  	/*
  	 * Make sure alloc parameters are reasonable, and save them.
***
*** 459,467  AllocSetContextCreate(MemoryContext parent,
  	if (maxBlockSize < initBlockSize)
  		maxBlockSize = initBlockSize;
  	Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
! 	context->initBlockSize = initBlockSize;
! 	context->maxBlockSize = maxBlockSize;
! 	context->nextBlockSize = initBlockSize;
  
  	/*
  	 * Compute the allocation chunk size limit for this context.  It can't be
--- 462,470 
  	if (maxBlockSize < initBlockSize)
  		maxBlockSize = initBlockSize;
  	Assert(AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
! 	set->initBlockSize = initBlockSize;
! 	set->maxBlockSize = maxBlockSize;
! 	set->nextBlockSize = initBlockSize;
  
  	/*
  	 * Compute the allocation chunk size limit for this context.  It can't be
***
*** 477,486  AllocSetContextCreate(MemoryContext parent,
  	 * and actually-allocated sizes of any chunk must be on the same side of
  	 * the limit, else we get confused about whether the chunk is "big".
  	 */
! 	context->allocChunkLimit = ALLOC_CHUNK_LIMIT;
! 	while ((Size) (context->allocChunkLimit + ALLOC_CHUNKHDRSZ) >
  		   (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION))
! 		context->allocChunkLimit >>= 1;
  
  	/*
  	 * Grab always-allocated space, if requested
--- 480,489 
  	 * and actually-allocated sizes of any chunk must be on the same side of
  	 * the limit, else we get confused about whether the chunk is "big".
  	 */
! 	set->allocChunkLimit = ALLOC_CHUNK_LIMIT;
! 	while ((Size) (set->allocChunkLimit + ALLOC_CHUNKHDRSZ) >
  		   (Size) ((maxBlockSize - ALLOC_BLOCKHDRSZ) / ALLOC_CHUNK_FRACTION))
! 		set->allocChunkLimit >>= 1;
  
  	/*
  	 * Grab always-allocated space, if requested
***
*** 500,519  AllocSetContextCreate(MemoryContext parent,
  	 errdetail("Failed while creating memory context \"%s\".",
  			   name)));
  		}
! 		block->aset = context;
  		block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block->endptr = ((char *) block) + blksize;
! 		block->next = context->blocks;
! 		context->blocks = block;
  		/* Mark block as not to be released at reset time */
! 		context->keeper = block;
  
  		/* Mark unallocated space NOACCESS; leave the block header alone. */
  		VALGRIND_MAKE_MEM_NOACCESS(block->freeptr,
     blksize - ALLOC_BLOCKHDRSZ);
  	}
  
! 	return (MemoryContext) context;
  }
  
  /*
--- 503,525 
  	 errdetail("Failed while creating memory context \"%s\".",
  			   name)));
  		}
! 
! 		context->mem_allocated += blksize;
! 
! 		block->aset = set;
  		block->freeptr = ((char *) block) + ALLOC_BLOCKHDRSZ;
  		block->endptr = ((char *) block) + blksize;
! 		block->next = set->blocks;
! 		set->blocks = block;
  		/* Mark block as not to be released at reset time */
! 		set->keeper = block;
  
  		/* Mark unallocated space NOACCESS; leave the block header alone. */
  		VALGRIND_MAKE_MEM_NOACCESS(block->freeptr,
     blksize - ALLOC_BLOCKHDRSZ);
  	}
  
! 	return context;
  }
  
  /*
***
*** 590,595  AllocSetReset(MemoryContext context)
--- 596,603 
  		else
  		{
  			/* Normal case

Re: [HACKERS] Proposal : REINDEX SCHEMA

2014-12-01 Thread Michael Paquier
On Mon, Dec 1, 2014 at 11:29 PM, Sawada Masahiko  wrote:
> On Thu, Nov 27, 2014 at 11:43 PM, Michael Paquier
>  wrote:
>> If you get that done, I'll have an extra look at it and then let's
>> have a committer look at it.
>
> Attached patch is latest version.
> I added new enum values like REINDEX_OBJECT_XXX,
> and changed ReindexObject() function.
> Also ACL problem is fixed for this version.

Thanks for the updated version.
I have been looking at this patch more deeply, and I noticed a couple
of things that were missing or mishandled:
- The patch completely ignored that a catalog schema could be reindexed
- When reindexing pg_catalog as a schema, it is critical to reindex
pg_class first as reindex_relation updates pg_class.
- gram.y generated some warnings because ReindexObjectType stuff was
casted as ObjectType.
- There was a memory leak, the scan keys palloc'd  in ReindexObject
were not pfree'd
- Using do_user, do_system and now do_database was really confusing
and reduced code lisibility. I reworked it using the object kind
- The patch to support SCHEMA in reindexdb has been forgotten.
Reattaching it here.
Adding on top of that a couple of things cleaned up, like docs and
typos, and I got the patch attached. Let's have a committer have a
look a it now, I am marking that as "Ready for Committer".
Regards,
-- 
Michael
From 1e384f579ffa865c364616573bd8767a24306c44 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 2 Dec 2014 15:29:36 +0900
Subject: [PATCH 1/2] Support for REINDEX SCHEMA

When pg_catalog is reindexed, note that pg_class is reindexed first
for a behavior consistent with the cases of DATABASE and SYSTEM.
---
 doc/src/sgml/ref/reindex.sgml  |  15 +++-
 src/backend/commands/indexcmds.c   | 110 +
 src/backend/parser/gram.y  |  35 +
 src/backend/tcop/utility.c |  15 ++--
 src/bin/psql/tab-complete.c|   4 +-
 src/include/commands/defrem.h  |   3 +-
 src/include/nodes/parsenodes.h |  11 ++-
 src/test/regress/expected/create_index.out |  31 
 src/test/regress/sql/create_index.sql  |  23 ++
 9 files changed, 191 insertions(+), 56 deletions(-)

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index cabae19..0a4c7d4 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-REINDEX { INDEX | TABLE | DATABASE | SYSTEM } name [ FORCE ]
+REINDEX { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } name [ FORCE ]
 
  
 
@@ -101,6 +101,19 @@ REINDEX { INDEX | TABLE | DATABASE | SYSTEM } nam

 

+SCHEMA
+
+ 
+  Recreate all indexes of the specified schema.  If a table of this
+  schema has a secondary TOAST table, that is reindexed as
+  well. Indexes on shared system catalogs are also processed.
+  This form of REINDEX cannot be executed inside a
+  transaction block.
+ 
+
+   
+
+   
 DATABASE
 
  
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 12b4ac7..a3e8a15 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -1777,34 +1777,58 @@ ReindexTable(RangeVar *relation)
 }
 
 /*
- * ReindexDatabase
- *		Recreate indexes of a database.
+ * ReindexObject
+ *		Recreate indexes of object whose type is defined by objectKind.
  *
  * To reduce the probability of deadlocks, each table is reindexed in a
  * separate transaction, so we can release the lock on it right away.
  * That means this must not be called within a user transaction block!
  */
 Oid
-ReindexDatabase(const char *databaseName, bool do_system, bool do_user)
+ReindexObject(const char *objectName, ReindexObjectType objectKind)
 {
+	Oid			objectOid;
 	Relation	relationRelation;
 	HeapScanDesc scan;
+	ScanKeyData	*scan_keys = NULL;
 	HeapTuple	tuple;
 	MemoryContext private_context;
 	MemoryContext old;
 	List	   *relids = NIL;
 	ListCell   *l;
+	int	num_keys;
 
-	AssertArg(databaseName);
+	AssertArg(objectName);
+	Assert(objectKind == REINDEX_OBJECT_SCHEMA ||
+		   objectKind == REINDEX_OBJECT_SYSTEM ||
+		   objectKind == REINDEX_OBJECT_DATABASE);
 
-	if (strcmp(databaseName, get_database_name(MyDatabaseId)) != 0)
-		ereport(ERROR,
-(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("can only reindex the currently open database")));
+	/*
+	 * Get OID of object to reindex, being the database currently being
+	 * used by session for a database or for system catalogs, or the schema
+	 * defined by caller. At the same time do permission checks that need
+	 * different processing depending on the object type.
+	 */
+	if (objectKind == REINDEX_OBJECT_SCHEMA)
+	{
+		objectOid = get_namespace_oid(objectName, false);
 
-	if (!pg_database_ownercheck(MyDatabaseId, GetUserId()))
-		aclcheck_error(ACLCHECK_NOT_OWNER, ACL_KIND_DATABASE,
-	   databaseName);
+		if (!pg_namespace_ownerc

Re: [HACKERS] New Event Trigger: table_rewrite

2014-12-01 Thread Michael Paquier
On Thu, Nov 20, 2014 at 10:37 PM, Dimitri Fontaine
 wrote:
> Alvaro Herrera  writes:
>>> CLUSTER and VACUUM are not part of the supported commands anymore, so
>>> I think that we could replace that by the addition of a reference
>>> number in the cell of ALTER TABLE for the event table_rewrite and
>>> write at the bottom of the table a description of how this event
>>> behaves with ALTER TABLE. Note as well that "might or might not" is
>>> not really helpful for the user.
>>
>> That's precisely why we have an event trigger here, I think --- for some
>> subcommands, it's not easy to determine whether a rewrite happens or
>> not.  (I think SET TYPE is the one).  I don't think we want to document
>> precisely under what condition a rewrite takes place.
>
> Yeah, the current documentation expands to the following sentence, as
> browsed in
>
>   http://www.postgresql.org/docs/9.3/interactive/sql-altertable.html
>
>   As an exception, if the USING clause does not change the column
>   contents and the old type is either binary coercible to the new type
>   or an unconstrained domain over the new type, a table rewrite is not
>   needed, but any indexes on the affected columns must still be rebuilt.
>
> I don't think that "might or might not" is less helpful in the context
> of the Event Trigger, because the whole point is that the event is only
> triggered in case of a rewrite. Of course we could cross link the two
> paragraphs or something.
>
>>> 2) The examples of SQL queries provided are still in lower case in the
>>> docs, that's contrary to the rest of the docs where upper case is used
>>> for reserved keywords.
>
> Right, being consistent trumps personal preferences, changed in the
> attached.
>
>> Yes please.   Another thing in that sample code is "not current_hour
>> between 1 and 6".  That reads strange to me.  It should be equally
>> correct to spell it as "current_hour not between 1 and 6", which seems
>> more natural. 
>
> True, fixed in the attached.
The status of this patch was not updated on the commit fest app, so I
lost track of it. Sorry for not answering earlier btw.

The following things to note about v5:
1) There are still mentions of VACUUM, ANALYZE and CLUSTER:
@@ -264,6 +275,10 @@ check_ddl_tag(const char *tag)
obtypename = tag + 6;
else if (pg_strncasecmp(tag, "DROP ", 5) == 0)
obtypename = tag + 5;
+   else if (pg_strncasecmp(tag, "ANALYZE", 7) == 0 ||
+pg_strncasecmp(tag, "CLUSTER", 7) == 0 ||
+pg_strncasecmp(tag, "VACUUM", 6) == 0)
+   return EVENT_TRIGGER_COMMAND_TAG_OK;
2) There are a couple of typos and incorrect styling, like "if(". Nothing huge..
Cleanup is done in the attached.

In any case, all the issues mentioned seem to have been addressed, so
switching this patch to ready for committer.
Regards,
-- 
Michael
diff --git a/src/backend/commands/event_trigger.c b/src/backend/commands/event_trigger.c
index 6a3002f..17b0818 100644
--- a/src/backend/commands/event_trigger.c
+++ b/src/backend/commands/event_trigger.c
@@ -42,11 +42,14 @@
 #include "utils/syscache.h"
 #include "tcop/utility.h"
 
-
+/*
+ * Data Structure for sql_drop and table_rewrite Event Trigger support.
+ */
 typedef struct EventTriggerQueryState
 {
 	slist_head	SQLDropList;
 	bool		in_sql_drop;
+	Oid			table_rewrite_oid;
 	MemoryContext cxt;
 	struct EventTriggerQueryState *previous;
 } EventTriggerQueryState;
@@ -119,11 +122,14 @@ static void AlterEventTriggerOwner_internal(Relation rel,
 HeapTuple tup,
 Oid newOwnerId);
 static event_trigger_command_tag_check_result check_ddl_tag(const char *tag);
+static event_trigger_command_tag_check_result check_table_rewrite_ddl_tag(
+	const char *tag);
 static void error_duplicate_filter_variable(const char *defname);
 static Datum filter_list_to_array(List *filterlist);
 static Oid insert_event_trigger_tuple(char *trigname, char *eventname,
 		   Oid evtOwner, Oid funcoid, List *tags);
 static void validate_ddl_tags(const char *filtervar, List *taglist);
+static void validate_table_rewrite_tags(const char *filtervar, List *taglist);
 static void EventTriggerInvoke(List *fn_oid_list, EventTriggerData *trigdata);
 
 /*
@@ -154,7 +160,8 @@ CreateEventTrigger(CreateEventTrigStmt *stmt)
 	/* Validate event name. */
 	if (strcmp(stmt->eventname, "ddl_command_start") != 0 &&
 		strcmp(stmt->eventname, "ddl_command_end") != 0 &&
-		strcmp(stmt->eventname, "sql_drop") != 0)
+		strcmp(stmt->eventname, "sql_drop") != 0 &&
+		strcmp(stmt->eventname, "table_rewrite") != 0)
 		ereport(ERROR,
 (errcode(ERRCODE_SYNTAX_ERROR),
  errmsg("unrecognized event name \"%s\"",
@@ -183,6 +190,9 @@ CreateEventTrigger(CreateEventTrigStmt *stmt)
 		 strcmp(stmt->eventname, "sql_drop") == 0)
 		&& tags != NULL)
 		validate_ddl_tags("tag", tags);
+	else if (strcmp(stmt->eventname, "table_rewrite") == 0
+			 && tags != NULL)
+		validate_table_rewrite_tags("tag", tags);
 
 	/