[HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

2015-06-17 Thread Tomas Vondra

Hi,

I'm currently running some tests on a 3TB TPC-H data set, and I tripped 
over a pretty bad n_distinct underestimate, causing OOM in HashAgg 
(which somehow illustrates the importance of the memory-bounded hashagg 
patch Jeff Davis is working on).


The problem is Q18, particularly this simple subquery:

select l_orderkey
from lineitem
group by l_orderkey
having sum(l_quantity)  313;

which is planned like this:

   QUERY PLAN
-
 HashAggregate  (cost=598510163.92..598515393.93 rows=418401 width=12)
   Group Key: l_orderkey
   Filter: (sum(l_quantity)  '313'::double precision)
   -  Seq Scan on lineitem  (cost=0.00..508509923.28 rows=1848128 
width=12)

(4 rows)

but sadly, in reality the l_orderkey cardinality looks like this:

tpch=# select count(distinct l_orderkey) from lineitem;
   count

 45
(1 row)

That's a helluva difference - not the usual one or two orders of 
magnitude, but 1x underestimate.


The usual thing to do in this case is increasing statistics target, and 
while this improves the estimate, the improvement is rather small:


   statistics target estimatedifference
   --
   100 429491 1
   1000   4240418  1000
   1 42913759   100

I find the pattern rather strange - every time the statistics target 
increases 10x, the difference decreases 10x - maybe that's natural, but 
the perfect proportionality is suspicious IMHO.


Also, this is a quite large dataset - the table has ~18 billion rows, 
and even with target=1 we're sampling only 3M rows, which is ~0.02%. 
That's a tiny sample, so inaccuracy is naturally expected, but OTOH the 
TPC-H dataset is damn uniform - there's pretty much no skew in the 
distributions AFAIK. So I'd expect a slightly better result.


With target=1 the plan switches to GroupAggregate, because the 
estimate gets sufficient to exceed work_mem (2GB). But it's still way 
off, and it's mostly just a lucky coincidence.


So I'm wondering if there's some bug because of the dataset size (an 
integer overflow or something like), so I added a bunch of logging into 
the estimator, logging all the parameters computed:


target=100 (samplerows=3)
-
WARNING:  attnum=1 attname=l_orderkey f1=27976 ndistinct=28977 
nmultiple=1001 toowide_cnt=0 d=28977 numer=86931.00 
denom=2024.046627 stadistinct=429491.094029
WARNING:  ndistinct estimate attnum=1 attname=l_orderkey 
current=429491.09 adaptive=443730.00


target=1000 (samplerows=30)
---
WARNING:  attnum=1 attname=l_orderkey f1=279513 ndistinct=289644 
nmultiple=10131 toowide_cnt=0 d=289644 numer=8689320.00 
denom=20491.658538 stadistinct=4240418.111618
WARNING:  ndistinct estimate attnum=1 attname=l_orderkey 
current=4240418.11 adaptive=4375171.00


target=1 (samplerows=300)
-
WARNING:  attnum=1 attname=l_orderkey f1=2797888 ndistinct=2897799 
nmultiple=99911 toowide_cnt=0 d=2897799 numer=869339700.00 
denom=202578.313396 stadistinct=42913759.396282
WARNING:  ndistinct estimate attnum=1 attname=l_orderkey 
current=42913759.40 adaptive=9882.00


It's totalrows=1849031 in all cases. The logs also show estimate 
produced by the adaptive estimate (discussed in a separate thread), but 
apparently that does not change the estimates much :-(


Any ideas?

--
Tomas Vondra   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On columnar storage

2015-06-17 Thread Alvaro Herrera
Jim Nasby wrote:

 Related to idea of an 'auto join', I do wish we had the ability to access
 columns in a referenced FK table from a referring key; something like SELECT
 customer_id.first_name FROM invoice (which would be translated to SELECT
 first_name FROM invoice JOIN customer USING( customer_id )).

We already have this feature -- you just need to set the
add_missing_from GUC.

(... actually we removed that feature because it was considered
dangerous or something.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On columnar storage

2015-06-17 Thread Jim Nasby

On 6/17/15 2:04 PM, Alvaro Herrera wrote:

Jim Nasby wrote:


Related to idea of an 'auto join', I do wish we had the ability to access
columns in a referenced FK table from a referring key; something like SELECT
customer_id.first_name FROM invoice (which would be translated to SELECT
first_name FROM invoice JOIN customer USING( customer_id )).


We already have this feature -- you just need to set the
add_missing_from GUC.

(... actually we removed that feature because it was considered
dangerous or something.)


That was removed in 9.0. Even so, I don't believe it added the JOIN, so 
you'd suddenly get a Cartesian product.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Function to get size of asynchronous notification queue

2015-06-17 Thread Brendan Jurd
On Thu, 18 Jun 2015 at 03:06 Gurjeet Singh gurj...@singh.im wrote:

 I don't see this in the CF app; can you please add it there?


Done.  I did try to add it when I posted the email, but for some reason I
couldn't connect to commitfest.postgresql.org at all.  Seems fine now,
though.

Cheers,
BJ


Re: [HACKERS] WAL replay bugs

2015-06-17 Thread Alvaro Herrera
Michael Paquier wrote:

 From 077d675795b4907904fa4e85abed8c4528f4666f Mon Sep 17 00:00:00 2001
 From: Michael Paquier mich...@otacoo.com
 Date: Sat, 19 Jul 2014 10:40:20 +0900
 Subject: [PATCH 3/3] Buffer capture facility: check WAL replay consistency

Is there a newer version of this tech?


-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5 make world failing due to sgml tools missing

2015-06-17 Thread Tom Lane
Keith Fiske ke...@omniti.com writes:
 The current HEAD of postgres in the git repo is not building when using
 make world. It's been like this for about a month or so that I've been
 aware of. I didn't really need the world build so been making due without
 it. At PGCon now, though, so asked Bruce and he said this error was due to
 my not having the sgml tools installed, which should not be a requirement.

Dunno about about a month; this seems to have come in with commit
5d93ce2d0c619ba1 from last October.  But yes, that commit moved the
goalposts in terms of what is required to build the documentation.
You now must have xmllint which is something supplied by libxml2.

I am not sure this is a good thing, especially since xmllint is only
doing checking, which is of little use to consumers of the documentation.
Maybe if xmllint is not found, we should just skip the checking steps
rather than hard failing?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] GIN function of pageinspect has inconsistency data type.

2015-06-17 Thread Jim Nasby

On 6/16/15 8:26 AM, Sawada Masahiko wrote:

I noticed while using gin function of pageinspect that there are some
inconsistency data types.
For example, data type of GinMetaPageData.head, and tail  is
BlockNumber, i.g, uint32.
But in ginfunc.c, we get that value as int64.


You can't put a uint32 into a signed int32 (which is what the SQL int 
is). That's why it's returning a bigint.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] could not adopt C locale failure at startup on Windows

2015-06-17 Thread Noah Misch
On Wed, Jun 17, 2015 at 01:43:55PM -0400, Tom Lane wrote:
 Noah Misch n...@leadboat.com writes:
  On Mon, Jun 15, 2015 at 12:37:43PM -0400, Tom Lane wrote:
  It's mere chance that the order of calls to pg_perm_setlocale() is
  such that the code works now; and it's not hard to imagine future
  changes in gettext, or reordering of our own code, that would break it.
 
  pg_bind_textdomain_codeset() looks at one piece of the locale environment,
  namely setlocale(LC_CTYPE, NULL), so the order of pg_perm_setlocale() calls
  does not affect it.
 
 Well, my point is that that is a larger assumption about the behavior of
 pg_bind_textdomain_codeset() than I think we ought to make, even if it
 happens to be true today.

Perhaps it's just me, but I can envision changes of similar plausibility that
break under each approach and not the other.  Without a way to distinguish on
that basis, I'm left shrugging about a proposal to switch.  For that matter,
if pg_bind_textdomain_codeset() starts to act on other facets of the locale,
that's liable to be a bug independent of our choice here.  However locale
facets conflict, we expect LC_CTYPE to control the message codeset.

  There's nothing acutely bad about the alternative you
  identify here, but it's no better equipped to forestall mistakes.  Moving 
  the
  call later would slightly expand the window during which translated messages
  are garbled.
 
 I'm not exactly sure that they wouldn't be garbled anyway during the
 interval where we're setting these things up.  Until DatabaseEncoding,
 ClientEncoding, and gettext's internal notions are all in sync, we
 are at risk of that type of issue no matter what.

True; the window exists and is small enough both ways.  This is merely one
more reason to fix the bug without fixing what ain't broke.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5 make world failing due to sgml tools missing

2015-06-17 Thread Joshua D. Drake


On 06/17/2015 01:07 PM, Tom Lane wrote:

Keith Fiske ke...@omniti.com writes:

The current HEAD of postgres in the git repo is not building when using
make world. It's been like this for about a month or so that I've been
aware of. I didn't really need the world build so been making due without
it. At PGCon now, though, so asked Bruce and he said this error was due to
my not having the sgml tools installed, which should not be a requirement.


Dunno about about a month; this seems to have come in with commit
5d93ce2d0c619ba1 from last October.  But yes, that commit moved the
goalposts in terms of what is required to build the documentation.
You now must have xmllint which is something supplied by libxml2.

I am not sure this is a good thing, especially since xmllint is only
doing checking, which is of little use to consumers of the documentation.
Maybe if xmllint is not found, we should just skip the checking steps
rather than hard failing?



If they are building from source, this does not seem to be an 
unrealistic package to require.


JD



regards, tom lane





--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing I'm offended is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Function to get size of asynchronous notification queue

2015-06-17 Thread Brendan Jurd
Posting v2 of the patch, incorporating some helpful suggestions from Merlin.

Cheers,
BJ
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***
*** 14800,14805  SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
--- 14800,14811 
/row
  
row
+entryliteralfunctionpg_notification_queue_usage()/function/literal/entry
+entrytypedouble/type/entry
+entryproportion of the asynchronous notification queue currently occupied (0-1)/entry
+   /row
+ 
+   row
 entryliteralfunctionpg_my_temp_schema()/function/literal/entry
 entrytypeoid/type/entry
 entryOID of session's temporary schema, or 0 if none/entry
***
*** 14939,14948  SET search_path TO replaceableschema/ optional, replaceableschema/, ..
  primarypg_listening_channels/primary
 /indexterm
  
 para
  functionpg_listening_channels/function returns a set of names of
! channels that the current session is listening to.  See xref
! linkend=sql-listen for more information.
 /para
  
 indexterm
--- 14945,14963 
  primarypg_listening_channels/primary
 /indexterm
  
+indexterm
+ primarypg_notification_queue_usage/primary
+/indexterm
+ 
 para
  functionpg_listening_channels/function returns a set of names of
! asynchronous notification channels that the current session is listening
! to.  functionpg_notification_queue_usage/function returns the
! proportion of the total available space for notifications currently
! occupied by notifications that are waiting to be processed, as a
! typedouble/type in the range 0-1.
! See xref linkend=sql-listen and xref linkend=sql-notify
! for more information.
 /para
  
 indexterm
*** a/doc/src/sgml/ref/notify.sgml
--- b/doc/src/sgml/ref/notify.sgml
***
*** 166,171  NOTIFY replaceable class=PARAMETERchannel/replaceable [ , replaceable cla
--- 166,176 
 current transaction so that cleanup can proceed.
/para
para
+The function functionpg_notification_queue_usage/function returns the
+proportion of the queue that is currently occupied by pending notifications.
+See xref linkend=functions-info for more information.
+   /para
+   para
 A transaction that has executed commandNOTIFY/command cannot be
 prepared for two-phase commit.
/para
*** a/src/backend/commands/async.c
--- b/src/backend/commands/async.c
***
*** 371,376  static bool asyncQueueIsFull(void);
--- 371,377 
  static bool asyncQueueAdvance(volatile QueuePosition *position, int entryLength);
  static void asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe);
  static ListCell *asyncQueueAddEntries(ListCell *nextNotify);
+ static double asyncQueueUsage(void);
  static void asyncQueueFillWarning(void);
  static bool SignalBackends(void);
  static void asyncQueueReadAllNotifications(void);
***
*** 1362,1387  asyncQueueAddEntries(ListCell *nextNotify)
  }
  
  /*
!  * Check whether the queue is at least half full, and emit a warning if so.
!  *
!  * This is unlikely given the size of the queue, but possible.
!  * The warnings show up at most once every QUEUE_FULL_WARN_INTERVAL.
   *
!  * Caller must hold exclusive AsyncQueueLock.
   */
! static void
! asyncQueueFillWarning(void)
  {
! 	int			headPage = QUEUE_POS_PAGE(QUEUE_HEAD);
! 	int			tailPage = QUEUE_POS_PAGE(QUEUE_TAIL);
! 	int			occupied;
! 	double		fillDegree;
! 	TimestampTz t;
  
  	occupied = headPage - tailPage;
  
  	if (occupied == 0)
! 		return;	/* fast exit for common case */
  
  	if (occupied  0)
  	{
--- 1363,1399 
  }
  
  /*
!  * SQL function to return the proportion of the notification queue currently
!  * occupied.
!  */
! Datum
! pg_notification_queue_usage(PG_FUNCTION_ARGS)
! {
! 	double usage;
! 
! 	LWLockAcquire(AsyncQueueLock, LW_SHARED);
! 	usage = asyncQueueUsage();
! 	LWLockRelease(AsyncQueueLock);
! 
! 	PG_RETURN_FLOAT8(usage);
! }
! 
! /*
!  * Return the proportion of the queue that is currently occupied.
   *
!  * The caller must hold (at least) shared AysncQueueLock.
   */
! static double
! asyncQueueUsage(void)
  {
! 	int		headPage = QUEUE_POS_PAGE(QUEUE_HEAD);
! 	int		tailPage = QUEUE_POS_PAGE(QUEUE_TAIL);
! 	int		occupied;
  
  	occupied = headPage - tailPage;
  
  	if (occupied == 0)
! 		return (double) 0;		/* fast exit for common case */
  
  	if (occupied  0)
  	{
***
*** 1389,1396  asyncQueueFillWarning(void)
  		occupied += QUEUE_MAX_PAGE + 1;
  	}
  
! 	fillDegree = (double) occupied / (double) ((QUEUE_MAX_PAGE + 1) / 2);
  
  	if (fillDegree  0.5)
  		return;
  
--- 1401,1424 
  		occupied += QUEUE_MAX_PAGE + 1;
  	}
  
! 	return (double) occupied / (double) ((QUEUE_MAX_PAGE + 1) / 2);
! }
! 
! /*
!  * Check whether the queue is at least half full, and emit a warning if so.
!  *
!  * This is unlikely given the size of the queue, but possible.
! 

Re: [HACKERS] last_analyze/last_vacuum not being updated

2015-06-17 Thread Peter Eisentraut
On 6/15/15 4:45 PM, Peter Eisentraut wrote:
 On 6/8/15 3:16 PM, Peter Eisentraut wrote:
 I'm looking at a case on 9.4.1 where the last_analyze and last_vacuum
 stats for a handful of tables seem stuck.  They don't update after
 running an ANALYZE or VACUUM command, and they don't react to
 pg_stat_reset_single_table_counters().  All of the affected tables are
 system catalogs, some shared, some not.  Other system catalogs and other
 tables have their statistics updated normally.  Any ideas (before I try
 to blow it away)?
 
 This issue somehow went away before I had time to analyze it further,
 which is weird in itself.  But now I have seen a segfault on a
 completely different 9.4 instance while querying pg_stat_databases.

This was probably bug #12918.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Function to get size of asynchronous notification queue

2015-06-17 Thread Brendan Jurd
On Thu, 18 Jun 2015 at 08:19 Merlin Moncure mmonc...@gmail.com wrote:


 scratch that.  that note already exists in sql-notify.html.  Instead,
 I'd modify that section to note that you can check queue usage with
 your new function.


I have already done so.  Under the paragraph about the queue filling up, I
have added:

The function functionpg_notify_queue_saturation/function returns the
proportion of the queue that is currently occupied by pending
notifications.

A link from here back to the section in System Information Functions might
be sensible?

I will rename the function with _usage as you suggest, and add the
explanation of the return value in the docs.

Cheers,
BJ


Re: [HACKERS] Auto-vacuum is not running in 9.1.12

2015-06-17 Thread Tom Lane
Alvaro Herrera alvhe...@2ndquadrant.com writes:
 Yeah, the case is pretty weird and I'm not really sure that the server
 ought to be expected to behave.  But if this is actually the only part
 of the server that misbehaves because of sudden gigantic time jumps, I
 think it's fair to patch it.  Here's a proposed patch.

I think the parenthetical bit in the comment should read plus any
fractional second, for simplicity.  Also, maybe replace 300 by
a #define constant MAX_AUTOVAC_SLEEPTIME for readability?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL replay bugs

2015-06-17 Thread Michael Paquier
On Thu, Jun 18, 2015 at 3:39 AM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 Michael Paquier wrote:

 From 077d675795b4907904fa4e85abed8c4528f4666f Mon Sep 17 00:00:00 2001
 From: Michael Paquier mich...@otacoo.com
 Date: Sat, 19 Jul 2014 10:40:20 +0900
 Subject: [PATCH 3/3] Buffer capture facility: check WAL replay consistency

 Is there a newer version of this tech?

Not yet.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] psql weird behaviour with charset encodings

2015-06-17 Thread Michael Paquier
On Thu, Jun 18, 2015 at 9:47 AM, Noah Misch n...@leadboat.com wrote:
 On Wed, Jun 03, 2015 at 05:25:45PM +0900, Michael Paquier wrote:
 On Tue, Jun 2, 2015 at 4:19 PM, Michael Paquier  michael.paqu...@gmail.com 
 wrote:
  On Sun, May 24, 2015 at 2:43 AM, Noah Misch n...@leadboat.com wrote:
   It would be good to purge the code of precisions on s conversion 
   specifiers,
   then Assert(!pointflag) in fmtstr() to catch new introductions.  I won't 
   plan
   to do it myself, but it would be a nice little defensive change.
 
  This sounds like a good protection idea, but as it impacts existing
  backend code relying in sprintf's port version we should only do the
  assertion in HEAD in my opinion, and mention it in the release notes of the
  next major version at the time a patch in this area is applied. I guess

 Adding the assertion would be master-only.  We don't necessarily release-note
 C API changes.

Cool. So we are on the same page.

  that we had better backpatch the places using .*s though on back-branches.

 I would tend to back-patch only the ones that cause interesting bugs.  For
 example, we should not reach the read.c elog() calls anyway, so it's not a big
 deal if the GNU libc bug makes them a bit less helpful in back branches.
 (Thanks for the list of code sites; it was more complete than anything I had.)
 So far, only tar.c looks harmed enough to back-patch.

 Attached is a patch purging a bunch of places from using %.*s, this will
 make the code more solid when facing non-ASCII strings. I let pg_upgrade
 and pg_basebackup code paths alone as it reduces the code lisibility by
 moving out of this separator. We may want to fix them though if file path
 names have non-ASCII characters, but it seems less critical.

 To add the assertion, we must of course fix all uses.

Sure.

 Having seen the patch I requested, I don't like the result as much as I
 expected to like it.  The patched code is definitely harder to read and write:

 @@ -1534,7 +1541,10 @@ parseNodeString(void)
   return_value = _readDeclareCursorStmt();
   else
   {
 - elog(ERROR, badly formatted node string \%.32s\..., token);
 + char buf[33];
 + memcpy(buf, token, 32);
 + buf[33] = '\0';
 + elog(ERROR, badly formatted node string \%s\..., buf);
   return_value = NULL;/* keep compiler quiet */
   }

We could spread what the first patch did in readfuncs.c by having some
more macros doing the duplicated work. Not that it would improve the
code readability of those macros..

 (Apropos, that terminator belongs in buf[32], not buf[33].)

Indeed.

 Perhaps we're better off setting aside the whole idea,

At least on OSX (10.8), I am seeing that no more than the number of
bytes defined by the precision is written. So it looks that we are
safe there. So yes thinking long-term this looks the better
alternative. And I am wondering about the potential breakage that this
could actually have with Postgres third-part tools using src/port's
snprintf.

 or forcing use of snprintf.c on configurations having the bug?

I am less sure about that. It doesn't seem worth it knowing that the
tendance is to evaluate the precision in terms in bytes and not
characters.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] psql weird behaviour with charset encodings

2015-06-17 Thread Noah Misch
On Wed, Jun 03, 2015 at 05:25:45PM +0900, Michael Paquier wrote:
 On Tue, Jun 2, 2015 at 4:19 PM, Michael Paquier  michael.paqu...@gmail.com 
 wrote:
  On Sun, May 24, 2015 at 2:43 AM, Noah Misch n...@leadboat.com wrote:
   It would be good to purge the code of precisions on s conversion 
   specifiers,
   then Assert(!pointflag) in fmtstr() to catch new introductions.  I won't 
   plan
   to do it myself, but it would be a nice little defensive change.
 
  This sounds like a good protection idea, but as it impacts existing
  backend code relying in sprintf's port version we should only do the
  assertion in HEAD in my opinion, and mention it in the release notes of the
  next major version at the time a patch in this area is applied. I guess

Adding the assertion would be master-only.  We don't necessarily release-note
C API changes.

  that we had better backpatch the places using .*s though on back-branches.

I would tend to back-patch only the ones that cause interesting bugs.  For
example, we should not reach the read.c elog() calls anyway, so it's not a big
deal if the GNU libc bug makes them a bit less helpful in back branches.
(Thanks for the list of code sites; it was more complete than anything I had.)
So far, only tar.c looks harmed enough to back-patch.

 Attached is a patch purging a bunch of places from using %.*s, this will
 make the code more solid when facing non-ASCII strings. I let pg_upgrade
 and pg_basebackup code paths alone as it reduces the code lisibility by
 moving out of this separator. We may want to fix them though if file path
 names have non-ASCII characters, but it seems less critical.

To add the assertion, we must of course fix all uses.

Having seen the patch I requested, I don't like the result as much as I
expected to like it.  The patched code is definitely harder to read and write:

 @@ -1534,7 +1541,10 @@ parseNodeString(void)
   return_value = _readDeclareCursorStmt();
   else
   {
 - elog(ERROR, badly formatted node string \%.32s\..., token);
 + char buf[33];
 + memcpy(buf, token, 32);
 + buf[33] = '\0';
 + elog(ERROR, badly formatted node string \%s\..., buf);
   return_value = NULL;/* keep compiler quiet */
   }

(Apropos, that terminator belongs in buf[32], not buf[33].)

Perhaps we're better off setting aside the whole idea, or forcing use of
snprintf.c on configurations having the bug?

Thanks,
nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5 make world failing due to sgml tools missing

2015-06-17 Thread Peter Eisentraut
On 6/17/15 3:35 PM, Keith Fiske wrote:
 The current HEAD of postgres in the git repo is not building when using
 make world. It's been like this for about a month or so that I've been
 aware of. I didn't really need the world build so been making due
 without it. At PGCon now, though, so asked Bruce and he said this error
 was due to my not having the sgml tools installed, which should not be a
 requirement.

make world has always required documentation build tools.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] could not adopt C locale failure at startup on Windows

2015-06-17 Thread Tom Lane
Noah Misch n...@leadboat.com writes:
 On Mon, Jun 15, 2015 at 12:37:43PM -0400, Tom Lane wrote:
 It's mere chance that the order of calls to pg_perm_setlocale() is
 such that the code works now; and it's not hard to imagine future
 changes in gettext, or reordering of our own code, that would break it.

 pg_bind_textdomain_codeset() looks at one piece of the locale environment,
 namely setlocale(LC_CTYPE, NULL), so the order of pg_perm_setlocale() calls
 does not affect it.

Well, my point is that that is a larger assumption about the behavior of
pg_bind_textdomain_codeset() than I think we ought to make, even if it
happens to be true today.

 There's nothing acutely bad about the alternative you
 identify here, but it's no better equipped to forestall mistakes.  Moving the
 call later would slightly expand the window during which translated messages
 are garbled.

I'm not exactly sure that they wouldn't be garbled anyway during the
interval where we're setting these things up.  Until DatabaseEncoding,
ClientEncoding, and gettext's internal notions are all in sync, we
are at risk of that type of issue no matter what.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind and xlogtemp files

2015-06-17 Thread Michael Paquier
On Wed, Jun 17, 2015 at 9:07 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Wed, Jun 17, 2015 at 4:57 PM, Vladimir Borodin r...@simply.name wrote:

 17 июня 2015 г., в 9:48, Michael Paquier michael.paqu...@gmail.com
 написал(а):

 On Wed, Jun 17, 2015 at 3:17 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:

 As pointed by dev1ant on the original bug report, process_remote_file
 should ignore files named as pg_xlog/xlogtemp.*, and I think that this
 is the right thing to do. Any objections for a patch that at the same
 time makes xlogtemp. a define declaration in xlog_internal.h?


 Declaration seems to be the right thing.

 Another problem I’ve caught twice already in the same test:

 error reading xlog record: record with zero length at 0/7890
 unexpected result while fetching remote files: ERROR:  could not open file
 base/13003/t6_2424967 for reading: No such file or directory
 The servers diverged at WAL position 0/76BADD50 on timeline 303.
 Rewinding from Last common checkpoint at 0/7651F870 on timeline 303

 I don’t know if this problem could be solved the same way (by skipping such
 files)… Should I start a new thread for that?

 That's the file of the temporary table, so there is no need to copy it
 from the source server. pg_rewind can safely skip such file, I think.

Yes. It is actually recommended to copy them manually if needed from
the archive (per se the docs).

 But even if we make pg_rewind skip such file, we would still get the
 similar problem. You can see the problem that I reported in other thread.
 In order to address this type of problem completely, we would need
 to apply the fix that is been discussed in that thread.
 http://www.postgresql.org/message-id/CAHGQGwEdsNgeNZo+GyrzZtjW_TkC=XC6XxrjuAZ7=x_cj1a...@mail.gmail.com

There are two things to take into account here in my opinion:
1) Ignoring files that should not be added into the filemap, like
postmaster.pid, xlogtemp, etc.
2) bypass the files that can be added in the file map, for example a
relation file or a fsm file, and prevent erroring out if they are
missing.

 BTW, even pg_basebackup doesn't skip the file of temporary table.
 But maybe we should change this, too.

 Also pg_rewind doesn't skip the files that pg_basebackup does. ISTM
 that basically pg_rewind can safely skip any files that pg_basebackup does.
 So probably we need to reconsider which file to make pg_rewind skip.

pg_rewind and basebackup.c are beginning to share many things in this
area, perhaps we should consider a common routine in let's say
libpqcommon to define if a file can be safely skipped depending on its
path name in PGDATA.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind and xlogtemp files

2015-06-17 Thread Fujii Masao
On Wed, Jun 17, 2015 at 4:57 PM, Vladimir Borodin r...@simply.name wrote:

 17 июня 2015 г., в 9:48, Michael Paquier michael.paqu...@gmail.com
 написал(а):

 On Wed, Jun 17, 2015 at 3:17 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:

 As pointed by dev1ant on the original bug report, process_remote_file
 should ignore files named as pg_xlog/xlogtemp.*, and I think that this
 is the right thing to do. Any objections for a patch that at the same
 time makes xlogtemp. a define declaration in xlog_internal.h?


 Declaration seems to be the right thing.

 Another problem I’ve caught twice already in the same test:

 error reading xlog record: record with zero length at 0/7890
 unexpected result while fetching remote files: ERROR:  could not open file
 base/13003/t6_2424967 for reading: No such file or directory
 The servers diverged at WAL position 0/76BADD50 on timeline 303.
 Rewinding from Last common checkpoint at 0/7651F870 on timeline 303

 I don’t know if this problem could be solved the same way (by skipping such
 files)… Should I start a new thread for that?

That's the file of the temporary table, so there is no need to copy it
from the source server. pg_rewind can safely skip such file, I think.

But even if we make pg_rewind skip such file, we would still get the
similar problem. You can see the problem that I reported in other thread.
In order to address this type of problem completely, we would need
to apply the fix that is been discussed in that thread.
http://www.postgresql.org/message-id/CAHGQGwEdsNgeNZo+GyrzZtjW_TkC=XC6XxrjuAZ7=x_cj1a...@mail.gmail.com

BTW, even pg_basebackup doesn't skip the file of temporary table.
But maybe we should change this, too.

Also pg_rewind doesn't skip the files that pg_basebackup does. ISTM
that basically pg_rewind can safely skip any files that pg_basebackup does.
So probably we need to reconsider which file to make pg_rewind skip.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] could not adopt C locale failure at startup on Windows

2015-06-17 Thread Noah Misch
On Mon, Jun 15, 2015 at 12:37:43PM -0400, Tom Lane wrote:
 After further thought, ISTM that this bug is evidence that 5f538ad
 was badly designed, and the proposed fix has blinkers on.  If
 pg_bind_textdomain_codeset() is looking at the locale environment,
 we should not be calling it from inside pg_perm_setlocale() at all,
 but from higher level code *after* we're done setting up the whole libc
 locale environment --- thus, after the unsetenv(LC_ALL) call in main.c,
 and somewhere near the bottom of CheckMyDatabase() in postinit.c.
 It's mere chance that the order of calls to pg_perm_setlocale() is
 such that the code works now; and it's not hard to imagine future
 changes in gettext, or reordering of our own code, that would break it.

pg_bind_textdomain_codeset() looks at one piece of the locale environment,
namely setlocale(LC_CTYPE, NULL), so the order of pg_perm_setlocale() calls
does not affect it.  There's nothing acutely bad about the alternative you
identify here, but it's no better equipped to forestall mistakes.  Moving the
call later would slightly expand the window during which translated messages
are garbled.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Auto-vacuum is not running in 9.1.12

2015-06-17 Thread Haribabu Kommi
On Wed, Jun 17, 2015 at 2:17 PM, Prakash Itnal prakash...@gmail.com wrote:
 Hi,

 Currently the issue is easily reproducible. Steps to reproduce:
 * Set some aggressive values for auto-vacuuming.
 * Run a heavy database update/delete/insert queries. This leads to invoking
 auto-vacuuming in quick successions.
 * Change the system time to older for eg. 1995-01-01

 Suddenly auto-vacuuming stops working. Even after changing system time back
 to current time, the auto-vacuuming did not resume.

 So the question is, does postrges supports system time changes?.

PostgreSQL uses the system time to check whether it reached the
specific nap time to trigger the autovacuum worker.

Is it possible for you to check what autovauum launcher is doing even
after the system time is reset to current time?

I can think of a case where the launcher_determine_sleep function
returns a big sleep value because of system time change.
Because of that it is possible that the launcher is not generating
workers to do the vacuum. May be I am wrong.


Regards,
Hari Babu
Fujitsu Australia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_rewind and xlogtemp files

2015-06-17 Thread Michael Paquier
On Wed, Jun 17, 2015 at 3:17 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 As pointed by dev1ant on the original bug report, process_remote_file
 should ignore files named as pg_xlog/xlogtemp.*, and I think that this
 is the right thing to do. Any objections for a patch that at the same
 time makes xlogtemp. a define declaration in xlog_internal.h?

And attached is a patch following those lines.
-- 
Michael
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index c6862a8..d9e062f 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -302,7 +302,8 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	/*
 	 * Write into a temp file name.
 	 */
-	snprintf(tmppath, MAXPGPATH, XLOGDIR /xlogtemp.%d, (int) getpid());
+	snprintf(tmppath, MAXPGPATH, XLOGDIR / XLOG_TEMP_PREFIX .%d,
+			 (int) getpid());
 
 	unlink(tmppath);
 
@@ -462,7 +463,8 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 	/*
 	 * Write into a temp file name.
 	 */
-	snprintf(tmppath, MAXPGPATH, XLOGDIR /xlogtemp.%d, (int) getpid());
+	snprintf(tmppath, MAXPGPATH, XLOGDIR / XLOG_TEMP_PREFIX .%d,
+			 (int) getpid());
 
 	unlink(tmppath);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 150d56a..e481554 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2928,7 +2928,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	 */
 	elog(DEBUG2, creating and filling new WAL file);
 
-	snprintf(tmppath, MAXPGPATH, XLOGDIR /xlogtemp.%d, (int) getpid());
+	snprintf(tmppath, MAXPGPATH, XLOGDIR / XLOG_TEMP_PREFIX .%d,
+			 (int) getpid());
 
 	unlink(tmppath);
 
@@ -3072,7 +3073,8 @@ XLogFileCopy(char *srcfname, int upto, XLogSegNo segno)
 	/*
 	 * Copy into a temp file name.
 	 */
-	snprintf(tmppath, MAXPGPATH, XLOGDIR /xlogtemp.%d, (int) getpid());
+	snprintf(tmppath, MAXPGPATH, XLOGDIR / XLOG_TEMP_PREFIX .%d,
+			 (int) getpid());
 
 	unlink(tmppath);
 
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index fbf9324..6d8af4e 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -131,6 +131,9 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 #define XLOGDIRpg_xlog
 #define XLOG_CONTROL_FILE	global/pg_control
 
+/* Temporary segment of XLog directory */
+#define XLOG_TEMP_PREFIX	xlogtemp
+
 /*
  * These macros encapsulate knowledge about the exact layout of XLog file
  * names, timeline history file names, and archive-status file names.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Is Postgres database server works fine if there is a change in system time?

2015-06-17 Thread Prakash Itnal
Hi,

Currently we observed that certain postgres child process, for eg.
autovacuum worker, are not working as expected if there is a system time
change. So I wanted to know if postgres already supports system time
changes or not.

Please confirm if postgres already handles system time changes or not.

-- 
Cheers,
Prakash


Re: [HACKERS] pg_rewind and xlogtemp files

2015-06-17 Thread Vladimir Borodin

 17 июня 2015 г., в 9:48, Michael Paquier michael.paqu...@gmail.com 
 написал(а):
 
 On Wed, Jun 17, 2015 at 3:17 PM, Michael Paquier
 michael.paqu...@gmail.com wrote:
 As pointed by dev1ant on the original bug report, process_remote_file
 should ignore files named as pg_xlog/xlogtemp.*, and I think that this
 is the right thing to do. Any objections for a patch that at the same
 time makes xlogtemp. a define declaration in xlog_internal.h?

Declaration seems to be the right thing.

Another problem I’ve caught twice already in the same test:

error reading xlog record: record with zero length at 0/7890
unexpected result while fetching remote files: ERROR:  could not open file 
base/13003/t6_2424967 for reading: No such file or directory
The servers diverged at WAL position 0/76BADD50 on timeline 303.
Rewinding from Last common checkpoint at 0/7651F870 on timeline 303

I don’t know if this problem could be solved the same way (by skipping such 
files)… Should I start a new thread for that?

 
 And attached is a patch following those lines.
 -- 
 Michael
 20150617_rewind_xlogtemp.patch
 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


--
May the force be with you…
https://simply.name



[HACKERS] pg_rewind and xlogtemp files

2015-06-17 Thread Michael Paquier
Hi all,

I just bumped into this report regarding pg_rewind, that impacts as
well the version shipped in src/bin/pg_rewind:
https://github.com/vmware/pg_rewind/issues/45

In short, the issue refers to the fact that if the source server
filemap includes xlogtemp files pg_rewind will surely fail with
something like the following error:

error reading xlog record: record with zero length at 1/D590
unexpected result while fetching remote files: ERROR:  could not open
file pg_xlog/xlogtemp.23056 for reading: No such file or directory

The servers diverged at WAL position 1/D4A081B0 on timeline 174.
Rewinding from Last common checkpoint at 1/D30A5650 on timeline 174

As pointed by dev1ant on the original bug report, process_remote_file
should ignore files named as pg_xlog/xlogtemp.*, and I think that this
is the right thing to do. Any objections for a patch that at the same
time makes xlogtemp. a define declaration in xlog_internal.h?

Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2015-06-17 Thread Fabien COELHO



Hello,

Here is version 3, including many performance tests with various settings, 
representing about 100 hours of pgbench run. This patch aims at improving 
checkpoint I/O behavior so that tps throughput is improved, late 
transactions are less frequent, and overall performances are more stable.



* SOLILOQUIZING


- The bgwriter flushing option seems ineffective, it could be removed
 from the patch?


I did that.


- Move fsync as early as possible, suggested by Andres Freund?

My opinion is that this should be left out for the nonce.


I did that.


- Take into account tablespaces, as pointed out by Andres Freund?

Alternatively, maybe it can be done from one thread, but it would probably 
involve some strange hocus-pocus to switch frequently between tablespaces.


I did the hocus-pocus approach, including a quasi-proof (not sure what is 
this mathematical object:-) in comments to show how/why it works.



* PATCH CONTENTS

 - as version 1: simplified asynchronous flush based on Andres Freund
   patch, with sync_file_range/posix_fadvise used to hint the OS that
   the buffer must be sent to disk now.

 - as version 2: checkpoint buffer sorting based on a 2007 patch by
   Takahiro Itagaki but with a smaller and static buffer allocated once.
   Also, sorting is done by chunks of 131072 pages in the current version,
   with a guc to change this value.

 - as version 2: sync/advise calls are now merged if possible,
   so less calls will be used, especially when buffers are sorted,
   but also if there are few files written.

 - new: the checkpointer balance its page writes per tablespace.
   this is done by choosing to write pages for a tablespace for which
   the progress ratio (written/to_write) is beyond the overall progress
   ratio for all tablespace, and by doing that in a round robin manner
   so that all tablespaces regularly get some attention. No threads.

 - new: some more documentation is added.

 - removed: bgwriter_flush_to_write is removed, as there was no clear
   benefit on the (simple) tests. It could be considered for another patch.

 - question: I'm not sure I understand the checkpointer memory management.
   There is some exception handling in the checkpointer main. I wonder
   whether the allocated memory would be lost in such event and should
   be reallocated.  The patch currently assumes that the memory is kept.


* PERFORMANCE TESTS

Impacts on pgbench -M prepared -N -P 1 ... (simple update test, mostly
random write activity on one table), checkpoint_completion_target=0.8, with
different settings on a 16GB 8-core host:

 . tiny: scale=10 shared_buffers=1GB checkpoint_timeout=30s time=6400s
 . small: scale=120 shared_buffers=2GB checkpoint_timeout=300s time=4000s
 . medium: scale=250 shared_buffers=4GB checkpoint_timeout=15min time=4000s
 . large: scale=1000 shared_buffers=4GB checkpoint_timeout=40min time=7500s

Note: figures noted with a star (*) had various issues during their run, so
pgbench progress figures were more or less incorrect, thus the standard
deviation computation is not to be trusted beyond pretty bad.

Caveat: these are only benches on one host at a particular time and
location, which may or may not be reproducible nor be representative
as such of any other load.  The good news is that all these tests tell
the same thing.

- full-speed 1-client

 options   | tps performance over per second data
  flush | sort |tiny|small |   medium |large
off |  off | 687 +- 231 | 163 +- 280 * | 191 +- 626 * | 37.7 +- 25.6
off |   on | 699 +- 223 | 457 +- 315   | 479 +- 319   | 48.4 +- 28.8
 on |  off | 740 +- 125 | 143 +- 387 * | 179 +- 501 * | 37.3 +- 13.3
 on |   on | 722 +- 119 | 550 +- 140   | 549 +- 180   | 47.2 +- 16.8

- full speed 4-clients

  options  | tps performance over per second data
  flush | sort |tiny | small |medium
off |  off | 2006 +- 748 | 193 +- 1898 * | 205 +- 2465 *
off |   on | 2086 +- 673 | 819 +-  905 * | 807 +- 1029 *
 on |  off | 2212 +- 451 | 169 +- 1269 * | 160 +-  502 *
 on |   on | 2073 +- 437 | 743 +-  413   | 822 +-  467

- 100-tps 1-client max 100-ms latency

 options   | percent of late transactions
  flush | sort |  tiny | small | medium
off |  off |  6.31 | 29.44 | 30.74
off |   on |  6.23 |  8.93 |  7.12
 on |  off |  0.44 |  7.01 |  8.14
 on |   on |  0.59 |  0.83 |  1.84

- 200-tps 1-client max 100-ms latency

 options   | percent of late transactions
  flush | sort |  tiny | small | medium
off |  off | 10.00 | 50.61 | 45.51
off |   on |  8.82 | 12.75 | 12.89
 on |  off |  0.59 | 40.48 | 42.64
 on |   on |  0.53 |  1.76 |  2.59

- 400-tps 1-client (or 4 for medium) max 100-ms latency

 options   | percent of late transactions
  flush | sort | tiny | small | medium
off |  off | 12.0 | 64.28 | 68.6
off |   on | 11.3 | 22.05 | 22.6
 on |  off |  1.1 | 67.93 | 67.9
 on |   on |  0.6 |  3.24 |  

[HACKERS] [PATCH] Function to get size of asynchronous notification queue

2015-06-17 Thread Brendan Jurd
Hello hackers,

I present a patch to add a new built-in function
pg_notify_queue_saturation().

The purpose of the function is to allow users to monitor the health of
their notification queue.  In certain cases, a client connection listening
for notifications might get stuck inside a transaction, and this would
cause the queue to keep filling up, until finally it reaches capacity and
further attempts to NOTIFY error out.

The current documentation under LISTEN explains this possible gotcha, but
doesn't really suggest a useful way to address it, except to mention that
warnings will show up in the log once you get to 50% saturation of the
queue.  Unless you happen to be eyeballing the logs when it happens, that's
not a huge help.  The choice of 50% as a threshold is also very much
arbitrary, and by the time you hit 50% the problem has likely been going on
for quite a while.  If you want your nagios (or whatever) to say, alert you
when the queue goes over 5% or 1%, your options are limited and awkward.

The patch has almost no new code.  It makes use of the existing logic for
the 50% warning.  I simply refactored that logic into a separate function
asyncQueueSaturation, and then added pg_notify_queue_saturation to make
that available in SQL.

I am not convinced that pg_notify_queue_saturation is the best possible
name for this function, and am very much open to other suggestions.

The patch includes documentation, a regression test and an isolation test.

Cheers,
BJ
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***
*** 14800,14805  SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
--- 14800,14811 
/row
  
row
+entryliteralfunctionpg_notify_queue_saturation()/function/literal/entry
+entrytypedouble/type/entry
+entryproportion of the asynchronous notification queue currently occupied/entry
+   /row
+ 
+   row
 entryliteralfunctionpg_my_temp_schema()/function/literal/entry
 entrytypeoid/type/entry
 entryOID of session's temporary schema, or 0 if none/entry
***
*** 14939,14948  SET search_path TO replaceableschema/ optional, replaceableschema/, ..
  primarypg_listening_channels/primary
 /indexterm
  
 para
  functionpg_listening_channels/function returns a set of names of
! channels that the current session is listening to.  See xref
! linkend=sql-listen for more information.
 /para
  
 indexterm
--- 14945,14962 
  primarypg_listening_channels/primary
 /indexterm
  
+indexterm
+ primarypg_notify_queue_saturation/primary
+/indexterm
+ 
 para
  functionpg_listening_channels/function returns a set of names of
! asynchronous notification channels that the current session is listening
! to.  functionpg_notify_queue_saturation/function returns the proportion
! of the total available space for notifications currently occupied by
! notifications that are waiting to be processed.  See
! xref linkend=sql-listen and xref linkend=sql-notify
! for more information.
 /para
  
 indexterm
*** a/doc/src/sgml/ref/notify.sgml
--- b/doc/src/sgml/ref/notify.sgml
***
*** 166,171  NOTIFY replaceable class=PARAMETERchannel/replaceable [ , replaceable cla
--- 166,175 
 current transaction so that cleanup can proceed.
/para
para
+The function functionpg_notify_queue_saturation/function returns the
+proportion of the queue that is currently occupied by pending notifications.
+   /para
+   para
 A transaction that has executed commandNOTIFY/command cannot be
 prepared for two-phase commit.
/para
*** a/src/backend/commands/async.c
--- b/src/backend/commands/async.c
***
*** 371,376  static bool asyncQueueIsFull(void);
--- 371,377 
  static bool asyncQueueAdvance(volatile QueuePosition *position, int entryLength);
  static void asyncQueueNotificationToEntry(Notification *n, AsyncQueueEntry *qe);
  static ListCell *asyncQueueAddEntries(ListCell *nextNotify);
+ static double asyncQueueSaturation(void);
  static void asyncQueueFillWarning(void);
  static bool SignalBackends(void);
  static void asyncQueueReadAllNotifications(void);
***
*** 1362,1387  asyncQueueAddEntries(ListCell *nextNotify)
  }
  
  /*
!  * Check whether the queue is at least half full, and emit a warning if so.
!  *
!  * This is unlikely given the size of the queue, but possible.
!  * The warnings show up at most once every QUEUE_FULL_WARN_INTERVAL.
   *
!  * Caller must hold exclusive AsyncQueueLock.
   */
! static void
! asyncQueueFillWarning(void)
  {
! 	int			headPage = QUEUE_POS_PAGE(QUEUE_HEAD);
! 	int			tailPage = QUEUE_POS_PAGE(QUEUE_TAIL);
! 	int			occupied;
! 	double		fillDegree;
! 	TimestampTz t;
  
  	occupied = headPage - tailPage;
  
  	if (occupied == 0)
! 		return;	/* fast exit for common case */
  
  	if (occupied  0)
  	{
--- 1363,1399 
  }
  
  

[HACKERS] Inheritance planner CPU and memory usage change since 9.3.2

2015-06-17 Thread Thomas Munro
Hi

We saw a rather extreme performance problem in a cluster upgraded from
9.1 to 9.3.  It uses a largish number of child tables (partitions) and
many columns.  Planning a simple UPDATE via the base table started
using a very large amount of memory and CPU time.

My colleague Rushabh Lathia tracked the performance change down to
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c03ad5602f529787968fa3201b35c119bbc6d782
.

The call to copyObject in the loop introduced here seems to be
problematic (copying append_rel_list for every element in
append_rel_list unconditionally), though we're still trying to figure
it out.  Attached is a simple repro script, with variables to tweak.

Quite a few others have posted about this sort of thing and been
politely reminded of the 100 table caveat [1][2] which is fair enough,
but the situations seems to have got dramatically worse for some users
after an upgrade.

[1] 
http://www.postgresql.org/message-id/8c9acaa.1f453.14c0da0402f.coremail.chjis...@163.com
[2] 
http://www.postgresql.org/message-id/flat/20141107185824.2513.53...@wrigleys.postgresql.org#20141107185824.2513.53...@wrigleys.postgresql.org

-- 
Thomas Munro
http://www.enterprisedb.com


repro-planner-explosion.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inheritance planner CPU and memory usage change since 9.3.2

2015-06-17 Thread Robert Haas
On Wed, Jun 17, 2015 at 9:32 AM, Thomas Munro
thomas.mu...@enterprisedb.com wrote:
 We saw a rather extreme performance problem in a cluster upgraded from
 9.1 to 9.3.  It uses a largish number of child tables (partitions) and
 many columns.  Planning a simple UPDATE via the base table started
 using a very large amount of memory and CPU time.

 My colleague Rushabh Lathia tracked the performance change down to
 http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=c03ad5602f529787968fa3201b35c119bbc6d782
 .

 The call to copyObject in the loop introduced here seems to be
 problematic (copying append_rel_list for every element in
 append_rel_list unconditionally), though we're still trying to figure
 it out.  Attached is a simple repro script, with variables to tweak.

 Quite a few others have posted about this sort of thing and been
 politely reminded of the 100 table caveat [1][2] which is fair enough,
 but the situations seems to have got dramatically worse for some users
 after an upgrade.

Yes.  The copyObject() call introduced by this commit seems to have
complexity O(T^2*C) where T is the number of child tables and C is the
number of columns per child.  That's because the List that is being
copied is a list of AppendRelInfo nodes, each of which has a
translated_vars member that is a list of every Var in one table, and
we copy that list once per child.

It appears that in a lot of cases this work is unnecessary.  The
second (long) for loop in inheritance_planner copies root-rowMarks
and root-append_rel_list so as to be able to apply ChangeVarNodes()
to the result, but we only actually do that if the rte is of type
RTE_SUBQUERY or if it has security quals.  In the common case where we
reach inheritance_planner not because of UNION ALL but just because
the table has a bunch of inheritance children (none of which have RLS
policies applied), we copy everything and then modify none of it,
using up startling large amounts of memory in ways that pre-9.2
versions did not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: adaptive ndistinct estimator v4

2015-06-17 Thread Tomas Vondra

Hi,

On 05/13/15 23:07, Jeff Janes wrote:

With the warning it is very hard to correlate the discrepancy you do
see with which column is causing it, as the warnings don't include
table or column names (Assuming of course that you run it on a
substantial database--if you just run it on a few toy cases then the
warning works well).


That's true. I've added attnum/attname to the warning in the attached 
version of the patch.



If we want to have an explicitly experimental patch which we want
people with interesting real-world databases to report back on, what
kind of patch would it have to be to encourage that to happen? Or are
we never going to get such feedback no matter how friendly we make
it? Another problem is that you really need to have the gold standard
to compare them to, and getting that is expensive (which is why we
resort to sampling in the first place). I don't think there is much
to be done on that front other than bite the bullet and just do
it--perhaps only for the tables which have discrepancies.


Not sure. The experimental part of the patch was not really aimed at 
the users outside the development community - it was meant to be used by 
members of the community, possibly testing it on customer databases I 
don't think adding the GUC into the final release is a good idea, it's 
just a noise in the config no-one would actually use.



Some of the regressions I've seen are at least partly a bug:

+   /* find the 'm' value minimizing the difference */
+   for (m = 1; m = total_rows; m += step)
+   {
+   double q = k / (sample_rows * m);

sample_rows and m are both integers, and their product overflows
vigorously. A simple cast to double before the multiplication fixes
the first example I produced. The estimate goes from 137,177 to
1,108,076. The reality is 1,062,223.

Perhaps m should be just be declared a double, as it is frequently
used in double arithmetic.


Yeah, I just discovered this bug independently. There's another bug that 
the adaptive_estimator takes total_rows as integer, so it breaks for 
tables with more than INT_MAX rows. Both are fixed in the v5.




Therefore, I think that:

1. This should be committed near the beginning of a release cycle,
not near the end. That way, if there are problem cases, we'll have a
year or so of developer test to shake them out.


It can't hurt, but how effective will it be? Will developers know or
care whether ndistinct happened to get better or worse while they
are working on other things? I would think that problems will be
found by focused testing, or during beta, and probably not by
accidental discovery during the development cycle. It can't hurt, but
I don't know how much it will help.


I agree with that - it's unlikely the regressions will get discovered 
randomly. OTOH I'd expect non-trivial number of people on this list to 
have a few examples of ndistinct failures, and testing those would be 
more useful I guess. But that's unlikely to find the cases that worked 
OK before and got broken by the new estimator :-(



I agree with the experimental GUC.  That way if hackers do happen to
see something suspicious, they can just turn it off and see what
difference it makes.  If they have to reverse out a patch from 6 months
ago in an area of the code they aren't particularly interested in and
then recompile their code and then juggle two different sets of
binaries, they will likely just shrug it off without investigation.


+1


--
Tomas Vondra   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 861048f..f030702 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -16,6 +16,7 @@
 
 #include math.h
 
+#include access/hash.h
 #include access/multixact.h
 #include access/transam.h
 #include access/tupconvert.h
@@ -97,6 +98,9 @@ static void update_attstats(Oid relid, bool inh,
 static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 static Datum ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 
+static double adaptive_estimator(double total_rows, int sample_rows,
+int *f, int f_max);
+static int hash_comparator(const void *a, const void *b);
 
 /*
  *	analyze_rel() -- analyze one relation
@@ -1698,6 +1702,7 @@ static void compute_scalar_stats(VacAttrStatsP stats,
 	 int samplerows,
 	 double totalrows);
 static int	compare_scalars(const void *a, const void *b, void *arg);
+static int	compare_scalars_simple(const void *a, const void *b, void *arg);
 static int	compare_mcvs(const void *a, const void *b);
 
 
@@ -1816,6 +1821,23 @@ compute_minimal_stats(VacAttrStatsP stats,
 	StdAnalyzeData *mystats = (StdAnalyzeData *) stats-extra_data;
 
 	/*
+	 * The adaptive ndistinct estimator requires counts for all the
+	 * repetition counts - we can't do the sort-based count directly
+	 * 

Re: [HACKERS] Auto-vacuum is not running in 9.1.12

2015-06-17 Thread Alvaro Herrera
Prakash Itnal wrote:

 Currently the issue is easily reproducible. Steps to reproduce:
 * Set some aggressive values for auto-vacuuming.
 * Run a heavy database update/delete/insert queries. This leads to invoking
 auto-vacuuming in quick successions.
 * Change the system time to older for eg. 1995-01-01
 
 Suddenly auto-vacuuming stops working. Even after changing system time back
 to current time, the auto-vacuuming did not resume.
 
 So the question is, does postrges supports system time changes?.

So Tom Lane just pinged me about this.  As far as I can tell, the
problem is that the clock goes backwards 20 years, then autovacuum
figures that it needs to sleep for 20 years until the next vacuum is
scheduled.  Then regardless of the clock moving forwards again, the
sleep is not affected by this.  (I didn't actually test this.)

A simple workaround would be to stop autovacuum then restart it, that
is, set autovacuum=off in postgresql.conf, send SIGHUP to postmaster
which should stop the autovacuum launcher, then set autovacuum=on and
SIGHUP again, which would start a new launcher.

As a proposed code fix, we could just clamp the sleep time to, say, 5
minutes.  In that case the launcher would wake up even if the next
database check is still a ways off, but that should be pretty harmless
-- we just compute a new sleep period and continue sleeping.  (That is,
notwitshstanding the fact that the sleep time should never really go
above a minute in most configurations.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Auto-vacuum is not running in 9.1.12

2015-06-17 Thread Tom Lane
Haribabu Kommi kommi.harib...@gmail.com writes:
 I can think of a case where the launcher_determine_sleep function
 returns a big sleep value because of system time change.
 Because of that it is possible that the launcher is not generating
 workers to do the vacuum. May be I am wrong.

I talked with Alvaro about this and we agreed that's most likely what
happened.  The launcher tracks future times-to-wake-up as absolute times,
so shortly after the system clock went backwards, it could have computed
that the next time to wake up was 20 years in the future, and issued a
sleep() call for 20 years.  Fixing the system clock after that would not
have caused it to wake up again.

It looks like a SIGHUP (pg_ctl reload) ought to be enough to wake it up,
or of course you could restart the server.

In HEAD this doesn't seem like it could cause an indefinite sleep because
if nothing else, sinval queue overrun would eventually wake the launcher
even without any manual action from the DBA.  But the loop logic is
different in 9.1.

launcher_determine_sleep() does have a minimum sleep time, and it seems
like we could fairly cheaply guard against this kind of scenario by also
enforcing a maximum sleep time (of say 5 or 10 minutes).  Not quite
convinced whether it's worth the trouble though.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Function to get size of asynchronous notification queue

2015-06-17 Thread Gurjeet Singh
I don't see this in the CF app; can you please add it there?

Best regards,

On Wed, Jun 17, 2015 at 3:31 AM, Brendan Jurd dire...@gmail.com wrote:

 Hello hackers,

 I present a patch to add a new built-in function
 pg_notify_queue_saturation().

 The purpose of the function is to allow users to monitor the health of
 their notification queue.  In certain cases, a client connection listening
 for notifications might get stuck inside a transaction, and this would
 cause the queue to keep filling up, until finally it reaches capacity and
 further attempts to NOTIFY error out.

 The current documentation under LISTEN explains this possible gotcha, but
 doesn't really suggest a useful way to address it, except to mention that
 warnings will show up in the log once you get to 50% saturation of the
 queue.  Unless you happen to be eyeballing the logs when it happens, that's
 not a huge help.  The choice of 50% as a threshold is also very much
 arbitrary, and by the time you hit 50% the problem has likely been going on
 for quite a while.  If you want your nagios (or whatever) to say, alert you
 when the queue goes over 5% or 1%, your options are limited and awkward.

 The patch has almost no new code.  It makes use of the existing logic for
 the 50% warning.  I simply refactored that logic into a separate function
 asyncQueueSaturation, and then added pg_notify_queue_saturation to make
 that available in SQL.

 I am not convinced that pg_notify_queue_saturation is the best possible
 name for this function, and am very much open to other suggestions.

 The patch includes documentation, a regression test and an isolation test.

 Cheers,
 BJ


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Gurjeet Singh http://gurjeet.singh.im/


Re: [HACKERS] Sequence Access Method WIP

2015-06-17 Thread Petr Jelinek

On 2015-06-15 11:32, Vik Fearing wrote:

I've been looking at these patches a bit and here are some comments:



Thanks for looking at this.


Patch 1: seqam

I would like to see an example in the docs for CREATE SEQUENCE.  That's
perhaps not possible (or desirable) with only the local seqam?  Not sure.



It is possible to have example with local seqam, it might be confusing 
though, given it produces same results as not putting USING in the query.



In the docs for pg_class, there is no mention that relam refers to
pg_seqam for sequences but pg_am for all other types.

There are errant half-sentences in the documentation, for example to
the options in the CREATE SEQUENCE or ALTER SEQUENCE statement. in
Sequence Access Method Functions.


I think that's the side effect of all the rebases and rewrites over the 
2y(!) that this has been going forward and back. It can be easily fixed 
by proof reading before final submission. I didn't pay too much 
attention yet because it's not clear how the docs should look like if 
there is no real agreement on the api. (This applies to other comments 
about docs as well)




I'd prefer a README instead of the long comment at the start of seqam.c.
  The other ams do that.



OK, since things have been moved to separate directory, README is 
doable, I personally prefer the docs in the main .c file usually but I 
know project uses README sometimes for this.



As mentioned upthread, this patch isn't a seamless replacement for
what's already there because of the amdata field.  I wasn't part of the
conversation of FOSDEM unfortunately, and there's not enough information
in this thread to know why this solution is preferred over each seqam
having its own table type with all the columns it needs.  I see that
Heikki is waffling a bit between the two, and I have a fairly strong
opinion that amdata should be split into separate columns.  The patch
already destroys and recreates what it needs when changing access method
via ALTER SEQUENCE, so I don't really see what the problem is.



FOSDEM was just about agreeing that amdata is simpler after we discussed 
it with Heikki. Nothing too important you missed there I guess.


I can try to summarize what are the differences:
- amdata is somewhat simpler in terms of code for both init, alter and 
DDL, since with custom columns you have to specify them somehow and deal 
with them in catalog, also ALTER SEQUENCE USING means that there are 
going to be colums marked as deleted which produces needless waste, etc
- amdata make it easier to change the storage model as the tuple 
descriptor is same for all sequences

- the separate columns are much nicer from user point of view
- my opinion is that separate columns also more nicely separate state 
from options and I think that if we move to separate storage model, 
there can be state table per AM which solves the tuple descriptor issue
- there is probably some slight performance benefit to amdata but I 
don't think it's big enough to be important


I personally have slight preference to separate columns design, but I am 
ok with both ways honestly.




There is no \d command for sequence access methods.  Without querying
pg_seqam directly, how does one discover what's available?



Good point.



Patch 2: seqam ddl

When defining a new access method for sequences, it is possible to list
the arguments multiple times (last one wins).  Other defel loops raise
an error if the argument is specified more than once.  I haven't looked
at all of such loops to see if this is the only odd man out or not, but
I prefer the error behavior.



Hmm yeah, there should be error. I think only tsearch doesn't enforce 
errors from the existing stuff, should probably be fixed as well 
(separately of course).




Patch 3: gapless_seq

I really like the idea of having a gapless sequence in contrib.
Although it has big potential to be abused, doing it manually when it's
needed (like for invoices, at least in France) is a major pain.  So big
+1 for including this.



Yeah, people make gapless sequences regardless, it's better to provide 
them one that behaves correctly, also it's quite good test for the API.




It would be nice to be able to specify an access method when declaring a
serial or bigserial column.  This could be a separate patch later on,
though.



The patch originally had GUC for this, but Heikki didn't like it so it's 
left for the future developments.




On the whole, I think this is a pretty good patchset.  Aside from the
design decision of whether amdata is a single opaque column or a set of
columns, there are only a few things that need to be changed before it's
ready for committer, and those things are mostly documentation.



Unfortunately the amdata being opaque vs set of columns is the main 
issue here.


--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list