Re: [HACKERS] SQL MERGE is quite distinct from UPSERT

2014-07-20 Thread Craig Ringer
On 07/20/2014 12:55 PM, Peter Geoghegan wrote:

> There is a *really* big
> demand for UPSERT from users, not MERGE, although MERGE is certainly
> useful too.

The inability to efficiently say "Add this unique-keyed row, or if a row
of the same key already exists replace it atomically" is a fundamental
defect in SQL its self. Vendors shouldn't need to be coming up with
their own versions because the standard should really cover this - much
like LIMIT and OFFSET.

It's very high in the most frequently asked questions on Stack Overflow,
right up there with questions about pg_hba.conf, connection issues on OS
X, etc.

I'd be very keen to see atomic upsert in Pg. Please Cc me on any patches
/ discussion, I'll be an eager tester.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SQL MERGE is quite distinct from UPSERT

2014-07-20 Thread Craig Ringer
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/21/2014 01:40 AM, Martijn van Oosterhout wrote:
> FWIW, I agree. MERGE is hard enough as it is, but trying to
> guarentee some kind of atomicity makes it nigh on impossible.
> Indeed, after reading what you wrote I think it may well be
> impossible to make it atomic *and* make it perform in the general
> case.
> 
> So, +1 UPSERT.

I totally agree. Particularly because MERGE-like behaviour is already
possible with wCTEs and/or table locking. It's not beautiful, but
neither is MERGE.

- -- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
-BEGIN PGP SIGNATURE-
Version: GnuPG v1
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTzKfbAAoJELBXNkqjr+S2GhIIALAMmpMuQiMqsJ/GHjCfeXYQ
Tb3dO0ocBgpk8CobGEVjDnLOh4Rfqt4XZ9pEGr38XEmmzfjc2nEczk+PFq+bRKki
d9lRk8BDH5fcyIYfCNXbycUBbJ/b+inLdhZI0wp3kGX6V1MWTuOquTp8NTbTzvcL
tJXRyWEqsMuXIA26B31W3AkLAFaFF7fpZiD91SI7ECozg1Qr+Ey5tTjJj1+ErzAC
5MnK4nSwwbFTdS7SaOmzzfGKT7BoSlbAXbF8gshbBA5IPU7FxfBcvAquxpPalF73
/949kneIWDA3Qux73wmr182ph4U8usgODA0Iq6QAHa4IPJWFfCvyRA9vt6P86oM=
=vVI+
-END PGP SIGNATURE-


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.5: UPDATE/DELETE .. ORDER BY .. LIMIT ..

2014-07-20 Thread Marko Tiikkaja

On 2014-06-24 11:08, Heikki Linnakangas wrote:

IMHO this needs to work with inheritance if we are to accept it. It
would be a rather strange limitation for no apparent reason, other than
that we didn't bother to implement it. It doesn't seem very difficult in
theory to add the table OID to the plan as a junk column, and use that
in the ModifyTable node to know which table a row came from.


So I've been trying to look at this with Rukh, and it doesn't seem at 
all as easy as you make it out to be.  Consider the following example:


  =# create table foo(a int);
  =# create table foo_c1(a int, b int) inherits (foo);
  =# create table foo_c2(a int, c text) inherits (foo);

  =# update foo set a = a+1;

Currently, the way that works is that the ModifyTable node knows of 
three plans, all with different target lists, targeting each of the 
tables separately.  To make $SUBJECT work with inheritance, we would 
somehow have to get all three different types of tuples through an 
Append node to the ModifyTable (with ctid and tableoid as junk columns). 
 I can see how the approach Tom and Andres talk about upthread could 
work, but that's a helluva lot of work just so we can check the 
inheritance check box for this feature, even though it's not clear 
anyone would actually want to use this on inheritance sets.  Other 
approach I can think of would be to create some kind of a Frankenstein 
tuple which would satisfy the needs of all tables in the set, but that 
sounds really awful.


Or we could just forget that we ever had inheritance and assume that 
partitioning is the only interesting use case.  But it's not clear to me 
that we could assume the output to be the same in every case even then.


If someone has a clear idea on how this could be implemented in a 
reasonable time, we'd be happy to hear about it.



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Draft release notes are up for 9.3.5

2014-07-20 Thread Tom Lane
I've committed first-draft release notes into HEAD (only, so far).
As in the past couple of release cycles, I've only created a full
new documentation section for 9.3.5; tomorrow I will copy final text
into the back-branch release-note files, trimming away whatever
does not apply to a given branch.

With respect to the ongoing discussion about pg_upgrade's handling
of datminmxid/relminmxid, I've documented that patch on the assumption
that we don't really want to recommend any manual attempts to accomplish
the same result.  See the thread on -bugs for discussion if you care.

Comments, corrections?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"

2014-07-20 Thread Peter Geoghegan
On Sun, Jul 20, 2014 at 10:56 AM, Tom Lane  wrote:
> However, this is certainly a behavioral change.  Perhaps squeeze it
> into 9.4, but not the back braches?

+1

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"

2014-07-20 Thread Fabien COELHO



If you do not like my normalization hack (I do not like it much either:-), I
have suggested to add "&& !IsA(parsetree, DeallocateStmt)" to the condition
above, which would ignore DEALLOCATE as PREPARE and EXECUTE are currently
and rightfully ignored.



Well, EXECUTE isn't actually ignored, but tracked via the execution
time. But that doesn't diminish your point with PREPARE. If we do
something we should go for the && !IsA(parsetree, DeallocateStmt), not
the normalization. The latter is pretty darn bogus.


Agreed.  I think basically the reasoning here is "since we don't track
PREPARE or EXECUTE, we shouldn't track DEALLOCATE either".


Yes. It is not just because it is nicely symmetric, it is also annoying to 
have hundreds of useless DEALLOCATE stats in the table.



However, this is certainly a behavioral change.  Perhaps squeeze it
into 9.4,


That would be nice, and the one-liner looks safe enough.


but not the back braches?


Yep. I doubt that pg_stat_statements users rely on statistics about 
DEALLOCATE, so back patching the would be quite safe as well, but I would 
not advocate it.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"

2014-07-20 Thread Tom Lane
Andres Freund  writes:
> On 2014-07-20 17:01:50 +0200, Fabien COELHO wrote:
>> If you do not like my normalization hack (I do not like it much either:-), I
>> have suggested to add "&& !IsA(parsetree, DeallocateStmt)" to the condition
>> above, which would ignore DEALLOCATE as PREPARE and EXECUTE are currently
>> and rightfully ignored.

> Well, EXECUTE isn't actually ignored, but tracked via the execution
> time. But that doesn't diminish your point with PREPARE. If we do
> something we should go for the && !IsA(parsetree, DeallocateStmt), not
> the normalization. The latter is pretty darn bogus.

Agreed.  I think basically the reasoning here is "since we don't track
PREPARE or EXECUTE, we shouldn't track DEALLOCATE either".

However, this is certainly a behavioral change.  Perhaps squeeze it
into 9.4, but not the back braches?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SQL MERGE is quite distinct from UPSERT

2014-07-20 Thread Martijn van Oosterhout
On Sat, Jul 19, 2014 at 09:55:19PM -0700, Peter Geoghegan wrote:
> At a high level SQL MERGE is quite distinct from UPSERT, in that it is
> a utility command that performs inserts, updates and deletes while
> avoiding race conditions (e.g. unique constraint violations) on a more
> or less best effort basis. MERGE is conceptually messy. In contrast
> UPSERT is actually atomic, and having its behavior be relatively easy
> to reason about ought to be the top priority. There is a *really* big
> demand for UPSERT from users, not MERGE, although MERGE is certainly
> useful too.

FWIW, I agree. MERGE is hard enough as it is, but trying to guarentee
some kind of atomicity makes it nigh on impossible.  Indeed, after
reading what you wrote I think it may well be impossible to make it
atomic *and* make it perform in the general case.

So, +1 UPSERT.

Have a nice day,
-- 
Martijn van Oosterhout  http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] bad estimation together with large work_mem generates terrible slow hash joins

2014-07-20 Thread Tomas Vondra
Attached v9 of the patch. Aside from a few minor fixes, the main  change
is that this is assumed to be combined with the "dense allocation" patch.

It also rewrites the ExecHashIncreaseNumBuckets to follow the same
pattern as ExecHashIncreaseNumBatches (i.e. scanning chunks directly,
instead of buckets). It's cleaner and more consistent.

hashjoin-nbuckets-growth-v9.patch contains only this patch, so you need
to grab the hashjoin-alloc-v4.patch from a different thread and apply it
first)

hashjoin-nbuckets-growth-v9-combined.patch contains both patches combined

regards
Tomas Vondra
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 781a736..ee3fffb 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1900,18 +1900,20 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 		if (es->format != EXPLAIN_FORMAT_TEXT)
 		{
 			ExplainPropertyLong("Hash Buckets", hashtable->nbuckets, es);
+			ExplainPropertyLong("Original Hash Buckets",
+hashtable->nbuckets_original, es);
 			ExplainPropertyLong("Hash Batches", hashtable->nbatch, es);
 			ExplainPropertyLong("Original Hash Batches",
 hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
-		else if (hashtable->nbatch_original != hashtable->nbatch)
+		else if ((hashtable->nbatch_original != hashtable->nbatch) || (hashtable->nbuckets_original != hashtable->nbuckets))
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-			"Buckets: %d  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
-			 hashtable->nbuckets, hashtable->nbatch,
-			 hashtable->nbatch_original, spacePeakKb);
+			"Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+			 hashtable->nbuckets, hashtable->nbuckets_original,
+			 hashtable->nbatch, hashtable->nbatch_original, spacePeakKb);
 		}
 		else
 		{
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..014b6f1 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,8 +37,10 @@
 #include "utils/lsyscache.h"
 #include "utils/syscache.h"
 
+bool enable_hash_resize = true;
 
 static void ExecHashIncreaseNumBatches(HashJoinTable hashtable);
+static void ExecHashIncreaseNumBuckets(HashJoinTable hashtable);
 static void ExecHashBuildSkewHash(HashJoinTable hashtable, Hash *node,
 	  int mcvsToUse);
 static void ExecHashSkewTableInsert(HashJoinTable hashtable,
@@ -47,6 +49,10 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 		int bucketNumber);
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
+static char * chunk_alloc(HashJoinTable hashtable, int tupleSize);
+
+/* Memory needed for optimal number of buckets. */
+#define BUCKETS_SPACE(htab)	((htab)->nbuckets_optimal * sizeof(void*))
 
 /* 
  *		ExecHash
@@ -116,6 +122,7 @@ MultiExecHash(HashState *node)
 /* It's a skew tuple, so put it into that hash table */
 ExecHashSkewTableInsert(hashtable, slot, hashvalue,
 		bucketNumber);
+hashtable->skewTuples += 1;
 			}
 			else
 			{
@@ -126,10 +133,34 @@ MultiExecHash(HashState *node)
 		}
 	}
 
+	/* If average number of tuples per bucket is over the defined threshold,
+	 * increase the number of buckets to get below it. */
+	if (enable_hash_resize) {
+
+		/* 
+		 * Do we actually need to resize? Only scale up, don't scale down (for example
+		 * after adding a batch, it's possible that nbuckets > nbuckets_optimal).
+		 */
+		if (hashtable->nbuckets < hashtable->nbuckets_optimal) {
+
+#ifdef HJDEBUG
+			printf("Increasing nbuckets %d => %d\n",
+   hashtable->nbuckets, hashtable->nbuckets_optimal);
+#endif
+			ExecHashIncreaseNumBuckets(hashtable);
+		}
+
+	}
+
+	/* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
+	hashtable->spaceUsed += BUCKETS_SPACE(hashtable);
+	if (hashtable->spaceUsed > hashtable->spacePeak)
+			hashtable->spacePeak = hashtable->spaceUsed;
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
-
+	
 	/*
 	 * We do not return the hash table directly because it's not a subtype of
 	 * Node, and so would violate the MultiExecProcNode API.  Instead, our
@@ -223,7 +254,6 @@ ExecEndHash(HashState *node)
 	ExecEndNode(outerPlan);
 }
 
-
 /* 
  *		ExecHashTableCreate
  *
@@ -239,6 +269,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	int			nbatch;
 	int			num_skew_mcvs;
 	int			log2_nbuckets;
+	int			log2_nbatch;
 	int			nkeys;
 	int			i;
 	ListCell   *ho;
@@ -263,6 +294,10 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	log2_nbuckets = my_log2(nbuckets);
 	Assert(nbuckets == (1 << log2_nbuckets));
 
+	/* nb

Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-20 Thread Tomas Vondra
On 19.7.2014 20:24, Tomas Vondra wrote:
> On 13.7.2014 21:32, Tomas Vondra wrote:
>> The current patch only implemnents this for tuples in the main
>> hash table, not for skew buckets. I plan to do that, but it will
>> require separate chunks for each skew bucket (so we can remove it
>> without messing with all of them). The chunks for skew buckets
>> should be smaller (I'm thinking about ~4kB), because there'll be
>> more of them, but OTOH those buckets are for frequent values so the
>> chunks should not remain empty.
> 
> I've looked into extending the dense allocation to the skew buckets, 
> and I think we shouldn't do that. I got about 50% of the changes and 
> then just threw it out because it turned out quite pointless.
> 
> The amount of memory for skew buckets is limited to 2% of work mem, 
> so even with 100% overhead it'll use ~4% of work mem. So there's 
> pretty much nothing to gain here. So the additional complexity 
> introduced by the dense allocation seems pretty pointless.
> 
> I'm not entirely happy with the code allocating some memory densely 
> and some using traditional palloc, but it certainly seems cleaner 
> than the code I had.
> 
> So I think the patch is mostly OK as is.

Attached is v4 of the patch, mostly with minor improvements and cleanups
(removed MemoryContextStats, code related to skew buckets).

regards
Tomas
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 589b2f1..6a9d956 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -47,6 +47,7 @@ static void ExecHashSkewTableInsert(HashJoinTable hashtable,
 		int bucketNumber);
 static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
+static char * chunk_alloc(HashJoinTable hashtable, int tupleSize);
 
 /* 
  *		ExecHash
@@ -129,7 +130,7 @@ MultiExecHash(HashState *node)
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
-
+	
 	/*
 	 * We do not return the hash table directly because it's not a subtype of
 	 * Node, and so would violate the MultiExecProcNode API.  Instead, our
@@ -223,7 +224,6 @@ ExecEndHash(HashState *node)
 	ExecEndNode(outerPlan);
 }
 
-
 /* 
  *		ExecHashTableCreate
  *
@@ -294,6 +294,8 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->spaceAllowedSkew =
 		hashtable->spaceAllowed * SKEW_WORK_MEM_PERCENT / 100;
 
+	hashtable->chunks_batch = NULL;
+
 	/*
 	 * Get info about the hash functions to be used for each hash key. Also
 	 * remember whether the join operators are strict.
@@ -556,10 +558,10 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	int			oldnbatch = hashtable->nbatch;
 	int			curbatch = hashtable->curbatch;
 	int			nbatch;
-	int			i;
 	MemoryContext oldcxt;
 	long		ninmemory;
 	long		nfreed;
+	HashChunk	oldchunks = hashtable->chunks_batch;
 
 	/* do nothing if we've decided to shut off growth */
 	if (!hashtable->growEnabled)
@@ -612,51 +614,70 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 	 */
 	ninmemory = nfreed = 0;
 
-	for (i = 0; i < hashtable->nbuckets; i++)
-	{
-		HashJoinTuple prevtuple;
-		HashJoinTuple tuple;
+	/* we're going to scan through the chunks directly, not buckets, so we can reset the
+	 * buckets and reinsert all the tuples there - just set them to NULL */
+	memset(hashtable->buckets, 0, sizeof(void*) * hashtable->nbuckets);
 
-		prevtuple = NULL;
-		tuple = hashtable->buckets[i];
+	/* reset the chunks (we have a copy in oldchunks) - this way we'll allocate */
+	hashtable->chunks_batch = NULL;
 
-		while (tuple != NULL)
-		{
-			/* save link in case we delete */
-			HashJoinTuple nexttuple = tuple->next;
-			int			bucketno;
-			int			batchno;
+	/* so, let's scan through the old chunks, and all tuples in each chunk */
+	while (oldchunks != NULL) {
+
+		/* pointer to the next memory chunk (may be NULL for the last chunk) */
+		HashChunk nextchunk = oldchunks->next;
+
+		/* position within the buffer (up to oldchunks->used) */
+		size_t idx = 0;	
+
+		/* process all tuples stored in this chunk (and then free it) */
+		while (idx < oldchunks->used) {
+
+			/* get the hashjoin tuple and mintuple stored right after it */
+			HashJoinTuple hashTuple = (HashJoinTuple)(oldchunks->data + idx);
+			MinimalTuple tuple = HJTUPLE_MINTUPLE(hashTuple);
+
+			int		hashTupleSize = (HJTUPLE_OVERHEAD + tuple->t_len);
+
+			int		bucketno;
+			int		batchno;
 
 			ninmemory++;
-			ExecHashGetBucketAndBatch(hashtable, tuple->hashvalue,
+			ExecHashGetBucketAndBatch(hashtable, hashTuple->hashvalue,
 	  &bucketno, &batchno);
-			Assert(bucketno == i);
+
 			if (batchno == curbatch)
 			{
-/* keep tuple */
-prevtuple = tuple;
+/* keep tuple - this means we need to copy it into the new chunks */
+HashJoinTuple copyTuple = (H

Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"

2014-07-20 Thread Fabien COELHO


[...]. If we do something we should go for the && !IsA(parsetree, 
DeallocateStmt), not the normalization.


Ok.


The latter is pretty darn bogus.


Yep:-) I'm fine with ignoring DEALLOCATE altogether.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"

2014-07-20 Thread Andres Freund
On 2014-07-20 17:01:50 +0200, Fabien COELHO wrote:
> 
> >That's because PREPARE isn't executed as it's own statement, but done on
> >the protocol level (which will need noticeably fewer messages). There's
> >no builtin logic to ignore actual PREPARE statements.
> 
> ISTM that there is indeed a special handling in function pgss_ProcessUtility
> for PREPARE and EXECUTE:
> 
>   /*
>* If it's an EXECUTE statement, we don't track it and don't increment 
> the
>* nesting level.  This allows the cycles to be charged to the 
> underlying
>* PREPARE instead (by the Executor hooks), which is much more useful.
>*
>* We also don't track execution of PREPARE.  If we did, we would get 
> one
>* hash table entry for the PREPARE (with hash calculated from the query
>* string), and then a different one with the same query string (but 
> hash
>* calculated from the query tree) would be used to accumulate costs of
>* ensuing EXECUTEs.  This would be confusing, and inconsistent with 
> other
>* cases where planning time is not included at all.
>*/
>   if (pgss_track_utility && pgss_enabled() &&
>   !IsA(parsetree, ExecuteStmt) &&
>   !IsA(parsetree, PrepareStmt))
>   ... standard handling ...
> else
> ... special "no" handling ...
> 
> >So I don't think your consistency argument counts as much here.
> 
> I think that given the above code, my argument stands reasonably.

Ick. You have a point.

> If you do not like my normalization hack (I do not like it much either:-), I
> have suggested to add "&& !IsA(parsetree, DeallocateStmt)" to the condition
> above, which would ignore DEALLOCATE as PREPARE and EXECUTE are currently
> and rightfully ignored.

Well, EXECUTE isn't actually ignored, but tracked via the execution
time. But that doesn't diminish your point with PREPARE. If we do
something we should go for the && !IsA(parsetree, DeallocateStmt), not
the normalization. The latter is pretty darn bogus.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"

2014-07-20 Thread Fabien COELHO



That's because PREPARE isn't executed as it's own statement, but done on
the protocol level (which will need noticeably fewer messages). There's
no builtin logic to ignore actual PREPARE statements.


ISTM that there is indeed a special handling in function 
pgss_ProcessUtility for PREPARE and EXECUTE:


[...]


For completeness purpose, here is the one-liner patch to handle DEALLOCATE 
as PREPARE & EXECUTE are handled. It is cleaner than the other one, but 
then DEALLOCATE disappear from the table, as PREPARE and EXECUTE do.


--
Fabien.diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 07f09e1..a8000cb 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -955,10 +955,14 @@ pgss_ProcessUtility(Node *parsetree, const char *queryString,
 	 * calculated from the query tree) would be used to accumulate costs of
 	 * ensuing EXECUTEs.  This would be confusing, and inconsistent with other
 	 * cases where planning time is not included at all.
+	 *
+	 * DEALLOCATE of prepared statement is skip, especially as some drivers
+	 * use process/counter-based names which end up cluttering the stats table.
 	 */
 	if (pgss_track_utility && pgss_enabled() &&
 		!IsA(parsetree, ExecuteStmt) &&
-		!IsA(parsetree, PrepareStmt))
+		!IsA(parsetree, PrepareStmt) &&
+		!IsA(parsetree, DeallocateStmt))
 	{
 		instr_time	start;
 		instr_time	duration;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"

2014-07-20 Thread Fabien COELHO



That's because PREPARE isn't executed as it's own statement, but done on
the protocol level (which will need noticeably fewer messages). There's
no builtin logic to ignore actual PREPARE statements.


ISTM that there is indeed a special handling in function 
pgss_ProcessUtility for PREPARE and EXECUTE:


/*
	 * If it's an EXECUTE statement, we don't track it and don't 
increment the
	 * nesting level.  This allows the cycles to be charged to the 
underlying
	 * PREPARE instead (by the Executor hooks), which is much more 
useful.

 *
	 * We also don't track execution of PREPARE.  If we did, we would 
get one
	 * hash table entry for the PREPARE (with hash calculated from the 
query
	 * string), and then a different one with the same query string 
(but hash
	 * calculated from the query tree) would be used to accumulate 
costs of
	 * ensuing EXECUTEs.  This would be confusing, and inconsistent 
with other

 * cases where planning time is not included at all.
 */
if (pgss_track_utility && pgss_enabled() &&
!IsA(parsetree, ExecuteStmt) &&
!IsA(parsetree, PrepareStmt))
... standard handling ...
else
... special "no" handling ...


So I don't think your consistency argument counts as much here.


I think that given the above code, my argument stands reasonably.

If you do not like my normalization hack (I do not like it much either:-), 
I have suggested to add "&& !IsA(parsetree, DeallocateStmt)" to the 
condition above, which would ignore DEALLOCATE as PREPARE and EXECUTE are 
currently and rightfully ignored.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tweaking NTUP_PER_BUCKET

2014-07-20 Thread Tomas Vondra
On 20.7.2014 00:12, Tomas Vondra wrote:
> On 19.7.2014 23:07, Tomas Vondra wrote:
>> On 19.7.2014 20:28, Tomas Vondra wrote:
>> For the first case, a WARNING at the end of estimate_hash_bucketsize
>> says this:
>>
>> WARNING:  nbuckets=8388608.00 estfract=0.01
>> WARNING:  nbuckets=65536.00 estfract=0.000267
>>
>> There are 4.3M rows in the big_table, so it says ~4 tuples (selectivity
>>> = 1e-6), and ~10 tuples/bucket for the small_table (42k rows).
>>
>> For the second case, I get this:
>>
>> WARNING:  nbuckets=16777216.00 estfract=0.01
>> WARNING:  nbuckets=262144.00 estfract=0.000100
>>
>> i.e. ~11 tuples/bucket for big_table and ~20 tuples for small_table.
>>
>> That sounds really strange, because small_table in the second test case
>> is almost perfectly unique. And in the first test case, there are only
>> very few keys with >2 occurences.
> 
> FWIW, the "problem" seems to be this:
> 
>   /*
>* Adjust estimated bucketsize upward to account for skewed
>* distribution. */
> 
>   if (avgfreq > 0.0 && mcvfreq > avgfreq)
> estfract *= mcvfreq / avgfreq;
> 
> Which is explained like in the estimate_hash_bucketsize comment like this:
> 
> * be nonuniformly occupied. If the other relation in the join has a key
> * distribution similar to this one's, then the most-loaded buckets are
> * exactly those that will be probed most often. Therefore, the "average"
> * bucket size for costing purposes should really be taken as something *
> close to the "worst case" bucket size. We try to estimate this by
> * adjusting the fraction if there are too few distinct data values, and
> * then scaling up by the ratio of the most common value's frequency to
> * the average frequency.

After thinking about this a bit more, I'm starting to question the logic
behind the adjustment. The thing is, with NTUP_PER_BUCKET=1, all the
tuples in a bucket should have the same hash value and (hash collisions
aside) the same join key value. That means we're not really searching
through the bucket, we're merely iterating and returning all the tuples.

With NTUP_PER_BUCKET=10 it maybe made sense, because it was expected to
see multiple hash values in a single bucket. With NTUP_PER_BUCKET=1
we shouldn't really expect that. So I think we should rip the MCV
adjustment out ouf estimate_hash_bucketsize (and indeed, it fixed both
the test cases).

The only case where I think this might play role is when we can't use
the optimal number of buckets (achieving NTUP_PER_BUCKET=1), but in that
case the patched ExecChooseHashTableSize prefers to increase number of
batches. So that seems safe to me.

Or in what scenario (with NTUP_PER_BUCKET=1) does this still make sense?


Two more comments:

(a) Isn't the minimum selectivity enforced in estimate_hash_bucketsize
(1.0e-6) too high? I mean, With 1GB work_mem I can get ~20M tuples
into the hash table, and in that case it's off by an order of
magnitude. Maybe 1e-9 would be more appropriate?

(b) Currently, ExecChooseHashTableSize sizes the hash table (nbuckets)
using ntuples, but maybe ndistinct (if available) would be better?
It's tricky, though, because while both values are initially
estimates, ndistinct is notoriously less reliable. Also, we
currently don't have means to count ndistinct while building the
table (as opposed to ntuples, which is trivial), which is needed
when resizing the hash table in case the initial estimate was off.
So currently the code optimizes for "ndistinct==ntuples" which may
produce larger nbuckets values, but ISTM it's the right choice.

regards
Tomas




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"

2014-07-20 Thread Andres Freund
On 2014-07-20 14:43:27 +0200, Fabien COELHO wrote:
> >a) Consider using the extended query protocol.
> >b) consider using unnamed prepared statements to reduce the number of
> >  roundtrips
> >c) wonder why PREPARE/DEALLOCATE are so much more frequent than the
> >  actualy query execution.
> 
> (1) I'm not responsible for DBD::Pg allocating "random" names to prepared
> statements, even if the queries are the same, and that accumulate over time
> (weeks, possibly months).
> 
> (2) pg_stat_statements is currently inconsistent anyway, as PREPARE is not
> counted (but the underlying query is on each EXECUTE), although its
> corresponding DEALLOCATE is counted, so I think that something is needed for
> consistency.

That's because PREPARE isn't executed as it's own statement, but done on
the protocol level (which will need noticeably fewer messages). There's
no builtin logic to ignore actual PREPARE statements. So I don't think
your consistency argument counts as much here.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"

2014-07-20 Thread Fabien COELHO


Hello Andres,


Why isn't the driver using the extended query protocol? Sending
PREPARE/EXECUTE/DEALLOCATE wastes roundtrips...


It seems to me that it would be more helful if these similar entries where
aggregated together, that is if the query "normalization" could ignore the
name of the descriptor.


I'm not in favor of this. If there's DEALLOCATEs that are frequent
enough to drown out other entries you should


Thanks for the advice. I'm not responsible for the application nor the 
driver, and I think that pg_stat_statements should be consistent and 
reasonable independently of drivers and applications.



a) Consider using the extended query protocol.
b) consider using unnamed prepared statements to reduce the number of
  roundtrips
c) wonder why PREPARE/DEALLOCATE are so much more frequent than the
  actualy query execution.


(1) I'm not responsible for DBD::Pg allocating "random" names to prepared 
statements, even if the queries are the same, and that accumulate over 
time (weeks, possibly months).


(2) pg_stat_statements is currently inconsistent anyway, as PREPARE is not 
counted (but the underlying query is on each EXECUTE), although its 
corresponding DEALLOCATE is counted, so I think that something is needed 
for consistency.


--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"

2014-07-20 Thread Marko Tiikkaja

On 2014-07-20 14:06, Andres Freund wrote:

On 2014-07-20 13:54:01 +0200, Andres Freund wrote:

On 2014-04-01 16:45:29 +0200, Fabien COELHO wrote:

I noticed that my pg_stat_statements is cluttered with hundreds of entries
like "DEALLOCATE dbdpg_p123456_7", occuring each only once.


Why isn't the driver using the extended query protocol? Sending
PREPARE/EXECUTE/DEALLOCATE wastes roundtrips...


Hm. It's probably because libqp doesn't expose sending Close message for
prepared statements :(. No idea why.


Yeah, I always considered that a missing feature, and even wrote a patch 
to add it at some point.  I wonder what happened to it.



.marko


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"

2014-07-20 Thread Andres Freund
On 2014-07-20 13:54:01 +0200, Andres Freund wrote:
> Hi,
> 
> On 2014-04-01 16:45:29 +0200, Fabien COELHO wrote:
> > I noticed that my pg_stat_statements is cluttered with hundreds of entries
> > like "DEALLOCATE dbdpg_p123456_7", occuring each only once.
> 
> Why isn't the driver using the extended query protocol? Sending
> PREPARE/EXECUTE/DEALLOCATE wastes roundtrips...

Hm. It's probably because libqp doesn't expose sending Close message for
prepared statements :(. No idea why.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"

2014-07-20 Thread Andres Freund
Hi,

On 2014-04-01 16:45:29 +0200, Fabien COELHO wrote:
> I noticed that my pg_stat_statements is cluttered with hundreds of entries
> like "DEALLOCATE dbdpg_p123456_7", occuring each only once.

Why isn't the driver using the extended query protocol? Sending
PREPARE/EXECUTE/DEALLOCATE wastes roundtrips...

> It seems to me that it would be more helful if these similar entries where
> aggregated together, that is if the query "normalization" could ignore the
> name of the descriptor.

I'm not in favor of this. If there's DEALLOCATEs that are frequent
enough to drown out other entries you should

a) Consider using the extended query protocol.
b) consider using unnamed prepared statements to reduce the number of
   roundtrips
c) wonder why PREPARE/DEALLOCATE are so much more frequent than the
   actualy query execution.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_stat_statements cluttered with "DEALLOCATE dbdpg_p*"

2014-07-20 Thread Fabien COELHO


Hello devs,

I noticed that my pg_stat_statements is cluttered with hundreds of entries 
like "DEALLOCATE dbdpg_p123456_7", occuring each only once.


Here is a patch and sql test file to:

* normalize DEALLOCATE utility statements in pg_stat_statements

Some drivers such as DBD:Pg generate process/counter-based identifiers for 
PREPARE, which result in hundreds of DEALLOCATE being tracked, although 
the prepared query may be the same. This is also consistent with the 
corresponding PREPARE not being tracked (although the underlying prepared 
query *is* tracked).



** Note **: another simpler option would be to skip deallocates altogether 
by inserting a "&& !IsA(parsetree, DeallocateStmt)" at the beginning of 
pgss_ProcessUtility(). I'm not sure what is best.


--
Fabien.diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 07f09e1..4b3348a 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -966,10 +966,19 @@ pgss_ProcessUtility(Node *parsetree, const char *queryString,
 		BufferUsage bufusage_start,
 	bufusage;
 		uint32		queryId;
+		const char *normalQueryString;
 
 		bufusage_start = pgBufferUsage;
 		INSTR_TIME_SET_CURRENT(start);
 
+		/*
+		 * normalize queryString for DEALLOCATE, as drivers such as DBD::Pg
+		 * generate counter-based names for prepared statements which can
+		 * result in many useless entries.
+		 */
+		normalQueryString = strncasecmp(queryString, "DEALLOCATE", 10)==0 ?
+			"DEALLOCATE ?;": queryString;
+
 		nested_level++;
 		PG_TRY();
 		{
@@ -1025,9 +1034,9 @@ pgss_ProcessUtility(Node *parsetree, const char *queryString,
 		INSTR_TIME_SUBTRACT(bufusage.blk_write_time, bufusage_start.blk_write_time);
 
 		/* For utility statements, we just hash the query string directly */
-		queryId = pgss_hash_string(queryString);
+		queryId = pgss_hash_string(normalQueryString);
 
-		pgss_store(queryString,
+		pgss_store(normalQueryString,
    queryId,
    INSTR_TIME_GET_MILLISEC(duration),
    rows,


pgss-norm-deallocate.sql
Description: application/sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Built-in binning functions

2014-07-20 Thread Simon Riggs
On 16 July 2014 20:35, Pavel Stehule  wrote:
>
>
>
> 2014-07-16 10:04 GMT+02:00 Petr Jelinek :
>
>> On 08/07/14 02:14, Tom Lane wrote:
>>>
>>> Petr Jelinek  writes:

 here is a patch implementing varwidth_bucket (naming is up for
 discussion) function which does binning with variable bucket width.
>>>
>>>
>>> I didn't see any discussion of the naming question in this thread.
>>> I'd like to propose that it should be just "width_bucket()"; we can
>>> easily determine which function is meant, considering that the
>>> SQL-spec variants don't take arrays and don't even have the same
>>> number of actual arguments.
>>
>>
>> I did mention in submission that the names are up for discussion, I am all
>> for naming it just width_bucket.
>
>
> I had this idea too - but I am not sure if it is good idea. A distance
> between ANSI SQL with_bucket and our enhancing is larger than in our
> implementation of "median" for example.
>
> I can live with both names, but current name I prefer.

Hmmm, not sure. Let's look around and think what words people use.

Transforms of this type are referred to as discretization in formal
literature and as binning in commong usage/statistical software.

width_bucket() seems to refer to an equal-width binning process. The
function being discussed here is a generic mechanism, the boundaries
of which could have been decided using equal-frequency or other
mechanisms. Using the word "width" in those contexts could be
confusing.

Given width_bucket() is already in use for SQL Standard function, I
agree it would be confusing to have same name for different parameter
profiles.

So I suggest that we use the more generic function name bin(), with a
second choice of discretize()  (though that seems annoyingly easy to
spell incorrectly)

Whatever we do, it seems clear we need a section in the manual to
describe Statistical Functions, including width_bucket(), whatever we
call this function and including the terms bin, binning, transform,
discretize and discretization to ensure those are easily searchable.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Production block comparison facility

2014-07-20 Thread Simon Riggs
The block comparison facility presented earlier by Heikki would not be
able to be used in production systems. ISTM that it would be desirable
to have something that could be used in that way.

ISTM easy to make these changes

* optionally generate a FPW for every WAL record, not just first
change after checkpoint
full_page_writes = 'always'

* when an FPW arrives, optionally run a check to see if it compares
correctly against the page already there, when running streaming
replication without a recovery target. We could skip reporting any
problems until the database is consistent
full_page_write_check = on

The above changes seem easy to implement.

With FPW compression, this would be a usable feature in production.

Comments?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers